This article is a small Demo during the postgraduate period. At that time, some machine learning algorithms were used to predict the odds. In this process, due to the need for data, crawlers were used to climb the data of Scout nets. The following is a small part of the code, showing an example of crawling some of the data.
Here, take the game data of 32 national teams participating in the 2018 Russian World Cup on the Scout website in the past 6 years as an example.
1, Constructing crawling ideas according to web content
Firstly, the grouping and points of 32 national teams shortlisted in the group match are listed on the theme page of the world cup of the Scout network
Click the name of any participating national team (such as Egypt) to link to the national team's special page, which contains all the competition results and relevant data of the team from 2011 to now, that is, the content we want to climb.
Therefore, the whole crawling idea is proposed as follows: grab the names of participating national teams and their corresponding national team data statistics links from the 2018 season world cup (World Cup), schedule points - scouting network, and use these links as secondary websites to grab the corresponding national team's game data statistics in recent years. Save the information in the last csv file.
2, Web page html analysis
First, find the information location we want to grab on the primary website in the developer tool. What we're going to climb isThe content and href attribute of the a tag in the third tag in.
Look at the file returned by the server when requesting the web page, where bomhelper JS contains the following information:
3, Configuring selenium and chrome
Now that we have determined the method of using selenium library + chrome browser, we must first install selenium library and the corresponding browser driver (chromedriver.exe)
selenium library installation
This is very simple. Like the common python package installation, you can use:
pip install selenium
Downloading and installing chromedriver
open https://sites.google.com/a/chromium.org/chromedriver/downloads , download the chromedriver.com that matches the browser version exe. For example, my browser is Chrome/93.0.4577.15. So I chose this version.
The download is a zip file. After decompression, there is a chromedriver Exe file, move the file to the installation directory of the computer's Chrome browser, my name is: C:\Program Files, (x86)\Google\Chrome\Application, and add the path to the path of the environment variable.
Verify that the chromedriver is available
Open python and enter the following:
from selenium import webdriver driver = webdriver.Chrome()
After clicking run, if a chrome browser window is successfully opened and the program does not report an error, the configuration is successful.
This process mainly uses the following knowledge points:
The combination of selenium+chrome is used to simulate the chome browser sending requests, handle ajax asynchronous loading data and the server's restrictions on the browser sending requests.
Another point is that the website is estimated to be crawled a lot. Anti crawling measures should be taken. The page load is very slow, which needs to be optimized.