hello, everyone~
Another full day~
Since you are full of vitality, do you want to do something, for example, climb the recruitment data of "Boss direct employment"~
Let's go. Let's do it~
Target determination
We use Selenium to capture data this time.
The information obtained includes title, salary, company name, company information, experience requirements, company label, company benefits and other information
Web page analysis
Because we use selenium to obtain data, what we have to do is to use computers to simulate people's manual operation without too much analysis of web pages. Just prepare the tools
1. selenium installation
selenium can be installed directly with pip.
pip install selenium
2. chromedriver installation
It should be noted that the version of chromedriver must be consistent with that of Chrome, otherwise it will not work.
There are two download addresses as follows:
1,http://chromedriver.storage.googleapis.com/index.html
2,https://npm.taobao.org/mirrors/chromedriver/
Of course, you first need to check your Chrome version and enter it in the browser
chrome://version You can view the browser version information
Ready for actual combat
Import required modules
# Import module import csv import random import time from icecream import ic from selenium import webdriver
Open the browser and load the web content
# Instantiate browser objects driver = webdriver.Chrome() # Open URL driver.get('https://www.zhipin.com/c100010000/?query=python&ka=sel-city-100010000') # Wait for the page to load driver.implicitly_wait(10)
Get web page information
As can be seen from the figure, all recruitment information exists in the li tag
So next, our idea is very clear. First, we get all the li tags
Then extract the information we need internally
# Get data content lis = driver.find_elements_by_css_selector('.job-list ul li') # Get multiple li tags plus s print(lis) # Return to list ''' [<selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="ab491619-fa11-48b6-9095-5c2720c213e1")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="d6e68702-ecad-43a9-a173-35e7088467b2")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="1e4e384e-e2b7-4af7-96c8-d44d9de9bfd0")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="04ae01a9-3c8b-4733-8db4-467ba502fdea")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="82f6772d-e962-46d5-b157-a9c11943ee42")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="1efaaa79-f423-4b38-bd67-03f168a1df4e")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="a73d425f-e6ef-4946-b6a7-fb345821e326")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="6da89e07-6d90-42a9-a61a-7871ceff58c5")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="e9432c2b-fedb-426d-aaa9-e2b29934e45e")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="b4f45c95-2582-4630-94ef-fdaa030e071e")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="86af349f-dc5b-4ca9-b1c4-7415d4583aae")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="6ba75e95-4c7d-438c-9305-f8df85cd2493")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="46365c27-948b-4436-972f-5dd519343f24")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="a595a576-36ea-4db4-9012-750c2eb0deb2")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="5bbc641f-bb49-4c62-b3b1-d7b75d434b61")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="f0453a49-7335-4a07-a00b-a662dd02efff")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="60080578-b1ba-4337-b4c1-dff4e4b237be")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="449582de-06f3-4a15-892e-a4ee3ab6407a")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="4478b6b6-f326-479c-b122-50fb6f1c56de")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="66df9ed4-d05c-4648-a3bc-4eedccbbd6a1")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="7bac7afd-10f5-440f-a408-8e7b106e5c76")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="9b956f40-f26c-49d9-aef0-4a11cd8b5548")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="543ad28f-e6c9-4cdf-a204-5d62306d6194")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="47d1ddf6-848b-4513-af5c-d8cff76e90db")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="19139ca9-f8c0-4ffe-a3f7-61256205890f")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="05668298-5ea3-4792-994b-c4e699873c9f")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="1291c4d9-0650-4c07-a2e2-f43f9d1557b4")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="3bf0414a-2da8-4ea0-9494-c468945f7eb5")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="b7b344aa-1189-4291-8133-e1a64b958351")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="b5908150-2f4e-4f86-ae41-5da0de93a9e2")>] '''
We have successfully obtained all the li objects. Next, we loop through the information we need
for li in lis: title = li.find_element_by_css_selector('.job-area-wrapper span').text salary = li.find_element_by_css_selector('.job-limit.clearfix span').text cop_name = li.find_element_by_css_selector('.name a').text cop_info = li.find_element_by_css_selector('.company-text p').text exprence = li.find_element_by_css_selector('.job-limit.clearfix p').text tags = li.find_element_by_css_selector('.tags span').text welfare = li.find_element_by_css_selector('.info-desc').text ic(title, salary, cop_name, cop_info, exprence, tags, welfare) ''' ic| title: 'Shenzhen·Nanshan District·Shekou' salary: '12-15K' cop_name: 'Mustard big number' cop_info: 'Computer software 100-499 people' exprence: '1-3 Undergraduate year' tags: 'Linux' welfare: 'Five insurances and one fund, regular physical examination, meal allowance, employee travel, overtime allowance, communication allowance, snacks, afternoon tea, holiday benefits, year-end bonus, paid annual leave, supplementary medical insurance, stock options, free shuttle bus and transportation allowance' ic| title: 'Jinan·Shizhong District·Wanda Plaza' salary: '8-12K' cop_name: 'Shandong civic' cop_info: 'Computer software 100-499 people' exprence: '3-5 Undergraduate year' tags: 'Container technology' welfare: 'Overtime allowance, regular physical examination, five insurances and one fund, transportation allowance, paid annual leave, snacks, afternoon tea, employee travel, holiday welfare and year-end bonus' ic| title: 'Qingdao' salary: '9-11K' cop_name: 'Sencott' cop_info: 'a ship/aviation/space flight B Wheel 20-99 people' exprence: '1-3 Master year' tags: 'Python' welfare: 'Supplementary medical insurance, paid annual leave, meal supplement, package! Single room, stock option, communication subsidy, regular physical examination, employee travel, five insurances and one fund, holiday benefits, including food, year-end bonus, snacks, afternoon tea, free shuttle bus, overtime subsidy, housing subsidy and transportation subsidy' '''
Data saving
Next, we save the data in csv for subsequent visual display
f = open('Recruitment 1.xlsx', mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter(f, fieldnames=[ 'title', 'salary', 'corporate name', 'Company information', 'Experience requirements', 'label', 'welfare', ]) dit = { 'title': title, 'salary': salary, 'corporate name': cop_name, 'Company information': cop_info, 'Experience requirements': exprence, 'label': tags, 'welfare': welfare, } csv_writer.writerow(dit)
Multi page acquisition
We find the label where the next page is located, and then cycle through 100 pages of data
for page in range(1, 100+1): print(f'-----------------Grabbing page{page}Page data-----------------') time.sleep(random.random()*3) # Delay to prevent reverse climbing # Click Page next_page = driver.find_element_by_css_selector('.next') if next_page: next_page.click() else: print('No data~~')
Data visualization
Job recruitment ranking
Job experience requirements
Salary distribution ranking