Python Selenium crawls boss direct recruitment information

hello, everyone~
Another full day~
Since you are full of vitality, do you want to do something, for example, climb the recruitment data of "Boss direct employment"~
Let's go. Let's do it~

Target determination

We use Selenium to capture data this time.

The information obtained includes title, salary, company name, company information, experience requirements, company label, company benefits and other information

Web page analysis

Because we use selenium to obtain data, what we have to do is to use computers to simulate people's manual operation without too much analysis of web pages. Just prepare the tools

1. selenium installation

selenium can be installed directly with pip.

pip install selenium

  2. chromedriver installation

It should be noted that the version of chromedriver must be consistent with that of Chrome, otherwise it will not work.

There are two download addresses as follows:

1,http://chromedriver.storage.googleapis.com/index.html

2,https://npm.taobao.org/mirrors/chromedriver/

Of course, you first need to check your Chrome version and enter it in the browser

chrome://version   You can view the browser version information

Ready for actual combat

Import required modules

#  Import module
import csv
import random
import time
from icecream import ic
from selenium import webdriver

Open the browser and load the web content

#  Instantiate browser objects
driver = webdriver.Chrome()

#  Open URL
driver.get('https://www.zhipin.com/c100010000/?query=python&ka=sel-city-100010000')

#  Wait for the page to load
driver.implicitly_wait(10)

Get web page information

As can be seen from the figure, all recruitment information exists in the li tag

So next, our idea is very clear. First, we get all the li tags

Then extract the information we need internally

 #  Get data content
    lis = driver.find_elements_by_css_selector('.job-list ul li')  #  Get multiple li tags plus s
    print(lis) #  Return to list
  
  '''
  [<selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="ab491619-fa11-48b6-9095-5c2720c213e1")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="d6e68702-ecad-43a9-a173-35e7088467b2")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="1e4e384e-e2b7-4af7-96c8-d44d9de9bfd0")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="04ae01a9-3c8b-4733-8db4-467ba502fdea")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="82f6772d-e962-46d5-b157-a9c11943ee42")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="1efaaa79-f423-4b38-bd67-03f168a1df4e")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="a73d425f-e6ef-4946-b6a7-fb345821e326")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="6da89e07-6d90-42a9-a61a-7871ceff58c5")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="e9432c2b-fedb-426d-aaa9-e2b29934e45e")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="b4f45c95-2582-4630-94ef-fdaa030e071e")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="86af349f-dc5b-4ca9-b1c4-7415d4583aae")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="6ba75e95-4c7d-438c-9305-f8df85cd2493")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="46365c27-948b-4436-972f-5dd519343f24")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="a595a576-36ea-4db4-9012-750c2eb0deb2")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="5bbc641f-bb49-4c62-b3b1-d7b75d434b61")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="f0453a49-7335-4a07-a00b-a662dd02efff")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="60080578-b1ba-4337-b4c1-dff4e4b237be")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="449582de-06f3-4a15-892e-a4ee3ab6407a")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="4478b6b6-f326-479c-b122-50fb6f1c56de")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="66df9ed4-d05c-4648-a3bc-4eedccbbd6a1")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="7bac7afd-10f5-440f-a408-8e7b106e5c76")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="9b956f40-f26c-49d9-aef0-4a11cd8b5548")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="543ad28f-e6c9-4cdf-a204-5d62306d6194")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="47d1ddf6-848b-4513-af5c-d8cff76e90db")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="19139ca9-f8c0-4ffe-a3f7-61256205890f")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="05668298-5ea3-4792-994b-c4e699873c9f")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="1291c4d9-0650-4c07-a2e2-f43f9d1557b4")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="3bf0414a-2da8-4ea0-9494-c468945f7eb5")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="b7b344aa-1189-4291-8133-e1a64b958351")>, <selenium.webdriver.remote.webelement.WebElement (session="02c17ca6c1ce3a7b8feb729e0ecfcd44", element="b5908150-2f4e-4f86-ae41-5da0de93a9e2")>]
  '''

We have successfully obtained all the li objects. Next, we loop through the information we need

    for li in lis:
        title = li.find_element_by_css_selector('.job-area-wrapper span').text
        salary = li.find_element_by_css_selector('.job-limit.clearfix span').text
        cop_name = li.find_element_by_css_selector('.name a').text
        cop_info = li.find_element_by_css_selector('.company-text p').text

        exprence = li.find_element_by_css_selector('.job-limit.clearfix p').text
        tags = li.find_element_by_css_selector('.tags span').text
        welfare = li.find_element_by_css_selector('.info-desc').text
        ic(title, salary, cop_name, cop_info, exprence, tags, welfare)
  
  '''
 ic| title: 'Shenzhen·Nanshan District·Shekou'
    salary: '12-15K'
    cop_name: 'Mustard big number'
    cop_info: 'Computer software 100-499 people'
    exprence: '1-3 Undergraduate year'
    tags: 'Linux'
    welfare: 'Five insurances and one fund, regular physical examination, meal allowance, employee travel, overtime allowance, communication allowance, snacks, afternoon tea, holiday benefits, year-end bonus, paid annual leave, supplementary medical insurance, stock options, free shuttle bus and transportation allowance'
ic| title: 'Jinan·Shizhong District·Wanda Plaza'
    salary: '8-12K'
    cop_name: 'Shandong civic'
    cop_info: 'Computer software 100-499 people'
    exprence: '3-5 Undergraduate year'
    tags: 'Container technology'
    welfare: 'Overtime allowance, regular physical examination, five insurances and one fund, transportation allowance, paid annual leave, snacks, afternoon tea, employee travel, holiday welfare and year-end bonus'
ic| title: 'Qingdao'
    salary: '9-11K'
    cop_name: 'Sencott'
    cop_info: 'a ship/aviation/space flight B Wheel 20-99 people'
    exprence: '1-3 Master year'
    tags: 'Python'
    welfare: 'Supplementary medical insurance, paid annual leave, meal supplement, package! Single room, stock option, communication subsidy, regular physical examination, employee travel, five insurances and one fund, holiday benefits, including food, year-end bonus, snacks, afternoon tea, free shuttle bus, overtime subsidy, housing subsidy and transportation subsidy'
  '''

Data saving

Next, we save the data in csv for subsequent visual display

f = open('Recruitment 1.xlsx', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=[
    'title',
    'salary',
    'corporate name',
    'Company information',
    'Experience requirements',
    'label',
    'welfare',
])

dit = {
            'title': title,
            'salary': salary,
            'corporate name': cop_name,
            'Company information': cop_info,
            'Experience requirements': exprence,
            'label': tags,
            'welfare': welfare,
        }
        csv_writer.writerow(dit)

Multi page acquisition

We find the label where the next page is located, and then cycle through 100 pages of data

for page in range(1, 100+1):
    print(f'-----------------Grabbing page{page}Page data-----------------')
    time.sleep(random.random()*3) #  Delay to prevent reverse climbing
    #  Click Page
    next_page = driver.find_element_by_css_selector('.next')
    if next_page:
        next_page.click()
    else:
        print('No data~~')

Data visualization

Job recruitment ranking

  Job experience requirements

Salary distribution ranking

Keywords: Python Selenium chrome

Added by SWI03 on Sat, 04 Dec 2021 06:39:23 +0200