Crawler project 18: use Python to crawl all recruitment information of all cities in the pull hook network

preface

Use selenium+requests to visit the page and crawl the hook bar recruitment information

Tip: the following is the main content of this article. The following cases can be used for reference

1, Analyze url

By observing the page, we can see that the page data belongs to dynamic loading, so now we get the data packet through the packet capture tool

Observe its url and parameters

url="https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false"
Parameters:
city=%E5%8C%97%E4%BA%AC  ==>city
first=true  ==>useless
pn=1  ==>the number of pages
kd=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90  ==>Commodity keywords

Therefore, if we want to achieve full site crawling, we need city and page number

2, Get all cities and pages

After opening the dragline, we found that his data is not fully displayed. For example, in the selection of cities, only 30 pages are displayed nationwide, but the total number of pages is far greater than 30 pages; I chose Beijing and found that it is 30 pages, and Haidian District under Beijing is 30 pages. Maybe we can't crawl all the data, but we can crawl as much data as possible

In order to obtain the data of the whole station, we must have two parameters: one is the city and the other is the number of pages. Therefore, we use selenium automation to obtain all cities and the corresponding number of pages

def City_Page(self):
    City_Page={}
    url="https://www.lagou.com/jobs/allCity.html?keyword=%s&px=default&companyNum=0&isCompanySelected=false&labelWords="%(self.keyword)
    self.bro.get(url=url)
    sleep(30)
    print("Start getting cities and their maximum pages")
    if "Verification system" in self.bro.page_source:
        sleep(40)
    html = etree.HTML(self.bro.page_source)
    city_urls = html.xpath('//table[@class="word_list"]//li/input/@value')
    for city_url in city_urls:
        try:
            self.bro.get(city_url)
            if "Verification system" in self.bro.page_source:
                sleep(40)
            city=self.bro.find_element_by_xpath('//a[@class="current_city current"]').text
            page=self.bro.find_element_by_xpath('//span[@class="span totalNum"]').text
            City_Page[city]=page
            sleep(0.5)
        except:
            pass
    self.bro.quit()
    data = json.dumps(City_Page)
    with open("city_page.json", 'w', encoding="utf-8")as f:
        f.write(data)
    return City_Page

3, Generate params parameters

When we have the maximum number of pages corresponding to each city, we can generate the parameters required to access the page

def Params_List(self):
    with open("city_page.json", "r")as f:
        data = json.loads(f.read())
    Params_List = []
    for a, b in zip(data.keys(), data.values()):
        for i in range(1, int(b) + 1):
            params = {
                'city': a,
                'pn': i,
                'kd': self.keyword
            }
            Params_List.append(params)
    return Params_List

4, Get data

Finally, we can access the page to get the data by adding the request header and using the params url

def Parse_Data(self,params):
    url = "https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false"
    header={
        'referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&fromSearch=true&suginput=',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
        'cookie':''
    }
    try:
        text = requests.get(url=url, headers=header, params=params).text
        if "frequently" in text:
            print("The operation is frequent and has been found. It is currently the second%d individual params"%(i))
        data=json.loads(text)
        result=data["content"]["positionResult"]["result"]
        for res in result:
            with open(".//lagou1.csv", "a",encoding="utf-8") as f:
                writer = csv.DictWriter(f, res.keys())
                writer.writerow(res)
        sleep(1)
    except Exception as e:
        print(e)
        pass

summary

Although the data only shows the first 30 pages, the data is not fully obtained

When using selenium to obtain the maximum number of pages in the city, you should log in to dragnet manually, and there may be verification problems during the access, and the system needs to be verified

When using requests to access the page to obtain data, try to sleep for a long time. Frequent operations will block the IP

Finally, if you are interested in crawler projects, you can browse to my home page, and have updated several reptiles. So the data source code is in the official account "Python", this source code gets the reply "hook" to get.

If you think this article is good, just like it 👍， This is the biggest support for original bloggers

Please indicate the source of reprint

Keywords: Python JSON Selenium request

Added by ashu.khetan on Tue, 08 Mar 2022 04:27:14 +0200

Programming VIP