How to batch crawl user information

First of all, the crawler is only used for testing. It's just for practicing hands. It won't be used for other purposes.

At the beginning of the text, if I want to get the following information from the user.

It's actually easier to request through request. Direct request Get user homepage url. The returned source code contains the details of the user, and the content is in the JavaScript code. See the figure below.

You can easily get it through regular expressions.

Here's the problem. How to batch obtain Zhihu users?

Our first reaction must be to obtain multiple pages through the home addresses of other users in one page. In this way, it can be obtained in batch. Through followers or followers?

The idea is very good, but I found that there is a parameter in the request header of the request for followers' information, which is always random. You need to reverse analyze js code to find rules before you can submit it successfully. But it's hard (at least I won't)

It is here that I know I have done well in anti climbing. Not before.

Here is the solution to the problem.
I use selenium. Our first reaction must be to control the browser to directly obtain the page to get the information of followers or followers.

But when you know you are not logged in, there is no information about other users on the selenium page.

There are no answers, questions, articles, column ideas, all of which are empty (because it is almost detected that you are a test browser). Finally, we can only start with dynamics.

With this idea, I wrote the following code.

from concurrent.futures.thread import ThreadPoolExecutor
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import Chrome
from lxml import etree


opt=Options()
opt.add_argument("--headless")
wait_data = set()
old_data=set()



#Get new url
def get_moreUrl(url):
    web = Chrome("D:\chromedriver.exe", options=opt)
    web.get(url)
    html = etree.HTML(web.page_source)
    web.close()
    divs=html.xpath('//*[@id="Profile-activities"]/div[2]/div')
    for div in divs:
        href=div.xpath('./div[2]/div/div[1]/div[1]/div/div/div[1]/span/div/div/a/@href')
        if href :
            new_href = "https:" + href[0]
            if not old_data.__contains__(new_href):
                wait_data.add(new_href)
                print(new_href)  
 
#Get page details
def get_data(url):
    web = Chrome("D:\chromedriver.exe", options=opt)
    web.get(url)
    web.find_element_by_xpath("/html/body/div[4]/div/div/div/div[2]/button").click()
    html = etree.HTML(web.page_source)
    name=html.xpath('//*[@id="ProfileHeader"]/div/div[2]/div/div[2]/div[1]/h1/span/text()')[0]
    num = html.xpath('//*[@id="root"]/div/main/div/meta[6]/@content')[0]
    answer_num=html.xpath('//*[@id="root"]/div/div[2]/header/div[2]/div/div/ul/li[2]/a/span/text()')[0]
    agree_num=html.xpath('//*[@id="root"]/div/main/div/meta[4]/@content')[0]
    like_num=html.xpath('//*[@id="root"]/div/main/div/meta[5]/@content')[0]
    web.close()
    list=[url,name,num,answer_num,agree_num,like_num]
    print(list)

if __name__ == '__main__':
    
    wait_data.add("https://www.zhihu.com/people/system-out-99")
    with ThreadPoolExecutor(10) as t:
        while True:
            if wait_data.__len__()>0:
                url = wait_data.pop()
                old_data.add(url)
                t.submit(get_moreUrl,url)
                t.sumbmit(get_data,url)
               

In fact, the code is not difficult. I think this thinking process is the most important. Although selenium is not efficient. However, the efficiency of online process pool is also very objective. A few dozen in a minute is still feasible.

If there are big guys in the code can have any suggestions to improve efficiency. Welcome to communicate with me.

Finally, my official account: reptiles.

Welcome to pay attention.

Keywords: Python crawler

Added by php-phan on Fri, 17 Dec 2021 07:33:03 +0200