python crawls the URL summary of Baidu search results

After writing two articles, I think the focus is on the crawling process

What to analyze:

1) First identify what you want to crawl

For example, this time we need to crawl all the url results after using Baidu search

2) Analyse the process of acquiring goals manually in order to achieve them programmatically

For example, Baidu, we first enter keyword search, then Baidu feedback to our search results page, we click one by one to query

3) Think about how the program is implemented and overcome specific difficulties in its implementation

 

So let's start with the steps above. First we recognize the engine we're searching, provide a search box for the user to enter, then click Execute

We can simulate the search first and find that one of the key URLs after clicking on the search is as follows

http://www.baidu.com/s?wd=search content...

Later we try to remove the url and request it again. If we find that the information returned is the same, we can conclude that the url we request only needs to fill in the wd parameter.

Then we should try requests.get() to see if we can return to the page properly and prevent Baidu's anti-crawlers

Hey, lucky to get back to the page.

(If you don't return to normal information, of course, just set up headers or strict cookies)

import requests

url = 'http://www.baidu.com/s?wd=......'

r = requests.get(url)

print r.status_code,r.content

 

Okay, next we want to know how to crawl all the results

When we analyze the url again, we find that another key item in the url is the one that controls the page number:

http://www.baidu.com/s?wd=...&pn=x

This x is every 10 pages, the first page is 0, and there are 76 pages in total, which is the maximum value of 750, greater than 750 returns to the first page

 

Next we can analyze the captured pages

Or use a friendly beautifulsoup

We found through analysis that the url we need is in href in tag a, and in this format:

http://www.baidu.com/link?url=......

There are many other url confusions, so we just need to do a filter

And this url is not what we want, it's just a jump link from Baidu

But I'm glad that when our team makes a get request for this jump link, returning directly to the url of the get object is the result link we want

Then we try again and find that there is no other anti-crawler mechanism.

 

The idea was whether we should first filter the status codes returned by the new url, not 200 (or even some headers)

But I found that even though it's not 200, we just need to return the url of the requesting object, and it doesn't matter if we can return it properly or not.

Because our purpose is not the result of the page requested, but the url requested

So all you need to do is print it out

Of course, I recommend writing a simple, general headers to get so that at least some unnecessary results can be excluded

 

Then the whole idea we asked for was almost complete

Top Code:

#coding=utf-8

import requests
import sys
import Queue
import threading 
from bs4 import BeautifulSoup as bs
import re

headers = {
    ......
}


class baiduSpider(threading.Thread):
    def __init__(self,queue,name):
        threading.Thread.__init__(self)
        self._queue = queue
        self._name = name

    def run(self):
        while not self._queue.empty():
            url = self._queue.get()
            try:
                self.get_url(url)
            except Exception,e:
                print e
                pass
                #Be sure to handle exceptions!!!Otherwise, it will stop halfway and the crawled content will be incomplete!!!

    def get_url(self,url):
        r = requests.get(url = url,headers = headers)
        soup = bs(r.content,"html.parser")
        urls = soup.find_all(name='a',attrs={'href':re.compile(('.'))})
#        for i in urls:
#            print i

        #Grab the results of Baidu search a Tags, where href Is the jump address that contains Baidu

        for i in urls:
            if 'www.baidu.com/link?url=' in i['href']:
                a = requests.get(url = i['href'],headers = headers)

                #A single visit to the jump address returns the visited url To get what we need to capture url Result

                #if a.status_code == 200:
                #print a.url

                with open('E:/url/'+self._name+'.txt') as f:
                    if a.url not in f.read():
                        f = open('E:/url/'+self._name+'.txt','a')
                        f.write(a.url+'\n')
                        f.close()
            
                


def main(keyword):

    name = keyword

    f = open('E:/url/'+name+'.txt','w')
    f.close()

    queue = Queue.Queue()
    for i in range(0,760,10):
        queue.put('http://www.baidu.com/s?wd=%s&pn=%s'%(keyword,str(i)))

    threads = []
    thread_count = 10

    for i in range(thread_count):
        spider = baiduSpider(queue,name)
        threads.append(spider)

    for i in threads:
        i.start()

    for i in threads:
        i.join()

    print "It's down,sir!"

if __name__ == '__main__':


    if len(sys.argv) != 2:
        print 'no keyword'
        print 'Please enter keyword '

        sys.exit(-1)
    else:
        main(sys.argv[1])

 

The function of our tools is:

python 123.py keyword

Can write url results to a file

 

sys I'm speaking on this side

Make a judgment in if u name_u =='u main_u': If the input field is one, then we return the reminder and let the user type

If there are two, type the second as keyword

Of course, there is a flaw in this logic, that is, if there are more than two characters, there will be other problems (other problems!!!).

It's worth investigating, but that's not the focus of our story

 

Okay, so many Baidu url results are collected today!

Thanks for watching!

Keywords: Python

Added by thisisnuts123 on Fri, 07 Jun 2019 22:09:57 +0300