After writing two articles, I think the focus is on the crawling process
What to analyze:
1) First identify what you want to crawl
For example, this time we need to crawl all the url results after using Baidu search
2) Analyse the process of acquiring goals manually in order to achieve them programmatically
For example, Baidu, we first enter keyword search, then Baidu feedback to our search results page, we click one by one to query
3) Think about how the program is implemented and overcome specific difficulties in its implementation
So let's start with the steps above. First we recognize the engine we're searching, provide a search box for the user to enter, then click Execute
We can simulate the search first and find that one of the key URLs after clicking on the search is as follows
http://www.baidu.com/s?wd=search content...
Later we try to remove the url and request it again. If we find that the information returned is the same, we can conclude that the url we request only needs to fill in the wd parameter.
Then we should try requests.get() to see if we can return to the page properly and prevent Baidu's anti-crawlers
Hey, lucky to get back to the page.
(If you don't return to normal information, of course, just set up headers or strict cookies)
import requests url = 'http://www.baidu.com/s?wd=......' r = requests.get(url) print r.status_code,r.content
Okay, next we want to know how to crawl all the results
When we analyze the url again, we find that another key item in the url is the one that controls the page number:
http://www.baidu.com/s?wd=...&pn=x
This x is every 10 pages, the first page is 0, and there are 76 pages in total, which is the maximum value of 750, greater than 750 returns to the first page
Next we can analyze the captured pages
Or use a friendly beautifulsoup
We found through analysis that the url we need is in href in tag a, and in this format:
http://www.baidu.com/link?url=......
There are many other url confusions, so we just need to do a filter
And this url is not what we want, it's just a jump link from Baidu
But I'm glad that when our team makes a get request for this jump link, returning directly to the url of the get object is the result link we want
Then we try again and find that there is no other anti-crawler mechanism.
The idea was whether we should first filter the status codes returned by the new url, not 200 (or even some headers)
But I found that even though it's not 200, we just need to return the url of the requesting object, and it doesn't matter if we can return it properly or not.
Because our purpose is not the result of the page requested, but the url requested
So all you need to do is print it out
Of course, I recommend writing a simple, general headers to get so that at least some unnecessary results can be excluded
Then the whole idea we asked for was almost complete
Top Code:
#coding=utf-8 import requests import sys import Queue import threading from bs4 import BeautifulSoup as bs import re headers = { ...... } class baiduSpider(threading.Thread): def __init__(self,queue,name): threading.Thread.__init__(self) self._queue = queue self._name = name def run(self): while not self._queue.empty(): url = self._queue.get() try: self.get_url(url) except Exception,e: print e pass #Be sure to handle exceptions!!!Otherwise, it will stop halfway and the crawled content will be incomplete!!! def get_url(self,url): r = requests.get(url = url,headers = headers) soup = bs(r.content,"html.parser") urls = soup.find_all(name='a',attrs={'href':re.compile(('.'))}) # for i in urls: # print i #Grab the results of Baidu search a Tags, where href Is the jump address that contains Baidu for i in urls: if 'www.baidu.com/link?url=' in i['href']: a = requests.get(url = i['href'],headers = headers) #A single visit to the jump address returns the visited url To get what we need to capture url Result #if a.status_code == 200: #print a.url with open('E:/url/'+self._name+'.txt') as f: if a.url not in f.read(): f = open('E:/url/'+self._name+'.txt','a') f.write(a.url+'\n') f.close() def main(keyword): name = keyword f = open('E:/url/'+name+'.txt','w') f.close() queue = Queue.Queue() for i in range(0,760,10): queue.put('http://www.baidu.com/s?wd=%s&pn=%s'%(keyword,str(i))) threads = [] thread_count = 10 for i in range(thread_count): spider = baiduSpider(queue,name) threads.append(spider) for i in threads: i.start() for i in threads: i.join() print "It's down,sir!" if __name__ == '__main__': if len(sys.argv) != 2: print 'no keyword' print 'Please enter keyword ' sys.exit(-1) else: main(sys.argv[1])
The function of our tools is:
python 123.py keyword
Can write url results to a file
sys I'm speaking on this side
Make a judgment in if u name_u =='u main_u': If the input field is one, then we return the reminder and let the user type
If there are two, type the second as keyword
Of course, there is a flaw in this logic, that is, if there are more than two characters, there will be other problems (other problems!!!).
It's worth investigating, but that's not the focus of our story
Okay, so many Baidu url results are collected today!
Thanks for watching!