Basic process of Python crawler web page:
First, select some carefully selected seed URL s.
Put these URLs into the URL queue to be fetched.
Read the URL of the queue to be crawled from the URL queue to be crawled, resolve DNS, obtain the IP of the host, download the web page corresponding to the URL and store it in the downloaded Web page library. In addition, put these URLs into the crawled URL queue.
Analyze the URLs in the crawled URL queue, analyze other URLs from the downloaded Web page data, compare them with the crawled URL to remove the duplicate, and finally put the de duplicated URL into the URL queue to be crawled, so as to enter the next cycle.
1. HTTP request implementation
Use urllib 2 / urllib to implement:
Urllib 2 and urllib are two built-in modules in Python. To realize the HTTP function, urllib 2 is the main implementation method, supplemented by urllib.
Urlib2 provides a basic function urlopen, which obtains data by sending a request to the specified URL. The simplest form is:
import urllib2 response=urllib2.urlopen('http://www.zhihu.com') html=response.read() print html
In fact, you can face it http://www.zhihu.com The request response of is divided into two steps, one is the request and the other is the response. The form is as follows:
import urllib2 # request request=urllib2.Request('http://www.zhihu.com') # response response = urllib2.urlopen(request) html=response.read() print html
There is also post request implementation:
import urllib import urllib2 url = 'http://www.xxxxxx.com/login' postdata = {'username' : 'qiye', 'password' : 'qiye_pass'} # info needs to be encoded in a format that urllib 2 can understand. Urllib is used here data = urllib.urlencode(postdata) req = urllib2.Request(url, data) response = urllib2.urlopen(req) html = response.read()
Rewrite the above example, add the request header information, and set the user agent domain and Referer domain information in the request header. 2. Request header processing
import urllib import urllib2 url = 'http://www.xxxxxx.com/login' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' referer='http://www.xxxxxx.com/' postdata = {'username' : 'qiye', 'password' : 'qiye_pass'} # user_agent,referer writes header information headers={'User-Agent':user_agent,'Referer':referer} data = urllib.urlencode(postdata) req = urllib2.Request(url, data,headers) response = urllib2.urlopen(req) html = response.read()
Urlib2 handles cookies automatically, and uses the cookie jar function to manage cookies. If you need to get the value of a cookie item, you can do this: 3. Cookie processing
import urllib2 import cookielib cookie = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie)) response = opener.open('http://www.zhihu.com') for item in cookie: print item.name+':'+item.value
But sometimes we encounter this situation. We don't want urlib2 to handle it automatically. We want to add the contents of cookies by ourselves. You can set the request header.
import urllib2 opener = urllib2.build_opener() opener.addheaders.append( ( 'Cookie', 'email=' + "xxxxxxx@163.com" ) ) req = urllib2.Request( "http://www.zhihu.com/" ) response = opener.open(req) print response.headers retdata = response.read()