Introduction to Python crawler: how to pick up Web pages in Python? What is the basic process

Basic process of Python crawler web page:

First, select some carefully selected seed URL s.

Put these URLs into the URL queue to be fetched.

Read the URL of the queue to be crawled from the URL queue to be crawled, resolve DNS, obtain the IP of the host, download the web page corresponding to the URL and store it in the downloaded Web page library. In addition, put these URLs into the crawled URL queue.

Analyze the URLs in the crawled URL queue, analyze other URLs from the downloaded Web page data, compare them with the crawled URL to remove the duplicate, and finally put the de duplicated URL into the URL queue to be crawled, so as to enter the next cycle.

1. HTTP request implementation

Use urllib 2 / urllib to implement:

Urllib 2 and urllib are two built-in modules in Python. To realize the HTTP function, urllib 2 is the main implementation method, supplemented by urllib.

Urlib2 provides a basic function urlopen, which obtains data by sending a request to the specified URL. The simplest form is:

import urllib2
response=urllib2.urlopen('http://www.zhihu.com')
html=response.read()
print html

In fact, you can face it http://www.zhihu.com The request response of is divided into two steps, one is the request and the other is the response. The form is as follows:

import urllib2

# request

request=urllib2.Request('http://www.zhihu.com')

# response

response = urllib2.urlopen(request)

html=response.read()

print html

There is also post request implementation:

import urllib
import urllib2
url = 'http://www.xxxxxx.com/login'
postdata = {'username' : 'qiye',
    'password' : 'qiye_pass'}
# info needs to be encoded in a format that urllib 2 can understand. Urllib is used here
data = urllib.urlencode(postdata)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html = response.read()

Rewrite the above example, add the request header information, and set the user agent domain and Referer domain information in the request header. 2. Request header processing

import urllib
import urllib2
url = 'http://www.xxxxxx.com/login'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
referer='http://www.xxxxxx.com/'
postdata = {'username' : 'qiye',
    'password' : 'qiye_pass'}
# user_agent,referer writes header information
headers={'User-Agent':user_agent,'Referer':referer}
data = urllib.urlencode(postdata)
req = urllib2.Request(url, data,headers)
response = urllib2.urlopen(req)
html = response.read()

Urlib2 handles cookies automatically, and uses the cookie jar function to manage cookies. If you need to get the value of a cookie item, you can do this: 3. Cookie processing

import urllib2

import cookielib

cookie = cookielib.CookieJar()

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))

response = opener.open('http://www.zhihu.com')

for item in cookie:

    print item.name+':'+item.value

But sometimes we encounter this situation. We don't want urlib2 to handle it automatically. We want to add the contents of cookies by ourselves. You can set the request header.

import  urllib2

opener = urllib2.build_opener()

opener.addheaders.append( ( 'Cookie', 'email=' + "xxxxxxx@163.com" ) )

req = urllib2.Request( "http://www.zhihu.com/" )

response = opener.open(req)

print response.headers

retdata = response.read()

Anyway, finally, thank you very much for clicking to watch my article. If it is helpful to you, please raise your hand and praise Xiaobian. If you have any questions or need the information in the article, you can send me a private letter backstage. Welcome to "harass".

Keywords: Python Back-end crawler

Added by fred_m on Fri, 21 Jan 2022 00:44:06 +0200