Python 3 web crawler development practice

1. Development environment configuration

2. Reptile Foundation

3. Use of basic library

3.1 using urllib

The request module is used to send basic HTTP requests. Just like entering the web address in the browser and then entering, you can simulate this process by passing in the URL and additional parameters to the library method.
error: exception handling module. If there are request errors, we can catch these exceptions and retry or other operations to ensure that the program will not terminate unexpectedly.
parse: a tool module that provides many URL processing methods, such as splitting, parsing, merging, etc.
Robot parser: it is mainly used to identify the robots of the website Txt file, and then judge which websites can be crawled and which websites can not be crawled. In fact, it is used less.

3.1.1 send request

1. urlopen()

import urllib.request
response= urllib.request.urlopen( ' https://www.python.org')
pri「1t(response. read(). decode (' utf-8'))

urlopen() gets an object of type httprespose, which mainly contains methods such as read (), readinto (), getheader(name), getheaders (), fileno (), and attributes such as msg, version, status, reason, debuglevel, and closed.

print(response . status)
print(response .getheaders())
print(response  . getheader (『Server'))

#output
200
[('Server', 'Tengine'),
 ('Content-......096908924e')]
'Tengine'

API:

urllib.request.urlopen(
    url,
    data=None,
    timeout=<object object at 0x000002C274D622D0>,
    *,
    cafile=None,
    capath=None,
    cadefault=False,
    context=None,
)

data parameter

The data parameter is optional. If it is the content of byte stream encoding format, i.e. bytes type, it needs to be converted through the bytes() method. If this parameter is passed, its request mode is no longer GET mode, but POST mode.

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response= urllib.request.urlopen('http://httpbin.org/post',data)
print(response.read())

timeout parameter

Used to set the timeout, in seconds. If the request exceeds the set time and has not received a response, an exception will be thrown

response= urllib.request.urlopen('http://httpbin.org/get',timeout=1)
print(response.read())

Other parameters

parameter	effect
context	Must be SSL Sslcontext type, used to specify SSL settings
cafile	Specify CA certificate
capath	Specify CA certificate path

2. Request

request = urllib.request.Request('https://creator.douyin.com/content/manage')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

Request a web page to get an object of type request

API

urllib.request.Request(
    url,
    data=None,
    headers={},
    origin_req_host=None,
    unverifiable=False,
    method=None,
)

parameter	explain
url	Required parameters
data	The bytes class must be passed
headers	A dictionary, which is the request header
origin_req_host	host name or IP address of the requestor
origin_req_host	host name or IP address of the requestor
unverifiable	Indicates whether the request cannot be verified. The default is False, which means that the user does not have sufficient permission to choose to receive the result of the request
method	A string indicating the method used by the request

headers = {
    'User_Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',
    'Host':'httpbin.org'
}
dict = {'name':'Germey'}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url,data,headers,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

3. Advanced usage

Handler

class	explain
BaseHandler class	The parent class of all other handlers, which provides the most basic methods
HITPDefaultErrorHandler	When handling HTTP response errors, exceptions of HTTP Error type will be thrown
HTTPRedirectHandler	Process redirection
HTTPCookieProcessor	Processing Cookies
ProxyHandler	Set the proxy. The default proxy is empty
HTTPPasswordMgr	Manage passwords, which maintains a table of user names and passwords
HTTPBasicAuthHandler	Manage authentication. If a link needs authentication when it is opened, it can be used to solve the authentication problem

Opener

The previously used Request and urlopen() are equivalent to the class library, which encapsulates the extremely common Request methods for you, and they can be used to complete the basic Request

Opener can use the open() method, and the return type is the same as urlopen()

Generally, we use Handler to build Opener

Authentication - HTTPBasicAuthHandler

Added by a-mo on Mon, 07 Mar 2022 19:53:27 +0200

Programming VIP