Python 3 web crawler development practice

1. Development environment configuration

2. Reptile Foundation

3. Use of basic library

3.1 using urllib

  • The request module is used to send basic HTTP requests. Just like entering the web address in the browser and then entering, you can simulate this process by passing in the URL and additional parameters to the library method.
  • error: exception handling module. If there are request errors, we can catch these exceptions and retry or other operations to ensure that the program will not terminate unexpectedly.
  • parse: a tool module that provides many URL processing methods, such as splitting, parsing, merging, etc.
  • Robot parser: it is mainly used to identify the robots of the website Txt file, and then judge which websites can be crawled and which websites can not be crawled. In fact, it is used less.

3.1.1 send request

1. urlopen()

import urllib.request
response= urllib.request.urlopen( ' https://www.python.org')
pri「1t(response. read(). decode (' utf-8'))

urlopen() gets an object of type httprespose, which mainly contains methods such as read (), readinto (), getheader(name), getheaders (), fileno (), and attributes such as msg, version, status, reason, debuglevel, and closed.

print(response . status)
print(response .getheaders())
print(response  . getheader (『Server'))

#output
200
[('Server', 'Tengine'),
 ('Content-......096908924e')]
'Tengine'

API:

urllib.request.urlopen(
    url,
    data=None,
    timeout=<object object at 0x000002C274D622D0>,
    *,
    cafile=None,
    capath=None,
    cadefault=False,
    context=None,
)

data parameter

The data parameter is optional. If it is the content of byte stream encoding format, i.e. bytes type, it needs to be converted through the bytes() method. If this parameter is passed, its request mode is no longer GET mode, but POST mode.

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response= urllib.request.urlopen('http://httpbin.org/post',data)
print(response.read())

timeout parameter

Used to set the timeout, in seconds. If the request exceeds the set time and has not received a response, an exception will be thrown

response= urllib.request.urlopen('http://httpbin.org/get',timeout=1)
print(response.read())

Other parameters

parametereffect
contextMust be SSL Sslcontext type, used to specify SSL settings
cafileSpecify CA certificate
capathSpecify CA certificate path

2. Request

request = urllib.request.Request('https://creator.douyin.com/content/manage')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

Request a web page to get an object of type request

API

urllib.request.Request(
    url,
    data=None,
    headers={},
    origin_req_host=None,
    unverifiable=False,
    method=None,
)
parameterexplain
urlRequired parameters
dataThe bytes class must be passed
headersA dictionary, which is the request header
origin_req_hosthost name or IP address of the requestor
origin_req_hosthost name or IP address of the requestor
unverifiableIndicates whether the request cannot be verified. The default is False, which means that the user does not have sufficient permission to choose to receive the result of the request
methodA string indicating the method used by the request
headers = {
    'User_Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',
    'Host':'httpbin.org'
}
dict = {'name':'Germey'}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url,data,headers,method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

3. Advanced usage

Handler

classexplain
BaseHandler classThe parent class of all other handlers, which provides the most basic methods
HITPDefaultErrorHandlerWhen handling HTTP response errors, exceptions of HTTP Error type will be thrown
HTTPRedirectHandlerProcess redirection
HTTPCookieProcessorProcessing Cookies
ProxyHandlerSet the proxy. The default proxy is empty
HTTPPasswordMgrManage passwords, which maintains a table of user names and passwords
HTTPBasicAuthHandlerManage authentication. If a link needs authentication when it is opened, it can be used to solve the authentication problem

Opener

The previously used Request and urlopen() are equivalent to the class library, which encapsulates the extremely common Request methods for you, and they can be used to complete the basic Request

Opener can use the open() method, and the return type is the same as urlopen()

Generally, we use Handler to build Opener

Authentication - HTTPBasicAuthHandler

Added by a-mo on Mon, 07 Mar 2022 19:53:27 +0200