1. Development environment configuration
2. Reptile Foundation
3. Use of basic library
3.1 using urllib
- The request module is used to send basic HTTP requests. Just like entering the web address in the browser and then entering, you can simulate this process by passing in the URL and additional parameters to the library method.
- error: exception handling module. If there are request errors, we can catch these exceptions and retry or other operations to ensure that the program will not terminate unexpectedly.
- parse: a tool module that provides many URL processing methods, such as splitting, parsing, merging, etc.
- Robot parser: it is mainly used to identify the robots of the website Txt file, and then judge which websites can be crawled and which websites can not be crawled. In fact, it is used less.
3.1.1 send request
1. urlopen()
import urllib.request response= urllib.request.urlopen( ' https://www.python.org') pri「1t(response. read(). decode (' utf-8'))
urlopen() gets an object of type httprespose, which mainly contains methods such as read (), readinto (), getheader(name), getheaders (), fileno (), and attributes such as msg, version, status, reason, debuglevel, and closed.
print(response . status) print(response .getheaders()) print(response . getheader (『Server')) #output 200 [('Server', 'Tengine'), ('Content-......096908924e')] 'Tengine'
API:
urllib.request.urlopen( url, data=None, timeout=<object object at 0x000002C274D622D0>, *, cafile=None, capath=None, cadefault=False, context=None, )
data parameter
The data parameter is optional. If it is the content of byte stream encoding format, i.e. bytes type, it needs to be converted through the bytes() method. If this parameter is passed, its request mode is no longer GET mode, but POST mode.
data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') response= urllib.request.urlopen('http://httpbin.org/post',data) print(response.read())
timeout parameter
Used to set the timeout, in seconds. If the request exceeds the set time and has not received a response, an exception will be thrown
response= urllib.request.urlopen('http://httpbin.org/get',timeout=1) print(response.read())
Other parameters
parameter | effect |
---|---|
context | Must be SSL Sslcontext type, used to specify SSL settings |
cafile | Specify CA certificate |
capath | Specify CA certificate path |
2. Request
request = urllib.request.Request('https://creator.douyin.com/content/manage') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
Request a web page to get an object of type request
API
urllib.request.Request( url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None, )
parameter | explain |
---|---|
url | Required parameters |
data | The bytes class must be passed |
headers | A dictionary, which is the request header |
origin_req_host | host name or IP address of the requestor |
origin_req_host | host name or IP address of the requestor |
unverifiable | Indicates whether the request cannot be verified. The default is False, which means that the user does not have sufficient permission to choose to receive the result of the request |
method | A string indicating the method used by the request |
headers = { 'User_Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)', 'Host':'httpbin.org' } dict = {'name':'Germey'} data = bytes(parse.urlencode(dict),encoding='utf8') req = request.Request(url,data,headers,method='POST') response = request.urlopen(req) print(response.read().decode('utf-8'))
3. Advanced usage
Handler
class | explain |
---|---|
BaseHandler class | The parent class of all other handlers, which provides the most basic methods |
HITPDefaultErrorHandler | When handling HTTP response errors, exceptions of HTTP Error type will be thrown |
HTTPRedirectHandler | Process redirection |
HTTPCookieProcessor | Processing Cookies |
ProxyHandler | Set the proxy. The default proxy is empty |
HTTPPasswordMgr | Manage passwords, which maintains a table of user names and passwords |
HTTPBasicAuthHandler | Manage authentication. If a link needs authentication when it is opened, it can be used to solve the authentication problem |
Opener
The previously used Request and urlopen() are equivalent to the class library, which encapsulates the extremely common Request methods for you, and they can be used to complete the basic Request
Opener can use the open() method, and the return type is the same as urlopen()
Generally, we use Handler to build Opener
Authentication - HTTPBasicAuthHandler