What is the urllib library
urllib library is Python's built-in HTTP request library, which does not need additional download. It mainly includes the following four modules
urllib.request Request module urllib.error Exception handling module urllib.parse url Analysis module urllib.robotparser robots.txt Analysis module
urllib.request
urllib.request.urlopen()
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
-url: url address.
-Data: other data objects sent to the server. It is required to pass parameters in the form of byte stream, that is, bytes. The default is None, and the data is transferred by GET method. If it is not None, the data is transferred by POST.
-Timeout: sets the access timeout, in seconds (s).
-Cafile and capath: cafile is the CA certificate and capath is the path of the CA certificate. It is required to use HTTPS.
-cadefault: deprecated.
- context: ssl.SSLContext type, used to specify SSL settings.
from urllib.request import urlopen response = urlopen("https://www.baidu.com/") print(response.read()) # Read all print(response.read(20)) # Specifies the first 20 rows to read print(response.read().decode("utf-8")) # Decode to utf-8 encoding print(response.readline()) # Read a line lines = response.readlines() # Read everything and assign it to a list variable for line in lines: print(line)
Using the data parameter
import urllib.parse import urllib.request data = {"Word":"Hello"} data = urllib.parse.urlencode(data).encode('utf-8') response = urllib.request.urlopen('http://httpbin.org/post',data = data) html = response.readlines() for line in html: print(line)
Using the timeout parameter
import urllib.request import urllib.error import socket try: response = urllib.request.urlopen("http://httpbin.org", timeout = 0.1) except urllib.error.URLError as e: if(isinstance(e.reason, socket.timeout): pirnt("TIME OUT!")
response
Status code
when crawling a web page, we often need to judge whether the web page can be accessed normally. Here, we can use the getcode() function to obtain the web page status code. Returning 200 indicates that the web page is normal, and returning 404 indicates that the web page does not exist:
import urllib.request import urllib.error try: response = urllib.request.urlopen("http://www.baidu.com") except urllib.error.HTTPError as e: if(e.code == 404) print(response.getcode()) # 404
Response header
import urllib.request response = urllib.request.urlopen("http://httpbin.org") print(type(response)) print(response.status) print(response.getheaders()) print(response.getheader("Server"))
The result is
<class 'http.client.HTTPResponse'> 200 [('Date', 'Wed, 09 Feb 2022 04:20:20 GMT'), ('Content-Type', 'text/html; charset=utf-8'), ('Content-Length', '9593'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')] gunicorn/19.9.0
urllib.request.Request class
we usually need to simulate the headers (page header information) to grab the web page. At this time, we need to use urllib request. Request class:
class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None) - url: url Address. - data: Other data objects sent to the server. The default is None. - headers: HTTP Request header information, dictionary format. - origin_req_host: Requested host address, IP Or domain name. - unverifiable: The entire parameter is rarely used to set whether the web page needs to be verified. The default is False. . - method: Request method, such as GET,POST,DELETE,PUT Wait.
from urllib import request, parse url = "http://httpbin.org/post" headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.9 Safari/537.36" } data = {"name":"Germer"} data = parse.urlencode(data).encode("utf-8") req = request.Request(url, data=data, headers=headers, method="POST") req.add_header("Host", "httpbin.org") # Add request header response = request.urlopen(req) lines = response.readlines() for line in lines: print(line.decode("utf-8"))
The result is
{ "args": {}, "data": "", "files": {}, "form": { "name": "Germer" }, "headers": { "Accept-Encoding": "identity", "Content-Length": "11", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.9 Safari/537.36", "X-Amzn-Trace-Id": "Root=1-620347ea-463c469d1cc6e37114f8f842" }, "json": null, "origin": "120.219.4.162", "url": "http://httpbin.org/post" }
Handler
agent
if we always use the same IP to request web pages on the same website, it may be blocked by the website server after a long time. Therefore, we can use proxy IP to initiate requests. Proxy actually refers to proxy server. When we use proxy IP to initiate a request, the server side displays the proxy IP address. Even if it is shielded, we can change a proxy IP to continue crawling. Setting up an agent is a measure to prevent crawlers from being anti crawled.
Use agent
proxy_support = urllib.request.ProxyHandler({})
the parameter is a dictionary. The key of the dictionary is the type of agent, such as http,ftp or https. The value of the dictionary is the IP address and corresponding port number of the agent. Here, the protocol needs to be added before the proxy, that is, http or https. When the request link is http protocol, ProxyHandler will call http proxy. When the request link is https protocol, it will call https proxy.
import urllib.request proxy_id = "58.240.53.196:8080" proxy_headler = urllib.request.ProxyHeadler( {"http":"http://" + proxy_id, "https":"https://" + proxy_id} ) opener = urllib,request.build_opener(proxy_headler) response = opener.open("http://www.baidu.com") html = response.read().decode("utf-8") print(html)
Create opener
the opener can be regarded as a private customization, but the opener can be customized. For example, it can be customized with special headers or the specified proxy IP. You can use build here_ The opener () function creates an opener that belongs to our own private customization. This is equivalent to that the agent has been set up in this opener. Next, you can directly call the open() method of the opener object to access the link we want
opener = urllib.request.build_opener(proxy_headler)
The urlopen() function cannot be used to open the web page here. You need to use the open() function to open the web page.
the following code example uses the IP pool, and randomly selects the IP proxy for each access. It is assumed that all our proxy IPS are recorded in the IP Txt file.
from urllib import request, error import random import socket url = "http://ip.tool.chinaz.com" proxy_iplist = [] with open("IP.txt", "w") as f: for line in f.readlines(): ip = line.strip() proxy_iplist.append(ip) while True: proxy_ip = random.choice(proxy_iplist) proxy_headler = request.ProxyHeadler( { "http":"http://" + proxy_ip, "https":"https://" + proxy_ip }) opener = request.build_opener(proxy_headler) try: response = opener.open(url, timeout = 1) print(response.read().decode("utf-8")) except error.URLError as e1: if isinstance(e1.reason, socket.timeout): print("TIME OUT!") except error.HTTPError as e2: if response.status == 404: print("404 ERROE!") finally: flag = input("Y/N") if flag == 'N' or flag == 'n': break
encountered an agent requiring authentication
proxy = 'username:password@58.240.53.196:8080'
here, you only need to change the proxy variable and add the user name and password of proxy authentication.
Cookie
we call http The functions of the cookie jar library operate on logs.
the CookieJar class has some subclasses, namely FileCookieJar, Mozilla CookieJar and lwpcookeiejar.
-
Cookie jar: an object that manages HTTP cookie values, stores cookies generated by HTTP requests, and adds cookies to outgoing HTTP requests. The entire cookie is stored in memory. After garbage collection of the cookie jar instance, the cookie will also be lost.
-
Filecookie jar (filename, delayload = none, policy = none): derived from cookie jar, it is used to create an instance of filecookie jar, retrieve cookie information and store cookies in files. Filename is the name of the file where the cookie is stored. When delayload is True, deferred access to files is supported, that is, files are read or data is stored in files only when needed.
-
Mozilla cookiejar (filename, delayload = none, policy = none): derived from FileCookieJar, it creates cookies with Mozilla browser Txt compatible FileCookieJar instance.
-
Lwpcookeiejar (filename, delayload = none, policy = none): derived from FileCookieJar, create an instance of FileCookieJar that is compatible with the libwww Perl standard Set-Cookie3 file format.
Code example
# This code demonstrates how to obtain a cookie, save it into a cookie jar object and print it #=============================================================================================================================================== import urllib.request import http.cookiejar url = "http://www.baidu.com" cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open(url) # Print cookie s for item in cookie: print(item.name + "=" + item.value)
BAIDUID=069F91E0E5A0B7E85F7FDFE97194CA18:FG=1 BIDUPSID=069F91E0E5A0B7E87C083ED4D88287F6 H_PS_PSSID=35105_31660_34584_35490_35245_35796_35316_26350_35765_35746 PSTM=1644458154 BDSVRTM=0 BD_HOME=1
# Save the obtained cookie to cookie Txt file #=============================================================================================================================================== # No load method import urllib.request import http.cookiejar url = "http://www.baidu.com" filename = "cookie.txt" cookie = http.cookiejar.MozillaCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open(url) cookie.save()
# Save the obtained cookie to cookie Txt file #=============================================================================================================================================== # There is a load method import urllib.request import http.cookiejar url = "http://www.baidu.com" filename = "cookie.txt" cookie = http.cookiejar.MozillaCookieJar() cookie.load(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open(url)
cookie.txt file:
# Netscape HTTP Cookie File # http://curl.haxx.se/rfc/cookie_spec.html # This is a generated file! Do not edit. .baidu.com TRUE / FALSE 1675995432 BAIDUID 139CA77B6F46CA597186A3F1F6FCF790:FG=1 .baidu.com TRUE / FALSE 3791943079 BIDUPSID 139CA77B6F46CA5980C5AB053579F5CF .baidu.com TRUE / FALSE 3791943079 PSTM 1644459432
urllib.error
urllib. The error module is urllib The exception raised by request defines the exception class, and the basic exception class is URLError. urllib.error contains two methods, URLError and HTTPError.
URLError is a subclass of OSError. It is used to handle the exception (or its derived exception) that will be thrown when the program encounters a problem. The attribute reason is the cause of the exception.
HTTPError is a subclass of URLError. It is used to handle special HTTP errors. For example, when it is an authentication request, the attribute code contained is the HTTP status code, reason is the cause of the exception, and headers are the HTTP response header of the specific HTTP request causing the HTTPError.
Grab and handle exceptions for non-existent web pages:
import urllib.request import urllib.error try: response = urllib.request.urlopen("http://www.baidu.com") except urllib.error.HTTPError as e: if(e.code == 404) print(response.getcode()) # 404
urllib.parse
urllib.parse is used to parse URL s. The format is as follows:
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True) urlstring Is a string url address scheme Is an agreement type, allow_fragments Parameter is falseļ¼The fragment identifier is not recognized. Instead, they are resolved as part of a path, parameter, or query component, and fragment Set to an empty string in the return value.
Note: when the protocol type is indicated in urlstring, the scheme parameter is invalid. The protocol indicated in urlstring shall prevail. If there is no protocol in urlstring, the scheme Protocol shall prevail
import urllib.parse result1 =urllib.parse.urlparse("https://www.csdn.net/?spm=1001.2101.3001.4476") result2 = urllib.parse.urlparse("www.csdn.net/?spm=1001.2101.3001.4476",scheme = "https") result3 = urllib.parse.urlparse("https://www.csdn.net/?spm=1001.2101.3001.4476", scheme="http") print(result1) print(result2) print(result3)
ParseResult(scheme='https', netloc='www.csdn.net', path='/', params='', query='spm=1001.2101.3001.4476', fragment='') ParseResult(scheme='https', netloc='', path='www.csdn.net/', params='', query='spm=1001.2101.3001.4476', fragment='') ParseResult(scheme='https', netloc='www.csdn.net', path='/', params='', query='spm=1001.2101.3001.4476', fragment='')
it can be seen from the result that the content is a tuple, including 6 strings: protocol, location, path, parameter, query and judgment.
we can also read properties directly
from urllib.parse import urlparse result = urlparse("https://www.runoob.com/?s=python+%E6%95%99%E7%A8%8B") print(result.scheme)
https
attribute | Indexes | value | Value (if not present) |
---|---|---|---|
scheme | 0 | URL protocol | scheme parameter |
netloc | 1 | Network location section | Empty string |
path | 2 | Hierarchical path | Empty string |
paramg | 3 | Parameters of the last path element | Empty string |
query | 4 | Query component | Empty string |
fragment | 5 | Fragment recognition | Empty string |
username | user name | None | |
password | password | None | |
hostname | Host name (lowercase) | None | |
port | The port number is an integer, if present | None |
urlunparse
in addition, we can also use urlunparse for reverse splicing
from urllib.parse import urlunparse data = ["http", "www.baidu.com", "index.html", "user", "a=6", "comment"] print(urlunparse(data))
http://www.baidu.com/index.html;user?a=6#comment
urljoin
urljoin(base, url, allow_fragments=True) base Reference parent station url Absolute path to be spliced url allow_fragments Identify fragment identifier
urljoin() splices the base and url into a web address. If the url is a complete web address, it is based on the url
from urllib import parse url1 = parse.urljoin("https://www.baidu.com", "index.html") url2 = parse.urljoin("https://www.baidu.com", "https://www.jianshu.com/p/20065f9b39bb") print(url1) print(url2)
https://www.baidu.com/index.html https://www.jianshu.com/p/20065f9b39bb
urlencode
we know that GET passes parameters with "&" symbol interval, but in Python, dictionary elements use "," interval. We can use urlencode to convert the dictionary into key value pairs with "&" interval for parameter transmission
from urllib import parse data = { "keyword":"Python", "id":"3252525", "page":"3" } base_url = "http://www.example.com" url = base_url + parse.urlencode(data) print(url)
http://www.example.comkeyword=Python&id=3252525&page=3
urllib.robotparser
urllib. Robot parser is used to parse robots Txt file.
robots.txt (Unified lowercase) is a kind of robots protocol stored in the root directory of the website. It is usually used to tell the search engine the crawling rules of the website.
urllib. The RobotFileParser class is provided by the robotparser. The syntax is as follows:
class urllib.robotparser.RobotFileParser(url='')
This class provides some that can read and parse robots Txt file:
-
set_url(url) - set robots The URL of the txt file.
-
read() - read robots Txt URL and enter it into the parser.
-
parse(lines) - parse line parameters.
-
can_fetch(useragent, url) - if useragent is allowed to follow the parsed robots Txt file to get the url, then return True.
-
mtime() - returns the last robots obtained Txt file. This applies to robots that need to be checked regularly Txt file update situation of the long-running web crawler.
-
modified() - the latest robots will be obtained Txt file is set to the current time.
-
crawl_delay(useragent) - for the specified useragent from robots Txt returns the crawl delay parameter. If this parameter does not exist or is not applicable to the specified useragent or robots of this parameter If there is a syntax error in the txt entry, return None.
-
request_rate(useragent) - from robots.com in the form of named tuple RequestRate(requests, seconds) Txt returns the contents of the request rate parameter. If this parameter does not exist or is not applicable to the specified useragent or robots of this parameter If there is a syntax error in the txt entry, return None.
-
site_maps() - from robots. Com in the form of list() Txt returns the contents of the Sitemap parameter. If this parameter does not exist or the robots of this parameter If there is a syntax error in the txt entry, return None.
>>> import urllib.robotparser >>> rp = urllib.robotparser.RobotFileParser() >>> rp.set_url("http://www.musi-cal.com/robots.txt") >>> rp.read() >>> rrate = rp.request_rate("*") >>> rrate.requests 3 >>> rrate.seconds 20 >>> rp.crawl_delay("*") 6 >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") False >>> rp.can_fetch("*", "http://www.musi-cal.com/") True