Detailed explanation of 10000 words in urllib Library

What is the urllib library

urllib library is Python's built-in HTTP request library, which does not need additional download. It mainly includes the following four modules

urllib.request  Request module
urllib.error  Exception handling module
urllib.parse  url Analysis module
urllib.robotparser  robots.txt Analysis module

urllib.request

urllib.request.urlopen()

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

-url: url address.
-Data: other data objects sent to the server. It is required to pass parameters in the form of byte stream, that is, bytes. The default is None, and the data is transferred by GET method. If it is not None, the data is transferred by POST.
-Timeout: sets the access timeout, in seconds (s).
-Cafile and capath: cafile is the CA certificate and capath is the path of the CA certificate. It is required to use HTTPS.
-cadefault: deprecated.
- context: ssl.SSLContext type, used to specify SSL settings.

from urllib.request import urlopen

response = urlopen("https://www.baidu.com/")
print(response.read())			# Read all
print(response.read(20))		# Specifies the first 20 rows to read
print(response.read().decode("utf-8"))		# Decode to utf-8 encoding
print(response.readline())		# Read a line

lines = response.readlines()		# Read everything and assign it to a list variable
for line in lines:
	print(line)

Using the data parameter

import urllib.parse
import urllib.request

data = {"Word":"Hello"}

data = urllib.parse.urlencode(data).encode('utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data = data)
html = response.readlines()

for line in html:
	print(line)

Using the timeout parameter

import urllib.request
import urllib.error
import socket

try:
	response = urllib.request.urlopen("http://httpbin.org", timeout = 0.1)
except urllib.error.URLError as e:
	if(isinstance(e.reason, socket.timeout):
		pirnt("TIME OUT!")

response

Status code
when crawling a web page, we often need to judge whether the web page can be accessed normally. Here, we can use the getcode() function to obtain the web page status code. Returning 200 indicates that the web page is normal, and returning 404 indicates that the web page does not exist:

import urllib.request
import urllib.error

try:
	response = urllib.request.urlopen("http://www.baidu.com")
except urllib.error.HTTPError as e:
	if(e.code == 404)		
		print(response.getcode())		# 404

Response header

import urllib.request

response = urllib.request.urlopen("http://httpbin.org")
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader("Server"))

The result is

<class 'http.client.HTTPResponse'>
200
[('Date', 'Wed, 09 Feb 2022 04:20:20 GMT'), ('Content-Type', 'text/html; charset=utf-8'), ('Content-Length', '9593'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]
gunicorn/19.9.0

urllib.request.Request class

we usually need to simulate the headers (page header information) to grab the web page. At this time, we need to use urllib request. Request class:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

- url: url Address.
- data: Other data objects sent to the server. The default is None. 
- headers: HTTP Request header information, dictionary format.
- origin_req_host: Requested host address, IP Or domain name.
- unverifiable: The entire parameter is rarely used to set whether the web page needs to be verified. The default is False. . 
- method: Request method, such as GET,POST,DELETE,PUT Wait.

from urllib import request, parse

url = "http://httpbin.org/post"
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.9 Safari/537.36"
}

data = {"name":"Germer"}
data = parse.urlencode(data).encode("utf-8")

req = request.Request(url, data=data, headers=headers, method="POST")
req.add_header("Host", "httpbin.org")       # Add request header
response = request.urlopen(req)
lines = response.readlines()

for line in lines:
    print(line.decode("utf-8"))

The result is

{

  "args": {}, 

  "data": "", 

  "files": {}, 

  "form": {

    "name": "Germer"

  }, 

  "headers": {

    "Accept-Encoding": "identity", 

    "Content-Length": "11", 

    "Content-Type": "application/x-www-form-urlencoded", 

    "Host": "httpbin.org", 

    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.9 Safari/537.36", 

    "X-Amzn-Trace-Id": "Root=1-620347ea-463c469d1cc6e37114f8f842"

  }, 

  "json": null, 

  "origin": "120.219.4.162", 

  "url": "http://httpbin.org/post"

}

Handler

agent
if we always use the same IP to request web pages on the same website, it may be blocked by the website server after a long time. Therefore, we can use proxy IP to initiate requests. Proxy actually refers to proxy server. When we use proxy IP to initiate a request, the server side displays the proxy IP address. Even if it is shielded, we can change a proxy IP to continue crawling. Setting up an agent is a measure to prevent crawlers from being anti crawled.

Use agent

proxy_support = urllib.request.ProxyHandler({})

the parameter is a dictionary. The key of the dictionary is the type of agent, such as http,ftp or https. The value of the dictionary is the IP address and corresponding port number of the agent. Here, the protocol needs to be added before the proxy, that is, http or https. When the request link is http protocol, ProxyHandler will call http proxy. When the request link is https protocol, it will call https proxy.

import urllib.request
proxy_id = "58.240.53.196:8080"
proxy_headler = urllib.request.ProxyHeadler(
	{"http":"http://" + proxy_id,
	  "https":"https://" + proxy_id}
	  )
opener = urllib,request.build_opener(proxy_headler)
response = opener.open("http://www.baidu.com")
html = response.read().decode("utf-8")
print(html)

Create opener
the opener can be regarded as a private customization, but the opener can be customized. For example, it can be customized with special headers or the specified proxy IP. You can use build here_ The opener () function creates an opener that belongs to our own private customization. This is equivalent to that the agent has been set up in this opener. Next, you can directly call the open() method of the opener object to access the link we want

opener = urllib.request.build_opener(proxy_headler)

The urlopen() function cannot be used to open the web page here. You need to use the open() function to open the web page.

the following code example uses the IP pool, and randomly selects the IP proxy for each access. It is assumed that all our proxy IPS are recorded in the IP Txt file.

from urllib import request, error
import random
import socket

url = "http://ip.tool.chinaz.com"
proxy_iplist = []

with open("IP.txt", "w") as f:
	for line in f.readlines():
		ip = line.strip()
		proxy_iplist.append(ip)

while True:
	proxy_ip = random.choice(proxy_iplist)
	proxy_headler = request.ProxyHeadler(
		{
			"http":"http://" + proxy_ip,
			"https":"https://" + proxy_ip
		})
	opener = request.build_opener(proxy_headler)
	try:
		response = opener.open(url, timeout = 1)
		print(response.read().decode("utf-8"))
	except error.URLError as e1:
		if isinstance(e1.reason, socket.timeout):
			print("TIME OUT!")
	except error.HTTPError as e2:
		if response.status == 404:
			print("404 ERROE!")
	finally:
		flag = input("Y/N")
		if flag == 'N' or flag == 'n':
			break

encountered an agent requiring authentication

proxy = 'username:password@58.240.53.196:8080'

here, you only need to change the proxy variable and add the user name and password of proxy authentication.

Cookie
we call http The functions of the cookie jar library operate on logs.
the CookieJar class has some subclasses, namely FileCookieJar, Mozilla CookieJar and lwpcookeiejar.

Cookie jar: an object that manages HTTP cookie values, stores cookies generated by HTTP requests, and adds cookies to outgoing HTTP requests. The entire cookie is stored in memory. After garbage collection of the cookie jar instance, the cookie will also be lost.
Filecookie jar (filename, delayload = none, policy = none): derived from cookie jar, it is used to create an instance of filecookie jar, retrieve cookie information and store cookies in files. Filename is the name of the file where the cookie is stored. When delayload is True, deferred access to files is supported, that is, files are read or data is stored in files only when needed.
Mozilla cookiejar (filename, delayload = none, policy = none): derived from FileCookieJar, it creates cookies with Mozilla browser Txt compatible FileCookieJar instance.
Lwpcookeiejar (filename, delayload = none, policy = none): derived from FileCookieJar, create an instance of FileCookieJar that is compatible with the libwww Perl standard Set-Cookie3 file format.

Code example

# This code demonstrates how to obtain a cookie, save it into a cookie jar object and print it
#===============================================================================================================================================

import urllib.request
import http.cookiejar

url = "http://www.baidu.com"

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)

# Print cookie s
for item in cookie:
    print(item.name + "=" + item.value)

BAIDUID=069F91E0E5A0B7E85F7FDFE97194CA18:FG=1
BIDUPSID=069F91E0E5A0B7E87C083ED4D88287F6
H_PS_PSSID=35105_31660_34584_35490_35245_35796_35316_26350_35765_35746
PSTM=1644458154
BDSVRTM=0
BD_HOME=1

# Save the obtained cookie to cookie Txt file
#===============================================================================================================================================
# No load method

import urllib.request
import http.cookiejar

url = "http://www.baidu.com"
filename = "cookie.txt"

cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
cookie.save()

# Save the obtained cookie to cookie Txt file
#===============================================================================================================================================
# There is a load method
import urllib.request
import http.cookiejar

url = "http://www.baidu.com"
filename = "cookie.txt"

cookie = http.cookiejar.MozillaCookieJar()
cookie.load(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)

cookie.txt file:

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com	TRUE	/	FALSE	1675995432	BAIDUID	139CA77B6F46CA597186A3F1F6FCF790:FG=1
.baidu.com	TRUE	/	FALSE	3791943079	BIDUPSID	139CA77B6F46CA5980C5AB053579F5CF
.baidu.com	TRUE	/	FALSE	3791943079	PSTM	1644459432

urllib.error

urllib. The error module is urllib The exception raised by request defines the exception class, and the basic exception class is URLError. urllib.error contains two methods, URLError and HTTPError.

URLError is a subclass of OSError. It is used to handle the exception (or its derived exception) that will be thrown when the program encounters a problem. The attribute reason is the cause of the exception.

HTTPError is a subclass of URLError. It is used to handle special HTTP errors. For example, when it is an authentication request, the attribute code contained is the HTTP status code, reason is the cause of the exception, and headers are the HTTP response header of the specific HTTP request causing the HTTPError.

Grab and handle exceptions for non-existent web pages:

import urllib.request
import urllib.error

try:
	response = urllib.request.urlopen("http://www.baidu.com")
except urllib.error.HTTPError as e:
	if(e.code == 404)		
		print(response.getcode())		# 404

urllib.parse

urllib.parse is used to parse URL s. The format is as follows:

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
urlstring Is a string url address
scheme Is an agreement type,
allow_fragments Parameter is false，The fragment identifier is not recognized. Instead, they are resolved as part of a path, parameter, or query component, and fragment Set to an empty string in the return value.

Note: when the protocol type is indicated in urlstring, the scheme parameter is invalid. The protocol indicated in urlstring shall prevail. If there is no protocol in urlstring, the scheme Protocol shall prevail

import urllib.parse

result1 =urllib.parse.urlparse("https://www.csdn.net/?spm=1001.2101.3001.4476")
result2 = urllib.parse.urlparse("www.csdn.net/?spm=1001.2101.3001.4476",scheme = "https")
result3 = urllib.parse.urlparse("https://www.csdn.net/?spm=1001.2101.3001.4476", scheme="http")

print(result1)
print(result2)
print(result3)

ParseResult(scheme='https', netloc='www.csdn.net', path='/', params='', query='spm=1001.2101.3001.4476', fragment='')
ParseResult(scheme='https', netloc='', path='www.csdn.net/', params='', query='spm=1001.2101.3001.4476', fragment='')
ParseResult(scheme='https', netloc='www.csdn.net', path='/', params='', query='spm=1001.2101.3001.4476', fragment='')

it can be seen from the result that the content is a tuple, including 6 strings: protocol, location, path, parameter, query and judgment.

we can also read properties directly

from urllib.parse import urlparse

result = urlparse("https://www.runoob.com/?s=python+%E6%95%99%E7%A8%8B")
print(result.scheme)

https

attribute	Indexes	value	Value (if not present)
scheme	0	URL protocol	scheme parameter
netloc	1	Network location section	Empty string
path	2	Hierarchical path	Empty string
paramg	3	Parameters of the last path element	Empty string
query	4	Query component	Empty string
fragment	5	Fragment recognition	Empty string
username		user name	None
password		password	None
hostname		Host name (lowercase)	None
port		The port number is an integer, if present	None

urlunparse
in addition, we can also use urlunparse for reverse splicing

from urllib.parse import urlunparse

data = ["http", "www.baidu.com", "index.html", "user", "a=6", "comment"]
print(urlunparse(data))

http://www.baidu.com/index.html;user?a=6#comment

urljoin

urljoin(base, url, allow_fragments=True)
base Reference parent station
url Absolute path to be spliced url
allow_fragments Identify fragment identifier

urljoin() splices the base and url into a web address. If the url is a complete web address, it is based on the url

from urllib import parse

url1 = parse.urljoin("https://www.baidu.com", "index.html")
url2 = parse.urljoin("https://www.baidu.com", "https://www.jianshu.com/p/20065f9b39bb")

print(url1)
print(url2)

https://www.baidu.com/index.html
https://www.jianshu.com/p/20065f9b39bb

urlencode
we know that GET passes parameters with "&" symbol interval, but in Python, dictionary elements use "," interval. We can use urlencode to convert the dictionary into key value pairs with "&" interval for parameter transmission

from urllib import parse

data = {
    "keyword":"Python",
    "id":"3252525",
    "page":"3"
}

base_url = "http://www.example.com"
url = base_url + parse.urlencode(data)
print(url)

http://www.example.comkeyword=Python&id=3252525&page=3

urllib.robotparser

urllib. Robot parser is used to parse robots Txt file.

robots.txt (Unified lowercase) is a kind of robots protocol stored in the root directory of the website. It is usually used to tell the search engine the crawling rules of the website.

urllib. The RobotFileParser class is provided by the robotparser. The syntax is as follows:

class urllib.robotparser.RobotFileParser(url='')

This class provides some that can read and parse robots Txt file:

set_url(url) - set robots The URL of the txt file.
read() - read robots Txt URL and enter it into the parser.
parse(lines) - parse line parameters.
can_fetch(useragent, url) - if useragent is allowed to follow the parsed robots Txt file to get the url, then return True.
mtime() - returns the last robots obtained Txt file. This applies to robots that need to be checked regularly Txt file update situation of the long-running web crawler.
modified() - the latest robots will be obtained Txt file is set to the current time.
crawl_delay(useragent) - for the specified useragent from robots Txt returns the crawl delay parameter. If this parameter does not exist or is not applicable to the specified useragent or robots of this parameter If there is a syntax error in the txt entry, return None.
request_rate(useragent) - from robots.com in the form of named tuple RequestRate(requests, seconds) Txt returns the contents of the request rate parameter. If this parameter does not exist or is not applicable to the specified useragent or robots of this parameter If there is a syntax error in the txt entry, return None.
site_maps() - from robots. Com in the form of list() Txt returns the contents of the Sitemap parameter. If this parameter does not exist or the robots of this parameter If there is a syntax error in the txt entry, return None.

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp.read()
>>> rrate = rp.request_rate("*")
>>> rrate.requests
3
>>> rrate.seconds
20
>>> rp.crawl_delay("*")
6
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
True

Keywords: Python SSL https

Added by dmayo2 on Thu, 10 Feb 2022 23:35:27 +0200

Programming VIP