quote method of python crawler urllib get request

quote method of get request

First of all, let's expand our little knowledge: the evolution of coding set
Because the computer was invented by Americans, only 127 characters were encoded into the computer at first, that is, upper and lower English letters, numbers and some symbols,
This coding table is called ASCII coding. For example, the upper case A is 65 and the lower case z is 122.
However, it is obvious that one byte is not enough to deal with Chinese. At least two bytes are required, and it cannot conflict with ASCII coding,
Therefore, China has formulated GB2312 code to encode Chinese.
You can imagine that there are hundreds of languages all over the world. Japan compiles Japanese into English_ In JIS, South Korea compiles Korean into Euc ‐ kr,
If countries have national standards, there will inevitably be conflicts. As a result, there will be garbled codes in multilingual texts.
Therefore, Unicode came into being. Unicode unifies all languages into one set of codes, so that there will be no more random code problems.
The Unicode standard is also evolving, but the most commonly used is to represent a character in two bytes (four bytes are required if very remote characters are used).
Modern operating systems and most programming languages support Unicode directly.

Let's start with a reptile Axe:

import urllib.request
url='https://www.baidu.com/s?wd='
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}

#Customization of request object

request=urllib.request.Request(url=url,headers=headers)

#Impersonate the browser to send a request to the server

response=urllib.request.urlopen(request)

#Get the content of the response

content=response.read().decode('utf-8')
print(content)

We observed that wd = is followed by the search content, but we learned from the expansion that Chinese must be converted into unicode to be recognized. At this time, we use the quote method in the get request.
Full code:

import urllib.request
import urllib.parse
url='https://www.baidu.com/s?wd='
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}

#Convert Jay Chou into unicode encoding format
name=urllib.parse.quote('Jay Chou')
url=url+name
#Customization of request object

request=urllib.request.Request(url=url,headers=headers)

#Impersonate the browser to send a request to the server

response=urllib.request.urlopen(request)

#Get the content of the response

content=response.read().decode('utf-8')
print(content)

urlencode method of get request

The above quote method can be used normally in the face of several data, but it is unable to deal with a large amount of data. Here we introduce urlencode method, which can extract data from the dictionary and convert it into unicode coding format, and automatically connect it with & symbol, which has advantages in dealing with a large amount of data.
Just read the code:


import urllib.request
import urllib.parse
url='https://www.baidu.com/s?'

base_url='https://www.baidu.com/s?'
data={
'wd':'Jay Chou',
'sex':'male',
'location':'Taiwan Province, China'
}
new_data=urllib.parse.urlencode(data)
url=base_url+new_data


headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}

#Convert Jay Chou into unicode encoding format
name=urllib.parse.quote('Jay Chou')
url=url+name
#Customization of request object

request=urllib.request.Request(url=url,headers=headers)

#Impersonate the browser to send a request to the server

response=urllib.request.urlopen(request)

#Get the content of the response

content=response.read().decode('utf-8')
print(content)









Keywords: Python crawler Python crawler

Added by Oren on Sun, 06 Mar 2022 18:38:53 +0200