Simple and practical Python Request Library

1. Introduction to request Library

2. Usage

2.1 send request

The request library sets different methods to send different HTTP requests, such as get, post, etc

2.1.1 sending of Get request

Simple GET request
get request is the simplest request in the request library, and its use method is also very simple.

import request
response = request.get('url')

So we get a simple get request.

GET request with parameters
When sending a request, we often need to send request parameters to the server. Usually, the parameters are placed in the URL in the form of key / value pairs, followed by a question mark.

import request
response = request.get('www.xxx.com/get?id=1')

By adding parameters to the url, we can send a url with parameters. Of course, if you don't want to be so troublesome, you should attach parameters every time. The request library also provides params parameters for dictionary use.

import request
param = {'id':'1','page':'20'}
response = request.get('www.xxx.com/get',params=param)
print(response.url)

results of enforcement

www.xxx.com/get?id=1&page=20

2.1.2 sending of other requests

The request library can also send the following requests:

import request
resp_1 = request.post('url')
resp_2 = request.put('url')
resp_3 = request.delete('url')
resp_4 = request.head('url')
resp_5 = request.options('url')

2.1.3 request header setting method

When we execute the crawler program, the page we crawl will generally carry out some anti crawl operations. The most common is the identification and authentication of header header. For this kind of anti crawling, we can modify the request header to ensure the normal execution of the crawler.

import request
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
}
response = request.get('url',headers=header)

In this way, we will modify the content of the header to the value we want.

2.2 response content

2.2.1 common corresponding contents

Crawling content is our ultimate goal, and we can get the content we want through the following request.

import request
response = request.get('url')

# Get response status code
print(response.status_code)

# Get response header information
print(response.headers)

# Get response content
print(response.text)

#Get response cookie
print(response.cookie)

#Get response url
print(response.url)

About response.text response content
results of enforcement

<!DOCTYPE html>
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />

Through text, we can directly crawl the source code of the url. Here, we need to pay special attention to the coding rules of the source code. When we don't make special settings, the crawler can't automatically recognize the coding format of the source code. Generally, there will be garbled code.
Solution: as for the common utf-8 format, we can set parameters to let the program recognize the coding format of the code.

import request
response = request.get('url')
response.encoding = 'utf-8'
print(response.text)

About obtaining status code and response header information

import request
response = request.get('url')

print(response.headers)# Get response header information
print(response.text)# Get response content

results of enforcement

200
{'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Fri, 28 Jun 2019 14:38:09 GMT', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Server': 'nginx', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '258', 'Connection': 'keep-alive'}
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "hero": "leesin"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0"
  }, 
  "json": null, 
  "origin": "61.144.173.21, 61.144.173.21", 
  "url": "https://httpbin.org/post"
}

2.2.2 binary response content

For non text requests (such as pictures), you can also access the request response body in bytes. Requests will automatically decode gzip and deflate for you and transmit encoded response data.

import requests

response = requests.get('http://xxx.com/3.jpg')
with open('1.jpg','wb') as f:
    f.write(response.content)
    f.close()

Here we save the pictures of the web page

2.2.3 JSON response content

There is a built-in JSON decoder in Requests, which can help you process JSON data:

import requests
r = requests.get('https://api.github.com/events')
print(r.json())

# Output results
[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

2.3 parameter transfer

2.3.1 file upload

import requests

file = {
    'file':open('File name','rb')
}
response = requests.post("url",file)
print(response.text)

Note the location of the document

2.4 cookie

Get cookie

import requests
response = requests.get('https://www.baidu.com')
print(response.cookies)

Session maintenance
After obtaining the cookie, you can simulate the login operation

import requests
session = requests.Session()
session.get('http://httpbin.org/cookies/set/number/123456789')
response = session.get('http://httpbin.org/cookies')
print(response.text)

Here, we use session to save the current cookie and let the server think that it is a request initiated by a browser, so that the cookie can be printed successfully.
If you want to verify the model login, you can use requests.Session() to initiate a request. It can simulate the browser to request the server and maintain the login session

Keywords: Python crawler

Added by wezalmighty on Mon, 08 Nov 2021 10:16:48 +0200