Crawler learning - 04requests Library

requests Library

Although the urllib module in Python's standard library already contains most of the functions we normally use, its API doesn't feel good to use, and Requests advertises "HTTP for Humans", indicating that it is more concise and convenient to use.

Installation and documentation address

pip can be installed easily:

pip install requests

Chinese documents: http://docs.python-requests.org/zh_CN/latest/index.html
github address: https://github.com/requests/requests

Send GET request

  1. The simplest way to send a get request is through requests Get to call:

    response = requests.get("http://www.baidu.com/")
    
  2. Add headers and query parameters:
    If you want to add headers, you can pass in the headers parameter to add header information in the request header. If you want to pass parameters in a url, you can use the params parameter. The relevant example codes are as follows:

     import requests
    
     kw = {'wd':'China'}
    
     headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
    
     # params receives the query parameters of a dictionary or string. The dictionary type is automatically converted to url code without urlencode()
     response = requests.get("http://www.baidu.com/s", params = kw, headers = headers)
    
     # View the response content, response Text returns data in Unicode format
     print(response.text)
    
     # View the response content, response Byte stream data returned by content
     print(response.content)
    
     # View full url address
     print(response.url)
    
     # View response header character encoding
     print(response.encoding)
    
     # View response code
     print(response.status_code)
    

Send POST request

  1. The most basic post request can use the post method:

    response = requests.post("http://www.baidu.com/",data=data)
    
  2. Incoming data:
    At this time, don't use urlencode for coding. Just pass it in to a dictionary. For example, the code of the data requested to pull the hook:

     import requests
    
     url = "https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false&isSchoolJob=0"
    
     headers = {
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
         'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
     }
    
     data = {
         'first': 'true',
         'pn': 1,
         'kd': 'python'
     }
    
     resp = requests.post(url,headers=headers,data=data)
     # If it is json data, you can call the json method directly
     print(resp.json())
    

Use agent

It is also very simple to add a proxy using requests. Just pass the proxies parameter in the requested method (such as get or post). The example code is as follows:

import requests

url = "http://httpbin.org/get"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
}

proxy = {
    'http': '171.14.209.180:27829'
}

resp = requests.get(url,headers=headers,proxies=proxy)
with open('xx.html','w',encoding='utf-8') as fp:
    fp.write(resp.text)

cookie

If a cookie is included in a response, you can use the cookie attribute to get the returned cookie value:

import requests

url = "http://www.renren.com/PLogin.do"
data = {"email":"970138074@qq.com",'password':"pythonspider"}
resp = requests.get('http://www.baidu.com/')
print(resp.cookies)
print(resp.cookies.get_dict())

session

Previously, using the urllib library, you can use opener to send multiple requests, and cookies can be shared among multiple requests. If we use requests to share cookies, we can use the session object provided by the requests library. Note that the session here is not the session in web development. This place is just a session object. Take logging in to Renren as an example and use requests to implement it. The example code is as follows:

import requests

url = "http://www.renren.com/PLogin.do"
data = {"email":"970138074@qq.com",'password':"pythonspider"}
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}

# Sign in
session = requests.session()
session.post(url,data=data,headers=headers)

# Visit Dapeng personal Center
resp = session.get('http://www.renren.com/880151247/profile')

print(resp.text)

Handling untrusted SSL certificates

For websites that have trusted SSL integers, such as https://www.baidu.com/ , then you can directly return the normal response using requests. The example code is as follows:

resp = requests.get('http://www.12306.cn/mormhweb/',verify=False)
print(resp.content.decode('utf-8'))

Keywords: Python

Added by Perryl7 on Mon, 03 Jan 2022 19:02:01 +0200