Web request for python crawler learning notes

Reference blog: python crawler learning notes_ fdk less owner's blog - CSDN blog

'requests' Library

Installation and documentation address:

Install using pip: pip install requests

Chinese documents: Requests: make HTTP service human - Requests 2.18.1 document

Send GET request:

1. The simplest way to send a get request is to call through requests.get:

response = requests.get('http://www.baidu.com')

2. Add headers and query parameters:

If you want to add headers, you can pass in the headers parameter to add header information in the request header. If you want to pass a parameter into a url, you can use the params parameter. The example code is as follows:

import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
data = {'wd':'China'}
url = 'https://www.baidu.com/s'

response = requests.get(url,params=data,headers=headers)
# View the response content in Unicode format
print(response.text)
# View the response content, byte stream format, and use decode to encode
print(response.content)
# View the full url address
print(response.url)
# View response header character encoding
print(response.encoding)
# View the status code of the response
print(response.status_code)

Send POST request:

1. The most basic post request can use the post method:

response = requests.get('https://www.baidu.com/s',data=data)

2. Incoming data:

At this time, don't use urlencode for coding. Just pass it in to a dictionary. If the returned data is json   Type, you can extract data according to the operation of the dictionary. The example code is as follows:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    'Referer': 'https://www.lagou.com/jobs/list_python%E7%88%AC%E8%99%AB?city=%E5%85%A8%E5%9B%BD&cl=false&fromSearch=true&labelWords=&suginput='
}
data = {'first': 'true', 'pn': '1', 'kd': 'python Reptile'}
url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
response = requests.post(url, data=data, headers=headers)
json_str = response.json()
result = json_str['content']['positionResult']['result']
for i in result:
    # Output company name
    print(i['companyShortName'])
    # Output city name
    print(i['city'])
    print('*' * 20)

Use agent:

Using requests to add a proxy is very simple. Just pass the proxies parameter in the requested method (such as get or post). The example code is as follows:

import requests
url = 'http://httpbin.org/ip'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
proxy = {
    'http': '118.190.95.35:9001'
}
resp = requests.get(url,headers=headers,proxies=proxy)
print(resp.text)

cookie:

If a response contains a cookie, you can use the cookie attribute to get the returned cookie value:

import requests

resp = requests.get('http://www.baidu.com')
print(resp.cookies)
# Get cookie details
print(resp.cookies.get_dict())

session:

Previously, using the urllib library, you can use opener to send multiple requests, and cookies can be shared among multiple requests. If you want to use requests and share cookies, you can use the session object provided by the requests library. Note that this session is not a session in web development. It is just a session object. Let's take logging in to Renren as an example. The example code is as follows:
 

import requests
url = 'http://renren.com/PLogin.do'
data = {'email':'Renren email account','password':'password'}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
# Sign in
session = requests.Session()
session.post(url,headers=headers,data=data)

# Visit Dapeng personal Center
resp = session.get('http://www.renren.com/880151247/profile')

print(resp.text)

Handling untrusted SSL certificates:

For websites that have trusted SSL certificates, such as http://www.baidu.com/ , then you can directly return the normal response using requests. If the SSL certificate is not trusted, you need to add a parameter verify=False when requesting the website. The example code is as follows:

resp = requests.get('http://www.12306.cn/mormhweb/',verify=False)
print(resp.content.decode('utf-8'))

Keywords: Python crawler

Added by Mow on Tue, 30 Nov 2021 16:50:35 +0200