Using Requests to implement a simple web crawler

In the first section, we briefly introduce the basic principle of crawler. Understanding the principle can help us better realize the code. Python provides many tools to implement HTTP requests, but the third-party open source library provides richer functions. You don't need to start writing from socket communication. For example, use Python's built-in module urlib to request a URL code. An example is as follows:

import ssl

from urllib.request import Request
from urllib.request import urlopen

context = ssl._create_unverified_context()

# HTTP request
request = Request(url="https://foofish.net/pip.html",
                  method="GET",
                  headers={"Host": "foofish.net"},
                  data=None)

# HTTP response
response = urlopen(request, context=context)
headers = response.info()  # Response header
content = response.read() # Responder
code = response.getcode() # Status code

Before initiating a Request, first build the Request object Request and specify the url address, Request method and Request header. The Request body data here is empty. Because you do not need to submit data to the server, you can also not specify it. The urlopen function will automatically establish a connection with the target server and send an HTTP Request. The return value of the function is a Response object Response, which contains attributes such as Response header information, Response body and status code.

However, the built-in module provided by Python is too low-level and needs to write a lot of code. You can consider requests when using a simple crawler. Requests has nearly 30k Star in GitHub, which is a very Python framework. Let's get familiar with how this framework is used

Install requests

pip install requests

GET request

>>> r = requests.get("https://httpbin.org/ip")
>>> r
<Response [200]> # Response object
>>> r.status_code  # Response status code
200

>>> r.content  # Response content
'{\n  "origin": "183.237.232.123"\n}\n'

POST request

>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})

Custom request header

This is often used. The server anti crawler mechanism will judge whether the user agent in the client request header comes from a real browser. Therefore, when we use Requests, we often specify the UA to disguise as a browser to initiate Requests

>>> url = 'https://httpbin.org/headers'
>>> headers = {'user-agent': 'Mozilla/5.0'}
>>> r = requests.get(url, headers=headers)

Parameter transfer

Many times, there will be a long string of parameters behind the URL. In order to improve readability, requests supports extracting the parameters and passing them as method parameters (params) without attaching them to the URL, such as the request url http://bin.org/get?key=val , available

>>> url = "http://httpbin.org/get"
>>> r = requests.get(url, params={"key":"val"})
>>> r.url
u'http://httpbin.org/get?key=val'

Specify Cookie

Cookies are the credentials for web browsers to log in to the website. Although cookies are also part of the request header, we can separate them and specify them with Cookie parameters

>>> s = requests.get('http://httpbin.org/cookies', cookies={'from-my': 'browser'})
>>> s.text
u'{\n  "cookies": {\n    "from-my": "browser"\n  }\n}\n'

Set timeout

When a request is initiated and the response of the server is very slow and you don't want to wait too long, you can specify timeout to set the request timeout in seconds. When the connection to the server fails after this time, the request will be forcibly terminated.

r = requests.get('https://google.com', timeout=5)

Set agent

Too many requests sent over a period of time are easy to be judged as crawlers by the server, so we often use proxy IP to disguise the real IP of the client.

import requests

proxies = {
    'http': 'http://127.0.0.1:1080',
    'https': 'http://127.0.0.1:1080',
}

r = requests.get('http://www.kuaidaili.com/free/', proxies=proxies, timeout=2)

Session

If you want to maintain the login (Session) state with the server without specifying cookies every time, you can use Session. The API provided by Session is the same as requests.

import requests

s = requests.Session()
s.cookies = requests.utils.cookiejar_from_dict({"a": "c"})
r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"a": "c"}}'

r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"a": "c"}}'

Small trial ox knife

Now let's use Requests to complete a simple crawler to crawl the user's attention list of Zhihu column. For example, find any column and open its Follow list . Use Chrome to find the request address to get the fan list: https://zhuanlan.zhihu.com/api/columns/pythoneer/followers?limit=20&offset=20

Then we use Requests to simulate the browser to send Requests to the server

import json

import requests


class SimpleCrawler:
    init_url = "https://zhuanlan.zhihu.com/api/columns/pythoneer/followers"
    offset = 0

    def crawl(self, params=None):
        # UA must be specified, otherwise the server will determine that the request is illegal
        headers = {
            "Host": "zhuanlan.zhihu.com",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36",
        }
        response = requests.get(self.init_url, headers=headers, params=params)
        print(response.url)
        data = response.json()
        # 7000 indicates all concerns
        # More paging loads, recursive calls
        while self.offset < 7000:
            self.parse(data)
            self.offset += 20
            params = {"limit": 20, "offset": self.offset}
            self.crawl(params)

    def parse(self, data):
    # Store to file in json format
        with open("followers.json", "a", encoding="utf-8") as f:
            for item in data:
                f.write(json.dumps(item))
                f.write('\n')


if __name__ == '__main__':
    SimpleCrawler().crawl()

Summary

This is the simplest single thread crawler based on requests to know the column fan list. Requests are very flexible. The request header, request parameters and Cookie information can be directly specified in the request method. If the return value response is in json format, you can directly call json() method to return python object. For more usage of requests, please refer to the official documents: http://docs.python-requests.org/en/master/

[like] [follow] don't get lost

Keywords: Python crawler

Added by fr34k2oo4 on Sun, 30 Jan 2022 10:06:25 +0200