Beginner level skills of reptile

1. Parse robots Txt file

There are some sites in robots Txt file is set to prohibit the proxy user from browsing the site. Since the target site has such a rule, we should follow it.

Use urllib's robot parser module to parse robots Txt file to avoid downloading the url prohibited from crawling, and then through can_fetch() function to determine whether the specified user agent conforms to the parsed robots Txt rule.

urllib contains four modules: request, error, parse, and robot parser.

Use steps:

1. Create a parser {RP = robot parser RobotFileParser()

#Returns a parser object < class' urllib robotparser. Robotfileparser '>, the content is empty

2. Set robots Txt URL: rp.set_url('http://example.python-scraping.com/robots.txt ')

3. Get robots Txt parsing (rp.read)

#Read out the parsing content and save it to the parsing object

4. Match resolved rules

allow = rp.can_fetch('BadCrawler','http://example.python-scraping.com/robots.txt')

#If useragent is allowed to resolve robots according to Txt file to get the url, returns True. To user_ Double qualification of agent and url to be accessed.

from urllib import robotparser

#Encapsulate the created RobotFileParser object into a function, pass in the link of the file, and return the parser object
def get_robots_parser(robots_url):
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp

2. Support server proxy

Different from the user agent (proxy browser ID), the agent here refers to the server agent. For example, when the local ip is blocked and the server cannot access the web page, you can use the agent to escape the shielding.

Compared with urllib library, python has a more http friendly library, requests library. Urllib is the python standard library. After you install python, this library can be used directly.

Requests is a third-party library that needs to be installed independently: pip install requests. It is generally recommended to use requests, which is the re encapsulation of urllib, and various operations will be easier. Here we also use urllib to familiarize ourselves with the underlying operations.

The operation is very simple. The code is as follows

from urllib import  request
proxy = '60.7.208.137:9999'
proxy_support = urllib.request.ProxyHandler({'http':proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)

3. Download speed limit

If the crawling speed is too fast, sometimes the ip will be blocked or cause the risk of server overload, so we need to add a delay between crawls

Here we will use the urlparse module to parse the url

In Python 3, urlparse module and urllib module are merged, and urlparse() is in urllib Called in parse. urlparse() divides the url into six parts, scheme (Protocol), netloc (domain name), path (path), params (optional parameters), query (connection key value pair), fragment (special anchor), and returns it in tuple form.

Create a Throttle class to record the last access time of each domain name (the domain name is resolved by urlparse). If the two access times are less than the given delay, perform sleep operation and pause crawling for a while

from urllib.parse import urlparse
import time

class Throttle:

    def __init__(self, delay):
        self.delay = delay      #Save the set interval between downloads
        self.domains = {}       #Save the timestamp of the latest domain name download

    def wait(self, url):
        domain = urlparse(url).netloc               #netloc get domain name
        last_accessed = self.domains.get(domain)    #Gets the time of the last download

        if self.delay > 0 and last_accessed is not None:   #The delay is greater than zero and has the timestamp of the last download
            sleep_sec = self.delay - (time.time() - last_accessed)      #This time should sleep = set delay - according to the time spent in the last download
            if sleep_sec > 0:
                time.sleep(sleep_sec)
        self.domains[domain] = time.time()                              #Record the time stamp of this download

Call:

throttle = Throttle(delay)

......

throttle.wait(url)

html = GetData(link)

4. Limited climbing depth

For some web pages that dynamically generate page content. For example, an online calendar website provides links to visit the next month and the next year, and the next month contains the pages of the next month. In this way, it will always crawl and fall into the crawler trap.

A simple way to avoid the crawler trap is to record how many links have passed to the current page, that is, the depth. Set a threshold. When the maximum depth is reached, the crawler will no longer add the links in the web page to the queue.

Method:

Change see to a dictionary and save the depth record of the found link.

The dictionary get(key, parameter) function is used. When the value matching the key can be queried, the value corresponding to the corresponding key will be returned. If not, the following parameter will be returned.

Setting parameter to 0 initializes the depth of the domain name crawled for the first time to 0.

Take the previous crawling link tracking as an example and modify it to

def scrap_link(start_url, link_regex, robots_url = None, user_agent = 'wswp', max_depth =5):
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)
    crawl_queue = [start_url]
    #seen = [start_url]          # See = set (crawl_queue) the set function returns the 'set' object. You can use see Add() add element
    seen = {}
    while crawl_queue :
        url = crawl_queue.pop()
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                print('The depth exceeds the maximum depth. Skip this page')
                continue
            html = GetData(url, user_agent = user_agent)
        if html is None:
            continue
        for link in getlinks(html):
            if re.match(link_regex,link):
                abs_link = urljoin(start_url, link)
                if abs_link not in seen:
                    crawl_queue.append(abs_link)
                    seen[abs_link] = depth + 1           #In fact, depth here reflects the number of layers of the loop, which is similar to the traversal of the tree by layer, adding one for each additional layer
#Returns links to all a tags of a page
def getlinks(html):
    url_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    return url_regex.findall(html)          #Or re findall(url_regex, html, re.IGNORECASE)

The key modifications are

def scrap_link(start_url, link_regex, robots_url = None, user_agent = 'wswp', max_depth =5):
    .........
    seen = {}
    ........
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                print('The depth exceeds the maximum depth. Skip this page')
                continue
            .....
        for link in getlinks(html):
            if re.match(link_regex,link):
                abs_link = urljoin(start_url, link)
                if abs_link not in seen:
                    crawl_queue.append(abs_link)
                    seen[abs_link] = depth + 1           #In fact, depth here reflects the number of layers of the cycle,

5. Use requests

Requests is the re encapsulation of urllib, and various operations will be easier. It is recommended that you first learn the urllib library and get familiar with the mechanism, and then you can mainly use the requests library.
In the GetData function I wrote, some functions can be replaced with requests, for example:
1. # Add user agent
    request = urllib.request.Request(url)
    request.add_header('User-Agent', user_agent)
    response = urllib.request.urlopen(request)
# add server proxy
    proxy_support = urllib.request.ProxyHandler({'http': proxy})
    opener = urllib.request.build_opener(proxy_support)
    urllib.request.install_opener(opener)
Replace with
    response = requests.get(url, headers = header, proxies = proxy)
2.# Processing decoding
    cs = response.headers.get_content_charset()
    html = response.read().decode(cs)
Replace with
    html = response.text
Here is an explanation of the functions of the requests functions
requests.get() requests an http page in get mode
. text attribute, string form of http response content, automatic test character encoding, and then decoding to string output

More detailed functions and attributes are shown in figure

The replaced code is

import requests
def GetData(url, user_agent = 'wswp', retry=2, proxy = None):
    print('download : ' + url)
    header = {'User-Agent', user_agent}
    try:
        response = requests.get(url, headers = header, proxies = proxy)
        html = response.text
        if response.status_code >= 400:
            print('Download error', response.text)
            html = None
            if retry > 0 and 500 <= response.status_code <600:
                return GetData(url, proxy_support,retry - 1)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('download error :', e.reason)
        html = None
    return html

Compared with the original code, it is shorter and clearer in structure.

Keywords: Python crawler

Added by afrancis on Tue, 18 Jan 2022 07:36:05 +0200

Programming VIP