Single machine learning request

Request de duplication

This is a high-frequency interview question for reptile post:

Q: How does the sweep remove duplicate requests? What is the de duplication principle? How does the request calculate uniqueness?

With this question, enter today's theme.

DUPEFILTER_CLASS

In the scene project configuration, DUPEFILTER_CLASS is the setting item of the framework for the request de duplication rule. The default classpath is: scratch dupefilters. RFPDupeFilter.

Enter the file and observe that the class RFPDupeFilter inherits from BaseDupeFilter, which seems to do nothing but define some methods. Therefore, the real de duplication core code is in the RFPDupeFilter class. Analyze its principle line by line.

RFPDupeFilter

class RFPDupeFilter(BaseDupeFilter):
    """Request Fingerprint duplicates filter"""

    def __init__(self, path=None, debug=False):
        self.file = None
        # Use python built-in set() as the fingerprint of the request
        # Characteristics of set: unordered non repeating element set
        self.fingerprints = set()
        self.logdupes = True
        self.debug = debug
        self.logger = logging.getLogger(__name__)
        # Local persistence request fingerprint
        if path:
            self.file = open(os.path.join(path, 'requests.seen'), 'a+')
            self.file.seek(0)
            self.fingerprints.update(x.rstrip() for x in self.file)

    @classmethod
    def from_settings(cls, settings):
        # If DEBUG is enabled in the configuration, the file will be persisted
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(job_dir(settings), debug)

    def request_seen(self, request):
        # !!! The core is used to detect whether the fingerprint exists.
        # Use request_fingerprint to get the requested fingerprint
        fp = self.request_fingerprint(request)
        # If the fingerprint is in the collection, return True
        if fp in self.fingerprints:
            return True
        # Not in the set, append to the set
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + '\n')

    def request_fingerprint(self, request):
        # Call the request of the sweep_ Fingerprint for fingerprint calculation
        return request_fingerprint(request)

    def close(self, reason):
        # Resource destruction
        if self.file:
            self.file.close()

    def log(self, request, spider):
        # Output and recording of logs
        if self.debug:
            msg = "Filtered duplicate request: %(request)s (referer: %(referer)s)"
            args = {'request': request, 'referer': referer_str(request)}
            self.logger.debug(msg, args, extra={'spider': spider})
        elif self.logdupes:
            msg = ("Filtered duplicate request: %(request)s"
                   " - no more duplicates will be shown"
                   " (see DUPEFILTER_DEBUG to show all duplicates)")
            self.logger.debug(msg, {'request': request}, extra={'spider': spider})
            self.logdupes = False

        spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)

The above code is so simple that anyone can easily write one by himself. Where request_ If the request is repeated, the see method is used to detect whether the request is repeated. Otherwise, it is used to return True. The core is the call of request_fingerprint to calculate the fingerprint. Go in and have a look.

request_fingerprint

The request fingerprint is a hash that uniquely identifies the resource the request points to
The request fingerprint is a hash value that uniquely identifies the resource to which the request is directed

def request_fingerprint(request, include_headers=None, keep_fragments=False):
    # Calculate headers
    if include_headers:
        include_headers = tuple(to_bytes(h.lower()) for h in sorted(include_headers))
    cache = _fingerprint_cache.setdefault(request, {})
    cache_key = (include_headers, keep_fragments)
    if cache_key not in cache:
        # Start calculation, encryption algorithm sha1
        fp = hashlib.sha1()
        # Add the request method, request url and request body into the calculation,
        # If the url here points to the same resource, it is the same, for example:
             # http://www.example.com/query?id=111&cat=222
            # http://www.example.com/query?cat=222&id=111
        # These two URLs point to the same target, and we also consider them duplicate requests url
        fp.update(to_bytes(request.method))
        fp.update(to_bytes(canonicalize_url(request.url, keep_fragments=keep_fragments)))
        fp.update(request.body or b'')
        # headers join calculation
        if include_headers:
            for hdr in include_headers:
                if hdr in request.headers:
                    fp.update(hdr)
                    for v in request.headers.getlist(hdr):
                        fp.update(v)
        cache[cache_key] = fp.hexdigest()
    return cache[cache_key]

Execution flow of scheduler

In the Scheduler code of the sweep, the Scheduler uses the class method from_crawler reads dupefilter in configuration item_ Class path, using load_object is loaded and created_ Instance instantiates the object. Assign to attribute self df

class Scheduler:
    
    def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None,
                 logunser=False, stats=None, pqclass=None, crawler=None):
        self.df = dupefilter
        ......

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        dupefilter_cls = load_object(settings['DUPEFILTER_CLASS'])
        dupefilter = create_instance(dupefilter_cls, settings, crawler)
        ......
        return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser,
                   stats=crawler.stats, pqclass=pqclass, dqclass=dqclass,
                   mqclass=mqclass, crawler=crawler)

    def open(self, spider):
        ......
        return self.df.open()

    def close(self, reason):
        ......
        return self.df.close(reason)

    def enqueue_request(self, request):
        if not request.dont_filter and self.df.request_seen(request):
            self.df.log(request, self.spider)
            return False
        ......
        return True

The scheduler is open ed, close d and requested to be listed in enqueue_request
Trigger the open ing and closing of the filter and calculate the fingerprint request respectively_ seen.

When constructing a request, the parameter dont_ When the filter is False, it will enter the de duplication calculation.

Novices often make mistakes. dont_filter=True is considered as de duplication. In fact, the thinking of foreigners is different from our direct expression. Maybe if we make parameters, filter=True is filtering, and filter=False is not filtering. Added don, don_ filter=True translates to: do not filter? yes.

summary

Now let's answer the interviewer's questions:

Q: How does the sweep remove duplicate requests? What is the de duplication principle? How does the request calculate uniqueness?

A: Scratch is through dupefilter in the configuration file_ Class attribute to select the method of de duplication. By default, it is called "scratch" dupefilters. RFPDupeFilter.
The scratch request is locally de duplicated through Python's built-in feature that sets do not duplicate sets.
Its encryption algorithm is sha1. By default, the uniqueness calculation is performed for the request method, url and body.

Core two points: set fingerprint de duplication and sha1 encryption to calculate fingerprint.

Keywords: Python Java crawler scrapy

Added by rmbarnes82 on Wed, 16 Feb 2022 07:36:05 +0200