Request de duplication
This is a high-frequency interview question for reptile post:
Q: How does the sweep remove duplicate requests? What is the de duplication principle? How does the request calculate uniqueness?
With this question, enter today's theme.
DUPEFILTER_CLASS
In the scene project configuration, DUPEFILTER_CLASS is the setting item of the framework for the request de duplication rule. The default classpath is: scratch dupefilters. RFPDupeFilter.
Enter the file and observe that the class RFPDupeFilter inherits from BaseDupeFilter, which seems to do nothing but define some methods. Therefore, the real de duplication core code is in the RFPDupeFilter class. Analyze its principle line by line.
RFPDupeFilter
class RFPDupeFilter(BaseDupeFilter): """Request Fingerprint duplicates filter""" def __init__(self, path=None, debug=False): self.file = None # Use python built-in set() as the fingerprint of the request # Characteristics of set: unordered non repeating element set self.fingerprints = set() self.logdupes = True self.debug = debug self.logger = logging.getLogger(__name__) # Local persistence request fingerprint if path: self.file = open(os.path.join(path, 'requests.seen'), 'a+') self.file.seek(0) self.fingerprints.update(x.rstrip() for x in self.file) @classmethod def from_settings(cls, settings): # If DEBUG is enabled in the configuration, the file will be persisted debug = settings.getbool('DUPEFILTER_DEBUG') return cls(job_dir(settings), debug) def request_seen(self, request): # !!! The core is used to detect whether the fingerprint exists. # Use request_fingerprint to get the requested fingerprint fp = self.request_fingerprint(request) # If the fingerprint is in the collection, return True if fp in self.fingerprints: return True # Not in the set, append to the set self.fingerprints.add(fp) if self.file: self.file.write(fp + '\n') def request_fingerprint(self, request): # Call the request of the sweep_ Fingerprint for fingerprint calculation return request_fingerprint(request) def close(self, reason): # Resource destruction if self.file: self.file.close() def log(self, request, spider): # Output and recording of logs if self.debug: msg = "Filtered duplicate request: %(request)s (referer: %(referer)s)" args = {'request': request, 'referer': referer_str(request)} self.logger.debug(msg, args, extra={'spider': spider}) elif self.logdupes: msg = ("Filtered duplicate request: %(request)s" " - no more duplicates will be shown" " (see DUPEFILTER_DEBUG to show all duplicates)") self.logger.debug(msg, {'request': request}, extra={'spider': spider}) self.logdupes = False spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)
The above code is so simple that anyone can easily write one by himself. Where request_ If the request is repeated, the see method is used to detect whether the request is repeated. Otherwise, it is used to return True. The core is the call of request_fingerprint to calculate the fingerprint. Go in and have a look.
request_fingerprint
The request fingerprint is a hash that uniquely identifies the resource the request points to
The request fingerprint is a hash value that uniquely identifies the resource to which the request is directed
def request_fingerprint(request, include_headers=None, keep_fragments=False): # Calculate headers if include_headers: include_headers = tuple(to_bytes(h.lower()) for h in sorted(include_headers)) cache = _fingerprint_cache.setdefault(request, {}) cache_key = (include_headers, keep_fragments) if cache_key not in cache: # Start calculation, encryption algorithm sha1 fp = hashlib.sha1() # Add the request method, request url and request body into the calculation, # If the url here points to the same resource, it is the same, for example: # http://www.example.com/query?id=111&cat=222 # http://www.example.com/query?cat=222&id=111 # These two URLs point to the same target, and we also consider them duplicate requests url fp.update(to_bytes(request.method)) fp.update(to_bytes(canonicalize_url(request.url, keep_fragments=keep_fragments))) fp.update(request.body or b'') # headers join calculation if include_headers: for hdr in include_headers: if hdr in request.headers: fp.update(hdr) for v in request.headers.getlist(hdr): fp.update(v) cache[cache_key] = fp.hexdigest() return cache[cache_key]
Execution flow of scheduler
In the Scheduler code of the sweep, the Scheduler uses the class method from_crawler reads dupefilter in configuration item_ Class path, using load_object is loaded and created_ Instance instantiates the object. Assign to attribute self df
class Scheduler: def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None, logunser=False, stats=None, pqclass=None, crawler=None): self.df = dupefilter ...... @classmethod def from_crawler(cls, crawler): settings = crawler.settings dupefilter_cls = load_object(settings['DUPEFILTER_CLASS']) dupefilter = create_instance(dupefilter_cls, settings, crawler) ...... return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser, stats=crawler.stats, pqclass=pqclass, dqclass=dqclass, mqclass=mqclass, crawler=crawler) def open(self, spider): ...... return self.df.open() def close(self, reason): ...... return self.df.close(reason) def enqueue_request(self, request): if not request.dont_filter and self.df.request_seen(request): self.df.log(request, self.spider) return False ...... return True
The scheduler is open ed, close d and requested to be listed in enqueue_request
Trigger the open ing and closing of the filter and calculate the fingerprint request respectively_ seen.
When constructing a request, the parameter dont_ When the filter is False, it will enter the de duplication calculation.
Novices often make mistakes. dont_filter=True is considered as de duplication. In fact, the thinking of foreigners is different from our direct expression. Maybe if we make parameters, filter=True is filtering, and filter=False is not filtering. Added don, don_ filter=True translates to: do not filter? yes.
summary
Now let's answer the interviewer's questions:
Q: How does the sweep remove duplicate requests? What is the de duplication principle? How does the request calculate uniqueness?
A: Scratch is through dupefilter in the configuration file_ Class attribute to select the method of de duplication. By default, it is called "scratch" dupefilters. RFPDupeFilter.
The scratch request is locally de duplicated through Python's built-in feature that sets do not duplicate sets.
Its encryption algorithm is sha1. By default, the uniqueness calculation is performed for the request method, url and body.
Core two points: set fingerprint de duplication and sha1 encryption to calculate fingerprint.