Scheduler is a url scheduler in Webmagic. It collects URLs (target Requests of Page) that need to be crawled from Spider processing and poll s the URLs to be processed to Spider. Scheduler is also responsible for error retry and de-processing of url judgment, as well as statistics of total pages and remaining pages.
Main interface:
Scheduler defines the basic push and poll methods. Basic interface.
Monitorable Scheduler, inherited from Scheduler's interface, defines a method to obtain the number of remaining url requests and the total number of requests. Easy to monitor.
The core package mainly implements classes:
Duplicate RemovedScheduler, an abstract class, implements a general push template method, and judges error retries and de-reprocessing within the push method. The de-duplicate strategy uses the HashSetDuplicateRemover class, which will be explained later.
Priority Scheduler, a scheduler with two priority queues (+, -) and a non-priority blocking queue.
Queue Scheduler, a scheduler with built-in blocking queues. This is adopted by default.
URL de-duplication strategy:
Duplicate Remover: De-duplicate interfaces, including methods to determine whether duplicate, reset duplicate checks, and get the total number of requests.
HashSet Duplicate Remover: The implementation class of Duplicate Remover, which maintains a concurrent secure HashSet internally.
First of all, we will talk about the concrete realization of the strategy. The core code is as follows:
public class HashSetDuplicateRemover implements DuplicateRemover { private Set<String> urls = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>()); @Override public boolean isDuplicate(Request request, Task task) { return !urls.add(getUrl(request)); } . . . @Override public void resetDuplicateCheck(Task task) { urls.clear(); } @Override public int getTotalRequestsCount(Task task) { return urls.size(); } }
The de-duplication policy class is very simple, which is to maintain a concurrent secure HashSet. Then, the success of the add method is used to determine whether it is a duplicate url. The reset duplicate check is to empty the set, and to get the total number of requests is to get the size of the set. Simple and clear. But if you think that's all you need to do, you're wrong. Keep looking.
public abstract class DuplicateRemovedScheduler implements Scheduler { private DuplicateRemover duplicatedRemover = new HashSetDuplicateRemover(); @Override public void push(Request request, Task task) { if (shouldReserved(request) || noNeedToRemoveDuplicate(request) || !duplicatedRemover.isDuplicate(request, task)) { pushWhenNoDuplicate(request, task); } } protected boolean shouldReserved(Request request) { return request.getExtra(Request.CYCLE_TRIED_TIMES) != null; } protected boolean noNeedToRemoveDuplicate(Request request) { return HttpConstant.Method.POST.equalsIgnoreCase(request.getMethod()); } protected void pushWhenNoDuplicate(Request request, Task task) { } }
Duplicate RemovedScheduler is an abstract class that provides a generic template for push and a pushWhenNoDuplicate for subclasses to implement their own policies. The push method is used for the same processing de-duplication and retry mechanism.
First, determine whether an error retry is required, if necessary, push directly to the queue, otherwise determine whether the request is a POST method, and if it is a direct join to the queue. (Note here that the url of the POST request will not be added to the urls set maintained by HashSet Duplicate Remover, so it will not be added to the most. In the final getTotalRequestsCount statistics, so the final statistics we get are only for GET requests. Otherwise, we should judge again.
According to the implementation of different schedulers, pushWhenNoDuplicate is implemented in different ways.
In Priority Scheduler, a scheduler with two priority queues (+, -) and a non-priority blocking queue is built in. The pushWhenNoDuplicate code is as follows:
public void pushWhenNoDuplicate(Request request, Task task) { if (request.getPriority() == 0) { noPriorityQueue.add(request); } else if (request.getPriority() > 0) { priorityQueuePlus.put(request); } else { priorityQueueMinus.put(request); } }
Decide which queue to join based on whether Request sets the priority attribute and whether it is positive or negative. Because it affects the sequence of subsequent poll s.
Built-in a scheduler for blocking queues in Queue Scheduler. The pushWhenNoDuplicate code is as follows:
public void pushWhenNoDuplicate(Request request, Task task) { queue.add(request); }
Simply add it to the queue.
The above is about the mechanism of URL de-duplication and push. Next, the idea of poll ing is explained.
In Priority Scheduler, the poll order is plus queue > no Priority queue > minus queue.
public synchronized Request poll(Task task) { Request poll = priorityQueuePlus.poll(); if (poll != null) { return poll; } poll = noPriorityQueue.poll(); if (poll != null) { return poll; } return priorityQueueMinus.poll(); }
In Queue Scheduler, it's simple and rude.
public Request poll(Task task) { return queue.poll(); }
As for the total number of url requests, it returns the size of the urls set maintained in HashSet Duplicate Remover. Again, here's a long story: Ultimately, the statistics we get are only for GET requests.
public int getTotalRequestsCount(Task task) { return getDuplicateRemover().getTotalRequestsCount(task); }
Of course, there are some Scheduler implementations in the extensions extension module, such as Redis Scheduler for cluster support and FileCacheQueue Scheduler for breakpoint crawling support. Because this series of articles is to analyze the core package first, and then expand the package, so this part is supplemented later.
The above is about the scheduler, the next topic to be determined.