Simple crawler design - manage the internal state of the crawler

preface

For some background on this article, please move on to the previous article in this series.

Simple crawler design (I) -- basic model

Simple crawler design (II) -- crawling range

Simple crawler design (III) -- the range of web pages to be processed

Design description

Starting from this article, we discuss the specific implementation of crawler. First, we discuss the data structure maintained by crawler.

The first article in this series has included the sample code of Crawler control process. Through the following code snippet, you can first briefly review the processing logic. The TargetLinks and FetchedLinks used are the contents to be discussed in this article.

Link target = null;
// If the link collection to be crawled is left
while (null != (target = targetLinks.next())) {
    try {
         fetchAndProcess(target);
         //If the number of links that have been collected does not exceed the maximum value of the collection range constraint
         if (this.fetchedLinks.total() >= crawlingScope.maxToCrawl()) {
            break;
         }
         TimeUnit.MILLISECONDS.sleep(this.crawlDelay);
     } catch (Exception e) {
         e.printStackTrace();
}

TargetLinks saves the target links to be collected, and FetchedLinks saves the links already collected. Whenever crawler successfully crawls a link, it will delete the link from TargetLinks and save it to FetchedLinks. These two sets are the internal states of crawler.

Code example

interface design

interface TargetLinks {  //URL collection to be crawled

    void addAll(Collection<Link> links);
    //Add a link
    void add(Link link);
    //Total number of crawled links
    long total();
    //Return to the next link to crawl
    Link next();
    //Emptying is equivalent to resetting the crawl progress
    void clear();
}

interface FetchedLinks {  //Crawled URL collection
    //Has a link been crawled
    boolean contains(Link link);
    //Add a link
    void add(Link link);
    //Total links
    long total();
    //Emptying is equivalent to resetting the crawl progress
    void clear();
}

Concrete implementation

There are many options for the persistence of these two sets.

(1) It can be saved in memory. This method is the simplest to implement. However, after the program is restarted, the acquisition progress is lost. You can only start from scratch and the test environment is available.

(2) It can also be saved in a file. The implementation is more complex. You should cache it yourself. The advantage is that the progress is not lost.

(3) It can also be saved to the database. Commonly used ones include MySQL and Redis. Because these two collections change rapidly, Redis is more suitable, but a separate database may need to be installed.

The following code example is a TargetLinks based on Redis. It can only be used in the experimental environment. If it is used in the production environment, it needs to be improved.

It should be noted that the overall design of this simple crawler allows each crawler instance to perform a collection task, and has independent TargetLinks collection objects and FetchedLinks collection objects. The names of the TargetLinks and FetchedLinks sets contain the collection task ID (taskId) to isolate them from other collection task states. If you want to execute multiple collection tasks at the same time, you should use multithreading mode to run each crawler instance in an independent thread.

Careful readers will find that the Redis Set is used as the persistence scheme in the code, so the traversal order of the website cannot be controlled. Indeed, if you have requirements in this regard, please switch to the Redis list by yourself.

public class TargetLinksRedisImpl implements TargetLinks {

    @Autowired
    private RedisTemplate<String, String> redisTemplate;

    private String setName;
    private String taskId;  // Each task corresponds to a set, and tasks are isolated
    private Gson gson = new GsonBuilder().create()

    public void setTaskId(String taskId) {
        this.taskId = taskId;
        setName = taskId + "-" + "targets";
    }

    @Override
    public void addAll(Collection<Link> links) {
        SetOperations<String, String> setOps = redisTemplate.opsForSet();
        for (Link each : links) {
            setOps.add(SET_NAME, gson.toJson(each));
        }
    }

    @Override
    public void add(Link link) {
        SetOperations<String, String> setOps = redisTemplate.opsForSet();
        setOps.add(setName, gson.toJson(link));
    }

    @Override
    public Link next() {
        SetOperations<String, String> setOps = redisTemplate.opsForSet();
        String linkJson = setOps.pop(setName);
        Link link = gson.fromJson(linkJson, Link.class);
        return link;
    }

    @Override
    public long total() {
        SetOperations<String, String> setOperations = redisTemplate.opsForSet();
        long size = setOperations.size(SET_NAME);
        return size;
    }
}

Summary

This article discusses the management of crawler's internal state, including TargetLinks and FetchedLinks, and shows the interface design and implementation of these two sets respectively.

Through the introduction of these two basic concepts, we can not only make the crawling logic clearer, but also easily observe the changes of the state of the crawler in the process of crawling the web page, and display the current crawling progress at any time.

Keywords: Java Design Pattern crawler Data Mining

Added by CarbonCopy on Thu, 06 Jan 2022 17:08:03 +0200