WebMagic getting started with crawler

This example implements the capture of the latest movie source name list and the download address of the detail page of a movie website.

webmagic is an open-source Java vertical crawler framework. Its goal is to simplify the development process of crawlers and let developers focus on the development of logical functions.

WebMagic features:

  • Fully modular design, strong scalability.
  • The core is simple but covers all the processes of reptiles. It is flexible and powerful. It is also a good material for learning how to get started with reptiles.
  • Provide rich extraction page API.
  • There is no configuration, but a crawler can be implemented through POJO + annotation.
  • Multithreading is supported.
  • Support distributed.
  • Support crawling js dynamically rendered pages.
  • No framework dependency, it can be embedded in the project flexibly.

Example

This example implements: https://www.dytt8.net/html/gn... Capture the latest movie source name and details page of movie website.

Configure Maven dependencies

pom.xml configuration. Because the log file conflicts with spring boot, remove webmagic's log dependency log4j12

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.1.9.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>com.easy</groupId>
    <artifactId>webmagic</artifactId>
    <version>0.0.1</version>
    <name>webmagic</name>
    <description>Demo project for Spring Boot</description>

    <properties>
        <java.version>1.8</java.version>
        <encoding>UTF-8</encoding>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.3</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.3</version>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
        
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <scope>compile</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

</project>

Create list and detail page parsing class

PageProcessor is responsible for parsing pages, extracting useful information, and discovering new links. WebMagic uses Jsoup as an HTML parsing tool, and develops an XPath parsing tool Xsoup based on it.

Listpageprocessor.java to get the list of movie names

package com.easy.webmagic.controller;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;

public class ListPageProcesser implements PageProcessor {
    private Site site = Site.me().setDomain("127.0.0.1");

    @Override
    public void process(Page page) {
        page.putField("title", page.getHtml().xpath("//a[@class='ulink']").all().toString());
    }

    @Override
    public Site getSite() {
        return site;
    }
}

Detailpageprocessor.java implementation details page movie download address acquisition

package com.easy.webmagic.controller;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;

public class DetailPageProcesser implements PageProcessor {
    private Site site = Site.me().setDomain("127.0.0.1");

    @Override
    public void process(Page page) {
        page.putField("download", page.getHtml().xpath("//*[@id=\"Zoom\"]/span/table/tbody/tr/td/a").toString());
    }

    @Override
    public Site getSite() {
        return site;
    }
}

Using Pipeline to handle the result of grabbing

Pipeline is responsible for the processing of extraction results, including calculation, persistence to files, databases, etc. WebMagic provides output to console and save to file solutions by default.

Pipeline defines how to save results. If you want to save to a specified database, you need to write a corresponding pipeline. Generally, only one pipeline needs to be written for a class of requirements.

No processing is done here, and the result of the packet capture is output in the console directly

MyPipeline.java

package com.easy.webmagic.controller;

import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

import java.util.Map;

@Slf4j
public class MyPipeline implements Pipeline {
    @Override
    public void process(ResultItems resultItems, Task task) {
        log.info("get page: " + resultItems.getRequest().getUrl());
        for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) {
            log.info(entry.getKey() + ":\t" + entry.getValue());
        }
    }
}

Start grab entry

Main.java

package com.easy.webmagic.controller;

import us.codecraft.webmagic.Spider;

public class Main {
    public static void main(String[] args) {
        //Get movie titles and page links
        Spider.create(new ListPageProcesser()).addUrl("https://www.dytt8.net/html/gndy/dyzz/list_23_1.html")
                .addPipeline(new MyPipeline()).thread(1).run();

        //Get the movie download address of the specified details page
        Spider.create(new DetailPageProcesser()).addUrl("https://www.dytt8.net/html/gndy/dyzz/20191204/59453.html")
                .addPipeline(new MyPipeline()).thread(1).run();
    }
}

Operation example

Start and run Main.java, observe the console

Title list of the first page of the movie

14:06:28.704 [pool-1-thread-1] INFO com.easy.webmagic.controller.MyPipeline - get page: https://www.dytt8.net/html/gndy/dyzz/list_23_1.html
14:06:28.704 [pool-1-thread-1] INFO com.easy.webmagic.controller.MyPipeline - title:    [<a href="/html/gndy/dyzz/20191204/59453.html" class="ulink">2019 Chinese Captain HD Mandarin Chinese English double characters</a>, <a href="/html/gndy/dyzz/20191201/59437.html" class="ulink">2019 Animated comedy "snowman's destiny" BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191201/59435.html" class="ulink">2019 Bernadette, where are you BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191129/59431.html" class="ulink">2019 The Irish/Irish killer BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191129/59429.html" class="ulink">2019 Downton Abbey movie BD Chinese and English double characters[Modified subtitles]</a>, <a href="/html/gndy/dyzz/20191129/59428.html" class="ulink">2018 "Snow storm" in BD Chinese characters in Mandarin</a>, <a href="/html/gndy/dyzz/20191128/59427.html" class="ulink">2019 "Official secrets" BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191127/59425.html" class="ulink">2019 Young you HD Chinese characters in Mandarin</a>, <a href="/html/gndy/dyzz/20191126/59424.html" class="ulink">2019 The climber HD Mandarin Chinese English double characters</a>, <a href="/html/gndy/dyzz/20191126/59423.html" class="ulink">2019 The story of "Goldfinch" BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191125/59422.html" class="ulink">2019 The Hollywood past BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191125/59421.html" class="ulink">2018 Cat and peach blossom BD Guoyue bilingual Chinese characters</a>, <a href="/html/gndy/dyzz/20191124/59418.html" class="ulink">2019 Are you ready/Marriage killing game BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191124/59417.html" class="ulink">2019 Suspense of the plot of "two souls" BD Guoyue bilingual Chinese characters</a>, <a href="/html/gndy/dyzz/20191122/59409.html" class="ulink">2019 Science fiction action "twin killers" HD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191122/59408.html" class="ulink">2019 Mount paradise/Paradise mountain�f>BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191121/59407.html" class="ulink">2019 Horror of "the return of the clown 2" BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191117/59403.html" class="ulink">2019 Klaus: the secret of Christmas BD Chinese, English and Western trilingual double characters</a>, <a href="/html/gndy/dyzz/20191116/59400.html" class="ulink">2019 The fall of angels BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191115/59399.html" class="ulink">2019 The crime scene HD Guoyue bilingual Chinese characters</a>, <a href="/html/gndy/dyzz/20191115/59398.html" class="ulink">2019 Don't tell her BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191114/59393.html" class="ulink">2019 Original fear BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191114/59392.html" class="ulink">2019 After the wedding BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191113/59387.html" class="ulink">2019 Crisis: Battle of Longtan BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191113/59386.html" class="ulink">2019 Silent witness BD Guoyue bilingual Chinese characters</a>]

Details page movie download address

14:06:34.365 [pool-2-thread-1] INFO com.easy.webmagic.controller.MyPipeline - get page: https://www.dytt8.net/html/gndy/dyzz/20191204/59453.html
14:06:34.365 [pool-2-thread-1] INFO com.easy.webmagic.controller.MyPipeline - download:    <a href="ftp://Ygdy8: ygdy8@yg45.dydydytt.net: 4233 / sunshine movie www.ygdy8.com. Chinese captain. HD.1080p. Chinese and English double characters. MKV "> ftp://ygdy8: ygdy8@yg45.dydydytt.net: 4233 / sunshine movie www.ygdy8.com. Chinese captain. HD.1080p. Chinese and English double characters. Mkv</a>

Indicates that you have successfully grabbed the data and then done what you want to do.

Crawler advance

Extract elements with Selectable

The Selectable related extraction element chain API is a core function of WebMagic. With the Selectable interface, you can directly extract the page elements in a chain, and you don't need to care about the details of extraction.

Configuration, startup and termination of Crawlers

Spider is the entry point for the reptile to start. Before starting the crawler, we need to create a spider object with a PageProcessor, and then start it with run(). At the same time, other components of spider (Downloader, Scheduler, Pipeline) can be set through the set method.

Jsoup and Xsoup

The extraction of WebMagic mainly uses Jsoup and Xsoup, a tool developed by myself.

Reptile monitoring

With this function, you can view the execution of the crawler - how many pages have been downloaded, how many pages still exist, how many threads have been started, etc. This function is implemented by JMX. You can use JMX tools such as Jconsole to view local or remote crawler information.

Configuration agent

ProxyProvider has a default implementation: SimpleProxyProvider. It is a ProxyProvider based on simple round robin with no failure checking. You can configure any candidate agent, and select one agent in order to use each time. It is suitable for the relatively stable proxy scene built by itself.

Processing non HTTP GET requests

It is implemented by adding Method and requestBody to the Request object. For example:

Request request = new Request("http://xxx/path");
request.setMethod(HttpConstant.Method.POST);
request.setRequestBody(HttpRequestBody.json("{'id':1}","utf-8"));

Using annotations to write Crawlers

WebMagic supports writing a crawler with a unique annotation style, which can be used by introducing WebMagic extension package.

In annotation mode, with a simple object and annotation, you can write a crawler with a small amount of code. For simple crawlers, this is easy to understand and manage.

data

Spring Boot, Cloud learning project

Keywords: Java Maven Spring Apache

Added by cyh123 on Wed, 11 Dec 2019 01:25:26 +0200