This example implements the capture of the latest movie source name list and the download address of the detail page of a movie website.
webmagic is an open-source Java vertical crawler framework. Its goal is to simplify the development process of crawlers and let developers focus on the development of logical functions.
WebMagic features:
- Fully modular design, strong scalability.
- The core is simple but covers all the processes of reptiles. It is flexible and powerful. It is also a good material for learning how to get started with reptiles.
- Provide rich extraction page API.
- There is no configuration, but a crawler can be implemented through POJO + annotation.
- Multithreading is supported.
- Support distributed.
- Support crawling js dynamically rendered pages.
- No framework dependency, it can be embedded in the project flexibly.
Example
This example implements: https://www.dytt8.net/html/gn... Capture the latest movie source name and details page of movie website.
Configure Maven dependencies
pom.xml configuration. Because the log file conflicts with spring boot, remove webmagic's log dependency log4j12
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.1.9.RELEASE</version> <relativePath/> <!-- lookup parent from repository --> </parent> <groupId>com.easy</groupId> <artifactId>webmagic</artifactId> <version>0.0.1</version> <name>webmagic</name> <description>Demo project for Spring Boot</description> <properties> <java.version>1.8</java.version> <encoding>UTF-8</encoding> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> </properties> <dependencies> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> <exclusions> <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <scope>compile</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build> </project>
Create list and detail page parsing class
PageProcessor is responsible for parsing pages, extracting useful information, and discovering new links. WebMagic uses Jsoup as an HTML parsing tool, and develops an XPath parsing tool Xsoup based on it.
Listpageprocessor.java to get the list of movie names
package com.easy.webmagic.controller; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.processor.PageProcessor; public class ListPageProcesser implements PageProcessor { private Site site = Site.me().setDomain("127.0.0.1"); @Override public void process(Page page) { page.putField("title", page.getHtml().xpath("//a[@class='ulink']").all().toString()); } @Override public Site getSite() { return site; } }
Detailpageprocessor.java implementation details page movie download address acquisition
package com.easy.webmagic.controller; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.processor.PageProcessor; public class DetailPageProcesser implements PageProcessor { private Site site = Site.me().setDomain("127.0.0.1"); @Override public void process(Page page) { page.putField("download", page.getHtml().xpath("//*[@id=\"Zoom\"]/span/table/tbody/tr/td/a").toString()); } @Override public Site getSite() { return site; } }
Using Pipeline to handle the result of grabbing
Pipeline is responsible for the processing of extraction results, including calculation, persistence to files, databases, etc. WebMagic provides output to console and save to file solutions by default.
Pipeline defines how to save results. If you want to save to a specified database, you need to write a corresponding pipeline. Generally, only one pipeline needs to be written for a class of requirements.
No processing is done here, and the result of the packet capture is output in the console directly
MyPipeline.java
package com.easy.webmagic.controller; import lombok.extern.slf4j.Slf4j; import us.codecraft.webmagic.ResultItems; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.pipeline.Pipeline; import java.util.Map; @Slf4j public class MyPipeline implements Pipeline { @Override public void process(ResultItems resultItems, Task task) { log.info("get page: " + resultItems.getRequest().getUrl()); for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) { log.info(entry.getKey() + ":\t" + entry.getValue()); } } }
Start grab entry
Main.java
package com.easy.webmagic.controller; import us.codecraft.webmagic.Spider; public class Main { public static void main(String[] args) { //Get movie titles and page links Spider.create(new ListPageProcesser()).addUrl("https://www.dytt8.net/html/gndy/dyzz/list_23_1.html") .addPipeline(new MyPipeline()).thread(1).run(); //Get the movie download address of the specified details page Spider.create(new DetailPageProcesser()).addUrl("https://www.dytt8.net/html/gndy/dyzz/20191204/59453.html") .addPipeline(new MyPipeline()).thread(1).run(); } }
Operation example
Start and run Main.java, observe the console
Title list of the first page of the movie
14:06:28.704 [pool-1-thread-1] INFO com.easy.webmagic.controller.MyPipeline - get page: https://www.dytt8.net/html/gndy/dyzz/list_23_1.html 14:06:28.704 [pool-1-thread-1] INFO com.easy.webmagic.controller.MyPipeline - title: [<a href="/html/gndy/dyzz/20191204/59453.html" class="ulink">2019 Chinese Captain HD Mandarin Chinese English double characters</a>, <a href="/html/gndy/dyzz/20191201/59437.html" class="ulink">2019 Animated comedy "snowman's destiny" BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191201/59435.html" class="ulink">2019 Bernadette, where are you BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191129/59431.html" class="ulink">2019 The Irish/Irish killer BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191129/59429.html" class="ulink">2019 Downton Abbey movie BD Chinese and English double characters[Modified subtitles]</a>, <a href="/html/gndy/dyzz/20191129/59428.html" class="ulink">2018 "Snow storm" in BD Chinese characters in Mandarin</a>, <a href="/html/gndy/dyzz/20191128/59427.html" class="ulink">2019 "Official secrets" BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191127/59425.html" class="ulink">2019 Young you HD Chinese characters in Mandarin</a>, <a href="/html/gndy/dyzz/20191126/59424.html" class="ulink">2019 The climber HD Mandarin Chinese English double characters</a>, <a href="/html/gndy/dyzz/20191126/59423.html" class="ulink">2019 The story of "Goldfinch" BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191125/59422.html" class="ulink">2019 The Hollywood past BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191125/59421.html" class="ulink">2018 Cat and peach blossom BD Guoyue bilingual Chinese characters</a>, <a href="/html/gndy/dyzz/20191124/59418.html" class="ulink">2019 Are you ready/Marriage killing game BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191124/59417.html" class="ulink">2019 Suspense of the plot of "two souls" BD Guoyue bilingual Chinese characters</a>, <a href="/html/gndy/dyzz/20191122/59409.html" class="ulink">2019 Science fiction action "twin killers" HD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191122/59408.html" class="ulink">2019 Mount paradise/Paradise mountain�f>BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191121/59407.html" class="ulink">2019 Horror of "the return of the clown 2" BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191117/59403.html" class="ulink">2019 Klaus: the secret of Christmas BD Chinese, English and Western trilingual double characters</a>, <a href="/html/gndy/dyzz/20191116/59400.html" class="ulink">2019 The fall of angels BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191115/59399.html" class="ulink">2019 The crime scene HD Guoyue bilingual Chinese characters</a>, <a href="/html/gndy/dyzz/20191115/59398.html" class="ulink">2019 Don't tell her BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191114/59393.html" class="ulink">2019 Original fear BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191114/59392.html" class="ulink">2019 After the wedding BD Chinese and English subtitles</a>, <a href="/html/gndy/dyzz/20191113/59387.html" class="ulink">2019 Crisis: Battle of Longtan BD Chinese English Subtitle</a>, <a href="/html/gndy/dyzz/20191113/59386.html" class="ulink">2019 Silent witness BD Guoyue bilingual Chinese characters</a>]
Details page movie download address
14:06:34.365 [pool-2-thread-1] INFO com.easy.webmagic.controller.MyPipeline - get page: https://www.dytt8.net/html/gndy/dyzz/20191204/59453.html 14:06:34.365 [pool-2-thread-1] INFO com.easy.webmagic.controller.MyPipeline - download: <a href="ftp://Ygdy8: ygdy8@yg45.dydydytt.net: 4233 / sunshine movie www.ygdy8.com. Chinese captain. HD.1080p. Chinese and English double characters. MKV "> ftp://ygdy8: ygdy8@yg45.dydydytt.net: 4233 / sunshine movie www.ygdy8.com. Chinese captain. HD.1080p. Chinese and English double characters. Mkv</a>
Indicates that you have successfully grabbed the data and then done what you want to do.
Crawler advance
Extract elements with Selectable
The Selectable related extraction element chain API is a core function of WebMagic. With the Selectable interface, you can directly extract the page elements in a chain, and you don't need to care about the details of extraction.
Configuration, startup and termination of Crawlers
Spider is the entry point for the reptile to start. Before starting the crawler, we need to create a spider object with a PageProcessor, and then start it with run(). At the same time, other components of spider (Downloader, Scheduler, Pipeline) can be set through the set method.
Jsoup and Xsoup
The extraction of WebMagic mainly uses Jsoup and Xsoup, a tool developed by myself.
Reptile monitoring
With this function, you can view the execution of the crawler - how many pages have been downloaded, how many pages still exist, how many threads have been started, etc. This function is implemented by JMX. You can use JMX tools such as Jconsole to view local or remote crawler information.
Configuration agent
ProxyProvider has a default implementation: SimpleProxyProvider. It is a ProxyProvider based on simple round robin with no failure checking. You can configure any candidate agent, and select one agent in order to use each time. It is suitable for the relatively stable proxy scene built by itself.
Processing non HTTP GET requests
It is implemented by adding Method and requestBody to the Request object. For example:
Request request = new Request("http://xxx/path"); request.setMethod(HttpConstant.Method.POST); request.setRequestBody(HttpRequestBody.json("{'id':1}","utf-8"));
Using annotations to write Crawlers
WebMagic supports writing a crawler with a unique annotation style, which can be used by introducing WebMagic extension package.
In annotation mode, with a simple object and annotation, you can write a crawler with a small amount of code. For simple crawlers, this is easy to understand and manage.