How long have you not seen the stars? [climb all the pictures on NASA's popular science website]

1. Summary of prospects

When surfing the Internet, I saw that NASA had a popular science website with a popular science picture every day, which was very high-definition. I wanted to download it to make wallpaper.

So I'm going to write a Java crawler to crawl all the pictures. I can also add some notifications to check every night. When an update is detected, I get up the next morning and push it to the mobile phone. Of course, this function has not been realized.

2. Development

2.1 development environment

  • Maven
  • Java 17 (in order to follow the trend after Spring Framework 6, it is planned to use Zulu jdk17 for future development)
  • WebMagic (crawler framework for Java)

2.2 Maven pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.yeahvin</groupId>
    <artifactId>getStars</artifactId>
    <version>1.0.1-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.5</version>
        </dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.5</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.25</version>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.20</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>ch.qos.logback</groupId>
            <artifactId>logback-classic</artifactId>
            <version>1.2.3</version>
        </dependency>
    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <configuration>
                    <mainClass>com.yeahvin.StartApplication</mainClass>
                    <layout>ZIP</layout>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>repackage</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

  Scan VX for Java data, front-end, test, python and so on

2.3 use of Java crawler webmagic

This framework is also very easy to start and use. The official manual is here   WebMagic in Action

Here is a brief introduction to the use of this framework. As shown in the following figure, what our developers need to do in the whole process is:

  1. Write pageprocessor
  2. Write Pipeline

You can complete a very simple crawler. I recommend you take a look at this example first: First example

In this project, the analysis of the whole nasa website is actually very simple, and the process is probably

request Overview page -->Get all existing links, traverse -- > request details page -- > download pictures

Startup class

ok, let's develop and write StartApplication, which is the place to start the crawler.

package com.yeahvin;

import com.yeahvin.pageProcessors.NasaStarsProcessor;
import com.yeahvin.pipelines.NasaStarsPipeline;
import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.Spider;

/**
 * @author Asher
 * on 2021/11/6
 */
@Slf4j
public class StartApplication {

    //Download picture storage path
    public static final String DOWNLOAD_PATH = "/Users/asher/gitWorkspace/temp/";
    public static final String INDEX_PATH = "https://apod.nasa.gov/apod/archivepix.html";

    public static void main(String[] args) {
        Spider.create(new NasaStarsProcessor())
                .addUrl(INDEX_PATH)
                .addPipeline(new NasaStarsPipeline())
                .run();
    }
}

In the above code, Spider.create(Processor) can build a crawler very quickly, and then   addUrl configures the requested address, and addPipeline configures how to handle when the crawler grabs data. Finally, run starts

Write NasaStarsProcessor

The reason for writing this is to analyze and process the content on the web page after the request to the web page

package com.yeahvin.pageProcessors;

import com.yeahvin.StartApplication;
import com.yeahvin.utils.StringUtil;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.lang3.StringUtils;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;

import java.io.File;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

import static com.yeahvin.StartApplication.DOWNLOAD_PATH;

/**
 * @author Asher
 * on 2021/11/6
 */
@Slf4j
public class NasaStarsProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(5).setSleepTime(1000).setTimeOut(10000);

    @Override
    public void process(Page page) {
        if (page.getUrl().regex(StartApplication.INDEX_PATH).match()){

            //Get the saved files in the folder
            File saveDir = new File(DOWNLOAD_PATH);
            List<String> fileNames = Arrays.stream(Optional.ofNullable(saveDir.listFiles()).orElse(new File[]{}))
                    .map(File::getName)
                    .collect(Collectors.toList());

            log.info("Start homepage crawling");
            List<String> starsInfos = page.getRawText()
                    .lines()
                    .map(line -> {
                        List<String> infoList = StringUtil.getRegexString(line, "^\\d{4}.*<br>$");
                        boolean flag = infoList.size() > 0;
                        return flag ? infoList.get(0) : null;
                    })
                    .filter(StringUtils::isNoneBlank)
                    .collect(Collectors.toList());

            starsInfos.forEach(info -> {
                String link = StringUtil.getRegexString(info, "ap.*html").get(0);
                String name = StringUtil.getRegexString(info, ">.*</a>").get(0)
                        .replace("</a>", "")
                        .replace(">", "");

                //Skip if the corresponding file already exists
                if (fileNames.stream().filter(fileName -> fileName.contains(name)).count() > 0) {
                    log.info("file [{}] Already exists,Skip", name);
                    return;
                }
                page.addTargetRequest("https://apod.nasa.gov/apod/" + link);
            });
            log.info("Complete the home page capture, a total of{}Data bar", starsInfos.size());
        }else {
            page.getRawText()
                .lines()
                .filter(line -> line.contains("<title>")
                        || Pattern.compile("<a href=\"image.*jpg.*\">").matcher(line).find()
                        || Pattern.compile("<a href=\"image.*jpg.*\"").matcher(line).find()
                )
                .forEach(line -> {
                    if (line.contains("<title>")){
                        String date = StringUtil.getRegexString(line, "\\d{4}.*-").get(0)
                                .replace("-", "");
                        String name = StringUtil.getRegexString(line, "-.*").get(0)
                                .replace("-", "");
                        page.putField("date", date);
                        page.putField("name", name);
                    }else {
                        String imageLink = StringUtil.getRegexString(line, "image.*jpg").get(0);
                        page.putField("imageLink", imageLink);
                    }
                    page.putField("url", page.getUrl());
                });

        }
    }

    @Override
    public Site getSite() {
        return site;
    }
}

The logic of the above code is:

  • Property to configure a Site as the configuration when the crawler requests a web page.

  • Implementation interface   Pag

  • Processor  , Override method required   process

  • In method   In process, first request the overview page and make a judgment through if. If the incoming request is the same as the overview page, skip to the details page for processing, that is, in else. Why can this be done, because in   In the condition processing of if(true), the following code can be used directly in the current   If the Processor requests the link put in again, there is no need to write the Processor repeatedly and save the amount of development.

    page.addTargetRequest("https://apod.nasa.gov/apod/" + link);
    
  • When the if condition = = true, the file names of all files in the specified save folder will be taken first. When the application stops halfway and starts here, there is no need to run the previous web page again. then   getRawText() gets the text content of the web page, filters it through regular rules, and gets the link s of all detail pages

  • Therefore, in else is the code processing process of the details page, such as Astronomy Picture of the Day   The processing process after the interface obtains the web page content. among   getRawText() gets the text content of the web page and filters it through regular, page.putField("date", date); Put it into the page variable, and the value will be passed into the startup class   Specified by StartApplication   Method in NasaStarsPipeline.

  Scan VX for Java data, front-end, test, python and so on

Write NasaStarsPipeline

This NasaStarsPipeline is written to process the contents of the Field after the above Processor is put into the Field.

The general logic is:

  • Get the specific information of the details page, such as name, date and the download link of the picture, and then download the picture. It should be noted here that sometimes the opened details page is not necessarily a picture, but also a video. When it cannot be parsed in the form of a picture, the processing method here is to skip directly. Save the error log.
package com.yeahvin.pipelines;

import com.yeahvin.utils.X509TrustManager;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.lang3.StringUtils;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

import java.io.File;

import static com.yeahvin.StartApplication.DOWNLOAD_PATH;

/**
 * @author Asher
 * on 2021/11/6
 */
@Slf4j
public class NasaStarsPipeline implements Pipeline {

    @Override
    public void process(ResultItems resultItems, Task task) {
        if (resultItems.getAll()!=null && resultItems.getAll().size()>0){
            String date = resultItems.get("date");
            String name = resultItems.get("name");
            String imageLink = resultItems.get("imageLink");

            if (StringUtils.isBlank(imageLink)){
                log.error("Failed to get the picture download link, name:{}, date:{}, url:{}", name, date, resultItems.get("url"));
                return;
            }

            String fileName = date + name + ".jpg";
            if (new File(DOWNLOAD_PATH + fileName).exists()) {
                log.info("Download pictures{}Already exists, skip, name:{}, date:{}", imageLink, name, date);
                return;
            }
            try {
                X509TrustManager.downLoadFromUrlHttps("https://apod.nasa.gov/apod/" + imageLink, fileName, DOWNLOAD_PATH);
            } catch (Exception e) {
                log.error("Picture download failed:", e);
                e.printStackTrace();
            }
            log.info("Download pictures{}Done, name:{}, date:{}", imageLink, name, date);
        }
    }

}

When you finish writing so much, the whole code is actually completed. You can expand it later.
For example, for multi-threaded downloading of pictures, I perform scheduled tasks every day. Here, I complete this scheduled task through shell script, so I don't intend to hang this application in memory often.

  Scan VX for Java data, front-end, test, python and so on

 

Keywords: Java network Back-end

Added by Ammar on Wed, 01 Dec 2021 02:36:40 +0200