Detailed explanation of actual combat of jsoup crawler

**
Today, I'd like to share a summary about how jsoup crawled Sina's web page for some time.
Before realizing the above functions, we should first understand two key points: 1 About the details and tutorials of jsoup crawling, you should understand the relevant tags of jsoup when crawling: if you are interested, you can go to Baidu to learn the corresponding tutorials in detail.
2. The second point is that we need to understand the page structure of the opposite website. Now it is generally divided into HTML display, and another is the form of page effect presented through js asynchronous loading. js asynchronous loading is relatively complex. Today, we will focus on the actual use of jsoup displayed in HTML. Today, we will take Sina as an example.

Step 1: import corresponding pom file contents:

Step 2: understand the basic structure of Sina Web page

We enter the Sina Web page. For example, we click into the sports classification column, enter the sports classification page, right-click to view the web source code, and we can see

The structure of the corresponding list is basically UL > Li > A, so we can get the link of the corresponding title to enter the details
We can get the list information of the corresponding categories above. We can click any one to enter the corresponding details page and continue to see the structure of the page:

The basic structure of the corresponding details is the construction of basic data dominated by class and article. Today, we will basically explain that capturing the corresponding title content information is the leading implementation of the function
The above is the corresponding page structure of sina. Let's directly enter the business code:

1. Obtain the document information of the page by simulating the CHROME browser. The code is as follows

url is the corresponding path information of the corresponding crawl, and useHtmlUnit is whether to use HtmlUnit. Here, we can pass the value as false.

    public static Document getHtmlFromUrl(String url, boolean useHtmlUnit) {
        Document document = null;
        if (!useHtmlUnit) {
            try {
                document = Jsoup.connect(url)
                // Simulate Firefox browser
                        .userAgent("Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)").get();
            } catch (Exception e) {
            }
        } else {
            // Analog CHROME viewer
            WebClient webClient = new WebClient(BrowserVersion.CHROME);
            webClient.getOptions().setJavaScriptEnabled(true);// 1 start JS
            webClient.getOptions().setCssEnabled(false);// 2 disable CSS to avoid automatic secondary request for CSS rendering
            webClient.getOptions().setActiveXNative(false);
            webClient.getOptions().setRedirectEnabled(true);// 3 start redirection
            webClient.setCookieManager(new CookieManager());// 4 start cookie management
            webClient.setAjaxController(new NicelyResynchronizingAjaxController());// 5 start the ajax agent
            webClient.getOptions().setThrowExceptionOnScriptError(false);// 6. Whether to throw an exception when JS runs an error
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
            webClient.getOptions().setUseInsecureSSL(true);// Ignore ssl authentication
            webClient.getOptions().setTimeout(10000);// 7 set timeout
            /**
             * Set proxy proxyconfig = webclient getOptions(). getProxyConfig();  proxyConfig. setProxyHost(listserver.get(0));
             * proxyConfig.setProxyPort(Integer.parseInt(listserver.get(1)));
             */
            HtmlPage htmlPage = null;
            try {
                htmlPage = webClient.getPage(url);
                webClient.waitForBackgroundJavaScript(10000);
                webClient.setJavaScriptTimeout(10000);
                String htmlString = htmlPage.asXml();
                return Jsoup.parse(htmlString);
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                webClient.close();
            }
        }
        return document;
    }

Step 2 core code

	/**
	 * 
	 * @param: @param Corresponding captured path information: http://sports.sina.com.cn/ Assume Sina's sports path
	 * @param: @param The corresponding tag id is its own corresponding tag id, which represents Sina
	 * @param: @param On line number
	 * @param: @param Tag rules of corresponding web pages: Here we assume: UL > Li > A
	 * @param: @return If the quantity tag is 0, the crawling ends   
	 * @return: List<Map<String,String>>
	 */
  private static List<Map<String,String>> pullNews(String tagUrl,Integer tag,Integer num,String regex) {
       //Get web page
       Document html= null;
       try {
   	    html = BaseJsoupUtil.getHtmlFromUrl(tagUrl, false);
       } catch (Exception e) {
       }
       List<Map<String,String>> returnList = new ArrayList<>();//Define the list returned
       //jsoup gets the < a > tag. Different addresses match different matching rules
       Elements newsATags = null;
       //It can be understood that the above rules of our first page about sports are basically: UL > Li > a
       if(Utils.notBlank(regex)) {
       	newsATags = html.select(regex);
       }else {
       	System.out.println(tagUrl+"Collection rule not set");
       	return returnList;
       }
       //When the rules do not apply
       if(newsATags.isEmpty()) {
       	System.out.println(tagUrl+"Rule not applicable");
       	return returnList;
       }
       //Extract url from < a > tag
       List<Map<String,String>> list = new ArrayList<>();//Define the list that stores the url
       //Traverse the list information of our sports topics to obtain the corresponding details through the corresponding url
       for (Element a : newsATags) {
       	String articleTitle=a.text();//Get current title information
       	   String url = a.attr("href");//The path to get the corresponding title, that is, the path to jump to the corresponding details as I mentioned above. We grab the article title, picture and other information we want by entering the details
           if(url.contains("html")) {
           	if(url.indexOf("http") == -1) {
               	url = "http:" + url;
               }
               Map<String,String> map = new HashMap<>();
               map.put("url", url);
               map.put("title", articleTitle);
               list.add(map);
           }else {
           	continue;  
           }      
       }
       //Access the page according to the url to get the content
       for(Map<String,String> m :list) {
       	 Document newsHtml = null;
            try {
           	    //Obtain the html information under the corresponding path through the simulator. The url here is the link of the corresponding details. Obtain the corresponding page information content through the simulator
                newsHtml = BaseJsoupUtil.getHtmlFromUrl(m.get("url"), false);
                String title=m.get("title");//Get title
                Elements newsContent=null;
                String data;
                newsContent=newsHtml.select(".article");//Here is the second part of the information I explained above. jsoup grabs information through tags. The above is the grabbing rule with class as article
                if(newsContent!=null) {      
               	 String content=newsContent.toString();//Get the corresponding content information
               	 //Too short an article doesn't need to be written
               	 if(content.length() < 200) {//Filter articles that are too long
                   	 break;
                    }
               	 //Time filtering
               	 if(Utils.isBlank(newsHtml.select(".date").text())) {
               		 data=newsHtml.select(".time").text().replaceAll("year", "-").replaceAll("month", "-").replaceAll("day", "");
               	 }else{
               		 data=newsHtml.select(".date").text().replaceAll("year", "-").replaceAll("month", "-").replaceAll("day", "");
               	 }
               	 //Specify path time filtering
               	 if(tagUrl.equals("http://collection.sina.com.cn/")) {
               		 data=newsHtml.select(".titer").text().replaceAll("year", "-").replaceAll("month", "-").replaceAll("day", "");
               	 }else if(tagUrl.equals("https://finance.sina.com.cn/stock/") || tagUrl.equals("http://blog.sina.com.cn/lm/history/")) {
               		 data=newsHtml.select(".time SG_txtc").text().replaceAll("year", "-").replaceAll("month", "-").replaceAll("day", "").replaceAll("(", "").replaceAll(")", "");
               	 }
               	    //Get picture information
                    List<String> imgSrcList = BaseJsoupUtil.getImgSrcList(content);
                    String coverImg = "";
                    if(!imgSrcList.isEmpty()) {
                   	 if(num == 0) {
                     		break;//The set upper limit of collection quantity has been reached
                     	 }
                        num --;
    					 for(String url : imgSrcList){
                   		 String imgUrl = BaseJsoupUtil.downloadUrl(url.indexOf("http") == -1?"http:" + url:url);
                   		 content = Utils.isBlankStr(content)?content:content.replace(url, imgUrl);
                   		 if(Utils.isBlank(coverImg))coverImg = imgUrl.replace(FileConst.QINIU_FILE_HOST+"/", "");
                   	 }
                   	    //Get what we want and package it into the map, which basically ends the pull
                      	Map<String,String> map = new HashMap<>();
                        map.put("title", title);//title
                        map.put("date", data);//Article creation time
                        map.put("content", HtmlCompressor.compress(content));//Article content
                        map.put("tag", tag.toString());//Tag id
                        map.put("coverImg", coverImg);//Article cover
                        map.put("url", m.get("url"));//url
                    }
                }
            } catch (Exception e) {
           	    e.printStackTrace();
            }
       }
       return returnList;
   }

The above is the basic logic of jsoop crawling. The overall summary is to be familiar with the tutorial on the use of corresponding jsoop tags and the specific rules for crawling pages. After knowing these basic rules, we can obtain information in combination with our own business logic. Here, we only do technical exchanges on Sina crawling, and we hope you will not use it for commercial use. Well, that's all for today. I'll continue to share my pit records later. If you have any questions, you can leave a message.

Keywords: Java crawler

Added by kulikedat on Wed, 22 Dec 2021 05:59:47 +0200

Programming VIP

Detailed explanation of actual combat of jsoup crawler