Detailed explanation of the use example of puppeter

Phantom JS used to be the king of headless browsers. Tests and crawlers are being used. With the development of Google Chrome
With the emergence of Headless, the author of phantom JS has made it clear that it is not updating, while Google Chrome
Headless will be the trend of crawlers in the future, and the test will still use the Webdriver scheme, Google Chrome
Headless can be called by WebDriver or its integrated API - puppeter. Its function is as powerful as its name. It can control Chrome or Chromeium at will. The disadvantage is that there is only node API. Let's see its icon:

Tools is based on the protocol of puppeter
Chrome's node library relies on node above version 6.4. I started learning node when I came into contact with this software. I still feel that its asynchronous async/await is super powerful. Asynchronous is also widely used in puppeter to complete tasks.

You can use node's package management tool npm to install the puppeter:

    npm i puppeteer

Chromeium will be automatically installed during installation here. If it is not necessary, you can skip the download by configuring npm. As a crawler engineer, I will not discuss the use of testing. Next, let's see how to use it. Similar to WebDriver, you need to instantiate browser first. The code is as follows:

    const puppeteer = require('puppeteer');
    
    (async () => { ​
     const browser = await puppeteer.launch(); 
     const page = await browser.newPage(); 
     await page.goto('http://www.baidu.com'); 
     await browser.close(); 
    })();

At the end of this code execution, you may not feel anything, because it starts a Chromeium process in the background, opens Baidu home page, and then closes. Of course, we can open Chromeium in the foreground. Here we need to configure it. The configured parameters only need to be passed into launch(). The common parameters are as follows:

headless: whether to open the browser. The default value is true

ignoreHTTPSErrors: whether to ignore https errors. The default is true

executablePath: configure the executable path of the browser to be called. The default is chrome installed with puppeter

slowMo: postpones the operation of the puppeter in the specified milliseconds

args: set the relevant parameters of the browser, such as whether to start the sandbox mode "– no sandbox" and whether to replace the proxy "– proxy server". For specific parameters, please refer to Click here to view

Examples of use are as follows:

    const browser = await puppeteer.launch({headless:false, args: ["--no-sandbox",]}) //Open browser

Open a new window:

    const page = await browser.newPage();

Set window size

    await page.setViewport({
     width: 1920,
     height: 1080
    })

Filter unwanted requests:

    await page.setRequestInterception(true);
    page.on('request', interceptedRequest => {
     if (interceptedRequest.url().endsWith('.png') || interceptedRequest.url().endsWith('.jpg'))
      interceptedRequest.abort();
     else
      interceptedRequest.continue();
    });

Set up userAgent for browser:

    await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299")

Set cookie s,

    const data = {
     name: "smidB2", 
     domain: ".csdn.net", 
     value: "201806051502283cf43902aa8991a248f9c605204f92530032f23ef22c16270"
    }
    await page.setCookie(data)

The example is just a demonstration. The real cookie is in the form of a list and needs to be added circularly

    for(let data of cookies){
     await page.setCookie(data)
    }

Request url:

    const url = "http://www.baidu.com"
    await page.goto(url, { waitUntil: "networkidle2" });

Set page waiting time:

    await page.waitFor(1000); // The unit is milliseconds

Wait for an element of the page to load

    await page.waitForSelector("input[class='usrname']")

Click an element

    await page.click("input[class='submit']")

Using page The evaluate () function drags the mouse to the bottom of the page. The principle is to inject js code into the page.

    let scrollEnable = false;
    let scrollStep = 500; //Steps per scroll
    while (scrollEnable) {
    	scrollEnable = await page.evaluate((scrollStep) => {
    		let scrollTop = document.scrollingElement.scrollTop;
    		document.scrollingElement.scrollTop = scrollTop + scrollStep;
    		return document.body.clientHeight > scrollTop + 1080 ? true : false
    	}, scrollStep);
    	await page.waitFor(600)
    }

Get html information

    const frame = await page.mainFrame()
    const bodyHandle = await frame.$('html');
    const html = await frame.evaluate(body => body.innerHTML, bodyHandle);
    await bodyHandle.dispose(); //Destroy 
    console.log(html)  

This is the general operation that the crawler can use. The following is the code for crawling the basic information and scoring of Douban popular movies. When writing this program, you also know a little about node. If there is anything wrong, please leave a message

basePupp.js

    const puppeteer = require("puppeteer")
    
    class BasePuppeteer{
     puppConfig(){
      const config = {
       headless: false
      }
      return config
     }
     async openBrower(setting){
      const browser = puppeteer.launch(setting)
      return browser
     }
     async openPage(browser){
      const page = await browser.newPage()
      return page
     }
     async closeBrower(browser){
      await browser.close()
     }
     async closePage(page){
      await page.close()
     }
    }
    
    const pupp = new BasePuppeteer()
    module.exports = pupp

douban.js

    const pupp = require("./basePupp.js")
    const cheerio = require("cheerio")
    const mongo = require("mongodb")
    const assert = require("assert")
    
    const MongoClient = mongo.MongoClient
    const Urls = "mongodb://10.4.251.129:27017/douban"
    
    MongoClient.connect(Urls, function (err, db) {
     if (err) throw err;
      console.log('Database created');
     var dbase = db.db("runoob");
     dbase.createCollection('detail', function (err, res) {
      if (err) throw err;
      console.log("Create collection!");
      db.close();
     });
    });
    
    async function getList(){
     const brower = await pupp.openBrower()
     const page = await pupp.openPage( brower)
     const url = "https://movie.douban.com/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0"
     await page.goto(url);
     while(true){           // Click repeatedly until the element cannot be obtained
      try{
       await page.waitFor(1000); 
       await page.waitForSelector('a[class=more]'); // Wait for element loading to complete, with a timeout of 30000ms
       await page.click("a[class=more]")
       // break
      }catch(err){
       console.log(err)
       console.log("stop click !!!")
       break
      }
     } 
     await page.waitFor(1000);        // Wait one second for the page
     const links = await page.evaluate(() => {    // Get movie details url
      let movies = [...document.querySelectorAll('.list a[class=item]')];
      return movies.map((movie) =>{
       return {
        href: movie.href.trim(),
       }
      });
     });
     console.log(links.length)
     for (var i = 0; i < links.length; i++) {
      const a = links[i];
      await page.waitFor(2000); 
      await getDetail(brower, a.href)
      // break
     }
     await pupp.closePage(page)
     await pupp.closeBrower(brower)
     
    }
    
    async function getDetail(brower, url){
     const page = await pupp.openPage(brower)
     await page.goto(url);
     await page.waitFor(1000); 
     try{
      await page.click(".more-actor", {delay: 20})
     }catch(err){
      console.log(err)
     }
     const frame = await page.mainFrame()
     const bodyHandle = await frame.$('html');
     const html = await frame.evaluate(body => body.innerHTML, bodyHandle);
     await bodyHandle.dispose();      // Destroy      
     const $ = cheerio.load(html)
     const title = $("h1 span").text().trim()
     const rating_num = $(".rating_num").text().trim()
     const data = {}
     data["title"] = title
     data["rating_num"] = rating_num
     let info = $("#info").text()
     const keyword = ["director", "screenplay", "lead", "type", "website", "location", "language", "playdate", "playtime", "byname", "imdb"]
     if (info.indexOf("www.") > 0){
      info = info.replace(/https:\/\/|http:\/\//g, "").replace(/\t/g," ").replace(/\r/g, " ").split(":")
      for(var i = 1; i < info.length; i++){
       data[keyword[i-1]] = info[i].split(/\n/g)[0].replace(/ \/ /g, ",").trim()
      }
    
     }else{
      info = info.replace(/\t/g," ").replace(/\r/g, " ").split(":")
      keyword.splice(4,1)
      for(var i = 1; i < info.length-1; i++){
       data[keyword[i-1]] = info[i].split(/\n/g)[0].replace(/ \/ /g, ",").trim()
      }
      data["website"] = ""
     }
     // console.log(data)
     MongoClient.connect(Urls,function(err,db){       //Get connection
      assert.equal(null,err);           //Use the assertion module instead of the previous if judgment
      var dbo = db.db("douban");
      dbo.collection("detail").insert(data, function(err,result){  //Connect to the database and pass in the collection with parameters
       assert.equal(null,err);
       console.log(result);
       db.close();
      });
     });
     await pupp.closePage(page)
    }
    
    getList()

The above code completes the capture of all the popular Douban movies. There are the following steps:

1. Click repeatedly to load more until no such element is operable and an exception is thrown

2. After loading the list of all popular movies, parse the url of each movie detail page and request one by one

3. Analyze the required data on the details page,

4. Store the captured data. MongoDB is used here

The data after warehousing is as follows:

The above browser instantiation is optimized and written into single instance mode

config.js

    module.exports = {
     browserOptions:{
     headless: false,
     // args: ['--no-sandbox', '--proxy-server=http://proxy:abc100@cp.test.com:8995'],
     args: ['--no-sandbox'],
     }
    };

brower.js

    const puppeteer = require("puppeteer");
    const config = require('./config');//
    const deasync = require('deasync');
    const BROWSER_KEY = Symbol.for('browser');
    const BROWSER_STATUS_KEY = Symbol.for('browser_status');
    
    launch(config.browserOptions)
    wait4Lunch();
    
    /**
     * Start and get browser instance
     * @param {*} options
     * param options is puppeteer.launch function's options
     */
    function launch(options = {}) {
     if (!global[BROWSER_STATUS_KEY]) {
     global[BROWSER_STATUS_KEY] = 'lunching';
     puppeteer.launch(options)
      .then((browser) => {
      global[BROWSER_KEY] = browser;
      global[BROWSER_STATUS_KEY] = 'lunched';
      })
      .catch((err) => {
      global[BROWSER_STATUS_KEY] = 'error';
      throw err;
      });
     }
    }
    
    function wait4Lunch(){
     while (!global[BROWSER_KEY] && global[BROWSER_STATUS_KEY] == 'lunching') {
     // wait for lunch
     deasync.runLoopOnce();
     }
    }
    module.exports = global[BROWSER_KEY];

The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can support the script home.

Keywords: Python

Added by rpadilla on Sun, 30 Jan 2022 13:15:33 +0200