Front end crawler [puppeter]

Puppeteer It is a node released by the chrome development team in 2017 JS package, as well as Headless Chrome. Used to simulate the operation of Chrome browser. It provides an advanced API to control Headless Chrome or Chromium through the DevTools protocol. It can also be configured to use full (non headless) chrome or Chromium.

Before learning about puppeter, let's have a look Chrome DevTool Protocol and Headless Chrome.

What is the Chrome DevTool Protocol

CDP is based on WebSocket and uses WebSocket to realize fast data channel with browser kernel.
CDP is divided into multiple domains (DOM, Debugger, Network, Profiler, Console...), Related Commands and Events are defined in each domain.
We can package some tools based on CDP to debug and analyze Chrome browser. For example, our commonly used "Chrome developer tool" is implemented based on CDP.
Many useful tools are implemented based on CDP, such as Chrome developer tools，chrome-remote-interface，Puppeteer Wait.

What is Headless Chrome

You can run Chrome in an interface free environment.
Operate Chrome through the command line or program language.
Without human intervention, the operation is more stable.
When starting Chrome, add the parameter -- headless to start Chrome in headless mode.
What parameters can be added when chrome starts? You can click here see.

In a word, Headless Chrome is the non interface form of Chrome browser. You can run your program using all the features supported by chrome without opening the browser.

What is a puppeter

The puppeter is a node JS tool engine.
Puppeter provides a series of API s to control the behavior of Chromium/Chrome browser through Chrome DevTools Protocol.
By default, puppeter starts Chrome with headless. You can also start Chrome with interface through parameter control.
Puppeter is bound to the latest Chromium version by default. You can also set the binding of different versions yourself.
Puppeter allows us to realize the communication with the browser without knowing too much about the underlying CDP protocol.

What can puppeter do

Official introduction: most of the operations you can manually perform in the browser can be completed with puppeter! Example:

Generate screenshots and PDF s of the page.
Crawl to SPA or SSR website.
Automated form submission, UI testing, keyboard input, etc.
Create the latest automated test environment. Use the latest JavaScript and browser functions to run tests directly in the latest version of Chrome.
Capture the timeline trace of the site to help diagnose performance problems.
Test the Chrome extension.
...

Puppeter API hierarchy

The API hierarchy in puppeter is basically consistent with that of browser. The following describes several commonly used classes:

Browser: corresponding to a browser instance, a browser can contain multiple browsercontexts
BrowserContext: a context session corresponding to the browser. Just like a browser in stealth mode after opening an ordinary Chrome, BrowserContext has an independent session (cookies and cache s are not shared independently). A BrowserContext can contain multiple pages
Page: represents a Tab page through BrowserContext newPage()/browser.newPage(), browser Newpage() will use the default BrowserContext when creating a page. A page can contain multiple frames
Frame: a frame. Each page has a main frame (page. Main frame ()), or multiple sub frames, which are mainly created by iframe tags
ExecutionContext: it is the execution environment of javascript. Each Frame has a default javascript execution environment
ElementHandle: an element node corresponding to DOM. Through this instance, we can click on the element and fill in the form. We can get the corresponding element through selector, xPath, etc
JsHandle: corresponds to the javascript object in the dom. ElementHandle inherits from JsHandle. Since we cannot directly operate the object in the DOM, it is encapsulated into JsHandle to realize relevant functions
CDPSession: it can communicate directly with the native CDP through session The send function sends a message directly through session On receives messages, which can realize functions not involved in the puppeter API
Coverage: get JavaScript and CSS code coverage
Tracing: grab performance data for analysis
Response: the response received by the page
Request: the request sent by the page

Puppeter installation and environment

Note: in V1 Before 18.1, the puppeter needs at least node V6 4.0. From V1 18.1 to v2 Version 1.0 depends on Node 8.9.0 +. From v3 From 0.0, the puppeter starts to rely on Node 10.18.1 +. To use async / await , only node v7 6.0 or later.

The puppeter is a node JS package, so the installation is simple:

npm install puppeteer
// perhaps
yarn add puppeteer

npm may report an error when installing the puppeter! This is caused by the Internet. You can visit foreign websites or install cnpm using Taobao image.

When you install puppeter, it will download the latest version of Chromium. Starting from version 1.7.0, the official release of the puppeteer-core Software package. By default, no browser will be downloaded. It is used to start an existing browser or connect to a remote browser. Note that the installed version of puppeter core is compatible with the browser you intend to connect to.

Use of puppeter

Case1: screenshot

Using puppeter, we can take a screenshot of a page or an element in the page:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  //Set the size of the visual area. The default page size is 800x600 resolution
  await page.setViewport({width: 1920, height: 800});
  await page.goto('https://www.baidu.com/');
  //Screenshot of the whole page
  await page.screenshot({
      path: './files/baidu_home.png',  //Picture saving path
      type: 'png',
      fullPage: true //Screenshot while scrolling
      // clip: {x: 0, y: 0, width: 1920, height: 800}
  });
  //Screenshot of an element of the page
  let element = await page.$('#s_lg_img');
  await element.screenshot({
      path: './files/baidu_logo.png'
  });
  await page.close();
  await browser.close();
})();

How do we get an element in the page?

page.$('#uniqueId'): get the first element corresponding to a selector
page.$$('div '): get all elements corresponding to a selector
page.$x('//img'): get all elements corresponding to an xPath
page.waitForXPath('//img'): wait for an element corresponding to an xPath to appear
page.waitForSelector('#uniqueId'): wait for the element corresponding to a selector to appear

Case2: simulate user operation

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        slowMo: 100,    //Slow down
        headless: false, //Turn on Visualization
        defaultViewport: {width: 1440, height: 780},
        ignoreHTTPSErrors: false, //Ignore https error
        args: ['--start-fullscreen'] //Open page in full screen
    });
    const page = await browser.newPage();
    await page.goto('https://www.baidu.com/');
    //Enter text
    const inputElement = await page.$('#kw');
    await inputElement.type('hello word', {delay: 20});
    //Click the search button
    let okButtonElement = await page.$('#su');
    //Wait until the page Jump is completed. Generally, when you click a button to jump, you need to wait for page The jump is successful only when waitfornavigation() is completed
    await Promise.all([
        okButtonElement.click(),
        page.waitForNavigation()  
    ]);
    await page.close();
    await browser.close();
})();

So what functions are provided by ElementHandle to operate elements?

elementHandle.click(): click an element
elementHandle.tap(): simulate finger touch and click
elementHandle.focus(): focus on an element
elementHandle.hover(): hover the mouse over an element
elementHandle.type('hello '): enter text in the input box

Case3: embedding javascript code

The most powerful function of puppeter is that you can execute any javascript code you want to run in the browser. The following code is an example of Baidu home page news recommendation crawling data.

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://www.baidu.com/');
    //Through page Evaluate executes code in the browser
    const resultData = await page.evaluate(async () =>  {
      let data = {};
      const ListEle = [...document.querySelectorAll('#hotsearch-content-wrapper .hotsearch-item')];
      data = ListEle.map((ele) => {
        const urlEle = ele.querySelector('a.c-link');
        const titleEle = ele.querySelector('.title-content-title');
        return {
          href: urlEle.href,
          title: titleEle.innerText,
        };
      });
      return data;
    });
    console.log(resultData)
    await page.close();
    await browser.close();
})();

What functions can execute code in a browser environment?

page. Evaluate (pagefunction [,... Args]): execute the function in the browser environment
page. Evaluatehandle (pagefunction [,... Args]): execute the function in the browser environment and return the JsHandle object
page.$$ Eval (selector, pagefunction [,... Args]): pass all elements corresponding to the selector into the function and execute it in the browser environment
page.$ Eval (selector, pagefunction [,... Args]): pass the first element corresponding to the selector into the function and execute it in the browser environment
page. Evaluate on new Document (pagefunction [,... Args]): when creating a new Document, it will be executed in the browser environment before all scripts on the page are executed
page.exposeFunction(name, puppeteerFunction): register a function on the window object, which is executed in the Node environment, and has the opportunity to call Node. in the browser environment. JS related function library

Case4: request interception

In some scenarios, it is necessary to intercept unnecessary requests to improve performance. We can listen to the request event of Page and intercept the request. The premise is to turn on the request interception Page setRequestInterception(true).

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    const blockTypes = new Set(['image', 'media', 'font']);
    await page.setRequestInterception(true); //Turn on request interception
    page.on('request', request => {
        const type = request.resourceType();
        const shouldBlock = blockTypes.has(type);
        if(shouldBlock){
            //Block requests directly
            return request.abort();
        }else{
            //Rewrite request
            return request.continue({
                //You can override URLs, method s, postData, and headers
                headers: Object.assign({}, request.headers(), {
                    'puppeteer-test': 'true'
                })
            });
        }
    });
    await page.goto('https://www.baidu.com/');
    await page.close();
    await browser.close();
})();

What events are provided on the page?

page.on('close ') page closes
page.on('console') console API called
page. Error on ('error ') page
page.on('load ') page loaded
page.on('request ') received request
page.on('requestfailed ') request failed
page.on('requestfinished ') request succeeded
page.on('response ') received a response
page.on('workercreated ') create webWorker
page.on('workerdestroyed ') destroy webWorker

Case5: get WebSocket response

Currently, puppeter does not provide a native API interface for processing WebSocket s, but we can obtain it through the lower layer Chrome DevTool Protocol (CDP)

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //Create CDP session
    let cdpSession = await page.target().createCDPSession();
    //Enable Network debugging and listen for Network related events in Chrome DevTools Protocol
    await cdpSession.send('Network.enable');
    //Listen to the webSocketFrameReceived event and get the corresponding data
    cdpSession.on('Network.webSocketFrameReceived', frame => {
        let payloadData = frame.response.payloadData;
        if(payloadData.includes('push:query')){
            //Parse payloadData and get the data pushed by the server
            let res = JSON.parse(payloadData.match(/\{.*\}/)[0]);
            if(res.code !== 200){
                console.log(`call websocket Interface error:code=${res.code},message=${res.message}`);
            }else{
                console.log('Get websocket Interface data:', res.result);
            }
        }
    });
    await page.goto('https://netease.youdata.163.com/dash/142161/reportExport?pid=700209493');
    await page.waitForFunction('window.renderdone', {polling: 20});
    await page.close();
    await browser.close();
})();

Case6: how to grab elements in iframe

A frame contains an Execution Context. We cannot execute functions across frames. There can be multiple frames in a page, mainly generated through iframe tag embedding. Most of the functions on the page are actually page mainFrame(). An abbreviation of XX. Frame is a tree structure. We can use frame Childframes() traverses all frames. If you want to execute the function in other frames, you must obtain the corresponding frame for corresponding processing

The following is an iframe embedded in the login window when logging in to mailbox 188. When we get the iframe and log in with the following code

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({headless: false, slowMo: 50});
    const page = await browser.newPage();
    await page.goto('https://www.188.com');

    for (const frame of page.mainFrame().childFrames()){
        //Find the iframe corresponding to the login page according to the url
        if (frame.url().includes('passport.188.com')){
            await frame.type('.dlemail', 'admin@admin.com');
            await frame.type('.dlpwd', '123456');
            await Promise.all([
                frame.click('#dologin'),
                page.waitForNavigation()
            ]);
            break;
        }
    }
    await page.close();
    await browser.close();
})();

Case7: page performance analysis

Puppeter provides a tool for page Performance analysis. At present, the function is still relatively weak, and we can only obtain the data of one page Performance. How to analyze needs our own analysis according to the data, It is said that a major revision will be made in version 2.0: - a browser can only trace once at a time - in devTools' Performance, you can upload the corresponding json file and view the analysis results - we can write a script to parse trace Automatic analysis of the data in json - through tracing, we can get the page loading speed and the execution Performance of the script

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.tracing.start({path: './files/trace.json'});
    await page.goto('https://www.google.com');
    await page.tracing.stop();
    /*
        continue analysis from 'trace.json'
    */
    browser.close();
})();

Case8: file upload and download

In automated testing, we often encounter the need for file upload and download, so how to implement it in puppeter?

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //Set download path through CDP session
    const cdp = await page.target().createCDPSession();
    await cdp.send('Page.setDownloadBehavior', {
        behavior: 'allow', //Allow all download requests
        downloadPath: 'path/to/download'  //Set download path
    });
    //Click the button to trigger the download
    await (await page.waitForSelector('#someButton')).click();
    //Wait for the file to appear, and judge whether the file appears by rotation
    await waitForFile('path/to/download/filename');

    //When uploading, the corresponding inputElement must be a < input > element
    let inputElement = await page.waitForXPath('//input[@type="file"]');
    await inputElement.uploadFile('/path/to/file');
    browser.close();
})();

Case9: jump to new tab page processing

When clicking a button to jump to a new Tab Page, a new Page will be opened. At this time, how can we get the Page instance corresponding to the changed Page? The created event on the target Browser can be used to monitor the new Page:

let page = await browser.newPage();
await page.goto(url);
let btn = await page.waitForSelector('#btn');
//Before clicking the button, define a Promise in advance to return the Page object of the new tab
const newPagePromise = new Promise(res => 
  browser.once('targetcreated', 
    target => res(target.page())
  )
);
await btn.click();
//After clicking the button, wait for the new tab object
let newPage = await newPagePromise;

Case10: simulate different devices

Puppeter provides the function of simulating different devices, among which puppeter The configuration information of many devices is defined on the devices object, which mainly includes viewport and userAgent, and then through the function page Simulate to realize the simulation of different devices

const puppeteer = require('puppeteer');
const iPhone = puppeteer.devices['iPhone 6'];
puppeteer.launch().then(async browser => {
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto('https://www.baidu.com');
  await browser.close();
});

Performance and optimization

About shared memory:

Chrome Default use /dev/shm Shared memory, but docker default/dev/shm Only 64 MB，Obviously, it is not enough. There are two ways to solve it:
- start-up docker Add parameters when --shm-size=1gb To increase /dev/shm Shared memory, but swarm Not currently supported shm-size parameter
- start-up Chrome Add parameter - disable-dev-shm-usage，No use /dev/shm Shared memory

Try to use the same browser instance to realize cache sharing
Intercept unnecessary resources by request
Like opening Chrome by ourselves, too many tab pages are bound to get stuck, so we must effectively control the number of tab pages
When a Chrome instance starts for a long time, it is inevitable that there will be memory leakage, page crash and other phenomena, so it is necessary to restart the Chrome instance regularly
In order to speed up performance, turn off unnecessary configurations, such as: - no sandbox (sandbox function), -- Disable extensions (extender), etc
Try to avoid using page Waiffor (1000), let the program decide by itself, the effect will be better
Because the Websocket used when connecting to the Chrome instance will cause the problem of Websocket sticky session

reference

Added by ryanbutler on Fri, 11 Feb 2022 07:36:58 +0200

Programming VIP