[automation] [PyChromeDevTools actual combat] 01 - crawl all files in the web page that Chrome has opened

Reading guide

I've always been curious that all files can be viewed in the Sources tab of Chrome debugging tool. How can I get all file information and content?

Scheme 1: export the HAR scheme through the Network tag of Chrome console. The operation is cumbersome and the data types processed are limited.
Scheme 2: through chromecacheview Exe to obtain files. This tool is indeed effective, but it also has the following disadvantages:

  • Invalid for localhost.
  • Unable to program.
    Other solutions: fiddler and other packet capture tools, selenium (too cumbersome), etc.

cdp scheme: recent research found that the Chrome console is the file data obtained through cdp. Then we can obtain relevant information through PyChromeDevTools and save it in a certain format. The final effect is as follows:.

development environment

Version numberdescribe
operating systemWin10-1607
Google Chrome96.0.4664.110 (official version) (64 bit) (cohort: 97_Win_99)
Python(venv)Python3.8.6(virtualenv)
PyChromeDevTools0.4

Core principles

preparation

  • Press Win+R to open the operation dialog box
  • Execute the command "C: \ program files \ Google \ chrome \ application \ chrome. Exe"“ https://www.baidu.com " --remote-debugging-port=9992 --headless

    The principle can refer to the article PyChromeDevTools source code analysis .

Open web page through browser http://localhost:9992/ , click Baidu page to view the content of headless browser.

chrome.Page.getResourceTree()

This command is to get all the contents in the Page of the current connection (only one Page of baidu is opened in this example, and PyChromeDevTools.ChromeInterface(port=9992) will connect to this Page by default). The structure is as follows:

chrome.Page.getResourceContent()

After parsing getResourceTree, get all resource information, and then parse the content of each resource information.

Instruction page Getresourcecontent, including two parameters:

  • frameId: the id marked in the above figure
  • url: each item of resources in the figure above

An example of the protocol content is as follows. The content needs to judge whether base64 decryption is required according to base64Encoded:

chrome.Page.enable()

Calling instruction Page. during compiling process Getresourcecontent, error {"error": {"code": - 32000, "message": "Agent is not enabled.}, "id":1}. The term Agent is a bit misleading. The protocol document talks about the domains that need to be enabled to debug them, that is, the need to execute chrome before calling the instruction Page. Enable() instruction.

Source code

import base64
import os
import urllib

import PyChromeDevTools


def ez_get_object(j, list_keys):
    """
    Quick access json Converted object child elements
    j = {
        'a':
            {'b':{
                b1: {'name':'xiao'},
                b2: {'name':'2b'}
            }}
    }

    >>> ez_get_object(j, 'a,b,b2')
    {'name':'2b'}

    j: json Converted object
    list_keys:  Comma separated string
    """
    ret = j
    for k in list_keys.split(','):
        ret = ret.get(k.strip())

    return ret


def Download all resources on the web page_unit_content(url, content):
    url_parsed = urllib.parse.urlparse(url)
    full_path = r'G:/_TMP/_web_resources/{}/'.format(url_parsed.hostname) + url_parsed.path

    # For those that have been downloaded, return directly
    if os.path.isfile(full_path):
        return

    dirname = os.path.dirname(full_path)
    # basename = os.path.basename(full_path)
    print('\t\tfull_path', full_path)

    if not os.path.isdir(dirname):
        os.makedirs(dirname)

    with open(full_path, 'wb') as wf:
        return wf.write(content)


def Download all resources on the web page_unit(url, _type, unit):
    if _type in ['Script', 'Stylesheet', 'Image']:
        if unit.get('base64Encoded'):
            content = base64.b64decode(unit.get('content'))
        else:
            content = unit.get('content').encode('utf-8')

        Download all resources on the web page_unit_content(url, content)
    else:
        print('\t[Warning] Unprocessed_type ', _type)


def Download all resources on the web page():
    chrome = PyChromeDevTools.ChromeInterface(port=9992)

    chrome.Page.enable()
    result, messages = chrome.Page.getResourceTree()
    frame_id = ez_get_object(result, 'result,frameTree,frame,id')
    resources = ez_get_object(result, 'result,frameTree,resources')

    for resource in resources:
        url = resource.get('url')
        _type = resource.get('type')
        result_, messages_ = chrome.Page.getResourceContent(frameId=frame_id, url=url)
        print('chrome.Page.getResourceContent', _type, url, result_)
        if result_:
            Download all resources on the web page_unit(url, _type, result_.get('result'))
        else:
            print('[Warning] chrome.Page.getResourceContent result_ is False')

reference material

**ps: * * the content in the article is only for technical exchange, and should not be used for illegal acts.

Keywords: crawler chrome

Added by Pascal P. on Thu, 27 Jan 2022 05:25:01 +0200