Reptile phased summary

requests module

resonse.content.headers:Response header
resonse.headers:Request header
Set-Cookie:That is Cookie value
response.cookies:object:cookieJar
# Get dictionary
dict_cookies = reequests.utils.dict_from_cookiejar(response.cookies)
# Dictionary to object
jar_cookie = request.utils.cookie_jar_from_dict(dict_cookies)
response.json():json data
response.content.decode():It is recommended to use, regardless of coding, direct binary decoding, and automatic decoding according to the coding type

Use cookie parameters to keep the session

  1. Build cookie parameters
  2. When re requesting, assign the cookie dictionary to the cookie parameter
  3. request.get(url, cookie)

Use of timeout parameter

request.get(url, timeout=3)  # 3s

Use of agents

  1. A proxy is an ip that points to a proxy server and forwards requests

verify parameters and CA authentication

SSLerror:certificate verify failed
 Certificate authentication failed
request.get(url, verify=False)  # Ignore certificate

post

Get input from command line

import sys
print(sys.argv)  # Returns the current Absolute path to py file
# Then enter the command line, python 3 py China
# Return [. py, CHina]
# So you can get what you need from the command line
word = sys.argv[1]
king = King(word)
king.run()

Data source:

  1. Fixed value packet capture compared with constant value
  2. The input value is compared according to its own change value
  3. Default - the static file needs to be obtained from the static html in advance
  4. Default - sending a request requires sending a request to the specified address
  5. In the analysis js generated by the client, simulate the generated data

session

Role: automatically process cookie s

Scenario: multiple consecutive requests

Data extraction

Response classification

Structured

  • json data
    • json module
    • re module
    • jsonpath module
  • xml data (low frequency)
    • re module
    • lxml module

Unstructured

  • html
    • re module
    • lxml module

jsonpath

Extracting data directly from multi-layer nested complex dictionary

Syntax:

  • $: root node
  • .: Child node
  • ...: descendant node
jsonpath(data, '$.key1.key2.key3')  # The result is a list
jsonpath(data, '$..key3')

xpath

Node selection syntax

html/head/title  # Absolute path
html//title
//title # relative path
//title/text() # get the text content from the opening and closing labels
//link/@href # gets the value of the specified attribute from the selected node label
/html/body/div[3]/div[1]/div[last()-1]  # Select the penultimate
/html/body/div[3]/div[1]/div[position() > 10]  # Range selection
//Div [@ id ='content left '] / div / @ ID # relative path. The last / @ is to get the attribute value, [@]: modify the node by tag attribute name and attribute value
//Div [span[2] > 9.4] # the value of attribute span[2] under div is greater than 9.4: the movie score is greater than 9.4
//Span [i > 2000] # node attribute value filtering, tag i value > 2000, modify the node by the value of the child node
//div[contains(@id, 'qiushi_tag')] # contains modifiers
//span(contains(text(), 'next page'))
//*[@ id ='content left '] # generic configuration, select the node modified by [ID]
//*/@*# select all attribute values
//node() # select all nodes
//td/a/@href # get link in a tag
//H2 / A / / TD / a # XPath combination

/: nodes are separated

@: select attributes

be careful:

Don't use an index when looking for the next page

etree.HTML(html_str) Labels can be automatically completed

js parsing

  1. Locate the js file (find the js file that generates encrypted data)
    1. Locate the js file through the initiator search keyword
    2. Locate the js file through the search keyword
    3. Find the js file through the event listener function bound to the element
  2. Analyze js code and master the encryption steps
  3. Simulate the encryption steps and reproduce them using Python code
    1. Use a third-party module to load JS: js2py, pyv8, executejs, splash
    2. Pure Python code implementation
  4. Location analysis reproduction

min ⁡ x f ( x ) = x 2 + 4 x − 1 ( 1 ) s . t . x + 1 ≤ 0 ( 2 ) s . t . − x − 1 ≤ 0 \min _{x} f(x)=x^{2}+4 x-1 \\ (1)s.t. x+1 \leq 0 \\(2) s. t. -x-1 \leq 0 xmin​f(x)=x2+4x−1(1)s.t.x+1≤0(2)s.t.−x−1≤0
Address de duplication

  • url
  • url-hash
  • Bloom filter

Text content de duplication

  • Edit distance
  • simhash

Scrapy

Complete the crawler process

  1. Modify the starting url
  2. Check and modify allowed domain names
  3. Implement crawling logic in parse method

Save data

In ` ` pipelines Py ` file defines the operation of the corresponding data

Define a pipe class

Override the process of the pipeline class_ Item method

​ precess_ The item method needs to return engine after processing the item

Start the pipeline in the setting file

scrapy list Displays a list of existing crawlers
rm *.json delete.json End of file

crwalspider

Inherited from Spider crawler class

Automatically extract links according to rules and send them to the engine

  1. Create crawlespider crawler
    1. Scratch genspider [- t crawl] name domains # - t select template

Usage of Middleware

  1. In middleware Define middleware classes in PY
  2. Rewrite the request or response method in the middleware class

process_request

  • None: if all downloader middleware returns none, the request is finally handed over to the downloader
  • Request: if it is returned as a request, the request is handed over to the scheduler
  • Response: submit the response object to the spider for parsing

process_response

  1. Enable the use of Middleware in setting

Distributed crawler

Accelerate the running speed of the project, but the required resources (Hardware & Network) are still original

The instability of a single node cannot affect the stability of the whole system

Distributed features:
cooperation
Speed up the execution of the whole task
Higher stability, and a single node will not affect the whole task

self.key = "dmoz:items" % {'spider': 'dmoz'}

Implementation steps of distributed crawler

  1. Implement an ordinary crawler
    2. Modify to distributed crawler
    1. Modify crawler file
    1. Import distributed crawler class
    2. Modify the inheritance of crawler class
    3. Log off the starting url and allowed domain
    4.redis_key gets the starting url from the database
    5.__init__ Get allowed domains dynamically
    2. Modify the configuration file
    1.copy the configuration file and modify it according to the current project

2. Write 5 configuration items

3. Operation
Run crawler node
Scratch runspider crawler file
Start crawler
lpush redis_key start-url

ifconfig: obtain IP information
ps aux |grep redis: open Redis database
'%{spider}s:items' % {'spider': demoz}  # placeholder 

Distributed crawler writing process

  1. Write ordinary crawler

    1. Create project
    2. Clear objectives
    3. Create crawler
    4. Save content
  2. Transform distributed crawler

    1. Modified reptile

      1. Import scene_ Distributed crawler class in redis
      2. Inheritance class
      3. Logoff start_ urls & allowed_ domains
      4. Set redis_key get start_urls
      5. Set__ init___ Get allowed domains
    2. Modification profile

      copy configuration parameters

Distributed crawler summary

Usage scenario

  • The amount of data is very large
  • Data requirements and time constraints

Implementation of distributed

  • scrapy_redis implements distributed

  • Common crawlers realize distributed: share the de duplication set and task queue

Distributed steps

  • poor – several ordinary laptops
  • Well off – one server virtualizes several computers
  • Tyrant – data acquisition service desk (15) management (3-4) storage (10)

Crawler's bag

Hand blade JD

Key JS Codes:

   , H = (r("pFHu"),
        Object(u.b)()((function(e) {
            var t = e.sonList;
            return c.a.createElement(c.a.Fragment, null, c.a.createElement("dt", null, c.a.createElement("a", {
                href: "//channel.jd.com/".concat(t.fatherCategoryId, "-").concat(t.categoryId, ".html")
            }, t.categoryName), c.a.createElement("b", null)), c.a.createElement("dd", null, t && t.sonList.map((function(e, r) {
                return c.a.createElement("em", {
                    key: r
                }, c.a.createElement("a", {
                    href: "//list.jd.com/".concat(t.fatherCategoryId, "-").concat(e.fatherCategoryId, "-").concat(e.categoryId, ".html")
                }, e.categoryName))
            }
            ))))
        }
        )))
scrapy genspider book jd.com  # Create a crawler, and then write a crawler

Scrapy_splash

A component of scratch

Function:

  • Scratch splash can simulate the browser to load js and return the loaded complete code

use:

  • install
    • splash
        1. Install docker
        2. Download splash image
    • python module

Keywords: Python crawler http

Added by Carline on Sat, 25 Dec 2021 01:27:46 +0200