Reptile phased summary

requests module

resonse.content.headers:Response header
resonse.headers:Request header
Set-Cookie:That is Cookie value
response.cookies:object:cookieJar
# Get dictionary
dict_cookies = reequests.utils.dict_from_cookiejar(response.cookies)
# Dictionary to object
jar_cookie = request.utils.cookie_jar_from_dict(dict_cookies)
response.json():json data
response.content.decode():It is recommended to use, regardless of coding, direct binary decoding, and automatic decoding according to the coding type

Use cookie parameters to keep the session

Build cookie parameters
When re requesting, assign the cookie dictionary to the cookie parameter
request.get(url, cookie)

Use of timeout parameter

request.get(url, timeout=3)  # 3s

Use of agents

A proxy is an ip that points to a proxy server and forwards requests

verify parameters and CA authentication

SSLerror:certificate verify failed
 Certificate authentication failed
request.get(url, verify=False)  # Ignore certificate

post

Get input from command line

import sys
print(sys.argv)  # Returns the current Absolute path to py file
# Then enter the command line, python 3 py China
# Return [. py, CHina]
# So you can get what you need from the command line
word = sys.argv[1]
king = King(word)
king.run()

Data source:

Fixed value packet capture compared with constant value
The input value is compared according to its own change value
Default - the static file needs to be obtained from the static html in advance
Default - sending a request requires sending a request to the specified address
In the analysis js generated by the client, simulate the generated data

session

Role: automatically process cookie s

Scenario: multiple consecutive requests

Data extraction

Response classification

Structured

json data
- json module
- re module
- jsonpath module
xml data (low frequency)
- re module
- lxml module

Unstructured

html
- re module
- lxml module

jsonpath

Extracting data directly from multi-layer nested complex dictionary

Syntax:

$: root node
.: Child node
...: descendant node

jsonpath(data, '$.key1.key2.key3')  # The result is a list
jsonpath(data, '$..key3')

xpath

Node selection syntax

html/head/title  # Absolute path
html//title
//title # relative path
//title/text() # get the text content from the opening and closing labels
//link/@href # gets the value of the specified attribute from the selected node label
/html/body/div[3]/div[1]/div[last()-1]  # Select the penultimate
/html/body/div[3]/div[1]/div[position() > 10]  # Range selection
//Div [@ id ='content left '] / div / @ ID # relative path. The last / @ is to get the attribute value, [@]: modify the node by tag attribute name and attribute value
//Div [span[2] > 9.4] # the value of attribute span[2] under div is greater than 9.4: the movie score is greater than 9.4
//Span [i > 2000] # node attribute value filtering, tag i value > 2000, modify the node by the value of the child node
//div[contains(@id, 'qiushi_tag')] # contains modifiers
//span(contains(text(), 'next page'))
//*[@ id ='content left '] # generic configuration, select the node modified by [ID]
//*/@*# select all attribute values
//node() # select all nodes
//td/a/@href # get link in a tag
//H2 / A / / TD / a # XPath combination

/: nodes are separated

@: select attributes

be careful:

Don't use an index when looking for the next page

etree.HTML(html_str) Labels can be automatically completed

js parsing

Locate the js file (find the js file that generates encrypted data)
1. Locate the js file through the initiator search keyword
2. Locate the js file through the search keyword
3. Find the js file through the event listener function bound to the element
Analyze js code and master the encryption steps
Simulate the encryption steps and reproduce them using Python code
1. Use a third-party module to load JS: js2py, pyv8, executejs, splash
2. Pure Python code implementation
Location analysis reproduction

min ⁡ x f ( x ) = x 2 + 4 x − 1 ( 1 ) s . t . x + 1 ≤ 0 ( 2 ) s . t . − x − 1 ≤ 0 \min _{x} f(x)=x^{2}+4 x-1 \\ (1)s.t. x+1 \leq 0 \\(2) s. t. -x-1 \leq 0 xminf(x)=x2+4x−1(1)s.t.x+1≤0(2)s.t.−x−1≤0
Address de duplication

url
url-hash
Bloom filter

Text content de duplication

Edit distance
simhash

Scrapy

Complete the crawler process

Modify the starting url
Check and modify allowed domain names
Implement crawling logic in parse method

Save data

In ` ` pipelines Py ` file defines the operation of the corresponding data

Define a pipe class

Override the process of the pipeline class_ Item method

precess_ The item method needs to return engine after processing the item

Start the pipeline in the setting file

scrapy list Displays a list of existing crawlers
rm *.json delete.json End of file

crwalspider

Inherited from Spider crawler class

Automatically extract links according to rules and send them to the engine

Create crawlespider crawler
1. Scratch genspider [- t crawl] name domains # - t select template

Usage of Middleware

In middleware Define middleware classes in PY
Rewrite the request or response method in the middleware class

process_request

None: if all downloader middleware returns none, the request is finally handed over to the downloader
Request: if it is returned as a request, the request is handed over to the scheduler
Response: submit the response object to the spider for parsing

process_response

Enable the use of Middleware in setting

Distributed crawler

Accelerate the running speed of the project, but the required resources (Hardware & Network) are still original

The instability of a single node cannot affect the stability of the whole system

Distributed features:
cooperation
Speed up the execution of the whole task
Higher stability, and a single node will not affect the whole task

self.key = "dmoz:items" % {'spider': 'dmoz'}

Implementation steps of distributed crawler

Implement an ordinary crawler
2. Modify to distributed crawler
1. Modify crawler file
1. Import distributed crawler class
2. Modify the inheritance of crawler class
3. Log off the starting url and allowed domain
4.redis_key gets the starting url from the database
5.__init__ Get allowed domains dynamically
2. Modify the configuration file
1.copy the configuration file and modify it according to the current project

2. Write 5 configuration items

3. Operation
Run crawler node
Scratch runspider crawler file
Start crawler
lpush redis_key start-url

ifconfig: obtain IP information
ps aux |grep redis: open Redis database
'%{spider}s:items' % {'spider': demoz}  # placeholder

Distributed crawler writing process

Write ordinary crawler
1. Create project
2. Clear objectives
3. Create crawler
4. Save content
Transform distributed crawler
1. Modified reptile
  1. Import scene_ Distributed crawler class in redis
  2. Inheritance class
  3. Logoff start_ urls & allowed_ domains
  4. Set redis_key get start_urls
  5. Set__ init___ Get allowed domains
2. Modification profile
  
  copy configuration parameters

Distributed crawler summary

Usage scenario

The amount of data is very large
Data requirements and time constraints

Implementation of distributed

scrapy_redis implements distributed
Common crawlers realize distributed: share the de duplication set and task queue

Distributed steps

poor – several ordinary laptops
Well off – one server virtualizes several computers
Tyrant – data acquisition service desk (15) management (3-4) storage (10)

Crawler's bag

Hand blade JD

Key JS Codes:

   , H = (r("pFHu"),
        Object(u.b)()((function(e) {
            var t = e.sonList;
            return c.a.createElement(c.a.Fragment, null, c.a.createElement("dt", null, c.a.createElement("a", {
                href: "//channel.jd.com/".concat(t.fatherCategoryId, "-").concat(t.categoryId, ".html")
            }, t.categoryName), c.a.createElement("b", null)), c.a.createElement("dd", null, t && t.sonList.map((function(e, r) {
                return c.a.createElement("em", {
                    key: r
                }, c.a.createElement("a", {
                    href: "//list.jd.com/".concat(t.fatherCategoryId, "-").concat(e.fatherCategoryId, "-").concat(e.categoryId, ".html")
                }, e.categoryName))
            }
            ))))
        }
        )))

scrapy genspider book jd.com  # Create a crawler, and then write a crawler

Scrapy_splash

A component of scratch

Function:

Scratch splash can simulate the browser to load js and return the loaded complete code

use:

install
- splash
  - 1. Install docker
    2. Download splash image
- python module

Keywords: Python crawler http

Added by Carline on Sat, 25 Dec 2021 01:27:46 +0200

Programming VIP