requests module
resonse.content.headers:Response header resonse.headers:Request header Set-Cookie:That is Cookie value response.cookies:object:cookieJar # Get dictionary dict_cookies = reequests.utils.dict_from_cookiejar(response.cookies) # Dictionary to object jar_cookie = request.utils.cookie_jar_from_dict(dict_cookies) response.json():json data response.content.decode():It is recommended to use, regardless of coding, direct binary decoding, and automatic decoding according to the coding type
Use cookie parameters to keep the session
- Build cookie parameters
- When re requesting, assign the cookie dictionary to the cookie parameter
- request.get(url, cookie)
Use of timeout parameter
request.get(url, timeout=3) # 3s
Use of agents
- A proxy is an ip that points to a proxy server and forwards requests
verify parameters and CA authentication
SSLerror:certificate verify failed Certificate authentication failed request.get(url, verify=False) # Ignore certificate
post
Get input from command line
import sys print(sys.argv) # Returns the current Absolute path to py file # Then enter the command line, python 3 py China # Return [. py, CHina] # So you can get what you need from the command line word = sys.argv[1] king = King(word) king.run()
Data source:
- Fixed value packet capture compared with constant value
- The input value is compared according to its own change value
- Default - the static file needs to be obtained from the static html in advance
- Default - sending a request requires sending a request to the specified address
- In the analysis js generated by the client, simulate the generated data
session
Role: automatically process cookie s
Scenario: multiple consecutive requests
Data extraction
Response classification
Structured
- json data
- json module
- re module
- jsonpath module
- xml data (low frequency)
- re module
- lxml module
Unstructured
- html
- re module
- lxml module
jsonpath
Extracting data directly from multi-layer nested complex dictionary
Syntax:
- $: root node
- .: Child node
- ...: descendant node
jsonpath(data, '$.key1.key2.key3') # The result is a list jsonpath(data, '$..key3')
xpath
Node selection syntax
html/head/title # Absolute path html//title //title # relative path //title/text() # get the text content from the opening and closing labels //link/@href # gets the value of the specified attribute from the selected node label /html/body/div[3]/div[1]/div[last()-1] # Select the penultimate /html/body/div[3]/div[1]/div[position() > 10] # Range selection //Div [@ id ='content left '] / div / @ ID # relative path. The last / @ is to get the attribute value, [@]: modify the node by tag attribute name and attribute value //Div [span[2] > 9.4] # the value of attribute span[2] under div is greater than 9.4: the movie score is greater than 9.4 //Span [i > 2000] # node attribute value filtering, tag i value > 2000, modify the node by the value of the child node //div[contains(@id, 'qiushi_tag')] # contains modifiers //span(contains(text(), 'next page')) //*[@ id ='content left '] # generic configuration, select the node modified by [ID] //*/@*# select all attribute values //node() # select all nodes //td/a/@href # get link in a tag //H2 / A / / TD / a # XPath combination
/: nodes are separated
@: select attributes
be careful:
Don't use an index when looking for the next page
etree.HTML(html_str) Labels can be automatically completed
js parsing
- Locate the js file (find the js file that generates encrypted data)
- Locate the js file through the initiator search keyword
- Locate the js file through the search keyword
- Find the js file through the event listener function bound to the element
- Analyze js code and master the encryption steps
- Simulate the encryption steps and reproduce them using Python code
- Use a third-party module to load JS: js2py, pyv8, executejs, splash
- Pure Python code implementation
- Location analysis reproduction
min
x
f
(
x
)
=
x
2
+
4
x
−
1
(
1
)
s
.
t
.
x
+
1
≤
0
(
2
)
s
.
t
.
−
x
−
1
≤
0
\min _{x} f(x)=x^{2}+4 x-1 \\ (1)s.t. x+1 \leq 0 \\(2) s. t. -x-1 \leq 0
xminf(x)=x2+4x−1(1)s.t.x+1≤0(2)s.t.−x−1≤0
Address de duplication
- url
- url-hash
- Bloom filter
Text content de duplication
- Edit distance
- simhash
Scrapy
Complete the crawler process
- Modify the starting url
- Check and modify allowed domain names
- Implement crawling logic in parse method
Save data
In ` ` pipelines Py ` file defines the operation of the corresponding data
Define a pipe class
Override the process of the pipeline class_ Item method
precess_ The item method needs to return engine after processing the item
Start the pipeline in the setting file
scrapy list Displays a list of existing crawlers rm *.json delete.json End of file
crwalspider
Inherited from Spider crawler class
Automatically extract links according to rules and send them to the engine
- Create crawlespider crawler
- Scratch genspider [- t crawl] name domains # - t select template
Usage of Middleware
- In middleware Define middleware classes in PY
- Rewrite the request or response method in the middleware class
process_request
- None: if all downloader middleware returns none, the request is finally handed over to the downloader
- Request: if it is returned as a request, the request is handed over to the scheduler
- Response: submit the response object to the spider for parsing
process_response
- Enable the use of Middleware in setting
Distributed crawler
Accelerate the running speed of the project, but the required resources (Hardware & Network) are still original
The instability of a single node cannot affect the stability of the whole system
Distributed features:
cooperation
Speed up the execution of the whole task
Higher stability, and a single node will not affect the whole task
self.key = "dmoz:items" % {'spider': 'dmoz'}
Implementation steps of distributed crawler
- Implement an ordinary crawler
2. Modify to distributed crawler
1. Modify crawler file
1. Import distributed crawler class
2. Modify the inheritance of crawler class
3. Log off the starting url and allowed domain
4.redis_key gets the starting url from the database
5.__init__ Get allowed domains dynamically
2. Modify the configuration file
1.copy the configuration file and modify it according to the current project
2. Write 5 configuration items
3. Operation
Run crawler node
Scratch runspider crawler file
Start crawler
lpush redis_key start-url
ifconfig: obtain IP information ps aux |grep redis: open Redis database '%{spider}s:items' % {'spider': demoz} # placeholder
Distributed crawler writing process
-
Write ordinary crawler
- Create project
- Clear objectives
- Create crawler
- Save content
-
Transform distributed crawler
-
Modified reptile
- Import scene_ Distributed crawler class in redis
- Inheritance class
- Logoff start_ urls & allowed_ domains
- Set redis_key get start_urls
- Set__ init___ Get allowed domains
-
Modification profile
copy configuration parameters
-
Distributed crawler summary
Usage scenario
- The amount of data is very large
- Data requirements and time constraints
Implementation of distributed
-
scrapy_redis implements distributed
-
Common crawlers realize distributed: share the de duplication set and task queue
Distributed steps
- poor – several ordinary laptops
- Well off – one server virtualizes several computers
- Tyrant – data acquisition service desk (15) management (3-4) storage (10)
Crawler's bag
Hand blade JD
Key JS Codes:
, H = (r("pFHu"), Object(u.b)()((function(e) { var t = e.sonList; return c.a.createElement(c.a.Fragment, null, c.a.createElement("dt", null, c.a.createElement("a", { href: "//channel.jd.com/".concat(t.fatherCategoryId, "-").concat(t.categoryId, ".html") }, t.categoryName), c.a.createElement("b", null)), c.a.createElement("dd", null, t && t.sonList.map((function(e, r) { return c.a.createElement("em", { key: r }, c.a.createElement("a", { href: "//list.jd.com/".concat(t.fatherCategoryId, "-").concat(e.fatherCategoryId, "-").concat(e.categoryId, ".html") }, e.categoryName)) } )))) } )))
scrapy genspider book jd.com # Create a crawler, and then write a crawler
Scrapy_splash
A component of scratch
Function:
- Scratch splash can simulate the browser to load js and return the loaded complete code
use:
- install
- splash
-
- Install docker
- Download splash image
-
- python module
- splash