1. Reptile
2. Classification of reptiles:
Universal crawler
focused crawler
Incremental reptiles: monitoring
3. Anti creep mechanism
Anti climbing strategy
4. robots, UA monitoring: UA camouflage
5. http and https concepts: some form of data interaction between server and client.
6. Common header information:
- User agent: the identity of the request carrier
- Connection: close
- content-type
7. https encryption method: Certificate secret key encryption
- Certificate: it is used in https encryption operation.
- The certificate is issued by a certification authority and contains the public key (encryption method).
8,request → get/post:
data/params: encapsulation of request parameters
headers: UA camouflage
9. What is dynamically loaded data: data requested by another additional request.
- ajax
- js
10. How to identify whether there is dynamically loaded data in the page?
- Local search
- global search
11. The first step before crawling a strange website
- Determine whether the data you want to crawl is dynamically loaded!!
Data analysis
1. Parsing: extract data according to specified rules.
Function: focus crawler.
2. Coding process of focused crawler:
Specify url
Initiate request
Get response data
Data analysis
Number of persistent stores
3. Data analysis method:
- regular
- bs4
- xpath
- Pyquery (Extended)
4. What is the general principle of data analysis?
Data parsing needs to act in the page source code (composed of a set of html tags)
The core function of html: displaying data
How html presents data: data is placed in html tags, or in attributes
5. General principles:
Label positioning
Get text or get attribute
Regular implementation of data parsing
Demand: crawl the embarrassing figure data in the embarrassing Encyclopedia
When I went to open the web page, I found that the web version of the embarrassing encyclopedia was gone.
So I found another page, Paradise pictures:
Implementation process
1. Import package and anti crawl mechanism
import requests # Anti climbing headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55' }
2. Crawl image data
Try climbing a picture first. Here are two ways
Mode 1:
url = '' img_data = requests.get(url=url,headers=headers).content # content returns byte type data with open('./1.jpg','wb') as fp: fp.write(img_data)
Mode 2:
from urllib import request url = '' request.urlretrieve(url,'./2.jpg')
The biggest difference between mode 1 and mode 2 is that mode 2 cannot use the mechanism of UA camouflage
Urllib is an old network request module. Before the requests module appeared, urllib was used for sending requests
Complete code
import requests
import re
import os
if name=="main":
#Create a folder
if not os.path.exists('./qiutuLibs'):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
#Use the general crawler to crawl the whole page corresponding to the url
page_text = requests.get(url=url,headers=headers).text
#You need to use the focus crawler to parse all the pictures in the page # < div class ="thumb" > # < a href = "/article/125003930" target = "_blank" > # < img src = "//pic. qiushibaike. com/system/pictures/12500/125003930/medium/14Z46S72MMC2P4ZC. Jpg "ALT =" embarrassing #125003930“ # class ="illustration" width="100%" height="auto" > # < / a > # < / div > ex= '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>' img_src_list = re.findall(ex,page_text,re.S) print(img_src_list) for src in img_src_list: #Splice a complete picture url src ='https:'+src #Binary data of the picture was requested img_data = requests.get(url=src,headers=headers).content #Generate picture name img_name=src.split('/')[-1] #The path where the picture is finally stored img_Path='./qiutuLibs/'+img_name with open(img_Path,'wb') as fp: fp.write(img_data) print(img_name,'Download succeeded!!!')