05 data analysis - regular

review

1. Reptile
2. Classification of reptiles:

  • Universal crawler

  • focused crawler

  • Incremental reptiles: monitoring

3. Anti creep mechanism

​ Anti climbing strategy
4. robots, UA monitoring: UA camouflage

5. http and https concepts: some form of data interaction between server and client.

6. Common header information:

  • User agent: the identity of the request carrier
  • Connection: close
  • content-type

7. https encryption method: Certificate secret key encryption

  • Certificate: it is used in https encryption operation.
  • The certificate is issued by a certification authority and contains the public key (encryption method).

8,request → get/post:

  • url

  • data/params: encapsulation of request parameters

  • headers: UA camouflage

9. What is dynamically loaded data: data requested by another additional request.

  • ajax
  • js

10. How to identify whether there is dynamically loaded data in the page?

  • Local search
  • global search

11. The first step before crawling a strange website

  • Determine whether the data you want to crawl is dynamically loaded!!

Data analysis

1. Parsing: extract data according to specified rules.

​ Function: focus crawler.

2. Coding process of focused crawler:

  • Specify url

  • Initiate request

  • Get response data

  • Data analysis

  • Number of persistent stores

3. Data analysis method:

  • regular
  • bs4
  • xpath
  • Pyquery (Extended)

4. What is the general principle of data analysis?

  • Data parsing needs to act in the page source code (composed of a set of html tags)

  • The core function of html: displaying data

  • How html presents data: data is placed in html tags, or in attributes

5. General principles:

  • Label positioning

  • Get text or get attribute

Regular implementation of data parsing

Demand: crawl the embarrassing figure data in the embarrassing Encyclopedia

When I went to open the web page, I found that the web version of the embarrassing encyclopedia was gone.

So I found another page, Paradise pictures: https://www.ivsky.com/bizhi/

Implementation process

1. Import package and anti crawl mechanism

import requests
# Anti climbing
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.55'
}

2. Crawl image data

Try climbing a picture first. Here are two ways

Mode 1:

url = 'https://img.ivsky.com/img/bizhi/li/202110/02/secrets_of_the_jungle-006.jpg'
img_data = requests.get(url=url,headers=headers).content  # content returns byte type data
with open('./1.jpg','wb') as fp:
    fp.write(img_data)

Mode 2:

from urllib import request
url = 'https://img.ivsky.com/img/bizhi/li/202110/02/secrets_of_the_jungle-006.jpg'
request.urlretrieve(url,'./2.jpg')

The biggest difference between mode 1 and mode 2 is that mode 2 cannot use the mechanism of UA camouflage

Urllib is an old network request module. Before the requests module appeared, urllib was used for sending requests

Complete code

import requests
import re
import os
if name=="main":
#Create a folder
if not os.path.exists('./qiutuLibs'):
os.mkdir('./qiutuLibs')
url='https://www.qiushibaike.com/imgrank/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}
#Use the general crawler to crawl the whole page corresponding to the url
page_text = requests.get(url=url,headers=headers).text

#You need to use the focus crawler to parse all the pictures in the page
# < div class ="thumb" >
# < a href = "/article/125003930" target = "_blank" >
# < img src = "//pic. qiushibaike. com/system/pictures/12500/125003930/medium/14Z46S72MMC2P4ZC. Jpg "ALT =" embarrassing #125003930“
# class ="illustration" width="100%" height="auto" >
# < / a >
# < / div >
ex= '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
img_src_list = re.findall(ex,page_text,re.S)
print(img_src_list)
for src in img_src_list:
    #Splice a complete picture url
    src ='https:'+src
    #Binary data of the picture was requested
    img_data = requests.get(url=src,headers=headers).content
    #Generate picture name
    img_name=src.split('/')[-1]
    #The path where the picture is finally stored
    img_Path='./qiutuLibs/'+img_name
    with open(img_Path,'wb') as fp:
        fp.write(img_data)
        print(img_name,'Download succeeded!!!')

Keywords: crawler

Added by wha??? on Tue, 25 Jan 2022 00:04:56 +0200