The first step from introduction to prison

The first step from introduction to prison

Abstract: this content only has the basic knowledge of reptiles. It is only suitable for novices to learn reptiles. If you are a big man, please detour and leave. It's mainly for yourself. I'm afraid I'll forget it

Catalogue

  1. python packages required for installation
  2. Steps for reptiles
  3. Website analysis
  4. Use of requests
  5. Crawling content
  6. Storage mode

Installation

The main python packages used are requests, re, pandas, and time
Open CMD - > Input

pip install requests 
pip install re 
pip install pandas

Open package

import requests
import re 
import time
import pandas as pd 

(you can open the bag according to the bag you need to crawl)

Steps for reptiles

  1. Analyze the website first: check where the target to crawl is on the website
  2. After finding the target, what form is displayed: usually in the form of direct text, URL, URL file, etc
  3. Write crawl code

Website analysis

Website parsing may feel that you must understand the language of the website, but you don't need to. Of course, understanding will make it easier for you to be in the crawler. If you don't understand it, don't be afraid. I don't know much about HTML,CSS and JS. But I can climb to what I want. As long as you know what you want to put there, and the structure of links, you can climb to the desired content.

All documents have absolute paths and links on the website. You can climb to the downloaded files as long as you know their links.

Therefore, it is mainly to find the corresponding link URLs, and then make an access to these URLs. As long as the access is successful, you can climb down.

However, there will also be encryption. There are CSS encryption and JS encryption, not to mention. It requires deeper crawlers to learn.

If you don't climb the websites of big companies, the structure of other websites is relatively simple and easy to climb down.

Utilization of network

The website's network is a simple tool for crawling packages. Loading data packets can be found here, and we can get the corresponding URL for crawling some contents of the website.

  • Fetch/XHR: it is mainly used to load text data. Its URL will be displayed in the form of XHR or JSON after it is opened
  • JS: it is a programming language that can move the website,
  • CSS: the style display of the website appears only to beautify the website
  • I MG: all the pictures loaded on the website can be found here

As long as you click, you can quickly sort, find what you want faster, and store it in that data packet.

Request

There are two basic crawling methods to access websites in python: get and post. I understand it this way. Get, we go to the server to get things, and post, we give things to the server. Most use get.

Let's use an example to illustrate. Suppose we want to access it in python https://www.baidu.com/

url = 'https://www.baidu.com/'

We also need to set a header before visiting the website, which is set by the method of dictionary

headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Cookie': 'BIDUPSID=2A7A4CF29AEF7C307EF129AE0E15B742; PSTM=1609848264; BD_UPN=12314753; BDUSS=9FMlRRMFFBOEV0R29rTXJXR20td2FGVEE5WHdKb2FoeHlKOEQ1Rn5YaGs4akJnSVFBQUFBJCQAAAAAAQAAAAEAAAAD4woGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGRlCWBkZQlgT2; BDUSS_BFESS=9FMlRRMFFBOEV0R29rTXJXR20td2FGVEE5WHdKb2FoeHlKOEQ1Rn5YaGs4akJnSVFBQUFBJCQAAAAAAQAAAAEAAAAD4woGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGRlCWBkZQlgT2; __yjs_duid=1_656489a63aff4a00db25c98347131c2d1617941644108; BAIDUID=69F8BB689CC2F279E2B2902E4C9AA2D9:FG=1; BCLID_BFESS=11583441835844288591; BDSFRCVID_BFESS=trDOJexroG0Y_ARe1brQk_OQMgKK0gOTDYLEOwXPsp3LGJLVgVBXEG0Pt_NFmZK-oxmHogKK3mOTHmDF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tbutoK8XJDD3jt-k5brBhnL-hp_X5-CstbTl2hcH0KLKbDoo0lK-bqDy3tCtQPJX25KL5Joy2fb1MRjvDfvP0nKIjx5d-MT75erl_l5TtUJcSDnTDMRh-4ApQnoyKMnitKj9-pPKWhQrh459XP68bTkA5bjZKxtq3mkjbPbDfn02eCKuDjtBDT30DGRabK6aKC5bL6rJabC3f-oeXU6q2bDeQN3kyMoN5R6aQfjoXh7G8J3oyT3JXp0vWtv4WbbvLT7johRTWqR4eUQtWMonDh83BPTl2lTiHCOOWlnO5hvvhn6O3M7VQMKmDloOW-TB5bbPLUQF5l8-sq0x0bOte-bQXH_E5bj2qRIjVIOP; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; H_WISE_SIDS=107316_110085_127969_128698_164869_168388_170704_175649_175668_175755_176398_176553_176677_177007_177371_177412_178005_178329_178530_178632_179201_179347_179368_179380_179402_179454_180114_180276_180407_180434_180436_180513_180655_180698_180758_180869_181207_181259_181329_181401_181429_181432_181483_181536_181589_181611_181710_181791_181799_182000_182026_182061_182071_182077_182117_182191_182233_182321_182576_182598_182715_182847_182921_183002_183329_183433; MCITY=-%3A; BAIDUID_BFESS=B23738E9C47D22324B926F5D72B96F85:FG=1; BD_HOME=1; H_PS_PSSID=34398_34369_31253_34374_33848_34092_34106_34111_26350_34246; BA_HECTOR=a5agal0l81al8025ha1gh949e0r',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}

Get method:
Open F12 - > Network - > click any one in name, open - > headers, and paste it according to the copy given by him,

You can also add headers in python according to its header. In the same way, suppose I want to add a Referer, as follows

headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Cookie': '.....',#Omitted because it's too long
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
    'Referer' : 'https://www.baidu. COM / '# < --- just add it in the way of dictionary
}

In this way, it can be added. Sometimes something will be used when crawling. You can add it in this way. If you change it according to the method of the dictionary, the value of the surface will change.
We use get to access.

r = requests.get(url = url, headers = headers)
html = r.content.decode('utf-8')

Content: obtain the website content in binary form, which is the HTML structure of the website.
decode: the code is' utf-8 'to prevent garbled codes

print(r)

The result of 200 is equal to successful access

<Response [200]>

We can also print html to see

print(html)

You can see the code of the website. I won't demonstrate it. I can implement it myself and deepen the image.

<>

The use of Post is one more data than get, and the writing method of data is the same as that of headers

data = {
    'VIEWSTATE': '...',		 #Drop slightly
    'EVENTVALIDATION': '...',#Drop slightly
    'PREVIOUSPAGE': '...' 	 #Drop slightly
} 

r = requests.get(url, headers = headers,data = data)
html = r.content.decode('utf-8')

Get method:

The corresponding data can also be obtained by using re

VIEWSTATE = re.findall(r'<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="(.*?)" />', str(html))
EVENTVALIDATION = re.findall(r'input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="(.*?)" />', str(html))
PREVIOUSPAGE = re.findall(r'<input type="hidden" name="__PREVIOUSPAGE" id="__PREVIOUSPAGE" value="(.*?)" />',str(html))

Crawling content

1. Text crawling

This is the simplest. If you visit the website according to the above method, you can crawl to the original code of the website and only do regular operation. There are many extracted teaching and python packages on the Internet, but I mainly use RE regular extraction.

We can go to the regular website to do the test first, and the use method can be checked by ourselves
Regular test website: https://regexr-cn.com/

I mainly use the following core methods:

text = "I'm Chen Dawen"
find = re.findall(r"I am(.*?)large",str(text)) #"I am" is used to locate (. *?) Used to extract the desired content, "big" is the final positioning
#find is a list 
print(find[0])

The above is just an example. You can test it and you will find that you can walk around the world as long as you use this sentence

2. Image crawling

At present, images are rarely encrypted, so just find the corresponding link, request the link, and save it in the binary way of content. Later, the storage method will talk about how to save it

The following example: first, right click - > check to find the corresponding link. As long as get is the condition to crawl the graph.

Storage mode

Files can be jpg .pdf .xlsx can be saved as long as you change the suffix name.

Important: it must be in binary form

path = 'File name.Suffix' #The storage address is set by yourself + 'Document type suffix' #The suffix must be the same as the crawled document, otherwise an error will occur
#For example, the picture path = 'pic jpg'
with open(path,'wb') as f:
	f.write(r.content)
	f.close()

example

Let me take this picture as an example

When we found this picture on the website, we opened F12 and found a link to this picture

https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fwww.pp-sp.com%2FUploadFiles%2Fimg_0_1579101990_2165129230_26.jpg&refer=http%3A%2F%2Fwww.pp-sp.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1631340143&t=895ed6441a5d8af0b2e9d15150a9fb60

We can open this picture

This will appear after opening. All files can be opened in this form, which is the ultimate goal of our crawling.

import requests

url = 'https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fwww.pp-sp.com%2FUploadFiles%2Fimg_0_1579101990_2165129230_26.jpg&refer=http%3A%2F%2Fwww.pp-sp.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1631340143&t=895ed6441a5d8af0b2e9d15150a9fb60'

r = requests.get(url) #The header is only needed when visiting the website, and this link can be used without headers or with headers
html = r.content

with open('Pikachu .jpg','wb') as f:
    f.write(html)
    f.close()

Run~~~

So the picture can be saved

Keywords: Python crawler

Added by dugindog on Sat, 25 Dec 2021 10:27:45 +0200