Python crawl Baidu pictures

This article is reproduced from: https://blog.csdn.net/qq_52907353/article/details/112391518#commentBox
What I want to write today is to climb Baidu pictures

1, Analysis process

1. First, open Baidu, then search for keywords and click pictures

2. In the process of sliding down with the mouse wheel, it is found that the picture is dynamically loaded, which means that this is an ajax request.

3. Select the XHR option

4. Then drag the mouse wheel down and we will find a packet.

5. Copy the URL request of this packet

https://image.baidu.com/search/acjson?tn=resultjson_com&logid=8222346496549682679&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E7%BE%8E%E5%A5%B3&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=&copyright=&word=%E7%BE%8E%E5%A5%B3&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&cg=girl&pn=30&rn=30&gsm=1e&1610176483429=

6. Click this URL to see the parameters it carries

Then copy it, too
Start writing related code

2, Write code

1. First introduce the modules we need

import requests

2. Start coding

#Conduct UA camouflage
header = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://image.baidu.com/search/acjson?'
param = {
    'tn': 'resultjson_com',
    'logid': '8846269338939606587',
    'ipn': 'rj',
    'ct': '201326592',
    'is': '',
    'fp': 'result',
    'queryWord': 'beauty',
    'cl': '2',
    'lm': '-1',
    'ie': 'utf-8',
    'oe': 'utf-8',
    'adpicid': '',
    'st': '-1',
    'z':'' ,
    'ic':'' ,
    'hd': '',
    'latest': '',
    'copyright': '',
    'word': 'beauty',
    's':'' ,
    'se':'' ,
    'tab': '',
    'width': '',
    'height': '',
    'face': '0',
    'istype': '2',
    'qc': '',
    'nc': '1',
    'fr': '',
    'expermode': '',
    'force': '',
    'cg': 'girl',
    'pn': '1',
    'rn': '30',
    'gsm': '1e',
}
#Convert encoding form to utf-8 
page_text = requests.get(url=url,headers=header,params=param)
page_text.encoding = 'utf-8'
page_text = page_text.text
print(page_text)

At this step, let's visit it first to see if we can get the data returned by the page

We successfully obtained the returned data
After that, we return to the web page to view the packet and its returned data

Then put it into the json online parsing tool and find the address corresponding to the image

Then continue writing code
Convert the returned data into json format, and find that all the data is stored in a dictionary, and the address of the picture is also in a dictionary
Then take out the link address

page_text = page_text.json()
#First, take out the dictionary where all links are located and store it in a list
info_list = page_text['data']
#Since the last dictionary retrieved in this way is empty, the last element in the list is deleted
del info_list[-1]
#Define a list for storing picture addresses
img_path_list = []
for info in info_list:
    img_path_list.append(info['thumbURL'])
#Then take out all the picture addresses and download them
#n will be the name of the picture
n = 0
for img_path in img_path_list:
    img_data = requests.get(url=img_path,headers=header).content
    img_path = './' + str(n) + '.jpg'
    with open(img_path,'wb') as fp:
        fp.write(img_data)
   	n += 1

After completing these, we also want to download multiple pages of Baidu pictures. After analysis, I found that in the parameters we submitted, pn represents the first few pictures to load. Along this idea, we can give the above code a big cycle, that is, the first download starts from the first one, Download 30 pictures, and the second download starts from the 31st one.
OK! The idea has been clear, and start to modify the above code

import requests
from lxml import etree
page = input('Please enter how many pages to crawl:')
page = int(page) + 1
header = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
n = 0
pn = 1
#pn is obtained from the first few pictures. When Baidu pictures decline, 30 are displayed at one time by default
for m in range(1,page):
    url = 'https://image.baidu.com/search/acjson?'

    param = {
        'tn': 'resultjson_com',
        'logid': '8846269338939606587',
        'ipn': 'rj',
        'ct': '201326592',
        'is': '',
        'fp': 'result',
        'queryWord': 'beauty',
        'cl': '2',
        'lm': '-1',
        'ie': 'utf-8',
        'oe': 'utf-8',
        'adpicid': '',
        'st': '-1',
        'z':'' ,
        'ic':'' ,
        'hd': '',
        'latest': '',
        'copyright': '',
        'word': 'beauty',
        's':'' ,
        'se':'' ,
        'tab': '',
        'width': '',
        'height': '',
        'face': '0',
        'istype': '2',
        'qc': '',
        'nc': '1',
        'fr': '',
        'expermode': '',
        'force': '',
        'cg': 'girl',
        'pn': pn,#Which picture to start with
        'rn': '30',
        'gsm': '1e',
    }
    page_text = requests.get(url=url,headers=header,params=param)
    page_text.encoding = 'utf-8'
    page_text = page_text.json()
    info_list = page_text['data']
    del info_list[-1]
    img_path_list = []
    for i in info_list:
        img_path_list.append(i['thumbURL'])
    
    for img_path in img_path_list:
        img_data = requests.get(url=img_path,headers=header).content
        img_path = './' + str(n) + '.jpg'
        with open(img_path,'wb') as fp:
            fp.write(img_data)
        n = n + 1
        
    pn += 29

Keywords: crawler

Added by fypstudent on Fri, 14 Jan 2022 03:47:26 +0200