Python Implements Crawling Baidu Pictures

Take Baidu-Discover Colorful World as an Example

1. First thing to do is to analyze the website

Enter the Baidu Pictures website page as follows:

You can see that the pictures we want to crawl appear in the picture and we can download them directly, but is it too time-consuming to download one by one? So let's F12 into the developer options.

As follows:

2. Knock Code

1. Get headers

In order for websites to think that our crawlers are visiting websites artificially. We need to get the headers and use them to simulate human-made visits to websites to prevent websites from giving our IP s down and causing us to lose access to websites.

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0X-Requested-WithXMLHttpRequest=9'}

2. Get Web Page URL

After analysis, we want to visit the initial page of the web page, then find the url of the target page in the initial interface, then visit the target page, and locate the specific picture link in the page, then we can download the picture

The string following the GET in the screenshot is our crucial URL, which we get.

3. Get the source code of the web page

Once we get the url of the page, we can try to get the source code of the page

Let's try to get the source code of the web page first:

import requests #Import Module

url = '	https://Image. Baidu. Com/search/acjson? Tn=resultjson_ Com&logid=8304080854454513979767441&ipn=8304084513979767441&ipn=rj&ct=201326592&is=&fp=result&fr=&word=rose&word=rose&queryWord=rose&cl=rose&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&utf-8&adpicid=&st=&st=-1&z=&ic=&ic=&0&hd=&latest=&copyright=&s=&tab=&height=&width=&width=&&&ampistype=2&qc=&qypypypype=&face=&expermode=&experm=nojc=&isAsync=&pn=90&rn=30&gsm=5a&1639489219169='

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0X-Requested-WithXMLHttpRequest=9'}

response=requests.get(url,headers=headers)#Visit Web Page
html=response.text
print(html)

Output result is Web page source code

3. Get links to pictures

We locate the target page and get links to pictures

#Crawl Pictures
from requests_html import HTMLSession
import requests#Request impersonation post get
import urllib
import json
from bs4 import BeautifulSoup#Parse Library
from PIL import Image
#1 First simulated request GET request gets several required parameters
session = requests.session()#To keep all sessions in your code uniform (session control)
url = """https://Image. Baidu. Com/search/acjson? Tn=resultjson_ Com&logid=8304080854454513979767441&ipn=8304084513979767441&ipn=rj&ct=201326592&is=&fp=result&fr=&word=rose&word=rose&queryWord=rose&cl=rose&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&utf-8&adpicid=&st=&st=-1&z=&ic=&ic=&0&hd=&latest=&copyright=&s=&tab=&height=&width=&width=&&&ampistype=2&qc=&qypypypype=&face=&expermode=&experm=nojc=&isAsync=&pn=90&rn=30&gsm=5a&1639489219169=""
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43'}

res = requests.get(url,headers=headers)#Get to Get Web Page Data
#print(res.text)#Output Page Link
jsonInfo = json.loads(res.text)#Decode using json

for index in range(30):
    print(jsonInfo['data'][index]['thumbURL'])
#soup = BeautifulSoup(res.text,'html.parser')#Resolve with html parser
#print(soup.select('img'))

Once we get a link to a web page, we can access the corresponding picture through the link to the web page we get

The output is a link to a web page

https://img0.baidu.com/it/u=3790394977,1650517858&fm=26&fmt=auto
https://img1.baidu.com/it/u=1929383178,891687862&fm=26&fmt=auto
https://img0.baidu.com/it/u=82069894,371572825&fm=26&fmt=auto
https://img1.baidu.com/it/u=3143784284,2910826804&fm=26&fmt=auto

Copy the web page and link to the browser. We can get the following results:

Of course, to achieve this, it's crucial that we choose Get, URL,

After the above analysis, our programming ideas will be clear

Now we can program

The overall code is as follows:

#Crawl Pictures
import requests#Simulate Request
import json#Lightweight data exchange format for easy reading and writing
from urllib import parse #Parsing, merging, encoding, decoding for URLs
import os#Module for manipulating files
import time#Time Module


class BaiduImageSpider(object):#Create a class
    def __init__(self):

        self.directory = r"H:\Python\Crawler Code\First crawl\images{}"  # Storage Directory You need to modify the directory you want to save here {} Don't lose it
        self.json_count = 0  # Number of json files requested (one json file contains 30 image files)
        self.url = 'https://image.baidu.com/search/acjson?tn=resultjson_com&logid=5179920884740494226&ipn=rj&ct' \
                   '=201326592&is=&fp=result&queryWord={' \
                   '}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word={' \
                   '}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&nojc=&pn={' \
                   '}&rn=30&gsm=1e&1635054081427= '
        self.header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30 '
        }

    # Get Image Links
    def get_image_link(self, url):
        list_image_link = []#Create a list of images
        strhtml = requests.get(url, headers=self.header)  # Get to Get Web Page Data
        jsonInfo = json.loads(strhtml.text)#json.loads decodes the encoded JSON string into a Python object
        #Links to get images are saved in the list
        for index in range(30):
            #Put data from jsonInfo in the list
            list_image_link.append(jsonInfo['data'][index]['thumbURL'])#Add a new object at the end of the list and place a link to the image in the list
        return list_image_link

        # Create Storage Folder
    def create_directory(self, name):
        self.directory = self.directory.format(name)#Complete Folder Name
        # Create if directory does not exist
        if not os.path.exists(self.directory):#If there is no path
            os.makedirs(self.directory)#Create the directory using the os module
        self.directory += r'\{}'

    # Download Pictures
    def save_image(self, img_link, filename):
        #img_link is a link to an image
        res = requests.get(img_link, headers=self.header)#Simulate get request to return information res object
        if res.status_code == 404:
            print(f"picture{img_link}Download error------->")
        with open(filename, "wb") as f:#Overwrite Write Files in Binary Form
            f.write(res.content)#content in requests module returns binary data
            print("Storage path:" + filename)#Print Storage Path

    # Entry function
    def run(self):
        name = input("Pictures you want:")
        searchName_parse = parse.quote(name)  # Encoding Converts Chinese to url encoding format

        self.create_directory(name)#Call the function to create a folder to create a file based on the query

        pic_number = 0  # Number of images
        for index in range(self.json_count):
            pn = (index+1)*30#pn represents a set of files, a set of 30 image contents
            #Image web page links are mostly the same, different url codes (serrchName_parse) are entered, and different image types are obtained
            #Get new links from different url codes
            request_url = self.url.format(searchName_parse, searchName_parse, str(pn))
            #The str() function takes an integer as a string, keeping it consistent with the string types on both sides
            
            list_image_link = self.get_image_link(request_url)#Call the Image Link function with a new url to get a new image link
            for link in list_image_link:
                pic_number += 1
                self.save_image(link, self.directory.format(str(pic_number)+'.jpg'))
                time.sleep(0.2)  # Hibernate for 0.2 seconds to prevent ip blocking
        print(name+"Image download successful")
        print("Picture storage{}".format(self.directory))


if __name__ == '__main__':#Code as a module does not run the entire script directly when called by other files
    spider = BaiduImageSpider()
    spider.json_count = 1   # Download a set of 30 pictures by default
    spider.run()

 

 

Keywords: Python html list

Added by sunder on Tue, 14 Dec 2021 20:15:54 +0200