After reading this article, I believe you will have a strong interest in reptiles and encourage yourself to learn reptiles. Here I wish you success in your studies!
Target website: sister map network
Environment: Python 3. X
Related third-party modules: requests, beautiful oup4
Re: during the test, you only need to specify the variable path in the code as the path to be saved by your current system and run it with python xxx.py or IDE.
The complete source code is as follows:
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup import os all_url = 'https://www.mzitu.com' # http request header Hostreferer = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)', 'Referer': 'http://www.mzitu.com' } # This request header Referer cracked the steal map link Picreferer = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)', 'Referer': 'http://i.meizitu.net' } # All for mzitu home page_ URL initiates a request to save the returned HTML data for easy parsing start_html = requests.get(all_url, headers=Hostreferer) # Linux save address # path = '/home/Nick/Desktop/mzitu/' # Windows save address path = 'E:/mzitu/' # Get maximum pages soup = BeautifulSoup(start_html.text, "html.parser") page = soup.find_all('a', class_='page-numbers') max_page = page[-2].text # same_url = 'http://www.mzitu.com/page / '# home page default latest pictures # Get the URL of each type of MM same_url = 'https://www.mzitu.com/mm/page / '# you can also specify qingchun MM series for n in range(1, int(max_page) + 1): # Splice all URLs of the current class MM ul = same_url + str(n) # Request the first layer url of each page of the current class start_html = requests.get(ul, headers=Hostreferer) # Extract all MM titles soup = BeautifulSoup(start_html.text, "html.parser") all_a = soup.find('div', class_='postlist').find_all('a', target='_blank') # Traverse all MM titles for a in all_a: # Extract the title text as the folder name title = a.get_text() if(title != ''): print("Ready to pick up:" + title) # windows cannot create a tape? Add judgment logic to the directory if(os.path.exists(path + title.strip().replace('?', ''))): # print('directory already exists') flag = 1 else: os.makedirs(path + title.strip().replace('?', '')) flag = 0 # Switch to the directory created in the previous step os.chdir(path + title.strip().replace('?', '')) # Extract the url of each MM in the first layer and initiate a request href = a['href'] html = requests.get(href, headers=Hostreferer) mess = BeautifulSoup(html.text, "html.parser") # Get the maximum number of pages in the second layer pic_max = mess.find_all('span') pic_max = pic_max[9].text if(flag == 1 and len(os.listdir(path + title.strip().replace('?', ''))) >= int(pic_max)): print('Saved, skip') continue # Traverse the url of each image in the second layer for num in range(1, int(pic_max) + 1): # Stitching the url of each picture pic = href + '/' + str(num) # Initiate request html = requests.get(pic, headers=Hostreferer) mess = BeautifulSoup(html.text, "html.parser") pic_url = mess.find('img', alt=title) print(pic_url['src']) html = requests.get(pic_url['src'], headers=Picreferer) # Extract picture name file_name = pic_url['src'].split(r'/')[-1] # Save picture f = open(file_name, 'wb') f.write(html.content) f.close() print('complete') print('The first', n, 'Page complete')
Analysis of drawing raking steps: (to interested friends)
1. Get web source code
Open mzitu website and use F12 of the browser to see the request process and source code of the web page
This step code is as follows:
#coding=utf-8 import requests url = 'http://www.mzitu.com' #Set headers. The website will judge your browser and operating system based on this. Many websites will refuse you access without this information header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'} #Open the url with the get method and send the headers html = requests.get(url,headers = header) #The print result. Text is the printed text information, that is, the source code
If there is no problem, the result is similar to the following. These are the source code of the web page.
<html> <body> ...... $("#index_banner_load").find("div").appendTo("#index_banner"); $("#index_banner").css("height", 90); $("#index_banner_load").remove(); }); </script> </body> </html>
2. Extract required information
Convert the obtained source code into a beautiful soup object
Use find to search the required data and save it to the container
This step code is as follows:
#coding=utf-8 import requests from bs4 import BeautifulSoup url = 'http://www.mzitu.com' header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'} html = requests.get(url,headers = header) #Use the built-in html.parser for parsing, which is slow but universal soup = BeautifulSoup(html.text,'html.parser') #In fact, all a tags in the div of the first class = 'postlist' are the information we are looking for all_a = soup.find('div',class_='postlist').find_all('a',target='_blank') for a in all_a: title = a.get_text() #Extract text
The titles of all sets of drawings on the current page are found as follows:
be careful: BeautifulSoup()The type returned is<class 'bs4.BeautifulSoup'> find()The type returned is<class 'bs4.element.Tag'> find_all()The type returned is<class 'bs4.element.ResultSet'> <class 'bs4.element.ResultSet'>No more input find/find_all operation
3. Enter the second layer page to download
After clicking a set of pictures, we find that each page displays a picture. At this time, we need to know the total number of pages, such as: http://www.mzitu.com/26685 It is the first page of a set of drawings, and the following pages are followed by / and numbers http://www.mzitu.com/26685/2 (the second page), then it's very simple. We just need to find the total number of pages, and then use the cycle to form the number of pages.
This step code is as follows:
#coding=utf-8 import requests from bs4 import BeautifulSoup url = 'http://www.mzitu.com/26685' header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'} html = requests.get(url,headers = header) soup = BeautifulSoup(html.text,'html.parser') #The maximum number of pages is the 10th in the span tab pic_max = soup.find_all('span')[10].text print(pic_max) #Output the address of each picture page for i in range(1,int(pic_max) + 1): href = url+'/'+str(i) print(href)
Then the next step is to find the picture address and save it; Right click the MM picture and click Check to find the following:
img src="https://i5.meizitu.net/2019/07/01b56.jpg" alt="xxxxxxxxxxxxxxxxxxxxxxxxx" width="728" height="485">
As shown in the figure, the above is the specific address of our MM picture. Just save it.
This step code is as follows:
#coding=utf-8 import requests from bs4 import BeautifulSoup url = 'http://www.mzitu.com/26685' header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'} html = requests.get(url,headers = header) soup = BeautifulSoup(html.text,'html.parser') #The maximum number of pages is the 10th in the span tab pic_max = soup.find_all('span')[10].text #Find title title = soup.find('h2',class_='main-title').text #Output the address of each picture page for i in range(1,int(pic_max) + 1): href = url+'/'+str(i) html = requests.get(href,headers = header) mess = BeautifulSoup(html.text,"html.parser") #The picture address is in the same place as the img tag alt attribute and title pic_url = mess.find('img',alt = title) html = requests.get(pic_url['src'],headers = header) #Get the name of the picture for easy naming file_name = pic_url['src'].split(r'/')[-1] #The picture is not a text file and is written in binary format, so it is html.content f = open(file_name,'wb') f.write(html.content)
Learn more learn more about Python web development, data analysis, crawler, artificial intelligence, etc.