Crawler welfare II sister map network batch download MM

After reading this article, I believe you will have a strong interest in reptiles and encourage yourself to learn reptiles. Here I wish you success in your studies!

Target website: sister map network

Environment: Python 3. X

Related third-party modules: requests, beautiful oup4

Re: during the test, you only need to specify the variable path in the code as the path to be saved by your current system and run it with python xxx.py or IDE.

The complete source code is as follows:
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os

all_url = 'https://www.mzitu.com'

# http request header
Hostreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://www.mzitu.com'
}
# This request header Referer cracked the steal map link
Picreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://i.meizitu.net'
}

# All for mzitu home page_ URL initiates a request to save the returned HTML data for easy parsing
start_html = requests.get(all_url, headers=Hostreferer)

# Linux save address
# path = '/home/Nick/Desktop/mzitu/'

# Windows save address
path = 'E:/mzitu/'

# Get maximum pages
soup = BeautifulSoup(start_html.text, "html.parser")
page = soup.find_all('a', class_='page-numbers')
max_page = page[-2].text

# same_url = 'http://www.mzitu.com/page / '# home page default latest pictures
# Get the URL of each type of MM
same_url = 'https://www.mzitu.com/mm/page / '# you can also specify qingchun MM series

for n in range(1, int(max_page) + 1):
    # Splice all URLs of the current class MM
    ul = same_url + str(n)

    # Request the first layer url of each page of the current class
    start_html = requests.get(ul, headers=Hostreferer)

    # Extract all MM titles
    soup = BeautifulSoup(start_html.text, "html.parser")
    all_a = soup.find('div', class_='postlist').find_all('a', target='_blank')

    # Traverse all MM titles
    for a in all_a:
        # Extract the title text as the folder name
        title = a.get_text()
        if(title != ''):
            print("Ready to pick up:" + title)

            # windows cannot create a tape? Add judgment logic to the directory
            if(os.path.exists(path + title.strip().replace('?', ''))):
                # print('directory already exists')
                flag = 1
            else:
                os.makedirs(path + title.strip().replace('?', ''))
                flag = 0
            # Switch to the directory created in the previous step
            os.chdir(path + title.strip().replace('?', ''))

            # Extract the url of each MM in the first layer and initiate a request
            href = a['href']
            html = requests.get(href, headers=Hostreferer)
            mess = BeautifulSoup(html.text, "html.parser")

            # Get the maximum number of pages in the second layer
            pic_max = mess.find_all('span')
            pic_max = pic_max[9].text
            if(flag == 1 and len(os.listdir(path + title.strip().replace('?', ''))) >= int(pic_max)):
                print('Saved, skip')
                continue

            # Traverse the url of each image in the second layer
            for num in range(1, int(pic_max) + 1):
                # Stitching the url of each picture
                pic = href + '/' + str(num)

                # Initiate request
                html = requests.get(pic, headers=Hostreferer)
                mess = BeautifulSoup(html.text, "html.parser")
                pic_url = mess.find('img', alt=title)
                print(pic_url['src'])
                html = requests.get(pic_url['src'], headers=Picreferer)

                # Extract picture name
                file_name = pic_url['src'].split(r'/')[-1]

                # Save picture
                f = open(file_name, 'wb')
                f.write(html.content)
                f.close()
            print('complete')
    print('The first', n, 'Page complete')

Analysis of drawing raking steps: (to interested friends)

1. Get web source code

Open mzitu website and use F12 of the browser to see the request process and source code of the web page

This step code is as follows:

#coding=utf-8

import requests

url = 'http://www.mzitu.com'

#Set headers. The website will judge your browser and operating system based on this. Many websites will refuse you access without this information
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

#Open the url with the get method and send the headers
html = requests.get(url,headers = header)

#The print result. Text is the printed text information, that is, the source code

If there is no problem, the result is similar to the following. These are the source code of the web page.

<html>
<body>

......

        $("#index_banner_load").find("div").appendTo("#index_banner");
        $("#index_banner").css("height", 90);
        $("#index_banner_load").remove();
});
</script>
</body>
</html>

2. Extract required information

Convert the obtained source code into a beautiful soup object

Use find to search the required data and save it to the container

This step code is as follows:

#coding=utf-8

import requests
from bs4 import BeautifulSoup

url = 'http://www.mzitu.com'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

html = requests.get(url,headers = header)

#Use the built-in html.parser for parsing, which is slow but universal
soup = BeautifulSoup(html.text,'html.parser')

#In fact, all a tags in the div of the first class = 'postlist' are the information we are looking for
all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')

for a in all_a:
    title = a.get_text() #Extract text

The titles of all sets of drawings on the current page are found as follows:

be careful: BeautifulSoup()The type returned is<class 'bs4.BeautifulSoup'>
find()The type returned is<class 'bs4.element.Tag'>
find_all()The type returned is<class 'bs4.element.ResultSet'>
<class 'bs4.element.ResultSet'>No more input find/find_all operation

3. Enter the second layer page to download

After clicking a set of pictures, we find that each page displays a picture. At this time, we need to know the total number of pages, such as: http://www.mzitu.com/26685 It is the first page of a set of drawings, and the following pages are followed by / and numbers http://www.mzitu.com/26685/2 (the second page), then it's very simple. We just need to find the total number of pages, and then use the cycle to form the number of pages.

This step code is as follows:

#coding=utf-8

import requests
from bs4 import BeautifulSoup

url = 'http://www.mzitu.com/26685'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

html = requests.get(url,headers = header)
soup = BeautifulSoup(html.text,'html.parser')

#The maximum number of pages is the 10th in the span tab
pic_max = soup.find_all('span')[10].text
print(pic_max)

#Output the address of each picture page
for i in range(1,int(pic_max) + 1):
    href = url+'/'+str(i)
    print(href)

Then the next step is to find the picture address and save it; Right click the MM picture and click Check to find the following:

img src="https://i5.meizitu.net/2019/07/01b56.jpg" alt="xxxxxxxxxxxxxxxxxxxxxxxxx" width="728" height="485">

As shown in the figure, the above is the specific address of our MM picture. Just save it.

This step code is as follows:

#coding=utf-8

import requests
from bs4 import BeautifulSoup

url = 'http://www.mzitu.com/26685'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}

html = requests.get(url,headers = header)
soup = BeautifulSoup(html.text,'html.parser')

#The maximum number of pages is the 10th in the span tab
pic_max = soup.find_all('span')[10].text

#Find title
title = soup.find('h2',class_='main-title').text

#Output the address of each picture page
for i in range(1,int(pic_max) + 1):
    href = url+'/'+str(i)
    html = requests.get(href,headers = header)
    mess = BeautifulSoup(html.text,"html.parser")

    #The picture address is in the same place as the img tag alt attribute and title
    pic_url = mess.find('img',alt = title)

    html = requests.get(pic_url['src'],headers = header)

    #Get the name of the picture for easy naming
    file_name = pic_url['src'].split(r'/')[-1]

    #The picture is not a text file and is written in binary format, so it is html.content
    f = open(file_name,'wb')
    f.write(html.content)

Learn more learn more about Python web development, data analysis, crawler, artificial intelligence, etc.

Keywords: Python Front-end crawler Data Analysis regex

Added by w.geoghegan on Tue, 23 Nov 2021 16:16:02 +0200