[reptile actual combat case 1] simple implementation of reptile requirements based on BeautifulSoup+Re+Pandas

Foreword

Share a case of using some crawler technology to simply crawl articles from media web pages on demand and save them to the local designated folder for reference only. In the learning process, do not visit the website frequently and cause web page paralysis. Do what you can!!!

Simple use of beautiful soup: https://inganxu.blog.csdn.net/article/details/122783587

Crawling demand

Crawl address: Construction Archives - construction industry whole industry chain content co construction platform

Technologies involved in crawling: Requests (access), beautiful soup, Re (parsing), Panda (save)

Climbing purpose: only used to practice and consolidate reptile technology

Climb target:

  1. Save the title of the article in excel and the title of the article in excel as the location of the first page
  2. Save the text of each article to the document in the specified location and name it with the article title
  3. Save the picture of each article to the folder in the specified location, and name it with the article title + picture serial number

Crawling idea

Initiate a homepage request to obtain the response content

Analyze the homepage data and obtain the title, author, release time and article link information of the article

Summarize the home page information of the website and save it in excel

Initiate an article web page request to obtain the response content

Analyze the web page data of the article and obtain the text, picture and link information of the article

Save the content of the article to the specified document

Initiate a picture web page request to obtain the response content

Save pictures to the specified folder

Code explanation

The following explains the implementation method and precautions of each step in the order of crawler ideas

Step 1: import related libraries

import requests                 # It is used to initiate a web page request and return the response content
import pandas as pd             # Generate two-dimensional data and save it to excel
import os                       # file save
import time                     # Slow down the reptile
import re                       # Extract and parse content
from bs4 import BeautifulSoup   # Parse data for response content

Install without related libraries

Press win + r to call up the CMD window and enter the following code

pip install Missing library name

Step 2: set anti reptile measures

"""
Record crawler time
 Set basic information for accessing web pages
"""
start_time = time.time()        
start_url = "https://www.jzda001.com"   
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}

hearder: header. Copy the header of the browser by yourself. If you can't find it, you can copy it directly

Step 3: visit the home page and parse the response content

response = requests.get(url=start_url, headers=header)      # Initiate request
html = response.text                                        # Extract response content
soup = BeautifulSoup(html, 'lxml')                          # Parse response content
content = soup.find_all('div', attrs={'class': 'content-left'})[0]  # Filter and extract crawler only content

# Use the list to load crawler information
item = {}
href_list = []
title_list = []
author_list = []
releasetime_list = []


# Get article links
for href in content.find_all('a', href=re.compile('^/index/index/details')):
    if href['href'] not in href_list:
        href_list.append(href['href'])

# Get article title
for title in content.find_all('p', class_=re.compile('(twoline|sub oneline|oneline)'))[:-1]:
    title_list.append(title.text)

# Get article author and release time
for author_time in content.find_all('p', attrs={'class': 'name'}):
    if len(author_list) < 20:
        author = re.findall(r'(?<=\s)\D+(?=\s\.)', author_time.text)[0]
        author_list.append(author.replace(' ', ''))
    if len(releasetime_list) < 20:
        time = re.findall(r'\d{4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2}', author_time.text)
        releasetime_list.append(time[0])

During the parsing process, it is found that there is unnecessary information at the head or tail of the obtained content, so the list slicing method is used to eliminate the elements

Step 4: data cleaning

# Main page data cleaning
author_clear = []
for i in author_list:
    if i != '.':
        author_clear.append(i.replace('\n', '').replace(' ', ''))
author_list = author_clear
time_clear = []
for j in releasetime_list:
    time_clear.append(j.replace('\n', '').replace(' ', '').replace(':', ''))
releasetime_list = time_clear
title_clear = []
for n in title_list:
    title_clear.append(
        n.replace(': ', ' ').replace('!', ' ').replace('|', ' ').replace('/', ' ').replace('   ', ' '))
title_list = title_clear

Viewing the web page structure, you can see that there are some system sensitive characters in the title, author and publishing time (affecting the subsequent inability to create folders)

Therefore, the parsed data can only be used after data cleaning

Step 5: visit the article link and parse and extract the response content

By comparing the @ href attribute link of the web page structure with the real link of the article, it can be seen that the parsed url needs to be spliced into the home page of the web address

 

# Access the article and save the text information and pictures to the specified folder
for nums in range(20):
    text_list = []
    img_url_list = []
    article_url = start_url + href_list[nums]
    response_url = requests.get(article_url, headers=header, timeout=1)
    html_text = response_url.text
    soup_article = BeautifulSoup(html_text, 'lxml')

    # Get sub page text (use find_all method)
    content_article = soup_article.find_all('div', attrs={'class': 'content'})[-1]
    text = ''
    for i in content_article:
        text = text + i.text
    text_list.append(text)

    # Get the picture link of the sub page (using the select method)
    soup_article_url = content_article.select('.pgc-img > img')
    for url in soup_article_url:
        img_url_list.append(url['src'])

Splicing text content is because the obtained text information is separated paragraph by paragraph and needs to be spliced together, which is the complete text information of the article

Step 6: create multi tier folders

 New multi tier folder
    file_name = "Building archives crawler folder"  # Folder name
    path = r'd:/' + file_name
    url_path_file = path + '/' + title_list[nums] + '/'  # Splice folder address
    if not os.path.exists(url_path_file):
        os.makedirs(url_path_file)

Because the created folder is: ". / crawler folder name / article title name /", you need to use OS Make dirs to create multi tier folders

Note that the file name ends with "/"

Step 7: save sub page pictures

When the folder is set up, start accessing the image links of each article and save them

# Save sub page picture
    for index, img_url in enumerate(img_url_list):
        img_rep = requests.get(img_url, headers=header, timeout=5)  # Set timeout
        index += 1  # Display quantity + 1 for each picture saved
        img_name = title_list[nums] + ' picture' + '{}'.format(index)  # Picture name
        img_path = url_path_file + img_name + '.png'  # Picture address
        with open(img_path, 'wb') as f:
            f.write(img_rep.content)  # Write in binary
            f.close()

Step 8: save sub page text

You can save the text first and then the picture. You need to pay attention to the writing methods of the two

# Save page text
    txt_name = str(title_list[nums] + ' ' + author_list[nums] + ' ' + releasetime_list[nums])  # Text name
    txt_path = url_path_file + '/' + txt_name + '.txt'  # Text address
    with open(txt_path, 'w', encoding='utf-8') as f:
        f.write(str(text_list[0]))
        f.close()

Step 9: summary of article information

Save all article titles, authors, publishing time and article links on the front page of the website to an excel folder

# Main page information saving
data = pd.DataFrame({'title': title_list,
                     'author': author_list,
                     'time': releasetime_list,
                     'url': href_list})
data.to_excel('{}./data.xls'.format(path), index=False)

Complete code (Process Oriented Version)

# !/usr/bin/python3.9
# -*- coding:utf-8 -*-
# @author:inganxu
# CSDN:inganxu.blog.csdn.net
# @Date: February 3, 2022

import requests                 # It is used to initiate a web page request and return the response content
import pandas as pd             # Generate two-dimensional data and save it to excel
import os                       # file save
import time                     # Slow down the reptile
import re                       # Extract and parse content
from bs4 import BeautifulSoup   # Parse data for response content

start_time = time.time()

start_url = "https://www.jzda001.com"
header = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}

response = requests.get(url=start_url, headers=header)      # Initiate request
html = response.text                                        # Extract response content
soup = BeautifulSoup(html, 'lxml')                          # Parse response content
content = soup.find_all('div', attrs={'class': 'content-left'})[0]  # Filter and extract crawler only content

# Use the list to load crawler information
item = {}
href_list = []
title_list = []
author_list = []
releasetime_list = []


# Get article links
for href in content.find_all('a', href=re.compile('^/index/index/details')):
    if href['href'] not in href_list:
        href_list.append(href['href'])

# Get article title
for title in content.find_all('p', class_=re.compile('(twoline|sub oneline|oneline)'))[:-1]:
    title_list.append(title.text)

# Get article author and release time
for author_time in content.find_all('p', attrs={'class': 'name'}):
    if len(author_list) < 20:
        author = re.findall(r'(?<=\s)\D+(?=\s\.)', author_time.text)[0]
        author_list.append(author.replace(' ', ''))
    if len(releasetime_list) < 20:
        time = re.findall(r'\d{4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2}', author_time.text)
        releasetime_list.append(time[0])

# Main page data cleaning
author_clear = []
for i in author_list:
    if i != '.':
        author_clear.append(i.replace('\n', '').replace(' ', ''))
author_list = author_clear
time_clear = []
for j in releasetime_list:
    time_clear.append(j.replace('\n', '').replace(' ', '').replace(':', ''))
releasetime_list = time_clear
title_clear = []
for n in title_list:
    title_clear.append(
        n.replace(': ', ' ').replace('!', ' ').replace('|', ' ').replace('/', ' ').replace('   ', ' '))
title_list = title_clear

# Access the article and save the text information and pictures to the specified folder
for nums in range(20):
    text_list = []
    img_url_list = []
    article_url = start_url + href_list[nums]
    response_url = requests.get(article_url, headers=header, timeout=1)
    html_text = response_url.text
    soup_article = BeautifulSoup(html_text, 'lxml')
    print("Climbing to the third{}Pages!!!".format(nums + 1))

    # Get sub page text (use find_all method)
    content_article = soup_article.find_all('div', attrs={'class': 'content'})[-1]
    text = ''
    for i in content_article:
        text = text + i.text
    text_list.append(text)

    # Get the picture link of the sub page (using the select method)
    soup_article_url = content_article.select('.pgc-img > img')
    for url in soup_article_url:
        img_url_list.append(url['src'])

    # New multi tier folder
    file_name = "Building archives crawler folder"  # Folder name
    path = r'd:/' + file_name
    url_path_file = path + '/' + title_list[nums] + '/'  # Splice folder address
    if not os.path.exists(url_path_file):
        os.makedirs(url_path_file)

    # Save sub page picture
    for index, img_url in enumerate(img_url_list):
        img_rep = requests.get(img_url, headers=header, timeout=5)  # Set timeout
        index += 1  # Display quantity + 1 for each picture saved
        img_name = title_list[nums] + ' picture' + '{}'.format(index)  # Picture name
        img_path = url_path_file + img_name + '.png'  # Picture address
        with open(img_path, 'wb') as f:
            f.write(img_rep.content)  # Write in binary
            f.close()
            print('The first{}Pictures saved successfully at:'.format(index), img_path)

    # Save page text
    txt_name = str(title_list[nums] + ' ' + author_list[nums] + ' ' + releasetime_list[nums])  # Text name
    txt_path = url_path_file + '/' + txt_name + '.txt'  # Text address
    with open(txt_path, 'w', encoding='utf-8') as f:
        f.write(str(text_list[0]))
        f.close()

    print("The text of this page was saved successfully,The text address is:", txt_path)
    print("Crawl succeeded!!!!!!!!!!!!!!!!!!!!!!!!")
    print('\n')

# Main page information saving
data = pd.DataFrame({'title': title_list,
                     'author': author_list,
                     'time': releasetime_list,
                     'url': href_list})
data.to_excel('{}./data.xls'.format(path), index=False)
print('Saved successfully,excel File address:', '{}/data.xls'.format(path))

print('Crawling completed!!!!')
end_time = time.time()
print('Time for climbing:', end_time - start_time)


epilogue

Later, we will expand anti creep measures (ip pool, header pool, response access) and data processing (data analysis, data visualization) for this case

Reference articles

[reptile actual combat case 1] simple implementation of reptile requirements based on Requests+Xpath+Pandas_ Inganxu CSDN blog

[reptile actual combat case 1] simple implementation of reptile requirements based on scratch + XPath_ Inganxu CSDN blog_ Scratch reptile case

Keywords: Python crawler regex

Added by stiduck on Mon, 07 Feb 2022 01:22:27 +0200