Crawler crawling learning notification information

1, Introduction to reptiles

1. Introduction

(1) The English name of Web Crawler is Web Crawler or Web Spider. It is a program that automatically browse web pages and collect required information.
(2) The crawler starts from the URL of the initial web page and obtains the URL on the initial web page. In the process of grabbing the web page, it constantly extracts new URLs from the current page and puts them into the queue. It does not stop until the stop conditions given by the system are met.
(3) It can download web page data from the Internet for search engine, which is an important part of search engine.

2. Working process of reptile

(1)URL management module: initiate a request. Generally, requests are made to the target site through the HTTP library. It is equivalent to opening the browser and entering the web address.
(2) Download module: get the response. If the requested content exists on the server, the server will return the requested content, generally HTML, binary files (video, audio), documents, Json strings, etc.
(3) Parsing module: parsing content. For users, it is to find the information they need. For Python crawlers, it is to extract target information using regular expressions or other libraries.
(4) Enclosure: save data. The parsed data can be saved locally in many forms, such as text, audio and video.

2, Climb the ACM topic website of Nanyang Institute of Technology

1. Open the web page http://www.51mxd.cn/problemset.php-page=1.htm , check the page source code, and you can see that the information we want to crawl is in the TD tab
2. Create a new file and write the code

import requests# Import web request Library
from bs4 import BeautifulSoup# Import web page parsing library
import csv
from tqdm import tqdm

# Simulate browser access
Headers = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'

# Header
csvHeaders = ['Question number', 'difficulty', 'title', 'Passing rate', 'Number of passes/Total submissions']

# Topic data
subjects = []

# Crawling problem
print('Topic information crawling:\n')
for pages in tqdm(range(1, 11 + 1)):
    # Incoming URL
    r = requests.get(f'http://www.51mxd.cn/problemset.php-page={pages}.htm', Headers)
 
    r.raise_for_status()

    r.encoding = 'utf-8'
      # Resolve URL
    soup = BeautifulSoup(r.text, 'html5lib')
     #Find and crawl everything related to td
    td = soup.find_all('td')

    subject = []

    for t in td:
        if t.string is not None:
            subject.append(t.string)
            if len(subject) == 5:
                subjects.append(subject)
                subject = []

# Storage topic
with open('D:\zhangyun1.csv', 'w', newline='') as file:
    fileWriter = csv.writer(file)
    fileWriter.writerow(csvHeaders)
    fileWriter.writerows(subjects)

print('\n Topic information crawling completed!!!')


What's in the file

3, Information notice of climbing the school's official website

1. Analyze the web page source code and open the web page first http://news.cqjtu.edu.cn/xxtz.htm , the information to be seen this time is the time when all notices are sent and the title of the notice

2. Press F12 to see the content to be crawled

3. Code implementation

# -*- coding: utf-8 -*-
"""
Created on Wed Nov 17 14:39:03 2021

@author: 86199
"""
import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm
import urllib.request, urllib.error  # Make URL to get web page data

# All news
subjects = []

# Simulate browser access
Headers = {  # Simulate browser header information
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53"
}

# Header
csvHeaders = ['time', 'title']


print('Information crawling:\n')
for pages in tqdm(range(1, 65 + 1)):
    # Make a request
    request = urllib.request.Request(f'http://news.cqjtu.edu.cn/xxtz/{pages}.htm', headers=Headers)
    html = ""
    # If the request is successful, get the web page content
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    # Parsing web pages
    soup = BeautifulSoup(html, 'html5lib')

    # Store a news item
    subject = []
    # Find all li Tags
    li = soup.find_all('li')
    for l in li:
        # Find div tags that meet the criteria
        if l.find_all('div',class_="time") is not None and l.find_all('div',class_="right-title") is not None:
            # time
            for time in l.find_all('div',class_="time"):
                subject.append(time.string)
            # title
            for title in l.find_all('div',class_="right-title"):
                for t in title.find_all('a',target="_blank"):
                    subject.append(t.string)
            if subject:
                print(subject)
                subjects.append(subject)
        subject = []

# Save data
with open('D:\cqjtu.csv', 'w', newline='',encoding='utf-8') as file:
    fileWriter = csv.writer(file)
    fileWriter.writerow(csvHeaders)
    fileWriter.writerows(subjects)

print('\n Information crawling completed!!!')

Open the generated cqjtu.csv file

4, Summary

This experiment is mainly to learn the example code, write detailed comments on the key code sentences, and complete the programming of the ACM topic website of Nanyang Institute of technology http://www.51mxd.cn/ Practice capturing and saving topic data; Rewrite the crawler sample code and put all the letters in the news website of Chongqing Jiaotong University in recent years( http://news.cqjtu.edu.cn/xxtz.htm )The release date and title of are all crawled down and written to the CSV spreadsheet.

5, Reference link

https://blog.csdn.net/weixin_56102526/article/details/121366806?spm=1001.2014.3001.5501

Keywords: Linux OpenCV Ubuntu

Added by sudhakararaog on Sun, 21 Nov 2021 02:06:00 +0200