Network protocol packet capture analysis and introduction to crawler

1, Introduction to reptiles

1. Concept

Web crawler (also known as web spider, web robot, more often called web chaser in FOAF community) is a program or script that automatically grabs World Wide Web information according to certain rules. Other infrequently used names include ants, automatic indexing, emulators, or worms.

2. Type

According to the system structure and implementation technology, web crawlers can be roughly divided into the following types: General Purpose Web Crawler, Focused Web Crawler, Incremental Web Crawler and Deep Web Crawler. The actual web crawler system is usually realized by the combination of several crawler technologies.

3. Scope of use

  • As a web page collector of search engine, capture the whole Internet, such as Google, Baidu, etc.
  • As a vertical search engine, capture specific topic information, such as video websites, recruitment websites, etc.
  • As a testing tool for testing the front-end of the website, it is used to evaluate the robustness of the front-end code of the website.

4. Legality

Robots protocol: also known as robot protocol or crawler protocol, this protocol makes an agreement on the scope of the website content captured by the search engine, including whether the website wants to be captured by the search engine and what content is not allowed to be captured. Based on this, the Web Crawler "consciously" captures or does not capture the website content. Since its launch, robots protocol has become an international practice for websites to protect their own sensitive data and the privacy of Internet users.

5. Search strategy

Web page crawling strategies can be divided into depth first, breadth first and best first. Depth first will lead to the problem of crawler trapped in many cases. At present, the common methods are breadth first and best first.

6. Basic structure

  • URL management module
  • Download module
  • Parsing module
  • Storage module

2, Environmental preparation

Installation and configuration of Anaconda:
The most detailed Anaconda installation tutorial in history
Packages to install:
requests,beautifulsoup4,html5lib
The instructions are as follows:
pip install beautifulsoup4
pip install requests
pip install html5lib

3, Climb the ACM topic website of Nanyang Institute of Technology

ACM topic website of Nanyang Institute of Technology
By viewing the web page source code, we can see that the title number, title, pass rate and other data we need are all in the TD tag of the figure
Press F12 on the website page to enter the developer mode. Enter td in the search box to display more than 700 relevant data

Create one where you need it csv file
Open the Jupiter notebook in anaconda for encoding

import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm

# Simulate browser access
Headers = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'

# Header
csvHeaders = ['Question number', 'difficulty', 'title', 'Passing rate', 'Number of passes/Total submissions']

# Topic data
subjects = []

# Crawling problem
print('Topic information crawling:\n')
# tqdm job: displays the crawling progress in the form of progress bar
# Crawl all topic information on page 11
for pages in tqdm(range(1, 11 + 1)):
    # get request page pages
    r = requests.get(f'http://www.51mxd.cn/problemset.php-page={pages}.htm', Headers)
    # Abnormal judgment
    r.raise_for_status()
    # Set encoding
    r.encoding = 'utf-8'
    # Create a beautiful soup object to parse the html page data
    soup = BeautifulSoup(r.text, 'html.parser')
    # Get all td Tags
    td = soup.find_all('td')
    # Store all the information about a topic
    subject = []
    # Traverse all td
    for t in td:
        if t.string is not None:
            subject.append(t.string) # Gets the string in td
            if len(subject) == 5:   # Every five is a topic of information
                subjects.append(subject)
                subject = []

# Storage topic
with open('F:/Anaconda3/project/1.csv(This is mine.csv (file path)', 'w', newline='') as file:
    fileWriter = csv.writer(file)
    fileWriter.writerow(csvHeaders)  # Write header
    fileWriter.writerows(subjects)   # Write data

print('\n Topic information crawling completed!!!')

The operation results are as follows:

see. csv file content:

4, Climb the news website of Chongqing Jiaotong University

Chongqing Jiaotong University News Network
The codes are as follows:

import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm


# Simulate browser access
cqjtu_Headers ={ 
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.44'
}
#csv header
cqjtu_head=["date","title"]
#Storage content
cqjtu_infomation=[]
def get_page_number():
    r=requests.get(f"http://news.cqjtu.edu.cn/xxtz.htm",headers=cqjtu_Headers)
    r.raise_for_status()
    r.encoding="utf-8"
    page_array={
    'type':'text/javascript'
    }
    soup = BeautifulSoup(r.text, 'html5lib')
    page_num=soup.find_all('script',page_array)
    page_number=page_num[4].string#It seems that the positioning only depends on the label is not the only one, so it can only be positioned manually
    page_number=page_number.strip('function a204111_gopage_fun(){_simple_list_gotopage_fun(')#Delete anything other than the number of pages
    page_number=page_number.strip(',\'a204111GOPAGE\',2)}')
    page_number=int(page_number)#Convert to number
    return page_number
def get_time_and_title(page_num,cqjtu_Headers):#Number of pages, request header
    if page_num==69 :
        url='http://news.cqjtu.edu.cn/xxtz.htm'
    else :
        url=f'http://news.cqjtu.edu.cn/xxtz/{page_num}.htm'
    r=requests.get(url,headers=cqjtu_Headers)
    r.raise_for_status()
    r.encoding="utf-8"
    array={#Select according to class
        'class':'time',
        }
    title_array={
     'target':'_blank'
    }
    page_array={
    'type':'text/javascript'
    }
    soup = BeautifulSoup(r.text, 'html.parser')
    time=soup.find_all('div',array)
    title=soup.find_all('a',title_array)
    temp=[]
    for i in range(0,len(time)):
        time_s=time[i].string
        time_s=time_s.strip('\n                                    ')
        time_s=time_s.strip('\n                                ')
        #Clear spaces
        temp.append(time_s)
        temp.append(title[i+1].string)
        cqjtu_infomation.append(temp)
        temp=[]
def write_csv(cqjtu_info):
    with open('F:/Anaconda3/project/2.csv', 'w', newline='',encoding='utf-8') as file:
        fileWriter = csv.writer(file)
        fileWriter.writerow(cqjtu_head)
        fileWriter.writerows(cqjtu_info)
        print('Crawling information succeeded')
page_num=get_page_number()#Get pages
for i in tqdm(range(page_num,0,-1)):
    get_time_and_title(i,cqjtu_Headers)
write_csv(cqjtu_infomation)

Operation results:

see. csv file

The message is displayed as garbled code, which may be a problem with the character set. Open it with Notepad instead

Information crawling succeeded

reference resources

What is a reptile?
Crawler - Introduction to Python Programming 1

Keywords: Python crawler Network Protocol

Added by coco777 on Thu, 06 Jan 2022 13:07:53 +0200