Python - a movie in which a reptile crawls on top 250 watercress

preface

Suddenly, I have a strong hunch that I will soon be able to use the crawler. After learning a little about python crawler these two days, I want to find a case to practice. Unfortunately, Douban has been selected. Who calls the HTML rendered by the server more interesting?
Sorry, Douban

get ready

The packages used are

requests

Used to send requests and receive results

re

Used to use regular expressions

csv

It is used to write the obtained data into the csv file, which is more convenient to read in the later data processing

thinking

Let's begin:
Note: the idea is not the sequence of programming, but starts from the concept of the overall situation, and finally has complete code that can be run directly
1, Website information reading
1. First, get the URL of the page you need to crawl. Here is Douban
Douban film TOP250
2. By viewing Douban's web page source code, it is found that the web page is HTML rendered by the server, so you choose to use regular (there are many methods here, and try regular first)
3. It is found that the request method is get through the package capturing tool provided by chrome

4. Because watercress is often climbed, it is set to reverse climb 🤣, However, just check the UA. Find the UA carried by the general browser accessing the Douban web page and write it into the requests. It is still the file just captured. At the bottom is the UA, that is, the user agent
5. Next, write the precompile of regular re. For large websites such as Douban, the front-end writing regulations are very strict. For example, the class in div with movie information is different from other Web building components, which also facilitates the application of regular.
For example, in this code, the movie names are all in the label span, and there is a specific class. Therefore, with this span and class, you can quickly locate and obtain the movie name. In python re, the results are stored in the dictionary, where

?P<name>

It is the key of the dictionary, and the data in the source code is the value in the dictionary

<span class="title">(?P<name>.*?)</span>

It is worth noting that "." in regular can match any character except line breaks, but there are a large number of line breaks in HTML code. Fortunately, re.compile can use a parameter, re.S, so that "." can also match line breaks \ n
6. Douban, like most websites, adopts pagination display, but only the start parameter start can be seen from the url on page 2 and page 3. I guess there will be an end parameter, but I continue to choose the troublesome start parameter, which is a test for myself and obtained through the for loop

# Because the page has only the starting number, it cannot be obtained from the range,
#The setting variable page is represented as the page actually captured
for i in range(0, 250, 25):

2, csv file write
1. There's nothing to say here, but it should be noted that the parameter mode here uses a +, i.e. append write, rather than w overwrite write. This is because the page information is obtained circularly, not all the data is requested at one time, and the windows computer needs to add encoding code, otherwise it defaults to gbk, and the newline is because the opening of the file will automatically wrap, The csvwriter will also automatically wrap lines, so that newline = "" can ignore the open line wrapping of the file

# Set the received csv file
    f = open('data.csv', mode='a+', encoding='utf-8', newline='')
    # Create writer
    csvwritter = csv.writer(f)

2. Finally, be sure to turn off both the request and the file stream

# Close the link to prevent being pulled black
    resp.close()
# Close file stream
    f.close()

Finally, the complete code is attached with a lot of comments

# Get the page source code requests
import requests
# re is used to further extract effective information
import re
# csv processing
import csv

# url to crawl
url = 'https://movie.douban.com/top250'
# Set UA of request header
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
}
# Precompiled re
obj = re.compile(
    r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)</span>'
    r'.*?<p class="">.*?<br>(?P<year>.*?)&nbsp.*?<span class="rating_num" property="v:average">(?P<score>.*?)</span>.*?'
    r'<span>(?P<num>.*?)Human evaluation</span>',
    re.S)
# Set the received csv file
f = open('data.csv', mode='a+', encoding='utf-8', newline='')
# Because the page has only the starting number and cannot be obtained in the range, set the variable page to represent the page actually captured
for i in range(0, 250, 25):
    # Request parameters
    data = {
        # start=25&filter=
        'start': i,
        'filter': ''
    }
    # get data
    resp = requests.get(url, headers=headers, params=data)
    # Convert to string
    page_content = resp.text
    # Close the link to prevent being pulled black
    resp.close()
    # Receive regular results
    ret = obj.finditer(page_content)
    # Create writer
    csvwritter = csv.writer(f)
    # Cycle results
    for i in ret:
        dic = i.groupdict()
        dic['year'] = dic['year'].strip()
        csvwritter.writerow(dic.values())
# Close file stream
f.close()

When drunk, I don't know the sky is in the water. The boat is full of clear dreams and the Star River is pressed.

Keywords: Python crawler

Added by blen on Sun, 24 Oct 2021 05:08:13 +0300