Crawler: Douban top250 movie information

requirement:

Get the movie information of Douban top250, including:

ranking

name

country

particular year

director

score

Number of raters

type

Slogan

Save the movie information in a file called movie Txt file

Idea:

1. Get the source code with requests

2. Obtain information with re

3. Store the obtained data in movie Txt

Example browser: chrome

1. Forerunner warehousing, define url(requests should be installed in advance, instruction pip install requests, how to download Baidu without pip)

import requests 
import re
import os 
import time
# os and time modules will be used at that time
url = "https://movie.douban.com/top250"

2. Open the browser, enter the Douban page (movie.douban.com/top250), press f12, and the following interface will appear:

Then, locate the network and click Open:

 

Click an item randomly (refresh if not), find the user agent in the "header" and save it. The next step is to use:

 

 3. Because Douban has an anti crawling mechanism, we need to manually specify headers: the code is also very simple. Just copy the above user agent (the headers of different systems are different and need to be viewed by ourselves)

headers = {
        "user-agent":
        "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
    }

Next, we need to check the request method of the web page. The request method is divided into get and post, which is shown in the user agent

You can see that Douban is the get request method. OK, let's get the web page source code requests Get (URL address, your headers). Because the return value of the get method is not text, we use the text method + a variable to save the source code text

resp = requests.get(url, headers=headers)
page_content = resp.text # page_content save source code

 

3. After the above methods are completed, the most difficult regular expression of re will come, but I will try to make it clear~

Go back to the browser, refresh the web page, and pull up to see an item called "top250". Open it

After opening, click "response", switch to the web page source code, and press ctrl+shift+f to open the search box

 

Then, let's look at the regular symbols we'll encounter this time

(one)
.*? This identifier will not be understood as jumping from the content in front of the identifier to the content behind the identifier
 Take a chestnut:
"Brother, do you have time to play games?Is brother there,Play games,It's on tonight?Hello, Hello, hello,Are you there,Play the game?"
here,If your regular expression is: brother.*?Game words
 That matches:
Brother has time to play games
 Not anything else,Even in the back? They don't match
 Can you understand?
 
(two)
?P This identifier is easy to understand,Just think of it as a data type,Format is(?P<name>content)
Take the example above,If you want to get the identifier.*?Content in,that is"Call when you have time"This string,Then we can write that:
brother(?P<string>.*?)game
 So our<string>The middle part is successfully stored in the middle part~
As for how to output, I'll talk about it later

After understanding, let's get the information of the movie and search the name of a movie in the search box (here I take the first one as an example):

 

Then, we found that all the data of the film is here. Using the previous two identifiers, we can write these regularities

I wrote it for reference only In order to pretend to force me to have a clearer idea, I wrote more than 100 million unnecessary words

obj = re.compile('<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)</span>.*?'
                 '<div class="bd">.*?<p class="">(?P<dy>.*?)&nbsp;.*?<br>(P<year>.*?)&nbsp'
                 '.*?/&nbsp;(?P<country>.*?)&nbsp;.*?&nbsp;(?P<lx>.*?)</p>'
                 '.*?<span class="rating_num" property="v:average">(?P<score>.*?)</span>'
                 '.*?<span>(?P<people>.*?)</span>'
                 '.*?</div>.*?<span class="inq">(?P<xcy>.*?)</span>', re.S)

Use the finder method in obj to obtain the result, and save the same variable

result = obj.finditer(page_content)

When writing to a file, to access variables such as name and dy in obj, use group("variable name") method

Here, I defined an s=0 at the beginning to calculate the ranking

  with open("movie.txt", mode="a", encoding="utf-8") as f:
        for it in result:
            s+=1
            f.write("ranking:"+str(s)+"\n")
            f.write("Movie title:"+it.group("name").strip()+"\n")
            f.write("country:"+it.group("country").strip()+"\n")
            f.write("particular year:"+it.group("year").strip()+"\n")
            f.write(it.group("dy").strip()+"\n")
            f.write("score:"+it.group("score").strip()+"\n")
            f.write("Number of raters:"+it.group("people").strip()+"\n")
            f.write("type:"+it.group("lx").strip()+"\n")
            f.write("Slogan:"+it.group("xcy").strip()+"\n")
            f.write("----------------------------------\n")
            f.write("\n")
            print("Successfully crawled", s,  "Item movie information")

Don't forget to turn off resp, or it will be blocked

resp.close()

But then we will find a problem in movie Txt contains only one page. At this time, don't think you have a problem with your code. Switch to the next page to see the address bar:

 

See? The start value increases by 25 for each page, so let's rewrite the beginning and remember to indent the following code

for i in range(10):
    a = i * 25
    url = "https://movie.douban.com/top250?start="+str(a)+"&filter="

 

That's no problem!

Then, we just need to polish the code a little

Because we use the "a" append mode when writing, if we write multiple times, the information will explode, so our os module works. Use os path. The exists method determines whether the file exists and deletes it if it exists

if not os.path.exists("./movie.txt"):
    pass
else:
    os.remove("./movie.txt")

Because I'm used to running programs on the terminal, I added OS System ("CLS"), please use clear for Linux. Then, in order to make the program run at the beginning, I added a few sentences at the beginning

os.system("cls")
print("Start crawling for watercress in 3 seconds top250 Movie information...")
time.sleep(1)
print("3...")
time.sleep(1)
print("2...")
time.sleep(1)
print("1...")
time.sleep(1)

So far, the program is finished! Full code:

import requests
import re
import os
import time

s = 0
os.system("@echo off")

# Delete files to avoid duplicate text

if not os.path.exists("./movie.txt"):
    pass
else:
    os.remove("./movie.txt")
    
os.system("cls")
print("Start crawling for watercress in 3 seconds top250 Movie information...")
time.sleep(1)
print("3...")
time.sleep(1)
print("2...")
time.sleep(1)
print("1...")
time.sleep(1)

for i in range(10):
    a = i * 25
    url = "https://movie.douban.com/top250?start="+str(a)+"&filter="
    headers = {
        "user-agent":
        "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
    }
    resp = requests.get(url, headers=headers)

    page_content = resp.text # page_content save source code

    obj = re.compile('<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?</span>.*?'
                 '<div class="bd">.*?<p class="">(?P<dy>.*?)&nbsp;.*?<br>(?P<year>.*?)&nbsp'
                 '.*?/&nbsp;(?P<country>.*?)&nbsp;.*?&nbsp;(?P<lx>.*?)</p>'
                 '.*?<span class="rating_num" property="v:average">(?P<score>.*?)</span>'
                 '.*?<span>(?P<people>.*?)</span>'
                 '.*?</div>.*?<span class="inq">(?P<xcy>.*?)</span>', re.S)

    result = obj.finditer(page_content)

    with open("movie.txt", mode="a", encoding="utf-8") as f:
        for it in result:
            s+=1
            f.write("ranking:"+str(s)+"\n")
            f.write("Movie title:"+it.group("name").strip()+"\n")
            f.write("country:"+it.group("country").strip()+"\n")
            f.write("particular year:"+it.group("year").strip()+"\n")
            f.write(it.group("dy").strip()+"\n")
            f.write("score:"+it.group("score").strip()+"\n")
            f.write("Number of raters:"+it.group("people").strip()+"\n")
            f.write("type:"+it.group("lx").strip()+"\n")
            f.write("Slogan:"+it.group("xcy").strip()+"\n")
            f.write("----------------------------------\n")
            f.write("\n")
            print("Successfully crawled", s,  "Item movie information")
    resp.close()
os.system("cls")
print("Crawling completed,Please enter movie.txt see")

New blogger, welcome to join us

Keywords: Python crawler request

Added by hayson1991 on Sun, 30 Jan 2022 01:28:09 +0200