"Web Crawler" should be the most popular of the 9 series "Speed and Passion!"!

Hello, I'm Talent Brother.

The F9 has come online recently, and we are looking forward to revisiting the Speed and Passion series many times, until we are defeated by this "science fiction" & "superhero" story that is illogical and exaggerated to the extreme!

When we opened Douban, we found that 37,700 people scored the lowest in the Quick Shock series for 5.6 days after the show.

What do the audiences say about such a coveted reputation? Let's take a quick look at the short review on Douban Valves!

Before we start the crawl, we'll catch some of the exaggerated scenes in the trailer and see them first.

Source: Recommended movies

Cars plus rocket launchers are also fun to wear space suits

Racing on the collapsed wooden bridge, safe ashore as soon as possible

Stepping on the throttle hung on the rope of the bridge and swing safely across the cliff

Okay, is that exciting?

1. Crawler explanation

Douban short review requires you to log in to see more Douban, so you need to log in to your account to get a cookie backup during the crawl process.

Because the crawler of Douban short review is simple and there are many cases on the Internet, we will make a simple introduction to the code in three parts: requesting web pages, parsing data and storing data.

In addition, we will upload the complete code to the public number case library, and you can get the executable code by replying to "955" in the background.

1.1. Introducing a Tool Library

It involves requests for data requests, reregular parsing, pandas for data storage (os file operations).

import requests
import re
import pandas as pd
import os

1.2. Request Web Page

Following the crawler's basic process F12 - > Page Flip - > Look at the changes we found the following information:

With regard to the variation of the request address url of the short review and the meaning of the relevant parameters, we can then construct the following functions for requesting web page data:

# Request Web Page Data
def get_html(tid,page,headers,_type):
    """
    tid:Product id,For example, products with Speed and Passion 9 id Is 2578006
    page:Short Comment Page Number, 0-24
    headers:Request header, with browser and cookie information
    _type:Type of evaluation (good reviews: h,Reviews: m,Negative comment: l),Empty is all
    """
    url = f'https://movie.douban.com/subject/{tid}/comments?'
    
    params = {
        'percent_type': _type,
        'start': page*20,
        'limit': 20,
        'status': 'P',
        'sort': 'new_score',
        'comments_only': 1,
        'ck': 't9O9',
        }
    
    r = requests.get(url, params= params, headers=headers)
    # Request data is json
    data = r.json()
    
    html = data['html']   
    # We're working on it, so first remove the empty characters
    html = re.sub('\s','',html)
    
    return html

1.3. Parse data

Since we are using regular expressions to parse the data, find the region of the node where the data you need is located and write the regular rules.

For example, get the evaluation content section:

comment = re.findall('"short">(.*?)</span>', html)

A complete analysis of the author, date, rating content, useful data, and star count is as follows:

# Parse data
def get_data(html):
      
    df = pd.DataFrame(columns=['author','date','comment','vote_count','star'])
    df.author = re.findall('<atitle="(.*?)"href', html)
    df.date = re.findall('"comment-time"title=".*?">(.*?)</span>', html)
    df.comment = re.findall('"short">(.*?)</span>', html)
    df.vote_count = re.findall('"votesvote-count">(\d+)</span>',html)
    # df.star = re.findall('<spanclass="allstar(\d+)rating"',html)

    return df

1.4. data storage

Here the data is stored as csv file, mainly append write is more convenient.

This is where the file is stored to determine whether it exists or not, and if it exists it is written in append mode, otherwise it is written directly.

Also, note that encoding='utf_is set 8_ Sig', otherwise open the file directly may appear Chinese random code.

# Store data
def save_df(df):
    if os.path.exists('data.csv'):
        df.to_csv('data.csv',index=None,mode='a',header=None,encoding='utf_8_sig')
    else:
        df.to_csv('data.csv',index=None,encoding='utf_8_sig')

With regard to data storage, we will consider introducing the topic once, mainly about how to append storage to the same and multiple page labels when stored as excel.

1.5. Final Supplement

It is related parameter settings and function execution conditions, and so on, as detailed in the following code:

if __name__ == '__main__':
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
        "Cookie": After you log in cookie Copy over,
    }
    # Product id, such as Speed & Passion 9, is 2578006
    tid = 25728006
    # Type of evaluation (favorable comment: h, medium comment: m, poor comment: l)
    _type = 'l'
    for page in range(30):
        html = get_html(tid,page,headers,_type)
        df = get_data(html)
        save_df(df)
        print(f'{page+1}Page Evaluation Collected..')

Result Preview:

2. Evaluation cloud

We can see that only 24% of the short comment on Douban!!!

2.1. Comment Cloud

In the good reviews, although we also talked about the pulling of the plot, we paid more attention to the visual comfort of racing and rushing out into space, the return of Korea (although the logic of the plot is not clear) and the release of Paul's mood, which are all the points to attract them. As a popcorn movie, it's OK.

One of the most well-received reviews is from an audience called the hiker, who focuses on the whole series, the ghostly plot but imaginative and final mood, which should also be the unanimous feeling of the well-received masters:

No matter how many quick-fire series I've always loved//Storytelling stereotyped imaginations run wild, but each one is the benchmark of the movie industry of an era and the ceiling of the action movie imagination Don't we just want to use movies to pursue a second life that's never possible?//At the end of that familiar blue show, tears can't help other things At least always in the minds of fans

2.2. Chinese Commentary Cloud

The most talked to the audiences in the mid-ratings is that the plot is out of line, the logic of the story is incomprehensible, the exaggerated bridge segments rushing into outer space, the revival of navigation, and so on.

In fact, the mid-ratings are mostly recognized with irony, such as:

Pinch Mom, next direct speed and passion 10: Star Wars. One micron

Any hero movie ends up in space - no bird

If there is a card teacher's card ringing forcibly, this point should be added a little more. A cartoon beeps for an hour more than this one. Travellers in Sodoma

2.3. Differential commentary cloud

More than 42% of the reviews were mostly slots in the plot, with the biggest slot being the spectrum rushing out into outer space. Some of the mood stories in the middle even made many audience masters fall asleep.

In other words, without Johnson Stone's F9, I think it should also be a bad point. It's not right that you see this story all going out into outer space.

Mu Cong, wife and husband from Chaoyang District, won the approval of many audiences. His comments are as follows:

Bang bang, bang bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, bang, Bang Wearefamily! End.

The main reason for the poor reviews from Ningfu's idle movies is the unreasonable spitting of the plot settings, which also received a lot of support:

Carsdon'tfly? It's good to be rich from Dabi Hahaha. If you can't burn it, it's absolutely the new stuff for bloggers who are spitting. More than Newton's coffin could not be covered, the pupil had to jump up and say a few sentences.

The selling mood you are talking about seems to be enlarged a lot in F9, but this setting of mood seems to be for mood has also been spit out by many people:

Although Paul finally got a slot for me, the mood won't last a lifetime.

Keywords: Python crawler

Added by catnip_uk on Wed, 09 Feb 2022 01:48:18 +0200