Climb the movie reviews of Watergate bridge to generate visual data and word clouds

1, Crawling Movie Reviews

In order to analyze the data of renyin's Spring Festival New Year film "Changjin Lake - shuimen bridge", more than 40000 film reviews were crawled from the cat's eye by means of reptiles.

1. In order to prevent the address from being banned, the proxy address pool is used for crawling:

To set the proxy address, you can get the proxy address from the following free websites

Open agent - fast agent

Free proxy ip_ Server http proxy_ Latest ip proxy_ Free ip extraction website_ Domestic and foreign agents_ 66 free proxy ip

89 free proxy IP - a completely free high-quality HTTP proxy IP supply platform

Cloud proxy - high quality http proxy IP supply platform / share a large number of free proxy IP every day

 

We only need the proxy IP address and port, and save it as a list as follows

Crawl comments from the cat's eye. I crawl by simulating the request address of the mobile terminal. https://m.maoyan.com/mmdb/comments/movie/1446115.json?_v_=yes&offset=0

1446115 here is to click the movie you want to crawl on the cat's eye website, and the id of the movie will be in the navigation bar

Then, when setting the message header, 'user agent': useragent() Random to prevent restricted crawling. The data crawled by cat's eye starts from the latest date, with 15 entries at a time. If there is no date, only 1000 entries can be crawled, so you can climb to the previous date with the date.

Add an item by yourself, fill in the agent address, and the code for crawling comments is as follows:

import requests
import json
import pandas as pd
import random
from fake_useragent import UserAgent
import time,datetime
import openpyxl

ran_time=random.random()   #Request interval length
now_time=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')   #current time 
end_time="2022-02-01 08:00:00"    #Film release time

tomato = pd.DataFrame(columns=['date', 'score', 'city', 'comment', 'nick'])
def startTime(time):
    if time>end_time:#Do not convert to timestamp format because the URL in timestamp format cannot display cmts partial comments
        run(time)

def run(date):
    global id  # Define as global variable
    global starttime
    global tomato
    for i in range(67):  # Page number, generally only the first 1000 are displayed, 67 * 15 "1000
        # Change the space in the time to% 20, otherwise the URL is incomplete
        proxies = random.choice(items)
        url = 'https://m.maoyan.com/mmdb/comments/movie/1446115.json?_v_=yes&offset={}&startTime={}'.format(i * 15,date.replace(' ','%20'))
        print(url)
        headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'Host': 'm.maoyan.com',
            # 'Referer': 'http://m.maoyan.com/movie/1446115/comments?_v_=yes',
            'Connection': "keep-alive",
            'Cookie': '_lxsdk_cuid=17ebd6fcb4396-0232fccb2b266b-f791539-1fa400*****',
            'User-Agent': UserAgent().random
        }
        rsp = requests.get(url, headers=headers, proxies=proxies, verify=False)
        try:
            comments = json.loads(rsp.content.decode('utf-8'))['cmts']
            for item in comments:
                tomato = tomato.append({'date': item['startTime'], 'city': item['cityName'], 'score': item['score'],
                                            'comment': item['content'], 'nick': item['nick']}, ignore_index=True)

                starttime = item['startTime']  # Time of last comment
        except:  # There may be less than 1000 comments a day
            continue

        time.sleep(ran_time)
    startTime(starttime)
if __name__ == '__main__':
    run(time_excu)
    tomato.to_excel("Shuimen bridge review.xlsx", index=False)

Here I saved it as xlsx file. My computer csv file is garbled. My friends can also try to save it as csv file.

Because requests Get crawls the https address. When setting the proxy pool address, select the one that supports https. In the requests request, proxies=proxies is the proxy, and verify=False is the url to get https. Execute the game and save it as an xlsx file. I crawled here for about four days. The saved file results are as follows:

II. Data visualization

First, the scoring results are visually displayed with data, and the code is as follows:

# -*- coding: utf-8 -*-
import pyecharts
from pyecharts.charts import Pie
from pyecharts.charts import Bar
from pyecharts import options as opts
import pandas as pd


def score_view(data):
    grouped = data.groupby(by="score")["nick"].size()
    grouped = grouped.sort_values(ascending=False)
    index = grouped.index
    values = grouped.values
    # Histogram
    bar = Bar()   #(init_opts = opts. Initopts (width = "600px", height = "1200px", page_title = "GDP in 2021")
    bar.add_xaxis(index.tolist())
    bar.add_yaxis("", values.tolist())
    bar.set_global_opts(
        xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=10)),
        title_opts=opts.TitleOpts(title="Score distribution"),
        datazoom_opts=opts.DataZoomOpts(),  #Provides the function of area scaling
    )

    bar.render_notebook()
    bar.render('Movie Watergate bridge score.html')

    pie = Pie()
    pie.add("", [list(z) for z in zip(index.tolist(), values.tolist())],
           radius=["30%", "75%"],
                center=["40%", "50%"],
                rosetype="radius")
    pie.set_global_opts(
                title_opts=opts.TitleOpts(title="Proportion of each score value"),
                legend_opts=opts.LegendOpts(
                    type_="scroll", pos_left="80%", orient="vertical"
                ),
            ).set_series_opts(label_opts=opts.LabelOpts(formatter="{d}%"))
    pie.render_notebook()
    pie.render('Proportion of scores of shuimen Bridge.html')

if __name__ == '__main__':
    df = pd.read_excel("Shuimen bridge review.xlsx")
    data = df.drop_duplicates(keep="first")  # Delete duplicate values
    score_view(data)

The scoring view obtained is as follows:

Proportion of each score value in the total score number

1.11% of the people gave a score of 0, with the highest score of 5, accounting for 79.86%

III. number of film viewers in each city of histogram + broken line

After observing the data crawled down, many people's city information shows districts, counties or county-level cities. The statistical dimension is calculated from the dimension of cities. Therefore, the administrative level of crawled cities is adjusted. Because of the database, many cities can't handle it. Therefore, the number of film viewers in first tier cities and new first tier cities has been sorted out.

The codes for treating counties and districts as cities are as follows:

import cpca

df2 = pd.DataFrame(data)
city_name = data["city"].values.tolist()
df_city = cpca.transform(city_name)

for i in range(len(df_city)):
    if df_city['province'][i]!=None and df_city["city"][i]==None:
        city_name[i] = df_city['province'][i].replace("city","")
    elif df_city['province'][i]!=None and df_city["city"][i]!=None:
        city_name[i] = df_city['city'][i].replace("city","")
# city_new = {"city_new":city_name}
df2["city_new"] = pd.Series(city_name)

The cpca library is used, but the data in the library is incomplete, and some place names can not be recognized.

Extract the number of comments and scores of first tier and new first tier cities from the processed data

city_num = df2["city_new"].value_counts()
city_first = ["Beijing","Shanghai","Guangzhou","Shenzhen"]   #first-tier cities
city_new_first = ["Wuhan","Nanjing","Chengdu","Chongqing","Hangzhou","Tianjin","Suzhou","Changsha","Qingdao","Xi'an","Zhengzhou","Ningbo","Wuxi","Dalian"]  #New frontline
citys_front = city_first + city_new_first
nums = []
city_comment = city_num.index.tolist()
count_comment = city_num.values.tolist()    #Calculate the number of city reviews
for city in citys_front:
    count = 0
    for i in range(len(city_comment)):
        if city_comment[i] ==city :
            count += count_comment[i]
    nums.append(count)   #Number of comments on first tier + new first tier cities
city_first_score_mean = df2[df2.city.isin(citys_front)].groupby(["city"], as_index=False)["score"].mean()   #Average score of first-line + new first-line

bar = Bar(init_opts=opts.InitOpts(width="1500px",height="800px",page_title="Number of first-line and new first-line comments"))   #(init_opts = opts. Initopts (width = "600px", height = "1200px", page_title = "GDP in 2021")
bar.add_xaxis(citys_front)
bar.add_yaxis("", nums)
bar.set_global_opts(
    xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=10)),
    title_opts=opts.TitleOpts(title="Number of first-line and new first-line comments"),
    datazoom_opts=opts.DataZoomOpts(),  #Provides the function of area scaling
)
line = (
    Line()
        .add_xaxis(citys_front)
        .add_yaxis("", nums,label_opts = opts.LabelOpts(is_show=False))
)

bar.overlap(line)

bar.render_notebook()
bar.render('Number of comments on the first line and new line of Watergate Bridge.html')


line = (
    Line()
        .add_xaxis(city_first_score_mean.city.values.tolist())
        .add_yaxis("", city_first_score_mean.score.values.round(2).tolist(),label_opts = opts.LabelOpts(is_show=False))
)
effe = (
    EffectScatter(init_opts=opts.InitOpts(width="2000px",height="750px"))
        .add_xaxis(city_first_score_mean.city.values.tolist())
        .add_yaxis("", city_first_score_mean.score.values.round(2).tolist())
        .set_global_opts(title_opts=opts.TitleOpts(title="average score"))
)

effe.overlap(line)
# bar.render_notebook()
effe.render('City average score.html')

Here I use overlap, which is to overlay the images, so you can see that there are broken lines on the histogram.

The number of comments received is distributed as follows:

Comparing the distribution of the number of commentators with the urban population can basically reflect the degree to which each city likes to watch movies. From the data, the most pyrotechnic Chengdu people participate in the most commentaries. Hangzhou and Tianjin should be less affected by the epidemic. Beijing seems to like watching movies more than Shanghai.

The average score results are as follows. The actual page has ripple effect:

Although the scoring results do not reflect anything, they can be used for analysis, which may be related to people's quality and satisfaction.

4, Generate word cloud

Word cloud is to use the method of word segmentation (NLP) to split the comment content, and then extract some words with high frequency to generate a cloud like graph. The higher the frequency of words, the larger the font on the word cloud.

import pandas as pd
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import jieba
import matplotlib.pyplot as plt
import sys

df = pd.read_excel("Shuimen bridge review.xlsx")

words = " ".join(jieba.cut(df.comment.str.cat(sep=" ")))
stopwords = STOPWORDS
stopwords.add(u"film")

wc = WordCloud(stopwords=stopwords,
               font_path="C:/Windows/Fonts/simkai.ttf",  # Solve the problem of garbled display font
               background_color="white",width=1000,height=880, max_words=100
               )
my_wc = wc.generate_from_text(words)
plt.imshow(my_wc )
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus'] = False    #Prevent Chinese from not displaying
plt.title(r"Shuimen Bridge")
# plt.imshow(my_wc.recolor(color_func=image_colors), )
plt.axis("off")
plt.show()

The resulting word cloud is as follows:

 

 

Keywords: Python AI crawler Data Mining

Added by ugh82 on Sun, 06 Feb 2022 21:53:19 +0200