Why do Chinese people no longer focus on gold medals—— I use Python to analyze the Tokyo Olympic Games

The words written in the front

Welcome to scan code, pay attention to my official account, and make progress with me! Mainly committed to learning

Using deep learning to solve computer vision related problems
Python based Internet application services
CPU microarchitecture design based on MIPS instruction set

introduction

For most of us, this year's summer Olympic Games are a very participatory Olympic Games after the Beijing Olympic Games. I think there are several reasons:

The host city of the 2020 Summer Olympic Games is Tokyo, Japan, and the time difference between Tokyo time and Beijing time is only one hour. We can catch all the live events without reversing black and white.
Some new city confirmed cases novel coronavirus pneumonia have appeared in some cities in China, and most people's summer plans are changed from "traveling" to "squatting at home".
This year's Chinese delegation has seen a large number of post-90s and post-00s legions. They represent the Chinese young generation. Their brilliant performance in the Olympic Games has attracted extensive attention from the society.
... ...

For me, this year I also used all kinds of free time to watch some events of the Chinese delegation through live broadcasting or playback. In my spare time, I also participated in the discussion of related topics on microblog. One topic is called“ #Why do Chinese people no longer focus on gold medals# "Aroused my great interest. Although I browsed the comments on related topics, I still felt not satisfied. I thought I could get the data of comments and analyze them with Python.

get data

To analyze the data, we must first obtain the data. In this step, we obtain the information by analyzing the network request.

First, enter the interface of microblog Mobile Version (data cannot be obtained from the PC interface).

Link address: https://m.weibo.cn/

The interface after entering is shown in the figure.

We can enter "# why do Chinese people no longer rely on gold medals #" in the search box to see the interface shown in the figure below. We choose the first microblog at the top and click in.

At the same time, press "F12" or right-click the page and select "check". Select "Network" in the new function bar and select "XHR" to see the Network requests during the page loading process.

Among the observed network requests, we will find a network request starting with "hotflow". Click to enter the request, view the response information and find the comment information we need.

It can be concluded that the target website is the website of the comment information we need. It encapsulates the data and returns it in json format. Therefore, as long as we can get the return value of the website, we can get the comment information.

However, a new problem has emerged: there are only 20 comments, but the original microblog has 5000 + comments. Where's the other information? After analysis, we can find that the comment information is dynamically loaded. When we drag the information down, we can get all the comment information~

After solving all the problems, we only have the last step to process the obtained information and save the comments we need to a text file. We use the following Python program.

def get_data(review_file_path):
    """
    Read data in the file and process the data
    :param review_file_path: The storage path of the comment file, string
    :return: Comment information
    """
    review_contents = []
    with open(review_file_path, 'r') as f:
        lines = f.readlines()
    f.close()
    n = len(lines)
    for i in range(n):
        line = json.loads(lines[i])
        review_datas = line['data']['data']
        data_nums = len(review_datas)
        # Set where to start reading comments
        if i == 0:
            start = 1
        else:
            start = 0
        for j in range(start, data_nums):
        	# Regular matching processes text and other information
            pattern_01 = re.compile(r'<span class="url-icon">.*?</span>', re.S)
            pattern_02 = re.compile(r'<a .*?>.*?</a>', re.S)
            origin_review_content = review_datas[j]['text']
            review_content = re.sub(pattern_01, '', origin_review_content)
            review_content = re.sub(pattern_02, '', review_content)
            review_contents.append(review_content)
    return review_contents

The finally obtained data information and comment information (processed data information) can be downloaded through the following links.

Data information
Link: https://pan.baidu.com/s/1EYC1kUf1p9XYQ_fOt_kp1w
Extraction code: dcn2

Note: only 1000 comments were obtained in this article.

Analysis data

For the processed comment information, use the third-party library of Python to segment the word and draw the result as a word cloud. It can call the library function, which will not be repeated here. It is implemented using the following Python program.

def draw_picture(review_content_file_path):
    data = open(review_content_file_path, 'r', encoding='utf-8').read()
    # participle
    word_list = jieba.cut(data, cut_all=True)
    word_list_result = ' '.join(word_list)
    # Configuration of word cloud
    word_cloud = WordCloud(
        # Set background
        background_color='white',
        # Set the maximum number of word clouds displayed
        max_words=50,
        # Set word span
        font_step=10,
        # Set font
        font_path=r'C:\Windows\Fonts\simsun.ttc',
        # Set the height and width of the word cloud
        height=300,
        width=300,
        # Set Color Scheme 
        random_state=30
    )
    # Get results
    my_cloud = word_cloud.generate(word_list_result)
    plt.imshow(my_cloud)
    plt.axis("off")
    plt.show()

The final drawing result is shown in the figure.

Epilogue

From the picture, it is not difficult to see some bright words, such as "self-confidence", "strong", "proof", "effort" and "Chinese". They gave the best answer to this question: our country is strong, our nation is reviving, we are more confident than ever in road, theory and culture. We no longer need to prove ourselves with gold medals. This is also the place where I am most shocked, proud and benefited from the Olympic Games!

Complete code

At the end of the article, a complete program to realize the analysis is presented.

#!/usr/bin/env python
# -*- coding:utf-8 -*-

"""
@ModuleName: get_data
@Function: 
@Author: PengKai
@Time: 2021/8/6 22:52
"""

import re
import json
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud


def get_data(review_file_path):
    """
    Read data in the file and process the data
    :param review_file_path: The storage path of the comment file, string
    :return: Comment information
    """
    review_contents = []
    with open(review_file_path, 'r') as f:
        lines = f.readlines()
    f.close()
    n = len(lines)
    for i in range(n):
        line = json.loads(lines[i])
        review_datas = line['data']['data']
        data_nums = len(review_datas)
        # Set where to start reading comments
        if i == 0:
            start = 1
        else:
            start = 0
        for j in range(start, data_nums):
            pattern_01 = re.compile(r'<span class="url-icon">.*?</span>', re.S)
            pattern_02 = re.compile(r'<a .*?>.*?</a>', re.S)
            origin_review_content = review_datas[j]['text']
            review_content = re.sub(pattern_01, '', origin_review_content)
            review_content = re.sub(pattern_02, '', review_content)
            review_contents.append(review_content)
    return review_contents


def store_contents(review_contents):
    """
    Save the comments in the file
    :param review_contents: Comment information, List
    :return: None
    """
    n = len(review_contents)
    with open('./review_data.txt', 'w', encoding='utf-8') as f:
        for i in range(n):
            text = review_contents[i] + '\n'
            f.write(text)
    f.close()


def draw_picture(review_content_file_path):
    data = open(review_content_file_path, 'r', encoding='utf-8').read()
    # participle
    word_list = jieba.cut(data, cut_all=True)
    word_list_result = ' '.join(word_list)
    # Configuration of word cloud
    word_cloud = WordCloud(
        # Set background
        background_color='white',
        # Set the maximum number of word clouds displayed
        max_words=50,
        # Set word span
        font_step=10,
        # Set font
        font_path=r'C:\Windows\Fonts\simsun.ttc',
        # Set the height and width of the word cloud
        height=300,
        width=300,
        # Set Color Scheme 
        random_state=30
    )
    # Get results
    my_cloud = word_cloud.generate(word_list_result)
    plt.imshow(my_cloud)
    plt.axis("off")
    plt.show()


if __name__ == '__main__':
    review_file_path = './data.txt'
    review_contents = get_data(review_file_path)
    store_contents(review_contents)
    review_content_file_path = './review_data.txt'
    draw_picture(review_content_file_path)

Keywords: Python Data Analysis

Added by R4000 on Sat, 25 Dec 2021 08:02:38 +0200

Programming VIP