Python crawler: microblog comment data analysis

Official account: Special House
Author: Peter
Editor: Peter

Hello, I'm Peter~

Recently, the divorce between Wang and Li has been making a lot of noise. I believe everyone has eaten a lot of melons. This article combines the comments of netizens below Li's first article to see how everyone views this matter.

Webpage

Crawl field

User nickname
Comment time
Comment content
Number of likes
Number of replies
Gender
city

Data from this address: https://weibo.com/5977512966/L6w2sfDXb#comment

Crawl all of the following comments:

Web law

The microblog web page belongs to Ajax rendering. When we slide down, the comments will be displayed. The URL of the address bar remains unchanged. We need to find the actual request URL.

1. Right click Check to find Network

2. Determine the content URL for each page

Here is the home page

Display the URL of each page after sliding;

3. URL address per page

start_url = "https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4715531283728505&is_show_bulletin=2&is_mix=0&count=10&uid=5977512966"
  
# 2: Max returned with the first page_ ID as parameter value
url2 = "https://weibo.com/ajax/statuses/buildComments?flow=0&is_reload=1&id=4715531283728505&is_show_bulletin=2&is_mix=0&max_id=22426369418746150&count=20&uid=5977512966"

# 3: Max returned with the second page_ ID as parameter value
url3 = "https://weibo.com/ajax/statuses/buildComments?flow=0&is_reload=1&id=4715531283728505&is_show_bulletin=2&is_mix=0&max_id=2197966808100516&count=20&uid=5977512966"

The part with more URL addresses from the second page is max_id, just the value of this parameter is the return content of the previous page:

4. Introduction to crawling on page 1

main_url = "https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=4715531283728505&is_show_bulletin=2&is_mix=0&count=10&uid=5977512966"

headers = {"user-agent": "Personal request header"}

response = requests.get(url=main_url,headers=headers)
result = response.content.decode('utf8')
content = json.loads(result)  # Convert json data to dictionary format

For example, we can get the relevant information of the first user:

Finally, we can see the data display crawled on the first page:

Referring to the above logic, you can crawl to all the comments under the microblog

Microblog analysis

Import library

Import required libraries:

import pandas as pd
import numpy as np
import jieba
from snownlp import SnowNLP

# Show all columns
# pd.set_option('display.max_columns', None)

# Show all rows
# pd.set_option('display.max_rows', None)

# Set the display length of value to 100, and the default is 50
# pd.set_option('max_colwidth',100)

# Drawing related
import matplotlib.pyplot as plt
from pyecharts.globals import CurrentConfig, OnlineHostType   
from pyecharts import options as opts  # Configuration item
from pyecharts.charts import Bar, Scatter, Pie, Line,Map, WordCloud, Grid, Page  # Classes for each drawing
from pyecharts.commons.utils import JsCode   
from pyecharts.globals import ThemeType,SymbolType

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots  # Draw subgraph

Data EDA

View the basic information of the data we crawled. We import the first 5 rows of data:

Basic information: view the shape shape of the data, with a total of 47638 rows and 8 fields, and there are no missing values.

Time preprocessing

Turn the Greenwich time we climbed into a familiar standardized time form:

import datetime

def change_time(x):
    """
    Greenwich mean time format----->Specified time format
    """
    std_transfer = '%a %b %d %H:%M:%S %z %Y'
    std_change_time = datetime.datetime.strptime(x, std_transfer)
    
    return std_change_time

df["Comment time"] = df["Comment time"].apply(change_time)
df["Registration time"] = df["Registration time"].apply(change_time)
df.head()

Other processing

Remove the img part of the comment
For crawling cities, we extract the provinces or municipalities directly under the central government. If it is foreign, the direct value is overseas

df["Comment content"] = df["Comment content"].apply(lambda x:x.split("<img")[0])
df["province"] = df["city"].apply(lambda x:x.split(" ")[0])

df.head()

Regional melon competition

fig = px.bar(df1[::-1],
             x="user",
             y="province",
             text="user",
             color="user",
             orientation="h"
            )

fig.update_traces(textposition="outside")

fig.update_layout(title="Distribution of microblog comments on cities",width=800,height=600)

fig.show()

Among the domestic provinces, Beijing, Guangdong, Shanghai and Jiangsu are all big melon eating provinces!

Gender competition

df2 = df.groupby("Gender")["user"].count().reset_index()

fig = px.pie(df2,names="Gender",values="user",labels="Gender")

fig.update_traces(
    # Text display position: ['inside ',' outside ',' auto ',' none ']
    textposition='inside',   
    textinfo='percent+label'
)

fig.show()

Sure enough: women really like melons 🍉 Far more than men

Hot comments

Look at the popular comments under this microblog through the number of likes and replies:

Number of likes

A netizen commented on 870000 + likes! six hundred and sixty-six

Number of replies

It is also the comment of this netizen, and the number of replies is No.1

From the overall distribution of likes and replies, this comment is really unique! Has completely deviated from other data:

Looking at the original data, we found that this comment is:

Conclusion: the revelations are all true

It seems that many previous revelations have been hammered!

Microblog user age

df["interval"] = df["Comment time"] - df["Registration time"]  # time interval
df["day"] = df["interval"].apply(lambda x:x.days)  # days property of timedelta
df["year"] = df["day"].apply(lambda x:str(int( x / 365)) + "year")  # Microblog age rounding; Less than one year

px.scatter(df,
           x="Number of likes",
           y="Number of replies",
           size="day",
           facet_col="year",
           facet_col_wrap=4, # Up to 4 graphics per line
           color="year")

According to the user's age, number of likes and replies, users aged 7, 8, 9 and 10 are more active; Older or new microblog users have less comments.

At the same time, the number of likes is also concentrated in the part between 2000 and 5000

Comment time

px.scatter(df,
           x="Comment time",
           y="day",
           color="year",
           size="day")

From the user's comment time point of view, when Li sent his first article, he immediately detonated the comment (the dense part on the left); this microblog was silent for 4 days, but I didn't expect it to be hot again on the night of the 23rd

Key points of eating melon with vermicelli

Segment fans' comments to find their focus:

comment_list = df["Comment content"].tolist()
# Word segmentation process
comment_jieba_list = []
for i in range(len(comment_list)):
  	# jieba participle
    seg_list = jieba.cut(str(comment_list[i]).strip(), cut_all=False)
    for each in list(seg_list):
        comment_jieba_list.append(each)
        
# Create stop word list
def StopWords(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords
  
# Path to the incoming stop phrase list
stopwords = StopWords("/Users/peter/spider/nlp_stopwords.txt")

useful_comment = []

for col in comment_jieba_list:
    if col not in stopwords:
        useful_comment.append(col)
        
information = pd.value_counts(useful_comment).reset_index()[1::]
information.columns=["word","number"]
information_zip = [tuple(z) for z in zip(information["word"].tolist(), information["number"].tolist())]

# mapping
c = (
    WordCloud()
    .add("", information_zip, word_size_range=[20, 80], shape=SymbolType.DIAMOND)
    .set_global_opts(title_opts=opts.TitleOpts(title="Cloud map of microblog comments"))
)

c.render_notebook()

Focus on the first 50 words:

In addition to the two parties, fans also care about their children. After all, children are innocent, but aren't their melons caused by children? Personal views.

In short: whether it's Wang or Li, if it's really a scum man or a scum woman, please go to the cross, Amen!

Book delivery activities

Python crawler has a very powerful framework, scripy. Xiaobian contacted Peking University Press to send two books: "Python web crawler framework, scripy from introduction to Mastery". Select two friends who leave messages

Friends interested in Python crawlers can also buy them directly.

Keywords: Python crawler Data Analysis

Added by onlinegamesnz on Mon, 27 Dec 2021 00:56:59 +0200

Programming VIP