Using Python to analyze 440,000 pieces of data to reveal how to become a net red paragraph player in Netease's cloud music commentary area

There's a passage that says, "Ten years old copywriter drivers are not as good as Netease comment area. Netease Wenhao goes everywhere to comment on all single dogs." The comment area of Netease Yun Music has always been the gathering place of all kinds of copywriter gods.

So how on earth do our ordinary users become hot commentators in Netease's music reviews?

Let me analyze it.

get data

In fact, the logic is not complicated:

  1. Click all URLs in the list of songs.

  2. Go into each song sheet and crawl all the songs url to duplicate.

  3. Go to the front page of each song to get the reviews and summarize them.

The list of songs is as follows:

Turn over the page and observe its url changes. Notice the following motion map. Change 35 at the end of each turn.

Use requests+pyquery to crawl.

def get_list():
    list1 = []
    for i in range(0,1295,35):
        url = 'https://music.163.com/discover/playlist/?order=hot&cat=%E5%8D%8E%E8%AF%AD&limit=35&offset='+str(i)
        print('Successful acquisition%i Page song list\n' %(i/35+1))
        data = []
        html = restaurant(url)
        doc = pq(html)
        for i in range(1,36): # 35 song lists on a page
            a = doc('#m-pl-container > li:nth-child(' + str(i) +') > div > a').attr('href')
            a1 = 'https://music.163.com/api' + a.replace('?','/detail?')
            data.append(a1)
        list1.extend(data)
        time.sleep(5+random.random())
    return list1

So we can get 38 pages of 35 song lists per page, a total of 1300 + song lists.

Next we need to go into each song list and crawl all the url s, and pay attention to the last "de-duplicate", different song lists may contain the same song.

Open a song list and pay attention to the id circled in red.

Observe that the information we need to get at the bottom of each song list is circled in red boxes. Using the song list id just crawled and the api of Netease Cloud Music (detailed in the next article), we can construct:

If it's not convenient to read, let's parse json.

What I don't know in the process of learning can be added to me?
python Learning Exchange Button qun,784758214
//There are good learning video tutorials, development tools and e-books in the group.
//Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn
def get_playlist(url):
    data = []
    doc = get_json(url)
    obj=json.loads(doc)
    jobs=obj['result']['tracks']
    for job in jobs:
        dic = {}
        dic['name']=jsonpath.jsonpath(job,'$..name')[0] #Name of song
        dic['id']=jsonpath.jsonpath(job,'$..id')[0] #Song ID
        data.append(dic)
    return data  

So we get all the songs under the list and remember to repeat them.

#Duplicate removal
data = data.drop_duplicates(subset=None, keep='first', inplace=True) 

The rest is to get the reviews of each song, similar to the previous song, but also according to the api structure, it is easy to find.

def get_comments(url,k):
    data = []
    doc = get_json(url)
    obj=json.loads(doc)
    jobs=obj['hotComments']
    for job in jobs:
        dic = {}
        dic['content']=jsonpath.jsonpath(job,'$..content')[0] 
        dic['time']= stampToTime(jsonpath.jsonpath(job,'$..time')[0])
        dic['userId']=jsonpath.jsonpath(job['user'],'$..userId')[0]  #User ID
        dic['nickname']=jsonpath.jsonpath(job['user'],'$..nickname')[0]#User name
        dic['likedCount']=jsonpath.jsonpath(job,'$..likedCount')[0] 
        dic['name']= k
        data.append(dic)
    return data  

In summary, 440,000 music reviews were obtained.

Data analysis

Clean and fill it up.

def data_cleaning(data):
    cols = data.columns
    for col in cols:
        if data[col].dtype ==  'object':
            data[col].fillna('missing data', inplace = True)
        else:
            data[col].fillna(0, inplace = True)
    return(data)

Put them in order of points.

#sort
df1['likedCount'] = df1['likedCount'].astype('int')
df_2 = df1.sort_values(by="likedCount",ascending=False)
df_2.head()

Look at which reviews have been copied, pasted and moved around.

#sort
df_line = df.groupby(['content']).count().reset_index().sort_values(by="name",ascending=False)
df_line.head()

The first and the third are just the difference of whether there is a period at the end, which can be classified as one kind. In this way, the sentence repeated 412 times at most.~~

Look at the God who has the highest number of reviews? What lessons can we learn from him?

df_user = df.groupby(['userId']).count().reset_index().sort_values(by="name",ascending=False)
df_user.head()

Summarize and sort according to user_id.

Successfully "captured" a "Duanzi hand", the number of hot reviews up to 347, let's see what the God commented on?

df_user_max = df.loc[(df['userId'] == 101***770)]
df_user_max.head()

This "insomniac Mr. Chen" seems to be skilled in all kinds of love stories. Let's take him as an example to see how he can become a hot commentator in Netease Yun's music reviews.

Data visualization

Let's first look at the positive distribution of 347 comments.

What I don't know in the process of learning can be added to me?
python learning communication deduction qun, 784758214
 There are good learning video tutorials, development tools and e-books in the group.
Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn
 # Pragmatic Distribution Map
import matplotlib.pyplot as plt
data = df_user_max['likedCount']
#data.to_csv("df_user_max.csv", index_label="index_label",encoding='utf-8-sig')
plt.hist(data,100,normed=True,facecolor='g',alpha=0.9)
plt.show()

Obviously, there are not many praises, most of them are within 500 praises, but hundreds of praises can be ranked in the hot reviews, which also shows that these songs are relatively minority, it seems that they are often cast in the new song area.

We use len() to figure out the length of each comment string and draw a distribution map.

The number of comments is between 18 and 30 words, which shows that we should pay attention to the number of words when leaving messages. The insurance policy is not to be too long to be read, nor too short to be classical.

Make a word cloud.


If you are still confused in the world of programming, you can join our Python learning button qun: 784758214 to see how our predecessors learned! Exchange experience! I am a senior Python development engineer, from basic Python script to web development, crawler, django, data mining and so on, zero-based to the actual project data have been collated. To every Python buddy! Share some learning methods and small details that need attention. Click to join us. python learner gathering place

It can be seen that his style of comment begins with a song that makes him "think" and "feel", and the object is usually "like the girl", which is often referred to as "she". Deposited feelings are "regret" and "sadness", and the end point of feeling is "put down".

Maybe we can gain some praise by analyzing the rules and become the hot comment net's red-handed person. But ultimately moving, it is still based on the sincere sharing of the song itself, and the real resonance contained in the singing.

Keywords: Python JSON REST encoding

Added by Illusionist on Tue, 20 Aug 2019 17:13:37 +0300