Chapter 15 of Python data analysis and mining practice - Emotional Analysis of e-commerce product review data

Article catalog

1. Mining background and objectives
2.2 data exploration and preprocessing
```
* 2.1 Data filtering 
```
- 2.2 data De duplication
- 2.3 delete prefix score
- 2.4 jieba participle
3 subject analysis based on LDA model
4. Weight
5. How to compare the similarity between two documents in the topic space

This paper is based on the data of Chapter 15 of the actual combat part of Python data analysis and mining - Emotional Analysis of e-commerce product review data.

This paper aims to review the processing and modeling methods of comment text data.

1. Mining background and objectives

Text mining analysis is carried out for water heater comments on JD platform, and the mining modeling is as follows:

Analyze the emotional tendency of users of a brand of water heater
Dig out the advantages and disadvantages of the brand water heater from the comment text
Refine the selling points of different brands of water heaters

2.2 data exploration and preprocessing

2.1 data filtering

    #-*- coding: utf-8 -*-
    import pandas as pd
    
    inputfile = '../data/huizong.csv' #Comment summary document
    outputfile = '../data/meidi_jd.txt' #Save path after comment extraction
    data = pd.read_csv(inputfile, encoding = 'utf-8')
    data = data[[u'comment']][data[u'brand'] == u'Beautiful']
    data.to_csv(outputfile, index = False, header = False, encoding = 'utf-8')

2.2 data De duplication

    #-*- coding: utf-8 -*-
    import pandas as pd
    
    inputfile = '../data/meidi_jd.txt' #Comment document
    outputfile = '../data/meidi_jd_process_1.txt' #Save path after comment processing
    data = pd.read_csv(inputfile, encoding = 'utf-8', header = None,sep = None)
    l1 = len(data)
    data = pd.DataFrame(data[0].unique())
    l2 = len(data)
    data.to_csv(outputfile, index = False, header = False, encoding = 'utf-8')
    print(u'Deleted%s Comments.' %(l1 - l2))

Amway has a software that can do emotion analysis called ROST CM6. Note that the documents you submit must be ANSI coded, otherwise the processed documents will be garbled

2.3 delete prefix score

    #-*- coding: utf-8 -*-
    import pandas as pd
    
    #Parameter initialization
    inputfile1 = '../data/meidi_jd_process_end_Negative emotional results.txt'
    inputfile2 = '../data/meidi_jd_process_end_Positive emotional results.txt'
    outputfile1 = '../data/meidi_jd_neg.txt'
    outputfile2 = '../data/meidi_jd_pos.txt'
    
    data1 = pd.read_csv(inputfile1, encoding = 'utf-8', header = None) #read in data
    data2 = pd.read_csv(inputfile2, encoding = 'utf-8', header = None)
    
    data1 = pd.DataFrame(data1[0].str.replace('.*?\d+?\\t ', '')) #Modifying data with regular expressions
    data2 = pd.DataFrame(data2[0].str.replace('.*?\d+?\\t ', ''))
    
    data1.to_csv(outputfile1, index = False, header = False, encoding = 'utf-8') #Save results
    data2.to_csv(outputfile2, index = False, header = False, encoding = 'utf-8')

Before deletion

After deletion
! [insert picture description here]( https://img-blog.csdnimg.cn/20190823123110736.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80Mzc0NjQzMw==,size_16,color_FFFFFF,t_70)

2.4 jieba participle

    #-*- coding: utf-8 -*-
    import pandas as pd
    import jieba #Import stuttering word segmentation. You need to download and install it yourself
    
    #Parameter initialization
    inputfile1 = '../data/meidi_jd_neg.txt'
    inputfile2 = '../data/meidi_jd_pos.txt'
    outputfile1 = '../data/meidi_jd_neg_cut.txt'
    outputfile2 = '../data/meidi_jd_pos_cut.txt'
    
    data1 = pd.read_csv(inputfile1, encoding = 'utf-8', header = None) #read in data
    data2 = pd.read_csv(inputfile2, encoding = 'utf-8', header = None)
    
    mycut = lambda s: ' '.join(jieba.cut(s)) #Custom simple word segmentation function
    data1 = data1[0].apply(mycut) #Through word segmentation in the form of "broadcast", speed up the speed.
    data2 = data2[0].apply(mycut)
    
    data1.to_csv(outputfile1, index = False, header = False, encoding = 'utf-8') #Save results
    data2.to_csv(outputfile2, index = False, header = False, encoding = 'utf-8')

3 subject analysis based on LDA model

https://blog.csdn.net/sinat_26917383/article/details/79357700#1__27

    #-*- coding: utf-8 -*-
    import pandas as pd
    
    # Modify default encoding
    import sys
    sys.getdefaultencoding()  # View current encoding format
    # When an error is reported (AttributeError: 'module' object has no attribute 'setdefaultencoding')
    
    #Parameter initialization
    negfile = '../data/meidi_jd_neg_cut.txt'
    posfile = '../data/meidi_jd_pos_cut.txt'
    stoplist = '../data/stoplist.txt'
    
    neg = pd.read_csv(negfile, encoding = 'utf-8', header = None) #read in data
    pos = pd.read_csv(posfile, encoding = 'utf-8', header = None)
    stop = pd.read_csv(stoplist, encoding = 'utf-8', header = None, sep = 'tipdm')
    #sep sets the split word. Because csv uses half width comma as the split word by default, and the word happens to be in the stop word list, it will cause reading errors
    #So the solution is to manually set a non-existent segmentation word, such as tipdm.
    stop = [' ', ''] + list(stop[0]) #Pandas automatically filters white space characters, which are added here manually
    
    neg[1] = neg[0].apply(lambda s: s.split(' ')) #Define a partition function and broadcast it with apply
    neg[2] = neg[1].apply(lambda x: [i for i in x if i not in stop]) #Judge whether to stop using words word by word. The idea is the same as above
    pos[1] = pos[0].apply(lambda s: s.split(' '))
    pos[2] = pos[1].apply(lambda x: [i for i in x if i not in stop])
    
    from gensim import corpora, models
    
    #Negative theme analysis
    neg_dict = corpora.Dictionary(neg[2]) #Build dictionary
    neg_corpus = [neg_dict.doc2bow(i) for i in neg[2]] #Building corpus
    neg_lda = models.LdaModel(neg_corpus, num_topics = 3, id2word = neg_dict) #LDA model training
    # for i in range(3):
    #   #print(neg_lda.show_topics())#Show theme
    #   print(neg_lda.print_topic(i))
    
    #Positive theme analysis
    pos_dict = corpora.Dictionary(pos[2])
    pos_corpus = [pos_dict.doc2bow(i) for i in pos[2]]
    pos_lda = models.LdaModel(pos_corpus, num_topics = 3, id2word = pos_dict)
    # for i in range(3):
    #   #print(pos_lda.show_topics())#Show theme
    #   print(pos_lda.print_topic(i))
    
    for i in  neg_lda.show_topics():
        print(i[0], i[1])
    
    for i in  pos_lda.show_topics():
        print(i[0], i[1])

The feature words are extracted according to the three potential themes highly praised by Midea water heater:

The high-frequency feature words in topic 1, namely good, fast delivery, heating, speed, fast, service and extraordinary, mainly reflect JD's fast delivery and excellent service; Midea water heater has fast heating speed;
The high-frequency feature words in topic 2, that is, the hot concerns are mainly price, things and value, which mainly reflect that Midea water heater is good, the price is appropriate, and it is worth buying, etc;
The high-frequency feature words in topic 3, that is, the hot concerns are mainly after-sales, master, door-to-door and installation, which mainly reflect JD's after-sales service and master's door-to-door installation.

From the three potential themes of Midea's poor comments on water heaters, we can see:

The high-frequency feature words in topic 1 are mainly installation, service, yuan, etc., that is, topic 1 mainly reflects the high installation fees of Midea water heater, poor after-sales service of water heater, etc;
The high-frequency feature words in topic 2 are mainly emotional words such as but, a little, and can also be;
Theme 3 mainly reflects that Midea water heater may not meet its needs; The high-frequency feature words in topic 3 mainly include no, but, self, etc. topic 3 may mainly reflect the installation of Midea water heater.

Based on the above topics and the high-frequency characteristic words, the advantages of Midea water heater are as follows:
Affordable, cost-effective, good-looking appearance, practical water heater, convenient to use, fast heating speed and good service.

Relatively speaking, users' complaints about Midea water heater mainly reflect the following aspects:
Midea water heaters are expensive to install and provide after-sales service.

Therefore, the reasons for users' purchase can be summarized as follows * *: Midea's big brand is trustworthy, Midea's water heater is affordable and cost-effective * *.

According to the LDA theme model analysis on the user evaluation of Midea water heater on JD platform, we put forward the following suggestions for Midea brand.

1) On the basis of maintaining the advantages of convenient use and affordable price, the water heater is improved to improve the quality of the water heater as a whole.
2) Improve the overall quality of installation personnel and customer service personnel and improve service quality. The detailed rules for charging installation fees shall be made clear and transparent to reduce the problem of arbitrary charging in the installation process. Moderately reduce the installation cost and material cost, so as to highlight the advantages in the competition of big brands.

4. Weight

    for i in range(3):
      print(pos_lda.print_topic(i, topn=5)) # Weight TOP5
    
    for i in  neg_lda.show_topics(num_topics=2, num_words=10, log=False, formatted=True):  #Among the top 2 or 100 topics, the topic vocabulary with the weight of top 2
        print(i[0], i[1])

5. How to compare the similarity between two documents in the topic space

    import numpy as np
    topics =[pos_lda[c] for c in pos_corpus]
    
    dense = np.zeros((len(topics),100),float)
    
    for ti,t in enumerate(topics):
    
        for tj,v in t:
    
            dense[ti,tj] = v	# The result is deny [document id, subject id] = the weight of a document in a subject
    print(dense)

! [insert picture description here]( https://img-blog.csdnimg.cn/20190823143233596.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80Mzc0NjQzMw==,size_16,color_FFFFFF,t_70)

Keywords: Python Data Analysis Data Mining

Added by cresler on Sat, 22 Jan 2022 21:10:38 +0200

Programming VIP