Emotional Analysis and Prediction of Naive Bayes-Douban Top250 Movie Review

Preface

In this paper, naive Bayesian algorithm is used to realize emotional analysis and prediction of Douban Top250 film evaluation.

Recently, the problem of dealing with positive and negative emotions in learning natural language is studied, but most of the practices that can be searched are emotional analysis of IMDB film reviews on Kggle.

So here I use the most basic naive Bayesian algorithm to analyze and predict the emotion of Douban film review.

Here I refer to it. https://github.com/aeternae/IMDb_Review Thank you very much.

Naive Bayes classifier

Bayesian classification is the general name of a kind of classification algorithm, which is based on Bayesian theorem, so it is called Bayesian classification.

This algorithm is often used to classify articles, junk mails and pieces of junk comments. Naive Bayesian algorithm has good effect and low cost.

Given a conditional probability, how to get the probability after the exchange of two events, that is, how to get P(B|A) when P(A|B) is known.

P(B|A) denotes the probability of event B occurring on the premise that event A has already occurred, which is called the conditional probability of event B under event A.

Naive Bayesian Formula
P(B∣A)=P(A∣B)P(B)P(A) P(B|A) = \frac{P(A|B)P(B)}{P(A)} P(B∣A)=P(A)P(A∣B)P(B)​

An easy-to-understand video tutorial

Youtube https://www.youtube.com/watch?v=AqonCeZUcC4

Give me an inappropriate example.

We want to know the relationship between being a programmer and being bald, so we can use naive Bayesian formula to calculate.

Now we want to find the probability of P (bald | programmer), that is, the probability that programmers will bald.

I will never be bald in my life ((o)!!!

Substituting Naive Bayesian Formula

P (bald programmer) = P (programmer bald) P (bald) P (programmer) P (bald | programmer) = frac {P (programmer | bald) P (bald)} {P (programmer)} P (bald programmer) = P (programmer) P (programmer bald) P (bald)

The known data are shown in the following table

Full name Occupation Is it bald?
kratos God of War yes
Killer 47 killer yes
Saitama superman yes
Thanos Director of Family Planning Office yes
Jason Statham A tough guy yes
A 996 Programmer Programmer yes
I Programmer no

Based on the Naive Bayesian Formula, we can derive from the above table:
P (bald programmer) = 16 6727 = 2142 = 12 P (bald head | programmer) = frac { frac16 * frac67 { frac27}= frac {21} {42}= frac {1} {2} P (bald programmer) = 72. 61 76 = 4221 = 21
The above example simply describes the basic usage of Naive Bayesian Formula.

Next, I use the Top250 Douban Film Review to use Naive Bayes to train and predict the positive and negative reviews.

Emotional Analysis of Douban Top250 Film Review

First of all, we need the Douban Top250 film review corpus. I used Scrapy to grab 5w corpus for training and verification.

Douban Film Review Reptile https://github.com/3inchtime/douban_movie_review

With the corpus, we can begin to develop it in practice.

It is recommended to use jupyter to develop operations.

The following code can be seen on my Github, and you are welcome to make suggestions.

https://github.com/3inchtime/douban_sentiment_analysis

First load the corpus

# -*- coding: utf-8 -*-
import random
import numpy as np
import csv
import jieba


file_path = './data/review.csv'
jieba.load_userdict('./data/userdict.txt')

# Read and save corpus in csv format
def load_corpus(corpus_path):
    with open(corpus_path, 'r') as f:
        reader = csv.reader(f)
        rows = [row for row in reader]

        
    review_data = np.array(rows).tolist()
    random.shuffle(review_data)

    review_list = []
    sentiment_list = []
    for words in review_data:
        review_list.append(words[1])
        sentiment_list.append(words[0])

    return review_list, sentiment_list

Before training, data sets are shuffled to disrupt the order of data and randomize the data so as to avoid over-fitting. So use random.shuffle() to scramble the data.

jieba.load_userdict('./data/userdict.txt') Here I made a dictionary to prevent some of the stuttering words from inaccurate, which can improve the accuracy of about 1%.

For example, if you don't like this sentence very much, jieba will be divided into two words: "no" and "like" so that the sentence will be predicted to be highly praised.

So here I have a lot of similar words in the custom dictionary, which improves the accuracy a little.

Then the corpus is divided into test set and training set according to 1:4.

n = len(review_list) // 5

train_review_list, train_sentiment_list = review_list[n:], sentiment_list[n:]
test_review_list, test_sentiment_list = review_list[:n], sentiment_list[:n]

participle

Using jieba participle, the corpus is segmented and stopwords are removed.

import re
import jieba


stopword_path = './data/stopwords.txt'


def load_stopwords(file_path):
    stop_words = []
    with open(file_path, encoding='UTF-8') as words:
       stop_words.extend([i.strip() for i in words.readlines()])
    return stop_words


def review_to_text(review):
    stop_words = load_stopwords(stopword_path)
    # Remove English
    review = re.sub("[^\u4e00-\u9fa5^a-z^A-Z]", '', review)
    review = jieba.cut(review)
    # Remove stop words
    if stop_words:
        all_stop_words = set(stop_words)
        words = [w for w in review if w not in all_stop_words]

    return words

# Comments for training
review_train = [' '.join(review_to_text(review)) for review in train_review_list]
# Relevant Good/Bad Comments on Training Comments
sentiment_train = train_sentiment_list

# Comments for testing
review_test = [' '.join(review_to_text(review)) for review in test_review_list]
# Relevant positive/negative reviews for test reviews
sentiment_test = test_sentiment_list

TF*IDF and Word Frequency Vectorization

TF-IDF (TF-IDF) is a weighted technology commonly used in information processing and data mining. The importance of a word in the whole corpus is calculated according to the number of occurrences of words in the text and the frequency of documents in the whole corpus.

Its advantage is that it can filter out some common but insignificant words, while retaining the important words that affect the whole text.

Countvectorizer() is used to convert a document into a vector and calculate the frequency of words appearing in the text.

The CountVectorizer class converts words in text into a word frequency matrix, such as an element a[i] [j], which represents the word frequency of J words in class I text. It calculates the number of occurrences of each word through fit_transform function.

Tfidf Transformer is used to calculate the TF-IDF value of each word in vectorizer.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

count_vec = CountVectorizer(max_df=0.8, min_df=3)

tfidf_vec = TfidfVectorizer()

# Defining Pipeline's streamlined encapsulation and management of all steps makes it easy to reuse parameter sets on new data sets, such as test sets.
def MNB_Classifier():
    return Pipeline([
        ('count_vec', CountVectorizer()),
        ('mnb', MultinomialNB())
    ])

The function of the parameter max_df is to act as a threshold. When a corpus keyword set is constructed, if the frequency of a word is greater than max_df, the word will not be regarded as a keyword.

If this parameter is float, it denotes the percentage of occurrence of words to the number of documents in the corpus, and if it is int, it denotes the number of occurrences of words.

Min_df is similar to max_df, except that if the frequency of a word is less than min_df, the word will not be regarded as a key word.

So we successfully constructed Pipeline for training and testing.

Then the training set is trained with Pipeline.fit().

Pipeline.score() is used to predict and score the test set directly.

mnbc_clf = MNB_Classifier()

# Training
mnbc_clf.fit(review_train, sentiment_train)

# Test Set Accuracy
print('Test set accuracy: {}'.format(mnbc_clf.score(review_test, sentiment_test)))

So we have completed the whole process from training to testing.

Basically, the correct rate of test set is about 79% - 80%.

Because there are a lot of negative emotional words in movie reviews, such as in the documentary Dolphin Bay.

I don't think most of the people who will be touched by this film know that the Baiji dolphin in China has been extinct for eight years, and they don't know that there are only about 1000 porpoises left in the Yangtze River. Instead of lamenting and cursing the Japanese how to kill dolphins, it is better to do something practical to protect the porpoises in the Yangtze River, which will disappear in a few years. What the Chinese do is no better than what the Japanese did.

So if we remove this kind of similar praise, we can improve the accuracy.

Save the trained model

# First convert to word frequency matrix, then calculate TFIDF value.
tfidf = tfidftransformer.fit_transform(vectorizer.fit_transform(review_train))
# Polynomial Classifier in Naive Bayes
clf = MultinomialNB().fit(tfidf, sentiment_train)

with open(model_export_path, 'wb') as file:
    d = {
        "clf": clf,
        "vectorizer": vectorizer,
        "tfidftransformer": tfidftransformer,
    }
    pickle.dump(d, file)

Prediction of Movie Review Emotion Using Trained Model

Here I paste all the source code directly, the code is very simple, I encapsulate the whole processing logic into a class, so it is very convenient to use.

clone directly on my Github if necessary

# -*- coding: utf-8 -*-
import re
import pickle

import numpy as np
import jieba


class SentimentAnalyzer(object):
    def __init__(self, model_path, userdict_path, stopword_path):
        self.clf = None
        self.vectorizer = None
        self.tfidftransformer = None
        self.model_path = model_path
        self.stopword_path = stopword_path
        self.userdict_path = userdict_path
        self.stop_words = []
        self.tokenizer = jieba.Tokenizer()
        self.initialize()

    # Loading model
    def initialize(self):
        with open(self.stopword_path, encoding='UTF-8') as words:
            self.stop_words = [i.strip() for i in words.readlines()]

        with open(self.model_path, 'rb') as file:
            model = pickle.load(file)
            self.clf = model['clf']
            self.vectorizer = model['vectorizer']
            self.tfidftransformer = model['tfidftransformer']
        if self.userdict_path:
            self.tokenizer.load_userdict(self.userdict_path)

    # Filtering English and Irrelevant Texts in Texts
    def replace_text(self, text):
        text = re.sub('((https?|ftp|file)://)?[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|].(com|cn)', '', text)
        text = text.replace('\u3000', '').replace('\xa0', '').replace('"', '').replace('"', '')
        text = text.replace(' ', '').replace('↵', '').replace('\n', '').replace('\r', '').replace('\t', '').replace(')', '')
        text_corpus = re.split('[!. ?;......;]', text)
        return text_corpus

    # Affective Analysis and Computation
    def predict_score(self, text_corpus):
        # participle
        docs = [self.__cut_word(sentence) for sentence in text_corpus]
        new_tfidf = self.tfidftransformer.transform(self.vectorizer.transform(docs))
        predicted = self.clf.predict_proba(new_tfidf)
        # Round it around and keep three
        result = np.around(predicted, decimals=3)
        return result

    # jieba participle
    def __cut_word(self, sentence):
        words = [i for i in self.tokenizer.cut(sentence) if i not in self.stop_words]
        result = ' '.join(words)
        return result

    def analyze(self, text):
        text_corpus = self.replace_text(text)
        result = self.predict_score(text_corpus)

        neg = result[0][0]
        pos = result[0][1]

        print('Negative comment: {} Praise: {}'.format(neg, pos))

When you use it, you just instantiate the analyzer and use the analyze() method.

# -*- coding: utf-8 -*-
from native_bayes_sentiment_analyzer import SentimentAnalyzer


model_path = './data/bayes.pkl'
userdict_path = './data/userdict.txt'
stopword_path = './data/stopwords.txt'
corpus_path = './data/review.csv'


analyzer = SentimentAnalyzer(model_path=model_path, stopword_path=stopword_path, userdict_path=userdict_path)
text = 'A disappointed Nolan movie feels more like a hodgepodge of the Inception Gang. Although I knew it must be a bat that can't surpass Previous 2, I really didn't think it could be so bad. The failure of rhythm control and the blurred role orientation are absolutely the hard wounds of the whole film.'
analyzer.analyze(text=text)

https://github.com/3inchtime/douban_sentiment_analysis

All of the above codes are push ed to my Github, and you are welcome to make suggestions.

Keywords: github encoding jupyter less

Added by david4u on Mon, 16 Sep 2019 12:15:12 +0300