Preface
In this paper, naive Bayesian algorithm is used to realize emotional analysis and prediction of Douban Top250 film evaluation.
Recently, the problem of dealing with positive and negative emotions in learning natural language is studied, but most of the practices that can be searched are emotional analysis of IMDB film reviews on Kggle.
So here I use the most basic naive Bayesian algorithm to analyze and predict the emotion of Douban film review.
Here I refer to it. https://github.com/aeternae/IMDb_Review Thank you very much.
Naive Bayes classifier
Bayesian classification is the general name of a kind of classification algorithm, which is based on Bayesian theorem, so it is called Bayesian classification.
This algorithm is often used to classify articles, junk mails and pieces of junk comments. Naive Bayesian algorithm has good effect and low cost.
Given a conditional probability, how to get the probability after the exchange of two events, that is, how to get P(B|A) when P(A|B) is known.
P(B|A) denotes the probability of event B occurring on the premise that event A has already occurred, which is called the conditional probability of event B under event A.
Naive Bayesian Formula
P(B∣A)=P(A∣B)P(B)P(A)
P(B|A) = \frac{P(A|B)P(B)}{P(A)}
P(B∣A)=P(A)P(A∣B)P(B)
An easy-to-understand video tutorial
Youtube https://www.youtube.com/watch?v=AqonCeZUcC4
Give me an inappropriate example.
We want to know the relationship between being a programmer and being bald, so we can use naive Bayesian formula to calculate.
Now we want to find the probability of P (bald | programmer), that is, the probability that programmers will bald.
I will never be bald in my life ((o)!!!
Substituting Naive Bayesian Formula
P (bald programmer) = P (programmer bald) P (bald) P (programmer) P (bald | programmer) = frac {P (programmer | bald) P (bald)} {P (programmer)} P (bald programmer) = P (programmer) P (programmer bald) P (bald)
The known data are shown in the following table
Full name | Occupation | Is it bald? |
---|---|---|
kratos | God of War | yes |
Killer 47 | killer | yes |
Saitama | superman | yes |
Thanos | Director of Family Planning Office | yes |
Jason Statham | A tough guy | yes |
A 996 Programmer | Programmer | yes |
I | Programmer | no |
Based on the Naive Bayesian Formula, we can derive from the above table:
P (bald programmer) = 16 6727 = 2142 = 12
P (bald head | programmer) = frac { frac16 * frac67 { frac27}= frac {21} {42}= frac {1} {2}
P (bald programmer) = 72. 61 76 = 4221 = 21
The above example simply describes the basic usage of Naive Bayesian Formula.
Next, I use the Top250 Douban Film Review to use Naive Bayes to train and predict the positive and negative reviews.
Emotional Analysis of Douban Top250 Film Review
First of all, we need the Douban Top250 film review corpus. I used Scrapy to grab 5w corpus for training and verification.
Douban Film Review Reptile https://github.com/3inchtime/douban_movie_review
With the corpus, we can begin to develop it in practice.
It is recommended to use jupyter to develop operations.
The following code can be seen on my Github, and you are welcome to make suggestions.
https://github.com/3inchtime/douban_sentiment_analysis
First load the corpus
# -*- coding: utf-8 -*- import random import numpy as np import csv import jieba file_path = './data/review.csv' jieba.load_userdict('./data/userdict.txt') # Read and save corpus in csv format def load_corpus(corpus_path): with open(corpus_path, 'r') as f: reader = csv.reader(f) rows = [row for row in reader] review_data = np.array(rows).tolist() random.shuffle(review_data) review_list = [] sentiment_list = [] for words in review_data: review_list.append(words[1]) sentiment_list.append(words[0]) return review_list, sentiment_list
Before training, data sets are shuffled to disrupt the order of data and randomize the data so as to avoid over-fitting. So use random.shuffle() to scramble the data.
jieba.load_userdict('./data/userdict.txt') Here I made a dictionary to prevent some of the stuttering words from inaccurate, which can improve the accuracy of about 1%.
For example, if you don't like this sentence very much, jieba will be divided into two words: "no" and "like" so that the sentence will be predicted to be highly praised.
So here I have a lot of similar words in the custom dictionary, which improves the accuracy a little.
Then the corpus is divided into test set and training set according to 1:4.
n = len(review_list) // 5 train_review_list, train_sentiment_list = review_list[n:], sentiment_list[n:] test_review_list, test_sentiment_list = review_list[:n], sentiment_list[:n]
participle
Using jieba participle, the corpus is segmented and stopwords are removed.
import re import jieba stopword_path = './data/stopwords.txt' def load_stopwords(file_path): stop_words = [] with open(file_path, encoding='UTF-8') as words: stop_words.extend([i.strip() for i in words.readlines()]) return stop_words def review_to_text(review): stop_words = load_stopwords(stopword_path) # Remove English review = re.sub("[^\u4e00-\u9fa5^a-z^A-Z]", '', review) review = jieba.cut(review) # Remove stop words if stop_words: all_stop_words = set(stop_words) words = [w for w in review if w not in all_stop_words] return words # Comments for training review_train = [' '.join(review_to_text(review)) for review in train_review_list] # Relevant Good/Bad Comments on Training Comments sentiment_train = train_sentiment_list # Comments for testing review_test = [' '.join(review_to_text(review)) for review in test_review_list] # Relevant positive/negative reviews for test reviews sentiment_test = test_sentiment_list
TF*IDF and Word Frequency Vectorization
TF-IDF (TF-IDF) is a weighted technology commonly used in information processing and data mining. The importance of a word in the whole corpus is calculated according to the number of occurrences of words in the text and the frequency of documents in the whole corpus.
Its advantage is that it can filter out some common but insignificant words, while retaining the important words that affect the whole text.
Countvectorizer() is used to convert a document into a vector and calculate the frequency of words appearing in the text.
The CountVectorizer class converts words in text into a word frequency matrix, such as an element a[i] [j], which represents the word frequency of J words in class I text. It calculates the number of occurrences of each word through fit_transform function.
Tfidf Transformer is used to calculate the TF-IDF value of each word in vectorizer.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.pipeline import Pipeline from sklearn.naive_bayes import MultinomialNB count_vec = CountVectorizer(max_df=0.8, min_df=3) tfidf_vec = TfidfVectorizer() # Defining Pipeline's streamlined encapsulation and management of all steps makes it easy to reuse parameter sets on new data sets, such as test sets. def MNB_Classifier(): return Pipeline([ ('count_vec', CountVectorizer()), ('mnb', MultinomialNB()) ])
The function of the parameter max_df is to act as a threshold. When a corpus keyword set is constructed, if the frequency of a word is greater than max_df, the word will not be regarded as a keyword.
If this parameter is float, it denotes the percentage of occurrence of words to the number of documents in the corpus, and if it is int, it denotes the number of occurrences of words.
Min_df is similar to max_df, except that if the frequency of a word is less than min_df, the word will not be regarded as a key word.
So we successfully constructed Pipeline for training and testing.
Then the training set is trained with Pipeline.fit().
Pipeline.score() is used to predict and score the test set directly.
mnbc_clf = MNB_Classifier() # Training mnbc_clf.fit(review_train, sentiment_train) # Test Set Accuracy print('Test set accuracy: {}'.format(mnbc_clf.score(review_test, sentiment_test)))
So we have completed the whole process from training to testing.
Basically, the correct rate of test set is about 79% - 80%.
Because there are a lot of negative emotional words in movie reviews, such as in the documentary Dolphin Bay.
I don't think most of the people who will be touched by this film know that the Baiji dolphin in China has been extinct for eight years, and they don't know that there are only about 1000 porpoises left in the Yangtze River. Instead of lamenting and cursing the Japanese how to kill dolphins, it is better to do something practical to protect the porpoises in the Yangtze River, which will disappear in a few years. What the Chinese do is no better than what the Japanese did.
So if we remove this kind of similar praise, we can improve the accuracy.
Save the trained model
# First convert to word frequency matrix, then calculate TFIDF value. tfidf = tfidftransformer.fit_transform(vectorizer.fit_transform(review_train)) # Polynomial Classifier in Naive Bayes clf = MultinomialNB().fit(tfidf, sentiment_train) with open(model_export_path, 'wb') as file: d = { "clf": clf, "vectorizer": vectorizer, "tfidftransformer": tfidftransformer, } pickle.dump(d, file)
Prediction of Movie Review Emotion Using Trained Model
Here I paste all the source code directly, the code is very simple, I encapsulate the whole processing logic into a class, so it is very convenient to use.
clone directly on my Github if necessary
# -*- coding: utf-8 -*- import re import pickle import numpy as np import jieba class SentimentAnalyzer(object): def __init__(self, model_path, userdict_path, stopword_path): self.clf = None self.vectorizer = None self.tfidftransformer = None self.model_path = model_path self.stopword_path = stopword_path self.userdict_path = userdict_path self.stop_words = [] self.tokenizer = jieba.Tokenizer() self.initialize() # Loading model def initialize(self): with open(self.stopword_path, encoding='UTF-8') as words: self.stop_words = [i.strip() for i in words.readlines()] with open(self.model_path, 'rb') as file: model = pickle.load(file) self.clf = model['clf'] self.vectorizer = model['vectorizer'] self.tfidftransformer = model['tfidftransformer'] if self.userdict_path: self.tokenizer.load_userdict(self.userdict_path) # Filtering English and Irrelevant Texts in Texts def replace_text(self, text): text = re.sub('((https?|ftp|file)://)?[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|].(com|cn)', '', text) text = text.replace('\u3000', '').replace('\xa0', '').replace('"', '').replace('"', '') text = text.replace(' ', '').replace('↵', '').replace('\n', '').replace('\r', '').replace('\t', '').replace(')', '') text_corpus = re.split('[!. ?;......;]', text) return text_corpus # Affective Analysis and Computation def predict_score(self, text_corpus): # participle docs = [self.__cut_word(sentence) for sentence in text_corpus] new_tfidf = self.tfidftransformer.transform(self.vectorizer.transform(docs)) predicted = self.clf.predict_proba(new_tfidf) # Round it around and keep three result = np.around(predicted, decimals=3) return result # jieba participle def __cut_word(self, sentence): words = [i for i in self.tokenizer.cut(sentence) if i not in self.stop_words] result = ' '.join(words) return result def analyze(self, text): text_corpus = self.replace_text(text) result = self.predict_score(text_corpus) neg = result[0][0] pos = result[0][1] print('Negative comment: {} Praise: {}'.format(neg, pos))
When you use it, you just instantiate the analyzer and use the analyze() method.
# -*- coding: utf-8 -*- from native_bayes_sentiment_analyzer import SentimentAnalyzer model_path = './data/bayes.pkl' userdict_path = './data/userdict.txt' stopword_path = './data/stopwords.txt' corpus_path = './data/review.csv' analyzer = SentimentAnalyzer(model_path=model_path, stopword_path=stopword_path, userdict_path=userdict_path) text = 'A disappointed Nolan movie feels more like a hodgepodge of the Inception Gang. Although I knew it must be a bat that can't surpass Previous 2, I really didn't think it could be so bad. The failure of rhythm control and the blurred role orientation are absolutely the hard wounds of the whole film.' analyzer.analyze(text=text)
https://github.com/3inchtime/douban_sentiment_analysis
All of the above codes are push ed to my Github, and you are welcome to make suggestions.