Introduction to LDA topic model and its Python implementation

1, Introduction to LDA theme model

LDA topic model is mainly used to infer the topic distribution of documents. The topic of each document in the document set can be given in the form of probability distribution, and topic clustering or text classification can be carried out according to the topic.

LDA topic model does not care about the order of words in the document, and usually uses bag of word feature to represent the document. For the introduction of word bag model, please refer to this article: Text vectorization -- word bag model -- Zhihu

To understand the LDA model, we need to first understand the generation model of LDA. How does LDA think an article is formed?

LDA model holds that topics can be represented by a lexical distribution, while articles can be represented by topic distribution.

For example, there are two themes, food and beauty. LDA said that two topics can be represented by word distribution, they are:

{bread: 0.4, hot pot: 0.5, eyebrow pencil: 0.03, blush: 0.07}
Eyebrow pencil: 0.4, blush: 0.5, bread: 0.03, hot pot: 0.07}

Similarly, for the two articles, LDA believes that the articles can be represented by topic distribution:

Beauty diary {beauty: 0.8, food: 0.1, others: 0.1}

Food exploration {food: 0.8, beauty: 0.1, others: 0.1}

Therefore, if you want to generate an article, you can first select one of the above topics with a certain probability, and then select a word under that topic with a certain probability. Repeat these two steps continuously to generate the final article.

In LDA model, a document is generated as follows:

Among them, the Beta like distribution is the conjugate prior probability distribution of binomial distribution, and the Dirichlet distribution is the conjugate prior probability distribution of polynomial distribution.

If we want to generate a document, the probability of each word in it is:

For more detailed mathematical derivation, see: Popular understanding of lda topic model_ The method of structure and the way of algorithm - CSDN blog_ lda model

The purpose of modeling is to infer the hidden topic distribution. In other words, human beings write all kinds of articles according to the document generation model, and then throw them to the computer, which is equivalent to the computer seeing the written articles. Now the computer needs to sum up the theme of the article according to a series of words seen in an article, and then get the different occurrence probability of each theme: theme distribution.

As for how the LDA topic model is implemented in the computer, we don't have to study it in detail. Now there are many packages that can be directly used for LDA topic analysis. We can use them directly. (that's right, I'm the switchman)

2, Python implementation

Before using Python for LDA topic model analysis, I first processed the document with word segmentation and de stop words (for details, see my previous article: Word segmentation of single microblog document in python -- jieba word segmentation (with reserved words and stop words)_ A lost heart blog - CSDN blog_ jieba stop word)

My input file below is also a file that has been divided into words

1. Import algorithm package

import gensim
from gensim import corpora
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

2. Load data

First, convert the document into a binary list, in which each sub list represents a microblog:

PATH = "E:/data/output.csv"

file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n')  #Read content line by line
data_set=[]  #Create a list of stored participles
for i in range(len(file_object2)):
    result=[]
    seg_list = file_object2[i].split()
    for w in seg_list :  #Read each line participle
        result.append(w)
    data_set.append(result)
print(data_set)

Construct a dictionary and quantify the corpus:

dictionary = corpora.Dictionary(data_set)  # Build dictionary
corpus = [dictionary.doc2bow(text) for text in data_set]  #Indicates the number of times the word appears

3. Build LDA model

ldamodel = LdaModel(corpus, num_topics=10, id2word = dictionary, passes=30,random_state = 1)   #Divided into 10 themes
print(ldamodel.print_topics(num_topics=num_topics, num_words=15))  #Each topic outputs 15 words

This is the construction method of LDA model when determining the number of topics. Generally, we can use indicators to evaluate the quality of the model, or use these indicators to determine the optimal number of topics. Generally, the indicators used to evaluate LDA topic model are perplexity and topic consistency. The lower the perplexity or the higher the consistency, the better the model. Some studies show that perfectivity is not a good indicator, so I generally use coherence to evaluate the model and select the optimal topic, but I use both methods in the following code.

#Computational confusion
def perplexity(num_topics):
    ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30)
    print(ldamodel.print_topics(num_topics=num_topics, num_words=15))
    print(ldamodel.log_perplexity(corpus))
    return ldamodel.log_perplexity(corpus)
#Calculate coherence
def coherence(num_topics):
    ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30,random_state = 1)
    print(ldamodel.print_topics(num_topics=num_topics, num_words=10))
    ldacm = CoherenceModel(model=ldamodel, texts=data_set, dictionary=dictionary, coherence='c_v')
    print(ldacm.get_coherence())
    return ldacm.get_coherence()

4. Draw the theme coherence curve and select the best number of themes

x = range(1,15)
# z = [perplexity(i) for i in x]  #If you want to use confusion, choose this
y = [coherence(i) for i in x]
plt.plot(x, y)
plt.xlabel('Number of topics')
plt.ylabel('coherence size')
plt.rcParams['font.sans-serif']=['SimHei']
matplotlib.rcParams['axes.unicode_minus']=False
plt.title('theme-coherence Changes')
plt.show()

Finally, the word distribution of each topic and such graphics can be obtained:

5. Result output and visualization

Through the above topic evaluation, we found that we can select 5 as the number of topics. Next, we can run the model again, set the number of topics to 5, and output the most likely corresponding topics of each document

from gensim.models import LdaModel
import pandas as pd
from gensim.corpora import Dictionary
from gensim import corpora, models
import csv

# Prepare data
PATH = "E:/data/output1.csv"

file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n')  #Read content line by line
data_set=[] #Create a list of stored participles
for i in range(len(file_object2)):
    result=[]
    seg_list = file_object2[i].split()
    for w in seg_list :#Read each line participle
        result.append(w)
    data_set.append(result)

dictionary = corpora.Dictionary(data_set)  # Build dictionary
corpus = [dictionary.doc2bow(text) for text in data_set]

lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes = 30,random_state=1)
topic_list=lda.print_topics()
print(topic_list)

for i in lda.get_document_topics(corpus)[:]:
    listj=[]
    for j in i:
        listj.append(j[1])
    bz=listj.index(max(listj))
    print(i[bz][0])

At the same time, we can use pyLDAvis to visualize the results of LDA model:

import pyLDAvis.gensim
pyLDAvis.enable_notebook()
data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.save_html(data, 'E:/data/3topic.html')

We can probably get the following results:

The circle on the left represents the theme, and the right represents the contribution of each word to the theme.

All codes are as follows:

import gensim
from gensim import corpora
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel



 # Prepare data
PATH = "E:/data/output.csv"

file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n')  #Read content line by line
data_set=[] #Create a list of stored participles
for i in range(len(file_object2)):
    result=[]
    seg_list = file_object2[i].split()
    for w in seg_list :#Read each line participle
        result.append(w)
    data_set.append(result)
print(data_set)


dictionary = corpora.Dictionary(data_set)  # Building a document term matrix
corpus = [dictionary.doc2bow(text) for text in data_set]
#Lda = gensim.models.ldamodel.LdaModel  # Create LDA object

#Computational confusion
def perplexity(num_topics):
    ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30)
    print(ldamodel.print_topics(num_topics=num_topics, num_words=15))
    print(ldamodel.log_perplexity(corpus))
    return ldamodel.log_perplexity(corpus)

#Calculate coherence
def coherence(num_topics):
    ldamodel = LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=30,random_state = 1)
    print(ldamodel.print_topics(num_topics=num_topics, num_words=10))
    ldacm = CoherenceModel(model=ldamodel, texts=data_set, dictionary=dictionary, coherence='c_v')
    print(ldacm.get_coherence())
    return ldacm.get_coherence()

# Draw confusion line chart
x = range(1,15)
# z = [perplexity(i) for i in x]
y = [coherence(i) for i in x]
plt.plot(x, y)
plt.xlabel('Number of topics')
plt.ylabel('coherence size')
plt.rcParams['font.sans-serif']=['SimHei']
matplotlib.rcParams['axes.unicode_minus']=False
plt.title('theme-coherence Changes')
plt.show()

from gensim.models import LdaModel
import pandas as pd
from gensim.corpora import Dictionary
from gensim import corpora, models
import csv

# Prepare data
PATH = "E:/data/output1.csv"

file_object2=open(PATH,encoding = 'utf-8',errors = 'ignore').read().split('\n')  #Read content line by line
data_set=[] #Create a list of stored participles
for i in range(len(file_object2)):
    result=[]
    seg_list = file_object2[i].split()
    for w in seg_list :#Read each line participle
        result.append(w)
    data_set.append(result)

dictionary = corpora.Dictionary(data_set)  # Building a document term matrix
corpus = [dictionary.doc2bow(text) for text in data_set]

lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes = 30,random_state=1)
topic_list=lda.print_topics()
print(topic_list)

result_list =[]
for i in lda.get_document_topics(corpus)[:]:
    listj=[]
    for j in i:
        listj.append(j[1])
    bz=listj.index(max(listj))
    result_list.append(i[bz][0])
print(result_list)

import pyLDAvis.gensim
pyLDAvis.enable_notebook()
data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.save_html(data, 'E:/data/topic.html')

Take it yourself if necessary~

You can also pay attention to me, and then I will send more dry articles on data analysis~

Keywords: Python Machine Learning NLP

Added by SpiderSprog on Sun, 09 Jan 2022 05:26:09 +0200

Programming VIP