Problem description
First read enterprise_ of_ English_ Peter Brown, refer to the above article to calculate the average information entropy of Chinese. Database: https://share.weiyun.com/5zGPyJX
Experimental principle
Information entropy
The concept of information entropy was first proposed by Shannon (1916-2001) in 1948 based on the concept of "thermal entropy" in thermodynamics, which aims to express the uncertainty of information. The greater the entropy, the greater the uncertainty of information. Its mathematical formula can be expressed as:
H
(
x
)
=
∑
x
∈
X
P
(
x
)
l
o
g
(
1
P
(
x
)
)
=
−
∑
x
∈
X
P
(
x
)
l
o
g
(
P
(
x
)
)
H(x) =\sum_{x\in X} P(x)log(\frac{1}{P(x)})=-\sum_{x\in X}P(x)log(P(x))
H(x)=x∈X∑P(x)log(P(x)1)=−x∈X∑P(x)log(P(x))
For random variables with joint distribution
(
X
,
Y
)
(X,Y)
(X,Y)~
P
(
X
,
Y
)
P(X,Y)
P(X,Y), when the two variables are independent of each other, the joint information entropy is
H
(
X
∣
Y
)
=
−
∑
y
∈
Y
P
(
y
)
l
o
g
(
P
(
x
∣
y
)
)
=
−
∑
y
∈
Y
P
(
y
)
∑
x
∈
X
P
(
x
)
l
o
g
(
P
(
x
∣
y
)
)
=
−
∑
y
∈
Y
∑
x
∈
X
P
(
x
)
P
(
y
)
l
o
g
(
P
(
x
∣
y
)
)
=
−
∑
y
∈
Y
∑
x
∈
X
P
(
x
,
y
)
l
o
g
(
P
(
x
∣
y
)
)
\begin{aligned} H(X|Y)&= -\sum_{y\in Y} P(y)log(P(x|y))\\ &= -\sum_{y\in Y} P(y) \sum_{x\in X}P(x) log(P(x|y))\\ &=-\sum_{y\in Y}\sum_{x\in X}P(x)P(y)log(P(x|y))\\ &=-\sum_{y\in Y}\sum_{x\in X}P(x,y)log(P(x|y)) \end{aligned}
H(X∣Y)=−y∈Y∑P(y)log(P(x∣y))=−y∈Y∑P(y)x∈X∑P(x)log(P(x∣y))=−y∈Y∑x∈X∑P(x)P(y)log(P(x∣y))=−y∈Y∑x∈X∑P(x,y)log(P(x∣y))
The joint information entropy can be used in the later binary model
(
b
i
−
g
r
a
m
)
(bi-gram)
(bi − gram) and ternary model
(
t
r
i
−
g
r
a
m
)
(tri-gram)
(tri − gram).
Parameter estimation of language model
This paper uses the unitary, binary and ternary models to calculate the information entropy of Jin Yong's novel collection.
Experimental process
Pretreatment of experimental data
The data set is Mr. Jin Yong's 16 novels, which contains a large number of chaotic codes and useless or repeated Chinese and English symbols, so it is necessary to preprocess the experimental data set.
1. Delete all hidden symbols.
2. Delete all non Chinese characters.
3. Delete all punctuation marks without considering the context.
The preprocessing here uses jieba for word segmentation. jieba is a Chinese word segmentation Library in python. In this experiment, word segmentation is carried out in a precise mode.
def getCorpus(self, rootDir): corpus = [] r1 = u'[a-zA-Z0-9'!"#$%&\'()*+,-./:: ;<=>?@,. ?★,...[]<>?""''![\\]^_`{|}~]+' for file in os.listdir(rootDir): path = os.path.join(rootDir, file) if os.path.isfile(path): with open(os.path.abspath(path), "r", encoding='utf-8') as file: filecontext = file.read(); filecontext = re.sub(r1, '', filecontext) filecontext = filecontext.replace("\n", '') filecontext = filecontext.replace(" ", '') seg_list = jieba.lcut_for_search(filecontext) corpus += seg_list elif os.path.isdir(path): TraversalFun.AllFiles(self, path) return corpus
Univariate model
for uni_word in words_tf.items(): entropy.append(-(uni_word[1]/words_len)*math.log(uni_word[1]/words_len, 2))
Binary model
for bi_word in bigram_tf.items(): jp_xy = bi_word[1] / bigram_len # Calculate the joint probability p(x,y) cp_xy = bi_word[1] / words_tf[bi_word[0][0]] # Calculate conditional probability p(x|y) entropy.append(-jp_xy * math.log(cp_xy, 2)) # Calculate the information entropy of binary model print("Chinese information entropy based on word binary model is:", round(sum(entropy), 3), "Bit/Words")
Ternary model
for tri_word in trigram_tf.items(): jp_xy = tri_word[1] / trigram_len # Calculate the joint probability p(x,y) cp_xy = tri_word[1] / words_tf[tri_word[0][0]] # Calculate conditional probability p(x|y) entropy.append(-jp_xy * math.log(cp_xy, 2)) # Calculate the information entropy of ternary model print("Chinese information entropy based on word ternary model is:", round(sum(entropy), 3), "Bit/Words")
experimental result
Corpus information statistics
Novel title | Number of words in Corpus | Number of participles | Average word length | Information entropy (bit / word) | Run time | |
---|---|---|---|---|---|---|
1 | Thirty three swordsmen | 53682 | 31551 | 1.70144 | 11.65358 | 0.82091 |
2 | White horse whistling west wind | 64061 | 42207 | 1.51778 | 9.49565 | 0.84298 |
3 | Book and sword | 438338 | 255908 | 1.71287 | 11.68502 | 2.50098 |
4 | Xiake Xing | 317099 | 190545 | 1.66417 | 10.99487 | 2.14651 |
5 | The story of relying on heaven to kill Dragons | 830439 | 487540 | 1.70332 | 11.63975 | 4.22489 |
6 | Tianlong Babu | 830439 | 487540 | 1.70332 | 11.63975 | 4.4675 |
7 | Legend of Shooting Heroes | 794124 | 480018 | 1.65436 | 11.55264 | 4.8816 |
8 | Blue blood sword | 420068 | 246405 | 1.70479 | 11.69014 | 2.6309 |
9 | Eagle Warrior | 835666 | 502841 | 1.66189 | 11.51878 | 4.92906 |
10 | Xiaoao Jianghu | 833372 | 490987 | 1.69734 | 11.32648 | 4.64088 |
11 | Yuenv sword | 14803 | 9293 | 1.59292 | 9.42809 | 0.58128 |
12 | Liancheng formula | 199412 | 122177 | 1.63216 | 10.84085 | 1.59599 |
13 | Snow mountain flying fox | 119513 | 73383 | 1.62862 | 10.773 | 1.16091 |
14 | Flying fox | 378777 | 224513 | 1.6871 | 11.46932 | 2.26253 |
15 | Mandarin duck knife | 32552 | 20658 | 1.57576 | 9.71726 | 0.65679 |
16 | DUKE OF MOUNT DEER | 1047468 | 626266 | 1.66459 | 11.25894 | 5.1325 |
Experimental results under different models
Word segmentation model | Number of words in Corpus | Number of participles | Average word length | Information entropy (bit / word) | Run time | |
---|---|---|---|---|---|---|
1 | unigram | 7420081 | 4430767 | 1.67467 | 12.01312 | 37.68415 |
2 | bigram | 7420081 | 4430767 | 1.67467 | 6.8915 | 41.43119 |
3 | trigram | 7420081 | 4430767 | 1.67467 | 6.8915 | 60.37625 |
appendix
import jieba import math import time import os import re class TraversalFun(): # 1 initialization def __init__(self, rootDir): self.rootDir = rootDir def TraversalDir(self): return TraversalFun.getCorpus(self, self.rootDir) def getCorpus(self, rootDir): corpus = [] r1 = u'[a-zA-Z0-9'!"#$%&\'()*+,-./:: ;<=>?@,. ?★,...[]<>?""''![\\]^_`{|}~]+' # You can also customize the character filtering here listdir = os.listdir(rootDir) count=0 for file in listdir: path = os.path.join(rootDir, file) if os.path.isfile(path): with open(os.path.abspath(path), "r", encoding='ansi') as file: filecontext = file.read(); filecontext = re.sub(r1, '', filecontext) filecontext = filecontext.replace("\n", '') filecontext = filecontext.replace(" ", '') filecontext = filecontext.replace("This book comes from www.cr173.com free txt Novel download station\n Please pay attention to more updated free e-books www.cr173.com", '') #seg_list = jieba.cut(filecontext, cut_all=True) #corpus += seg_list count += len(filecontext) corpus.append(filecontext) elif os.path.isdir(path): TraversalFun.AllFiles(self, path) return corpus,count # Word frequency statistics to facilitate the calculation of information entropy def get_tf(tf_dic, words): for i in range(len(words)-1): tf_dic[words[i]] = tf_dic.get(words[i], 0) + 1 def get_bigram_tf(tf_dic, words): for i in range(len(words)-1): tf_dic[(words[i], words[i+1])] = tf_dic.get((words[i], words[i+1]), 0) + 1 def get_trigram_tf(tf_dic, words): for i in range(len(words)-2): tf_dic[((words[i], words[i+1]), words[i+2])] = tf_dic.get(((words[i], words[i+1]), words[i+2]), 0) + 1 def cal_unigram(corpus,count): before = time.time() split_words = [] words_len = 0 line_count = 0 words_tf = {} for line in corpus: for x in jieba.cut(line): split_words.append(x) words_len += 1 get_tf(words_tf, split_words) split_words = [] line_count += 1 print("Corpus of words:", count) print("Number of participles:", words_len) print("Average word length:", round(count / words_len, 5)) entropy = [] for uni_word in words_tf.items(): entropy.append(-(uni_word[1] / words_len) * math.log(uni_word[1] / words_len, 2)) print("Chinese information entropy based on word unary model is:", round(sum(entropy), 5), "Bit/Words") after = time.time() print("Running time:", round(after - before, 5), "s") def cal_bigram(corpus, count): before = time.time() split_words = [] words_len = 0 line_count = 0 words_tf = {} bigram_tf = {} for line in corpus: for x in jieba.cut(line): split_words.append(x) words_len += 1 get_tf(words_tf, split_words) get_bigram_tf(bigram_tf, split_words) split_words = [] line_count += 1 print("Number of words in Corpus:", count) print("Number of participles:", words_len) print("Average word length:", round(count / words_len, 5)) bigram_len = sum([dic[1] for dic in bigram_tf.items()]) print("Binary model length:", bigram_len) entropy = [] for bi_word in bigram_tf.items(): jp_xy = bi_word[1] / bigram_len # Calculate the joint probability p(x,y) cp_xy = bi_word[1] / words_tf[bi_word[0][0]] # Calculate conditional probability p(x|y) entropy.append(-jp_xy * math.log(cp_xy, 2)) # Binary entropy calculation model print("Chinese information entropy based on word binary model is:", round(sum(entropy), 5), "Bit/Words") after = time.time() print("Running time:", round(after - before, 5), "s") def cal_trigram(corpus,count): before = time.time split_words = [] words_len = 0 line_count = 0 words_tf = {} trigram_tf = {} for line in corpus: for x in jieba.cut(line): split_words.append(x) words_len += 1 get_bigram_tf(words_tf, split_words) get_trigram_tf(trigram_tf, split_words) split_words = [] line_count += 1 print("Number of words in Corpus:", count) print("Number of participles:", words_len) print("Average word length:", round(count / words_len, 5)) trigram_len = sum([dic[1] for dic in trigram_tf.items()]) print("Length of ternary model:", trigram_len) entropy = [] for tri_word in trigram_tf.items(): jp_xy = tri_word[1] / trigram_len # Calculate the joint probability p(x,y) cp_xy = tri_word[1] / words_tf[tri_word[0][0]] # Calculate conditional probability p(x|y) entropy.append(-jp_xy * math.log(cp_xy, 2)) # Calculate the information entropy of ternary model print("Chinese information entropy based on word ternary model is:", round(sum(entropy), 5), "Bit/Words") after = time.time() print("Running time:", round(after - before , 5), "s") if __name__ == '__main__': tra = TraversalFun("./datasets") corpus,count = tra.TraversalDir() cal_unigram(corpus, count) cal_bigram(corpus,count) cal_trigram(corpus,count)
Experimental code
The code of the experiment can be found in Experimental code find