# Problem description

First read enterprise_ of_ English_ Peter Brown, refer to the above article to calculate the average information entropy of Chinese. Database: https://share.weiyun.com/5zGPyJX

# Experimental principle

## Information entropy

The concept of information entropy was first proposed by Shannon (1916-2001) in 1948 based on the concept of "thermal entropy" in thermodynamics, which aims to express the uncertainty of information. The greater the entropy, the greater the uncertainty of information. Its mathematical formula can be expressed as:
H ( x ) = ∑ x ∈ X P ( x ) l o g ( 1 P ( x ) ) = − ∑ x ∈ X P ( x ) l o g ( P ( x ) ) H(x) =\sum_{x\in X} P(x)log(\frac{1}{P(x)})=-\sum_{x\in X}P(x)log(P(x)) H(x)=x∈X∑​P(x)log(P(x)1​)=−x∈X∑​P(x)log(P(x))
For random variables with joint distribution ( X , Y ) (X,Y) (X,Y)~ P ( X , Y ) P(X,Y) P(X,Y), when the two variables are independent of each other, the joint information entropy is

H ( X ∣ Y ) = − ∑ y ∈ Y P ( y ) l o g ( P ( x ∣ y ) ) = − ∑ y ∈ Y P ( y ) ∑ x ∈ X P ( x ) l o g ( P ( x ∣ y ) ) = − ∑ y ∈ Y ∑ x ∈ X P ( x ) P ( y ) l o g ( P ( x ∣ y ) ) = − ∑ y ∈ Y ∑ x ∈ X P ( x , y ) l o g ( P ( x ∣ y ) ) \begin{aligned} H(X|Y)&= -\sum_{y\in Y} P(y)log(P(x|y))\\ &= -\sum_{y\in Y} P(y) \sum_{x\in X}P(x) log(P(x|y))\\ &=-\sum_{y\in Y}\sum_{x\in X}P(x)P(y)log(P(x|y))\\ &=-\sum_{y\in Y}\sum_{x\in X}P(x,y)log(P(x|y)) \end{aligned} H(X∣Y)​=−y∈Y∑​P(y)log(P(x∣y))=−y∈Y∑​P(y)x∈X∑​P(x)log(P(x∣y))=−y∈Y∑​x∈X∑​P(x)P(y)log(P(x∣y))=−y∈Y∑​x∈X∑​P(x,y)log(P(x∣y))​
The joint information entropy can be used in the later binary model ( b i − g r a m ) (bi-gram) (bi − gram) and ternary model ( t r i − g r a m ) (tri-gram) (tri − gram).

## Parameter estimation of language model

This paper uses the unitary, binary and ternary models to calculate the information entropy of Jin Yong's novel collection.

# Experimental process

## Pretreatment of experimental data

The data set is Mr. Jin Yong's 16 novels, which contains a large number of chaotic codes and useless or repeated Chinese and English symbols, so it is necessary to preprocess the experimental data set.
1. Delete all hidden symbols.
2. Delete all non Chinese characters.
3. Delete all punctuation marks without considering the context.

The preprocessing here uses jieba for word segmentation. jieba is a Chinese word segmentation Library in python. In this experiment, word segmentation is carried out in a precise mode.

def getCorpus(self, rootDir):
corpus = []
r1 = u'[a-zA-Z0-9'!"#$%&\'()*+,-./:: ;<=>?@，. ?★,...[]<>？""''![\\]^_{|}~]+' for file in os.listdir(rootDir): path = os.path.join(rootDir, file) if os.path.isfile(path): with open(os.path.abspath(path), "r", encoding='utf-8') as file: filecontext = file.read(); filecontext = re.sub(r1, '', filecontext) filecontext = filecontext.replace("\n", '') filecontext = filecontext.replace(" ", '') seg_list = jieba.lcut_for_search(filecontext) corpus += seg_list elif os.path.isdir(path): TraversalFun.AllFiles(self, path) return corpus  ## Univariate model  for uni_word in words_tf.items(): entropy.append(-(uni_word[1]/words_len)*math.log(uni_word[1]/words_len, 2))  ## Binary model  for bi_word in bigram_tf.items(): jp_xy = bi_word[1] / bigram_len # Calculate the joint probability p(x,y) cp_xy = bi_word[1] / words_tf[bi_word[0][0]] # Calculate conditional probability p(x|y) entropy.append(-jp_xy * math.log(cp_xy, 2)) # Calculate the information entropy of binary model print("Chinese information entropy based on word binary model is:", round(sum(entropy), 3), "Bit/Words")  ## Ternary model  for tri_word in trigram_tf.items(): jp_xy = tri_word[1] / trigram_len # Calculate the joint probability p(x,y) cp_xy = tri_word[1] / words_tf[tri_word[0][0]] # Calculate conditional probability p(x|y) entropy.append(-jp_xy * math.log(cp_xy, 2)) # Calculate the information entropy of ternary model print("Chinese information entropy based on word ternary model is:", round(sum(entropy), 3), "Bit/Words")  # experimental result ## Corpus information statistics Novel titleNumber of words in CorpusNumber of participlesAverage word lengthInformation entropy (bit / word)Run time 1Thirty three swordsmen53682315511.7014411.653580.82091 2White horse whistling west wind64061422071.517789.495650.84298 3Book and sword4383382559081.7128711.685022.50098 4Xiake Xing3170991905451.6641710.994872.14651 5The story of relying on heaven to kill Dragons8304394875401.7033211.639754.22489 6Tianlong Babu8304394875401.7033211.639754.4675 7Legend of Shooting Heroes7941244800181.6543611.552644.8816 8Blue blood sword4200682464051.7047911.690142.6309 9Eagle Warrior8356665028411.6618911.518784.92906 10Xiaoao Jianghu8333724909871.6973411.326484.64088 11Yuenv sword1480392931.592929.428090.58128 12Liancheng formula1994121221771.6321610.840851.59599 13Snow mountain flying fox119513733831.6286210.7731.16091 14Flying fox3787772245131.687111.469322.26253 15Mandarin duck knife32552206581.575769.717260.65679 16DUKE OF MOUNT DEER10474686262661.6645911.258945.1325 ## Experimental results under different models Word segmentation modelNumber of words in CorpusNumber of participlesAverage word lengthInformation entropy (bit / word)Run time 1unigram742008144307671.6746712.0131237.68415 2bigram742008144307671.674676.891541.43119 3trigram742008144307671.674676.891560.37625 # appendix import jieba import math import time import os import re class TraversalFun(): # 1 initialization def __init__(self, rootDir): self.rootDir = rootDir def TraversalDir(self): return TraversalFun.getCorpus(self, self.rootDir) def getCorpus(self, rootDir): corpus = [] r1 = u'[a-zA-Z0-9'!"#$%&\'()*+,-./:: ;<=>?@，. ?★,...[]<>？""''![\\]^_{|}~]+'  # You can also customize the character filtering here
listdir = os.listdir(rootDir)
count=0
for file in listdir:
path  = os.path.join(rootDir, file)
if os.path.isfile(path):
with open(os.path.abspath(path), "r", encoding='ansi') as file:
filecontext = re.sub(r1, '', filecontext)
filecontext = filecontext.replace("\n", '')
filecontext = filecontext.replace(" ", '')
filecontext = filecontext.replace("This book comes from www.cr173.com free txt Novel download station\n Please pay attention to more updated free e-books www.cr173.com", '')
#seg_list = jieba.cut(filecontext, cut_all=True)
#corpus += seg_list
count += len(filecontext)
corpus.append(filecontext)
elif os.path.isdir(path):
TraversalFun.AllFiles(self, path)
return corpus,count

# Word frequency statistics to facilitate the calculation of information entropy
def get_tf(tf_dic, words):

for i in range(len(words)-1):
tf_dic[words[i]] = tf_dic.get(words[i], 0) + 1

def get_bigram_tf(tf_dic, words):
for i in range(len(words)-1):
tf_dic[(words[i], words[i+1])] = tf_dic.get((words[i], words[i+1]), 0) + 1

def get_trigram_tf(tf_dic, words):
for i in range(len(words)-2):
tf_dic[((words[i], words[i+1]), words[i+2])] = tf_dic.get(((words[i], words[i+1]), words[i+2]), 0) + 1

def cal_unigram(corpus,count):
before = time.time()
split_words = []
words_len = 0
line_count = 0
words_tf = {}
for line in corpus:
for x in jieba.cut(line):
split_words.append(x)
words_len += 1
get_tf(words_tf, split_words)
split_words = []
line_count += 1

print("Corpus of words:", count)
print("Number of participles:", words_len)
print("Average word length:", round(count / words_len, 5))
entropy = []
for uni_word in words_tf.items():
entropy.append(-(uni_word[1] / words_len) * math.log(uni_word[1] / words_len, 2))
print("Chinese information entropy based on word unary model is:", round(sum(entropy), 5), "Bit/Words")
after = time.time()
print("Running time:", round(after - before, 5), "s")

def cal_bigram(corpus, count):
before = time.time()
split_words = []
words_len = 0
line_count = 0
words_tf = {}
bigram_tf = {}

for line in corpus:
for x in jieba.cut(line):
split_words.append(x)
words_len += 1

get_tf(words_tf, split_words)
get_bigram_tf(bigram_tf, split_words)

split_words = []
line_count += 1

print("Number of words in Corpus:", count)
print("Number of participles:", words_len)
print("Average word length:", round(count / words_len, 5))

bigram_len = sum([dic[1] for dic in bigram_tf.items()])
print("Binary model length:", bigram_len)

entropy = []
for bi_word in bigram_tf.items():
jp_xy = bi_word[1] / bigram_len  # Calculate the joint probability p(x,y)
cp_xy = bi_word[1] / words_tf[bi_word[0][0]]  # Calculate conditional probability p(x|y)
entropy.append(-jp_xy * math.log(cp_xy, 2))  # Binary entropy calculation model
print("Chinese information entropy based on word binary model is:", round(sum(entropy), 5), "Bit/Words")

after = time.time()
print("Running time:", round(after - before, 5), "s")

def cal_trigram(corpus,count):
before = time.time
split_words = []
words_len = 0
line_count = 0
words_tf = {}
trigram_tf = {}

for line in corpus:
for x in jieba.cut(line):
split_words.append(x)
words_len += 1

get_bigram_tf(words_tf, split_words)
get_trigram_tf(trigram_tf, split_words)

split_words = []
line_count += 1

print("Number of words in Corpus:", count)
print("Number of participles:", words_len)
print("Average word length:", round(count / words_len, 5))

trigram_len = sum([dic[1] for dic in trigram_tf.items()])
print("Length of ternary model:", trigram_len)

entropy = []
for tri_word in trigram_tf.items():
jp_xy = tri_word[1] / trigram_len  # Calculate the joint probability p(x,y)
cp_xy = tri_word[1] / words_tf[tri_word[0][0]]  # Calculate conditional probability p(x|y)
entropy.append(-jp_xy * math.log(cp_xy, 2))  # Calculate the information entropy of ternary model
print("Chinese information entropy based on word ternary model is:", round(sum(entropy), 5), "Bit/Words")

after = time.time()
print("Running time:", round(after - before , 5), "s")

if __name__ == '__main__':
tra = TraversalFun("./datasets")
corpus,count = tra.TraversalDir()
cal_unigram(corpus, count)
cal_bigram(corpus,count)
cal_trigram(corpus,count)



# Experimental code

The code of the experiment can be found in Experimental code find

Keywords: Python Machine Learning

Added by LuiePL on Mon, 07 Mar 2022 19:32:41 +0200