Experiment of deep learning and natural language processing -- Calculation of Chinese information entropy

Problem description

First read enterprise_ of_ English_ Peter Brown, refer to the above article to calculate the average information entropy of Chinese. Database: https://share.weiyun.com/5zGPyJX

Experimental principle

Information entropy

The concept of information entropy was first proposed by Shannon (1916-2001) in 1948 based on the concept of "thermal entropy" in thermodynamics, which aims to express the uncertainty of information. The greater the entropy, the greater the uncertainty of information. Its mathematical formula can be expressed as:
H ( x ) = ∑ x ∈ X P ( x ) l o g ( 1 P ( x ) ) = − ∑ x ∈ X P ( x ) l o g ( P ( x ) ) H(x) =\sum_{x\in X} P(x)log(\frac{1}{P(x)})=-\sum_{x\in X}P(x)log(P(x)) H(x)=x∈X∑P(x)log(P(x)1)=−x∈X∑P(x)log(P(x))
For random variables with joint distribution ( X , Y ) (X,Y) (X,Y)~ P ( X , Y ) P(X,Y) P(X,Y), when the two variables are independent of each other, the joint information entropy is

H ( X ∣ Y ) = − ∑ y ∈ Y P ( y ) l o g ( P ( x ∣ y ) ) = − ∑ y ∈ Y P ( y ) ∑ x ∈ X P ( x ) l o g ( P ( x ∣ y ) ) = − ∑ y ∈ Y ∑ x ∈ X P ( x ) P ( y ) l o g ( P ( x ∣ y ) ) = − ∑ y ∈ Y ∑ x ∈ X P ( x , y ) l o g ( P ( x ∣ y ) ) \begin{aligned} H(X|Y)&= -\sum_{y\in Y} P(y)log(P(x|y))\\ &= -\sum_{y\in Y} P(y) \sum_{x\in X}P(x) log(P(x|y))\\ &=-\sum_{y\in Y}\sum_{x\in X}P(x)P(y)log(P(x|y))\\ &=-\sum_{y\in Y}\sum_{x\in X}P(x,y)log(P(x|y)) \end{aligned} H(X∣Y)=−y∈Y∑P(y)log(P(x∣y))=−y∈Y∑P(y)x∈X∑P(x)log(P(x∣y))=−y∈Y∑x∈X∑P(x)P(y)log(P(x∣y))=−y∈Y∑x∈X∑P(x,y)log(P(x∣y))
The joint information entropy can be used in the later binary model ( b i − g r a m ) (bi-gram) (bi − gram) and ternary model ( t r i − g r a m ) (tri-gram) (tri − gram).

Parameter estimation of language model

This paper uses the unitary, binary and ternary models to calculate the information entropy of Jin Yong's novel collection.

Experimental process

Pretreatment of experimental data

The data set is Mr. Jin Yong's 16 novels, which contains a large number of chaotic codes and useless or repeated Chinese and English symbols, so it is necessary to preprocess the experimental data set.
1. Delete all hidden symbols.
2. Delete all non Chinese characters.
3. Delete all punctuation marks without considering the context.

The preprocessing here uses jieba for word segmentation. jieba is a Chinese word segmentation Library in python. In this experiment, word segmentation is carried out in a precise mode.

def getCorpus(self, rootDir):
    corpus = []
    r1 = u'[a-zA-Z0-9'!"#$%&\'()*+,-./:: ;<=>?@，. ?★,...[]<>？""''![\\]^_`{|}~]+' 
    for file in os.listdir(rootDir):
        path  = os.path.join(rootDir, file)
        if os.path.isfile(path):
            with open(os.path.abspath(path), "r", encoding='utf-8') as file:
                filecontext = file.read();
                filecontext = re.sub(r1, '', filecontext)
                filecontext = filecontext.replace("\n", '')
                filecontext = filecontext.replace(" ", '')
                seg_list = jieba.lcut_for_search(filecontext)
                corpus += seg_list
        elif os.path.isdir(path):
            TraversalFun.AllFiles(self, path)
    return corpus

Univariate model

    for uni_word in words_tf.items():
        entropy.append(-(uni_word[1]/words_len)*math.log(uni_word[1]/words_len, 2))

Binary model

    for bi_word in bigram_tf.items():
        jp_xy = bi_word[1] / bigram_len  # Calculate the joint probability p(x,y)
        cp_xy = bi_word[1] / words_tf[bi_word[0][0]]  # Calculate conditional probability p(x|y)
        entropy.append(-jp_xy * math.log(cp_xy, 2))  # Calculate the information entropy of binary model
    print("Chinese information entropy based on word binary model is:", round(sum(entropy), 3), "Bit/Words")

Ternary model

    for tri_word in trigram_tf.items():
        jp_xy = tri_word[1] / trigram_len  # Calculate the joint probability p(x,y)
        cp_xy = tri_word[1] / words_tf[tri_word[0][0]]  # Calculate conditional probability p(x|y)
        entropy.append(-jp_xy * math.log(cp_xy, 2))  # Calculate the information entropy of ternary model
    print("Chinese information entropy based on word ternary model is:", round(sum(entropy), 3), "Bit/Words")

experimental result

Corpus information statistics

	Novel title	Number of words in Corpus	Number of participles	Average word length	Information entropy (bit / word)	Run time
1	Thirty three swordsmen	53682	31551	1.70144	11.65358	0.82091
2	White horse whistling west wind	64061	42207	1.51778	9.49565	0.84298
3	Book and sword	438338	255908	1.71287	11.68502	2.50098
4	Xiake Xing	317099	190545	1.66417	10.99487	2.14651
5	The story of relying on heaven to kill Dragons	830439	487540	1.70332	11.63975	4.22489
6	Tianlong Babu	830439	487540	1.70332	11.63975	4.4675
7	Legend of Shooting Heroes	794124	480018	1.65436	11.55264	4.8816
8	Blue blood sword	420068	246405	1.70479	11.69014	2.6309
9	Eagle Warrior	835666	502841	1.66189	11.51878	4.92906
10	Xiaoao Jianghu	833372	490987	1.69734	11.32648	4.64088
11	Yuenv sword	14803	9293	1.59292	9.42809	0.58128
12	Liancheng formula	199412	122177	1.63216	10.84085	1.59599
13	Snow mountain flying fox	119513	73383	1.62862	10.773	1.16091
14	Flying fox	378777	224513	1.6871	11.46932	2.26253
15	Mandarin duck knife	32552	20658	1.57576	9.71726	0.65679
16	DUKE OF MOUNT DEER	1047468	626266	1.66459	11.25894	5.1325

Experimental results under different models

	Word segmentation model	Number of words in Corpus	Number of participles	Average word length	Information entropy (bit / word)	Run time
1	unigram	7420081	4430767	1.67467	12.01312	37.68415
2	bigram	7420081	4430767	1.67467	6.8915	41.43119
3	trigram	7420081	4430767	1.67467	6.8915	60.37625

appendix

import jieba
import math
import time
import os
import re
class TraversalFun():

    # 1 initialization
    def __init__(self, rootDir):
        self.rootDir = rootDir

    def TraversalDir(self):
        return TraversalFun.getCorpus(self, self.rootDir)

    def getCorpus(self, rootDir):
        corpus = []
        r1 = u'[a-zA-Z0-9'!"#$%&\'()*+,-./:: ;<=>?@，. ?★,...[]<>？""''![\\]^_`{|}~]+'  # You can also customize the character filtering here
        listdir = os.listdir(rootDir)
        count=0
        for file in listdir:
            path  = os.path.join(rootDir, file)
            if os.path.isfile(path):
                with open(os.path.abspath(path), "r", encoding='ansi') as file:
                    filecontext = file.read();
                    filecontext = re.sub(r1, '', filecontext)
                    filecontext = filecontext.replace("\n", '')
                    filecontext = filecontext.replace(" ", '')
                    filecontext = filecontext.replace("This book comes from www.cr173.com free txt Novel download station\n Please pay attention to more updated free e-books www.cr173.com", '')
                    #seg_list = jieba.cut(filecontext, cut_all=True)
                    #corpus += seg_list
                    count += len(filecontext)
                    corpus.append(filecontext)
            elif os.path.isdir(path):
                TraversalFun.AllFiles(self, path)
        return corpus,count

# Word frequency statistics to facilitate the calculation of information entropy
def get_tf(tf_dic, words):

    for i in range(len(words)-1):
        tf_dic[words[i]] = tf_dic.get(words[i], 0) + 1

def get_bigram_tf(tf_dic, words):
    for i in range(len(words)-1):
        tf_dic[(words[i], words[i+1])] = tf_dic.get((words[i], words[i+1]), 0) + 1

def get_trigram_tf(tf_dic, words):
    for i in range(len(words)-2):
        tf_dic[((words[i], words[i+1]), words[i+2])] = tf_dic.get(((words[i], words[i+1]), words[i+2]), 0) + 1

def cal_unigram(corpus,count):
    before = time.time()
    split_words = []
    words_len = 0
    line_count = 0
    words_tf = {}
    for line in corpus:
        for x in jieba.cut(line):
            split_words.append(x)
            words_len += 1
        get_tf(words_tf, split_words)
        split_words = []
        line_count += 1

    print("Corpus of words:", count)
    print("Number of participles:", words_len)
    print("Average word length:", round(count / words_len, 5))
    entropy = []
    for uni_word in words_tf.items():
        entropy.append(-(uni_word[1] / words_len) * math.log(uni_word[1] / words_len, 2))
    print("Chinese information entropy based on word unary model is:", round(sum(entropy), 5), "Bit/Words")
    after = time.time()
    print("Running time:", round(after - before, 5), "s")

def cal_bigram(corpus, count):
    before = time.time()
    split_words = []
    words_len = 0
    line_count = 0
    words_tf = {}
    bigram_tf = {}

    for line in corpus:
        for x in jieba.cut(line):
            split_words.append(x)
            words_len += 1

        get_tf(words_tf, split_words)
        get_bigram_tf(bigram_tf, split_words)

        split_words = []
        line_count += 1

    print("Number of words in Corpus:", count)
    print("Number of participles:", words_len)
    print("Average word length:", round(count / words_len, 5))

    bigram_len = sum([dic[1] for dic in bigram_tf.items()])
    print("Binary model length:", bigram_len)

    entropy = []
    for bi_word in bigram_tf.items():
        jp_xy = bi_word[1] / bigram_len  # Calculate the joint probability p(x,y)
        cp_xy = bi_word[1] / words_tf[bi_word[0][0]]  # Calculate conditional probability p(x|y)
        entropy.append(-jp_xy * math.log(cp_xy, 2))  # Binary entropy calculation model
    print("Chinese information entropy based on word binary model is:", round(sum(entropy), 5), "Bit/Words")

    after = time.time()
    print("Running time:", round(after - before, 5), "s")

def cal_trigram(corpus,count):
    before = time.time
    split_words = []
    words_len = 0
    line_count = 0
    words_tf = {}
    trigram_tf = {}

    for line in corpus:
        for x in jieba.cut(line):
            split_words.append(x)
            words_len += 1

        get_bigram_tf(words_tf, split_words)
        get_trigram_tf(trigram_tf, split_words)

        split_words = []
        line_count += 1

    print("Number of words in Corpus:", count)
    print("Number of participles:", words_len)
    print("Average word length:", round(count / words_len, 5))

    trigram_len = sum([dic[1] for dic in trigram_tf.items()])
    print("Length of ternary model:", trigram_len)

    entropy = []
    for tri_word in trigram_tf.items():
        jp_xy = tri_word[1] / trigram_len  # Calculate the joint probability p(x,y)
        cp_xy = tri_word[1] / words_tf[tri_word[0][0]]  # Calculate conditional probability p(x|y)
        entropy.append(-jp_xy * math.log(cp_xy, 2))  # Calculate the information entropy of ternary model
    print("Chinese information entropy based on word ternary model is:", round(sum(entropy), 5), "Bit/Words")

    after = time.time()
    print("Running time:", round(after - before , 5), "s")


if __name__ == '__main__':
    tra = TraversalFun("./datasets")
    corpus,count = tra.TraversalDir()
    cal_unigram(corpus, count)
    cal_bigram(corpus,count)
    cal_trigram(corpus,count)

Experimental code

The code of the experiment can be found in Experimental code find

Keywords: Python Machine Learning

Added by LuiePL on Mon, 07 Mar 2022 19:32:41 +0200

Programming VIP