Experiment of deep learning and natural language processing -- Calculation of Chinese information entropy

Problem description

First read enterprise_ of_ English_ Peter Brown, refer to the above article to calculate the average information entropy of Chinese. Database: https://share.weiyun.com/5zGPyJX

Experimental principle

Information entropy

The concept of information entropy was first proposed by Shannon (1916-2001) in 1948 based on the concept of "thermal entropy" in thermodynamics, which aims to express the uncertainty of information. The greater the entropy, the greater the uncertainty of information. Its mathematical formula can be expressed as:
H ( x ) = ∑ x ∈ X P ( x ) l o g ( 1 P ( x ) ) = − ∑ x ∈ X P ( x ) l o g ( P ( x ) ) H(x) =\sum_{x\in X} P(x)log(\frac{1}{P(x)})=-\sum_{x\in X}P(x)log(P(x)) H(x)=x∈X∑​P(x)log(P(x)1​)=−x∈X∑​P(x)log(P(x))
For random variables with joint distribution ( X , Y ) (X,Y) (X,Y)~ P ( X , Y ) P(X,Y) P(X,Y), when the two variables are independent of each other, the joint information entropy is

H ( X ∣ Y ) = − ∑ y ∈ Y P ( y ) l o g ( P ( x ∣ y ) ) = − ∑ y ∈ Y P ( y ) ∑ x ∈ X P ( x ) l o g ( P ( x ∣ y ) ) = − ∑ y ∈ Y ∑ x ∈ X P ( x ) P ( y ) l o g ( P ( x ∣ y ) ) = − ∑ y ∈ Y ∑ x ∈ X P ( x , y ) l o g ( P ( x ∣ y ) ) \begin{aligned} H(X|Y)&= -\sum_{y\in Y} P(y)log(P(x|y))\\ &= -\sum_{y\in Y} P(y) \sum_{x\in X}P(x) log(P(x|y))\\ &=-\sum_{y\in Y}\sum_{x\in X}P(x)P(y)log(P(x|y))\\ &=-\sum_{y\in Y}\sum_{x\in X}P(x,y)log(P(x|y)) \end{aligned} H(X∣Y)​=−y∈Y∑​P(y)log(P(x∣y))=−y∈Y∑​P(y)x∈X∑​P(x)log(P(x∣y))=−y∈Y∑​x∈X∑​P(x)P(y)log(P(x∣y))=−y∈Y∑​x∈X∑​P(x,y)log(P(x∣y))​
The joint information entropy can be used in the later binary model ( b i − g r a m ) (bi-gram) (bi − gram) and ternary model ( t r i − g r a m ) (tri-gram) (tri − gram).

Parameter estimation of language model

This paper uses the unitary, binary and ternary models to calculate the information entropy of Jin Yong's novel collection.

Experimental process

Pretreatment of experimental data

The data set is Mr. Jin Yong's 16 novels, which contains a large number of chaotic codes and useless or repeated Chinese and English symbols, so it is necessary to preprocess the experimental data set.
1. Delete all hidden symbols.
2. Delete all non Chinese characters.
3. Delete all punctuation marks without considering the context.

The preprocessing here uses jieba for word segmentation. jieba is a Chinese word segmentation Library in python. In this experiment, word segmentation is carried out in a precise mode.

def getCorpus(self, rootDir):
    corpus = []
    r1 = u'[a-zA-Z0-9'!"#$%&\'()*+,-./:: ;<=>?@,. ?★,...[]<>?""''![\\]^_`{|}~]+' 
    for file in os.listdir(rootDir):
        path  = os.path.join(rootDir, file)
        if os.path.isfile(path):
            with open(os.path.abspath(path), "r", encoding='utf-8') as file:
                filecontext = file.read();
                filecontext = re.sub(r1, '', filecontext)
                filecontext = filecontext.replace("\n", '')
                filecontext = filecontext.replace(" ", '')
                seg_list = jieba.lcut_for_search(filecontext)
                corpus += seg_list
        elif os.path.isdir(path):
            TraversalFun.AllFiles(self, path)
    return corpus

Univariate model

    for uni_word in words_tf.items():
        entropy.append(-(uni_word[1]/words_len)*math.log(uni_word[1]/words_len, 2))

Binary model

    for bi_word in bigram_tf.items():
        jp_xy = bi_word[1] / bigram_len  # Calculate the joint probability p(x,y)
        cp_xy = bi_word[1] / words_tf[bi_word[0][0]]  # Calculate conditional probability p(x|y)
        entropy.append(-jp_xy * math.log(cp_xy, 2))  # Calculate the information entropy of binary model
    print("Chinese information entropy based on word binary model is:", round(sum(entropy), 3), "Bit/Words")  

Ternary model

    for tri_word in trigram_tf.items():
        jp_xy = tri_word[1] / trigram_len  # Calculate the joint probability p(x,y)
        cp_xy = tri_word[1] / words_tf[tri_word[0][0]]  # Calculate conditional probability p(x|y)
        entropy.append(-jp_xy * math.log(cp_xy, 2))  # Calculate the information entropy of ternary model
    print("Chinese information entropy based on word ternary model is:", round(sum(entropy), 3), "Bit/Words")  

experimental result

Corpus information statistics

Novel titleNumber of words in CorpusNumber of participlesAverage word lengthInformation entropy (bit / word)Run time
1Thirty three swordsmen53682315511.7014411.653580.82091
2White horse whistling west wind64061422071.517789.495650.84298
3Book and sword4383382559081.7128711.685022.50098
4Xiake Xing3170991905451.6641710.994872.14651
5The story of relying on heaven to kill Dragons8304394875401.7033211.639754.22489
6Tianlong Babu8304394875401.7033211.639754.4675
7Legend of Shooting Heroes7941244800181.6543611.552644.8816
8Blue blood sword4200682464051.7047911.690142.6309
9Eagle Warrior8356665028411.6618911.518784.92906
10Xiaoao Jianghu8333724909871.6973411.326484.64088
11Yuenv sword1480392931.592929.428090.58128
12Liancheng formula1994121221771.6321610.840851.59599
13Snow mountain flying fox119513733831.6286210.7731.16091
14Flying fox3787772245131.687111.469322.26253
15Mandarin duck knife32552206581.575769.717260.65679
16DUKE OF MOUNT DEER10474686262661.6645911.258945.1325

Experimental results under different models

Word segmentation modelNumber of words in CorpusNumber of participlesAverage word lengthInformation entropy (bit / word)Run time
1unigram742008144307671.6746712.0131237.68415
2bigram742008144307671.674676.891541.43119
3trigram742008144307671.674676.891560.37625

appendix

import jieba
import math
import time
import os
import re
class TraversalFun():

    # 1 initialization
    def __init__(self, rootDir):
        self.rootDir = rootDir

    def TraversalDir(self):
        return TraversalFun.getCorpus(self, self.rootDir)

    def getCorpus(self, rootDir):
        corpus = []
        r1 = u'[a-zA-Z0-9'!"#$%&\'()*+,-./:: ;<=>?@,. ?★,...[]<>?""''![\\]^_`{|}~]+'  # You can also customize the character filtering here
        listdir = os.listdir(rootDir)
        count=0
        for file in listdir:
            path  = os.path.join(rootDir, file)
            if os.path.isfile(path):
                with open(os.path.abspath(path), "r", encoding='ansi') as file:
                    filecontext = file.read();
                    filecontext = re.sub(r1, '', filecontext)
                    filecontext = filecontext.replace("\n", '')
                    filecontext = filecontext.replace(" ", '')
                    filecontext = filecontext.replace("This book comes from www.cr173.com free txt Novel download station\n Please pay attention to more updated free e-books www.cr173.com", '')
                    #seg_list = jieba.cut(filecontext, cut_all=True)
                    #corpus += seg_list
                    count += len(filecontext)
                    corpus.append(filecontext)
            elif os.path.isdir(path):
                TraversalFun.AllFiles(self, path)
        return corpus,count

# Word frequency statistics to facilitate the calculation of information entropy
def get_tf(tf_dic, words):

    for i in range(len(words)-1):
        tf_dic[words[i]] = tf_dic.get(words[i], 0) + 1

def get_bigram_tf(tf_dic, words):
    for i in range(len(words)-1):
        tf_dic[(words[i], words[i+1])] = tf_dic.get((words[i], words[i+1]), 0) + 1

def get_trigram_tf(tf_dic, words):
    for i in range(len(words)-2):
        tf_dic[((words[i], words[i+1]), words[i+2])] = tf_dic.get(((words[i], words[i+1]), words[i+2]), 0) + 1

def cal_unigram(corpus,count):
    before = time.time()
    split_words = []
    words_len = 0
    line_count = 0
    words_tf = {}
    for line in corpus:
        for x in jieba.cut(line):
            split_words.append(x)
            words_len += 1
        get_tf(words_tf, split_words)
        split_words = []
        line_count += 1

    print("Corpus of words:", count)
    print("Number of participles:", words_len)
    print("Average word length:", round(count / words_len, 5))
    entropy = []
    for uni_word in words_tf.items():
        entropy.append(-(uni_word[1] / words_len) * math.log(uni_word[1] / words_len, 2))
    print("Chinese information entropy based on word unary model is:", round(sum(entropy), 5), "Bit/Words")
    after = time.time()
    print("Running time:", round(after - before, 5), "s")

def cal_bigram(corpus, count):
    before = time.time()
    split_words = []
    words_len = 0
    line_count = 0
    words_tf = {}
    bigram_tf = {}

    for line in corpus:
        for x in jieba.cut(line):
            split_words.append(x)
            words_len += 1

        get_tf(words_tf, split_words)
        get_bigram_tf(bigram_tf, split_words)

        split_words = []
        line_count += 1

    print("Number of words in Corpus:", count)
    print("Number of participles:", words_len)
    print("Average word length:", round(count / words_len, 5))

    bigram_len = sum([dic[1] for dic in bigram_tf.items()])
    print("Binary model length:", bigram_len)

    entropy = []
    for bi_word in bigram_tf.items():
        jp_xy = bi_word[1] / bigram_len  # Calculate the joint probability p(x,y)
        cp_xy = bi_word[1] / words_tf[bi_word[0][0]]  # Calculate conditional probability p(x|y)
        entropy.append(-jp_xy * math.log(cp_xy, 2))  # Binary entropy calculation model
    print("Chinese information entropy based on word binary model is:", round(sum(entropy), 5), "Bit/Words")

    after = time.time()
    print("Running time:", round(after - before, 5), "s")

def cal_trigram(corpus,count):
    before = time.time
    split_words = []
    words_len = 0
    line_count = 0
    words_tf = {}
    trigram_tf = {}

    for line in corpus:
        for x in jieba.cut(line):
            split_words.append(x)
            words_len += 1

        get_bigram_tf(words_tf, split_words)
        get_trigram_tf(trigram_tf, split_words)

        split_words = []
        line_count += 1

    print("Number of words in Corpus:", count)
    print("Number of participles:", words_len)
    print("Average word length:", round(count / words_len, 5))

    trigram_len = sum([dic[1] for dic in trigram_tf.items()])
    print("Length of ternary model:", trigram_len)

    entropy = []
    for tri_word in trigram_tf.items():
        jp_xy = tri_word[1] / trigram_len  # Calculate the joint probability p(x,y)
        cp_xy = tri_word[1] / words_tf[tri_word[0][0]]  # Calculate conditional probability p(x|y)
        entropy.append(-jp_xy * math.log(cp_xy, 2))  # Calculate the information entropy of ternary model
    print("Chinese information entropy based on word ternary model is:", round(sum(entropy), 5), "Bit/Words")

    after = time.time()
    print("Running time:", round(after - before , 5), "s")


if __name__ == '__main__':
    tra = TraversalFun("./datasets")
    corpus,count = tra.TraversalDir()
    cal_unigram(corpus, count)
    cal_bigram(corpus,count)
    cal_trigram(corpus,count)

Experimental code

The code of the experiment can be found in Experimental code find

Keywords: Python Machine Learning

Added by LuiePL on Mon, 07 Mar 2022 19:32:41 +0200