Summary of Study on Jieba Word Separation Code

2021SC@SDUSC
Algorithms used for jieba word segmentation:
Efficient word map scanning based on Trie tree structure generates a directed acyclic graph (DAG) of all possible word-forming situations of Chinese characters in a sentence.
Dynamic programming is used to find the maximum probability path to find the maximum tangent combination based on word frequency
For unregistered words, an HMM model based on Chinese word-forming ability is used, and a Viterbi algorithm is used.

The first half of the main function cut for jieba word segmentation is to block the input text with regular expressions based on a user-specified pattern. Then for each block, the participle function of the block used by default (exact mode) is called u cut_DAG.
_u Cut_ The DAG function calls get_DAG(sentence), which is a directed acyclic graph DAG used to generate each sentence. A corpus is necessary to generate a DAG, so under the same folder, jieba, you can find a file: dict.txt. There are three columns in the corpus: the first column is words, the second column is word frequency, and the third column is word type. The action of initializing the corpus in the program is in the function initialize(DICTIONARY), which passes through a wrapper require_initialized at get_ The DAG function is executed only when it is called.

def require_initialized(fn):
 
    @wraps(fn) #The purpose of wraps is to preserve some properties of wrapped functions, such as u doc_u
    def wrapped(*args, **kwargs):
        global initialized
        if initialized:
            return fn(*args, **kwargs)
        else:
            initialize(DICTIONARY)
            return fn(*args, **kwargs)
 
    return wrapped

The jieba participle uses caching technology to speed up the loading of the corpus. It serializes the generated corpus data structure with marshal and stores it in the temporary directory of the system (using the tempfile library). If the cached.cache file cannot be found, call the gen_trie function to generate the trie tree.

def gen_trie(f_name):
    lfreq = {}
    trie = {}
    ltotal = 0.0
    with open(f_name, 'rb') as f:
        lineno = 0 
        for line in f.read().rstrip().decode('utf-8').split('\n'):
            lineno += 1
            try:
                word,freq,_ = line.split(' ')
                freq = float(freq)
                lfreq[word] = freq
                ltotal+=freq
                p = trie
                for c in word:
                    if c not in p:
                        p[c] ={}
                    p = p[c]
                p['']='' #ending flag
            except ValueError, e:
                logger.debug('%s at line %s %s' % (f_name,  lineno, line))
                raise ValueError, e
    return trie, lfreq,ltotal

There is already a frequency score for each word in sentence's DAG and sentence. To find a path among all paths to maximize the sum of frequency scores, this is also a typical application of dynamic planning.

def calc(sentence,DAG,idx,route):
    N = len(sentence)
    route[N] = (0.0,'')
    for idx in xrange(N-1,-1,-1):
        candidates = [ ( FREQ.get(sentence[idx:x+1],min_freq) + route[x+1][0],x ) for x in DAG[idx] ]
        route[idx] = max(candidates)

A bottom-up dynamic program that traverses the scores of each word breaker in reverse order starting from the last word of sentence. The highest score is then saved in the route with such a tuple.

# -*- coding: utf-8 -*-
# python2.7
import marshal
 
def get_DAG(sentence):
 
    N = len(sentence)
    i,j=0,0
    p = trie
    DAG = {}
    while i<N:
        c = sentence[j]
        if c in p:
            p = p[c]
            if '' in p:
                if i not in DAG:
                    DAG[i]=[]
                DAG[i].append(j)
            j+=1
            if j>=N:
                i+=1
                j=i
                p=trie
        else:
            p = trie
            i+=1
            j=i
    for i in xrange(len(sentence)):
        if i not in DAG:
            DAG[i] =[i]
    return DAG
 
def calc(sentence,DAG,idx,route):
    N = len(sentence)
    route[N] = (0.0,'')
    for idx in xrange(N-1,-1,-1):
        candidates = [ ( FREQ.get(sentence[idx:x+1],0.0) + route[x+1][0],x ) for x in DAG[idx] ]
        route[idx] = max(candidates)
 
if __name__=='__main__':
    sentence=u'On National Day I am studying stumbling participles'
    trie,FREQ,total,min_freq = marshal.load(open(u'D:\jieba.cache','rb'))#Loading important variables using the cache
    rs=get_DAG(sentence)#Get DAG
    route={}
    calc(sentence,rs,0,route)#Preliminary word segmentation based on score
    print route

Back to u Cut_ The DAG function, the first half of which uses the calc function to compute a preliminary participle, and the second half is for words that do not appear in the corpus in the above example.

    x = 0
    buf =u''
    N = len(sentence)
    while x<N:
        y = route[x][1]+1
        l_word = sentence[x:y]
        if y-x==1:
            buf+= l_word
        else:
            if len(buf)>0:
                if len(buf)==1:
                    yield buf
                    buf=u''
                else:
                    if (buf not in FREQ):
                        regognized = finalseg.cut(buf)
                        for t in regognized:
                            yield t
                    else:
                        for elem in buf:
                            yield elem
                    buf=u''
            yield l_word
        x =y

Since frequency-based word segmentation tends to cut unrecognized phrases word by word, merging these words is one way to identify unknown words and optimize the results of word segmentation. In the code, a buf variable is defined to collect these consecutive words, combine them into strings, and hand them over to finalseg. The cut function proceeds to the next word division.
This function is encapsulated in the finalseg module, mainly through u The cut function performs further word segmentation.

def __cut(sentence):
    global emit_P
    prob, pos_list =  viterbi(sentence,('B','M','E','S'), start_P, trans_P, emit_P)
    begin, next = 0,0
    #print pos_list, sentence
    for i,char in enumerate(sentence):
        pos = pos_list[i]
        if pos=='B':
            begin = i
        elif pos=='E':
            yield sentence[begin:i+1]
            next = i+1
        elif pos=='S':
            yield char
            next = i+1
    if next<len(sentence):
        yield sentence[next:]

To understand this code, you must first have some mathematical knowledge, since the implicit Markov model (HMM) and the Viterbi algorithm are used here.
There are two states in HMM, one is the decisive implied state (state for short) and the other is the state of the explicit output (output for short). There are four states in a stubborn participle: B,M,E,S, which corresponds to the position of a Chinese character in a word: B (beginning), M (middle), E (end), S (independent word), and the output is a Chinese character.
There are also three states in HMM: the state distribution probability, the state transition probability and the transmission probability (the transmission probability is a conditional probability, which indicates the probability of getting an output in a state).
Now that we have sentence, a string of outputs, we want to know the most likely BMES combination of these characters for word segmentation. This requires the use of the Wittby algorithm.

#A state transition matrix, such as B state, which may only be E or S state
PrevStatus = {
    'B':('E','S'),
    'M':('M','B'),
    'S':('S','E'),
    'E':('B','M')
}
 
def viterbi(obs, states, start_p, trans_p, emit_p):
    V = [{}] #State probability matrix
    path = {}
    for y in states: 
        V[0][y] = start_p[y] + emit_p[y].get(obs[0],MIN_FLOAT)#Calculates the initial state probability, and since the probability values are logarized, the multiplication sign becomes a plus sign
        path[y] = [y]
    for t in range(1,len(obs)):
        V.append({})
        newpath = {}
        for y in states:
            em_p = emit_p[y].get(obs[t],MIN_FLOAT)
            (prob,state ) = max([(V[t-1][y0] + trans_p[y0].get(y,MIN_FLOAT) + em_p ,y0) for y0 in PrevStatus[y] ])
            V[t][y] =prob
            newpath[y] = path[state] + [y]#Pruning, keeping only one path with the highest probability
        path = newpath
    (prob, state) = max([(V[len(obs) - 1][y], y) for y in ('E','S')])#Finding which state of the last word corresponds most likely, and the last word can only be in two situations: E (end) and S (independent word)
    return (prob, path[state])

Keywords: Python Algorithm Back-end NLP

Added by OilSheikh on Mon, 27 Dec 2021 07:40:49 +0200