2021SC@SDUSC
Algorithms used for jieba word segmentation:
Efficient word map scanning based on Trie tree structure generates a directed acyclic graph (DAG) of all possible word-forming situations of Chinese characters in a sentence.
Dynamic programming is used to find the maximum probability path to find the maximum tangent combination based on word frequency
For unregistered words, an HMM model based on Chinese word-forming ability is used, and a Viterbi algorithm is used.
The first half of the main function cut for jieba word segmentation is to block the input text with regular expressions based on a user-specified pattern. Then for each block, the participle function of the block used by default (exact mode) is called u cut_DAG.
_u Cut_ The DAG function calls get_DAG(sentence), which is a directed acyclic graph DAG used to generate each sentence. A corpus is necessary to generate a DAG, so under the same folder, jieba, you can find a file: dict.txt. There are three columns in the corpus: the first column is words, the second column is word frequency, and the third column is word type. The action of initializing the corpus in the program is in the function initialize(DICTIONARY), which passes through a wrapper require_initialized at get_ The DAG function is executed only when it is called.
def require_initialized(fn): @wraps(fn) #The purpose of wraps is to preserve some properties of wrapped functions, such as u doc_u def wrapped(*args, **kwargs): global initialized if initialized: return fn(*args, **kwargs) else: initialize(DICTIONARY) return fn(*args, **kwargs) return wrapped
The jieba participle uses caching technology to speed up the loading of the corpus. It serializes the generated corpus data structure with marshal and stores it in the temporary directory of the system (using the tempfile library). If the cached.cache file cannot be found, call the gen_trie function to generate the trie tree.
def gen_trie(f_name): lfreq = {} trie = {} ltotal = 0.0 with open(f_name, 'rb') as f: lineno = 0 for line in f.read().rstrip().decode('utf-8').split('\n'): lineno += 1 try: word,freq,_ = line.split(' ') freq = float(freq) lfreq[word] = freq ltotal+=freq p = trie for c in word: if c not in p: p[c] ={} p = p[c] p['']='' #ending flag except ValueError, e: logger.debug('%s at line %s %s' % (f_name, lineno, line)) raise ValueError, e return trie, lfreq,ltotal
There is already a frequency score for each word in sentence's DAG and sentence. To find a path among all paths to maximize the sum of frequency scores, this is also a typical application of dynamic planning.
def calc(sentence,DAG,idx,route): N = len(sentence) route[N] = (0.0,'') for idx in xrange(N-1,-1,-1): candidates = [ ( FREQ.get(sentence[idx:x+1],min_freq) + route[x+1][0],x ) for x in DAG[idx] ] route[idx] = max(candidates)
A bottom-up dynamic program that traverses the scores of each word breaker in reverse order starting from the last word of sentence. The highest score is then saved in the route with such a tuple.
# -*- coding: utf-8 -*- # python2.7 import marshal def get_DAG(sentence): N = len(sentence) i,j=0,0 p = trie DAG = {} while i<N: c = sentence[j] if c in p: p = p[c] if '' in p: if i not in DAG: DAG[i]=[] DAG[i].append(j) j+=1 if j>=N: i+=1 j=i p=trie else: p = trie i+=1 j=i for i in xrange(len(sentence)): if i not in DAG: DAG[i] =[i] return DAG def calc(sentence,DAG,idx,route): N = len(sentence) route[N] = (0.0,'') for idx in xrange(N-1,-1,-1): candidates = [ ( FREQ.get(sentence[idx:x+1],0.0) + route[x+1][0],x ) for x in DAG[idx] ] route[idx] = max(candidates) if __name__=='__main__': sentence=u'On National Day I am studying stumbling participles' trie,FREQ,total,min_freq = marshal.load(open(u'D:\jieba.cache','rb'))#Loading important variables using the cache rs=get_DAG(sentence)#Get DAG route={} calc(sentence,rs,0,route)#Preliminary word segmentation based on score print route
Back to u Cut_ The DAG function, the first half of which uses the calc function to compute a preliminary participle, and the second half is for words that do not appear in the corpus in the above example.
x = 0 buf =u'' N = len(sentence) while x<N: y = route[x][1]+1 l_word = sentence[x:y] if y-x==1: buf+= l_word else: if len(buf)>0: if len(buf)==1: yield buf buf=u'' else: if (buf not in FREQ): regognized = finalseg.cut(buf) for t in regognized: yield t else: for elem in buf: yield elem buf=u'' yield l_word x =y
Since frequency-based word segmentation tends to cut unrecognized phrases word by word, merging these words is one way to identify unknown words and optimize the results of word segmentation. In the code, a buf variable is defined to collect these consecutive words, combine them into strings, and hand them over to finalseg. The cut function proceeds to the next word division.
This function is encapsulated in the finalseg module, mainly through u The cut function performs further word segmentation.
def __cut(sentence): global emit_P prob, pos_list = viterbi(sentence,('B','M','E','S'), start_P, trans_P, emit_P) begin, next = 0,0 #print pos_list, sentence for i,char in enumerate(sentence): pos = pos_list[i] if pos=='B': begin = i elif pos=='E': yield sentence[begin:i+1] next = i+1 elif pos=='S': yield char next = i+1 if next<len(sentence): yield sentence[next:]
To understand this code, you must first have some mathematical knowledge, since the implicit Markov model (HMM) and the Viterbi algorithm are used here.
There are two states in HMM, one is the decisive implied state (state for short) and the other is the state of the explicit output (output for short). There are four states in a stubborn participle: B,M,E,S, which corresponds to the position of a Chinese character in a word: B (beginning), M (middle), E (end), S (independent word), and the output is a Chinese character.
There are also three states in HMM: the state distribution probability, the state transition probability and the transmission probability (the transmission probability is a conditional probability, which indicates the probability of getting an output in a state).
Now that we have sentence, a string of outputs, we want to know the most likely BMES combination of these characters for word segmentation. This requires the use of the Wittby algorithm.
#A state transition matrix, such as B state, which may only be E or S state PrevStatus = { 'B':('E','S'), 'M':('M','B'), 'S':('S','E'), 'E':('B','M') } def viterbi(obs, states, start_p, trans_p, emit_p): V = [{}] #State probability matrix path = {} for y in states: V[0][y] = start_p[y] + emit_p[y].get(obs[0],MIN_FLOAT)#Calculates the initial state probability, and since the probability values are logarized, the multiplication sign becomes a plus sign path[y] = [y] for t in range(1,len(obs)): V.append({}) newpath = {} for y in states: em_p = emit_p[y].get(obs[t],MIN_FLOAT) (prob,state ) = max([(V[t-1][y0] + trans_p[y0].get(y,MIN_FLOAT) + em_p ,y0) for y0 in PrevStatus[y] ]) V[t][y] =prob newpath[y] = path[state] + [y]#Pruning, keeping only one path with the highest probability path = newpath (prob, state) = max([(V[len(obs) - 1][y], y) for y in ('E','S')])#Finding which state of the last word corresponds most likely, and the last word can only be in two situations: E (end) and S (independent word) return (prob, path[state])