Use jieba, forward maximum matching algorithm and backward maximum matching algorithm to segment words

Use jieba, forward maximum matching algorithm and backward maximum matching algorithm to segment words

Today, we start word segmentation. I have sorted out some lines in Wong Kar Wai's films, with a total of 79 lines, which are from the more famous "east evil and West poison", "spring light and sudden release" and "Chongqing forest". This data set is like this.

Forward maximum matching algorithm

The idea of this algorithm is very simple. It is to find the longest word matching it in the dictionary from the beginning of the sentence, and then push back until the word matching of the whole sentence is completed.

  1. Build dictionary
  2. Loop through matching words

First of all, it's really difficult for me to build a dictionary with so many old words. As a result, it's really difficult for me to use ebjia first
The code is as follows:

import jieba
import re
import jieba.posseg as pseg

filepath1 = 'Wong Kar Wai movie lines.txt'
dict=[]
##Create dictionary
with open(filepath1, 'r', encoding = 'utf-8') as sourceFile:
    lines = sourceFile.readlines()
    for line in lines:
        line1 = line.replace(' ','')     # Remove spaces from text
        pattern = re.compile("[^\u4e00-\u9fa5^a-z^A-Z^0-9]")    #Keep only Chinese and English, numbers, and remove symbols
        line2= re.sub(pattern,'',line1)      #Replace the matching characters in the text with empty characters
        seg = jieba.lcut(line2, cut_all = False)
        for i in range(len(seg)):
            if seg[i] not in dict:
                dict.append(seg[i])
    print(len(dict)) 
    print(dict)

Take a look at this dictionary:
A total of 619 words
Next, you can segment words. Read them line by line:

##Forward maximum matching algorithm
def forword_Match(text, Dict):
    word_list = []
    pi = 0    #initial position 
    #Find the length of the longest word in the dictionary
    m = max([len(word) for word in Dict])
    while pi != len(text):
        n = len(text[pi:])    #The length of the current pointer to the end of the string
        if n < m:
            m = n
        for index in range(m,0,-1):      #Take m Chinese characters as words from the current pi 
            if text[pi:pi+index] in Dict:
                word_list.append(text[pi:pi+index])
                pi = pi + index         # Modify the pointer pi according to the length of the word
                break
    print('/'.join(word_list),len(word_list))
##Output the word segmentation result and the number of words in each sentence sentence sentence by sentence
with open(filepath1, 'r', encoding = 'utf-8') as sourceFile:
    lines = sourceFile.readlines()
    for line in lines:
        line1 = line.replace(' ','')     # Remove spaces from text
        pattern = re.compile("[^\u4e00-\u9fa5^a-z^A-Z^0-9]")    #Keep only Chinese and English, numbers, and remove symbols
        line2= re.sub(pattern,'',line1)      #Replace the matching characters in the text with empty characters
        forword_Match(line2, dict) 

Let's look at the word segmentation results

jieba algorithm

jieba participle is very simple, just direct participle.

##Word by word and sentence by sentence
##Output the word segmentation result and the number of words in each sentence sentence sentence by sentence
with open(filepath1, 'r', encoding = 'utf-8') as sourceFile:
    lines = sourceFile.readlines()
    for line in lines:
        line1 = line.replace(' ','')     # Remove spaces from text
        pattern = re.compile("[^\u4e00-\u9fa5^a-z^A-Z^0-9]")    #Keep only Chinese and English, numbers, and remove symbols
        line2= re.sub(pattern,'',line1)      #Replace the matching characters in the text with empty characters
        seg = jieba.lcut(line2, cut_all = False)
        print("/".join(seg),len(seg))

Let's see the results

Backward maximum matching algorithm

##Reverse maximum matching
def back_Match(text, Dict):
    word_list = []
    pi = len(text) - 1
    m = max(len(word) for word in Dict)
    while pi >= 0:
        n = len(text[0:pi+1])
        if n < m:
            m = n
        for index in range(m-1,-1,-1):
            if text[pi-index:pi+1] in Dict:
                word_list.append(text[pi-index:pi+1])
                pi = pi - index -1
                break

    print('/'.join(word_list[::-1]))
with open(filepath1, 'r', encoding = 'utf-8') as sourceFile:
    lines = sourceFile.readlines()
    for line in lines:
        line1 = line.replace(' ','')     # Remove spaces from text
        pattern = re.compile("[^\u4e00-\u9fa5^a-z^A-Z^0-9]")    #Keep only Chinese and English, numbers, and remove symbols
        line2= re.sub(pattern,'',line1)      #Replace the matching characters in the text with empty characters
        back_Match(line2, dict) 

The complete project and data set have been uploaded to the resources, but I found a problem. Although I have set it to 0 points download, CSDN doesn't know how to operate internally. All of them have become 2 points download, so if you don't have points, you can chat with me privately or comment on me.
This time I set it to 1.9 yuan to download, because I don't quite understand how to operate the points. If I'm not in a hurry to download, I can chat or comment privately. I can send it privately.

Keywords: Python NLP jieba

Added by maverickminds on Tue, 08 Mar 2022 08:50:51 +0200