Programming language Python Jieba Thesaurus

jieba database is an excellent third-party database for Chinese word segmentation. Chinese text needs to obtain a single word through word segmentation

jieba library installation

Run the cmd window as an administrator and enter the command: pip install jieba

jieba library function introduction

features

Three word segmentation modes are supported
- Precise mode: try to cut the sentence most accurately, which is suitable for text analysis
- Full mode: scan all the words that can be formed into words in the sentence. It is very fast, but it can't solve the ambiguity
- Search engine mode: on the basis of accurate mode, long words are segmented again to improve the recall rate. It is suitable for search engine word segmentation
Support traditional word segmentation
Support custom dictionary

participle function

The jieba.cut and jieba.lcut methods accept two passed in parameters
- The first parameter is the string that needs word segmentation
- cut_ The all parameter is used to control whether full mode is adopted

lcut converts the returned object into a list object

jieba.cut_for_search and jieba.lcut_ for_ The search method accepts a parameter
- A string that requires a participle

This method is suitable for word segmentation of inverted index constructed by search engine with fine granularity
jieba.lcut_ for_ The search method returns the list type

add custom dictionary

Developers can specify their own custom dictionaries to include words that are not in the jieba thesaurus. Although jieba has the ability to recognize new words, adding new words by itself can ensure a higher accuracy

usage

Use custom dictionary file
- jieba.load_userdict(file_name) # file_name is the path to the custom dictionary
Using jieba to dynamically modify dictionaries in programs
- jieba.add_word(new_words) # new_words is the new word you want to add
- jieba.del_word(words) # delete words

Keyword extraction

jieba.analyse.extract_ Tags (sense, TOPK) # need to import jieba.analyze first

sentence is the text to be extracted
topK is to return several keywords with the largest TF/IDF weight. The default is 20

Part of speech tagging

jieba.posseg.POSTokenizer(tokenizer=None) creates a new custom word breaker, tokenizer Parameter specifies the jieba.Tokenizer participle used internally

jieba.posseg.dt is the default part of speech tagging participle
Mark the part of speech of each word after sentence segmentation, and use the marking method compatible with ictclas

case

1, Precise mode

import jieba
list1 = jieba.lcut("The people's Republic of China is a great country")
print(list1)
print("Precise mode:"+"/".join(list1))

2, Full mode

list2 = jieba.lcut("The people's Republic of China is a great country",cut_all = True)
print(list2,end=",")
print("Full mode:"+"/".join(list2))

3, Search engine mode

list3 = jieba.lcut_for_search("The people's Republic of China is a great country")
print(list3)
print("Search engine mode:"+"  ".join(list3))

4, Revise dictionary

import jieba
text = "CSC has a game, and CITIC has also invested in a game company"
word = jieba.lcut(text)
print(word)

# Add word
jieba.add_word("CSC")
jieba.add_word("investment company")
word1 = jieba.lcut(text)
print(word1)

# Delete word
jieba.del_word("CSC")
word2 = jieba.lcut(text)
print(word2)

5, Part of speech tagging

import jieba.posseg as pseg

words = pseg.cut("I Love Beijing Tiananmen ")
for i in words:
    print(i.word,i.flag)

6, Count the number of appearances of characters in the romance of the Three Kingdoms

Romance of the Three Kingdoms text Download

import  jieba

txt = open("File path", "r", encoding='utf-8').read()    # Open and read file
words = jieba.lcut(txt)     # Word segmentation of text using precise mode
counts = {}     # Store words and their occurrences in the form of key value pairs

for word in words:
    if  len(word) == 1:    # Individual words are not counted
        continue
    else:
        counts[word] = counts.get(word, 0) + 1    # Traverse all words, and add 1 to their corresponding value every time they appear
        
items = list(counts.items())     #Convert key value pairs to lists
items.sort(key=lambda x: x[1], reverse=True)    # Sort words from large to small according to the number of times they appear 

for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

import jieba

excludes = {"general","But say","Jingzhou","Two people","must not","No","such","how"}
txt = open("Romance of the Three Kingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}

for word in words:
    if len(word) == 1:
        continue
    elif word == "Zhuge Liang" or word == "Kong Mingyue":
        rword = "kong ming"
    elif word == "Guan Yu" or word == "Cloud length":
        rword = "Guan Yu"
    elif word == "Xuande" or word == "Xuande said":
        rword = "Liu Bei"
    elif word == "Meng de" or word == "the prime minister":
        rword = "Cao Cao"
    else:
        rword = word
        counts[rword] = counts.get(rword,0) + 1
    
for i in excludes:
    del counts[i]
    
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 

for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

Article source: https://www.cnblogs.com/L-hua/p/15584823.html

Keywords: Algorithm Dynamic Programming security

Added by ddoc on Thu, 25 Nov 2021 03:36:28 +0200

Programming VIP

Programming language Python Jieba Thesaurus

Programming language Python Jieba Thesaurus

jieba library installation

jieba library function introduction

features

participle function

add custom dictionary

Keyword extraction

Part of speech tagging

case

1, Precise mode

2, Full mode

3, Search engine mode

4, Revise dictionary

5, Part of speech tagging

6, Count the number of appearances of characters in the romance of the Three Kingdoms

Popular Keywords