Programming language Python Jieba Thesaurus
jieba database is an excellent third-party database for Chinese word segmentation. Chinese text needs to obtain a single word through word segmentation
jieba library installation
Run the cmd window as an administrator and enter the command: pip install jieba
jieba library function introduction
features
- Three word segmentation modes are supported
- Precise mode: try to cut the sentence most accurately, which is suitable for text analysis
- Full mode: scan all the words that can be formed into words in the sentence. It is very fast, but it can't solve the ambiguity
- Search engine mode: on the basis of accurate mode, long words are segmented again to improve the recall rate. It is suitable for search engine word segmentation
- Support traditional word segmentation
- Support custom dictionary
participle function
- The jieba.cut and jieba.lcut methods accept two passed in parameters
- The first parameter is the string that needs word segmentation
- cut_ The all parameter is used to control whether full mode is adopted
lcut converts the returned object into a list object
- jieba.cut_for_search and jieba.lcut_ for_ The search method accepts a parameter
- A string that requires a participle
This method is suitable for word segmentation of inverted index constructed by search engine with fine granularity
jieba.lcut_ for_ The search method returns the list type
add custom dictionary
Developers can specify their own custom dictionaries to include words that are not in the jieba thesaurus. Although jieba has the ability to recognize new words, adding new words by itself can ensure a higher accuracy
- Use custom dictionary file
- jieba.load_userdict(file_name) # file_name is the path to the custom dictionary
- Using jieba to dynamically modify dictionaries in programs
- jieba.add_word(new_words) # new_words is the new word you want to add
- jieba.del_word(words) # delete words
Keyword extraction
- jieba.analyse.extract_ Tags (sense, TOPK) # need to import jieba.analyze first
sentence is the text to be extracted
topK is to return several keywords with the largest TF/IDF weight. The default is 20
Part of speech tagging
- jieba.posseg.POSTokenizer(tokenizer=None) creates a new custom word breaker, tokenizer Parameter specifies the jieba.Tokenizer participle used internally
jieba.posseg.dt is the default part of speech tagging participle
Mark the part of speech of each word after sentence segmentation, and use the marking method compatible with ictclas
case
1, Precise mode
import jieba list1 = jieba.lcut("The people's Republic of China is a great country") print(list1) print("Precise mode:"+"/".join(list1))
2, Full mode
list2 = jieba.lcut("The people's Republic of China is a great country",cut_all = True) print(list2,end=",") print("Full mode:"+"/".join(list2))
3, Search engine mode
list3 = jieba.lcut_for_search("The people's Republic of China is a great country") print(list3) print("Search engine mode:"+" ".join(list3))
4, Revise dictionary
import jieba text = "CSC has a game, and CITIC has also invested in a game company" word = jieba.lcut(text) print(word) # Add word jieba.add_word("CSC") jieba.add_word("investment company") word1 = jieba.lcut(text) print(word1) # Delete word jieba.del_word("CSC") word2 = jieba.lcut(text) print(word2)
5, Part of speech tagging
import jieba.posseg as pseg words = pseg.cut("I Love Beijing Tiananmen ") for i in words: print(i.word,i.flag)
6, Count the number of appearances of characters in the romance of the Three Kingdoms
Romance of the Three Kingdoms text Download
import jieba txt = open("File path", "r", encoding='utf-8').read() # Open and read file words = jieba.lcut(txt) # Word segmentation of text using precise mode counts = {} # Store words and their occurrences in the form of key value pairs for word in words: if len(word) == 1: # Individual words are not counted continue else: counts[word] = counts.get(word, 0) + 1 # Traverse all words, and add 1 to their corresponding value every time they appear items = list(counts.items()) #Convert key value pairs to lists items.sort(key=lambda x: x[1], reverse=True) # Sort words from large to small according to the number of times they appear for i in range(15): word, count = items[i] print("{0:<10}{1:>5}".format(word, count))
import jieba excludes = {"general","But say","Jingzhou","Two people","must not","No","such","how"} txt = open("Romance of the Three Kingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word == "Zhuge Liang" or word == "Kong Mingyue": rword = "kong ming" elif word == "Guan Yu" or word == "Cloud length": rword = "Guan Yu" elif word == "Xuande" or word == "Xuande said": rword = "Liu Bei" elif word == "Meng de" or word == "the prime minister": rword = "Cao Cao" else: rword = word counts[rword] = counts.get(rword,0) + 1 for i in excludes: del counts[i] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))
Article source: https://www.cnblogs.com/L-hua/p/15584823.html