IK is a lightweight Chinese word segmentation toolkit based on dictionary, which can be integrated through the plug-in mechanism of elasticsearch;
1, Integration steps
1. Create a new ik directory under plugin under the installation directory of elasticsearch;
2. Download the corresponding version of ik plug-in in github;
https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v6.8.12
3. Unzip the plug-in file and restart elasticsearch. You can see that the ik plug-in has been loaded as follows;
[2022-01-11T15:22:54,341][INFO ][o.e.p.PluginsService ] [4EvvJl1] loaded plugin [analysis-ik]
2, Analyzer to experience IK
IK provides ik_smart and ik_max_word two analyzers;
ik_max_word analyzer will segment the text to the greatest extent, and the granularity of word segmentation is relatively detailed;
POST _analyze { "analyzer": "ik_max_word", "text":"On this business trip, we stayed at Yan Tuan Rujia Express Hotel" } { "tokens" : [ { "token" : "this time", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "a business travel", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 1 }, { "token" : "We", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 2 }, { "token" : "live", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 3 }, { "token" : "of", "start_offset" : 7, "end_offset" : 8, "type" : "CN_CHAR", "position" : 4 }, { "token" : "yes", "start_offset" : 8, "end_offset" : 9, "type" : "CN_CHAR", "position" : 5 }, { "token" : "a surname", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 6 }, { "token" : "group", "start_offset" : 10, "end_offset" : 11, "type" : "CN_CHAR", "position" : 7 }, { "token" : "Like home", "start_offset" : 11, "end_offset" : 13, "type" : "CN_WORD", "position" : 8 }, { "token" : "Budget Hotel", "start_offset" : 13, "end_offset" : 17, "type" : "CN_WORD", "position" : 9 } ] }
ik_ The granularity of smart is relatively coarse;
POST _analyze { "analyzer": "ik_smart", "text":"On this business trip, we stayed at Yan Tuan Rujia Express Hotel" } { "tokens" : [ { "token" : "this time", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "a business travel", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 1 }, { "token" : "We", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 2 }, { "token" : "live", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 3 }, { "token" : "of", "start_offset" : 7, "end_offset" : 8, "type" : "CN_CHAR", "position" : 4 }, { "token" : "yes", "start_offset" : 8, "end_offset" : 9, "type" : "CN_CHAR", "position" : 5 }, { "token" : "a surname", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 6 }, { "token" : "group", "start_offset" : 10, "end_offset" : 11, "type" : "CN_CHAR", "position" : 7 }, { "token" : "Like home", "start_offset" : 11, "end_offset" : 13, "type" : "CN_WORD", "position" : 8 }, { "token" : "Budget Hotel", "start_offset" : 13, "end_offset" : 17, "type" : "CN_WORD", "position" : 9 } ] }
3, Extended ik dictionary
Because Yan Tuan is a relatively small place, ik's dictionary does not contain characters that lead to two single characters; We can add it to ik's dictionary;
Add my. In config under the installation directory of ik DIC file and put Yan Tuan into the file; Modify ikanalyzer after completion cfg. XML file to add a new dictionary file;
<properties> <comment>IK Analyzer Extended configuration</comment> <!--Users can configure their own extended dictionary here --> <entry key="ext_dict">my.dic</entry> <!--Users can configure their own extended stop word dictionary here--> <entry key="ext_stopwords"></entry> <!--Users can configure the remote extension dictionary here --> <!-- <entry key="remote_ext_dict">words_location</entry> --> <!--Users can configure the remote extended stop word dictionary here--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
Restart elasticsearch and re execute the view. The place name has been regarded as a participle;
POST _analyze { "analyzer": "ik_smart", "text":"On this business trip, we stayed at Yan Tuan Rujia Express Hotel" } { "tokens" : [ { "token" : "this time", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "a business travel", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 1 }, { "token" : "We", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 2 }, { "token" : "live", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 3 }, { "token" : "of", "start_offset" : 7, "end_offset" : 8, "type" : "CN_CHAR", "position" : 4 }, { "token" : "yes", "start_offset" : 8, "end_offset" : 9, "type" : "CN_CHAR", "position" : 5 }, { "token" : "Yan Tuan", "start_offset" : 9, "end_offset" : 11, "type" : "CN_WORD", "position" : 6 }, { "token" : "Like home", "start_offset" : 11, "end_offset" : 13, "type" : "CN_WORD", "position" : 7 }, { "token" : "Budget Hotel", "start_offset" : 13, "end_offset" : 17, "type" : "CN_WORD", "position" : 8 } ] }
4, Experience HanLP analyzer and custom dictionary
HanLP is a Java toolkit composed of a series of models and algorithms. It starts from Chinese word segmentation and covers common NLP tasks such as part of speech tagging, named entity recognition, syntactic analysis and text classification. It provides rich API s and is widely used in search platforms such as Lucene, Solr and ES. As for word segmentation algorithm, it supports word segmentation algorithms such as shortest path word segmentation, N-shortest path word segmentation and CRF word segmentation.
Download the hanLP plug-in package from
https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.9.2/elasticsearch-analysis-hanlp-7.9.2.zip
Install the plug-in package
bin\elasticsearch-plugin install file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip -> Installing file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip -> Downloading file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip [=================================================] 100%?? @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * java.io.FilePermission plugins/analysis-hanlp/data/-#plus read,write,delete * java.io.FilePermission plugins/analysis-hanlp/hanlp.cache#plus read,write,delete * java.lang.RuntimePermission getClassLoader * java.lang.RuntimePermission setContextClassLoader * java.net.SocketPermission * connect,resolve * java.util.PropertyPermission * read,write See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for descriptions of what these permissions allow and the associated risks. Continue with installation? [y/N]y -> Installed analysis-hanlp
Using hanlp_ The standard analyzer analyzes the text
POST _analyze { "analyzer": "hanlp_standard", "text":"On this business trip, we stayed at Yan Tuan Rujia Express Hotel" } { "tokens" : [ { "token" : "this time", "start_offset" : 0, "end_offset" : 2, "type" : "r", "position" : 0 }, { "token" : "a business travel", "start_offset" : 2, "end_offset" : 4, "type" : "vi", "position" : 1 }, { "token" : "We", "start_offset" : 4, "end_offset" : 6, "type" : "rr", "position" : 2 }, { "token" : "live", "start_offset" : 6, "end_offset" : 7, "type" : "vi", "position" : 3 }, { "token" : "of", "start_offset" : 7, "end_offset" : 8, "type" : "ude1", "position" : 4 }, { "token" : "yes", "start_offset" : 8, "end_offset" : 9, "type" : "vshi", "position" : 5 }, { "token" : "Yan Tuan", "start_offset" : 9, "end_offset" : 11, "type" : "nr", "position" : 6 }, { "token" : "Like home", "start_offset" : 11, "end_offset" : 13, "type" : "r", "position" : 7 }, { "token" : "Budget Hotel", "start_offset" : 13, "end_offset" : 17, "type" : "ntch", "position" : 8 } ] }
We can see that hanLP automatically divides Yan Tuan into one word;
By performing the following tests, we can see that hanLP does not take a small place as a participle;
POST _analyze { "analyzer": "hanlp_standard", "text":"Yan Tuan is a small place" } { "tokens" : [ { "token" : "Yan Tuan", "start_offset" : 0, "end_offset" : 2, "type" : "nr", "position" : 0 }, { "token" : "yes", "start_offset" : 2, "end_offset" : 3, "type" : "vshi", "position" : 1 }, { "token" : "One", "start_offset" : 3, "end_offset" : 5, "type" : "mq", "position" : 2 }, { "token" : "Small", "start_offset" : 5, "end_offset" : 6, "type" : "a", "position" : 3 }, { "token" : "local", "start_offset" : 6, "end_offset" : 8, "type" : "n", "position" : 4 } ] }
In order to customize the word segmentation, we create my. Under ${es_home} / plugins / analysis hanlp / data / dictionary / custom DIC, and add small places;
Then copy hanlp. Net from the plug-in installation package Put the properties file in the following location ${es_home} / config / analysis hanlp / hanlp Properties and modify CustomDictionaryPath;
CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; ModernChineseSupplementaryWord.txt; ChinesePlaceName.txt ns; PersonalName.txt; OrganizationName.txt; ShanghaiPlaceName.txt ns;data/dictionary/person/nrf.txt nrf;data/dictionary/custom/my.dic;
From, elasticsearch and execute the test
POST _analyze { "analyzer": "hanlp", "text":"Yan Tuan is a small place" } { "tokens" : [ { "token" : "Yan Tuan", "start_offset" : 0, "end_offset" : 2, "type" : "nr", "position" : 0 }, { "token" : "yes", "start_offset" : 2, "end_offset" : 3, "type" : "vshi", "position" : 1 }, { "token" : "One", "start_offset" : 3, "end_offset" : 5, "type" : "mq", "position" : 2 }, { "token" : "Small place", "start_offset" : 5, "end_offset" : 8, "type" : "n", "position" : 3 } ] }