Integrated Chinese word segmentation for elastic search

IK is a lightweight Chinese word segmentation toolkit based on dictionary, which can be integrated through the plug-in mechanism of elasticsearch;
1, Integration steps

1. Create a new ik directory under plugin under the installation directory of elasticsearch;

2. Download the corresponding version of ik plug-in in github;

https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v6.8.12

3. Unzip the plug-in file and restart elasticsearch. You can see that the ik plug-in has been loaded as follows;

[2022-01-11T15:22:54,341][INFO ][o.e.p.PluginsService     ] [4EvvJl1] loaded plugin [analysis-ik]

2, Analyzer to experience IK

IK provides ik_smart and ik_max_word two analyzers;

ik_max_word analyzer will segment the text to the greatest extent, and the granularity of word segmentation is relatively detailed;

POST _analyze
{
  "analyzer": "ik_max_word",
  "text":"On this business trip, we stayed at Yan Tuan Rujia Express Hotel"
}


{
  "tokens" : [
    {
      "token" : "this time",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "a business travel",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "We",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "live",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "of",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "yes",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "a surname",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "group",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "Like home",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "Budget Hotel",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}



ik_ The granularity of smart is relatively coarse;

POST _analyze
{
  "analyzer": "ik_smart",
  "text":"On this business trip, we stayed at Yan Tuan Rujia Express Hotel"
}

{
  "tokens" : [
    {
      "token" : "this time",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "a business travel",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "We",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "live",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "of",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "yes",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "a surname",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "group",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "Like home",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "Budget Hotel",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}

3, Extended ik dictionary

Because Yan Tuan is a relatively small place, ik's dictionary does not contain characters that lead to two single characters; We can add it to ik's dictionary;

Add my. In config under the installation directory of ik DIC file and put Yan Tuan into the file; Modify ikanalyzer after completion cfg. XML file to add a new dictionary file;

<properties>
	<comment>IK Analyzer Extended configuration</comment>
	<!--Users can configure their own extended dictionary here -->
	<entry key="ext_dict">my.dic</entry>
	 <!--Users can configure their own extended stop word dictionary here-->
	<entry key="ext_stopwords"></entry>
	<!--Users can configure the remote extension dictionary here -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--Users can configure the remote extended stop word dictionary here-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

Restart elasticsearch and re execute the view. The place name has been regarded as a participle;

POST _analyze
{
  "analyzer": "ik_smart",
  "text":"On this business trip, we stayed at Yan Tuan Rujia Express Hotel"
}

{
  "tokens" : [
    {
      "token" : "this time",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "a business travel",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "We",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "live",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "of",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "yes",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "Yan Tuan",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "Like home",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "Budget Hotel",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 8
    }
  ]
}

4, Experience HanLP analyzer and custom dictionary

HanLP is a Java toolkit composed of a series of models and algorithms. It starts from Chinese word segmentation and covers common NLP tasks such as part of speech tagging, named entity recognition, syntactic analysis and text classification. It provides rich API s and is widely used in search platforms such as Lucene, Solr and ES. As for word segmentation algorithm, it supports word segmentation algorithms such as shortest path word segmentation, N-shortest path word segmentation and CRF word segmentation.

Download the hanLP plug-in package from

https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.9.2/elasticsearch-analysis-hanlp-7.9.2.zip

Install the plug-in package

bin\elasticsearch-plugin install file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
-> Installing file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
-> Downloading file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
[=================================================] 100%??
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.io.FilePermission plugins/analysis-hanlp/data/-#plus read,write,delete
* java.io.FilePermission plugins/analysis-hanlp/hanlp.cache#plus read,write,delete
* java.lang.RuntimePermission getClassLoader
* java.lang.RuntimePermission setContextClassLoader
* java.net.SocketPermission * connect,resolve
* java.util.PropertyPermission * read,write
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y
-> Installed analysis-hanlp

Using hanlp_ The standard analyzer analyzes the text

POST _analyze
{
  "analyzer": "hanlp_standard",
  "text":"On this business trip, we stayed at Yan Tuan Rujia Express Hotel"
}

{
  "tokens" : [
    {
      "token" : "this time",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "r",
      "position" : 0
    },
    {
      "token" : "a business travel",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "vi",
      "position" : 1
    },
    {
      "token" : "We",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "rr",
      "position" : 2
    },
    {
      "token" : "live",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "vi",
      "position" : 3
    },
    {
      "token" : "of",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "ude1",
      "position" : 4
    },
    {
      "token" : "yes",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "vshi",
      "position" : 5
    },
    {
      "token" : "Yan Tuan",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "nr",
      "position" : 6
    },
    {
      "token" : "Like home",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "r",
      "position" : 7
    },
    {
      "token" : "Budget Hotel",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "ntch",
      "position" : 8
    }
  ]
}

We can see that hanLP automatically divides Yan Tuan into one word;

By performing the following tests, we can see that hanLP does not take a small place as a participle;

POST _analyze
{
  "analyzer": "hanlp_standard",
  "text":"Yan Tuan is a small place"
}

{
  "tokens" : [
    {
      "token" : "Yan Tuan",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nr",
      "position" : 0
    },
    {
      "token" : "yes",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "vshi",
      "position" : 1
    },
    {
      "token" : "One",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "mq",
      "position" : 2
    },
    {
      "token" : "Small",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "a",
      "position" : 3
    },
    {
      "token" : "local",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "n",
      "position" : 4
    }
  ]
}

In order to customize the word segmentation, we create my. Under ${es_home} / plugins / analysis hanlp / data / dictionary / custom DIC, and add small places;

Then copy hanlp. Net from the plug-in installation package Put the properties file in the following location ${es_home} / config / analysis hanlp / hanlp Properties and modify CustomDictionaryPath;

CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; ModernChineseSupplementaryWord.txt; ChinesePlaceName.txt ns; PersonalName.txt; OrganizationName.txt; ShanghaiPlaceName.txt ns;data/dictionary/person/nrf.txt nrf;data/dictionary/custom/my.dic;

From, elasticsearch and execute the test

POST _analyze
{
  "analyzer": "hanlp",
  "text":"Yan Tuan is a small place"
}

{
  "tokens" : [
    {
      "token" : "Yan Tuan",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nr",
      "position" : 0
    },
    {
      "token" : "yes",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "vshi",
      "position" : 1
    },
    {
      "token" : "One",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "mq",
      "position" : 2
    },
    {
      "token" : "Small place",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "n",
      "position" : 3
    }
  ]
}

Added by shwathi on Tue, 11 Jan 2022 17:21:30 +0200