1, Version and correspondence
Plugin version | Branch version |
7.x | 7.x |
6.x | 6.x |
2, Installation steps
1. Download and install the Plugin Release version corresponding to ES
a. download the corresponding release installation package. The latest release package can be downloaded from baidu disk (link: Baidu network disk, please enter the extraction code Password: i0o7)
b. execute the following command to install, where PATH is the absolute PATH of the plug-in package:
./bin/elasticsearch-plugin install file://${PATH}
2. Install package
The word segmentation data stored in the release package is the default word segmentation data in the HanLP source code. To download the full version of the data package, please check it HanLP Release.
Package directory: ES_HOME/plugins/analysis-hanlp
Note: because the file name of the user-defined Dictionary of the original data package is Chinese, here is hanlp Properties has been modified to English, please modify the file name accordingly
3. Restart Elasticsearch
Note: ES in the above description_ Home is your own ES installation path. Absolute path is required
4. Hot renewal
In this version, the dictionary hot update is added, and the modification steps are as follows:
a. In ES_ Add a user-defined dictionary in the home / plugins / analysis help / data / dictionary / custom directory
b. Modify hanlp Properties, modify CustomDictionaryPath and add custom dictionary configuration
c. After waiting for 1 minute, the dictionary will be loaded automatically
Note: the above changes are required for each node
Description of word segmentation provided
hanlp: hanlp |
Default participle |
hanlp_standard | Standard participle |
hanlp_index | Index word segmentation |
hanlp_nlp | NLP participle |
hanlp_crf | CRF participle |
hanlp_n_short | N-shortest path participle |
hanlp_dijkstra | Shortest path participle |
hanlp_speed | Speed dictionary segmentation |
5. Sample
POST _analyze { "text": ["8 in Alaska.0 M earthquake"], "analyzer": "hanlp_index" }
result
{ "tokens" : [ { "token" : "U.S.A", "start_offset" : 0, "end_offset" : 2, "type" : "nsf", "position" : 0 }, { "token" : "Alaska", "start_offset" : 2, "end_offset" : 7, "type" : "nsf", "position" : 1 }, { "token" : "Alaska", "start_offset" : 2, "end_offset" : 6, "type" : "nsf", "position" : 2 }, { "token" : "ARAS", "start_offset" : 2, "end_offset" : 5, "type" : "nsf", "position" : 3 }, { "token" : "Allah", "start_offset" : 2, "end_offset" : 4, "type" : "r", "position" : 4 }, { "token" : "Russ", "start_offset" : 3, "end_offset" : 5, "type" : "nrf", "position" : 5 }, { "token" : "California", "start_offset" : 5, "end_offset" : 7, "type" : "ns", "position" : 6 }, { "token" : "happen", "start_offset" : 7, "end_offset" : 9, "type" : "v", "position" : 7 }, { "token" : "8.0", "start_offset" : 9, "end_offset" : 12, "type" : "m", "position" : 8 }, { "token" : "level", "start_offset" : 12, "end_offset" : 13, "type" : "q", "position" : 9 }, { "token" : "earthquake", "start_offset" : 13, "end_offset" : 15, "type" : "n", "position" : 10 } ] }
3, Remote dictionary configuration
The configuration file is ES_HOME/config/analysis-hanlp/hanlp-remote.xml
<properties> <comment>HanLP Analyzer Extended configuration</comment> <!--Users can configure the remote extension dictionary here --> <entry key="remote_ext_dict">words_location</entry> <!--Users can configure the remote extended stop word dictionary here--> <entry key="remote_ext_stopwords">stop_words_location</entry> </properties>
1. Remote extension dictionary
Where words_location is URL or URL + "" + part of speech, such as:
1. http://localhost:8080/mydic 2. http://localhost:8080/mydic nt
The first example is to directly configure the URL. Each line in the dictionary represents a word. The format follows [word] [part of speech a] [frequency of a] [part of speech b] [frequency of B] If the part of speech is not filled in, it means that the default part of speech n of the dictionary is adopted.
In the second example, configure the dictionary URL and the default part of speech nt of the dictionary. Of course, the dictionary also follows [word] [part of speech a] [frequency of a] [part of speech b] [frequency of B] If the part of speech is not configured, the default part of speech NT is used.
2. Remote extension stop word dictionary
Where stop_words_location is the URL, such as:
1. http://localhost:8080/mystopdic
The example directly configures the URL. Each line in the dictionary represents a word. There is no need to configure the part of speech and frequency. Just use the newline character \ n.
Note that all dictionary URL s need to meet the conditions to complete word segmentation hot update:
-
The http request needs to return two headers, one is last modified and the other is ETag. Both of them are string types. As long as one of them changes, the plug-in will grab a new word segmentation and update the thesaurus.
-
Multiple dictionary paths can be configured, with English semicolons in the middle; interval
-
The URL is accessed every 1 minute
-
Guaranteed dictionary code UTF-8
3. Custom word segmentation configuration
On the basis of providing various word segmentation methods, HanLP also provides a series of word segmentation configurations. The word segmentation plug-in also provides relevant word segmentation configurations. We can define our own word splitter through the following configurations:
Config | Elastic version |
---|---|
enable_custom_config | Enable custom configuration |
enable_index_mode | Is it an index participle |
enable_number_quantifier_recognize | Do you recognize numbers and quantifiers |
enable_custom_dictionary | Load user dictionary |
enable_translated_name_recognize | Do you recognize transliterated names |
enable_japanese_name_recognize | Do you recognize Japanese names |
enable_organization_recognize | Is the Organization identified |
enable_place_recognize | Do you recognize place names |
enable_name_recognize | Do you recognize Chinese names |
enable_traditional_chinese_mode | Open traditional Chinese |
enable_stop_dictionary | Enable stop words |
enable_part_of_speech_tagging | Turn on part of speech tagging |
enable_remote_dict | Open remote dictionary |
enable_normalization | Whether to perform character normalization |
enable_offset | Calculate offset |
Note: if you want to use the above configuration to configure custom word segmentation, you need to set enable_custom_config is true
For example:
PUT test { "settings": { "analysis": { "analyzer": { "my_hanlp_analyzer": { "tokenizer": "my_hanlp" } }, "tokenizer": { "my_hanlp": { "type": "hanlp", "enable_stop_dictionary": true, "enable_custom_config": true } } } } }
POST test/_analyze { "text": "U.S.A,|=8 in Alaska.0 M earthquake", "analyzer": "my_hanlp_analyzer" }
result:
{ "tokens" : [ { "token" : "U.S.A", "start_offset" : 0, "end_offset" : 2, "type" : "nsf", "position" : 0 }, { "token" : ",|=", "start_offset" : 0, "end_offset" : 3, "type" : "w", "position" : 1 }, { "token" : "Alaska", "start_offset" : 0, "end_offset" : 5, "type" : "nsf", "position" : 2 }, { "token" : "happen", "start_offset" : 0, "end_offset" : 2, "type" : "v", "position" : 3 }, { "token" : "8.0", "start_offset" : 0, "end_offset" : 3, "type" : "m", "position" : 4 }, { "token" : "level", "start_offset" : 0, "end_offset" : 1, "type" : "q", "position" : 5 }, { "token" : "earthquake", "start_offset" : 0, "end_offset" : 2, "type" : "n", "position" : 6 } ] }
III. problems encountered
1,java.io.FilePermission "data/dictionary/CoreNatureDictionary.tr.txt" read "error
Edit plugin - security. In the plugin directory Policy file
add to
// HanLP data directories permission java.io.FilePermission "<<ALL FILES>>", "read,write,delete";
Restart Elasticsearch cluster
Note: each node needs to be installed