Introduction to Elasticsearch proficient in the installation and use of Hanlp word splitter

1, Version and correspondence

Plugin version	Branch version
7.x	7.x
6.x	6.x

2, Installation steps

1. Download and install the Plugin Release version corresponding to ES

a. download the corresponding release installation package. The latest release package can be downloaded from baidu disk (link: Baidu network disk, please enter the extraction code Password: i0o7)

b. execute the following command to install, where PATH is the absolute PATH of the plug-in package:

./bin/elasticsearch-plugin install file://${PATH}

2. Install package

The word segmentation data stored in the release package is the default word segmentation data in the HanLP source code. To download the full version of the data package, please check it HanLP Release.

Package directory: ES_HOME/plugins/analysis-hanlp

Note: because the file name of the user-defined Dictionary of the original data package is Chinese, here is hanlp Properties has been modified to English, please modify the file name accordingly

3. Restart Elasticsearch

Note: ES in the above description_ Home is your own ES installation path. Absolute path is required

4. Hot renewal

In this version, the dictionary hot update is added, and the modification steps are as follows:

a. In ES_ Add a user-defined dictionary in the home / plugins / analysis help / data / dictionary / custom directory

b. Modify hanlp Properties, modify CustomDictionaryPath and add custom dictionary configuration

c. After waiting for 1 minute, the dictionary will be loaded automatically

Note: the above changes are required for each node

Description of word segmentation provided

hanlp: hanlp	Default participle
hanlp_standard	Standard participle
hanlp_index	Index word segmentation
hanlp_nlp	NLP participle
hanlp_crf	CRF participle
hanlp_n_short	N-shortest path participle
hanlp_dijkstra	Shortest path participle
hanlp_speed	Speed dictionary segmentation

5. Sample

POST _analyze
{
  "text": ["8 in Alaska.0 M earthquake"],
  "analyzer": "hanlp_index"
}

result

{
  "tokens" : [
    {
      "token" : "U.S.A",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nsf",
      "position" : 0
    },
    {
      "token" : "Alaska",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "nsf",
      "position" : 1
    },
    {
      "token" : "Alaska",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "nsf",
      "position" : 2
    },
    {
      "token" : "ARAS",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "nsf",
      "position" : 3
    },
    {
      "token" : "Allah",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "r",
      "position" : 4
    },
    {
      "token" : "Russ",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "nrf",
      "position" : 5
    },
    {
      "token" : "California",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "ns",
      "position" : 6
    },
    {
      "token" : "happen",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "v",
      "position" : 7
    },
    {
      "token" : "8.0",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "m",
      "position" : 8
    },
    {
      "token" : "level",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "q",
      "position" : 9
    },
    {
      "token" : "earthquake",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "n",
      "position" : 10
    }
  ]
}

3, Remote dictionary configuration

The configuration file is ES_HOME/config/analysis-hanlp/hanlp-remote.xml

<properties>
    <comment>HanLP Analyzer Extended configuration</comment>

    <!--Users can configure the remote extension dictionary here -->
    <entry key="remote_ext_dict">words_location</entry>

    <!--Users can configure the remote extended stop word dictionary here-->
    <entry key="remote_ext_stopwords">stop_words_location</entry>
</properties>

1. Remote extension dictionary

Where words_location is URL or URL + "" + part of speech, such as:

1. http://localhost:8080/mydic

2. http://localhost:8080/mydic nt

The first example is to directly configure the URL. Each line in the dictionary represents a word. The format follows [word] [part of speech a] [frequency of a] [part of speech b] [frequency of B] If the part of speech is not filled in, it means that the default part of speech n of the dictionary is adopted.

In the second example, configure the dictionary URL and the default part of speech nt of the dictionary. Of course, the dictionary also follows [word] [part of speech a] [frequency of a] [part of speech b] [frequency of B] If the part of speech is not configured, the default part of speech NT is used.

2. Remote extension stop word dictionary

Where stop_words_location is the URL, such as:

1. http://localhost:8080/mystopdic

The example directly configures the URL. Each line in the dictionary represents a word. There is no need to configure the part of speech and frequency. Just use the newline character \ n.

Note that all dictionary URL s need to meet the conditions to complete word segmentation hot update:

The http request needs to return two headers, one is last modified and the other is ETag. Both of them are string types. As long as one of them changes, the plug-in will grab a new word segmentation and update the thesaurus.
Multiple dictionary paths can be configured, with English semicolons in the middle; interval
The URL is accessed every 1 minute
Guaranteed dictionary code UTF-8

3. Custom word segmentation configuration

On the basis of providing various word segmentation methods, HanLP also provides a series of word segmentation configurations. The word segmentation plug-in also provides relevant word segmentation configurations. We can define our own word splitter through the following configurations:

Config	Elastic version
enable_custom_config	Enable custom configuration
enable_index_mode	Is it an index participle
enable_number_quantifier_recognize	Do you recognize numbers and quantifiers
enable_custom_dictionary	Load user dictionary
enable_translated_name_recognize	Do you recognize transliterated names
enable_japanese_name_recognize	Do you recognize Japanese names
enable_organization_recognize	Is the Organization identified
enable_place_recognize	Do you recognize place names
enable_name_recognize	Do you recognize Chinese names
enable_traditional_chinese_mode	Open traditional Chinese
enable_stop_dictionary	Enable stop words
enable_part_of_speech_tagging	Turn on part of speech tagging
enable_remote_dict	Open remote dictionary
enable_normalization	Whether to perform character normalization
enable_offset	Calculate offset

Note: if you want to use the above configuration to configure custom word segmentation, you need to set enable_custom_config is true

For example:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_hanlp_analyzer": {
          "tokenizer": "my_hanlp"
        }
      },
      "tokenizer": {
        "my_hanlp": {
          "type": "hanlp",
          "enable_stop_dictionary": true,
          "enable_custom_config": true
        }
      }
    }
  }
}

POST test/_analyze
{
  "text": "U.S.A,|=8 in Alaska.0 M earthquake",
  "analyzer": "my_hanlp_analyzer"
}

result:

{
  "tokens" : [
    {
      "token" : "U.S.A",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nsf",
      "position" : 0
    },
    {
      "token" : ",|=",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "w",
      "position" : 1
    },
    {
      "token" : "Alaska",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "nsf",
      "position" : 2
    },
    {
      "token" : "happen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "v",
      "position" : 3
    },
    {
      "token" : "8.0",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "m",
      "position" : 4
    },
    {
      "token" : "level",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "q",
      "position" : 5
    },
    {
      "token" : "earthquake",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "n",
      "position" : 6
    }
  ]
}

III. problems encountered

1,java.io.FilePermission "data/dictionary/CoreNatureDictionary.tr.txt" read "error

Edit plugin - security. In the plugin directory Policy file

add to

// HanLP data directories permission java.io.FilePermission "<<ALL FILES>>", "read,write,delete";

Restart Elasticsearch cluster

Note: each node needs to be installed

Keywords: Big Data ElasticSearch search engine

Added by Hypnos on Mon, 24 Jan 2022 09:31:28 +0200

Programming VIP