Introduction to Elasticsearch proficient in the installation and use of Hanlp word splitter

1, Version and correspondence

Plugin version Branch version
7.x 7.x
6.x 6.x

2, Installation steps

1. Download and install the Plugin Release version corresponding to ES

a. download the corresponding release installation package. The latest release package can be downloaded from baidu disk (link: Baidu network disk, please enter the extraction code Password: i0o7)

b. execute the following command to install, where PATH is the absolute PATH of the plug-in package:

  ./bin/elasticsearch-plugin install file://${PATH}

2. Install package

The word segmentation data stored in the release package is the default word segmentation data in the HanLP source code. To download the full version of the data package, please check it HanLP Release.

Package directory: ES_HOME/plugins/analysis-hanlp

Note: because the file name of the user-defined Dictionary of the original data package is Chinese, here is hanlp Properties has been modified to English, please modify the file name accordingly

3. Restart Elasticsearch

Note: ES in the above description_ Home is your own ES installation path. Absolute path is required

4. Hot renewal

In this version, the dictionary hot update is added, and the modification steps are as follows:

a. In ES_ Add a user-defined dictionary in the home / plugins / analysis help / data / dictionary / custom directory

b. Modify hanlp Properties, modify CustomDictionaryPath and add custom dictionary configuration

c. After waiting for 1 minute, the dictionary will be loaded automatically

Note: the above changes are required for each node

Description of word segmentation provided

hanlp: hanlp

Default participle
hanlp_standard Standard participle
hanlp_index Index word segmentation
hanlp_nlp NLP participle
hanlp_crf CRF participle
hanlp_n_short N-shortest path participle
hanlp_dijkstra Shortest path participle
hanlp_speed Speed dictionary segmentation

5. Sample

POST _analyze
{
  "text": ["8 in Alaska.0 M earthquake"],
  "analyzer": "hanlp_index"
}

result

{
  "tokens" : [
    {
      "token" : "U.S.A",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nsf",
      "position" : 0
    },
    {
      "token" : "Alaska",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "nsf",
      "position" : 1
    },
    {
      "token" : "Alaska",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "nsf",
      "position" : 2
    },
    {
      "token" : "ARAS",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "nsf",
      "position" : 3
    },
    {
      "token" : "Allah",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "r",
      "position" : 4
    },
    {
      "token" : "Russ",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "nrf",
      "position" : 5
    },
    {
      "token" : "California",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "ns",
      "position" : 6
    },
    {
      "token" : "happen",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "v",
      "position" : 7
    },
    {
      "token" : "8.0",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "m",
      "position" : 8
    },
    {
      "token" : "level",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "q",
      "position" : 9
    },
    {
      "token" : "earthquake",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "n",
      "position" : 10
    }
  ]
}

3, Remote dictionary configuration

The configuration file is ES_HOME/config/analysis-hanlp/hanlp-remote.xml

<properties>
    <comment>HanLP Analyzer Extended configuration</comment>

    <!--Users can configure the remote extension dictionary here -->
    <entry key="remote_ext_dict">words_location</entry>

    <!--Users can configure the remote extended stop word dictionary here-->
    <entry key="remote_ext_stopwords">stop_words_location</entry>
</properties>

1. Remote extension dictionary

Where words_location is URL or URL + "" + part of speech, such as:

1. http://localhost:8080/mydic

2. http://localhost:8080/mydic nt

The first example is to directly configure the URL. Each line in the dictionary represents a word. The format follows [word] [part of speech a] [frequency of a] [part of speech b] [frequency of B] If the part of speech is not filled in, it means that the default part of speech n of the dictionary is adopted.

In the second example, configure the dictionary URL and the default part of speech nt of the dictionary. Of course, the dictionary also follows [word] [part of speech a] [frequency of a] [part of speech b] [frequency of B] If the part of speech is not configured, the default part of speech NT is used.

2. Remote extension stop word dictionary

Where stop_words_location is the URL, such as:

1. http://localhost:8080/mystopdic

The example directly configures the URL. Each line in the dictionary represents a word. There is no need to configure the part of speech and frequency. Just use the newline character \ n.

Note that all dictionary URL s need to meet the conditions to complete word segmentation hot update:

  • The http request needs to return two headers, one is last modified and the other is ETag. Both of them are string types. As long as one of them changes, the plug-in will grab a new word segmentation and update the thesaurus.

  • Multiple dictionary paths can be configured, with English semicolons in the middle; interval

  • The URL is accessed every 1 minute

  • Guaranteed dictionary code UTF-8

3. Custom word segmentation configuration

On the basis of providing various word segmentation methods, HanLP also provides a series of word segmentation configurations. The word segmentation plug-in also provides relevant word segmentation configurations. We can define our own word splitter through the following configurations:

Config Elastic version
enable_custom_config Enable custom configuration
enable_index_mode Is it an index participle
enable_number_quantifier_recognize Do you recognize numbers and quantifiers
enable_custom_dictionary Load user dictionary
enable_translated_name_recognize Do you recognize transliterated names
enable_japanese_name_recognize Do you recognize Japanese names
enable_organization_recognize Is the Organization identified
enable_place_recognize Do you recognize place names
enable_name_recognize Do you recognize Chinese names
enable_traditional_chinese_mode Open traditional Chinese
enable_stop_dictionary Enable stop words
enable_part_of_speech_tagging Turn on part of speech tagging
enable_remote_dict Open remote dictionary
enable_normalization Whether to perform character normalization
enable_offset Calculate offset

Note: if you want to use the above configuration to configure custom word segmentation, you need to set enable_custom_config is true

For example:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_hanlp_analyzer": {
          "tokenizer": "my_hanlp"
        }
      },
      "tokenizer": {
        "my_hanlp": {
          "type": "hanlp",
          "enable_stop_dictionary": true,
          "enable_custom_config": true
        }
      }
    }
  }
}
POST test/_analyze
{
  "text": "U.S.A,|=8 in Alaska.0 M earthquake",
  "analyzer": "my_hanlp_analyzer"
}

result:

{
  "tokens" : [
    {
      "token" : "U.S.A",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nsf",
      "position" : 0
    },
    {
      "token" : ",|=",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "w",
      "position" : 1
    },
    {
      "token" : "Alaska",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "nsf",
      "position" : 2
    },
    {
      "token" : "happen",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "v",
      "position" : 3
    },
    {
      "token" : "8.0",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "m",
      "position" : 4
    },
    {
      "token" : "level",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "q",
      "position" : 5
    },
    {
      "token" : "earthquake",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "n",
      "position" : 6
    }
  ]
}

III. problems encountered

1,java.io.FilePermission "data/dictionary/CoreNatureDictionary.tr.txt" read "error

Edit plugin - security. In the plugin directory Policy file

add to

// HanLP data directories permission java.io.FilePermission "<<ALL FILES>>", "read,write,delete";

Restart Elasticsearch cluster

Note: each node needs to be installed

Keywords: Big Data ElasticSearch search engine

Added by Hypnos on Mon, 24 Jan 2022 09:31:28 +0200