ElasticSearch various word segmentation (Analyzer) patterns summary

definition
Analyzer is a component in es that is specially used to deal with word segmentation. It is composed of three parts:

Character Filters: processing of original text, such as removing html
Tokenizer: word segmentation according to rules
Token Filter: process the segmented words, such as removing modifying words
Type of word splitter


StandardAnalyzer

This is the default word splitter, which divides words by word, converts letters to lowercase, and turns off termination words by default.
The usage is as follows:

GET /_analyze
{
  "analyzer": "standard",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}


The results are as follows. It can be seen that the uppercase letters are converted to lowercase:

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "<NUM>",
      "position" : 11
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}


SimpleAnalyzer


According to the non alphabetic segmentation, the non alphabetic ones will be removed, and the letters will also be processed in lowercase
Examples are as follows:

GET /_analyze
{
  "analyzer": "simple",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}


The output is as follows. It can be seen that except for the word in lowercase, all non letters are removed (number 2):

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 11
    }
  ]
}



WhitespaceAnalyzer


Divide words according to the spaces, for example:

GET /_analyze
{
  "analyzer": "whitespace",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}


The output is as follows. It can be seen that It`s, Let`s, etc. are reserved:

{
  "tokens" : [
    {
      "token" : "It`s",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "commander.",
      "start_offset" : 16,
      "end_offset" : 26,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Let`s",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "times!",
      "start_offset" : 45,
      "end_offset" : 51,
      "type" : "word",
      "position" : 10
    }
  ]
}



StopAnalyzer


Compared with SimpleAnalyzer, there is an additional stop filter, which can remove the modifying words such as the, a and is. Examples are as follows:

GET /_analyze
{
  "analyzer": "stop",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}



The output is as follows. It can be seen that many modifying words are missing:

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 11
    }
  ]
}



Keyword Analyzer


Without word segmentation, the input is directly regarded as a term output. Examples are as follows:

GET /_analyze
{
  "analyzer": "keyword",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}



The output is as follows:

{
  "tokens" : [
    {
      "token" : "It`s a good day commander. Let`s do it for 2 times!",
      "start_offset" : 0,
      "end_offset" : 51,
      "type" : "word",
      "position" : 0
    }
  ]
}


Pattern Analyzer


Word segmentation is performed through regular expressions. The default is \ W +, which is separated by non letters. Examples are as follows:

GET /_analyze
{
  "analyzer": "pattern",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}



The output here is actually the same as standard:

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 12
    }
  ]
}



LanguageAnalyzer


es can also be segmented according to language:

GET /_analyze
{
  "analyzer": "english",
    "text": "It`s a good day commander. Let`s do it for 2 times!"
}



The output is as follows, and the modified words are also filtered:

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "dai",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "command",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "<NUM>",
      "position" : 11
    },
    {
      "token" : "time",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}


ICU-Analyzer


This is a word splitter for Chinese word segmentation, which should be installed first:

[es@localhost bin]$ ./elasticsearch-plugin install analysis-icu

-> Installing analysis-icu
-> Downloading analysis-icu from elastic
[=================================================] 100%   
-> Installed analysis-icu


Then restart ES and test again:

GET /_analyze
{
  "analyzer": "icu_analyzer",
    "text": "It was a beautiful goal!"
}


The output is as follows. It seems that "goal" is not divided into one word:

{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "enter",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "ball",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "Really",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "pretty",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}


Configure custom Analyzer


Implement a custom Analyzer by combining CharacterFilter, Tokenizer, and TokenFilter
The built-in CharacterFilter includes HTML strip, Mapping and Pattern replace, which are used for html tag removal, string replacement and regular matching replacement respectively
The tokenizer comes with whitespace, standard and uax_url_email,pattern,keyword,path_hierarchy, you can also use java to develop plug-ins and implement your own tokenizer
The built-in TokenFilter includes Lowercase, stop and synonym

tokenizer+character_filter


tokenizer and character_ Examples of filter combinations are as follows:

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "<b>aaa</b>"
}



It is also tokenizer and character_ The combination of filter, but it can be in character_ Add mapping to the filter, for example:

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _"]
    }],
    "text": "1-2, d-4"
}


regular


The regular example is as follows, $1 means to take the content in the number (), here is www.baidu.com com:

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [{
    "type": "pattern_replace",
    "pattern": "http://(.*)",
    "replacement": "$1"
  }],
  "text": "http://www.baidu.com"
}



Path level word splitter


The path level word splitter is as follows. Take the input / home/szc/a/b/c/e as the directory, and then segment words according to the level of the directory:

POST _analyze{  "tokenizer": "path_hierarchy",  "text": "/home/szc/a/b/c/e"}


The output is as follows

{
  "tokens" : [
    {
      "token" : "/home",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b/c",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b/c/e",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}


filter combination


The filter is combined as follows, with lowercase and modifier removed at the same time

POST _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase", "stop"],
  "text": "The boys in China are playing soccer!"
}



Comprehensive use


A combinatorial named my_ The custom Analyzer of Analyzer is as follows, in which char_filter, tokenizer and filter are user-defined,


When using, add the document defining this word breaker:

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I`m a :) guy, and you ?"
}


The output is as follows. It can be seen that the emoticon replacement is completed first, the word segmentation is carried out according to the specified regularity, and finally the modified words are removed

{
  "tokens" : [
    {
      "token" : "i`m",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "happy",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "guy",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "you",
      "start_offset" : 18,
      "end_offset" : 21,
      "type" : "word",
      "position" : 5
    }
  ]
}


The above is about the word splitter of ElasticSearch7

Keywords: ElasticSearch

Added by madhavr on Thu, 27 Jan 2022 23:05:56 +0200