ES configure text analyzer

By default, elastic search uses a standard text analyzer for all text analysis, which provides out of the box support for most natural languages and use cases. If you choose to use the standard analyzer as is, no further configuration is required.

If the standard analyzer does not meet your needs, check and test Elasticsearch's other built-in analyzers. The built-in analyzers do not need to be configured, but some support options can be used to adjust their behavior. For example, you can configure a standard parser with a list of custom stop words to remove.

If there is no built-in analyzer to meet your needs, you can test and create a custom analyzer. Custom analyzer includes selecting and combining different analyzer components to give you more control over the process.

Analyzer anatomy

The parser -- whether built-in or custom -- is just a package that contains three lower level building blocks: a character filter, a marker, and a marker filter.

The built-in parser Prepackages these building blocks into parsers suitable for different languages and text types. Elasticsearch also exposes various building blocks in order to combine them to define a new custom analyzer.

Character filter

The character filter receives the original text as a character stream and can convert the character stream by adding, deleting, or changing characters. For example, you can use a character filter to convert Hindu Arabic numerals( ٠ ‎ ١٢٣٤٥٦٧٨ ‎ ٩ ..) convert to Arabic Latin numbers (0123456789), or remove HTML elements like this from the stream.

The parser can have zero or more character filters that are applied sequentially.

Marker

The marker receives the character stream, decomposes it into a single tag (usually a single word), and outputs the tag stream. For example, whenever you see any white space, the white space marker divides the text into tags. It will the text "Quick brown fox!" Convert to term [Quick, brown, fox!]

The marker is also responsible for recording the order or position of each term and the beginning and end character offsets of the original word represented by the term.

An analyzer must have and only have one marker.

Tag filter

Tag filters receive tag streams and can add, delete, or change tags. For example, the lowercase tag filter converts all tags to lowercase, the stop tag filter removes common words (stop words) from the tag stream, and the synonym tag filter introduces synonyms into the tag stream.

The tag filter does not allow you to change the position or character offset of each tag.

The parser can have zero or more tag filters that are applied sequentially.

Test Profiler

The analyzer API is a valuable tool for viewing the terms generated by the analyzer. The built-in analyzer can specify inline in the request:

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}

Return the following response:

{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 10,
      "end_offset": 15,
      "type": "word",
      "position": 2
    },
    {
      "token": "fox.",
      "start_offset": 16,
      "end_offset": 20,
      "type": "word",
      "position": 3
    }
  ]
}

You can also test the following combinations:

  • A marker
  • 0 or more tag filters
  • 0 or more character filters
POST _analyze
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}

Return the following response:

{
  "tokens": [
    {
      "token": "is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "vu",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Position and character offset

As you can see from the output of the analysis API, the analysis not only converts text into terms, but also records the order or relative position of each item (for phrase query or word approximation query), and the beginning and end character offset of each term (for highlighting search fragments).

Alternatively, when running the analysis API on a specific index, you can reference the custom analyzer:

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_folded": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded" 
      }
    }
  }
}

GET my-index-000001/_analyze 
{
  "analyzer": "std_folded", 
  "text":     "Is this déjà vu?"
}

GET my-index-000001/_analyze 
{
  "field": "my_text", 
  "text":  "Is this déjà vu?"
}

Return the following response:

{
  "tokens": [
    {
      "token": "is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "vu",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
  1. A custom parser named STD is defined_ folded
  2. Domain my_ Textuse std_fold analyzer.
  3. To reference this parser, the analysis API must specify an index name.
  4. Reference the parser by name.
  5. Application domain my_ TextParser used.

Configure built-in analyzer

The built-in analyzer can be used directly without any configuration. However, some of them support configuration options to change their behavior. For example, the standard analyzer can be configured to support stop word lists:

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { 
          "type":      "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type":     "text",
        "analyzer": "standard", 
        "fields": {
          "english": {
            "type":     "text",
            "analyzer": "std_english" 
          }
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "field": "my_text", 
  "text": "The old brown cow"
}

POST my-index-000001/_analyze
{
  "field": "my_text.english", 
  "text": "The old brown cow"
}
  • We define STD based on standard analyzer_ English parser, but configured to delete the predefined English stop word list.
  • my_ The text field directly uses the standard analyzer without any configuration. No stop words in this field will be deleted. The resulting word is: [the, old, brown, cow]
  • my_ text. The English field uses std_english parser, so the English stop word will be deleted. The result is: [old, brown, cow]

Create a custom analyzer

When the built-in analyzer cannot meet your needs, you can create a custom analyzer using the following appropriate combination:

  • 0 or more character filters
  • A marker
  • 0 or more tag filters

to configure

The custom analyzer receives the following parameters:

parameterdescribe
typeAnalyzer type. Receive built-in analyzer type. For custom analyzers, use custom or ignore this parameter.
tokenizerBuilt in or custom marker (required)
char_filterBuilt in or custom character filter array (optional)
filterBuilt in or custom tag filter array (optional)
position_increment_gapWhen indexing an array of text values, Elasticsearch inserts a false "gap" between the last item of a value and the first item of the next value to ensure that the phrase query does not match two items from different array elements. The default is 100. See position for more information_ increment_ gap.

Sample configuration

Here is an example that combines the following:

Character filter

  • HTML Strip Character Filter

Marker

  • Standard Tokenizer

Marker filter

  • Lowercase Token Filter
  • ASCII-Folding Token Filter
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}
  • For custom parsers, the type field uses custom or ignores the type parameter

The above example produces the following words:

[ is, this, deja, vu ]

The previous example uses the default configured marker, marker filter, and character filter, but you can create versions of each custom configuration and use them in the custom analyzer.

Here is a more complex example:

Character filter

  • Mapping Character Filter, configured to use_ happy_ Replace:), use_ sad_ Replace:(

Marker

  • Pattern Tokenizer, configured to split by punctuation characters

Tag filter

  • Lowercase Token Filter
  • Stop Token Filter, configured to use the predefined English stop word list

Here is an example:

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": { 
          "char_filter": [
            "emoticons"
          ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
          ]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { 
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { 
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you?"
}
  1. Assign a default custom parser to the index_ custom_ analyzer. The parser uses custom markers, character filters, and tag filters, which are defined later in the request. The parser also omits the type parameter.
  2. A custom punctuation marker is defined
  3. Defines a custom emoticons character filter
  4. Custom English is defined_ Stop tag filter

The above example produces the following words:

[ i'm, _happy_, person, you ]

Using the analyzer

Elasticsearch provides several ways to specify built-in or custom parsers.

  • By text field, index or query
  • At index or query time

Keep it simple

The flexibility to specify analyzers at different levels and at different times is great, but only when needed.

In most cases, a simple method works best: specify an parser for each text field, as described in specifying an parser for a field.

This approach works well with the default behavior of Elasticsearch, allowing you to index and search using the same analyzer. It also allows you to use the get mapping API to quickly see which parser applies to which domain.

If you do not normally create a mapping for an index, you can use an index template to achieve a similar effect.

How Elasticsearch determines the index analyzer

Elasticsearch determines which index analyzer to use by checking the following parameters:

  1. Domain mapping parameters for the parser. See the analyzer for the specified domain.
  2. analysis.analyzer.default index setting. See specifying the default parser for the index.

If these parameters are not specified, the standard parser is used.

Specifies the parser for the domain

When mapping indexes, you can use the parser mapping parameter to specify a parser for each text field.

The following index creation API requests to set the whitespace analyzer as the analyzer of the title domain.

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "whitespace"
      }
    }
  }
}

Specifies the default parser for the index

In addition to the domain level analyzer, you can also set up to use analysis analyzer. Default is set as the backup analyzer.

The following create index API requests to set the simple analyzer as the backup analyzer for the index my-index- 00000 1.

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "simple"
        }
      }
    }
  }
}

How does Elasticsearch determine the search Analyzer

Warning: in most cases, it is not necessary to specify a different search analyzer. Doing so may have a negative impact on relevance and lead to unexpected search results.

If you choose to specify a separate search analyzer, we recommend that you thoroughly test the analysis configuration before deploying to the production environment.

When searching, Elasticsearch determines which parser to use by checking the following parameters:

  1. Search for analyzer parameters in the query. See the search analyzer for the specified query.
  2. Search for domain_ Analyzer mapping parameters. See the search analyzer for the specified field.
  3. analysis.analyzer.default_search index settings. See the default search analyzer for the specified index.
  4. The mapping parameters of the analyzer for the domain. See the analyzer for the specified domain.

If these parameters are not specified, use a standard analyzer.

Specifies the search analyzer for the query

When writing a full-text query, you can use the analyzer parameter to specify the search analyzer. If provided, any other search parsers will be overwritten.

The following Search API requests that the stop analyzer be set as the search analyzer for match queries.

GET my-index-000001/_search
{
  "query": {
    "match": {
      "message": {
        "query": "Quick foxes",
        "analyzer": "stop"
      }
    }
  }
}

Search analyzer for the specified domain

When mapping indexes, you can use search_ The analyzer mapping parameter specifies the search analyzer for each text field.

If a search analyzer is provided, you must also specify an index analyzer using the analyzer parameter.

The following create index API request sets the simple analyzer as the search analyzer for the title domain.

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "whitespace",
        "search_analyzer": "simple"
      }
    }
  }
}

Specifies the default search analyzer for the index

When creating an index, you can use analysis analyzer. default_ Search sets the default search analyzer.

If a search analyzer is provided, you must also use analysis analyzer. The default setting specifies the default index analyzer.

The following create index API request sets the whitespace analyzer as the default search analyzer for the my index- 00000 1 index.

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "simple"
        },
        "default_search": {
          "type": "whitespace"
        }
      }
    }
  }
}

Keywords: ElasticSearch

Added by batman99 on Wed, 29 Dec 2021 09:59:19 +0200