Description of key attributes and fields in elasticsearch

Premise: This article is written based on elasticsearch 6.4.2. There may be slight differences between the two versions

Document, index, type

attributeexplain
documentThe data information to be stored, such as employee data. An employee data can represent a document
indexThe behavior of storing documents in elastic search is called Indexing; An index is similar to a database in a traditional relational database. It is a place to store relational documents; An elasticsearch can contain multiple indexes
typeSpecifying the specific type of stored documents can be understood as a single table in a relational database. Note: in 6 There can only be one type under the same index of version x, which is in 7.0 Version x starts removing types, 8 Completely remove type after X

Description: why remove type. 1

Inverted index

attributeexplain
Inverted indexThe inverted index consists of a list of all non repeating words in the document. For each word, there is a document list containing it.

Query return field interpretation

{
  "took": 1,  //How many milliseconds does the entire search request take
  "timed_out": false,
  "_shards": { //Indicates how many tiles were searched and the count of tiles that were successfully and failed to be searched
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": { //Used for the actual search result set
    "total": 1, // Number of search returns
    "max_score": 0.83355963, //Matching degree of search results
    "hits": [ //Query complete data to_ score descending sort
      {
        "_index": "test", //Indexes
        "_type": "test", //type
        "_id": "xtoffH0BS_BXL2CbFNQ_", //Index data id
        "_score": 0.83355963, // Measure how well the document matches the query
        "_source": { // Result original data
          "name": "Sina military 22"
        }
      }
    ]
  }
}

New document

# Automatically generate unique_ id : 
POST /test/test
{
 "name":"Who am I? Who am I" 
}

# Return result:
{
  "_index": "test",
  "_type": "test",
  "_id": "gLKe4X4BLI5cjpF7njkO", //Automatically generated id
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 1
}

# Set id manually
PUT /test/test/465?op_type=create
{
 "name":"Who am I? Who am I" 
}
or
PUT /test/test/465/_create
{
 "name":"Who am I? Who am I" 
}
Returns if the request to create a new document was successfully executed
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}
On the other hand, if you have the same _index , _type and _id Your document already exists, Elasticsearch 409 will be returned Conflict Response code and the following error message:
{
   "error": {
      "root_cause": [
         {
            "type": "document_already_exists_exception",
            "reason": "[blog][123]: document already exists",
            "shard": "0",
            "index": "website"
         }
      ],
      "type": "document_already_exists_exception",
      "reason": "[blog][123]: document already exists",
      "shard": "0",
      "index": "website"
   },
   "status": 409
}

Update document

Important: in Elasticsearch, documents cannot be changed and cannot be modified. On the contrary, if you want to update an existing document, you need to rebuild the index or replace it. The actual update process is as follows:

  1. Building JSON from old documents
  2. Change the JSON
  3. Delete old documents
  4. Index a new document
    Internally, Elasticsearch has marked the old document as deleted and added a new document. Although you can no longer access the old version of the document, it will not disappear immediately. When you continue to index more data, Elasticsearch will clean up these deleted documents in the background.
# Update the entire document. If the document field is not in the update field, it will be discarded
PUT /test/test/465
{
  "title": "My first blog entry",
  "text":  "I am starting to get the hang of this...",
  "date":  "2014/01/02"
}
# Return results
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 3,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 1
}

# Update some fields of the document
POST /test/test/465/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "title": "make progress every day"
   }
}
# Return results
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 4,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 3,
  "_primary_term": 1
}

remove document

More id remove document
DELETE /website/blog/123
# Return results
{
  "_index": "test",
  "_type": "test",
  "_id": "uZOYM34BLI5cjpF7dwsz",
  "_version": 6,
  "result": "deleted",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 8,
  "_primary_term": 1
}
# Delete documents with specified conditions
POST test/test/_delete_by_query
{
  "query": {
    "match": {
      "name": "Sina military 22"
    }
  }
}
# Return results
{
  "took": 477,
  "timed_out": false,
  "total": 4,
  "deleted": 4,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": []
}


consult your documentation

Query documents by id

# Query by id
GET /test/test/465?pretty  // The pretty keyword is only used to format the result set. It makes the result look better and has no other effect
# Return result:
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 4,
  "found": true,
  "_source": {
    "title": "make progress every day",
    "text": "I am starting to get the hang of this...",
    "date": "2014/01/02",
    "tags": [
      "testing"
    ]
  }
}

# Query and return the specified field according to the id
GET /test/test/465?_source=title,text
# Return results
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 4,
  "found": true,
  "_source": {
    "text": "I am starting to get the hang of this...",
    "title": "make progress every day"
  }
}
# Query multiple index data
mget API One is required docs Array as a parameter, each element contains the metadata of the document to be retrieved, including _index , _type and _id . If you want to retrieve one or more specific fields, you can _source Parameter to specify the names of these fields
GET /_mget
{
  "docs" : [
    {
      "_index":"test",
      "_type":"test",
      "_id":"gLKe4X4BLI5cjpF7njkO"
    },
    {
      "_index":"test01",
      "_type":"test01",
      "_id":123
    }
    ]
}
# If you want to retrieve all the data in the same_ Index (even in the same _type), you can specify the default in the URL/_ Index or default/_ index/_type . 

You can still override these values with a separate request:
GET /website/blog/_mget
{
   "docs" : [
      { "_id" : 2 },
      { "_type" : "pageviews", "_id" :   1 }
   ]
}
If all documents _index and _type It's all the same. You can pass only one ids Array, not the whole docs Array:
GET /test/test/_mget
{
   "ids" : [ "gLKe4X4BLI5cjpF7njkO", "3ZNyM34BLI5cjpF7LQQ3" ]  
}

Query all documents

GET /test/test/_search

Condition query

GET /test/test/_search
{
    "query" : {
        "match" : { 
            "name" : "Who am I"
        }
    }
}
# Remarks: match participle;
         match_phrase: Accurately match a series of words or_phrase
         match_all: Matching all documents is equivalent to no filtering
         match_phrase_prefix:  Leftmost prefix query
         multi_match: Multi field query
         "multi_match": {
      			"query": "make progress every day", 
      			"fields": ["title","text"]
    		}

Highlight query

GET /test/test/_search
{
    "query" : {
        "match_phrase" : {
            "name" : "Who am I"
        }
    },
    "highlight": {
        "fields" : {
            "name" : {}
        }
    }
}

Paging query

GET test/_search
{
  "from": 0, // Displays the number of initial results that should be skipped. The default is 0
  "size": 2 //Displays the number of results that should be returned. The default is 10
}

Query function

match_all query

match_all query simply matches all documents. When no query method is specified, it is the default query:

{ "match_all": {}}

It is often used in conjunction with a filter - for example, to retrieve all messages in your inbox. All messages are considered to have the same relevance, so they will receive a neutral score of 1_ score.

match query

If you use match query on a full-text field, it will analyze the query string with the correct analyzer before executing the query:

{ "match": { "tweet": "About Search" }}

If you use it on a field with an exact value, such as a number, date, Boolean, or a not_analyzed string field, then it will exactly match the given value:

{ "match": { "age":    26           }}
{ "match": { "date":   "2014-09-01" }}
{ "match": { "public": true         }}
{ "match": { "tag":    "full_text"  }}

For queries with exact values, you may need to use the filter statement instead of query, because the filter will be cached.

multi_match query

multi_match Queries can perform the same query on multiple fields match Query:
{
    "multi_match": {
        "query":    "full text search",
        "fields":   [ "title", "body" ]
    }
}

range query

The range query finds the numbers or times that fall within the specified range:

{
    "range": {
        "age": {
            "gte":  20,
            "lt":   30
        }
    }
}

The allowed operators are as follows:

  1. gt greater than
  2. gte is greater than or equal to
  3. lt less than
  4. lte less than or equal to

term query

term queries are used to match exact values, which may be numbers, times, Booleans, or not_analyzed string:

{ "term": { "age":    26           }}
{ "term": { "date":   "2014-09-01" }}
{ "term": { "public": true         }}
{ "term": { "tag":    "full_text"  }}

term query does not analyze the input text, so it will accurately query the given value.

terms query

terms query is the same as term query, but it allows you to specify multiple values for matching. If this field contains any of the specified values, the document meets the following conditions:

{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}

Like term query, terms query does not analyze the input text. It queries for exactly matched values (including differences in case, accent, space, etc.).

Note: if term and terms act on the attributes of string type, it will cause problems in query. It should be the string. Word segmentation is performed by default. For example:

GET test/test/_search
{
  "query": {
    "term": {
      "name": "Please note that"
    }
  }
}
# Query results
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

GET test/test/_search
{
  "query": {
    "term": {
      "name": "please"
    }
  }
}
# Query results
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.80259144,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "3ZNyM34BLI5cjpF7LQQ3",
        "_score": 0.80259144,
        "_source": {
          "name": "Please note that",
          "age": "456",
          "url": "Second update"
        }
      }
    ]
  }
}

By reason, the first query statement should have a result, but it doesn't, because name The attribute is parsed by word segmentation by default, so it cannot be queried

exists

The exists query finds documents that have values (exists) in the specified field. This is the same as not is in SQL_ Null (exists) has something in common in essence:

{
    "exists":   {
        "field":    "title"
    }
}

bool

Combine multiple queries together to become the desired Boolean query. This is similar to and in SQL

{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [ //
            { "match": { "tag": "starred" }},
            { "range": { "date": { "gte": "2014-01-01" }}}
        ]
    }
}

should

Satisfy any of the statements in the statement. This is similar to or in SQL

{
	"should": [ 
	            { "match": { "tag": "starred" }},
	            { "range": { "date": { "gte": "2014-01-01" }}}
	        ]
}

filter

It must match, but it is carried out in non scoring and filtering mode. These statements do not contribute to the score, but exclude or include documents according to the filtering criteria.

"filter": {
          "range": { "date": { "gte": "2014-01-01" }} 
        }

constant_score

It applies a constant score to all matching documents. It is often used when you only need to execute one filter without other queries (for example, scoring queries).
You can use it instead of a bool query with only a filter statement. The performance is exactly the same, but it is very helpful to improve the simplicity and clarity of query.

{
    "constant_score":   {
        "filter": {
            "term": { "category": "ebooks" } 
        }
    }
}

term queries are placed in constant_ In score, it is converted to a filter without score. This method can be used to replace the bool query with only filter statement.

Validate query

Queries can become very complex, especially when combined with different analyzers and different field mappings. However, the validate query API can be used to verify whether a query is legitimate.

GET test/test/_validate/query
{
  "query": {
    "trod": {
      "filter": {
        "term": {
          "age": "456"
        } 
      }
    } 
  }
}

The response to the above validate request tells us that this query is illegal:

{
  "valid": false
}

Understanding error messages

To find out the reason why the query is illegal, you can add the explain parameter to the query string:

GET /gb/tweet/_validate/query?explain 
{
   "query": {
      "tweet" : {
         "match" : "really powerful"
      }
   }
}

The explain parameter can provide more information about illegal queries.
Obviously, we confuse the query type (match) with the field name (tweet):

{
  "valid" :     false,
  "_shards" :   { ... },
  "explanations" : [ {
    "index" :   "gb",
    "valid" :   false,
    "error" :   "org.elasticsearch.index.query.QueryParsingException:
                 [gb] No query registered for [tweet]"
  } ]
}

Understanding query statements

For legitimate queries, using the explain parameter will return a readable description, which is very useful for accurately understanding how Elasticsearch parses your query:

GET /_validate/query?explain
{
   "query": {
      "match" : {
         "tweet" : "really powerful"
      }
   }
}

Each index we query will return the corresponding explanation, because each index has its own mapping and analyzer:

{
  "valid" :         true,
  "_shards" :       { ... },
  "explanations" : [ {
    "index" :       "us",
    "valid" :       true,
    "explanation" : "tweet:really tweet:powerful"
  }, {
    "index" :       "gb",
    "valid" :       true,
    "explanation" : "tweet:realli tweet:power"
  } ]
}

It can be seen from the explanation that the match query matching really powerful is rewritten into two single term queries for the tweet field. A single term query corresponds to a term separated from the query string.
Of course, for index us, the two terms are really and powerful respectively, while for index gb, the terms are realli and power respectively. The reason for this is that we changed the parser of the tweet field in the index gb to english parser.

Sort query

The fields to be parsed by word segmentation (general string type default resolution) cannot be directly used for sorting, and keyword must be specified

GET test/test/_search
{
   "sort": [
    {"age": {"order": "desc"}}
    { "_score": { "order": "desc" }}
    {"name.keyword": {"order": "desc"}} //The name field is of string type
  ]
}
  1. In an elastic search index, the same lucene field is used internally for all different types of fields with the same name. If the same fields of different types under the same index have different data types, it is bound to increase the cost of data compression and affect the efficiency.
    For example: under the same index, there are two types: user and person. Both types have a "deleted" field
    At this time, you want the "deleted" field in the same index to be a value in one type and a Boolean value in another type. Finally, in the same index, storing only a small number of documents with the same or all fields are different will lead to sparse data, affect Lucene's ability to effectively compress data, occupy more space and affect the efficiency of the index ↩︎

Keywords: ElasticSearch search engine lucene

Added by jon2396 on Tue, 15 Feb 2022 12:10:28 +0200