Introduction to ElasticSearch query DSL combination query (bool, boosting, constant_score, dis_max)

Combined query

bool (Boolean Query)

Many times, we may want to search through multiple conditions. For example, Douban searches a movie. We may limit the query according to multiple conditions such as the type, year and Douban score of the movie. In fact, this scenario is a scenario in which multiple search conditions and multiple words match. In ES, there is a query called bool query. It can combine multiple query words and sentences, and then output the results. It supports nested clauses. The query words and sentences he supports can be divided into four types:

Must: must match, which will affect the score result
should: selective matching will also affect the scoring results
must_not: must not match and will not affect scoring
filter: must match and will not affect scoring

The following is an example of a bool query provided on the official website. The words must and should will be scored and accumulated into the final score. Bool queries can be made through minimum_should_match to specify how many term subqueries in the should query must match before they can be truly matched to this data. Suppose we are now querying a movie. Our search limit score should be greater than 9 points. The type is literary and artistic film, the release time is 2021, and the actors are Zhang Guorong. If you do not specify minimum_should_match. It may be found if one of the four conditions is met, but if minimum is specified_ should_ Match = 3, then three of the four conditions must be met before returning.

POST _search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "user.id" : "kimchy" }
      },
      "filter": {
        "term" : { "tags" : "production" }
      },
      "must_not" : {
        "range" : {
          "age" : { "gte" : 10, "lte" : 20 }
        }
      },
      "should" : [    # An array includes two term queries. If the must condition is not specified, the term in the should query must meet at least one query
        { "term" : { "tags" : "env1" } },
        { "term" : { "tags" : "deployed" } }
      ],
      "minimum_should_match" : 1,
      "boost" : 1.0
    }
  }
}

boosting(Boosting query)

Suppose we have the following index, including three documents. The first two are the introduction of Apple's electronic products, and the last one is the introduction of Apple's Encyclopedia. If we match through the following query criteria, we will query both Apple phones and apples in fruits.

POST /baike/_search
{
  "query": {
    "bool": {
      "must": {
        "match":{"title":"Apple"}
      }
    }
  }
}

"hits" : {
    "total" : {
        "value" : 3,
        "relation" : "eq"
    },
    "max_score" : 0.1546153,
    "hits" : [
        {
            "_index" : "baike",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 0.1546153,
            "_source" : {
                "title" : "Apple Pad"
            }
        },
        {
            "_index" : "baike",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 0.1546153,
            "_source" : {
                "title" : "Apple Mac"
            }
        },
        {
            "_index" : "baike",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : 0.1546153,
            "_source" : {
                "title" : "Apple Pie and Apple Fruit"
            }
        }
    ]
}

But perhaps our users are not concerned about fruit apples, but electronic products, so how should we make more accurate matching? Of course, we can make some modifications to the previous query through mast_not to exclude documents including pie or fruit in the title, and only Apple Pad and Apple Mac are returned. But this seems a little absolute. Although many people really want to find Apple phones, there are always people who want to see what apples are delicious. Is there any compromise?

POST /baike/_search
{
  "query": {
    "bool": {
      "must": {
        "match":{"title":"Apple"}
      },
      "must_not": {
        "match":{"title":"Pie"}
      }
    }
  }
}

In ES, it provides us with the query method of Boosting query (boosting: the current participle form of boost means to improve and boost. Here I understand that it means to improve the score of _score). It can match us with the Apple phone that users care about most and the apple they eat. And you can specify to let the most watched Apple phone display at the top of the search results. The wording is as follows:

Here is a simple analysis of several attributes:

Positive: translated with positive meaning, it is used to specify the documents that we are most concerned about, hope to show ahead, and count as high scores
Negative: translated to have a negative meaning. It is used to specify documents that we don't care about very much, but still hope they can be matched
negative_boost: This is to specify a boost value for the conditions in negative to reduce their calculation score, a float number between 0.0-1.0

POST /baike/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "title": "Apple"
        }
      },
      "negative": {
        "match": {
          "title": "fruit"
        }
      },
      "negative_boost": 0.5   # By combining this field with the conditions in the negative above, the score of the data containing fruit will be scored very low during query, so that it can be displayed at the end
    }
  }
}

costant_score (Constant score query)

We know that filter queries will not be scored, and es will automatically cache some filter queries to improve efficiency. Sometimes it may be necessary to return a desired score, so constant_score can be used to do this. It can wrap the filter query once, and then specify the score that returns a constant through the boost parameter. Constant

POST /baike/_search
{
  "query": {
    "constant_score": {
      "filter": {"term": {"title.keyword": "Quick brown rabbits"}},
      "boost": 1.2
    }
  }
}

"hits" : {
    "total" : {
        "value" : 1,
        "relation" : "eq"
    },
    "max_score" : 1.2,
    "hits" : [
        {
            "_index" : "baike",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : 1.2,   # Through the above query, the score returned here is equal to the boost score specified by us
            "_source" : {
                "title" : "Quick brown rabbits",
                "body" : "Brown rabbits are commonly seen."
            }
        }
    ]
}

dis_max(Disjunction max query)

The above mentioned bool query. Let's review here. First, I found two test data from es Chinese website,

POST baike/_doc/4
{
  "title": "Quick brown rabbits",
  "body": "Brown rabbits are commonly seen."
}

POST baike/_doc/5
{
  "title": "Keeping pets healthy",
  "body": "My quick brown fox eats rabbits on a regular basis."
}

Assuming that we want to query the content related to brown fox in the title or body, we find that the data with ID 5 should be more relevant through observation, because the complete brown fox search condition appears in his body. Of course, we hope he can get higher scores and show it a little closer to the front, Next, let's use bool query to see if it will be the same as we think. Here are the results:

POST /baike/_search
{
  "query": {
    "bool": {
      "should": [
        {"match": {"title": "Brown fox"}},
        {"match": {"body": "Brown fox"}}
      ]
    }
  }
}

# Query results
"hits" : {
    "total" : {
        "value" : 2,
        "relation" : "eq"
    },
    "max_score" : 1.5974034,
    "hits" : [
        {
            "_index" : "baike",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : 1.5974034,
            "_source" : {
                "title" : "Quick brown rabbits",
                "body" : "Brown rabbits are commonly seen."
            }
        },
        {
            "_index" : "baike",
            "_type" : "_doc",
            "_id" : "5",
            "_score" : 0.77041256,
            "_source" : {
                "title" : "Keeping pets healthy",
                "body" : "My quick brown fox eats rabbits on a regular basis."
            }
        }
    ]
}

During the actual operation, we found that the data with ID 5 did not get higher scores. Why? In order to answer this question, we should know that in es, the execution plan of dsl, including the calculation method, can also be displayed through the keyword explain, similar to the execution plan of mysql query sql. Let's wait and see:

POST /baike/_search
{
  "query": {
    "bool": {
      "should": [
        {"match": {"title": "Brown fox"}},
        {"match": {"body": "Brown fox"}}
      ]
    }
  },
  "explain": true
}

# Query results
"hits" : {
  "total" : {
    "value" : 2,
    "relation" : "eq"
  },
  "max_score" : 1.5974034,
  "hits" : [
    {
      "_shard" : "[baike][0]",
      "_node" : "aPt8G7vHTzOJU_L2FdLBpA",
      "_index" : "baike",
      "_type" : "_doc",
      "_id" : "4",
      "_score" : 1.5974034,
      "_source" : {
        "title" : "Quick brown rabbits",
        "body" : "Brown rabbits are commonly seen."
      },
      "_explanation" : {
        "value" : 1.5974034,  # This value is approximately equal to line 38 value + line 49 value
        "description" : "sum of:",   # !!!  Sum
        "details" : [
          {
            "value" : 1.3862942, 
            "description" : "weight(title:brown in 0) [PerFieldSimilarity], result of:",  # There is the keyword brown in the title. Count it once
            "details" : [
              {
                "value" : 1.3862942,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [] # Calculate the details because it is too long to omit
              }
            ]
          },
          {
            "value" : 0.21110919,
            "description" : "weight(body:brown in 0) [PerFieldSimilarity], result of:", # There is the keyword brown in the body. Calculate a score
            "details" : [
              {
                "value" : 0.21110919,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : []
              }
            ]
          }
        ]
      }
    },
    {
      "_shard" : "[baike][0]",
      "_node" : "aPt8G7vHTzOJU_L2FdLBpA",
      "_index" : "baike",
      "_type" : "_doc",
      "_id" : "5",
      "_score" : 0.77041256,
      "_source" : {
        "title" : "Keeping pets healthy",
        "body" : "My quick brown fox eats rabbits on a regular basis."
      },
      "_explanation" : {
        "value" : 0.77041256,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 0.160443,
            "description" : "weight(body:brown in 0) [PerFieldSimilarity], result of:", # There is the keyword brown in the body. Calculate a score
            "details" : [
              {
                "value" : 0.160443,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : []
              }
            ]
          },
          {
            "value" : 0.60996956,
            "description" : "weight(body:fox in 0) [PerFieldSimilarity], result of:", # There is a keyword fox in the body. Calculate a score
            "details" : [
              {
                "value" : 0.60996956,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : []
              }
            ]
          }
        ]
      }
    }
  ]
}

Through the analysis of the execution plan, we find that the bool query will score the two sub queries in should respectively, and then add them to get a total score. In the document with ID 4, the keyword brown is included in the title and body respectively, while in the document with ID 5, because the title does not contain any character in the query criteria, Therefore, its score was low and finally ranked behind.

Obviously, this outcome is not what we want to see, so is there any way? es provides a solution called dis_max. Next, we use him to make another query to see what the results are. It is obvious that the data with ID 5 has obtained a high score this time.

"hits" : {
    "total" : {
        "value" : 2,
        "relation" : "eq"
    },
    "max_score" : 0.77041256,
    "hits" : [
        {
            "_index" : "baike",
            "_type" : "_doc",
            "_id" : "5",
            "_score" : 0.77041256,
            "_source" : {
                "title" : "Keeping pets healthy",
                "body" : "My quick brown fox eats rabbits on a regular basis."
            }
        },
        {
            "_index" : "baike",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : 0.6931471,
            "_source" : {
                "title" : "Quick brown rabbits",
                "body" : "Brown rabbits are commonly seen."
            }
        }
    ]
}
}

We are using explain to look at his execution plan. We find that he does not simply add up the scores of the two sub queries, but selects the maximum score of the scores in the two sub queries as his final score.

"hits" : {
  "total" : {
    "value" : 2,
    "relation" : "eq"
  },
  "max_score" : 0.77041256,
  "hits" : [
    {
      "_shard" : "[baike][0]",
      "_node" : "tc1MvVwdRcO-2A5L6j_l0Q",
      "_index" : "baike",
      "_type" : "_doc",
      "_id" : "5",
      "_score" : 0.77041256,
      "_source" : {
        "title" : "Keeping pets healthy",
        "body" : "My quick brown fox eats rabbits on a regular basis."
      },
      "_explanation" : {
        "value" : 0.77041256,
        "description" : "max of:",  # !!!  Find the maximum value
        "details" : [
          {
            "value" : 0.77041256,
            "description" : "sum of:",
            "details" : [
              {
                "value" : 0.160443,
                "description" : "weight(body:brown in 1) [PerFieldSimilarity], result of:",
                "details" : []
              },
              {
                "value" : 0.60996956,
                "description" : "weight(body:fox in 1) [PerFieldSimilarity], result of:",
                "details" : []
              }
            ]
          }
        ]
      }
    },
    {
      "_shard" : "[baike][0]",
      "_node" : "tc1MvVwdRcO-2A5L6j_l0Q",
      "_index" : "baike",
      "_type" : "_doc",
      "_id" : "4",
      "_score" : 0.6931471,
      "_source" : {
        "title" : "Quick brown rabbits",
        "body" : "Brown rabbits are commonly seen."
      },
      "_explanation" : {
        "value" : 0.6931471,
        "description" : "max of:",
        "details" : [
          {
            "value" : 0.6931471,
            "description" : "sum of:",
            "details" : [
              {
                "value" : 0.6931471,
                "description" : "weight(title:brown in 0) [PerFieldSimilarity], result of:",
                "details" : []
              }
            ]
          },
          {
            "value" : 0.21110919,
            "description" : "sum of:",
            "details" : [
              {
                "value" : 0.21110919,
                "description" : "weight(body:brown in 0) [PerFieldSimilarity], result of:",
                "details" : []
              }
            ]
          }
        ]
      }
    }
  ]
}

However, sometimes it is not very reasonable to take the highest score completely and directly ignore the scores of other query words and sentences. After all, excellent people are always rare, and the power of ordinary people can not be underestimated, so we should also take into account. ES also provides us with a parameter: tie_breaker. A floating-point number whose valid value is between 0.0 and 1.0. The default value is 0.0. If we set this field, first he will get the highest score, and then multiply the score of all sub queries by tie_ Add the breaker values to get a final score. In this process, he gave the highest score and other sub query scores a weight, taking into account not only the contributions of very few priority outstanding figures, but also the strength of the people. After analyzing so much, we understand why it is called dis_max, DIS is the abbreviation of Disjunction. It has the meaning of separation and extraction. max is the biggest meaning. Therefore, it separates the combined query into multiple sub queries to calculate the highest score as the final score. It is an effective means to help us select the best match.

POST /baike/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {"title": "Brown fox"}},
        {"match": {"body": "Brown fox"}}
      ],
      "tie_breaker": 0.7
    }
  }

Keywords: ElasticSearch boosting

Added by walter8111 on Sat, 08 Jan 2022 07:26:54 +0200

Programming VIP