[elasticsearch] part of the result data in ES aggregation scenario is not returned. Problem analysis

background

After aggregating and querying a filter field in ES, similar to groupBy operation, it is found that the new data in this field is not displayed, but the user can query the new filter data after full-text retrieval, and has carried out relevant troubleshooting for this problem.

Troubleshooting ideas

First, we need to clarify our data writing process, as shown in the figure below:

After checking that there is no problem with the data in Mysql database, start to check whether there is a problem in ES. According to the phenomenon, we know that since we can search in full-text retrieval, it means that the data must be written into es, but how to determine the aggregation result?

First, add a log to print out the final DSL statement generated by the code

LOGGER.info("\n{}", searchRequestBuilder);

This makes it easy to debug with the curl command

The following is a query executed on the generated DSL statement:

curl -XGET 'http://ip:9200/es_data_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query":{
        "bool":{
            "must":[
                {
                    "term":{
                        "companyId":{
                            "value":1,
                            "boost":1
                        }
                    }
                },
                {
                    "term":{
                        "yn":{
                            "value":1,
                            "boost":1
                        }
                    }
                },
                {
                    "match_all":{
                        "boost":1
                    }
                }
            ],
            "must_not":[
                {
                    "term":{
                        "table_sentinel":{
                            "value":2,
                            "boost":1
                        }
                    }
                }
            ],
            "disable_coord":false,
            "adjust_pure_negative":true,
            "boost":1
        }
    },
    "aggregations":{
        "group_by_topics":{
            "terms":{
                "field":"topic",
                "size":10,
                "min_doc_count":1,
                "shard_min_doc_count":0,
                "show_term_doc_count_error":false,
                "order":[
                    {
                        "_count":"desc"
                    },
                    {
                        "_term":"asc"
                    }
                ]
            }
        }
    }
}'

Group above_ by_ Topics is the field we want to aggregate. The following is the result of executing the DSL statement:

"aggregations" : {
    "group_by_topics" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 14,
      "buckets" : [
        {
          "key" : 1,
          "doc_count" : 35
        },
        {
          "key" : 19,
          "doc_count" : 25
        },
        {
          "key" : 18,
          "doc_count" : 17
        },
        {
          "key" : 29,
          "doc_count" : 15
        },
        {
          "key" : 20,
          "doc_count" : 12
        },
        {
          "key" : 41,
          "doc_count" : 8
        },
        {
          "key" : 161,
          "doc_count" : 5
        },
        {
          "key" : 2,
          "doc_count" : 3
        },
        {
          "key" : 3,
          "doc_count" : 2
        },
        {
          "key" : 21,
          "doc_count" : 2
        }
      ]
    }
  }

After observation, it is found that there are no new filter items in the aggregation results, and only 10 data are returned

"Sum_other_doc_count": 14. This is a key item. Literally, there are other documents, so what is the meaning of querying in ES?

After query, it is found that there is a paragraph Description:

Only top results will be returned, and some results will not be returned

How to return this part of the result?

With the problem, it is found that bucket aggregation is used. By default, it will be based on doc_count is sorted in descending order, and only 10 aggregate results are returned by default

You can solve this problem by increasing the attribute size in the aggregate query, as follows

curl -XGET 'http://ip:9200/es_data_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query":{
        "bool":{
            "must":[
                {
                    "term":{
                        "companyId":{
                            "value":1,
                            "boost":1
                        }
                    }
                },
                {
                    "term":{
                        "yn":{
                            "value":1,
                            "boost":1
                        }
                    }
                },
                {
                    "match_all":{
                        "boost":1
                    }
                }
            ],
            "must_not":[
                {
                    "term":{
                        "table_sentinel":{
                            "value":2,
                            "boost":1
                        }
                    }
                }
            ],
            "disable_coord":false,
            "adjust_pure_negative":true,
            "boost":1
        }
    },
    "aggregations":{
        "group_by_topics":{
            "terms":{
                "field":"topic",
                "size":100,
                "min_doc_count":1,
                "shard_min_doc_count":0,
                "show_term_doc_count_error":false,
                "order":[
                    {
                        "_count":"desc"
                    },
                    {
                        "_term":"asc"
                    }
                ]
            }
        }
    }
}'

The following is the query result:

"aggregations" : {
    "group_by_topics" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 1,
          "doc_count" : 35
        },
        {
          "key" : 19,
          "doc_count" : 25
        },
        {
          "key" : 18,
          "doc_count" : 17
        },
        {
          "key" : 29,
          "doc_count" : 15
        },
        {
          "key" : 20,
          "doc_count" : 12
        },
        {
          "key" : 41,
          "doc_count" : 8
        },
        {
          "key" : 161,
          "doc_count" : 5
        },
        {
          "key" : 2,
          "doc_count" : 3
        },
        {
          "key" : 3,
          "doc_count" : 2
        },
        {
          "key" : 21,
          "doc_count" : 2
        },
        {
          "key" : 81,
          "doc_count" : 2
        },
        {
          "key" : 801,
          "doc_count" : 2
        },
        {
          "key" : 0,
          "doc_count" : 1
        },
        {
          "key" : 4,
          "doc_count" : 1
        },
        {
          "key" : 5,
          "doc_count" : 1
        },
        {
          "key" : 6,
          "doc_count" : 1
        },
        {
          "key" : 7,
          "doc_count" : 1
        },
        {
          "key" : 11,
          "doc_count" : 1
        },
        {
          "key" : 23,
          "doc_count" : 1
        },
        {
          "key" : 28,
          "doc_count" : 1
        },
        {
          "key" : 201,
          "doc_count" : 1
        },
        {
          "key" : 241,
          "doc_count" : 1
        }
      ]
    }

All ES filter item data are statistically returned

Set size in the code:

  TermsAggregationBuilder termAgg1 = AggregationBuilders.terms("group_by_topics")
                .field("topic").size(100);

We have solved the problem. Now consider why ES does not return the result data of all statistical items at once?

The answer is determined by the ES aggregation mechanism. How does es aggregate

The node that accepts the client is the coordination node

On the coordination node, the search task is divided into two phases: query and fetch

The nodes of real search or aggregation tasks are data nodes, as shown in figures 2, 3 and 4

Aggregation steps:

  • The client sends a request to the coordination node
  • The coordination node pushes the request to each data node
  • Each data node specifies the partition to participate in the data collection
  • Coordinate nodes to aggregate the total results

es for efficiency and performance reasons, the result of aggregation is actually inaccurate what do you mean? Take the scenario we encountered above as an example:

By default, the top 10 aggregation result is returned. First, each node takes its own topic 10 and returns it to the coordination node, and then the coordination node summarizes it This will cause the actual aggregate result of the whole quantity to be inconsistent with the expectation

Although there are many ways to improve the accuracy of es aggregation, if the response speed is fast for accurate aggregation of large amount of data, es is not good at this scenario. It is necessary to use products such as clickhouse to solve this scenario

summary

This paper mainly aims at the application problems of practical work to troubleshoot and solve the problem that some data of ES aggregated data are not displayed, and explain the aggregation retrieval principle of ES ES is not good at business scenarios with large amount of data, high aggregation accuracy and fast response speed

reference resources

https://discuss.elastic.co/t/what-does-sum-other-doc-count-mean-exactly/159687

https://stackoverflow.com/questions/22927098/show-all-elasticsearch-aggregation-results-buckets-and-not-just-10

Keywords: ElasticSearch

Added by dizzy1 on Fri, 31 Dec 2021 22:31:07 +0200