Deep Tilling ElasticSearch - Bar Chart / Aggregation by Time Statistics / Range Limited

1. Data preparation

1. Create an index mapping:

PUT /cars 
{
  "mappings": {
    "properties": {
      "price":{
        "type": "integer"
      },
      "color":{
        "type": "keyword"
      },
      "make":{
        "type": "keyword"
      },
      "sold":{
        "type": "date"
      }
    }
  }
}

2. Index documents:

POST /cars/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

2. Bar Charts

Another exciting feature of aggregation is the ease with which they can be converted into charts and graphs, and histograms are particularly useful. It is essentially a bar graph, and creating a histogram requires an interval. If we want to create a histogram for the sales price, we can set the interval to 20,000. Doing so will create a new bucket in each $20,000 slot, and the document will be divided into the corresponding buckets.

We want to know how many cars are sold within each sales range. We would also like to know the revenue from cars sold in each sales range, which can be calculated by summing the prices of cars sold in each sales range.

We can use histogram and a nested sum to get the answer we want:

GET /cars/_search
{
  "size": 0,
  "aggs": {
    "price_agg": {
      "histogram": {  // The histogram bucket requires two parameters: a numeric field and a bucket size interval defined.
        "field": "price",
        "interval": 20000
      },
      "aggs": {
        "sum_agg": {
          "sum": {   // sum measures are nested within each sales price interval and are used to show the total revenue within each interval.
            "field": "price"
          }
        }
      }
    }
  }
}

As we have seen, the query is built around a price aggregation and contains a histogram bucket. It requires that the type of field be numeric and that the range of intervals for grouping be set. Setting the interval to 20,000 means we'll get [0-19999, 20000-39999,...] Such an interval.

Next, we define a nested measure within the histogram, a sum measure that sums up the values of the price field in a document that falls within a specific sales range. This will give us revenue per sales range, so we can see whether we make money from regular household cars or luxury cars.

The response results are as follows:

{
  "aggregations" : {
    "price_agg" : {
      "buckets" : [
        {
          "key" : 0.0,
          "doc_count" : 3,    // price has three in [0~19999]
          "sum_agg" : {
            "value" : 37000.0 // The total price is 3700.0
          }
        },
        {
          "key" : 20000.0,    // price has four in [20000-29999]
          "doc_count" : 4,
          "sum_agg" : {
            "value" : 95000.0 // The total price is 95000.0
          }
        },
        {
          "key" : 40000.0,
          "doc_count" : 0,
          "sum_agg" : {
            "value" : 0.0
          }
        },
        {
          "key" : 60000.0,
          "doc_count" : 0,
          "sum_agg" : {
            "value" : 0.0
          }
        },
        {
          "key" : 80000.0,
          "doc_count" : 1,
          "sum_agg" : {
            "value" : 80000.0
          }
        }
      ]
    }
  }
}

Of course, we can create bars for the classification and statistical results of any aggregated output, not just histogram buckets. Let's create a bar chart with information about the 10 most popular cars and their average selling price and standard deviation. We'll use terms buckets and extended_stats measure:

GET /cars/_search
{
  "size": 0,
  "aggs": {
    "make_agg": {
      "terms": {
        "field": "make",
        "size": 10
      },
      "aggs": {
        "stats": {
          "extended_stats": {   
            "field": "price"
          }
        }
      }
    }
  }
}
{
  "aggregations" : {
    "make_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "honda",
          "doc_count" : 3,
          "stats_agg" : {
            "count" : 3,
            "min" : 10000.0,
            "max" : 20000.0,
            "avg" : 16666.666666666668,
            "sum" : 50000.0,
            "sum_of_squares" : 9.0E8,
            "variance" : 2.222222222222221E7,
            "std_deviation" : 4714.045207910315,
            "std_deviation_bounds" : {
              "upper" : 26094.757082487296,
              "lower" : 7238.5762508460375
            }
          }
        }
        // .....
      ]
    }
  }
}

3. Time Statistics

3.1 Statistics by Time

Why would you like to use date_to build time statistics What about histogram? Suppose your data is timestamped. No matter what data (Apache event log, stock trading time, baseball time) you have, date_histogram analysis can be done with a time stamp. When your data has a time stamp, you always want to build an indicator analysis in the time dimension:

  • How many cars are sold each month this year?
  • What is the price of this stock for the last 12 hours?
  • What was the average response delay per hour on our website last week?

Although the usual histogram is a bar graph, date_ Hisgram tends to convert to a linear graph to show time series. Many companies use Elasticsearch only to analyze time series data. Date_ Hisgram analysis is their most basic need.

Date_ Hisgram is similar to a normal histogram. Instead of building buckets on numeric fields representing numeric ranges, buckets are built on time ranges. So each bucket is defined as a specific date size (for example, a month or 2.5 days).

Our first example will build a simple line chart to answer the following questions: How many cars are sold each month?

GET /cars/_search
{
  "size": 0,
  "aggs": {
    "sold_agg": {
      "date_histogram": {
        "field": "sold",
        "interval": "month",
        "format": "yyyy-MM-dd"
      }
    }
  }
}

Time interval requirements are calendar terms (such as one month per bucket).

Our query has only one aggregation and builds a bucket each month. So we can get the number of cars sold each month. An additional format parameter is provided so that buckets have "nice" key values.

#! Deprecation: [interval] on [date_histogram] is deprecated, use [fixed_interval] or [calendar_interval] in the future.
{
  "aggregations" : {
    "sold_agg" : {
      "buckets" : [
        {
          "key_as_string" : "2014-01-01",
          "key" : 1388534400000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-02-01",
          "key" : 1391212800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-03-01",
          "key" : 1393632000000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "2014-04-01",
          "key" : 1396310400000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "2014-05-01",
          "key" : 1398902400000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-06-01",
          "key" : 1401580800000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "2014-07-01",
          "key" : 1404172800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-08-01",
          "key" : 1406851200000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-09-01",
          "key" : 1409529600000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "2014-10-01",
          "key" : 1412121600000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-11-01",
          "key" : 1414800000000,
          "doc_count" : 2
        }
      ]
    }
  }
}

3.2 Filter empty buckets

We don't want to go back if there are no documents in the buckets. You can set the additional parameter min_doc_count achieves this effect:

GET /cars/_search
{
  "size": 0,
  "aggs": {
    "sold_agg": {
      "date_histogram": {
        "field": "sold",
        "interval": "month",
        "min_doc_count": 1  // When data doc_in bucket Return when count >=1
      }
    }
  }
}

As you can see from the results, min_has not been added before Doc_ Comparing the count parameter, doc_in bucket The count=0 data has been filtered out

{
  "aggregations" : {
    "sold_agg" : {
      "buckets" : [
        {
          "key_as_string" : "2014-01-01T00:00:00.000Z",
          "key" : 1388534400000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-02-01T00:00:00.000Z",
          "key" : 1391212800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-05-01T00:00:00.000Z",
          "key" : 1398902400000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-07-01T00:00:00.000Z",
          "key" : 1404172800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-08-01T00:00:00.000Z",
          "key" : 1406851200000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-10-01T00:00:00.000Z",
          "key" : 1412121600000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2014-11-01T00:00:00.000Z",
          "key" : 1414800000000,
          "doc_count" : 2
        }
      ]
    }
  }
}

Note: In fact, in most scenarios, we want to return even if there are no documents in the buckets. In the lower version of es, if the data in the bucket is 0, it will not be returned. In the higher version, the result will be returned regardless of the data in the bucket.

3.3 Specified time interval

GET /cars/_search
{
  "size": 0,
  "aggs": {
    "sold_agg": {
      "date_histogram": {
        "field": "sold",
        "interval": "month",
        "min_doc_count": 0,   // This parameter forces an empty buckets back.
        "extended_bounds": {  // This parameter forces the return of the whole year
          "min": "2014-01-01",
          "max": "2014-12-31"
        }
      }
    }
  }
}

These two parameters force the results to be returned for all months of the year regardless of the number of documents in the results. min_doc_count is easy to understand: it forces all buckets back, even if they may be empty.

Extended_ The bounds parameter requires some explanation. Min_ Doc_ The count parameter forces empty buckets to be returned, but Elasticsearch by default only returns buckets between the minimum and maximum values in your data.

So if your data falls between April and July, you only get buckets for those months (either empty or not). So in order to get the full year's data, we need to tell Elasticsearch that we want all buckets, even if those buckets might fall before the minimum or after the maximum date.

Changes to version 3.4 ES 7.0

You can see from the hint that in the new version of es, date_ The histogram aggregate query interval parameter is obsolete and recommended: calendar_interval and fixed_interval parameter

1. fixed_interval parameter

GET /cars/_search
{
  "aggs": {
    "sold_agg": {
      "date_histogram": {
        "field": "sold",
        "fixed_interval": "60m",  //  Create a bucket every 60 minutes
        "format": "yyyy-MM-dd"
      }
    }
  }
}
GET /cars/_search
{
  "aggs": {
    "sold_agg": {
      "date_histogram": {
        "field": "sold",
        "fixed_interval": "100000000s", // Create a bucket every 100,000,000 seconds
        "format": "yyyy-MM-dd"
      }
    }
  }
}
GET /cars/_search
{
  "size": 0,
  "aggs": {
    "sold_agg": {
      "date_histogram": {
        "field": "sold",
        "fixed_interval": "1d",   // Create a bucket every 1 day
        "format": "yyyy-MM-dd"
      }
    }
  }
}

2. calendar_interval parameter

GET /cars/_search
{
  "size": 0, 
  "aggs": {
    "sold_agg": {
      "date_histogram": {
        "field": "sold",
        "calendar_interval": "month",   // Create a bucket every 1 month
        "format": "yyyy-MM-dd"
      }
    }
  }
}
GET /cars/_search
{
  "size": 0, 
  "aggs": {
    "sold_agg": {
      "date_histogram": {
        "field": "sold",
        "calendar_interval": "quarter",  // Create a bucket every quarter
        "format": "yyyy-MM-dd"
      }
    }
  }
}

3.5 Extended Example

As we've seen many times, buckets can be nested within buckets for more complex analysis. As an example, we build aggregations to show the total sales of all car brands by quarter. At the same time, the total sales are calculated quarterly by each car brand, so that you can find out which brand makes the most money:

GET /cars/_search
{
  "size": 0,
  "aggs": {
    "sold_agg": {
      "date_histogram": {
        "field": "sold",
        "interval": "quarter",
        "format": "yyyy-MM-dd",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "2014-01-01",
          "max": "2014-12-31"
        }
      },
      "aggs": {
        "total_sum_agg":{       // Show total sales of all car brands Quarterly
          "sum": {
            "field": "price"
          }
        },
        "sold_agg_1":{
          "terms": {
            "field": "make"     
          },
          "aggs": {
            "sum_agg": {
              "sum": {
                "field": "price" // Calculate total sales per car brand
              }
            }
          }
        }
      }
    }
  }
}
{
  "aggregations" : {
    "sold_agg" : {
      "buckets" : [
        {
          "key_as_string" : "2014-01-01",
          "key" : 1388534400000,
          "doc_count" : 2,
          "sold_agg_1" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "bmw",
                "doc_count" : 1,
                "sum_agg" : {
                  "value" : 80000.0
                }
              },
              {
                "key" : "ford",
                "doc_count" : 1,
                "sum_agg" : {
                  "value" : 25000.0
                }
              }
            ]
          },
          "total_sum_agg" : {
            "value" : 105000.0
          }
        }
        // .....
      ]
    }
  }
}

4. Aggregation with limited scope

For all the aggregated examples so far, you may have noticed that our search request omitted a query. The whole request is just an aggregation.

Aggregation can be performed at the same time as a search request, but we need to understand a new concept: scope. By default, aggregation and query operate on the same scope, that is, aggregation is calculated based on the set of documents that our query matches.

Let's look at an example of the first aggregation:

GET /cars/_search
{
    "size" : 0,
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color"
            }
        }
    }
}

We can see that aggregation is isolated. In reality, Elasticsearch considers that "no query specified" and "query all documents" are equivalent. The previous query internally translates into the following request:

GET /cars/_search
{
    "size" : 0,
    "query" : {
        "match_all" : {}
    },
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color"
            }
        }
    }
}

Since aggregation always operates on results within the scope of a query, an isolated aggregation is actually a match_all result range operations, that is, all documents.

Once we have the concept of scope, we can further customize aggregation. All of our previous examples calculate statistics for all data: the cars with the highest sales, the average selling price of all cars, the best selling month, and so on.

Using the range, we can ask "How many colors are Ford selling cars?" Questions like these. You can simply add a query to the request (in this case, the match query):

GET /cars/_search
{
    "query" : {
        "match" : {
            "make" : "ford"
        }
    },
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color"
            }
        }
    }
}

Since we did not specify "size": 0, both search results and aggregated results were returned:

{
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.2809337,
    "hits" : [
      {
        "_index" : "cars",
        "_type" : "_doc",
        "_id" : "nlim_H0Bh5iD2576zz3b",
        "_score" : 1.2809337,
        "_source" : {
          "price" : 30000,
          "color" : "green",
          "make" : "ford",
          "sold" : "2014-05-18"
        }
      },
      {
        "_index" : "cars",
        "_type" : "_doc",
        "_id" : "o1im_H0Bh5iD2576zz3b",
        "_score" : 1.2809337,
        "_source" : {
          "price" : 25000,
          "color" : "blue",
          "make" : "ford",
          "sold" : "2014-02-12"
        }
      }
    ]
  },
  "aggregations" : {
    "colors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "blue",
          "doc_count" : 1
        },
        {
          "key" : "green",
          "doc_count" : 1
        }
      ]
    }
  }
}

It doesn't seem like much, but it's important for a big dashboard. Adding a search bar can turn any static dashboard into a real-time data search device. This allows users to search for data and view all graphics that are updated in real time (due to aggregation support and limited query scope).

Keywords: Big Data ElasticSearch

Added by hws on Fri, 31 Dec 2021 04:48:02 +0200