Query and aggregation operations in [Java]-Elastic


In< Basic operations of index and document in Elastic >It introduces the basic knowledge of Elastic, index and document operation; This section describes the query and aggregation operations commonly used in Elasticsearch.

search Foundation

Elasticsearch will segment the document content and establish an inverted index according to the word segmentation; You can use keyword ({field}.keyword) to match the complete input of a field.

All examples take the following structure as an example:

{
	"number": 1802,
	"name": "Name 36zou",
	"age": 28,
	"courses": [
		{
			"name": "maths",
			"hours": 160,
			"teacher": "mike"
		},
		{
			"name": "english",
			"hours": 120,
			"teacher": "tom"
		}
	]
}

Tokenizer

The word splitter accepts a string as input, splits the string into independent words or tokens (some punctuation characters may be discarded), and then outputs a token stream.

Built in word splitter:

Tokenizer explain
standardDefault word splitter: convert words to lowercase, remove stop words and punctuation (common separators except underline), and segment Chinese words
simpleText information is segmented by non alphabetic characters, and then the vocabulary units are uniformly converted to lowercase form, which will remove the characters of numeric type
whitespaceOnly remove spaces and do not support Chinese; The segmented vocabulary units will not be standardized, nor will the characters be converted to lowercase
patternRegular expression word segmentation, default \ W + (non character segmentation)
keywordNo word segmentation, direct input and output (use {field}.keyword for overall query)
languageLanguage specific participle
customerCustom word segmentation

Request request

ES constructs a request through SearchRequest and returns a SearchResponse through search; The response contains the query record information and the total amount information that meets the conditions. index name is required when constructing SearchRequest:

  • There can be multiple indexes: query all specified indexes at this time;
  • You can use * fuzzy matching: for example, test * will match all index es starting with test;

SearchHit is the requested document content:

  • getSourceAsString: convert the content to a string (Json format);
  • getSourceAsMap: convert the content into a Map for easy access (values are of long type by default);
  • getHits: get the query result array; Its method getTotalHits().value gets the total number of records (the total number of qualified records in ES), and its attribute length is the number of currently returned entries;
public void searchQuery(String index, SearchSourceBuilder sourceBuilder) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchRequest reqSearch = new SearchRequest(index);
        reqSearch.source(sourceBuilder);

        SearchResponse respSearch = rhlClient.search(reqSearch, RequestOptions.DEFAULT);
        SearchHits gotHits = respSearch.getHits();
        System.out.printf("get size: %d, total size: %d\n", gotHits.getHits().length, gotHits.getTotalHits().value);
        for (SearchHit hit : gotHits) {
            // System.out.println(hit.getSourceAsString());
            Map<String, Object> mapHit = hit.getSourceAsMap();
            String name = mapHit.get("name").toString();
            Integer age = Integer.valueOf(mapHit.get("age").toString());
            System.out.printf("name: %s, age: %d", name, age);
        }
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

SearchSourceBuilder

As the source content of SearchRequest, SearchSourceBuilder determines the query criteria, quantity, sorting method, obtained content, etc

  • from/size: used for paging access (get size bar from from); Get 10 from 0 by default; Maximum 10000 (from + size < = 10000; if it exceeds this value, it can only be obtained through scroll);
  • sort: set sorting method;
  • query(QueryBuilder): set query filter;
  • aggregation(AggregationBuilder): sets the aggregation method;
  • collapse(CollapseBuilder): folding and de duplication;
  • suggest(SuggestBuilder): set the suggestion (give the input prompt according to the matching);
  • postFilter(QueryBuilder): set the post filter (you can filter after aggregation without affecting aggregation);
  • fetchSource: set the column to query; The include column must be a real column in ES (use new String[0] to return only system columns), and the exclude column can contain non-existent columns (system columns: _doc / _score / _idcannot be excluded);
  • Timeout: set the query timeout;
  • terminateAfter(int n): when the number of search results reaches n, the search will be terminated in advance;
  • highlighter(HighlightBuilder): set highlight;
private SearchSourceBuilder buildTermQuerySource(String field, String word) {
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    boolQuery.must(QueryBuilders.termQuery(field, word));
    sourceBuilder.query(boolQuery);

    sourceBuilder.from(0);
    sourceBuilder.size(5);
    sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));
    sourceBuilder.sort("name.keyword");

    String[] excludeFields = new String[]{"@time", "@version"};
    sourceBuilder.fetchSource(null, excludeFields);

    return sourceBuilder;
}

query

Elastic is stored in lowercase after default word segmentation (field.keyword stores the complete content as it is); To match word segmentation, lower case is required (if the method supports word segmentation, it will be automatically converted to lower case during word segmentation).

QueryBuilders

Query conditions can be easily constructed through QueryBuilders:

  • matchAllQuery: match all;
  • termQuery: accurate matching and case sensitive; termsQuery can match multiple values at a time;
  • matchPhraseQuery: word segmentation (and the order is required to be consistent), which is convenient for accurate Chinese matching;
  • queryStringQuery: it can match multiple and supports AND/OR;
  • fuzzyQuery: fuzzy matching (you can set how many characters are different);
  • prefixQuery: prefix matching;
  • wildcardQuery: fuzzy matching (* matches 0 or more characters, and? Matches one character); For slashes, \ \, you need to escape (\ \ \ \);
  • rangeQuery: range matching
    • from/to: set the start and end of the range from("fieldValue1").to("fieldValue2").includeUpper(false).includeLower(false);
    • gt/gte/lt/lte: greater than less than comparison (use termQuery directly for equality);
  • boolQuery: combined condition query:
    • must: equivalent to AND;
    • mustNot: equivalent to NOT;
    • should: equivalent to OR;
    • Filter: filter; The return value must meet the conditions of the filter clause, but it will not participate in the calculation of scores like must;

QueryStringQuery

QueryStringQuery can specify multiple fields to query the documents in the index through fields! When multiple words (term s) in the query string are matched, the default is the operation relationship of OR (OR) (the default operation mode of the query string can be modified through the default_operator).

Multiple fields of complex documents can be obtained through query string operation; In the query string:

  • Support AND/OR/NOT for Boolean operations: such as big AND fat (there should be spaces before and after the symbol);
  • Support + (must) and - (must not): such as + dog -cat (with dogs and no cats);
private SearchSourceBuilder buildQueryStringSource(String field, String word) {
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    QueryBuilders.queryStringQuery(word)
            .field(field)
            .analyzeWildcard(true)
            .defaultOperator(Operator.AND);
    boolQuery.must(QueryBuilders.termQuery(field, word));
    sourceBuilder.query(boolQuery);

    sourceBuilder.from(0);
    sourceBuilder.size(5);
    sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));

    return sourceBuilder;
}

SimpleQueryStringQuery is a simplified version of QueryStringQuery. It does not support AND OR NOT Boolean operation keywords. These keywords will be treated as common words.

sort

ES defaults to_ score, which can be sorted through sourcebuilder Sort (field, sortorder. DESC) is used to define sorting. Multiple fields can be sorted (the first field to be added has the highest priority). SortBuilder has four special implementations:

  • FieldSortBuilder: sort according to a special field; For text field sorting, you need * * to use field Keyword * * is used as the field name for sorting;
  • ScoreSortBuilder: sort according to score;
  • GeoDistanceSortBuilder: sort by geographic location;
  • ScriptSortBuilder: sort according to custom scripts;
private SearchSourceBuilder buildSortSource(String ...field){
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    sourceBuilder.from(0);
    sourceBuilder.size(20);

    for(String f:field)
        sourceBuilder.sort(f, SortOrder.DESC);

    return sourceBuilder;
}

buildSortSource("age", "name.keyword");
// Sort by the age field first, and sort by name for the same age

Cursor Scroll

The ES query can return up to 10000 records each time. To obtain subsequent data, you need to use Scroll query; After the cursor is used, it needs to be cleaned to avoid affecting other subsequent queries and releasing resources.

public void scrollSearch(String index) {
    // Sets the timeout for each query of the cursor
    final Scroll scroll = new Scroll(TimeValue.timeValueSeconds(60));
    SearchRequest searchRequest = new SearchRequest(index);
    searchRequest.scroll(scroll);

    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.size(5);

    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    boolQuery.must(QueryBuilders.rangeQuery("age").gt(0));
    searchSourceBuilder.query(boolQuery);

    searchRequest.source(searchSourceBuilder);
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
        SearchHits gotHits = response.getHits();
        while (gotHits.getHits().length>0){
            System.out.printf("ScrollId: %s, size: %d, total: %d\n", response.getScrollId(),
                    gotHits.getHits().length, gotHits.getTotalHits().value);
            for(SearchHit hit : gotHits.getHits()){
                System.out.println(hit.getSourceAsString());
            }

            // scroll query
            SearchScrollRequest scrollRequest = new SearchScrollRequest(response.getScrollId());
            scrollRequest.scroll(scroll);
            response = rhlClient.scroll(scrollRequest, RequestOptions.DEFAULT);
            gotHits = response.getHits();
            System.out.println();
        }

        // clear scroll query(must clear to avoid affect other query)
        ClearScrollRequest clearRequest = new ClearScrollRequest();
        clearRequest.addScrollId(response.getScrollId());
        ClearScrollResponse clearResponse = rhlClient.clearScroll(clearRequest, RequestOptions.DEFAULT);
        System.out.println("clear scroll: " + clearResponse.isSucceeded());
    }catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

polymerization

Use AggregationBuilder to construct group conditions in RestHighLevelClient:

  • Buckets: a collection of documents that meet certain conditions; Get the number of documents in the bucket through getDocCount();
  • Metrics: statistical information calculated for a document of the same kind;

An aggregation is a combination of buckets and indicators. An aggregation can have only one bucket, or one indicator, or one for each; There can even be multiple nested buckets in the bucket.

AggregationBuilders

AggregationBuilders are used to construct aggregation conditions; The construction parameter is the name, which is used to identify this aggregation (this name is required when obtaining the next aggregation); The corresponding column is passed field(f) setting; Combine subAggregation through subAggregation:

  • count(name): count the quantity;
  • avg(name): average value;
  • max(name): maximum value;
  • min(name): minimum value;
  • sum(name): cumulative value;
  • stats(name): statistical information (mean, variance, etc.);
  • filter(name, QueryBuilder): filter conditions; When there are multiple conditions, use filters;
  • range(name): count a range, and set the upper limit, range and lower limit [from, to] through addUnboundedTo/addRange/addUnboundedFrom;
  • missing(name): grouping aggregation with missing corresponding fields;
  • terms(name): aggregate by specified field;
  • topHists(name): get the document details in the aggregation (bucket);
  • histogram(name): histogram aggregation. Set the interval through interval;
  • dateHistogram(name): aggregate query of time histogram; The field must be of date time type; Pass Datehistogramminterval sets the aggregation granularity (hour, minute, second, etc.), and format sets the date format (returned in the string type of key_as_string);
  • dateRange(name): aggregate the date range, set the date format through format, and set the range through range;
  • ipRange(name, GeoPoint): aggregate IP address ranges;
  • geoDistance(name): geographic distance aggregation;
  • nested(name,path): embedded sub object (subclass) aggregation;

Description of common parameters:

  • Field: the field corresponding to the aggregation; For text, you probably need to use field keyword;
  • size: 10 barrels by default;
  • min_doc_count: minimum document filtering. Buckets with fewer documents than the specified value will not be returned;
  • order: bucket sorting;
  • Missing: set the default value for the missing field;
  • ranges: configure interval arrays, such as [{from:0}, {from:50, to:100}, {to:200}];
  • subAggregation: add sub bucket;

The indicator (count/avg/max/min/stats) is only used for statistics. It needs to be under a group and will not be grouped (no new sub group will be generated):

private static void AggregateSumAndAvgByAge(String index) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        TermsAggregationBuilder termAggregation = AggregationBuilders.terms("ageTerm").field("age")
                .subAggregation(AggregationBuilders.sum("sum").field("number"))
                .subAggregation(AggregationBuilders.avg("avg").field("number"))
                .subAggregation(AggregationBuilders.topHits("details").size(2));
        sourceBuilder.aggregation(termAggregation);

        SearchRequest searchRequest = new SearchRequest(index);
        searchRequest.source(sourceBuilder);

        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
        // getAggregations gets the aggregated data
        Aggregations aggAge = response.getAggregations();
        Terms ageTerms = aggAge.get("ageTerm");
        for (Terms.Bucket bucket : ageTerms.getBuckets()) {
            System.out.println("bucket of " + bucket.getKeyAsString()+ ", count: " + bucket.getDocCount());

            Aggregations aggNumber = bucket.getAggregations();
            ParsedSum sumTerm = aggNumber.get("sum");
            ParsedAvg avgTerm = aggNumber.get("avg");
            System.out.println("\tsum: " + sumTerm.getValue() + ", avg: " + avgTerm.getValue());

            ParsedTopHits topHits = aggNumber.get("details");
            for(SearchHit detail : topHits.getHits()){
                System.out.println("\t" + detail.getSourceAsString());
            }
        }
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

nested aggregation

netsted is equivalent to a sub document in a document (similar to a word list); Its query and aggregation performance is very good; The update performance is average. courses in the example is a sub document. Nested query and aggregation are needed to process its content.

AggregationBuilder aggregation = AggregationBuilders.nested("course", "courses")
                .subAggregation(AggregationBuilders.terms("hour").field("courses.hours"));

Set the path in nested, and the field name when creating aggregation should carry the path;

sort

The aggregate sort is set by order, but TopHits uses order (similar to query):

private static void AggregateByAge(String index) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        TermsAggregationBuilder termAggregation = AggregationBuilders
                .terms("ageTerm")
                .field("age")
                .order(BucketOrder.key(true))
                .subAggregation(
                        AggregationBuilders
                                .topHits("details")
                                .sort("name.keyword", SortOrder.ASC)
                                .size(10)
                );
        sourceBuilder.aggregation(termAggregation);

        SearchRequest searchRequest = new SearchRequest(index);
        searchRequest.source(sourceBuilder);

        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
       Aggregations aggAge = response.getAggregations();
        Terms ageTerms = aggAge.get("ageTerm");
        for (Terms.Bucket bucket : ageTerms.getBuckets()) {
            System.out.println("bucket of " + bucket.getKeyAsString() + ", count: " + bucket.getDocCount());

            Aggregations aggDetail = bucket.getAggregations();
            ParsedTopHits topHits = aggDetail.get("details");
            for (SearchHit detail : topHits.getHits()) {
                System.out.println("\t" + detail.getSourceAsString());
            }
        }
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

Query aggregation

Query and aggregation can be used together to aggregate only the records that meet the conditions:

private static void FilterAndAggregate(String index) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        sourceBuilder.size(0); // No query content is required, set to 0

        BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
        boolQuery.must(QueryBuilders.rangeQuery("number").gt(1500));
        sourceBuilder.query(boolQuery);

        TermsAggregationBuilder termAggregation = AggregationBuilders
                .terms("ageTerm")
                .field("age")
                .order(BucketOrder.key(true));
        sourceBuilder.aggregation(termAggregation);

        SearchRequest searchRequest = new SearchRequest(index);
        searchRequest.source(sourceBuilder);

        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
        // getAggregations gets the aggregated data
        Aggregations aggAge = response.getAggregations();
        Terms ageTerms = aggAge.get("ageTerm");
        for (Terms.Bucket bucket : ageTerms.getBuckets()) {
            System.out.println("bucket of " + bucket.getKeyAsString() + ", count: " + bucket.getDocCount());
        }
        
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

collapse de duplication

When aggregating and de duplication, the statistical quantity is returned by default; After the collapse is de duplicated, select one from the same data to return; Moreover, collapse can be used in conjunction with from/size for paging:

  • getHits().length: returns the quantity of current query results (after de duplication);
  • getTotalHits().value: is the number of all records that meet the conditions; No overall quantity after weight removal;
public void collapseSearch(String index, String field) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchRequest searchRequest = new SearchRequest(index);
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        sourceBuilder.collapse(new CollapseBuilder(field));
        sourceBuilder.from(0);
        sourceBuilder.size(20);
        searchRequest.source(sourceBuilder);

        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);

        SearchHits gotHits = response.getHits();
        System.out.printf("get size: %d, total size: %d\n", gotHits.getHits().length, gotHits.getTotalHits().value);
        for (SearchHit hit : gotHits) {
            System.out.println(hit.getSourceAsString());
        }
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

Keywords: Java ElasticSearch

Added by hawleyjr on Fri, 14 Jan 2022 23:10:58 +0200