Elastic search: search for the same content, but with different scores, ranking confusion, and solve the problem

problem

  • For search results, you need to sort them intelligently according to relevance
  • However, for some similar text content and consistent search scores, you need to enable other sorting rules, such as time
  • Later, it is found that for these similar texts, the score of some texts is different from that of other texts, resulting in the ranking behind
  • Take the following data as an example. For the fuzzy search "economic operation in the first half of the year", it is necessary to search according to the title, and then sort the same scores in reverse chronological order. However, in fact, the first article appears in 2009 and the second article appears in 2021, which is not allowed
[
    {
        "createDate": "2009-07-21",
        "id": "7917561",
        "title": "2009 Economic operation of the whole province in the first half of"
    },
    {
        "createDate": "2021-08-02",
        "id": "8193901",
        "title": "2021 Economic operation of the whole province in the first half of"
    },
    {
        "createDate": "2020-08-02",
        "id": "8193891",
        "title": "2020 Economic operation of the whole province in the first half of"
    },
    {
        "createDate": "2019-08-02",
        "id": "8193881",
        "title": "2019 Economic operation of the whole province in the first half of"
    },
    {
        "createDate": "2014-08-02",
        "id": "8193861",
        "title": "2014 Economic operation of the whole province in the first half of"
    },
    {
        "createDate": "2019-07-18",
        "id": "4271871",
        "title": "2019 Economic operation of the whole province in the first half of"
    },
    {
        "createDate": "2017-08-02",
        "id": "8193871",
        "title": "2017 Economic operation of the whole province in the first half of"
    },
    {
        "createDate": "2017-01-23",
        "id": "7914371",
        "title": "2016 Economic operation of the whole province in"
    },
    {
        "createDate": "2016-01-22",
        "id": "7914981",
        "title": "2015 Economic operation of the whole province in"
    },
    {
        "createDate": "2015-01-22",
        "id": "7915411",
        "title": "2014 Economic operation of the whole province in"
    },
    {
        "createDate": "2014-01-23",
        "id": "7915791",
        "title": "2013 Economic operation of the whole province in"
    },
    {
        "createDate": "2012-01-20",
        "id": "7916451",
        "title": "2011 Economic operation of the whole province in"
    },
    {
        "createDate": "2011-01-24",
        "id": "7916941",
        "title": "2010 Economic operation of the whole province in"
    },
    {
        "createDate": "2010-01-23",
        "id": "7917271",
        "title": "2009 Economic operation of the whole province in"
    }
]

Cause exploration

shard and Lucene

  • Different shard s with different index es may have different retrieval scores for the same data
  • This is because each shard is an instance of Lucene, which uses TF/IDF to calculate the correlation algorithm. Each Lucene instance only saves its own TF and IDF statistics, so a shard only knows the number of occurrences of term in itself, not the whole cluster

TF: abbreviation of term frequency, indicating the frequency of the term in the current document
IDF: Inverse Document Frequency abbreviation, indicating the frequency of the term in all documents

  • It can be seen from TF/IDF algorithm that the higher the number of occurrences of the term in the current document, the greater the score; If the term appears less frequently in all documents, the score is higher. In this way, the term score is related not only to the hit document, but also to the number of documents in the shard and the capacity in the document
  • The documents in each shard are allocated according to the hash algorithm, and the number is not always the same. Especially when the total number of documents is small, this inconsistency may be obvious. Thus, the same document may have different scores for term

searchType

QUERY_THEN_FETCH

  • Query is used by default when elasticsearch searches_ THEN_ FETCH
  • According to the official documents, query_ THEN_ The fetch mode search steps are as follows:
    • Send query to each shard
    • Find all matching documents and, of course, use local TF/IDF information for scoring
    • Build a priority queue for the results (sorting, tabs, etc.)
    • Return enough metadata about the result to the requesting node. Note that the document content is not included
    • The scores from all shard s are combined and sorted on the request node to obtain the required pages and number of documents
    • Finally, the actual documents are retrieved from their respective shard s (including the document content at this time)
    • According to the request, wrap the result and return it to the user's request
  • As can be seen from the above, the default method does not guarantee the same document score
  • But in fact, when the accuracy requirements are not so strict, the results are still very ideal, so the general retrieval scenarios can be met
  • Lucene allocates documents to different shards according to the hash algorithm. When the amount of document data is large, the hash result will make the number of documents in different shards tend to be the same, and the default method can also achieve quite ideal results

DFS_QUERY_THEN_FETCH

  • You can use search_ The type parameter specifies another search mode, DFS_QUERY_THEN_FETCH is the solution to the above problems provided by Elasticsearch
  • Roughly the same as {@ link #QUERY_THEN_FETCH}
  • Only in the initial dispersion phase, DFS_QUERY_THEN_FETCH will ask TF/IDF from all shard s for more accurate scores
  • When querying each shard, you can use the global TF/IDF obtained by pre query

Source code

/*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */

package org.elasticsearch.action.search;

/**
 * Search type represent the manner at which the search operation is executed.
 *
 *
 */
public enum SearchType {
    /**
     * Same as {@link #QUERY_THEN_FETCH}, except for an initial scatter phase which goes and computes the distributed
     * term frequencies for more accurate scoring.
     */
    DFS_QUERY_THEN_FETCH((byte) 0),
    /**
     * The query is executed against all shards, but only enough information is returned (not the document content).
     * The results are then sorted and ranked, and based on it, only the relevant shards are asked for the actual
     * document content. The return number of hits is exactly as specified in size, since they are the only ones that
     * are fetched. This is very handy when the index has a lot of shards (not replicas, shard id groups).
     */
    QUERY_THEN_FETCH((byte) 1),
    // 2 used to be DFS_QUERY_AND_FETCH

    /**
     * Only used for pre 5.3 request where this type is still needed
     */
    @Deprecated
    QUERY_AND_FETCH((byte) 3);

    /**
     * The default search type ({@link #QUERY_THEN_FETCH}.
     */
    public static final SearchType DEFAULT = QUERY_THEN_FETCH;

    private byte id;

    SearchType(byte id) {
        this.id = id;
    }

    /**
     * The internal id of the type.
     */
    public byte id() {
        return this.id;
    }

    /**
     * Constructs search type based on the internal id.
     */
    public static SearchType fromId(byte id) {
        if (id == 0) {
            return DFS_QUERY_THEN_FETCH;
        } else if (id == 1
            || id == 3) { // This is a BWC layer for pre 5.3 indices where QUERY_AND_FETCH was id 3 but we don't have it anymore from 5.3 on
            return QUERY_THEN_FETCH;
        } else {
            throw new IllegalArgumentException("No search type for [" + id + "]");
        }
    }

    /**
     * The a string representation search type to execute, defaults to {@link SearchType#DEFAULT}. Can be
     * one of "dfs_query_then_fetch"/"dfsQueryThenFetch", "dfs_query_and_fetch"/"dfsQueryAndFetch",
     * "query_then_fetch"/"queryThenFetch" and "query_and_fetch"/"queryAndFetch".
     */
    public static SearchType fromString(String searchType) {
        if (searchType == null) {
            return SearchType.DEFAULT;
        }
        if ("dfs_query_then_fetch".equals(searchType)) {
            return SearchType.DFS_QUERY_THEN_FETCH;
        } else if ("query_then_fetch".equals(searchType)) {
            return SearchType.QUERY_THEN_FETCH;
        } else {
            throw new IllegalArgumentException("No search type for [" + searchType + "]");
        }
    }
}

solve

  • If it is required that the scores must be consistent, DFS can be used_ QUERY_ THEN_ Fetch, but using this method may cause a little loss of query performance, which can be ignored in our production environment
searchRequestBuilder.setSearchType(SearchType.DFS_QUERY_THEN_FETCH).get();
  • If the amount of data is small, you can consider a single shard and modify the configuration of index, number_of_shards=1

Keywords: Java ElasticSearch

Added by lipun4u on Fri, 31 Dec 2021 23:14:25 +0200