ElasticSearch usage - query

catalogue

Spring Boot integrated with Jest

Paging query

Sorting and scoring optimization

Implementation of full field matching with the highest score

Optimization of Chinese search keyword segmentation

Paging query processing of big data records

Processing of secondary search

Other relevant notes

ElasticSearch usage - query

Official query documents: Query ES Query is mainly used to generate Query statements. Basically, match matching is only used in daily use. If multiple fields are added at most, multi_match, and then the should queried through bool Or must. Java backend recommended Jest To operate ES.

Spring Boot integrated with Jest

Add the following configuration in the application.properties file

es.server.url=http://xxxx:9200
es.user=xxx
es.password=xxx

Add a configuration Bean to load ES configuration, and initialize the custom ES Clietn class. The EsClient class is the self encapsulated Jest ES operation tool class, and the relevant codes will be listed when used.

/**
 * ElasticSearch configuration file
 */
@Configuration
public class EsConfig {
    /**
     * ES Server URL
     */
    private String esServerUrl;
    
    /**
     * ES user
     */
    private String esUser;
    /**
     * ES User password
     */
    private String esPassword;
    
    /**
     * URL
     * @param esServerUrl
     */
    @Value(value="${es.server.url}")
    public void setEsServerUrl(String esServerUrl) {
        this.esServerUrl = esServerUrl;
    }
    
    /**
     * ES user
     * @param esUser
     */
    @Value(value="${es.user}")
    public void setEsUser(String esUser) {
        this.esUser = esUser;
    }
    
    /**
     * ES User password
     * @param esPassword
     */
    @Value(value="${es.password}")
    public void setEsPassword(String esPassword) {
        this.esPassword = esPassword;
    }
    
    /**
     * Get the early search client
     * @return
     */
    @Bean
    public EsClient getEsClient() {
        return EsClient.getInstance(esServerUrl, esUser, esPassword);
    }

}

Specify @ Autowired injection when using.

@RunWith(SpringRunner.class)
@SpringBootTest
@Log4j2
public class EsClietnTests {

    @Autowired
    EsClient esClient;
    
    @Test
    public void testCount() {
        esClient.getIndexCount("order");
    }
}

Paging query

Paging query mainly uses the from and size parameters. Refer to the official document: From/Size

Statements for manual paging query:

GET test/_search
{
  "from": 0,
  "size": 10,
  "query": {}
}

Query through Jest: build a SearchSourceBuilder first object

 @Test
    public void testSearchByPage(){
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
		// Number matching query
        MatchQueryBuilder codeMatchQueryBuilder = QueryBuilders.matchQuery("baseInfo.code.chinese", "Query keyword");
        boolQueryBuilder.should(codeMatchQueryBuilder);
        
        // From which page
        searchSourceBuilder.from(0);
        // How many are displayed per page
        searchSourceBuilder.size(10);
        // Set query criteria
        searchSourceBuilder.query(boolQueryBuilder);
        // Set sorting by matching score
        searchSourceBuilder.sort("_score");
        // Settings are in reverse order of creation date
        searchSourceBuilder.sort("baseInfo.createDate", SortOrder.DESC);
        log.info(JsonUtils.objectToJson(esService.searchByPage(searchSourceBuilder)));
    }

Implementation of eservice.searchwithpage:

@Override
 public List<OrderAllInfoDTO> searchByPage(SearchSourceBuilder searchSourceBuilder) {
     //Set the result field list returned by the query and the ignored field list
     searchSourceBuilder.fetchSource(ALL_INFO_FIELD_LIST, null);
     //Call the paging search function of the client, pass in the index name, query statement and returned result type
     List<OrderAllInfoDTO> list = esClient.searchByPage(
         OrderConstant.INDEX_NAME,
         searchSourceBuilder, ORDER_ALL_INFO_DTO_TYPE);
     return list;
 }

ORDER_ ALL_ INFO_ DTO_ The content of type is:

private static final TypeReference<OrderAllInfoDTO> ORDER_ALL_INFO_DTO_TYPE=
        new TypeReference<OrderAllInfoDTO>() {};

Implementation of esClient.searchByPage:

public <T> List<T> searchByPage(String index, SearchSourceBuilder searchSourceBuilder,TypeReference<T> type) {
        String searchStr = searchSourceBuilder.toString();
        log.debug("ES Query string:{}", searchStr);
        // typeName fixed as default Type name '_ doc'
        Search search = new Search.Builder(searchStr).addIndex(index).addType(typeName).build();
        return searchByPage(search,type);
    }

public <T> List<T> searchByPage(Action clientRequest,TypeReference<T> type){
        try {
            JestResult result = client.execute(clientRequest);
            if(result.isSucceeded()) {
                List<String> sourceList =  result.getSourceAsStringList();
                if(CollectionUtils.isNotEmpty(sourceList)) {
                    List<T> list = new ArrayList<>();
                    for(String str:sourceList) {
                        list.add(JsonUtils.JsonToObject(str, type));
                    }
                    return list;
                }
            }
        } catch (IOException ex) {
            log.warn("Paging search failed",ex);
        }
        return Collections.emptyList();
    }

JsonUtils.JsonToObject is a custom Jackson wrapped tool class. The implementation contents are as follows:

    public static <T> T JsonToObject(String json, TypeReference<T> javaType) {
        try {
            return OM.readValue(json, javaType);
        } catch (Exception e) {
            log.error("transformation JSON String object failed",e);
        } 
        return null;
    }

Note that from + size Cannot exceed index.max_result_window Parameter value, otherwise the subsequent data cannot be obtained, index.max_result_window The default value of the parameter is 10000. Although the official does not recommend setting this value too large, it is recommended to modify this parameter value if it is really necessary for actual business. Because the other two methods recommended by the official can not perfectly support the operations of random page skipping, scrolling back and forth and modifying the number of records per page. In actual use, users can achieve better experience by limiting the number of pages they can jump to at one time. For example, it is forbidden to jump directly to the last page, and it is not allowed to enter the page number to jump. Only the page numbers of the first and last 10 pages are displayed at a time. Modification method:

PUT /my_index/_settings
{
  "index.max_result_window":1000000
}

Sorting and scoring optimization

Usually used for sorting_ score Just sort, that is, ES calculates the score of each doc for this search according to the specified scoring algorithm, and then sorts according to the score size from large to small. Scoring rule reference: Research on scoring mechanism of elasticSearch(5.3.0)，

Sorted official documents: Sort

Implementation of full field matching with the highest score

The scoring algorithm of ES does not calculate the highest priority of whole word matching, but summarizes a score according to multiple dimensions. Therefore, if you want to sort the highest priority of keyword complete matching, you need to modify the Mapping of the index and the query statement. The modifications are as follows:

Mapping: add a subfield of raw, with the data type of keyword, and use a word free planner.

"name": {
              "type": "text",
              "analyzer": "custome_standard",
              "fields": {
                "raw": {
                  "type": "keyword",
                  "normalizer": "custome_normalizer"
                },
                "chinese": {
                  "type": "text",
                  "analyzer": "custome_chinese",
                  "search_analyzer":"custome_chinese_search"
                }
              }
            }

Search: in addition to the normal search conditions, a search condition for the raw field is added, and the weight of the condition is set to be 10 times that of the ordinary search condition

@Test
    public void testSearchByPage(){
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
        //Number matching
		MatchQueryBuilder codeMatchQueryBuilder = QueryBuilders.matchQuery("baseInfo.code.chinese", keyword);
        MatchQueryBuilder codeRawMatchQueryBuilder = QueryBuilders.matchQuery("baseInfo.code.raw", keyword);
        codeRawMatchQueryBuilder.boost(10f);
        
        // From which page
        searchSourceBuilder.from(0);
        // How many are displayed per page
        searchSourceBuilder.size(10);
        // Set query criteria
        searchSourceBuilder.query(boolQueryBuilder);
        // Set sorting by matching score
        searchSourceBuilder.sort("_score");
        // Settings are in reverse order of creation date
        searchSourceBuilder.sort("baseInfo.createDate", SortOrder.DESC);
        log.info(JsonUtils.objectToJson(esService.searchByPage(searchSourceBuilder)));
    }

Optimization of Chinese search keyword segmentation

The commonly used Chinese analyzer is IK analyzer, the official website elasticsearch-analysis-ik , the analyzer has two kinds of word splitters ik_smart , ik_max_word Where ik_max_word is divided into as many words as possible, and ik_smart is a disjunctive word that can do some semantic analysis. The actual results are as follows:

ik_smart test:

POST _analyze
{
  "analyzer": "ik_smart",
  "text":     "The People's Republic of China"
}

Test results:

{
  "tokens": [
    {
      "token": "The People's Republic of China",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

ik_max_word test:

POST _analyze
{
  "analyzer": "ik_max_word",
  "text":     "The People's Republic of China"
}

{
  "tokens": [
    {
      "token": "The People's Republic of China",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "Chinese people",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "the chinese people",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "Chinese",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "People's Republic of China",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "the people",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "republic",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "republic",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "country",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    }
  ]
}

So if only IK is used_ max_ word When searching for the keyword people's Republic of China, it will be matched according to each keyword after word segmentation, resulting in too many matching results and the sorting of matching results is not expected. For example, if a doc contains a lot of Chinese characters, its score in the search results may only contain once, and the score of the people's Republic of China is high. Especially when there are some common prepositions in the search keywords, such as, is, the difference will be greater. But if you only use ik_smart, which will lead to the search keyword Republic and other words can not match any content. Therefore, it is recommended to use IK when indexing_ max_ Word, and IK is used in search_ Smart, which can not only ensure that the keywords are broken down sufficiently, but also ensure that the matching is accurate enough when searching.

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "pinyin_filter": {
          "type": "pinyin"
        }
      },
      "analyzer": {
        "custome_standard": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        },
        "custome_chinese": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        },
        "custome_chinese_search":{
          "type": "custom",
          "tokenizer": "ik_smart",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "normalizer": {
        "custome_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    },
    "index.mapping.coerce": false,
    "index.mapping.ignore_malformed":false,
    "index.gc_deletes":"0s"
  },
  "mappings": {
    "_doc": {
      "properties": {

        "id": {
          "type": "long"
        },
        "name": {
          "type": "text",
          "analyzer": "custome_standard",
          "fields": {
            "raw": {
              "type": "keyword",
              "normalizer": "custome_normalizer"
            },
            "chinese": {
              "type": "text",
              "analyzer": "custome_chinese",
                  "search_analyzer":"custome_chinese_search"
            }
          }
        }
      }
    }
  }
}

Paging query processing of big data records

For paged queries of big data (more than one million), the official does not recommend from + size Query mode, because from + size is controlled by index.max of the index_ result_ window Parameter limit, from + size cannot exceed index.max_result_window, and when querying from + size, you actually need to sort the results first. If a query spans multiple slices, after sorting each slice, you need to merge and sort again. We can assume that we search in an index with five main partitions. When we request the first page of the results (results from 1 to 10), each partition generates the top 10 results and returns them to the coordination node. The coordination node sorts the 50 results to get the top 10 of all results.

Now suppose we request page 1000 - the result is from 10001 to 10010. All work in the same way, except that each slice has to produce the first 10010 results. Then, the coordination node sorts all 50050 results, and finally discards 50040 of these results.

It can be seen that in the distributed system, the cost of sorting the results increases with the multiple of the paging depth component slice. This is why web search engines should not return more than 1000 results for any query.

Although officially provided Scroll and Search After There are two schemes, but Scroll is actually a cursor with a fixed step size that can only move forward, can't change the step size, and can't move back. Search After can change the step size, but it can't move back. Therefore, the scheme needs to be selected according to the actual scenario, and the cooperation of the demander is also required.

Standard paging supports random page skipping, and the step size can be modified
You can only select from + size by modifying index.max_result_window parameter value to achieve the final effect. It is best to restrict users from jumping pages within a limited range during product design, such as jumping pages within the range of plus or minus 5 pages of the current page, and the maximum number of records on a page cannot exceed 50. Otherwise, the user will directly turn to the last page, which will trigger performance problems, for example.
If more schemas are loaded, use Search After More appropriate.
If it is a data export mode, use Scroll It will be more convenient to handle.

Processing of secondary search

Use the new query criteria in the previous search results to search again, and the implementation scheme is as follows:

When receiving the search request, the back-end judges whether the keyword search is entered. If any, record the search conditions, store them in the database through json serialization, and generate a unique search condition id to the front-end.
The front end submits the new query criteria and the id of the previously received search criteria to the back end when the user specifies that it is a secondary search. If it is not a secondary search, set the search id to null.
When the backend receives the query, it judges whether there is a search condition id. if yes, it takes out the search condition corresponding to the saved search condition id, appends the new condition to the old query condition through the must query, and stores the last added query condition in the database through json serialization again. The newly generated search condition id and the search results are returned to the front end.

public PageResult<OrderPageInfoResult> orderSearchList(OrderSearchParameter searchParameter,
        SystemUser user) {
        PageResult<OrderPageInfoResult> result = PageResult.empty();
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
        //Get query criteria list
        List<OrderSearchParameter> searchParameterList = getRefineSearchParameter(searchParameter);
        //Complete query statement
        completeQuery(boolQueryBuilder,searchParameterList);
        // Advanced search (filtering)
        completeFilter(boolQueryBuilder,searchParameterList);
        searchSourceBuilder.query(boolQueryBuilder);
        // Get the total number of entries
        Integer total = orderEsService.countBySearch(searchSourceBuilder);
        searchSourceBuilder.sort("_score");
        searchSourceBuilder.sort("baseInfo.createDate", SortOrder.DESC);

        if (total > 0) {
            // Adjust paging parameters according to the total number of records
            searchParameter.correctPageParams(total);
            result = new PageResult<>(searchParameter.getOffset(), searchParameter.getLimit(), total);
            // From which page
            searchSourceBuilder.from(searchParameter.getOffset());
            // How many are displayed per page
            searchSourceBuilder.size(searchParameter.getLimit());
			//Set return result
            result.setItems(orderEsService.searchByPage(searchSourceBuilder));
        }
        //Save new query criteria
        if(StringUtils.isNotBlank(searchParameter.getKeyword()) || searchParameter.getSearchParameterId()!=null){
            result.setSearchParameterId(saveSearchParameterList(searchParameterList,user));
        }
        return result;
    }


private List<OrderSearchParameter> getRefineSearchParameter(OrderSearchParameter searchParameter){
        if(searchParameter==null){
            return Collections.emptyList();
        }
        Integer searchParameterId = searchParameter.getSearchParameterId();
        List<OrderSearchParameter> list = null;
        if(searchParameterId!=null){
            SystemSearchParameter refineSearchParameter = systemSearchParameterService.getById(searchParameterId);
            if(refineSearchParameter!=null){
                list = JsonUtils.JsonToObject(refineSearchParameter.getValue(),
                    new TypeReference<List<OrderSearchParameter>>() {});
            }
        }
        if(CollectionUtils.isEmpty(list)){
            list = new ArrayList<>();
        }
        //Spell the previous query criteria with the current query criteria
        list.add(searchParameter);
        return list;

    }

 private Integer saveSearchParameterList(List<OrderSearchParameter> searchParameterList,SystemUser user){
        SystemSearchParameterForm form = new SystemSearchParameterForm();
        form.setType(SystemSearchTypeEnum.ORDER_SEARCH.value());
        form.setValue(JsonUtils.objectToJson(searchParameterList));
        return systemSearchParameterService.insert(form,user,now());
    }

Other relevant notes

Keywords: ElasticSearch

Added by drewbie on Tue, 21 Sep 2021 08:10:11 +0300

Programming VIP

ElasticSearch usage - query

ElasticSearch usage - query

Spring Boot integrated with Jest

Paging query

Sorting and scoring optimization

Implementation of full field matching with the highest score

Optimization of Chinese search keyword segmentation

Paging query processing of big data records

Processing of secondary search

Other relevant notes

Popular Keywords