catalogue
Spring Boot integrated with Jest
Sorting and scoring optimization
Implementation of full field matching with the highest score
Optimization of Chinese search keyword segmentation
Paging query processing of big data records
Processing of secondary search
ElasticSearch usage - query
Official query documents: Query ES Query is mainly used to generate Query statements. Basically, match matching is only used in daily use. If multiple fields are added at most, multi_match, and then the should queried through bool Or must. Java backend recommended Jest To operate ES.
Spring Boot integrated with Jest
Add the following configuration in the application.properties file
es.server.url=http://xxxx:9200 es.user=xxx es.password=xxx
Add a configuration Bean to load ES configuration, and initialize the custom ES Clietn class. The EsClient class is the self encapsulated Jest ES operation tool class, and the relevant codes will be listed when used.
/** * ElasticSearch configuration file */ @Configuration public class EsConfig { /** * ES Server URL */ private String esServerUrl; /** * ES user */ private String esUser; /** * ES User password */ private String esPassword; /** * URL * @param esServerUrl */ @Value(value="${es.server.url}") public void setEsServerUrl(String esServerUrl) { this.esServerUrl = esServerUrl; } /** * ES user * @param esUser */ @Value(value="${es.user}") public void setEsUser(String esUser) { this.esUser = esUser; } /** * ES User password * @param esPassword */ @Value(value="${es.password}") public void setEsPassword(String esPassword) { this.esPassword = esPassword; } /** * Get the early search client * @return */ @Bean public EsClient getEsClient() { return EsClient.getInstance(esServerUrl, esUser, esPassword); } }
Specify @ Autowired injection when using.
@RunWith(SpringRunner.class) @SpringBootTest @Log4j2 public class EsClietnTests { @Autowired EsClient esClient; @Test public void testCount() { esClient.getIndexCount("order"); } }
Paging query
Paging query mainly uses the from and size parameters. Refer to the official document: From/Size
Statements for manual paging query:
GET test/_search { "from": 0, "size": 10, "query": {} }
Query through Jest: build a SearchSourceBuilder first object
@Test public void testSearchByPage(){ SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery(); // Number matching query MatchQueryBuilder codeMatchQueryBuilder = QueryBuilders.matchQuery("baseInfo.code.chinese", "Query keyword"); boolQueryBuilder.should(codeMatchQueryBuilder); // From which page searchSourceBuilder.from(0); // How many are displayed per page searchSourceBuilder.size(10); // Set query criteria searchSourceBuilder.query(boolQueryBuilder); // Set sorting by matching score searchSourceBuilder.sort("_score"); // Settings are in reverse order of creation date searchSourceBuilder.sort("baseInfo.createDate", SortOrder.DESC); log.info(JsonUtils.objectToJson(esService.searchByPage(searchSourceBuilder))); }
Implementation of eservice.searchwithpage:
@Override public List<OrderAllInfoDTO> searchByPage(SearchSourceBuilder searchSourceBuilder) { //Set the result field list returned by the query and the ignored field list searchSourceBuilder.fetchSource(ALL_INFO_FIELD_LIST, null); //Call the paging search function of the client, pass in the index name, query statement and returned result type List<OrderAllInfoDTO> list = esClient.searchByPage( OrderConstant.INDEX_NAME, searchSourceBuilder, ORDER_ALL_INFO_DTO_TYPE); return list; }
ORDER_ ALL_ INFO_ DTO_ The content of type is:
private static final TypeReference<OrderAllInfoDTO> ORDER_ALL_INFO_DTO_TYPE= new TypeReference<OrderAllInfoDTO>() {};
Implementation of esClient.searchByPage:
public <T> List<T> searchByPage(String index, SearchSourceBuilder searchSourceBuilder,TypeReference<T> type) { String searchStr = searchSourceBuilder.toString(); log.debug("ES Query string:{}", searchStr); // typeName fixed as default Type name '_ doc' Search search = new Search.Builder(searchStr).addIndex(index).addType(typeName).build(); return searchByPage(search,type); } public <T> List<T> searchByPage(Action clientRequest,TypeReference<T> type){ try { JestResult result = client.execute(clientRequest); if(result.isSucceeded()) { List<String> sourceList = result.getSourceAsStringList(); if(CollectionUtils.isNotEmpty(sourceList)) { List<T> list = new ArrayList<>(); for(String str:sourceList) { list.add(JsonUtils.JsonToObject(str, type)); } return list; } } } catch (IOException ex) { log.warn("Paging search failed",ex); } return Collections.emptyList(); }
JsonUtils.JsonToObject is a custom Jackson wrapped tool class. The implementation contents are as follows:
public static <T> T JsonToObject(String json, TypeReference<T> javaType) { try { return OM.readValue(json, javaType); } catch (Exception e) { log.error("transformation JSON String object failed",e); } return null; }
Note that from + size Cannot exceed index.max_result_window Parameter value, otherwise the subsequent data cannot be obtained, index.max_result_window The default value of the parameter is 10000. Although the official does not recommend setting this value too large, it is recommended to modify this parameter value if it is really necessary for actual business. Because the other two methods recommended by the official can not perfectly support the operations of random page skipping, scrolling back and forth and modifying the number of records per page. In actual use, users can achieve better experience by limiting the number of pages they can jump to at one time. For example, it is forbidden to jump directly to the last page, and it is not allowed to enter the page number to jump. Only the page numbers of the first and last 10 pages are displayed at a time. Modification method:
PUT /my_index/_settings { "index.max_result_window":1000000 }
Sorting and scoring optimization
Usually used for sorting_ score Just sort, that is, ES calculates the score of each doc for this search according to the specified scoring algorithm, and then sorts according to the score size from large to small. Scoring rule reference: Research on scoring mechanism of elasticSearch(5.3.0),
Sorted official documents: Sort
Implementation of full field matching with the highest score
The scoring algorithm of ES does not calculate the highest priority of whole word matching, but summarizes a score according to multiple dimensions. Therefore, if you want to sort the highest priority of keyword complete matching, you need to modify the Mapping of the index and the query statement. The modifications are as follows:
Mapping: add a subfield of raw, with the data type of keyword, and use a word free planner.
"name": { "type": "text", "analyzer": "custome_standard", "fields": { "raw": { "type": "keyword", "normalizer": "custome_normalizer" }, "chinese": { "type": "text", "analyzer": "custome_chinese", "search_analyzer":"custome_chinese_search" } } }
Search: in addition to the normal search conditions, a search condition for the raw field is added, and the weight of the condition is set to be 10 times that of the ordinary search condition
@Test public void testSearchByPage(){ SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery(); //Number matching MatchQueryBuilder codeMatchQueryBuilder = QueryBuilders.matchQuery("baseInfo.code.chinese", keyword); MatchQueryBuilder codeRawMatchQueryBuilder = QueryBuilders.matchQuery("baseInfo.code.raw", keyword); codeRawMatchQueryBuilder.boost(10f); // From which page searchSourceBuilder.from(0); // How many are displayed per page searchSourceBuilder.size(10); // Set query criteria searchSourceBuilder.query(boolQueryBuilder); // Set sorting by matching score searchSourceBuilder.sort("_score"); // Settings are in reverse order of creation date searchSourceBuilder.sort("baseInfo.createDate", SortOrder.DESC); log.info(JsonUtils.objectToJson(esService.searchByPage(searchSourceBuilder))); }
Optimization of Chinese search keyword segmentation
The commonly used Chinese analyzer is IK analyzer, the official website elasticsearch-analysis-ik , the analyzer has two kinds of word splitters ik_smart , ik_max_word Where ik_max_word is divided into as many words as possible, and ik_smart is a disjunctive word that can do some semantic analysis. The actual results are as follows:
ik_smart test:
POST _analyze { "analyzer": "ik_smart", "text": "The People's Republic of China" }
Test results:
{ "tokens": [ { "token": "The People's Republic of China", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 } ] }
ik_max_word test:
POST _analyze { "analyzer": "ik_max_word", "text": "The People's Republic of China" }
{ "tokens": [ { "token": "The People's Republic of China", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "Chinese people", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "the chinese people", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 2 }, { "token": "Chinese", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 3 }, { "token": "People's Republic of China", "start_offset": 2, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "the people", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 5 }, { "token": "republic", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 6 }, { "token": "republic", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 7 }, { "token": "country", "start_offset": 6, "end_offset": 7, "type": "CN_CHAR", "position": 8 } ] }
So if only IK is used_ max_ word When searching for the keyword people's Republic of China, it will be matched according to each keyword after word segmentation, resulting in too many matching results and the sorting of matching results is not expected. For example, if a doc contains a lot of Chinese characters, its score in the search results may only contain once, and the score of the people's Republic of China is high. Especially when there are some common prepositions in the search keywords, such as, is, the difference will be greater. But if you only use ik_smart, which will lead to the search keyword Republic and other words can not match any content. Therefore, it is recommended to use IK when indexing_ max_ Word, and IK is used in search_ Smart, which can not only ensure that the keywords are broken down sufficiently, but also ensure that the matching is accurate enough when searching.
PUT test { "settings": { "analysis": { "filter": { "pinyin_filter": { "type": "pinyin" } }, "analyzer": { "custome_standard": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ] }, "custome_chinese": { "type": "custom", "tokenizer": "ik_max_word", "filter": [ "lowercase", "asciifolding" ] }, "custome_chinese_search":{ "type": "custom", "tokenizer": "ik_smart", "filter": [ "lowercase", "asciifolding" ] } }, "normalizer": { "custome_normalizer": { "type": "custom", "char_filter": [], "filter": [ "lowercase", "asciifolding" ] } } }, "index.mapping.coerce": false, "index.mapping.ignore_malformed":false, "index.gc_deletes":"0s" }, "mappings": { "_doc": { "properties": { "id": { "type": "long" }, "name": { "type": "text", "analyzer": "custome_standard", "fields": { "raw": { "type": "keyword", "normalizer": "custome_normalizer" }, "chinese": { "type": "text", "analyzer": "custome_chinese", "search_analyzer":"custome_chinese_search" } } } } } } }
Paging query processing of big data records
For paged queries of big data (more than one million), the official does not recommend from + size Query mode, because from + size is controlled by index.max of the index_ result_ window Parameter limit, from + size cannot exceed index.max_result_window, and when querying from + size, you actually need to sort the results first. If a query spans multiple slices, after sorting each slice, you need to merge and sort again. We can assume that we search in an index with five main partitions. When we request the first page of the results (results from 1 to 10), each partition generates the top 10 results and returns them to the coordination node. The coordination node sorts the 50 results to get the top 10 of all results.
Now suppose we request page 1000 - the result is from 10001 to 10010. All work in the same way, except that each slice has to produce the first 10010 results. Then, the coordination node sorts all 50050 results, and finally discards 50040 of these results.
It can be seen that in the distributed system, the cost of sorting the results increases with the multiple of the paging depth component slice. This is why web search engines should not return more than 1000 results for any query.
Although officially provided Scroll and Search After There are two schemes, but Scroll is actually a cursor with a fixed step size that can only move forward, can't change the step size, and can't move back. Search After can change the step size, but it can't move back. Therefore, the scheme needs to be selected according to the actual scenario, and the cooperation of the demander is also required.
- Standard paging supports random page skipping, and the step size can be modified
-
You can only select from + size by modifying index.max_result_window parameter value to achieve the final effect. It is best to restrict users from jumping pages within a limited range during product design, such as jumping pages within the range of plus or minus 5 pages of the current page, and the maximum number of records on a page cannot exceed 50. Otherwise, the user will directly turn to the last page, which will trigger performance problems, for example.
- If more schemas are loaded, use Search After More appropriate.
- If it is a data export mode, use Scroll It will be more convenient to handle.
Processing of secondary search
Use the new query criteria in the previous search results to search again, and the implementation scheme is as follows:
- When receiving the search request, the back-end judges whether the keyword search is entered. If any, record the search conditions, store them in the database through json serialization, and generate a unique search condition id to the front-end.
- The front end submits the new query criteria and the id of the previously received search criteria to the back end when the user specifies that it is a secondary search. If it is not a secondary search, set the search id to null.
- When the backend receives the query, it judges whether there is a search condition id. if yes, it takes out the search condition corresponding to the saved search condition id, appends the new condition to the old query condition through the must query, and stores the last added query condition in the database through json serialization again. The newly generated search condition id and the search results are returned to the front end.
public PageResult<OrderPageInfoResult> orderSearchList(OrderSearchParameter searchParameter, SystemUser user) { PageResult<OrderPageInfoResult> result = PageResult.empty(); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery(); //Get query criteria list List<OrderSearchParameter> searchParameterList = getRefineSearchParameter(searchParameter); //Complete query statement completeQuery(boolQueryBuilder,searchParameterList); // Advanced search (filtering) completeFilter(boolQueryBuilder,searchParameterList); searchSourceBuilder.query(boolQueryBuilder); // Get the total number of entries Integer total = orderEsService.countBySearch(searchSourceBuilder); searchSourceBuilder.sort("_score"); searchSourceBuilder.sort("baseInfo.createDate", SortOrder.DESC); if (total > 0) { // Adjust paging parameters according to the total number of records searchParameter.correctPageParams(total); result = new PageResult<>(searchParameter.getOffset(), searchParameter.getLimit(), total); // From which page searchSourceBuilder.from(searchParameter.getOffset()); // How many are displayed per page searchSourceBuilder.size(searchParameter.getLimit()); //Set return result result.setItems(orderEsService.searchByPage(searchSourceBuilder)); } //Save new query criteria if(StringUtils.isNotBlank(searchParameter.getKeyword()) || searchParameter.getSearchParameterId()!=null){ result.setSearchParameterId(saveSearchParameterList(searchParameterList,user)); } return result; } private List<OrderSearchParameter> getRefineSearchParameter(OrderSearchParameter searchParameter){ if(searchParameter==null){ return Collections.emptyList(); } Integer searchParameterId = searchParameter.getSearchParameterId(); List<OrderSearchParameter> list = null; if(searchParameterId!=null){ SystemSearchParameter refineSearchParameter = systemSearchParameterService.getById(searchParameterId); if(refineSearchParameter!=null){ list = JsonUtils.JsonToObject(refineSearchParameter.getValue(), new TypeReference<List<OrderSearchParameter>>() {}); } } if(CollectionUtils.isEmpty(list)){ list = new ArrayList<>(); } //Spell the previous query criteria with the current query criteria list.add(searchParameter); return list; } private Integer saveSearchParameterList(List<OrderSearchParameter> searchParameterList,SystemUser user){ SystemSearchParameterForm form = new SystemSearchParameterForm(); form.setType(SystemSearchTypeEnum.ORDER_SEARCH.value()); form.setValue(JsonUtils.objectToJson(searchParameterList)); return systemSearchParameterService.insert(form,user,now()); }