Using IK word breakers
- Integrated ik word breaker https://mp.csdn.net/postedit/93602713
- Entity class PosEntity
/** Omit getter and setter*/ class PosEntity{ private Integer posId; private String posName; private String posAddress; }
In entity class, posName and posAddress are both used as Chinese fields, and IK word breaker is used.
-
Create index
@Test public void createIKIndex(){ XContentBuilder contentBuilder = null; try { contentBuilder = XContentFactory.jsonBuilder() .startObject() .startObject(typeName) .startObject("properties") .startObject("posId").field("type","integer").endObject() .startObject("posName").field("type","text").field("analyzer","ik_max_word").endObject() .startObject("posAddress").field("type","text").field("analyzer","ik_max_word").endObject() .endObject() .endObject() .endObject(); } catch (IOException e) { e.printStackTrace(); } client.admin().indices().prepareCreate(ikIndexName).addMapping(typeName, contentBuilder).get(); }
-
Import data
@Test public void loadIKData(){ PosEntity entity = new PosEntity(3, "Chongqing B21902321", "China is the most populous country in the world"); PosEntity entity1 = new PosEntity(4, "Chongqing A21902321", "Today's your birthday, fat head"); client.prepareIndex(ikIndexName, typeName, "3").setSource(JSONObject.toJSONString(entity), XContentType.JSON).get(); client.prepareIndex(ikIndexName, typeName, "4").setSource(JSONObject.toJSONString(entity1), XContentType.JSON).get(); }
There are license plate numbers and normal statements imported. The participle situation can be found in kibana.
GET _analyze {"text":"Today is your birthday, fat man.","analyzer":"ik_max_word"}
See.
-
Search similar to previous
@Test public void multiIK(){ MultiMatchQueryBuilder queryBuilder = new MultiMatchQueryBuilder("It's yours", "posName", "posAddress"); SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryBuilder); execSearch(builder); } @Test public void ikStringQuery(){ QueryStringQueryBuilder queryStringQueryBuilder = new QueryStringQueryBuilder("posAddress:It's yours"); SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryStringQueryBuilder); execSearch(builder); }
Expand ik Thesaurus
- First use kibana to analyze the sentence "blue thin mushroom is delicious"
GET _analyze {"text":"Thin blue mushroom is delicious","analyzer":"ik_max_word"}
The results are as follows
{ "tokens": [ { "token": "blue", "start_offset": 0, "end_offset": 1, "type": "CN_CHAR", "position": 0 }, { "token": "thin", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 1 }, { "token": "Mushrooms", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 2 }, { "token": "very good", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 3 }, { "token": "very", "start_offset": 4, "end_offset": 5, "type": "CN_CHAR", "position": 4 }, { "token": "Yummy", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 5 } ] }
The blue thin mushroom is not recognized as a word, it is very recognized. Now let the blue thin mushroom be recognized as a whole, which is ignored as a stop word.
-
Enter IK configuration file directory / home/es/elasticsearch-6.2.2/plugins/ik/config, and create a dictionary file.
vim my_extra.dic
Add blue thin mushroom, save and modify
-
Create a word segmentation file and add
vim my_extra.dic
-
Modify the IK configuration file, and add the customized file
vim IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> < comment > IK analyzer extended configuration < / comment > <! -- users can configure their own extended dictionary here -- > <entry key="ext_dict">my_extra.dic</entry> <! -- users can configure their own extended stop word dictionary here -- > <entry key="ext_stopwords">my_stopword.dic</entry> <! -- users can configure remote extended dictionary here -- > <!-- <entry key="remote_ext_dict">words_location</entry> --> <! -- users can configure the remote extension stop word dictionary here -- > <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
Add to their own extended thesaurus file and word segmentation file.
-
Restart es cluster
-
Reinterpret that sentence and the result is as follows
{ "tokens": [ { "token": "I feel awful. I want to cry.", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "thin", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 1 }, { "token": "Mushrooms", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 2 }, { "token": "very good", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 3 }, { "token": "Yummy", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 4 } ] }