Using IK word breakers, extending IK word banks, and stopping word banks

Using IK word breakers

Integrated ik word breaker https://mp.csdn.net/postedit/93602713

Entity class PosEntity

/** Omit getter and setter*/
class PosEntity{
		private Integer posId;
		private String posName;
		private String posAddress;
	}

In entity class, posName and posAddress are both used as Chinese fields, and IK word breaker is used.

Create index

@Test
	public void createIKIndex(){
		XContentBuilder contentBuilder = null;
		try {
			contentBuilder = XContentFactory.jsonBuilder()
					.startObject()
					.startObject(typeName)
					.startObject("properties")
					.startObject("posId").field("type","integer").endObject()
					.startObject("posName").field("type","text").field("analyzer","ik_max_word").endObject()
					.startObject("posAddress").field("type","text").field("analyzer","ik_max_word").endObject()
					.endObject()
					.endObject()
					.endObject();
		} catch (IOException e) {
			e.printStackTrace();
		}
		client.admin().indices().prepareCreate(ikIndexName).addMapping(typeName, contentBuilder).get();
	}

Import data

	@Test
	public void loadIKData(){
		PosEntity entity = new PosEntity(3, "Chongqing B21902321", "China is the most populous country in the world");
		PosEntity entity1 = new PosEntity(4, "Chongqing A21902321", "Today's your birthday, fat head");
		client.prepareIndex(ikIndexName, typeName, "3").setSource(JSONObject.toJSONString(entity), XContentType.JSON).get();
		client.prepareIndex(ikIndexName, typeName, "4").setSource(JSONObject.toJSONString(entity1), XContentType.JSON).get();
	}

There are license plate numbers and normal statements imported. The participle situation can be found in kibana.

	GET _analyze
	{"text":"Today is your birthday, fat man.","analyzer":"ik_max_word"}

See.

Search similar to previous

	@Test
	public void multiIK(){
		MultiMatchQueryBuilder queryBuilder = new MultiMatchQueryBuilder("It's yours", "posName", "posAddress");
		SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryBuilder);
		execSearch(builder);
	}

	@Test
	public void ikStringQuery(){
		QueryStringQueryBuilder queryStringQueryBuilder = new QueryStringQueryBuilder("posAddress:It's yours");
		SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryStringQueryBuilder);
		execSearch(builder);
	}

Expand ik Thesaurus

First use kibana to analyze the sentence "blue thin mushroom is delicious"

	GET _analyze
	{"text":"Thin blue mushroom is delicious","analyzer":"ik_max_word"}

The results are as follows

{
  "tokens": [
    {
      "token": "blue",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "thin",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_CHAR",
      "position": 1
    },
    {
      "token": "Mushrooms",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "very good",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "very",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_CHAR",
      "position": 4
    },
    {
      "token": "Yummy",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 5
    }
  ]
}

The blue thin mushroom is not recognized as a word, it is very recognized. Now let the blue thin mushroom be recognized as a whole, which is ignored as a stop word.

Enter IK configuration file directory / home/es/elasticsearch-6.2.2/plugins/ik/config, and create a dictionary file.
```
vim my_extra.dic
```
Add blue thin mushroom, save and modify
Create a word segmentation file and add
```
vim my_extra.dic
```

Modify the IK configuration file, and add the customized file

vim IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        < comment > IK analyzer extended configuration < / comment >
        <! -- users can configure their own extended dictionary here -- >
        <entry key="ext_dict">my_extra.dic</entry>
         <! -- users can configure their own extended stop word dictionary here -- >
        <entry key="ext_stopwords">my_stopword.dic</entry>
        <! -- users can configure remote extended dictionary here -- >
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <! -- users can configure the remote extension stop word dictionary here -- >
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

Add to their own extended thesaurus file and word segmentation file.

Restart es cluster

Reinterpret that sentence and the result is as follows

{
  "tokens": [
    {
      "token": "I feel awful. I want to cry.",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "thin",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_CHAR",
      "position": 1
    },
    {
      "token": "Mushrooms",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "very good",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "Yummy",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    }
  ]
}

Keywords: vim JSON xml ElasticSearch

Added by tomm098 on Thu, 31 Oct 2019 10:41:32 +0200

Programming VIP

Using IK word breakers, extending IK word banks, and stopping word banks

Using IK word breakers

Expand ik Thesaurus

Popular Keywords