Using IK word breakers, extending IK word banks, and stopping word banks

Using IK word breakers

  1. Integrated ik word breaker https://mp.csdn.net/postedit/93602713
  2. Entity class PosEntity
    /** Omit getter and setter*/
    class PosEntity{
    		private Integer posId;
    		private String posName;
    		private String posAddress;
    	}

    In entity class, posName and posAddress are both used as Chinese fields, and IK word breaker is used.

  3. Create index

    @Test
    	public void createIKIndex(){
    		XContentBuilder contentBuilder = null;
    		try {
    			contentBuilder = XContentFactory.jsonBuilder()
    					.startObject()
    					.startObject(typeName)
    					.startObject("properties")
    					.startObject("posId").field("type","integer").endObject()
    					.startObject("posName").field("type","text").field("analyzer","ik_max_word").endObject()
    					.startObject("posAddress").field("type","text").field("analyzer","ik_max_word").endObject()
    					.endObject()
    					.endObject()
    					.endObject();
    		} catch (IOException e) {
    			e.printStackTrace();
    		}
    		client.admin().indices().prepareCreate(ikIndexName).addMapping(typeName, contentBuilder).get();
    	}

     

  4. Import data

    	@Test
    	public void loadIKData(){
    		PosEntity entity = new PosEntity(3, "Chongqing B21902321", "China is the most populous country in the world");
    		PosEntity entity1 = new PosEntity(4, "Chongqing A21902321", "Today's your birthday, fat head");
    		client.prepareIndex(ikIndexName, typeName, "3").setSource(JSONObject.toJSONString(entity), XContentType.JSON).get();
    		client.prepareIndex(ikIndexName, typeName, "4").setSource(JSONObject.toJSONString(entity1), XContentType.JSON).get();
    	}

    There are license plate numbers and normal statements imported. The participle situation can be found in kibana.

    	GET _analyze
    	{"text":"Today is your birthday, fat man.","analyzer":"ik_max_word"}

    See.

  5. Search similar to previous

    	@Test
    	public void multiIK(){
    		MultiMatchQueryBuilder queryBuilder = new MultiMatchQueryBuilder("It's yours", "posName", "posAddress");
    		SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryBuilder);
    		execSearch(builder);
    	}
    
    	@Test
    	public void ikStringQuery(){
    		QueryStringQueryBuilder queryStringQueryBuilder = new QueryStringQueryBuilder("posAddress:It's yours");
    		SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryStringQueryBuilder);
    		execSearch(builder);
    	}
    

     

Expand ik Thesaurus

  1. First use kibana to analyze the sentence "blue thin mushroom is delicious"
    	GET _analyze
    	{"text":"Thin blue mushroom is delicious","analyzer":"ik_max_word"}

    The results are as follows

    {
      "tokens": [
        {
          "token": "blue",
          "start_offset": 0,
          "end_offset": 1,
          "type": "CN_CHAR",
          "position": 0
        },
        {
          "token": "thin",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 1
        },
        {
          "token": "Mushrooms",
          "start_offset": 2,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 2
        },
        {
          "token": "very good",
          "start_offset": 4,
          "end_offset": 6,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "very",
          "start_offset": 4,
          "end_offset": 5,
          "type": "CN_CHAR",
          "position": 4
        },
        {
          "token": "Yummy",
          "start_offset": 5,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 5
        }
      ]
    }

    The blue thin mushroom is not recognized as a word, it is very recognized. Now let the blue thin mushroom be recognized as a whole, which is ignored as a stop word.

  2. Enter IK configuration file directory / home/es/elasticsearch-6.2.2/plugins/ik/config, and create a dictionary file.

    vim my_extra.dic

    Add blue thin mushroom, save and modify

  3. Create a word segmentation file and add

    vim my_extra.dic
  4. Modify the IK configuration file, and add the customized file

    vim IKAnalyzer.cfg.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
    <properties>
            < comment > IK analyzer extended configuration < / comment >
            <! -- users can configure their own extended dictionary here -- >
            <entry key="ext_dict">my_extra.dic</entry>
             <! -- users can configure their own extended stop word dictionary here -- >
            <entry key="ext_stopwords">my_stopword.dic</entry>
            <! -- users can configure remote extended dictionary here -- >
            <!-- <entry key="remote_ext_dict">words_location</entry> -->
            <! -- users can configure the remote extension stop word dictionary here -- >
            <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
    </properties>
    

    Add to their own extended thesaurus file and word segmentation file.

  5. Restart es cluster

  6. Reinterpret that sentence and the result is as follows

    {
      "tokens": [
        {
          "token": "I feel awful. I want to cry.",
          "start_offset": 0,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "thin",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 1
        },
        {
          "token": "Mushrooms",
          "start_offset": 2,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 2
        },
        {
          "token": "very good",
          "start_offset": 4,
          "end_offset": 6,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "Yummy",
          "start_offset": 5,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 4
        }
      ]
    }

     

Keywords: vim JSON xml ElasticSearch

Added by tomm098 on Thu, 31 Oct 2019 10:41:32 +0200