Introduction to ElasticSearch and its deployment, principle and use

Introduction to ElasticSearch and its deployment, principle and use

Chapter 1: introduction to elastic search

Elasticsearch is a Lucene based search server. It provides a distributed multi-user full-text search engine based on RESTful web interface. Elasticsearch is developed in Java and released as an open source under the Apache license terms. It is a popular enterprise search engine.

Section 1 usage scenarios of elasticSearch

1. Provide users with full-text search function by keyword query.

2. It is a solution to realize the processing and analysis of enterprise massive data. It is an important part of the big data field, such as the famous ELK framework (ElasticSearch,Logstash,Kibana).

Section 2 Comparison of common database storage

redismysqlelasticsearchhbasehadoop/hive
Capacity / capacity expansionlowinmoreMassiveMassive
Query timelinessExtremely highsecondaryhighersecondarylow
Query flexibilityPoor k-v modeVery good, support sqlBetter, the association query is weak, but it can be full-text retrieval. DSL language can handle various operations such as filtering, matching, sorting, aggregation and so onPoor performance, mainly due to rowkey and scan, or the establishment of secondary indexesVery good, support sql
Write speedVery fastsecondaryFasterFasterslow
Consistency, transactionweakstrongweakweakweak

Section III characteristics of elastic search

1.3. 1 natural fragment, natural cluster

es divides the data into multiple shards, P0-P2 in the figure below. Multiple shards can form a complete data, and these shards can be distributed in each machine node in the cluster. With the continuous increase of data, the cluster can add multiple partitions and put multiple partitions on multiple machines to achieve load balancing and horizontal expansion.

In the actual operation process, each query task is submitted to a node, which must be responsible for sorting and aggregating the data and returning it to the client, that is, Map calculation on a simple node, reduce on a fixed node, and the final result is returned to the client.

This cluster slicing machine makes elasticsearch powerful data capacity and operational scalability.

1.3. 2 natural index

All ES data is indexed by default, which is just the opposite of mysql. mysql does not add indexes by default. Special instructions must be given to add indexes. Only ES does not need to add indexes.

The ES uses inverted indexes, which are different from the B+Tree indexes of Mysql.

Section IV relationship between lucene and elastic search

lucene is just a core toolkit providing full-text search function class library, and the real use of it also needs an application built by a perfect service framework.

For example, lucene is similar to an engine, and search engine software (ES,Solr) is a car.

At present, there are two popular search engine software in the market, elastic search and solr. Both of them are built based on lucene and can deploy and start the search engine service software independently. Because the kernel is the same, except for server installation, deployment, management and cluster, the two are very similar in data operation, modification, addition, saving, query, etc. It's like two kinds of database software that support sql language. As long as you learn one, the other is easy to start.

From the actual use of enterprises, the market share of elasticSearch is gradually replacing solr. Domestic Baidu, JD and Sina all realize the search function based on elasticSearch. There are more abroad, such as Wikipedia, GitHub, Stack Overflow and so on, which are also based on ES.

Chapter 2: installation and deployment of elasticSearch (including kibana)

2.1 download address

https://www.elastic.co/cn/downloads/past-releases#elasticsearch

https://www.elastic.co/cn/downloads/past-releases#kibana

[note]:

  1. The download of es and kibana versions should be consistent
  2. At present, the production environment mostly adopts large version 6.0 x.x; 7.x. The X version is relatively new, but the deployment process is the same

2.2 machine planning

3 machines:

11.8.37.50 ops01
11.8.36.63 ops02
11.8.36.76 ops03

wangting@ops01:/home/wangting >cat /etc/hosts
127.0.0.1 ydt-cisp-ops01
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# elasticsearch
11.8.37.50 ops01
11.8.36.63 ops02
11.8.36.76 ops03

[note]: if the node's ip resolution is configured in / etc/hosts of each node, the relevant ip configuration can be replaced by the resolution name in the subsequent configuration file;

For example: network host: 11.8. 37.50 is equivalent to network host: ops01

2.3 download installation package

wangting@ops01:/opt/software >ll | grep 6.6.0
-rw-r--r-- 1 wangting wangting     114106988 Aug  4 14:40 elasticsearch-6.6.0.tar.gz
-rw-r--r-- 1 wangting wangting     180704352 Aug  4 14:40 kibana-6.6.0-linux-x86_64.tar.gz

2.4 environmental optimization

2.4. 1 optimization 1

The maximum number of files that the system allows Elasticsearch to open needs to be modified to 65536

wangting@ops01:/opt/software >sudo vim /etc/security/limits.conf
# End of file
* soft nofile 65536
* hard nofile 131072
* soft nproc 2048
* hard nproc 65536

# Disconnect reconnection session
wangting@ops01:/home/wangting >ulimit -n
65536

If this configuration is not optimized, starting the service will appear:
[error] max file descriptors [4096] for elasticsearch process likely too low, increase to at least [65536] elasticsearch

2.4. 2 optimization 2

The maximum number of processes allowed is modified to 4096; If it is not 4096, the optimization needs to be modified

wangting@ops01:/home/wangting >sudo vim /etc/security/limits.d/20-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

*          soft    nproc     4096
root       soft    nproc     unlimited

If this configuration is not optimized, starting the service will appear:
[error]max number of threads [1024] for user [judy2] likely too low, increase to at least [4096]

2.4. 3 optimization 3

Set the number of virtual memory areas that a process can have

wangting@ops01:/home/wangting >sudo vim /etc/sysctl.conf 
vm.max_map_count=262144
# Overload configuration
wangting@ops01:/home/wangting >sudo sysctl -p

If this configuration is not optimized, starting the service will appear:
[error]max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144]

2.5 decompression and installation

wangting@ops01:/opt/software >tar -xf elasticsearch-6.6.0.tar.gz -C /opt/module/
wangting@ops01:/opt/software >cd /opt/module/elasticsearch-6.6.0/
wangting@ops01:/opt/module/elasticsearch-6.6.0 >ll
total 448
drwxr-xr-x  3 wangting wangting   4096 Aug  4 15:13 bin
drwxr-xr-x  2 wangting wangting   4096 Jan 24  2019 config
drwxr-xr-x  3 wangting wangting   4096 Jan 24  2019 lib
-rw-r--r--  1 wangting wangting  13675 Jan 24  2019 LICENSE.txt
drwxr-xr-x  2 wangting wangting   4096 Jan 24  2019 logs
drwxr-xr-x 29 wangting wangting   4096 Jan 24  2019 modules
-rw-r--r--  1 wangting wangting 403816 Jan 24  2019 NOTICE.txt
drwxr-xr-x  2 wangting wangting   4096 Jan 24  2019 plugins
-rw-r--r--  1 wangting wangting   8519 Jan 24  2019 README.textile

2.6 modifying configuration files

wangting@ops01:/opt/module/elasticsearch-6.6.0 >cd config/
wangting@ops01:/opt/module/elasticsearch-6.6.0/config >cat elasticsearch.yml | grep -vE "^#|^$"
cluster.name: my-es
node.name: node-ops01
bootstrap.memory_lock: false
network.host: 11.8.37.50
http.port: 9200
discovery.zen.ping.unicast.hosts: ["11.8.37.50", "11.8.36.63", "11.8.36.76"]

Configuration Item Description:

cluster. Name: my es # cluster name; All nodes in the same cluster must have the same name
node.name: node-ops01 # current node name; It can be understood that each cluster is distinguished by this name
bootstrap.memory_lock: false # bootstrap self-test program
network.host: 11.8.37.50 # current node host
http.port: 9200 # es boot port
discovery.zen.ping.unicast.hosts: ["11.8.37.50", "11.8.36.63", "11.8.36.76"] # self discovery configuration: the host name of the new node reporting to the cluster

[note]: These are conventional configurations. Others, such as data storage path and log storage path, can be customized and configured according to the situation

2.7 distribution and installation directory

wangting@ops01:/opt/module >scp -r elasticsearch-6.6.0 ops02:/opt/module/
wangting@ops01:/opt/module >scp -r elasticsearch-6.6.0 ops03:/opt/module/

2.8 modifying other node configuration files

# Node ops02
wangting@ops02:/opt/module/elasticsearch-6.6.0/config >cat elasticsearch.yml | grep -vE "^#|^$"
cluster.name: my-es
node.name: node-ops02
bootstrap.memory_lock: false
network.host: 11.8.36.63
http.port: 9200
discovery.zen.ping.unicast.hosts: ["11.8.37.50", "11.8.36.63", "11.8.36.76"]

# Node ops03
wangting@ops03:/opt/module/elasticsearch-6.6.0/config >cat elasticsearch.yml | grep -vE "^#|^$"
cluster.name: my-es
node.name: node-ops03
bootstrap.memory_lock: false
network.host: 11.8.36.76
http.port: 9200
discovery.zen.ping.unicast.hosts: ["11.8.37.50", "11.8.36.63", "11.8.36.76"]

2.9 service startup

# Start three nodes es in sequence with the same command
wangting@ops01:/opt/module >cd /opt/module/elasticsearch-6.6.0/bin/
wangting@ops01:/opt/module/elasticsearch-6.6.0/bin >./elasticsearch -d
wangting@ops01:/opt/module/elasticsearch-6.6.0/bin >

[note]:

  1. -d is running in the background. You can only run in the foreground without - d. if the session window is closed, the service will be terminated at the same time
  2. elasticsearch needs to be started for all three machines
  3. There is no configuration definition for the operation log, which is in the service directory by default: elasticsearch-6.6 0 / logs /, if there is an exception, you can check the log first

2.10 command validation es

wangting@ops01:/opt/module/elasticsearch-6.6.0/logs >curl http://11.8.37.50:9200/_cat/nodes?v
ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
11.8.36.63           26          88   0    0.05    0.13     0.10 mdi       -      node-ops02
11.8.37.50           28          87   0    0.06    0.10     0.08 mdi       *      node-ops01
11.8.36.76           26          53   0    0.06    0.07     0.06 mdi       -      node-ops03

[note]: for the es cluster in normal operation, the nodes status of all nodes in curl can return results

2.11 installation of kibana

# Installing kibana kibana is just a tool. Pick a server to install it
wangting@ops01:/opt/software >scp kibana-6.6.0-linux-x86_64.tar.gz ops03:/opt/software/
wangting@ops03:/opt/module/elasticsearch-6.6.0/bin >cd /opt/software/
wangting@ops03:/opt/software >tar -xf kibana-6.6.0-linux-x86_64.tar.gz -C /opt/module/
wangting@ops03:/opt/software >cd /opt/module/kibana-6.6.0-linux-x86_64/config/
wangting@ops03:/opt/module/kibana-6.6.0-linux-x86_64/config >cat kibana.yml | grep -vE "^$|^#"
server.host: "0.0.0.0"
elasticsearch.hosts: ["http://11.8.37.50:9200"]

wangting@ops03:/opt/module/kibana-6.6.0-linux-x86_64/config >cd ..
wangting@ops03:/opt/module/kibana-6.6.0-linux-x86_64 >nohup bin/kibana &

2.12 interface verification kibana

In which node is deployed, the corresponding address + 5601 port is used for access

http://11.8.36.76:5601/

Chapter three: the basic concept of elastic search

Section I: keyword interpretation

key wordinterpretation
clusterThe whole elasticsearch defaults to the cluster status, and the whole cluster is a complete and mutually standby data.
nodeGenerally, only one process is a node in a cluster
shardFragmentation: even the data in a node will be stored in multiple slices through the hash algorithm. The default is 5 slices. (7.0 is changed to 1 by default)
indexdatabase(5.x), which is equivalent to rdbms, is a logical database for users. Although it is physically stored in multiple shard s, it may also be stored in multiple node s. 6.x 7.x index is equivalent to table
typeTable is similar to rdbms, but it is more like class in object-oriented and data collection in the same Json format than table. (only one 6.x is allowed, and 7.0 is abandoned, resulting in the fact that the index is equivalent to the table level)
documentSimilar to row in rdbms and object in object-oriented
fieldEquivalent to fields and attributes

Section 2: simple examples of syntax:

GET /_cat/nodes?v # query the status of each node

GET /_cat/indices?v # query the status of each index

GET /_cat/shards/xxxx # query the fragmentation of an index

Section 3: elasticsearch 9200 9300 port differences

When the es is started normally, there will be 9200 and 9300 ports

1. 9200, as an HTTP protocol, is mainly used for external communication; It is the RESTful interface of HTTP protocol

2. 9300 as TCP protocol, jar s communicate with each other through TCP protocol; Both the clusters and the TCPClient can use it

Chapter 4: elasticsearch restful api [DSL]

DSL: full name: Domain Specific language, i.e. Domain Specific language

Section 1 data structure saved in es

Generally, in java code, if two objects are stored in a relational database, such as mysql, they will be split into two tables. Movie corresponds to one MySQL table and Actor corresponds to another MySQL table;

package com.wangting.elasticsearch.test;

import java.util.List;

public interface Test {
    public static void main(String[] args) {
        public class  Movie<Actor> {
            String id;
            String name;
            Double doubanScore;
            List<Actor> actorList;
        }

        public class Actor{
            String id;
            String name;
        }
    }
}

But elastic search uses a json to represent a document.

{
    "id": "1", 
    "name": "operation red sea", 
    "doubanScore": "8.5", 
    "actorList": [
        {
            "id": "1", 
            "name": "zhangyi"
        }, 
        {
            "id": "2", 
            "name": "haiqing"
        }, 
        {
            "id": "3", 
            "name": "zhanghanyu"
        }
    ]
}

For ease of understanding, it is simple to make a comparison with MySQL. The keywords are only similar, not exactly a concept, and the functions are almost the same

MySQLelasticsearch
databasesindex
tabletype
rowdocument
columnfield

Section 2 common data operations

4.2. 1 check which indexes are in es

It is similar to show tables in MySQL;

GET /_cat/indices?v

[note]:

  1. By default, there will be a file named kibana's index
  2. GET /_cat/indices?v is equivalent to curl on the command line http://11.8.37.50:9200/_cat/indices?v
  3. final? v can not be added, but it is not recommended, because the result has no header, and the query result is relatively messy and not easy to view

Query the definition of each keyword in the index header

key wordinterpretation
healthGreen (cluster integrity) yellow (single point normal, cluster incomplete) red (single point abnormal)
statusCan it be used
indexIndex name
uuidIndex unified number
priSeveral primary nodes
repSeveral slave nodes
docs.countNumber of documents
docs.deletedHow many documents have been deleted
store.sizeOverall space occupied
pri.store.sizeSpace occupied by primary node

4.2. 2 add index

It is similar to the create table in MySQL;

PUT /indexname

[note]:

#! Deprecation: the default number of shards will change from [5] to [1] in 7.0.0; if you wish to continue using the default of [5] shards, you must manage this on the create index request or with an index template

Tips in large version 7.0 After 0, the default number of partitions will be changed from 5 to 1. If you want to use the default 5 in the future, you must create an index request with an index template

Query index information again:

4.2. 3 delete index

It is similar to the drop table in MySQL;

DELETE /indexname

4.2. 4 new documents

Compare insert into similar to a table in MySQL;

Syntax: format PUT /index/type/id

If the movie in the above example_ The index index is deleted and rebuilt: PUT /movie_index

newly added:

PUT /movie_index/movie/1
{
    "id": 1, 
    "name": "operation red sea", 
    "doubanScore": 8.5, 
    "actorList": [
        {
            "id": 1, 
            "name": "zhang yi"
        }, 
        {
            "id": 2, 
            "name": "hai qing"
        }, 
        {
            "id": 3, 
            "name": "zhang han yu"
        }
    ]
}

PUT /movie_index/movie/2
{
    "id": 2, 
    "name": "operation meigong river", 
    "doubanScore": 8, 
    "actorList": [
        {
            "id": 3, 
            "name": "zhang han yu"
        }
    ]
}

PUT /movie_index/movie/3
{
    "id": 3, 
    "name": "incident red sea", 
    "doubanScore": 5, 
    "actorList": [
        {
            "id": 4, 
            "name": "zhang chen"
        }
    ]
}

4.2. 5 directly use id to find data

Compare the select where condition similar to a table in MySQL;

GET movie_index/movie/1

4.2. 6. Modify - (modify overall replacement data)

Compare the alter table similar to a table in MySQL;

[note]: there is no difference between adding and adding. Requirements: all fields must be included

PUT /movie_index/movie/3
{
    "id": "3", 
    "name": "incident red sea", 
    "doubanScore": "5.0", 
    "actorList": [
        {
            "id": "1", 
            "name": "zhang chen"
        }
    ]
}

4.2. 7 modify - (modify a field)

Modify the value content of a field

Compare the alter table similar to a table in MySQL;

POST movie_index/movie/3/_update
{ 
  "doc": {
    "doubanScore":"7.0"
  } 
}
Close index for a field
"name":{
  "type": "keyword",
  "index": false
}

4.2. 8 delete a document

Compare delete from < table name > [where clause] similar to a table in MySQL

DELETE movie_index/movie/3

4.2. 9 search all data

It is similar to select * from a table in MySQL;

GET movie_index/movie/_search

4.2. 10 query by criteria (all)

GET movie_index/movie/_search
{
  "query":{
    "match_all": {}
  }
}

[note]: when you query by criteria, if the criteria are empty, the effect is equivalent to getting movie_index / movie / _searchfor all data in the query type

4.2. 11 query by word segmentation

Compare a table similar to MySQL. Select * from like% XXX%; But just like the principle is different

GET movie_index/movie/_search
{
  "query":{
    "match": {"name":"red"}
  }
}

4.2. 12 query by word segmentation sub attribute

GET movie_index/movie/_search
{
  "query":{
    "match": {"actorList.name":"zhang"}
  }
}

4.2. 13 "match phrase"

Query by phrase, instead of using word segmentation technology, directly use phrases to match in the original data

GET movie_index/movie/_search
{
    "query":{
      "match_phrase": {"name":"operation red"}
    }
}

It is equivalent to querying the operation red as a whole. The red will not be queried after the operation is checked, and it will not be split to match one by one; The words operation and red in name alone do not meet the conditions

4.2.14 fuzzy query [correction matching]

fuzzy corrects the matching segmentation. When a word cannot be accurately matched, es gives a certain score to the very close words through an algorithm, which can be queried, but consumes more performance.

GET movie_index/movie/_search
{
    "query":{
      "fuzzy": {"name":"rad"}
    }
}

This mechanism is equivalent to that when you search Baidu content at ordinary times, the input content may be wrong. The home page search content will prompt: do you want to search xxx, and then return the content of xxx

4.2. 15 filter - filter after query (query before filter)

post_ The filter filters the data that meets the conditions, and rejects the data that does not meet the conditions; Not in line with post_filter is removed.

GET movie_index/movie/_search
{
    "query":{
      "match": {"name":"red"}
    },
    "post_filter":{
      "term": {
        "actorList.id": 3
      }
    }
}

First, query whether the name has red

id=1 -> "name": "operation red sea",

id=3 -> "name": "incident red sea",

The result with id 3 is returned after filtering

4.2. 16 filter - filter before query (filter before query)

Under the same requirements, filtering before querying is better than querying before filtering

GET movie_index/movie/_search
{ 
    "query":{
        "bool":{
          "filter":[ {"term": {  "actorList.id": "1"  }},
                     {"term": {  "actorList.id": "3"  }}
           ], 
           "must":{"match":{"name":"red"}}
         }
    }
}

4.2. 17 filter – filter by range

​ "name": "operation red sea",
​ "doubanScore": 8.5,
​ "name": "operation meigong river",
​ "doubanScore": 8,
​ "name": "incident red sea",
​ "doubanScore": 5,

Find out the results of film score of 8 or above

GET movie_index/movie/_search
{
   "query": {
     "bool": {
       "filter": {
         "range": {
            "doubanScore": {"gte": 8}
         }
       }
     }
   }
}

The result returns data with IDS 1 and 2

key wordfunction
gtgreater than
ltless than
gteGreater than or equal
lteLess than or equal

4.2. 18 sorting

GET movie_index/movie/_search
{
  "query":{
    "match": {"name":"red sea"}
  }
  , "sort": [
    {
      "doubanScore": {
        "order": "desc"
      }
    }
  ]
}

First find out the data whose name contains the red sea keyword, and then sort according to the query data results; asc from small to large desc from large to small

4.2. 19 paging query

GET movie_index/movie/_search
{
  "query": { "match_all": {} },
  "from": 1,
  "size": 1
}

from defines the offset value of the target data

size defines the number of events currently returned.
If you do not customize the value, the default from is 0 and the size is 10, that is, all queries return only the first 10 data by default.

4.2. 20 specify the fields to query

GET movie_index/movie/_search
{
  "query": { "match_all": {} },
  "_source": ["name", "doubanScore"]
}

4.2. 21 highlight

GET movie_index/movie/_search
{
    "query":{
      "match": {"name":"red sea"}
    },
    "highlight": {
      "fields": {"name":{} }
    }
    
}

Here you can see that the result is wrapped with a layer of em label, that is, highlight and skew

You can create an html suffix file, enter the following code, save the html file, open it with a browser, and then observe and compare it

<p>operation red sea</p>
<p>operation <em>red</em> <em>sea</em> </p>
<p>incident red sea </p>
<p>incident <em>red</em> <em>sea</em> </p>

4.2. 22 aggregate query

Example to illustrate 1: take out how many movies each actor has participated in

GET movie_index/movie/_search
{ 
  "aggs": {
    "groupby_actor": {
      "terms": {
        "field": "actorList.name.keyword"  
      }
    }
  }
}

# Query results:
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "movie_index",
        "_type" : "movie",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "id" : 2,
          "name" : "operation meigong river",
          "doubanScore" : 8,
          "actorList" : [
            {
              "id" : 3,
              "name" : "zhang han yu"
            }
          ]
        }
      },
      {
        "_index" : "movie_index",
        "_type" : "movie",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "id" : 1,
          "name" : "operation red sea",
          "doubanScore" : 8.5,
          "actorList" : [
            {
              "id" : 1,
              "name" : "zhang yi"
            },
            {
              "id" : 2,
              "name" : "hai qing"
            },
            {
              "id" : 3,
              "name" : "zhang han yu"
            }
          ]
        }
      },
      {
        "_index" : "movie_index",
        "_type" : "movie",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "id" : 3,
          "name" : "incident red sea",
          "doubanScore" : 5,
          "actorList" : [
            {
              "id" : 4,
              "name" : "zhang chen"
            }
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "groupby_actor" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "zhang han yu",
          "doc_count" : 2
        },
        {
          "key" : "hai qing",
          "doc_count" : 1
        },
        {
          "key" : "zhang chen",
          "doc_count" : 1
        },
        {
          "key" : "zhang yi",
          "doc_count" : 1
        }
      ]
    }
  }
}

Example 2: what is the average score of each actor in the film and sort it according to the score

GET movie_index/movie/_search
{ 
  "aggs": {
    "groupby_actor_id": {
      "terms": {
        "field": "actorList.name.keyword" ,
        "order": {
          "avg_score": "desc"
          }
      },
      "aggs": {
        "avg_score":{
          "avg": {
            "field": "doubanScore" 
          }
        }
       }
    } 
  }
}

# Query results:
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "movie_index",
        "_type" : "movie",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "id" : 2,
          "name" : "operation meigong river",
          "doubanScore" : 8,
          "actorList" : [
            {
              "id" : 3,
              "name" : "zhang han yu"
            }
          ]
        }
      },
      {
        "_index" : "movie_index",
        "_type" : "movie",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "id" : 1,
          "name" : "operation red sea",
          "doubanScore" : 8.5,
          "actorList" : [
            {
              "id" : 1,
              "name" : "zhang yi"
            },
            {
              "id" : 2,
              "name" : "hai qing"
            },
            {
              "id" : 3,
              "name" : "zhang han yu"
            }
          ]
        }
      },
      {
        "_index" : "movie_index",
        "_type" : "movie",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "id" : 3,
          "name" : "incident red sea",
          "doubanScore" : 5,
          "actorList" : [
            {
              "id" : 4,
              "name" : "zhang chen"
            }
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "groupby_actor_id" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "hai qing",
          "doc_count" : 1,
          "avg_score" : {
            "value" : 8.5
          }
        },
        {
          "key" : "zhang yi",
          "doc_count" : 1,
          "avg_score" : {
            "value" : 8.5
          }
        },
        {
          "key" : "zhang han yu",
          "doc_count" : 2,
          "avg_score" : {
            "value" : 8.25
          }
        },
        {
          "key" : "zhang chen",
          "doc_count" : 1,
          "avg_score" : {
            "value" : 5.0
          }
        }
      ]
    }
  }
}

Keyword is a string field that specifically stores the copy of the non participle format. In some scenarios, only the non participle format is allowed, such as filter, aggregation aggs, so the field should be added Suffix of keyword

keyword function:

  1. Direct indexing without word segmentation
  2. Support fuzzy and accurate query
  3. Support aggregation

Section 3 Chinese word segmentation

The Chinese word segmentation provided by elastic search itself simply separates Chinese word by word, and there is no concept of vocabulary at all. However, in practical application, users query and match based on vocabulary. If articles can be cut and separated by vocabulary, they can match more appropriately with the user's query conditions, and the query speed is faster.

GET _analyze
{
  "text": ["wang ting niubi","Awesome today"]
  
}

# The results are as follows:
{
  "tokens" : [
    {
      "token" : "wang",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "ting",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "niubi",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "this",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "to",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "power",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    }
  ]
}

Through examples, it can be seen clearly that "wang ting niubi" is "awesome today". English participle is relatively reasonable according to the space word segmentation, but Chinese word segmentation is obviously not appropriate (today, awesome two words are not recognized).

Word splitter download website: https://github.com/medcl/elasticsearch-analysis-ik

[note]: it is recommended to find the best ik version consistent with the es version; find the zip package to download

install

# Enter the plugins directory of es
wangting@ops01:/home/wangting >cd /opt/module/elasticsearch-6.6.0/plugins/
# Create a plug-in directory (a plug-in corresponds to a subdirectory under plugins)
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins >mkdir ik
# Download the ik plug-in zip package
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik >ls
elasticsearch-analysis-ik-6.6.0.zip
# Unzip the installation package and clean up the zip file
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik >unzip elasticsearch-analysis-ik-6.6.0.zip && rm elasticsearch-analysis-ik-6.6.0.zip
# directory structure
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik >ll
total 1432
-rw-r--r-- 1 wangting wangting 263965 May  6  2018 commons-codec-1.9.jar
-rw-r--r-- 1 wangting wangting  61829 May  6  2018 commons-logging-1.2.jar
drwxr-xr-x 2 wangting wangting   4096 Aug 26  2018 config
-rw-r--r-- 1 wangting wangting  54693 Jan 30  2019 elasticsearch-analysis-ik-6.6.0.jar
-rw-r--r-- 1 wangting wangting 736658 May  6  2018 httpclient-4.5.2.jar
-rw-r--r-- 1 wangting wangting 326724 May  6  2018 httpcore-4.4.4.jar
-rw-r--r-- 1 wangting wangting   1805 Jan 30  2019 plugin-descriptor.properties
-rw-r--r-- 1 wangting wangting    125 Jan 30  2019 plugin-security.policy
# Distribute plug-ins to other nodes
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik >cd ..
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins >scp -r ik ops02:/opt/module/elasticsearch-6.6.0/plugins/
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins >scp -r ik ops03:/opt/module/elasticsearch-6.6.0/plugins/

[note]: the plug-in installation takes effect only after restarting es, otherwise the corresponding function cannot be used, as shown in the figure below

Restart es:

# Node ops01; Find es the corresponding process number
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins >jps | grep Elasticsearch|awk -F" " '{print $1}'
95973
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins >kill -9 95973
# Node ops02; Find es the corresponding process number
wangting@ops02:/opt/module/elasticsearch-6.6.0/plugins >jps | grep Elasticsearch|awk -F" " '{print $1}'
109175
wangting@ops02:/opt/module/elasticsearch-6.6.0/plugins >kill -9 109175
# Node ops03; Find es the corresponding process number
wangting@ops03:/opt/module/elasticsearch-6.6.0/plugins >jps | grep Elasticsearch|awk -F" " '{print $1}'
41777
wangting@ops03:/opt/module/elasticsearch-6.6.0/plugins >kill -9 41777

# Each node starts es again with the following command
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins >cd /opt/module/elasticsearch-6.6.0/bin/
wangting@ops01:/opt/module/elasticsearch-6.6.0/bin >./elasticsearch -d

The test uses ik Chinese word segmentation

The common functions of ik word splitter are ik_smart and ik_max_word

ik_smart

Match one by one and use each word once

ik_max_word

Match one by one, and the words that can be connected before and after each word will be displayed, which is equivalent to forming as many relational words as possible

[note]: as can be seen from the above example, different word splitters have obvious differences. Therefore, the default mapping cannot be used for defining a type in the future. It is necessary to manually create mapping to specify the word splitter, because the appropriate word splitter should be selected according to the use scenario

Custom Chinese Thesaurus

In life, some new hot words often appear. For example, recently, the most contact I have with is yyds eternal God... If we always use the previous thesaurus, the eternal God may be divided into: forever, God, God, not what we think of as the eternal God as a whole.

In this case, it is necessary to maintain a set of user-defined Chinese Thesaurus.

Before there is no custom Chinese Thesaurus, let's check an example and leave the results for comparison after installing the custom Thesaurus:

Before installation:

Install and deploy custom Thesaurus:

wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >pwd
/opt/module/elasticsearch-6.6.0/plugins/ik/config
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >ls
extra_main.dic  extra_single_word.dic  extra_single_word_full.dic  extra_single_word_low_freq.dic  extra_stopword.dic  IKAnalyzer.cfg.xml  main.dic  preposition.dic  quantifier.dic  stopword.dic  suffix.dic  surname.dic
# Modify ik plug-in config / ikanalyzer cfg. XML configuration
# <entry key="remote_ext_dict"> http://11.8.38.86/fenci/esword.txt < / entry > this line configures an nginx proxy address
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >vim IKAnalyzer.cfg.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer Extended configuration</comment>
        <!--Users can configure their own extended dictionary here -->
        <entry key="ext_dict"></entry>
         <!--Users can configure their own extended stop word dictionary here-->
        <entry key="ext_stopwords"></entry>
        <!--Users can configure the remote extension dictionary here -->
        <entry key="remote_ext_dict">http://11.8.38.86/fenci/esword.txt</entry>
        <!--Users can configure the remote extended stop word dictionary here-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

# Switch to a machine with nginx service (if there is no nginx, you need to deploy and install it yourself)
root@ops04:/usr/local/nginx-1.10/conf #cd /usr/local/nginx-1.10/
root@ops04:/usr/local/nginx-1.10 #mkdir ik
root@ops04:/usr/local/nginx-1.10 #cd ik
root@ops04:/usr/local/nginx-1.10/ik #mkdir fenci
root@ops04:/usr/local/nginx-1.10/ik #cd fenci
root@ops04:/usr/local/nginx-1.10/ik/fenci #echo "Wang Ting" > > esword txt
root@ops04:/usr/local/nginx-1.10/ik/fenci #echo "eternal God" > > esword txt
root@ops04:/usr/local/nginx-1.10/ik/fenci #echo "sacred hymn" > > esword txt
root@ops04:/usr/local/nginx-1.10/ik/fenci #cat esword.txt 
Wang Ting
 Eternal God
 Sacred Hymn

root@ops04:/usr/local/nginx-1.10/ik/fenci #vim /usr/local/nginx-1.10/conf/nginx.conf
listen       80;
        server_name  localhost;

        location / {
            root   html;
            index  index.html index.htm;
        }
		
		# Add the following configuration:
        location /fenci/ {
            root   ik;
        }

root@ops04:/usr/local/nginx-1.10/ik/fenci #/usr/local/nginx-1.10/sbin/nginx -s reload

# The address must be the same as ikanalyzer cfg. XML configuration item correspondence; You can also finish nginx before configuring ikanalyzer cfg. XML is more reasonable
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >curl http://11.8.38.86/fenci/esword.txt
 Wang Ting
 Eternal God
 Sacred Hymn

# Modify the xml configuration and distribute it to other nodes
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >scp IKAnalyzer.cfg.xml ops02:/opt/module/elasticsearch-6.6.0/plugins/ik/config/ 
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >scp IKAnalyzer.cfg.xml ops03:/opt/module/elasticsearch-6.6.0/plugins/ik/config/ 

# Restart es (each node)
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >jps | grep Elasticsearch|awk -F" " '{print $1}'
13077
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >kill -9 13077
wangting@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >cd /opt/module/elasticsearch-6.6.0/bin/
wangting@ops01:/opt/module/elasticsearch-6.6.0/bin >./elasticsearch -d

Restart es and retest: (the newly defined words can be recognized successfully)

Section 4 about mapping

Type can be analogized to table. MySQL defines the field type constraint of each field when creating table; The data type of es each field can also be defined; In fact, the data type of the field in each type is defined by the mapping of es.

[note]: if mapping is not set, the system will automatically infer the corresponding data format according to the format of a piece of data

4.4. 1 view the mapping of type

GET movie_index/_mapping/movie

Common types:

​ true/false → boolean
​ 1020 → long
​ 20.1 → double
​ "2018-02-01" → date
​ "hello world" → text + keyword
[note]:

  1. By default, only text type can be segmented, and keyword is a string that will not be segmented.

  2. In addition to automatic definition, mapping can also be defined manually, but it can only be defined for newly added fields without data. Once you have the data, you can't modify it.

  3. Although the data of each Field is placed under different type s, fields with the same name can only have one mapping definition under one index.

4.4. 2 build index based on Chinese words

Creating mapping

PUT movie_chn
{
  "mappings": {
    "movie":{
      "properties": {
        "id":{
          "type": "long"
        },
        "name":{
          "type": "text"
          , "analyzer": "ik_smart"
        },
        "doubanScore":{
          "type": "double"
        },
        "actorList":{
          "properties": {
            "id":{
              "type":"long"
            },
            "name":{
              "type":"keyword"
            }
          }
        }
      }
    }
  }
}

Name - > define as text type and use ik Chinese word segmentation ik_smart

Execution results:

After creation

PUT insert data:

# Data 1
PUT /movie_chn/movie/1
{ "id":1,
  "name":"Operation Red Sea",
  "doubanScore":8.5,
  "actorList":[  
  {"id":1,"name":"Zhang Yi"},
  {"id":2,"name":"Haiqing"},
  {"id":3,"name":"Zhang Hanyu"}
 ]
}

# Data 2
PUT /movie_chn/movie/2
{
  "id":2,
  "name":"Mekong action",
  "doubanScore":8.0,
  "actorList":[  
{"id":3,"name":"Zhang Hanyu"}
]
}

# Data 3
PUT /movie_chn/movie/3
{
  "id":3,
  "name":"Red Sea event",
  "doubanScore":5.0,
  "actorList":[  
{"id":4,"name":"Zhang Chen"}
]
}

name: Red sea operation, Mekong operation, Red Sea incident

Test query effect:

Query the results of the movie named Red Sea Battle:

GET /movie_chn/movie/_search
{
  "query": {
    "match": {
      "name": "Red Sea campaign"
    }
  }
}

The query results of Zhang's Translation:

GET /movie_chn/movie/_search
{
  "query": {
    "term": {
      "actorList.name": "Zhang Yi"
    }
  }
}

Section 5 index alias_ aliases

Index alias is like a shortcut or soft connection. It can point to one or more indexes or be used by any API that needs index name. Aliases give us great flexibility and allow us to do the following:
1. Group multiple indexes (for example, last_three_months - > can point to multiple indexes)
2. Create a view for a subset of the index
3. In the running cluster, you can seamlessly switch from one index to another

4.5. 1 create index alias

PUT movie_chn_2
{  "aliases": {
      "movie_chn_2020-query": {}
  }, 
  "mappings": {
    "movie":{
      "properties": {
        "id":{
          "type": "long"
        },
        "name":{
          "type": "text"
          , "analyzer": "ik_smart"
        },
        "doubanScore":{
          "type": "double"
        },
        "actorList":{
          "properties": {
            "id":{
              "type":"long"
            },
            "name":{
              "type":"keyword"
            }
          }
        }
      }
    }
  }
}

Define alias:

"aliases": {

"movie_chn_2020-query": {}

}

Add alias to existing index

POST  _aliases
{
    "actions": [
        { "add":    { "index": "movie_chn_2", "alias": "movie_chn_2020-query" }}
    ]
}

# Can continue to increase
POST  _aliases
{
    "actions": [
        { "add":    { "index": "movie_chn_2", "alias": "movie_chn_2020-query" }}
    ]
}

You can also narrow the query range by adding filter conditions to create a subset view

POST  _aliases
{
    "actions": [
        { "add":    
{ "index": "movie_chn_2", 
"alias": "movie_chn0919-query-zhhy",
            "filter": {
                "term": {  "actorList.id": "3"
                 }
               }
 }
}
    ]
}

4.5. 2 query alias

GET movie_chn_2020-query/_search

4.5. 3 delete an index alias

POST  _aliases
{
    "actions": [
        { "remove":    { "index": "movie_chn_2", "alias": "movie_chn_2020-query" }}
    ]
}

4.5. 4 is alias switching

POST /_aliases
{
    "actions": [
        { "remove": { "index": "movie_chn_2", "alias": "movie_chn_2021-query" }},
        { "add":    { "index": "movie_chn", "alias": "movie_chn_2021-query" }}
    ]
}

4.5. 5 query alias list

GET  _cat/aliases?v

Section VI index template

The Index Template index template is a template for creating an index. It can define a series of rules to help us build the mappings and settings of the index that meet specific business needs. By using the Index Template, we can make our index have predictable consistency.

The index template can make it easier to build an index. For example, when the index has not been built, ES obtains the first piece of data and needs to save it. If the index prefix in the data can match the index pattern of the index template, es will directly generate the index according to the template

4.6. 1 split index

Split index is to divide a business index into multiple indexes according to time interval.
give an example:
Put order_info becomes order_info_0801,order_info_0802,order_info_0803,…
There are two benefits:

  1. Flexibility of structure change: because elastic search does not allow modification of data structure. However, in actual use, the structure and configuration of the index will inevitably change. As long as the index in the next interval is modified, the original index position will remain the same. In this way, there is a certain degree of flexibility.
  2. Query range optimization: generally, the data of all time cycles will not be queried, so the range of scanned data is physically reduced through segmentation index, which is also an optimization of performance.

4.6. 2 create index template

PUT _template/template_movie
{
  "index_patterns": ["movie_test*"],                  
  "settings": {                                               
    "number_of_shards": 1
  },
  "aliases" : { 
    "{index}-query": {},
    "movie_test-query":{}
  },
  "mappings": {                                          
"_doc": {
      "properties": {
        "id": {
          "type": "keyword"
        },
        "movie_name": {
          "type": "text",
          "analyzer": "ik_smart"
        }
      }
    }
  }
}

Among them, "index_patterns": ["movie_test *"], which means going to the movie_ When writing data to the index beginning with test, for example, to movie_test_001 index write data, if index movie_test_001 does not exist, then es will be based on movie_ The test template is automatically indexed.

shard quantity setting:

  "settings": {                                               
    "number_of_shards": 1
  },

Use {index} in "aliases" to obtain the real created index name.

POST movie_test_20210801/_doc
{
  "id":"100",
  "name":"aaa"
}

POST movie_test_20210801/_doc
{
  "id":"101",
  "name":"bbb"
}

POST movie_test_20210802/_doc
{
  "id":"102",
  "name":"ccc"
}
POST movie_test_20210801/_doc
{
  "id":"103",
  "name":"ddd"
}

4.6. 3. Query existing template list

GET  _cat/templates

4.6. 4 view the details of a template

GET  _template/template_movie*

When the first piece of data needs to be saved, if the index prefix in the data can match the index pattern of the index template, es will directly generate the index according to the template

4.6. 1 split index

Split index is to divide a business index into multiple indexes according to time interval.
give an example:
Put order_info becomes order_info_0801,order_info_0802,order_info_0803,…
There are two benefits:

  1. Flexibility of structure change: because elastic search does not allow modification of data structure. However, in actual use, the structure and configuration of the index will inevitably change. As long as the index in the next interval is modified, the original index position will remain the same. In this way, there is a certain degree of flexibility.
  2. Query range optimization: generally, the data of all time cycles will not be queried, so the range of scanned data is physically reduced through segmentation index, which is also an optimization of performance.

4.6. 2 create index template

PUT _template/template_movie
{
  "index_patterns": ["movie_test*"],                  
  "settings": {                                               
    "number_of_shards": 1
  },
  "aliases" : { 
    "{index}-query": {},
    "movie_test-query":{}
  },
  "mappings": {                                          
"_doc": {
      "properties": {
        "id": {
          "type": "keyword"
        },
        "movie_name": {
          "type": "text",
          "analyzer": "ik_smart"
        }
      }
    }
  }
}

Among them, "index_patterns": ["movie_test *"], which means going to the movie_ When writing data to the index beginning with test, for example, to movie_test_001 index write data, if index movie_test_001 does not exist, then es will be based on movie_ The test template is automatically indexed.

shard quantity setting:

  "settings": {                                               
    "number_of_shards": 1
  },

Use {index} in "aliases" to obtain the real created index name.

POST movie_test_20210801/_doc
{
  "id":"100",
  "name":"aaa"
}

POST movie_test_20210801/_doc
{
  "id":"101",
  "name":"bbb"
}

POST movie_test_20210802/_doc
{
  "id":"102",
  "name":"ccc"
}
POST movie_test_20210801/_doc
{
  "id":"103",
  "name":"ddd"
}

4.6. 3. Query existing template list

GET  _cat/templates

4.6. 4 view the details of a template

GET  _template/template_movie*

Keywords: Operation & Maintenance Database Big Data ElasticSearch ELK

Added by parijat_php on Tue, 28 Dec 2021 09:46:24 +0200