elastic_ Getting started with search

Basic concepts

  • An index is similar to a table in a traditional relational database. It is a place to store relational documents

  • Document type [removed after version 7.0]

  • Document (doc)

A doc represents a piece of data in the index, like a record in the database table. Doc stores data in json format

es architecture design

Simple definition of each component of the architecture:

  • The underlying storage system of gateway is generally a file system, which supports multiple types.

  • distributed lucence directory is based on lucence's distributed framework and encapsulates the implementation of inverted index, data storage, translog, segment and so on.

  • The module layer is the main module of ES, including index module, search module and mapping module.

  • The Discovery cluster node Discovery module is used for communication between cluster nodes, elects coordinate node operations, and supports a variety of Discovery mechanisms, such as zen, ec2, etc.

  • Script script parsing module is used to support scripts written in query statements, such as painless, groovy, python, etc.

  • plugins third-party plug-ins. Various advanced functions can be provided by plug-ins and support customization.

  • transport/jmx communication module, data transmission, netty framework is used at the bottom

  • restful/node provides an interface for accessing Elasticsearch cluster

  • An expansion package of x-pack elasticsearch, which integrates security, warning, monitoring, graphics and reporting functions, with seamless access and pluggable design.

Photo source: Nuggets

Request logic diagram

Source: Architect training

Create, delete and change operations

Create index

PUT /<index>
PUT /report_index_1
{
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 1
    },
    "mappings": {
        "properties": {
            "action_cost": {
                "type": "double"
            },
            "activation": {
                "type": "long"
            },
            "advertiser_name": {
                "type": "text",
                "fields":{
                    "keyword":{
                        "type":"keyword"
                    }
                }
            },
            "cover_url": {
                "type": "keyword"
            },
            "created_at": {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss || yyyy-MM-dd HH:mm:ss.SSS ||yyyy-MM-dd || epoch_millis || strict_date_optional_time || yyyy-MM-dd'T'HH:mm:ss'+'08:00"
            }
        }
    }
}

Create doc

#Specify id
PUT /<index>/_doc/<_id>
PUT /report_index_1/_doc/1
{
  "cover_url":"test2",
	"activation":100
}

#Do not specify id
POST /<index>/_doc
POST /report_index_1/_doc/
{
  "cover_url":"test2",
  "activation":200
}

#If doc already exists during creation, an exception will be reported
PUT /report_index_1/_doc/10/_create
{
  "cover_url": "test10",
  "activation": 1000
}

Batch creation and operation

POST _bulk
{"index":{"_index":"report_index_1","_id":"1"}}
{"cover_url":"test1","activation":100, "advertiser_name":"Zhang San, Today is Monday"}
{"index":{"_index":"report_index_1","_id":"2"}}
{"cover_url":"test2","activation":200,"advertiser_name":"Thursday, today is Tuesday"}
{"index":{"_index":"report_index_1","_id":"3"}}
{"cover_url":"test3","activation":300,"advertiser_name":"Wang Wu is today Wednesday"}
{"index":{"_index":"report_index_1","_id":"4"}}
{"cover_url":"test4","activation":400, "advertiser_name":"Ma Saturday today is Thursday"}
{"create":{"_index":"report_index_1","_id":"5"}}
{"cover_url":"test5","activation":500, "advertiser_name": "Advertiser"}
{"delete":{"_index":"report_index_1","_id":"5"}}
{"update":{"_id":"1","_index":"report_index_1"}}
{"doc":{"cover_url":"test11"}}

Update index field

es does not support changing the type of existing fields. If you need to change, you need to reindex

PUT /<index>/_mapping
PUT /report_index_1/_mapping
{
    "properties": {
        "creative_create_time": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss"
        }
    }
}

Update doc

#Update by id
PUT /<index>/_doc/<_id>
PUT /report_index_1/_doc/227
{
	"advertiser_name" : "Hi Sushi-Lee"
}

#Update by condition
POST /report_index_1/_update_by_query
{
  "query": {
    "term": {
      "advertiser_name": "Hi Sushi-Lee"
    }
  } ,
  "script": {
    "source": "ctx._source['advertiser_name'] = 'Hi Sushi'"
  }
}

Delete index

DELETE /<index>
DELETE /report_index_1

Delete doc

DELETE /<index>/_doc/<_id>
DELETE /report_index_1/_doc/137

Empty doc

POST /report_index_1/_delete_by_query
{
    "query": {
       "match_all": {}
     }
}

Common query types

General query

  • Query all data of index
GET /report_index_1/_search
{   
  "query": {
    "match_all": {}
  }
}

  • Specify id query
GET /report_index_1/_doc/1

  • Fuzzy query
GET /report_index_1/_search
{
  "query": {
    "wildcard": {
      "cover_url": {
        "value": "*4*"
      }
    }
  }
}

  • Exist query
GET /report_index_1/_search
{
  "query": {
    "exists": {
      "field": "cover_url"
    }
  }
}

  • Range query
GET /report_index_1/_search
{
  "query":{
    "range": {
      "activation": {
        "gt": 1,
        "lte": 200
      }
    }
  }
}

  • Accurate query
GET /report_index_1/_search
{
  "query":{
    "term": {
      "_id": {
        "value": "1"
      }
    }
  }
}

  • Regular query
GET /report_index_1/_search
{
  "query": {
   "regexp": { 
       "advertiser_name":"Zhang*"
    } 
  }
}

  • Compound query (bool query)

bool query can be used to merge query results of multiple filter criteria. It includes the following operators:

Must: the query must appear in the matching document

filter: for must, the score of the query will be ignored. Cache query

must_not: queries must not appear in matching documents

Should: the query should appear in the matching document. (do not force matching. If matching is found, the score is larger. If there is no must, at least one should query matching)

All {must} statements must match, and all} must_not , statements must not match, but how many , should , statements should match? By default, no 'should' statements must match, with one exception: when there is no 'must' statement, at least one 'should' statement must match. Just like we can control the accuracy of [match query]( https://www.elastic.co/guide/cn/elasticsearch/guide/current/match-multi-word.html#match -Like precision, we can use minimum_ should_ The match , parameter controls the number of , should , statements to match. It can be either an absolute number or a percentage

GET /report_index_1/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "_id": [
              "1",
              "2"
            ]
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "cover_url": {
              "value": "test2"
            }
          }
        }
      ],
      "should": [
        {
          "term": {
            "cover_url": {
              "value": "test1"
            }
          }
        }
      ]
    }
  }
}

Text query

At present, the online es text word segmentation uses IK word segmentation

Official word splitter:

Fingerprint,Keyword,Language,Pattern,Simple,**Standard,**Stop,Whitespace

IK word segmentation (Chinese):

IK analyzer:  ik_smart, ik_max_word

  • math
GET /report_index_1/_search
{
  "query": {
    "match": {
      "advertiser_name": {
        "query": "Ma Liu today",
        "analyzer": "standard"
      }
    }
  }
}

  • math_phrase
GET /report_index_1/_search
{
  "query": {
    "match_phrase": {
      "advertiser_name": {
        "query": "Ma Liu today",
        "analyzer": "standard"
      }
    }
  }
}

  • View analysis results
POST /_analyze?pretty=true
{
  "text":"Zhang San, Today is Monday",
  "tokenizer":"ik_max_word"
}

Aggregation query

  • bucket aggregation

group by equivalent to db

Fractional polymerization

GET /report_index_1/_search
{
  "query":{
    "range":{
      "activation":{
        "gte":300
      }
    }
  },
  "size": 0,
  "aggs": {
    "my-agg-name": {
      "terms": {
        "field": "advertiser_name.keyword"
      }
    }
  }
}

  • metric aggregation

Aggregate function equivalent to db; max,avg

GET /report_index_1/_search
{
  "query":{
    "range":{
      "activation":{
        "gte":300
      }
    }
  },
  "size": 0,
  "aggs": {
    "my-agg-name-avg":{
      "avg": {
        "field": "activation"
      }
    }
  }
}

  • bucket and metric combined query
GET /report_index_1/_search?pretty
{
  "size":0,
  "aggs": {
    "my-agg-name": {
      "terms": {
        "field": "advertiser_name.keyword"
      },
      "aggs": {
        "my-sub-agg-name": {
          "avg": {
            "field": "activation"
          }
        }
      }
    }
  }
}

  • pipe line aggregation

The object of pipeline aggregation is the output of other aggregations (bucket or some weight of bucket), rather than directly targeting the document.

The function of pipeline aggregation is to add some useful information to the output.

buckets_path: the weight path used to calculate the mean value

{
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "avg_monthly_sales": {
            "avg_bucket": {             //Average the total sales of all months
                "buckets_path": "sales_per_month>sales"
            }
        }
    }
}

Cluster basic information query

  • View basic cluster information
GET /

  • View cluster node information
GET /_cluster/health?pretty

  • View all index es of the cluster
GET /_cat/indices

  • View index mapping information
GET /report_index_1/_mapping

example

Report condition query

POST /report_index_1/_search
{
    "aggregations":{
        "count":{
            "cardinality":{
                "field":"unit_id"
            }
        },
        "group":{
            "aggregations":{
                "bucket_sort":{
                    "bucket_sort":{
                        "size":20
                    }
                },
                "lately_top_hit":{
                    "top_hits":{
                        "size":1,
                        "sort":[
                            {
                                "unit_create_time":{
                                    "order":"desc"
                                }
                            }
                        ]
                    }
                },
                "max_create_time":{
                    "max":{
                        "field":"unit_create_time"
                    }
                },
                "sum_aclick":{
                    "sum":{
                        "field":"aclick"
                    }
                },
                "sum_bclick":{
                    "sum":{
                        "field":"bclick"
                    }
                },
                "sum_charge":{
                    "sum":{
                        "field":"charge"
                    }
                },
                "sum_photo_click":{
                    "sum":{
                        "field":"photo_click"
                    }
                },
                "sum_show":{
                    "sum":{
                        "field":"show"
                    }
                }
            },
            "terms":{
                "field":"unit_id",
                "order":[
                    {
                        "max_create_time":"desc"
                    }
                ],
                "size":20
            }
        },
        "sum_aclick":{
            "sum":{
                "field":"aclick"
            }
        },
        "sum_bclick":{
            "sum":{
                "field":"bclick"
            }
        },
        "sum_charge":{
            "sum":{
                "field":"charge"
            }
        },
        "sum_photo_click":{
            "sum":{
                "field":"photo_click"
            }
        },
        "sum_show":{
            "sum":{
                "field":"show"
            }
        }
    },
    "from":0,
    "query":{
        "bool":{
            "filter":[
                {
                    "range":{
                        "stat_date":{
                            "format":"yyyy-MM-dd",
                            "from":"2021-06-08",
                            "include_lower":true,
                            "include_upper":true,
                            "to":"2021-06-08"
                        }
                    }
                },
                {
                    "term":{
                        "campaign_id":62935142
                    }
                }
            ]
        }
    },
    "size":0
}




Keywords: Java Big Data ElasticSearch

Added by abie10 on Thu, 03 Feb 2022 05:10:31 +0200