elasticsearch index cross cluster migration

Project 1: elasticsearch migration scheme

elasticsearch index migration

  1. View reference documents: https://www.elastic.co/guide/en/elasticsearch/reference/7.15/docs-reindex.html

  2. Add the white list of ES clusters to be migrated to the destination es configuration file elasticsearch yml

    reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"
    
  3. kibana development tool interface_ reindex to recreate the index. Or curl

    #kibana interface is running and the amount of data is too large. The following configuration
    POST _reindex?wait_for_completion=false
    {Same as below}
    
    
    #Using curl command
    curl -X POST "localhost:9200/_reindex?pretty" -H 'Content-Type: application/json' -d'
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200",
          "username": "user",
          "password": "pass"
        },
        "index": "my-index-000001",
        "size": 10,			
        "socket_timeout": "1m",
        "connect_timeout": "10s",
        "query": {						#Query the matched documents, create only these documents, and copy all indexes without setting
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "my-new-index-000001"
      }
    }
    
    size 10 #Re indexing from a remote server uses an on Heap Buffer with a default maximum size of 100mb. If the remote index contains very large documents, you will need to use a smaller batch size. The following example sets the batch size to 10 very, very small.
    "socket_timeout": "1m",
    "connect_timeout": "10s"
    You can also use socket_timeout Field to set the socket read timeout on the remote connection and use the field to set the connection timeout connect_timeout. Both default to 30 seconds. This example sets the socket read timeout to one minute and the connection timeout to 10 seconds:
    
  4. GET _tasks/task_id is based on the task returned after execution_ ID query reindex execution.

reindex improves efficiency:

Reindex supports Sliced Scroll to parallelize the re indexing process. This parallelization can improve efficiency and provide a convenient way to decompose requests into smaller parts.

sliced principle (from medcl)

1) Used the Scroll interface, isn't it slow? If you have a large amount of data, it is really unacceptable to use Scroll to traverse the data. Now the Scroll interface can traverse the data concurrently.
2) Each Scroll request can be divided into multiple Slice requests, which can be understood as slices. Each Slice is independent and parallel. It is many times faster to reconstruct or traverse with Scroll.

elasticsearch all data migration

Directly copy the path set by es Directory after data to target es

Project 2: elasticsearch write performance optimization

  1. Increase refresh interval

    The default refresh interval is 1s, using index refresh_ The interval parameter can be set, which will force es to write the data in memory to the disk every second and create a new segment file. It is this interval that allows us to see it after 1s every time we write data. However, if we increase this interval, for example, 30s, we can accept the written data and see it after 30s, then we can obtain greater write throughput, because memory is written within 30s, and a segment file will be created every 30s.

  2. index buffer

    If we want to perform very heavy high concurrency write operations, we'd better increase the index buffer, indexes memory. index_ buffer_ The size can be adjusted to be larger. The index buffer size set is common to all Shards. However, if you divide it by the number of shards, you can calculate the average memory size that can be used by each shard. It is generally recommended, but for each shard, the maximum is 512mb, because no matter how large it is, the performance will not be improved. es will use this setting as the index buffer shared by each shard, and those particularly active shards will use this buffer more. The default value of this parameter is 10%, that is, 10% of jvm heap. If we allocate 10gb of memory to jvm heap, the index buffer will have 1gb, which is enough for two shards to share.

https://blog.csdn.net/lm324114/article/details/105028701/

Keywords: ELK

Added by tucker on Mon, 10 Jan 2022 16:38:16 +0200