Elasticearch cluster data migration

There are three methods for cluster migration or ES data migration:
elasticsearch-dump
reindex
snapshot
logstash

1, Elasticsearch dump

  1. Applicable scenario
    It is suitable for scenarios with small amount of data and few migration indexes\
  2. Mode of use
    Elasticsearch dump is an open source ES data migration tool. GitHub address: https://github.com/taskrabbit/elasticsearch-dump
  3. Installing elsticearch dump
    Elasticsearch dump uses node JS development, which can be installed directly using the npm package management tool\
#Download package (extract and use)
cd /data/
wget https://npm.taobao.org/mirrors/node/v13.8.0/node-v13.8.0-linux-x64.tar.gz
tar -xf node-v13.8.0-linux-x64.tar.gz

#Configure node npm variable and take effect
vim /etc/profile.d/node.sh
export NODE_HOME=/data/node-v13.8.0-linux-x64
export PATH=$PATH:$NODE_HOME/bin 

#Verify that the variable is available
npm -v

#Install elasticsearch dump
npm install elasticdump -g
  1. Description of main parameters
    --input: Source address, which can be ES colony URL,File or stdin,The index can be specified in the format:{protocol}://{host}:{port}/{index}
    --input-index: source ES Index in cluster
    --output: Destination address, which can be ES Cluster address URL,File or stdout,The index can be specified in the format:{protocol}://{host}:{port}/{index}
    --output-index: target ES Index of cluster
    --type: Migration type, default to data,Indicates that only data is migrated, optional settings, analyzer, data, mapping, alias
    --limit: Every time to the target ES The number of data written to the cluster cannot be set too large to avoid bulk Queue full
  1. Migrate a single index
    The following operations migrate the index indexes in the source cluster to the target cluster through the elasticdump command. Note that the first command migrates the settings of the index first. If you migrate mapping or data directly, you will lose the configuration information of the index in the original cluster, such as the number of slices and replicas. Of course, you can also synchronize mapping and data directly after creating the index in the target cluster
#Write a migration script (or execute it sentence by sentence with a single statement)
vim /data/index
    #Copy analyzer word segmentation
    elasticdump \
      --input=http://Source host: 9200 / index (index name)\
      --output=http://Target host: 9200 / index (index name)\
      --type=analyzer \
      --limit=1000   #Batch pull file size
    #Copy mapping
    elasticdump \
      --input=http://Source host: 9200 / index (index name)\
      --output=http://Target host: 9200 / index (index name)\
      --type=mapping \
      --limit=1000   
    #Copy data
    elasticdump \
      --input=http://Source host: 9200 / index (index name)\
      --output=http://Target host: 9200 / index (index name)\
      --type=data \
      --limit=1000   

#Start the file in the background and print the log (the log can be viewed for migration progress)
nohup sh /data/index | tee index.txt > /dev/null &
  1. Migrate all indexes
    The following operations migrate the index indexes in the source cluster to the target cluster through the elasticdump command. Note that this operation cannot migrate the configuration of the index, such as the number of slices and replicas. Each index must be migrated separately, or the data can be migrated directly after the index is created in the target cluster
elasticdump --input=http://Source host:9200 --output=http: / / target host:9200

!!! Note that if the index has special requirements or structures, you need to create the index in advance, because using elsticsearch dump directly will not synchronize all structure types and fragment numbers, but only create an index with one fragment and one copy

2, reindex

reindex is an api interface provided by Elasticsearch. It can import data from the source ES cluster to the current ES cluster. It also realizes data migration. By configuring the white list, it realizes the connection between the source end cluster and the target end cluster. It can be used for indexes and clusters with large amount of data

  1. Configure reindex remote. Whitelist parameter (whitelist parameter, elsticsearch.yml file)
    You only need to configure this parameter in the target ES cluster to indicate the white list of remote clusters that can reindex. If both sides need to connect, both sides should configure the white list
  2. Call reindex api (full synchronization)
    The following operations mean to query the index named test1 from the source ES cluster. The query criteria are title and the field is elasticsearch. Write the results to the test2 index of the current cluster, or cancel the query criteria and synchronize all the data in the index
POST _reindex
{
  	"source": {
    	"remote": {
      		"host": "http://Source host:9200“
    	},
    	"index": "test1",
    	"size": 30,                                #The number of times to pull files in batch can be determined according to the file size
    	"query": {                                 #Cancelable
      		"match": {                             #Cancelable
        		"title": "elasticsearch"           #Cancelable
      		}                                      #Cancelable
    	}                                          #Cancelable
  	},
  	"dest": {
   		"index": "test2"
  	}
}
  1. Call reindex api (incremental synchronization)
    It can be used for real-time synchronization. It is mainly used to distinguish the data by some special fields used by the index data, such as time field. This function can ensure that the formal business can complete the migration in a short time and will not stop the service for too long because the amount of migrated data is too large
POST _reindex
{
  "source": {
    "remote": {
      "host": "http://Source host:9200“
    },
    "index": "index1",
    "size": 30,
    "query": {
      "range": {
        "c_update_dt": {                            #update field to distinguish between full synchronization and incremental synchronization
          "gte": "2021-10-19T20:30:00.000+08:00"
      }
    }
  }  
  },
  
  "dest": {
    "index": "recruitment_search_v4"
  }
}

!!! Note that if the index has special requirements or structure, it is necessary to create the index in advance, because the direct use of reindex will not synchronize all structure types and fragment numbers, but only create an index with one fragment and one copy

3, snapshot

  1. Applicable scenario
    It is suitable for scenarios with large amount of data
  2. Mode of use
    Snapshot api is a set of api interfaces used by Elasticsearch for data backup and recovery. Data migration across clusters can be carried out through snapshot api. The principle is to create data snapshots from the source ES cluster and then restore them in the target ES cluster. Note the version of ES:
target ES The major version number of the cluster(Such as 5.6.4 5 in is the main version number)Source to be greater than or equal to ES The major version number of the cluster;
1.x A snapshot created by a version of a cluster cannot be created in 5.x Restore in version;
  1. Create a repository in the source ES cluster
    A repository repository must be created before creating snapshots. A repository repository can contain multiple snapshot files. There are several types of repositories
fs: Share the file system and store the snapshot file in the file system
url: Specifies the of the file system URL Path, support protocol: http,https,ftp,file,jar
s3: AWS S3 Object storage,Snapshot stored in S3 In, it is supported as a plug-in
hdfs: Snapshot stored in hdfs In, it is supported as a plug-in
cos: Snapshots are stored in the object store and supported in the form of plug-ins

If it is necessary to migrate other cloud ES clusters from self built ES clusters, please note that the Elasticsearch configuration file Elasticsearch YML set warehouse Path

path.repo: ["/usr/local/services/test"]

After that, call snapshot api to create repository:

curl -XPUT http://Source host:9200/_snapshot/my_backup -H  		' Content-Type: application/json' -d '{
	"type": "fs",     #Warehouse storage type
	"settings": {
		"location": "/usr/local/services/test"   #Self built ES needs to add location address
		"compress": true
	}
}'

If you need to migrate from the ES cluster of a cloud manufacturer to another cloud es cluster, or from the ES cluster within the cloud, you can use the warehouse type provided by the corresponding cloud manufacturer, such as S3 of AWS, OSS of Alibaba cloud, COS of Tencent cloud, etc

curl -XPUT http://Source host:9200/_snapshot/my_s3_repository
{
 	"type": "s3",
	"settings": {
	"bucket": "my_bucket_name",     #The cloud ES needs to add a warehouse name
	"region": "us-west"
	}
}
  1. Create a snapshot in the source ES cluster
    Call the snapshot api to create a snapshot in the created warehouse
curl -XPUT http://Source host:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true
  1. Create a repository in the target ES cluster
    Creating a warehouse in the target ES cluster is similar to creating a warehouse in the source ES cluster Storage type, etc
  2. Move the snapshot of the source ES cluster to the warehouse of the target ES cluster
    Upload the snapshot created by the source ES cluster to the warehouse created by the target ES cluster
  3. Restore from snapshot
curl -XPUT http://Target host:9200/_snapshot/my_backup/snapshot_1/_restore

#Check snapshot recovery status
curl http://172.16.0.20:9200/_snapshot/_status

4, logstash

Logstash supports reading data from one ES cluster and then writing it to another ES cluster. Therefore, logstash can be used for data migration. The specific configuration files are as follows:

input {
	elasticsearch {
		hosts => ["http://Source host:9200 "]
		index => "*"
		docinfo => true
	}
}
output {
	elasticsearch {
		hosts => ["http://Target host:9200 "]
		index => "%{[@metadata][_index]}"
	}
}

The above configuration file synchronizes all indexes of the source ES cluster to the target cluster. Of course, you can set to synchronize only the specified indexes. For more functions of logstash, please refer to the official logstash documentation

5, Summary

  1. When elasticsearch dump and logstash perform cross cluster data migration, it is required that the machine performing the migration task can access both clusters at the same time, otherwise the migration cannot be realized when the network cannot be connected. The method of using snapshot has no such restriction, because the snapshot method is completely offline. Therefore, elasticsearch dump and logstash migration methods are more suitable for migration when the source ES cluster and the target ES cluster are on the same network. Cross cloud manufacturers are required for migration. For example, when migrating from Alibaba cloud ES cluster to Tencent cloud ES cluster, you can choose to use snapshot for migration. Of course, you can also connect clusters through the network, but the cost is high.
  2. The elasticsearchdump tool is similar to the mysql database data backup tool mysqldump tool. Both are logical backups. You need to export data one by one before importing, so it is suitable for migration in scenarios with small amount of data.
  3. The snapshot method is suitable for migration in scenarios with a large amount of data.

Keywords: ElasticSearch

Added by animuson on Wed, 15 Dec 2021 13:03:39 +0200