Clickhouse MySQL data synchronization

clickhouse version: 22.1.2.2

1.Mysql engine (not recommended)

CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster]
ENGINE = MySQL('host:port', ['database' | database], 'user', 'password')

Engine parameters

host:port - MySQL service address
Database - MySQL database name
User - MySQL user name
Password - MySQL user password

Speechless what the engine make complaints about. The MySQL engine maps the tables in the remote MySQL server to ClickHouse. The MySQL database engine converts its query to MySQL syntax and sends it to the MySQL server.

2. Materialized MySQL (not recommended)

The engine creates a ClickHouse database that contains all the tables in MySQL and all the data in these tables. The ClickHouse server works as a MySQL copy. It reads binlog and executes DDL and DML queries.

CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster]
ENGINE = MaterializeMySQL('host:port', ['database' | database], 'user', 'password') [SETTINGS ...]

Engine parameters

host:port - MySQL service address
Database - MySQL database name
User - MySQL user name
Password - MySQL user password

Engine configuration

max_rows_in_buffer - the maximum number of rows that allow data to be cached in memory (for a single table and cached data that cannot be queried). When the number of rows is exceeded, the data will be materialized. Default value: 65505.
max_bytes_in_buffer - the maximum number of bytes allowed to cache data in memory (for a single table and cached data that cannot be queried). When the number of rows is exceeded, the data will be materialized. Default: 1048576
max_rows_in_buffers - the maximum number of rows that allow data to be cached in memory (for databases and cached data that cannot be queried). When the number of rows is exceeded, the data will be materialized. Default: 65505
max_bytes_in_buffers - the maximum number of bytes allowed to cache data in memory (for databases and cached data that cannot be queried). When the number of rows is exceeded, the data will be materialized. Default: 1048576
max_flush_data_time - the maximum number of milliseconds that data is allowed to be cached in memory (for databases and cached data that cannot be queried). When this time is exceeded, the data will be materialized. Default: 1000
max_wait_time_when_mysql_unavailable - retry interval (MS) when MySQL is unavailable. Negative values prohibit retry. Default: 1000
allows_query_when_mysql_lost - allows querying materialized tables when MySQL is lost. Default value: 0 (false)

eg:

CREATE DATABASE mysql ENGINE = MaterializeMySQL('localhost:3306', 'db', 'user', '***') 
     SETTINGS 
        allows_query_when_mysql_lost=true,
        max_wait_time_when_mysql_unavailable=10000;

Use steps:

1. Modify the mysql configuration and restart

gtid_mode=ON
enforce_gtid_consistency=1
binlog_format=ROW
log_bin=mysql-bin
expire_logs_days=7
max_binlog_size = 256m
binlog_cache_size = 4m
max_binlog_cache_size = 512m
server-id=1

2. Use the following command

set allow_experimental_database_materialized_mysql = 1;
CREATE DATABASE test ENGINE = MaterializeMySQL('ip:port', 'mysql Library of', 'mysql account', 'mysql password');

Question:

1. This engine greatly facilitates mysql to import data into the clickhouse, but the official prompt is still in the experiment and should not be used in the production environment

2. I have tested that under the condition of large amount of data, the efficiency is still not high, which is better than mysql

3. There are many limitations in using clusters

3.mysql table function

clickhouse allows you to perform SELECT and INSERT queries on data stored on a remote MySQL server

mysql('host:port', 'database', 'table', 'user', 'password'[, replace_query, 'on_duplicate_clause']);

parameter

host:port - MySQL server address
Database - remote database name
Table - remote table name
User - MySQL user
Password - user password
replace_query - flag to convert INSERT INTO query into REPLACE INTO. If replace_query=1 ', the query is replaced.
on_ duplicate_ Claim - add on duplicate key on_ duplicate_ The clause expression to the INSERT query. It is specified that only replace can be used_ Query = 0, if you also set replace_query = 1 and on_duplicate_clause `, ClickHouse will generate an exception

Import all data of mysql into clickhouse

1. Create a clickhouse table

2. Execute the order

INSERT INTO table SELECT * FROM mysql('ip:port', 'database', 'table', 'user', 'password') [where]

Note: 1 If * is directly used here, the order of clickhouse fields should be consistent with that of mysql, otherwise there will be problems. If only some fields are required, you can specify them manually

2. You can import by yourself with query criteria here

3. Data duplication may occur in the Clickhouse, even in the replacing mergetree

4.canal synchronization

1. Prepare the jar package (avoid stepping on the pit): https://mvnrepository.com/

clickhouse-jdbc-0.2.6.jar
httpclient-4.5.13.jar
httpcore-4.4.13.jar
lz4-1.5.0.jar
lz4-java-1.8.0.jar
druid-1.2.8.jar(This version cannot be too low, otherwise druid Connection pool is not supported clickhouse)

2.docker installation of canal server

docker run --name canal-server -p 11111:11111 -d canal/canal-server:latest

3. Enter the container

Copy / home / admin / canal server / conf / example to the name of the new database instance you want to create. Use example2 here

4. Modify instance. Under example2 Properties configuration

#MySQL address
canal.instance.master.address=127.0.0.1:3306
#mysql account
canal.instance.dbUsername=root
canal.instance.dbPassword=root
#Filtering rules
canal.instance.filter.regex=example2\\*

5. Modify / home / admin / canal server / conf / canal Properties configuration

canal.serverMode = tcp
canal.destinations = example2

6. Install the Canadian adapter (you can also use kafka)

docker run --name canal-adapter -p 8081:8081 -d slpcat/canal-adapter:v1.1.5

7. Modify / opt / Canadian adapter / conf / application. Of the Canadian adapter YML configuration file (in container)

canal.conf:
# Note the mode here
  mode: tcp 
  flatMessage: true
  zookeeperHosts:
  syncBatchSize: 1000
  retries: 0
  timeout:
  accessKey:
  secretKey:
  consumerProperties:
    # Note the address of the canal service here
    canal.tcp.server.host: 127.0.0.1:11111
    canal.tcp.zookeeper.hosts:
    canal.tcp.batch.size: 500
    canal.tcp.username:
    canal.tcp.password:
    

  srcDataSources:
  # Note the mysql connection here
    defaultDS:
      url: jdbc:mysql://127.0.0.1:3306/example2?useUnicode=true
      username: root
      password: root
  canalAdapters:
  - instance: example2
    groups:
    - groupId: g1
      outerAdapters:
      - name: logger
	  - name: rdb
         key: mysql1
         # clickhouse configuration
         properties:
           jdbc.driverClassName: ru.yandex.clickhouse.ClickHouseDriver
           jdbc.url: jdbc:clickhouse://127.0.0.1:8123
           jdbc.username: default
           jdbc.password: default

8. Modify mytest under / opt / Canadian adapter / conf / RDB_ user. yml

dataSourceKey: defaultDS  
destination: example2  
groupId: g1  
outerAdapterKey: mysql  
concurrent: true  
dbMapping:  
  mirrorDb: true  
  database: example2

9. Restart the server and adapter

Insert data into mysql to check whether there is data synchronization. If not, check the log to find out the problem

Attention

1. To enable MySQL configuration

gtid_mode=ON
enforce_gtid_consistency=1
binlog_format=ROW
log_bin=mysql-bin
expire_logs_days=7
max_binlog_size = 256m
binlog_cache_size = 4m
max_binlog_cache_size = 512m
server-id=1

2.mysql and clickhouse libraries and tables should be created in advance

3. When using 127.0.0.1 for docker configuration, pay attention to the host mode, otherwise the containers cannot communicate

4. If mysql or clickhouse cannot connect successfully, check whether the driver version and jdbc connection are written correctly

5. Other schemes: debezium of red hat

Added by BEFOR on Sat, 29 Jan 2022 23:02:36 +0200

Programming VIP