clickhouse engine introduction

MergeTree Family (LSM tree data structure)

Replicated + MergeTree = replicated MergeTree

  1. MergeTree :

  2. Replaceingmergetree: de duplication of data with the same sort key during background data consolidation.

  3. SummingMergeTree: when merging data, records with the same primary key will be merged into one record. According to the aggregation field settings, the value of this field is the aggregate value after aggregation. The non aggregation field uses the value of the first record. The aggregation field type must be numeric.

  4. AggregatingMergeTree: data with the same primary key can be aggregated under the same data partition.

  5. CollapsingMergeTree: collapses and merges data with the same primary key under the same data partition.

  6. Versioned CollapsingMergeTree: Based on the CollapsingMergeTree engine, the configuration option of data version information field is added. On the basis of sorting the data according to the ORDER BY setting, if the version information column of the data is not in the sorting field, the version information will be implicitly taken as the last column of ORDER BY, which will affect the data sorting.

  7. GraphiteMergeTree: used to store the data of time series database Graphites.

Log Engines

The function of Log series table engine is relatively simple. It is mainly used to quickly write small tables (tables with about 1 million rows) and then read them out.

The common features of several Log table engines are:

  • The data is written to the disk by append in sequence;
  • delete and update are not supported;
  • index is not supported;
  • Atomic write is not supported;
  • insert blocks the select operation.
  1. TinyLog: it does not support concurrent reading of data files, and the query performance is poor; Simple format, suitable for temporary storage of intermediate data;

  2. Log: supports concurrent reading of data files, and the query performance is better than TinyLog; Each column is stored in a separate file.

Integration Engines

The system table engine is mainly used to import external data into ClickHouse, or directly operate external data sources in ClickHouse.

  1. Kafka: import the data in Kafka Topic directly into ClickHouse;
  2. MySQL: use MySQL as the storage engine to directly select MySQL tables in ClickHouse;
  3. JDBC/ODBC: read data sources by specifying jdbc and odbc connection strings;
  4. HDFS: directly read the data file of specific format on HDFS;

Special Engines

  1. Memory: store data in memory. Data loss will occur after restart. The query performance is excellent. It is suitable for 100 million small tables that do not require data persistence. In ClickHouse, it is usually used as a temporary table.

  2. Buffer: set a memory buffer for the target table. When the buffer meets certain conditions, it will flush to the disk.

  3. File: directly store local files as data;

  4. Null: write data is discarded and read data is empty;

  5. Distributed: the distributed engine does not store real data, but is used for distributed writing and query, which can be used in conjunction with other engines. For example: Distributed + MergeTree. Perform query operations in parallel.

-- Table creation template
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] 
(
      name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1], 
      name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2], 
      ... 
) ENGINE = MergeTree()
  ORDER BY expr 
  [PARTITION BY expr] 
  [PRIMARY KEY expr] 
  [SAMPLE BY expr] 
  [TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...] 
  [SETTINGS name=value, ... 

-- ex
   CREATE TABLE test (
  id UInt16,
  create_time Date,
  title Nullable(String)
) ENGINE = MergeTree()
   PARTITION BY create_time
     ORDER BY  (id, create_time)
     PRIMARY KEY (id, create_time)
     TTL create_time + INTERVAL 1 MONTH
     SETTINGS index_granularity=8192;
   
-- Required options
# ENGINE: 
   Engine name, MergeTree The engine has no parameters.
# ORDER BY: 
   Sorting key, which can be composed of one or more columns, determines how data is sorted, such as ORDER BY(CounterID, EventDate). If the specified is not displayed PRIMARY KEY,Then you will use ORDER BY As PRIMARY KEY. Usually only specify ORDER BY Just.

-- Optional options
# PARTITION BY: 
   Partition key, which indicates the rules by which the data in the table is partitioned. Partition is a logical data set divided by specified rules in a table. Partitioning can be performed according to any standard, such as by month, by day, or by event type. In order to reduce the data that needs to be manipulated, each partition is stored separately.
# PRIMARY KEY: 
   Primary key. After setting, the primary index will be generated according to the primary key( primary.idx),The data is sorted according to the index settings, which speeds up query performance. By default, PRIMARY KEY And ORDER BY The settings are the same, so they are usually used directly ORDER BY Set to override the primary key setting.
# SAMPLE BY: 
   Data sampling settings. If this option is displayed and configured, this configuration should also be included in the primary key configuration. for example ORDER BY CounterID / EventDate / intHash32(UserID),SAMPLE BY intHash32(UserID). 
# TTL: 
   The data lifetime can be set for a field column or an entire table TTL,Settings must include Date or DateTime Field type. If set on a column, the expired data in the field is deleted. If the setting is table level TTL,The expired data in the table will be deleted. If two types are set, the one that expires first shall prevail. For example, TTL createtime + INTERVAL 1 DAY,That is, it expires in one day. Usage scenarios include deleting data regularly or archiving data regularly.
# index_granularity: 
   Index interval granularity. MergeTree The index is sparse, every index_granularity A piece of data produces an index. index_granularity The default setting is 8092.
# enable_mixed_granularity_parts: 
   Start index_granularity_bytes To control the index granularity.
# index_granularity_bytes: 
   Index granularity, in bytes. The default value is 10 Mb. 
# merge_max_block_size: 
   The maximum number of data block consolidation records is 8192 by default.
# merge_with_ttl_timeout: 
   The minimum time interval of consolidation frequency is 1 day by default.

Keywords: Database Big Data clickhouse

Added by bwochinski on Wed, 05 Jan 2022 17:02:00 +0200