Zhang's notes - clickhouse

1, clickhouse - Introduction

ClickHouse is a column database management system (DBMS:Database Management System) for online analytical processing (OLAP) opened by Yandex of Russia in 2016. It is mainly used for online analytical processing query (OLAP), and can use SQL query to generate analysis data report in real time. The full name of ClickHouse is Click Stream, Data WareHouse house, or ClickHouse for short

1. Advantages

  1. Flexible MPP architecture, support linear expansion, simple, convenient and high reliability

  2. Multi server distributed data processing, complete DBMS system

  3. Underlying data columnar storage, support compression, optimize data storage, optimize index data, and optimize underlying storage

  4. Fast fault tolerance running: 5 times faster than Vertica, 279 times faster than Hive, 800 times faster than MySQL, and its processable data level has reached 1 billion

  5. Multiple functions: support various scenarios of data statistics and analysis, SQL query and remote replication deployment

    Massive data storage, distributed computing, fast lightning performance, almost real-time data analysis, friendly SQL syntax, excellent function support

2. Disadvantages

  1. Transaction and real deletion / update (batch) are not supported
  2. High concurrency is not supported. The official recommended qps is 100. You can increase the number of connections by modifying the configuration file, but when the server is good enough
  3. Secondary indexes are not supported
  4. Not good at multi table join * * * large wide table
  5. Metadata management requires human intervention***
  6. Try to write more than 1000 pieces in batches, and avoid line by line insert or small batch insert, update and delete operations

2, Installation

sudo yum -y install yum-utils
sudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG
sudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/clickhouse.repo
sudo yum -y install clickhouse-server clickhouse-client

sudo /etc/init.d/clickhouse-server start   Start service
clickhouse-client   Enter client

Open remote connection

 cd /etc/clickhouse-server  Profile directory
 vi config.xml  Modify profile
 :set nu   Copy 156 lines or release the comment paste
 <listen_host>::</listen_host>
 sudo /etc/init.d/clickhouse-server restart Restart service 

3, Common directory

Data storage directory / var/lib/clickhouse

**Server configuration file directory * * / etc / Clickhouse server

Log storage directory / var / log / Clickhouse server

4, Common data types

-- integer
UInt8 [0-255]
UInt32[0-4294967295]
-- decimal
Float32
Float64
-- Higher precision operation
Decimal32  -- 9 decimal places
-- character string
String
FixedString() -- Limited length
-- Date type
Date 		-- specific date
DateTime    -- Mm / DD / yyyy H / min / S
DateTime64  --With milliseconds
-- UUID type

-- Enumeration type
Enum('female'=1,'male'=2) -- Limit inserted data
-- Array collection tuple
Array() -- array 
Map()  -- aggregate
Tuple() -- tuple
-- IPv4 Type can be ip Address verification
IPv4

5, View

Normal view

ClickHouse has two kinds of Views: normal view and materialized view. Materialized view has independent storage * *, while normal view is just a simple query agent**

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name AS SELECT ... 

Ordinary view will not store any data. It is just a layer of simple SELECT query mapping, which plays a role in simplifying query and clarifying semantics, and will not enhance query performance.

create view  test3_view as select id , upper(name) , role from tb_test3 ;
┌─name────────────┐
│ tb_test3        │
│ test3_view      │
│ test_partition1 │
│ test_partition2 │
│ test_rep        │
│ tt1             │
└─────────────────┘
drop view test3_view ;   -- Delete view 

Materialized view

The materialized view supports the table engine. The data saving form is determined by its table engine. The complete syntax for creating materialized views is as follows

create materialized view mv_log engine=Log populate as select * from log ;

After the materialized view is created, if the source table is written with new data, the materialized view will be updated synchronously.

-- Creating materialized views specifies that the engine is AggregatingMergeTree()
create materialized view detail_table_view
engine=AggregatingMergeTree() -- Specify engine
PARTITION BY toDate(ctime)  -- Specify partition fields
order by id  -- Specify sort field
populate -- Synchronous data
as select id,ctime ,sumState(money) as moneys
from detail_table group by id,ctime;
-- Query data
select id ,ctime,sumMerge(moneys)
from detail_table_view 
group by id,ctime;

2,2021-08-07,400
1,2021-08-06,200
-- Insert data into a schedule
INSERT INTO detail_table VALUES(3, '2021-09-01', 500);
INSERT INTO detail_table VALUES(3, '2021-09-01', 500);
INSERT INTO detail_table VALUES(4, '2021-09-01', 400);
INSERT INTO detail_table VALUES(4, '2021-09-01', 400);
-- Query the materialized view again and find that the data is automatically updated synchronously
1,2021-08-06,200
2,2021-08-07,400
3,2021-09-01,1000
4,2021-09-01,800

be careful

-- Delete data from source table , The data in the materialized view does not change   ****  
be careful: Data deletion syntax applies only to MergeTree The basic syntax of the engine table is as follows
ALTER TABLE db_name.table_name DROP PARTITION '20210601'
ALTER TABLE db_name.table_name DELETE WHERE day = '20210618'
ALTER TABLE <table_name> UPDATE col1 = expr1, ... WHERE <filter>

6, Engine

engine:Database engine table engine
 Different engines have different data storage locations, different characteristics and different operations
 database engine:You can create a database without specifying a default engine

1.Log series engine

The Log family is a lightweight engine with minimal functionality. This type of engine is most effective when you need to write many small tables (up to about 1 million rows) quickly and read them as a whole later.

1)TinyLog engine
Simplest table engine,Used to store data on disk.Each column is stored in a separate compressed file,When writing,The data is appended to the end of the file.The engine has no concurrency control
-- Build table
create table t_tinylog(
id UInt8,
name String
)engine=TinyLog;
-- insert data
insert into t_tinylog values(1,'Zhang San'),(2,'Li Si');

The data is stored in the underlying storage. Each time the data is stored, it is written additionally

-rw-r-----. 1 clickhouse clickhouse 28 Sep 27 04:04 id.bin
-rw-r-----. 1 clickhouse clickhouse 40 Sep 27 04:04 name.bin
-rw-r-----. 1 clickhouse clickhouse 64 Sep 27 04:04 sizes.json
2)StripeLog engine
data.bin Store all data
index.mrk Index data block Tags
size.json data size
 Concurrent read and write
-- Build table 
create table t_stripelog(
id UInt8,
name String
)engine=StripeLog;
-- insert data
insert into t_stripelog values(1,'Zhang San'),(2,'Li Si');

Data underlying storage

-rw-r-----. 1 clickhouse clickhouse 118 Sep 27 04:16 data.bin
-rw-r-----. 1 clickhouse clickhouse  63 Sep 27 04:16 index.mrk
-rw-r-----. 1 clickhouse clickhouse  68 Sep 27 04:16 sizes.json
3)Log engine
*.bin Store data for each field
mark.mrk Data block tag
 Support multithreading
 Concurrent read and write
-- Delete table
drop table if exists t_log;
-- Create table
create table t_log(
id UInt8,
name String
)engine=Log;
-- insert data
insert into t_log values(1,'Zhang San'),(2,'Li Si');

Data underlying storage

-rw-r-----. 1 clickhouse clickhouse 28 Sep 27 04:22 id.bin
-rw-r-----. 1 clickhouse clickhouse 32 Sep 27 04:22 __marks.mrk
-rw-r-----. 1 clickhouse clickhouse 40 Sep 27 04:22 name.bin
-rw-r-----. 1 clickhouse clickhouse 94 Sep 27 04:22 sizes.json

2.MergeTree series engine

MergeTree The table engine of the series is ClickHouse The core of data storage function. They provide most functions for elastic and high-performance data retrieval: column storage, custom partitioning, sparse primary index, secondary data skipping index, etc.
1)Stores data sorted by primary key.
2)If specified[Partition key]You can use[partition). 
3)Data replication support
ReplicatedMergeTree Table families provide data replication. For more information.
4)Data sampling support
 If necessary, you can set the data sampling method in the table.
1)MergeTree engine primary key is not duplicated
MergeTree When writing a batch of data, the data is always written to the disk in the form of data fragments, and the data fragments cannot be modified. To avoid too many fragments, ClickHouse These data fragments will be merged regularly through the background thread, and the data fragments belonging to the same partition will be synthesized into a new fragment. This feature of data fragments merging back and forth is the origin of the name of the merge tree.
1)Be sure to specify the primary key order by id  -- The default primary key is id
2)The primary key has no unique constraint and will not be duplicated

grammar

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster1](
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2) 
ENGINE = MergeTree() -- Specify engine
ORDER BY expr        -- sort
[PARTITION BY expr]  -- partition
[PRIMARY KEY expr]   -- Primary key
[SAMPLE BY expr]  
[TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...][SETTINGS name=value, ...] -- Expiration time

Create table

-- Delete table
drop table if exists t_merge_tree;
-- Create table
create table t_merge_tree(
id UInt8,
name String,
gender Enum('male'=1,'female'=2,'Simon?'=3)
)engine=MergeTree  -- Specify engine
partition by gender -- Specify partition
order by id ;  -- Specifies that sorting defaults to primary key
-- insert data
insert into t_merge_tree values
(1,'Zhang San','male'),
(2,'Li Si','male'),
(3,'Wang Wu','female'),
(4,'Xiaochao','Simon?');
insert into t_merge_tree values
(5,'Duan Zong','Simon?'),
(6,'Dalang','male'),
(7,'Golden Lotus','female'),
(8,'ximen qing','Simon?');

Merge data

-- Data before consolidation
┌─id─┬─name─┬─gender─┐
│  4 │ Xiaochao │ Simon?   │
└────┴──────┴────────┘
┌─id─┬─name───┬─gender─┐
│  5 │ Duan Zong   │ Simon?   │
│  8 │ ximen qing │ Simon?   │
└────┴────────┴────────┘
-- Merge data
optimize table t_merge_tree final;
-- Consolidated data
┌─id─┬─name───┬─gender─┐
│  4 │ Xiaochao   │ Simon?   │
│  5 │ Duan Zong   │ Simon?   │
│  8 │ ximen qing │ Simon?   │
└────┴────────┴────────┘
┌─id─┬─name─┬─gender─┐
│  3 │ Wang Wu │ female     │
│  7 │ Golden Lotus │ female     │
└────┴──────┴────────┘
┌─id─┬─name─┬─gender─┐
│  1 │ Zhang San │ male     │
│  2 │ Li Si │ male     │
│  6 │ Dalang │ male     │
└────┴──────┴────────┘
2) Replacing mergetree engine primary key de duplication
1)De reordering in the area De reordering the data to the bucket
2)De duplication of data with the same sorting field in the same partition
3)It will be de duplicated when merging. By default, five pieces of data will be merged automatically once
4)You can also merge manually optimize table Table name final ;

No version

The retained data is uncertain and may not be the latest data
 According to the insertion time of data , Data retention after insertion
-- Create table
drop table if  exists test_replacingMergeTree1 ;
create table test_replacingMergeTree1(
oid Int8 ,
ctime DateTime ,
cost Decimal(10,2)
)engine = ReplacingMergeTree() -- Specify engine
order by oid -- Specifies that the sorting field defaults to the primary key
partition by toDate(ctime) ;-- Specify partition fields
-- insert data
insert into test_replacingMergeTree1 values(3,'2021-01-01 11:11:11',30) ;
insert into test_replacingMergeTree1 values(1,'2021-01-01 11:11:14',40) ;
insert into test_replacingMergeTree1 values(1,'2021-01-01 11:11:11',10);
insert into test_replacingMergeTree1 values(2,'2021-01-01 11:11:11',20) ;
insert into test_replacingMergeTree1 values(3,'2021-01-02 11:11:11',41) ;
insert into test_replacingMergeTree1 values(4,'2021-01-02 11:11:11',10);
-- Manual merge
optimize table  test_replacingMergeTree1  final ;
-- The view result retains the data inserted after
1,2021-01-01 03:11:11,10
2,2021-01-01 03:11:11,20
3,2021-01-02 03:11:11,41
3,2021-01-01 03:11:11,30
4,2021-01-02 03:11:11,10

Version available

- The version field can be numeric
- The version field can be a time
1. If not set ver Version number, the last row in the same set of duplicate data is retained.
2. If set ver The version number is retained in the same set of duplicate data ver The row with the largest field value.
-- Create a versioned table
drop table if  exists test_replacingMergeTree3 ;
create table test_replacingMergeTree3(
 oid Int8 ,
 ctime DateTime ,
 cost Decimal(10,2)
)engine = ReplacingMergeTree(ctime) -- Specify version field 
 order by oid
 partition by toDate(ctime) ;
-- insert data
insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:11',10) ;
insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:12',20) ;
insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:10',30);
insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:19',40) ;
insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:13',50) ;
-- Merge data manually
optimize table test_replacingMergeTree3 final;
-- View the data with the largest data retention version field
1,2021-01-01 03:11:19,40
3)CollapsingMergeTree engine folding
CollapsingMergeTree It is a table engine that supports row level data modification and deletion through the idea of adding instead of deleting. It defines a sign The flag bit field records the status of the data row. If sign If it is marked as 1, it indicates that this is a row of valid data; If sign Mark as-1,Indicates that this row of data needs to be deleted. When CollapsingMergeTree When merging partitions, in the same data partition, sign Marked as 1 and-1 A set of data will be cancelled and deleted. This 1 and-1 The operation of offsetting each other is like folding a piece of corrugated paper. This intuitive metaphor must be the folding and merging tree( CollapsingMergeTree)The origin of the name,

ENGINE = CollapsingMergeTree(sign) specifies the collapse field

-- The sorting in the partition is the same, and the status is 1  -1  delete
-- The sorting in the partition is the same, the status is 1, and the next data is reserved
-- The same sort status in the partition is -1  1  Cannot collapse
-- The same sort status in the partition is -1  -1  De duplication preserves the first data
-- Create table
drop table if exists tb_cps_merge_tree1 ;
CREATE TABLE tb_cps_merge_tree1(
 user_id UInt64,
 name String,
 age UInt8,
 sign Int8
)
ENGINE = CollapsingMergeTree(sign) -- Specifies the collapse field
ORDER BY user_id; -- id If the primary key is the same, it will be de duplicated
-- insert data
insert into tb_cps_merge_tree1 values
(1,'xiaoluo',23,1),
(2,'xiaoyu',24,1),
(3,'xiaofeng',25,1) ;
insert into tb_cps_merge_tree1 values
(1,'xiaoluo_',23,-1),
(2,'xiaoyu_',24,-1),
(3,'xiaofeng2',25,1) ;
-- merge
optimize table tb_cps_merge_tree1 final;
-- View the deletion of the inserted data after the same data is de duplicated and retained  
3,xiaofeng2,25,1
4)VersionedCollapsingMergeTree engine folding + version
In order to solve CollapsingMergeTree Unable to fold normally in case of out of order writing(delete)Question, VersionedCollapsingMergeTree The table engine adds a new column in the table creation statement Version,Used to record the correspondence between the status line and the cancellation line in case of out of order. Primary key(sort)Same, and Version Same Sign The opposite line, in Compaction Will be deleted CollapsingMergeTree Similarly, in order to get the correct results, the business layer needs to rewrite SQL,take count(),sum(col)Rewrite as sum(Sign),sum(col * Sign). 
-- Build table
create table t_collapsing2(
    id UInt8,
    name String,
    folding Int8,
    v UInt8
)engine=VersionedCollapsingMergeTree(folding,v) -- Specify the collapse field and version field
order by id;
-- insert data
insert into t_collapsing2 values(1,'zss',-1,1);
insert into t_collapsing2 values(1,'lss',1,1);
-- Merge data
optimize table t_collapsing2 final;
-- If the version is the same, the folding will be deleted instead, including(-1 In front 1 in back)
5)SummingMergeTree engine summation of numeric fields
Suppose there is such a query requirement: the end user only needs to query the summary results of the data, does not care about the detailed data, and the summary conditions of the data are defined in advance( GROUP BY Conditions are clear and will not be changed at will).
-- Build table
drop table if exists t_sum;
create table t_sum(
    id String,
    name String,
    date Date,
    wage Float64
)engine = SummingMergeTree() -- You can specify summation fields 
order by id;
-- insert data
insert into t_sum values('1','zss','2021-9-25',30000);
insert into t_sum values('1','zss','2021-9-26',30000);
insert into t_sum values('2','lss','2021-9-25',30000);
insert into t_sum values('2','lss','2021-9-26',30000);
-- Merge data
optimize table t_sum final;
-- When merging ,In partition, All numeric fields of row data in the same sort are summed(sum)
-- Same as viewing data id Automatic data summation
1,zss,2021-09-25,60000
2,lss,2021-09-25,60000
6)AggregatingMergeTree engine pre aggregation
1)AggregatingMergeTree yes SummingMergeTree Upgraded version of
2)Required for writing-State Syntax, used when querying-Merge grammar
3)field  AggregateFunction(Aggregate function, data type) ;
-- 1)Create parts list
drop table if exists detail_table;
CREATE TABLE detail_table
(id UInt8,
 ctime Date,
 money UInt64
) ENGINE = MergeTree()
PARTITION BY toDate(ctime) -- Partition field
ORDER BY id; -- sort field
-- 2)Insert details
INSERT INTO detail_table VALUES(1, '2021-08-06', 100);
INSERT INTO detail_table VALUES(1, '2021-08-06', 100);
INSERT INTO detail_table VALUES(2, '2021-08-07', 200);
INSERT INTO detail_table VALUES(2, '2021-08-07', 200);
-- Query data
1,2021-08-06,100
1,2021-08-06,100
2,2021-08-07,200
2,2021-08-07,200
-- 3)Create a pre aggregation table,
drop table if exists agg_table;
CREATE TABLE agg_table
(id UInt8,
 ctime Date,
 money AggregateFunction(sum, UInt64) -- Pre aggregate fields
) ENGINE = AggregatingMergeTree() -- Specify engine
PARTITION BY  toDate(ctime) -- Partition field
ORDER BY id; -- sort field
-- 4)insert data
insert into agg_table
select id,ctime,
       sumState(money) -- Insert syntax
from detail_table
group by (id,ctime);
-- 5)Query data
select id,ctime,
       sumMerge(money) -- Query syntax value
from agg_table group by id,ctime ;
2,2021-08-07,400
1,2021-08-06,200

Create materialized view

-- Creating materialized views specifies that the engine is AggregatingMergeTree()
create materialized view detail_table_view
engine=AggregatingMergeTree() -- Specify engine
PARTITION BY toDate(ctime)  -- Specify partition fields
order by id  -- Specify sort field
populate -- Synchronous data
as select id,ctime ,sumState(money) as moneys
from detail_table group by id,ctime;
-- Query data
select id ,ctime,sumMerge(moneys)
from detail_table_view 
group by id,ctime;

2,2021-08-07,400
1,2021-08-06,200
-- Insert data into a schedule
INSERT INTO detail_table VALUES(3, '2021-09-01', 500);
INSERT INTO detail_table VALUES(3, '2021-09-01', 500);
INSERT INTO detail_table VALUES(4, '2021-09-01', 400);
INSERT INTO detail_table VALUES(4, '2021-09-01', 400);
-- Query the materialized view again and find that the data is automatically updated synchronously
1,2021-08-06,200
2,2021-08-07,400
3,2021-09-01,1000
4,2021-09-01,800

3. External storage engine

HDFS engine
Clickhouse Can be directly from HDFS Load data in the directory specified in , I don't store data at all, Just reading data
ENGINE = HDFS(hdfs_uri,format) 
·hdfs_uri express HDFS File storage path; 
·format Indicates the file format ClickHouse Supported file formats, common are CSV,TSV and JSON Etc.).
-- Direct loading HDFS Data on the mainstream file format text  CSV  TSV   ORC PARQUET JSONROW
-- Create table
create table  tb_hdfs(
    uid String ,
    name String ,
    age UInt8 ,
    gender String
)engine = HDFS("hdfs://doit01:8020/csv/input/user.txt" , 'CSV') ;
-- Parameter one: hdfs File directory for
-- Parameter two: file format
MySQL engine

MySQL table engine can establish mapping with data tables in MySQL database, and initiate remote queries to them through SQL, including SELECT and INSERT

It is stated as follows:

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
) ENGINE = MySQL('host:port', 'database', 'table', 'user', 'password');
-- The meanings of each parameter are as follows: 
-- ·host:port express MySQL Address and port of. 
-- ·database Represents the name of the database. 
-- ·table Represents the name of the table to be mapped. 
-- ·user express MySQL User name. 
-- ·password express MySQL Your password.
-- Support query data 
-- Support inserting data
-- Delete and update operations are not supported 
-- Mysql engine
-- Create table
create table  tb_mysql(
    id Int32 ,
    name String
)engine =MySQL('doit01:3306','ck','tb_test1','root','root') ;
File engine

4. Memory engine

Memory engine
1)Memory The table engine directly saves the data in memory. The data will not be compressed or format converted. The form of data saved in memory is the same as that seen during query.
2)When ClickHouse When the service restarts, Memory All data in the table will be lost.
3)So on some occasions, will Memory Used as a test table
CREATE TABLE memory_1 ( 
id UInt64 
)ENGINE = Memory() ;
Set engine
Set The table engine has physical storage. The data is first written to memory and then synchronized to disk files. Therefore, when the service is restarted, its data will not be lost. When the data table is reloaded, the file data will be fully loaded into memory again. As we all know, in Set In the data structure, all elements are unique. Set The table engine has the ability of de duplication. In the process of data writing, duplicate data will be automatically ignored Set The usage scenarios of the table engine are special and limited, although it supports normal INSERT Write, but cannot be used directly SELECT Query it, Set The table engine can only be used indirectly as IN The criteria on the right side of the query are used by the query
-- Create table
drop table if exists test_set;
create table test_set(
id Int8
)engine=Set();
-- insert data
insert into test_set values(1),(2),(3);
-- Set The table engine can only be used indirectly as IN The criteria on the right side of the query are used by the query
-- Create a Log surface
drop table if exists  test_set_source;
CREATE TABLE test_set_source(
id Int8,
name String,
age Int8
)ENGINE = Log ;
-- insert data
insert into test_set_source values(1,'lss',21),(2,'ww',33),(3,'zl',11) ;
-- Query data
select * from test_set_source where id in test_set;
1,lss,21
2,ww,33
3,zl,11
Buffer Engine
Buffer The table engine completely uses memory to load data and does not support file persistent storage. Therefore, after the service is restarted, the data in the table will be emptied. Buffer The table engine is not designed for query scenarios, but acts as a buffer.
Suppose there is a scenario where we need to write data to the target MergeTree surface A,Due to the high number of concurrent writes, this may cause MergeTree surface A The merge speed of is slower than the write speed (because each time INSERT Will generate a new partition directory). At this time, you can import Buffer Table to alleviate such problems will Buffer The table acts as a buffer for data writing. Data is written first Buffer Table, when the preset conditions are met, Buffer The table automatically refreshes the data to the target table
ENGINE = Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes) 
Parameters can be divided into basic parameters and conditional parameters
 Basic parameters:
- database: The database of the target table. 
- table: The name of the target table, Buffer The data in the table will be automatically refreshed to the target table. 
- num_layers: It can be understood as the number of threads. The data is refreshed to the target table in parallel. The official recommendation is set to 16. 
Condition parameters:
- min_time and max_time: The minimum and maximum values of the time condition, in seconds;
- min_rows and max_rows: Minimum and maximum values of data line conditions; 
- min_bytes and max_bytes: Minimum and maximum values of data volume conditions, in bytes;
1. If all the minimum thresholds in the three groups of conditions have been met, the refresh action is triggered; 
2. If at least one of the three sets of conditions meets the maximum threshold condition, the refresh action is triggered;
1) Create a target table
create table tb_user(uid Int8 , name String) engine=TinyLog ;
2) Create a cache table
CREATE TABLE tb_user_buffer AS tb_user ENGINE = Buffer(db_doit26,tb_user, 16, 5, 60, 10, 100, 10000000, 100000000) ;
3) Insert data into cache table
insert into tb_user_buffer values(1,'Yang'),(2,'Haha') ,(3,'ADA') ;
4) Wait to view the data in the target table later
select * from tb_user ;

7, Grammar

Basic grammar

Import data
1 insert into surface values(data)   insert into surface select Query results
2 cat a.txt  | clickhouse-client -q  'insert into tb_name format  filrType'
3 clickhouse-client  --format_csv_delimiter='_'  -q  'insert into tb_name format  filrType'  < cat a.txt
--format_csv_delimiter='_'  Specify separator
filrType:file format CSV TSV ......
DDL Foundation

At present, only MergeTree, Merge and Distributed table engines support alter modification, so pay attention to the table engine when performing alter operation!

MergeTree cannot modify partition fields and primary key fields when modifying fields

-- View table creation statement view engine
show create table surface
Modify table structure
-- Add field
alter table surface add column Field name data type ;
-- Delete field
alter table surface drop column Field name ;
-- As a good programmer,It is a good habit to use annotations for the fields of a table, Therefore, it is recommended that you use comments to describe the meaning of fields during operation
-- modify / The code used internally in the annotation of the added field is by default UTF8
alter table surface comment column Field name 'user name' ;
Move table
-- Modify table name 
rename table tb_test1 to t1 ;
-- Modify multiple table names
rename table tb_test2 to t2 , t1 to tt1 ;
-- Move table to another database 
rename table Table name to library.Table name ;
-- View all tables under the database 
show tables ; -- Library currently in use
show tables from Library name ;
Set table defaults
-- Set default values for columns 
create table tb_test3(
   id Int8 ,
    name String comment 'user name' ,
    role String comment 'role' default 'VIP'
)engine = Log ;
Update delete data
only MergeTree The data of the engine can only be modified
1) You can delete a partition and re import it
2) You can delete data according to conditions and update data according to conditions  alter table delete/update where 
-- Delete partition data 
alter table surface drop  partition 'SH' ;
-- Conditional delete data
alter table surface delete where id=3 ;  -- Certain conditions
-- Condition update data
alter table surface update name='Li sisi'  where id=3 ;
Partition operation
-- Create partition table
create table test_partition1(
id String , 
ctime DateTime
)engine=MergeTree() 
partition by toYYYYMM(ctime)
order by (id) ;
-- View table creation statement
show create table test_partition1;
-- insert data
insert into test_partition1 values(1,now()) ,(2,'2021-06-11 11:12:13') ;
-- View partitions in tables
-- ClickHouse Built in many system The system table is used to query its own status information. among parts The system table is specially used to query the partition information of the data table.
SELECT 
name,   -- folder
table,  -- surface
partition  -- partition
FROM system.parts -- System table
WHERE table = 'test_partition1' ;
-- Delete partition after deleting partition , Delete all data in the partition 
alter table test_partition1 drop partition '202109' ;
-- Copy partition  
-- Copy table structure and create table
create  table  tb_y  as  tb_x ;
However, it should be noted that not all data tables can be copied to each other. They also need to meet two preconditions:
·Two tables need to have the same partition key
·They have exactly the same table structure.
ALTER TABLE  B  REPLACE PARTITION 'Partition name' FROM A 
-- hold test_partition1 Copy partitions in table to test_partition2 In the table
alter table test_partition2  replace  partition '202106' from  test_partition1
-- Reset partition
 be careful: Primary key and partition fields cannot be reset
alter  table test_rep clear column name in partition '202105' ;
-- Unmount partition
alter table surface detach partition 'Partition name' ;
-- Mount partition
alter table surface attach partition 'Partition name' ;
-- Remember, once the partition is moved to detached Subdirectory means that it has been separated ClickHouse Management of, ClickHouse It will not actively clean up these files. These partition files will always exist unless we actively delete or use them ATTACH Statement reload 

Query syntax

with
-- ClickHouse support CTE(Common Table Expression,Common table expression) to enhance the expression of query statements
SELECT pow(2, 2) -- Square function
┌─pow(2, 2)─┐
│         4 │
└───────────┘
SELECT pow(pow(2, 2), 2)

┌─pow(pow(2, 2), 2)─┐
│                16 │
└───────────────────┘
-- Using CTE It can greatly improve the readability and maintainability of the statement,
with pow(2,2) as a select pow(a,2) ;
-- with grammar
with () as x
-- Create table
CREATE TABLE t_user(
 `name` String,
 `aihao` String
)ENGINE = Memory;
insert into t_user values ('zss','Travel');
insert into t_user values ('lss','Play basketball');
insert into t_user values ('ww','binge-watching');
--  The result of subquery must be a piece of data 
with (select name from t_user where name='zss') as x
select * from t_user where name in (x);
array join burst
ARRAY JOIN Clause allows you to interact with fields of array or nested type inside the data table JOIN Operation to expand a row array into multiple rows. be similar to hive Medium explode Function of burst function!
-- Create table
create table t_array_join(
    id UInt8,
    name Array(String)
)engine=Memory();
-- insert data
insert into t_array_join values (1,['zs','ls','ww','zl']);
insert into t_array_join values (2,['ny','yf','mq','zq']);
-- Query data array join Burst array
select * from t_array_join;
select id,names
from t_array_join
array join name as names;
-- Queried data
1,zs
1,ls
1,ww
1,zl
2,ny
2,yf
2,mq
2,zq
Association query
join
left join
right join
 Connection accuracy
all  All do not write default is
any  Any one connected to any one
asof  Additional conditions
with model can be used for projects
Calculate all combination possibilities in advance according to the aggregation dimension for aggregation
a  b  c
1  1  1
0  0  0
2^n Combination possibility
-- with cube  All possible
-- with rollup  Cannot skip intermediate fields
-- with totals  All or none
-- Create table
drop table if exists tb_with ;
create table tb_with(
	id UInt8 ,
	vist UInt8,
    province String ,
    city String ,
    area String
)engine=MergeTree() 
order by id ;
-- insert data
insert into tb_with values(1,12,'Shandong','Jinan','Calendar') ;
insert into tb_with values(2,12,'Shandong','Jinan','Calendar') ;
insert into tb_with values(3,12,'Shandong','Jinan','footbridge') ;
insert into tb_with values(4,12,'Shandong','Jinan','footbridge') ;
insert into tb_with values(5,88,'Shandong','Qingdao','Huangdao') ;
insert into tb_with values(6,88,'Shandong','Qingdao','Huangdao') ;
insert into tb_with values(7,12,'Shanxi','Taiyuan','Small shop') ;
insert into tb_with values(8,12,'Shanxi','Taiyuan','Small shop') ;
insert into tb_with values(9,112,'Shanxi','Taiyuan','Pointed lawn') ;
-- with cube 
SELECT
province ,
city ,
area
FROM 
tb_with
group by province ,city ,area 
with CUBE ;
-- Query results
province|city|area|
--------+----+----+
Shandong      |Qingdao  |Huangdao  |
Shandong      |Jinan  |footbridge  |
Shanxi      |Taiyuan  |Pointed lawn 
Shandong      |Jinan  |Calendar  |
Shanxi      |Taiyuan  |Small shop  |
Shandong      |Qingdao  |    |
Shandong      |Jinan  |    |
Shanxi      |Taiyuan  |    |
Shandong      |    |Calendar  |
Shandong      |    |footbridge  |
Shanxi      |    |Pointed lawn |
Shanxi      |    |Small shop  |
Shandong      |    |Huangdao  |
Shanxi      |    |    |
Shandong      |    |    |
        |Jinan  |Calendar  |
        |Jinan  |footbridge  |
        |Taiyuan  |Pointed lawn |
        |Qingdao  |Huangdao  |
        |Taiyuan  |Small shop  |
        |Qingdao  |    |
        |Jinan  |    |
        |Taiyuan  |    |
        |    |footbridge  |
        |    |Small shop  |
        |    |Huangdao  |
        |    |Calendar  |
        |    |Pointed lawn |
        |    |    |
-- with rollup
SELECT 
province ,city ,area
FROM 
tb_with
group by province , city , area 
with rollup;
-- Query results
province|city|area|
--------+----+----+
Shandong      |Qingdao  |Huangdao  |
Shandong      |Jinan  |footbridge  |
Shanxi      |Taiyuan  |Pointed lawn |
Shandong      |Jinan  |Calendar  |
Shanxi      |Taiyuan  |Small shop  |
Shandong      |Qingdao  |    |
Shandong      |Jinan  |    |
Shanxi      |Taiyuan  |    |
Shanxi      |    |    |
Shandong      |    |    |
        |    |    |
-- with totals
SELECT 
province ,city ,area
FROM 
tb_with
group by province , city , area 
with totals;
province|city|area|
--------+----+----+
Shandong      |Qingdao  |Huangdao  |
Shandong      |Jinan  |footbridge  |
Shanxi      |Taiyuan  |Pointed lawn |
Shandong      |Jinan  |Calendar  |
Shanxi      |Taiyuan  |Small shop  |
        |    |    |
function

User behavior analysis function SequenceMatch()

//	SELECT sequenceMatch('(?1)(?2)')(eventTime, eventid = 'event1', eventid = 'event2') FROM funnel_test ;
//		(? N) - match condition parameters at position n.
//		. * - the number that matches any event. No condition parameters are required to match this pattern.
	//	 (? 1) (? T > 1800) (? 2) events that match each other for more than 1800 seconds.
//			You can use the > =, >, <, < =, = = operators
//	data
CREATE TABLE funnel_test 
( 	
	uid String, 
	eventid String, 
	eventTime UInt64
) 
ENGINE = MergeTree 
PARTITION BY (uid, eventTime) 
ORDER BY (uid, eventTime) 
SETTINGS index_granularity = 8192 ;
insert into funnel_test values
('uid1','event1',1551398404) ,
('uid1','event2',1551398406) ,
('uid1','event3',1551398408) ,
('uid2','event2',1551398412) ,
('uid2','event4',1551398415) ,
('uid3','event3',1551398410) ,
('uid3','event4',1551398413) ;
('uid4','event2',1551398410) ,
('uid4','event4',1551398413) ;
 
SELECT uid ,sequenceMatch('(?1)(?2)')(eventTime, eventid = 'event1', eventid = 'event2') FROM funnel_test group by uid;
//	The data returned is success 1 failure 0
┌─uid──┬─sequenceMatch('(?1)(?2)')(eventTime, equals(eventid, 'event1'), equals(eventid, 'event2'))─┐
│ uid3 │                                                                                          0 │
│ uid1 │                                                                                          1 │
│ uid4 │                                                                                          0 │
│ uid2 │                                                                                          0 │
└──────┴────────────────────────────────────────────────────────────────────────────────────────────┘

Funnel function (windowfunnel)

//The returned data satisfies several conditions
//	Data ibid
SELECT uid,windowFunnel(4)(toDateTime(eventTime),eventid = 'event2',eventid = 'event3') as funnel from funnel_test group by uid;
	//	The time from event2 to event3 shall not exceed 4 seconds
 ┌─uid──┬─funnel─┐
│ uid3 │      0 │
│ uid1 │      2 │
│ uid4 │      1 │
│ uid2 │      2 │
└──────┴────────┘

8, Distributed

Configure remote connections

configuration file:  vi /etc/clickhouse-server/config.xml 
1)Open the remote connection, release the comment or copy one
<listen_host>::</listen_host>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-xzqlbui6-163419367795) (H: \ snipaste_2021-09-29_21-44-08. PNG)]

Configure zookeeper

configuration file: vi /etc/clickhouse-server/config.xml
<zookeeper>    
 <node index="1">
 <host>doit01</host>
 <port>2181</port>
 </node>
  <node index="2">
 <host>doit02</host>
 <port>2181</port>
 </node>
  <node index="3">
 <host>doit03</host>
 <port>2181</port>
 </node>
 </zookeeper>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-wjrebxxs-163419367797) (H: \ zookeeper. PNG)]

Start the zookeeper service before starting

Distribution profile scp config.xml doit02$PWD
sudo /etc/init.d/clickhouse-server restart  restart ck service
Create a replica table and divide it into 1 pieces
-- doit01 
create table tb_demo1 (
id Int8 ,
name String)
engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo1', 'doit01') 
order by id ;

-- doit02 
create table tb_demo1 (
id Int8 ,
name String)
engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo1', 'doit02') 
order by id ;

-- doit03
create table tb_demo1 (
id Int8 ,
name String)
engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo1', 'doit03') 
order by id ;

Create two shards of the replica table

-- doit01 
create table tb_demo2 (
id Int8 ,
name String)
engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo2', 'doit01') 
order by id ;

-- doit02 
create table tb_demo2 (
id Int8 ,
name String)
engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo2', 'doit02') 
order by id ;

-- doit03
create table tb_demo2 (
id Int8 ,
name String)
engine=ReplicatedMergeTree('/clickhouse/tables/02/tb_demo2', 'doit03') 
order by id ;
Distributed create table

Configure cluster

Search in conf.xml: / < remote_ servers>

<remote_servers>

    <!-- Copy the following -->
<cluster01> 
     <!-- All tables in the cluster have one slice -->
	 <shard>
	    <!-- Three copies of each table -->
		 <replica>
			 <host>doit01</host>
			 <port>9000</port>
		 </replica>
		  <replica>
			 <host>doit02</host>
			 <port>9000</port>
		 </replica>
		  <replica>
			 <host>doit03</host>
			 <port>9000</port>
		 </replica>
	 </shard>
 </cluster01>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-ffrvmvzg-163419367798) (H: \ configure cluster. png)]

Distribution profile restart ck service

Create distributed tables

 -- Run on any server SQL The statement creates as like as two peas on the 010203 machine.
 -- 1)Create local tables distributed
create table  tb_demo00 on cluster cluster01(
id Int8 ,
name String 
) engine = MergeTree()
order by id ;

-- 2)Create distributed tables pointing to distributed local tables
-- Distributed() Distributed engine parameter I:Specify cluster parameter 2:default default Parameter three:Local table parameter four pointed to:Slice field
create table demo00_all 
on cluster cluster01 
engine=Distributed('cluster01','default','tb_demo00',id) as tb_demo00 ;
-- Inserting a data distributed table will synchronize the inserted data to the local table on 01 02 03
insert into demo00_all values(1,'lny');
insert into demo00_all values(2,'yg');

emo2', 'doit02')
order by id ;

– doit03
create table tb_demo2 (
id Int8 ,
name String)
engine=ReplicatedMergeTree('/clickhouse/tables/02/tb_demo2', 'doit03')
order by id ;

##### **Distributed create table**

**Configure cluster**

stay conf.xml  search  :  /<remote_servers>

```xml
<remote_servers>

    <!-- Copy the following -->
<cluster01> 
     <!-- All tables in the cluster have one slice -->
	 <shard>
	    <!-- Three copies of each table -->
		 <replica>
			 <host>doit01</host>
			 <port>9000</port>
		 </replica>
		  <replica>
			 <host>doit02</host>
			 <port>9000</port>
		 </replica>
		  <replica>
			 <host>doit03</host>
			 <port>9000</port>
		 </replica>
	 </shard>
 </cluster01>

[external chain pictures are being transferred... (img-ffrvmvzg-163419367798)]

Distribution profile restart ck service

Create distributed tables

 -- Run on any server SQL The statement creates as like as two peas on the 010203 machine.
 -- 1)Create local tables distributed
create table  tb_demo00 on cluster cluster01(
id Int8 ,
name String 
) engine = MergeTree()
order by id ;

-- 2)Create distributed tables pointing to distributed local tables
-- Distributed() Distributed engine parameter I:Specify cluster parameter 2:default default Parameter three:Local table parameter four pointed to:Slice field
create table demo00_all 
on cluster cluster01 
engine=Distributed('cluster01','default','tb_demo00',id) as tb_demo00 ;
-- Inserting a data distributed table will synchronize the inserted data to the local table on 01 02 03
insert into demo00_all values(1,'lny');
insert into demo00_all values(2,'yg');

Keywords: Database

Added by wizzard on Wed, 13 Oct 2021 15:03:46 +0300