1, clickhouse - Introduction
ClickHouse is a column database management system (DBMS:Database Management System) for online analytical processing (OLAP) opened by Yandex of Russia in 2016. It is mainly used for online analytical processing query (OLAP), and can use SQL query to generate analysis data report in real time. The full name of ClickHouse is Click Stream, Data WareHouse house, or ClickHouse for short
1. Advantages
-
Flexible MPP architecture, support linear expansion, simple, convenient and high reliability
-
Multi server distributed data processing, complete DBMS system
-
Underlying data columnar storage, support compression, optimize data storage, optimize index data, and optimize underlying storage
-
Fast fault tolerance running: 5 times faster than Vertica, 279 times faster than Hive, 800 times faster than MySQL, and its processable data level has reached 1 billion
-
Multiple functions: support various scenarios of data statistics and analysis, SQL query and remote replication deployment
Massive data storage, distributed computing, fast lightning performance, almost real-time data analysis, friendly SQL syntax, excellent function support
2. Disadvantages
- Transaction and real deletion / update (batch) are not supported
- High concurrency is not supported. The official recommended qps is 100. You can increase the number of connections by modifying the configuration file, but when the server is good enough
- Secondary indexes are not supported
- Not good at multi table join * * * large wide table
- Metadata management requires human intervention***
- Try to write more than 1000 pieces in batches, and avoid line by line insert or small batch insert, update and delete operations
2, Installation
sudo yum -y install yum-utils sudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG sudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/clickhouse.repo sudo yum -y install clickhouse-server clickhouse-client sudo /etc/init.d/clickhouse-server start Start service clickhouse-client Enter client
Open remote connection
cd /etc/clickhouse-server Profile directory vi config.xml Modify profile :set nu Copy 156 lines or release the comment paste <listen_host>::</listen_host> sudo /etc/init.d/clickhouse-server restart Restart service
3, Common directory
Data storage directory / var/lib/clickhouse
**Server configuration file directory * * / etc / Clickhouse server
Log storage directory / var / log / Clickhouse server
4, Common data types
-- integer UInt8 [0-255] UInt32[0-4294967295] -- decimal Float32 Float64 -- Higher precision operation Decimal32 -- 9 decimal places -- character string String FixedString() -- Limited length -- Date type Date -- specific date DateTime -- Mm / DD / yyyy H / min / S DateTime64 --With milliseconds -- UUID type -- Enumeration type Enum('female'=1,'male'=2) -- Limit inserted data -- Array collection tuple Array() -- array Map() -- aggregate Tuple() -- tuple -- IPv4 Type can be ip Address verification IPv4
5, View
Normal view
ClickHouse has two kinds of Views: normal view and materialized view. Materialized view has independent storage * *, while normal view is just a simple query agent**
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name AS SELECT ...
Ordinary view will not store any data. It is just a layer of simple SELECT query mapping, which plays a role in simplifying query and clarifying semantics, and will not enhance query performance.
create view test3_view as select id , upper(name) , role from tb_test3 ; ┌─name────────────┐ │ tb_test3 │ │ test3_view │ │ test_partition1 │ │ test_partition2 │ │ test_rep │ │ tt1 │ └─────────────────┘ drop view test3_view ; -- Delete view
Materialized view
The materialized view supports the table engine. The data saving form is determined by its table engine. The complete syntax for creating materialized views is as follows
create materialized view mv_log engine=Log populate as select * from log ;
After the materialized view is created, if the source table is written with new data, the materialized view will be updated synchronously.
-- Creating materialized views specifies that the engine is AggregatingMergeTree() create materialized view detail_table_view engine=AggregatingMergeTree() -- Specify engine PARTITION BY toDate(ctime) -- Specify partition fields order by id -- Specify sort field populate -- Synchronous data as select id,ctime ,sumState(money) as moneys from detail_table group by id,ctime; -- Query data select id ,ctime,sumMerge(moneys) from detail_table_view group by id,ctime; 2,2021-08-07,400 1,2021-08-06,200 -- Insert data into a schedule INSERT INTO detail_table VALUES(3, '2021-09-01', 500); INSERT INTO detail_table VALUES(3, '2021-09-01', 500); INSERT INTO detail_table VALUES(4, '2021-09-01', 400); INSERT INTO detail_table VALUES(4, '2021-09-01', 400); -- Query the materialized view again and find that the data is automatically updated synchronously 1,2021-08-06,200 2,2021-08-07,400 3,2021-09-01,1000 4,2021-09-01,800
be careful
-- Delete data from source table , The data in the materialized view does not change **** be careful: Data deletion syntax applies only to MergeTree The basic syntax of the engine table is as follows ALTER TABLE db_name.table_name DROP PARTITION '20210601' ALTER TABLE db_name.table_name DELETE WHERE day = '20210618' ALTER TABLE <table_name> UPDATE col1 = expr1, ... WHERE <filter>
6, Engine
engine:Database engine table engine Different engines have different data storage locations, different characteristics and different operations database engine:You can create a database without specifying a default engine
1.Log series engine
The Log family is a lightweight engine with minimal functionality. This type of engine is most effective when you need to write many small tables (up to about 1 million rows) quickly and read them as a whole later.
1)TinyLog engine
Simplest table engine,Used to store data on disk.Each column is stored in a separate compressed file,When writing,The data is appended to the end of the file.The engine has no concurrency control
-- Build table create table t_tinylog( id UInt8, name String )engine=TinyLog; -- insert data insert into t_tinylog values(1,'Zhang San'),(2,'Li Si');
The data is stored in the underlying storage. Each time the data is stored, it is written additionally
-rw-r-----. 1 clickhouse clickhouse 28 Sep 27 04:04 id.bin -rw-r-----. 1 clickhouse clickhouse 40 Sep 27 04:04 name.bin -rw-r-----. 1 clickhouse clickhouse 64 Sep 27 04:04 sizes.json
2)StripeLog engine
data.bin Store all data index.mrk Index data block Tags size.json data size Concurrent read and write
-- Build table create table t_stripelog( id UInt8, name String )engine=StripeLog; -- insert data insert into t_stripelog values(1,'Zhang San'),(2,'Li Si');
Data underlying storage
-rw-r-----. 1 clickhouse clickhouse 118 Sep 27 04:16 data.bin -rw-r-----. 1 clickhouse clickhouse 63 Sep 27 04:16 index.mrk -rw-r-----. 1 clickhouse clickhouse 68 Sep 27 04:16 sizes.json
3)Log engine
*.bin Store data for each field mark.mrk Data block tag Support multithreading Concurrent read and write
-- Delete table drop table if exists t_log; -- Create table create table t_log( id UInt8, name String )engine=Log; -- insert data insert into t_log values(1,'Zhang San'),(2,'Li Si');
Data underlying storage
-rw-r-----. 1 clickhouse clickhouse 28 Sep 27 04:22 id.bin -rw-r-----. 1 clickhouse clickhouse 32 Sep 27 04:22 __marks.mrk -rw-r-----. 1 clickhouse clickhouse 40 Sep 27 04:22 name.bin -rw-r-----. 1 clickhouse clickhouse 94 Sep 27 04:22 sizes.json
2.MergeTree series engine
MergeTree The table engine of the series is ClickHouse The core of data storage function. They provide most functions for elastic and high-performance data retrieval: column storage, custom partitioning, sparse primary index, secondary data skipping index, etc. 1)Stores data sorted by primary key. 2)If specified[Partition key]You can use[partition). 3)Data replication support ReplicatedMergeTree Table families provide data replication. For more information. 4)Data sampling support If necessary, you can set the data sampling method in the table.
1)MergeTree engine primary key is not duplicated
MergeTree When writing a batch of data, the data is always written to the disk in the form of data fragments, and the data fragments cannot be modified. To avoid too many fragments, ClickHouse These data fragments will be merged regularly through the background thread, and the data fragments belonging to the same partition will be synthesized into a new fragment. This feature of data fragments merging back and forth is the origin of the name of the merge tree.
1)Be sure to specify the primary key order by id -- The default primary key is id 2)The primary key has no unique constraint and will not be duplicated
grammar
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster1]( name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2], ... INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1, INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2) ENGINE = MergeTree() -- Specify engine ORDER BY expr -- sort [PARTITION BY expr] -- partition [PRIMARY KEY expr] -- Primary key [SAMPLE BY expr] [TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...][SETTINGS name=value, ...] -- Expiration time
Create table
-- Delete table drop table if exists t_merge_tree; -- Create table create table t_merge_tree( id UInt8, name String, gender Enum('male'=1,'female'=2,'Simon?'=3) )engine=MergeTree -- Specify engine partition by gender -- Specify partition order by id ; -- Specifies that sorting defaults to primary key -- insert data insert into t_merge_tree values (1,'Zhang San','male'), (2,'Li Si','male'), (3,'Wang Wu','female'), (4,'Xiaochao','Simon?'); insert into t_merge_tree values (5,'Duan Zong','Simon?'), (6,'Dalang','male'), (7,'Golden Lotus','female'), (8,'ximen qing','Simon?');
Merge data
-- Data before consolidation ┌─id─┬─name─┬─gender─┐ │ 4 │ Xiaochao │ Simon? │ └────┴──────┴────────┘ ┌─id─┬─name───┬─gender─┐ │ 5 │ Duan Zong │ Simon? │ │ 8 │ ximen qing │ Simon? │ └────┴────────┴────────┘ -- Merge data optimize table t_merge_tree final; -- Consolidated data ┌─id─┬─name───┬─gender─┐ │ 4 │ Xiaochao │ Simon? │ │ 5 │ Duan Zong │ Simon? │ │ 8 │ ximen qing │ Simon? │ └────┴────────┴────────┘ ┌─id─┬─name─┬─gender─┐ │ 3 │ Wang Wu │ female │ │ 7 │ Golden Lotus │ female │ └────┴──────┴────────┘ ┌─id─┬─name─┬─gender─┐ │ 1 │ Zhang San │ male │ │ 2 │ Li Si │ male │ │ 6 │ Dalang │ male │ └────┴──────┴────────┘
2) Replacing mergetree engine primary key de duplication
1)De reordering in the area De reordering the data to the bucket 2)De duplication of data with the same sorting field in the same partition 3)It will be de duplicated when merging. By default, five pieces of data will be merged automatically once 4)You can also merge manually optimize table Table name final ;
No version
The retained data is uncertain and may not be the latest data According to the insertion time of data , Data retention after insertion
-- Create table drop table if exists test_replacingMergeTree1 ; create table test_replacingMergeTree1( oid Int8 , ctime DateTime , cost Decimal(10,2) )engine = ReplacingMergeTree() -- Specify engine order by oid -- Specifies that the sorting field defaults to the primary key partition by toDate(ctime) ;-- Specify partition fields -- insert data insert into test_replacingMergeTree1 values(3,'2021-01-01 11:11:11',30) ; insert into test_replacingMergeTree1 values(1,'2021-01-01 11:11:14',40) ; insert into test_replacingMergeTree1 values(1,'2021-01-01 11:11:11',10); insert into test_replacingMergeTree1 values(2,'2021-01-01 11:11:11',20) ; insert into test_replacingMergeTree1 values(3,'2021-01-02 11:11:11',41) ; insert into test_replacingMergeTree1 values(4,'2021-01-02 11:11:11',10); -- Manual merge optimize table test_replacingMergeTree1 final ; -- The view result retains the data inserted after 1,2021-01-01 03:11:11,10 2,2021-01-01 03:11:11,20 3,2021-01-02 03:11:11,41 3,2021-01-01 03:11:11,30 4,2021-01-02 03:11:11,10
Version available
- The version field can be numeric - The version field can be a time 1. If not set ver Version number, the last row in the same set of duplicate data is retained. 2. If set ver The version number is retained in the same set of duplicate data ver The row with the largest field value.
-- Create a versioned table drop table if exists test_replacingMergeTree3 ; create table test_replacingMergeTree3( oid Int8 , ctime DateTime , cost Decimal(10,2) )engine = ReplacingMergeTree(ctime) -- Specify version field order by oid partition by toDate(ctime) ; -- insert data insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:11',10) ; insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:12',20) ; insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:10',30); insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:19',40) ; insert into test_replacingMergeTree3 values(1,'2021-01-01 11:11:13',50) ; -- Merge data manually optimize table test_replacingMergeTree3 final; -- View the data with the largest data retention version field 1,2021-01-01 03:11:19,40
3)CollapsingMergeTree engine folding
CollapsingMergeTree It is a table engine that supports row level data modification and deletion through the idea of adding instead of deleting. It defines a sign The flag bit field records the status of the data row. If sign If it is marked as 1, it indicates that this is a row of valid data; If sign Mark as-1,Indicates that this row of data needs to be deleted. When CollapsingMergeTree When merging partitions, in the same data partition, sign Marked as 1 and-1 A set of data will be cancelled and deleted. This 1 and-1 The operation of offsetting each other is like folding a piece of corrugated paper. This intuitive metaphor must be the folding and merging tree( CollapsingMergeTree)The origin of the name,
ENGINE = CollapsingMergeTree(sign) specifies the collapse field
-- The sorting in the partition is the same, and the status is 1 -1 delete -- The sorting in the partition is the same, the status is 1, and the next data is reserved -- The same sort status in the partition is -1 1 Cannot collapse -- The same sort status in the partition is -1 -1 De duplication preserves the first data
-- Create table drop table if exists tb_cps_merge_tree1 ; CREATE TABLE tb_cps_merge_tree1( user_id UInt64, name String, age UInt8, sign Int8 ) ENGINE = CollapsingMergeTree(sign) -- Specifies the collapse field ORDER BY user_id; -- id If the primary key is the same, it will be de duplicated -- insert data insert into tb_cps_merge_tree1 values (1,'xiaoluo',23,1), (2,'xiaoyu',24,1), (3,'xiaofeng',25,1) ; insert into tb_cps_merge_tree1 values (1,'xiaoluo_',23,-1), (2,'xiaoyu_',24,-1), (3,'xiaofeng2',25,1) ; -- merge optimize table tb_cps_merge_tree1 final; -- View the deletion of the inserted data after the same data is de duplicated and retained 3,xiaofeng2,25,1
4)VersionedCollapsingMergeTree engine folding + version
In order to solve CollapsingMergeTree Unable to fold normally in case of out of order writing(delete)Question, VersionedCollapsingMergeTree The table engine adds a new column in the table creation statement Version,Used to record the correspondence between the status line and the cancellation line in case of out of order. Primary key(sort)Same, and Version Same Sign The opposite line, in Compaction Will be deleted CollapsingMergeTree Similarly, in order to get the correct results, the business layer needs to rewrite SQL,take count(),sum(col)Rewrite as sum(Sign),sum(col * Sign).
-- Build table create table t_collapsing2( id UInt8, name String, folding Int8, v UInt8 )engine=VersionedCollapsingMergeTree(folding,v) -- Specify the collapse field and version field order by id; -- insert data insert into t_collapsing2 values(1,'zss',-1,1); insert into t_collapsing2 values(1,'lss',1,1); -- Merge data optimize table t_collapsing2 final; -- If the version is the same, the folding will be deleted instead, including(-1 In front 1 in back)
5)SummingMergeTree engine summation of numeric fields
Suppose there is such a query requirement: the end user only needs to query the summary results of the data, does not care about the detailed data, and the summary conditions of the data are defined in advance( GROUP BY Conditions are clear and will not be changed at will).
-- Build table drop table if exists t_sum; create table t_sum( id String, name String, date Date, wage Float64 )engine = SummingMergeTree() -- You can specify summation fields order by id; -- insert data insert into t_sum values('1','zss','2021-9-25',30000); insert into t_sum values('1','zss','2021-9-26',30000); insert into t_sum values('2','lss','2021-9-25',30000); insert into t_sum values('2','lss','2021-9-26',30000); -- Merge data optimize table t_sum final; -- When merging ,In partition, All numeric fields of row data in the same sort are summed(sum) -- Same as viewing data id Automatic data summation 1,zss,2021-09-25,60000 2,lss,2021-09-25,60000
6)AggregatingMergeTree engine pre aggregation
1)AggregatingMergeTree yes SummingMergeTree Upgraded version of 2)Required for writing-State Syntax, used when querying-Merge grammar 3)field AggregateFunction(Aggregate function, data type) ;
-- 1)Create parts list drop table if exists detail_table; CREATE TABLE detail_table (id UInt8, ctime Date, money UInt64 ) ENGINE = MergeTree() PARTITION BY toDate(ctime) -- Partition field ORDER BY id; -- sort field -- 2)Insert details INSERT INTO detail_table VALUES(1, '2021-08-06', 100); INSERT INTO detail_table VALUES(1, '2021-08-06', 100); INSERT INTO detail_table VALUES(2, '2021-08-07', 200); INSERT INTO detail_table VALUES(2, '2021-08-07', 200); -- Query data 1,2021-08-06,100 1,2021-08-06,100 2,2021-08-07,200 2,2021-08-07,200 -- 3)Create a pre aggregation table, drop table if exists agg_table; CREATE TABLE agg_table (id UInt8, ctime Date, money AggregateFunction(sum, UInt64) -- Pre aggregate fields ) ENGINE = AggregatingMergeTree() -- Specify engine PARTITION BY toDate(ctime) -- Partition field ORDER BY id; -- sort field -- 4)insert data insert into agg_table select id,ctime, sumState(money) -- Insert syntax from detail_table group by (id,ctime); -- 5)Query data select id,ctime, sumMerge(money) -- Query syntax value from agg_table group by id,ctime ; 2,2021-08-07,400 1,2021-08-06,200
Create materialized view
-- Creating materialized views specifies that the engine is AggregatingMergeTree() create materialized view detail_table_view engine=AggregatingMergeTree() -- Specify engine PARTITION BY toDate(ctime) -- Specify partition fields order by id -- Specify sort field populate -- Synchronous data as select id,ctime ,sumState(money) as moneys from detail_table group by id,ctime; -- Query data select id ,ctime,sumMerge(moneys) from detail_table_view group by id,ctime; 2,2021-08-07,400 1,2021-08-06,200 -- Insert data into a schedule INSERT INTO detail_table VALUES(3, '2021-09-01', 500); INSERT INTO detail_table VALUES(3, '2021-09-01', 500); INSERT INTO detail_table VALUES(4, '2021-09-01', 400); INSERT INTO detail_table VALUES(4, '2021-09-01', 400); -- Query the materialized view again and find that the data is automatically updated synchronously 1,2021-08-06,200 2,2021-08-07,400 3,2021-09-01,1000 4,2021-09-01,800
3. External storage engine
HDFS engine
Clickhouse Can be directly from HDFS Load data in the directory specified in , I don't store data at all, Just reading data ENGINE = HDFS(hdfs_uri,format) ·hdfs_uri express HDFS File storage path; ·format Indicates the file format ClickHouse Supported file formats, common are CSV,TSV and JSON Etc.).
-- Direct loading HDFS Data on the mainstream file format text CSV TSV ORC PARQUET JSONROW -- Create table create table tb_hdfs( uid String , name String , age UInt8 , gender String )engine = HDFS("hdfs://doit01:8020/csv/input/user.txt" , 'CSV') ; -- Parameter one: hdfs File directory for -- Parameter two: file format
MySQL engine
MySQL table engine can establish mapping with data tables in MySQL database, and initiate remote queries to them through SQL, including SELECT and INSERT
It is stated as follows:
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] ( name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2], ... ) ENGINE = MySQL('host:port', 'database', 'table', 'user', 'password'); -- The meanings of each parameter are as follows: -- ·host:port express MySQL Address and port of. -- ·database Represents the name of the database. -- ·table Represents the name of the table to be mapped. -- ·user express MySQL User name. -- ·password express MySQL Your password. -- Support query data -- Support inserting data -- Delete and update operations are not supported
-- Mysql engine -- Create table create table tb_mysql( id Int32 , name String )engine =MySQL('doit01:3306','ck','tb_test1','root','root') ;
File engine
4. Memory engine
Memory engine
1)Memory The table engine directly saves the data in memory. The data will not be compressed or format converted. The form of data saved in memory is the same as that seen during query. 2)When ClickHouse When the service restarts, Memory All data in the table will be lost. 3)So on some occasions, will Memory Used as a test table
CREATE TABLE memory_1 ( id UInt64 )ENGINE = Memory() ;
Set engine
Set The table engine has physical storage. The data is first written to memory and then synchronized to disk files. Therefore, when the service is restarted, its data will not be lost. When the data table is reloaded, the file data will be fully loaded into memory again. As we all know, in Set In the data structure, all elements are unique. Set The table engine has the ability of de duplication. In the process of data writing, duplicate data will be automatically ignored Set The usage scenarios of the table engine are special and limited, although it supports normal INSERT Write, but cannot be used directly SELECT Query it, Set The table engine can only be used indirectly as IN The criteria on the right side of the query are used by the query
-- Create table drop table if exists test_set; create table test_set( id Int8 )engine=Set(); -- insert data insert into test_set values(1),(2),(3); -- Set The table engine can only be used indirectly as IN The criteria on the right side of the query are used by the query -- Create a Log surface drop table if exists test_set_source; CREATE TABLE test_set_source( id Int8, name String, age Int8 )ENGINE = Log ; -- insert data insert into test_set_source values(1,'lss',21),(2,'ww',33),(3,'zl',11) ; -- Query data select * from test_set_source where id in test_set; 1,lss,21 2,ww,33 3,zl,11
Buffer Engine
Buffer The table engine completely uses memory to load data and does not support file persistent storage. Therefore, after the service is restarted, the data in the table will be emptied. Buffer The table engine is not designed for query scenarios, but acts as a buffer. Suppose there is a scenario where we need to write data to the target MergeTree surface A,Due to the high number of concurrent writes, this may cause MergeTree surface A The merge speed of is slower than the write speed (because each time INSERT Will generate a new partition directory). At this time, you can import Buffer Table to alleviate such problems will Buffer The table acts as a buffer for data writing. Data is written first Buffer Table, when the preset conditions are met, Buffer The table automatically refreshes the data to the target table
ENGINE = Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes) Parameters can be divided into basic parameters and conditional parameters Basic parameters: - database: The database of the target table. - table: The name of the target table, Buffer The data in the table will be automatically refreshed to the target table. - num_layers: It can be understood as the number of threads. The data is refreshed to the target table in parallel. The official recommendation is set to 16. Condition parameters: - min_time and max_time: The minimum and maximum values of the time condition, in seconds; - min_rows and max_rows: Minimum and maximum values of data line conditions; - min_bytes and max_bytes: Minimum and maximum values of data volume conditions, in bytes; 1. If all the minimum thresholds in the three groups of conditions have been met, the refresh action is triggered; 2. If at least one of the three sets of conditions meets the maximum threshold condition, the refresh action is triggered;
1) Create a target table create table tb_user(uid Int8 , name String) engine=TinyLog ; 2) Create a cache table CREATE TABLE tb_user_buffer AS tb_user ENGINE = Buffer(db_doit26,tb_user, 16, 5, 60, 10, 100, 10000000, 100000000) ; 3) Insert data into cache table insert into tb_user_buffer values(1,'Yang'),(2,'Haha') ,(3,'ADA') ; 4) Wait to view the data in the target table later select * from tb_user ;
7, Grammar
Basic grammar
Import data
1 insert into surface values(data) insert into surface select Query results 2 cat a.txt | clickhouse-client -q 'insert into tb_name format filrType' 3 clickhouse-client --format_csv_delimiter='_' -q 'insert into tb_name format filrType' < cat a.txt --format_csv_delimiter='_' Specify separator filrType:file format CSV TSV ......
DDL Foundation
At present, only MergeTree, Merge and Distributed table engines support alter modification, so pay attention to the table engine when performing alter operation!
MergeTree cannot modify partition fields and primary key fields when modifying fields
-- View table creation statement view engine show create table surface
Modify table structure
-- Add field alter table surface add column Field name data type ; -- Delete field alter table surface drop column Field name ; -- As a good programmer,It is a good habit to use annotations for the fields of a table, Therefore, it is recommended that you use comments to describe the meaning of fields during operation -- modify / The code used internally in the annotation of the added field is by default UTF8 alter table surface comment column Field name 'user name' ;
Move table
-- Modify table name rename table tb_test1 to t1 ; -- Modify multiple table names rename table tb_test2 to t2 , t1 to tt1 ; -- Move table to another database rename table Table name to library.Table name ; -- View all tables under the database show tables ; -- Library currently in use show tables from Library name ;
Set table defaults
-- Set default values for columns create table tb_test3( id Int8 , name String comment 'user name' , role String comment 'role' default 'VIP' )engine = Log ;
Update delete data
only MergeTree The data of the engine can only be modified 1) You can delete a partition and re import it 2) You can delete data according to conditions and update data according to conditions alter table delete/update where
-- Delete partition data alter table surface drop partition 'SH' ; -- Conditional delete data alter table surface delete where id=3 ; -- Certain conditions -- Condition update data alter table surface update name='Li sisi' where id=3 ;
Partition operation
-- Create partition table create table test_partition1( id String , ctime DateTime )engine=MergeTree() partition by toYYYYMM(ctime) order by (id) ; -- View table creation statement show create table test_partition1; -- insert data insert into test_partition1 values(1,now()) ,(2,'2021-06-11 11:12:13') ; -- View partitions in tables -- ClickHouse Built in many system The system table is used to query its own status information. among parts The system table is specially used to query the partition information of the data table. SELECT name, -- folder table, -- surface partition -- partition FROM system.parts -- System table WHERE table = 'test_partition1' ; -- Delete partition after deleting partition , Delete all data in the partition alter table test_partition1 drop partition '202109' ; -- Copy partition -- Copy table structure and create table create table tb_y as tb_x ; However, it should be noted that not all data tables can be copied to each other. They also need to meet two preconditions: ·Two tables need to have the same partition key ·They have exactly the same table structure. ALTER TABLE B REPLACE PARTITION 'Partition name' FROM A -- hold test_partition1 Copy partitions in table to test_partition2 In the table alter table test_partition2 replace partition '202106' from test_partition1 -- Reset partition be careful: Primary key and partition fields cannot be reset alter table test_rep clear column name in partition '202105' ; -- Unmount partition alter table surface detach partition 'Partition name' ; -- Mount partition alter table surface attach partition 'Partition name' ; -- Remember, once the partition is moved to detached Subdirectory means that it has been separated ClickHouse Management of, ClickHouse It will not actively clean up these files. These partition files will always exist unless we actively delete or use them ATTACH Statement reload
Query syntax
with
-- ClickHouse support CTE(Common Table Expression,Common table expression) to enhance the expression of query statements SELECT pow(2, 2) -- Square function ┌─pow(2, 2)─┐ │ 4 │ └───────────┘ SELECT pow(pow(2, 2), 2) ┌─pow(pow(2, 2), 2)─┐ │ 16 │ └───────────────────┘ -- Using CTE It can greatly improve the readability and maintainability of the statement, with pow(2,2) as a select pow(a,2) ; -- with grammar with () as x -- Create table CREATE TABLE t_user( `name` String, `aihao` String )ENGINE = Memory; insert into t_user values ('zss','Travel'); insert into t_user values ('lss','Play basketball'); insert into t_user values ('ww','binge-watching'); -- The result of subquery must be a piece of data with (select name from t_user where name='zss') as x select * from t_user where name in (x);
array join burst
ARRAY JOIN Clause allows you to interact with fields of array or nested type inside the data table JOIN Operation to expand a row array into multiple rows. be similar to hive Medium explode Function of burst function!
-- Create table create table t_array_join( id UInt8, name Array(String) )engine=Memory(); -- insert data insert into t_array_join values (1,['zs','ls','ww','zl']); insert into t_array_join values (2,['ny','yf','mq','zq']); -- Query data array join Burst array select * from t_array_join; select id,names from t_array_join array join name as names; -- Queried data 1,zs 1,ls 1,ww 1,zl 2,ny 2,yf 2,mq 2,zq
Association query
join left join right join Connection accuracy all All do not write default is any Any one connected to any one asof Additional conditions
with model can be used for projects
Calculate all combination possibilities in advance according to the aggregation dimension for aggregation a b c 1 1 1 0 0 0 2^n Combination possibility -- with cube All possible -- with rollup Cannot skip intermediate fields -- with totals All or none
-- Create table drop table if exists tb_with ; create table tb_with( id UInt8 , vist UInt8, province String , city String , area String )engine=MergeTree() order by id ; -- insert data insert into tb_with values(1,12,'Shandong','Jinan','Calendar') ; insert into tb_with values(2,12,'Shandong','Jinan','Calendar') ; insert into tb_with values(3,12,'Shandong','Jinan','footbridge') ; insert into tb_with values(4,12,'Shandong','Jinan','footbridge') ; insert into tb_with values(5,88,'Shandong','Qingdao','Huangdao') ; insert into tb_with values(6,88,'Shandong','Qingdao','Huangdao') ; insert into tb_with values(7,12,'Shanxi','Taiyuan','Small shop') ; insert into tb_with values(8,12,'Shanxi','Taiyuan','Small shop') ; insert into tb_with values(9,112,'Shanxi','Taiyuan','Pointed lawn') ; -- with cube SELECT province , city , area FROM tb_with group by province ,city ,area with CUBE ; -- Query results province|city|area| --------+----+----+ Shandong |Qingdao |Huangdao | Shandong |Jinan |footbridge | Shanxi |Taiyuan |Pointed lawn Shandong |Jinan |Calendar | Shanxi |Taiyuan |Small shop | Shandong |Qingdao | | Shandong |Jinan | | Shanxi |Taiyuan | | Shandong | |Calendar | Shandong | |footbridge | Shanxi | |Pointed lawn | Shanxi | |Small shop | Shandong | |Huangdao | Shanxi | | | Shandong | | | |Jinan |Calendar | |Jinan |footbridge | |Taiyuan |Pointed lawn | |Qingdao |Huangdao | |Taiyuan |Small shop | |Qingdao | | |Jinan | | |Taiyuan | | | |footbridge | | |Small shop | | |Huangdao | | |Calendar | | |Pointed lawn | | | | -- with rollup SELECT province ,city ,area FROM tb_with group by province , city , area with rollup; -- Query results province|city|area| --------+----+----+ Shandong |Qingdao |Huangdao | Shandong |Jinan |footbridge | Shanxi |Taiyuan |Pointed lawn | Shandong |Jinan |Calendar | Shanxi |Taiyuan |Small shop | Shandong |Qingdao | | Shandong |Jinan | | Shanxi |Taiyuan | | Shanxi | | | Shandong | | | | | | -- with totals SELECT province ,city ,area FROM tb_with group by province , city , area with totals; province|city|area| --------+----+----+ Shandong |Qingdao |Huangdao | Shandong |Jinan |footbridge | Shanxi |Taiyuan |Pointed lawn | Shandong |Jinan |Calendar | Shanxi |Taiyuan |Small shop | | | |
function
User behavior analysis function SequenceMatch()
// SELECT sequenceMatch('(?1)(?2)')(eventTime, eventid = 'event1', eventid = 'event2') FROM funnel_test ; // (? N) - match condition parameters at position n. // . * - the number that matches any event. No condition parameters are required to match this pattern. // (? 1) (? T > 1800) (? 2) events that match each other for more than 1800 seconds. // You can use the > =, >, <, < =, = = operators // data CREATE TABLE funnel_test ( uid String, eventid String, eventTime UInt64 ) ENGINE = MergeTree PARTITION BY (uid, eventTime) ORDER BY (uid, eventTime) SETTINGS index_granularity = 8192 ; insert into funnel_test values ('uid1','event1',1551398404) , ('uid1','event2',1551398406) , ('uid1','event3',1551398408) , ('uid2','event2',1551398412) , ('uid2','event4',1551398415) , ('uid3','event3',1551398410) , ('uid3','event4',1551398413) ; ('uid4','event2',1551398410) , ('uid4','event4',1551398413) ; SELECT uid ,sequenceMatch('(?1)(?2)')(eventTime, eventid = 'event1', eventid = 'event2') FROM funnel_test group by uid; // The data returned is success 1 failure 0 ┌─uid──┬─sequenceMatch('(?1)(?2)')(eventTime, equals(eventid, 'event1'), equals(eventid, 'event2'))─┐ │ uid3 │ 0 │ │ uid1 │ 1 │ │ uid4 │ 0 │ │ uid2 │ 0 │ └──────┴────────────────────────────────────────────────────────────────────────────────────────────┘
Funnel function (windowfunnel)
//The returned data satisfies several conditions // Data ibid SELECT uid,windowFunnel(4)(toDateTime(eventTime),eventid = 'event2',eventid = 'event3') as funnel from funnel_test group by uid; // The time from event2 to event3 shall not exceed 4 seconds ┌─uid──┬─funnel─┐ │ uid3 │ 0 │ │ uid1 │ 2 │ │ uid4 │ 1 │ │ uid2 │ 2 │ └──────┴────────┘
8, Distributed
Configure remote connections
configuration file: vi /etc/clickhouse-server/config.xml 1)Open the remote connection, release the comment or copy one <listen_host>::</listen_host>
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-xzqlbui6-163419367795) (H: \ snipaste_2021-09-29_21-44-08. PNG)]
Configure zookeeper
configuration file: vi /etc/clickhouse-server/config.xml <zookeeper> <node index="1"> <host>doit01</host> <port>2181</port> </node> <node index="2"> <host>doit02</host> <port>2181</port> </node> <node index="3"> <host>doit03</host> <port>2181</port> </node> </zookeeper>
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-wjrebxxs-163419367797) (H: \ zookeeper. PNG)]
Start the zookeeper service before starting
Distribution profile scp config.xml doit02$PWD sudo /etc/init.d/clickhouse-server restart restart ck service
Create a replica table and divide it into 1 pieces
-- doit01 create table tb_demo1 ( id Int8 , name String) engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo1', 'doit01') order by id ; -- doit02 create table tb_demo1 ( id Int8 , name String) engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo1', 'doit02') order by id ; -- doit03 create table tb_demo1 ( id Int8 , name String) engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo1', 'doit03') order by id ;
Create two shards of the replica table
-- doit01 create table tb_demo2 ( id Int8 , name String) engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo2', 'doit01') order by id ; -- doit02 create table tb_demo2 ( id Int8 , name String) engine=ReplicatedMergeTree('/clickhouse/tables/01/tb_demo2', 'doit02') order by id ; -- doit03 create table tb_demo2 ( id Int8 , name String) engine=ReplicatedMergeTree('/clickhouse/tables/02/tb_demo2', 'doit03') order by id ;
Distributed create table
Configure cluster
Search in conf.xml: / < remote_ servers>
<remote_servers> <!-- Copy the following --> <cluster01> <!-- All tables in the cluster have one slice --> <shard> <!-- Three copies of each table --> <replica> <host>doit01</host> <port>9000</port> </replica> <replica> <host>doit02</host> <port>9000</port> </replica> <replica> <host>doit03</host> <port>9000</port> </replica> </shard> </cluster01>
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-ffrvmvzg-163419367798) (H: \ configure cluster. png)]
Distribution profile restart ck service
Create distributed tables
-- Run on any server SQL The statement creates as like as two peas on the 010203 machine. -- 1)Create local tables distributed create table tb_demo00 on cluster cluster01( id Int8 , name String ) engine = MergeTree() order by id ; -- 2)Create distributed tables pointing to distributed local tables -- Distributed() Distributed engine parameter I:Specify cluster parameter 2:default default Parameter three:Local table parameter four pointed to:Slice field create table demo00_all on cluster cluster01 engine=Distributed('cluster01','default','tb_demo00',id) as tb_demo00 ; -- Inserting a data distributed table will synchronize the inserted data to the local table on 01 02 03 insert into demo00_all values(1,'lny'); insert into demo00_all values(2,'yg');
emo2', 'doit02')
order by id ;
– doit03
create table tb_demo2 (
id Int8 ,
name String)
engine=ReplicatedMergeTree('/clickhouse/tables/02/tb_demo2', 'doit03')
order by id ;
##### **Distributed create table** **Configure cluster** stay conf.xml search : /<remote_servers> ```xml <remote_servers> <!-- Copy the following --> <cluster01> <!-- All tables in the cluster have one slice --> <shard> <!-- Three copies of each table --> <replica> <host>doit01</host> <port>9000</port> </replica> <replica> <host>doit02</host> <port>9000</port> </replica> <replica> <host>doit03</host> <port>9000</port> </replica> </shard> </cluster01>
[external chain pictures are being transferred... (img-ffrvmvzg-163419367798)]
Distribution profile restart ck service
Create distributed tables
-- Run on any server SQL The statement creates as like as two peas on the 010203 machine. -- 1)Create local tables distributed create table tb_demo00 on cluster cluster01( id Int8 , name String ) engine = MergeTree() order by id ; -- 2)Create distributed tables pointing to distributed local tables -- Distributed() Distributed engine parameter I:Specify cluster parameter 2:default default Parameter three:Local table parameter four pointed to:Slice field create table demo00_all on cluster cluster01 engine=Distributed('cluster01','default','tb_demo00',id) as tb_demo00 ; -- Inserting a data distributed table will synchronize the inserted data to the local table on 01 02 03 insert into demo00_all values(1,'lny'); insert into demo00_all values(2,'yg');