ClickHouse in simple terms (data type)

1, Data type

There are only three basic types: value, string and time. There is no Boolean type, but you can use integer 0 or 1 instead. ClickHouse's data types are compared with those of other common storage systems:

Official website: https://clickhouse.tech/docs/zh/sql-reference/data-types/

 

1.1 value type

Numerical types are divided into integer, floating-point and fixed-point numbers

1.1.1 integer

In general, Tinyint, Smallint, int and Bigint are commonly used to refer to different value ranges of integers. ClickHouse directly uses Int8, Int16, Int32 and Int64 to refer to four types of int, and the number at the end of it just indicates the size of occupied bytes (8 bits = 1 byte). ClickHouse supports unsigned integers, represented by the prefix U. Fixed length integers, including signed or unsigned integers.

Integer range:( reach ):

Int8 - [-128 : 127]
Int16 - [-32768 : 32767]
Int32 - [-2147483648 : 2147483647]
Int64 - [-9223372036854775808 : 9223372036854775807]

Unsigned integer range (0 to) ):

UInt8 - [0 : 255]
UInt16 - [0 : 65535]
UInt32 - [0 : 4294967295]
UInt64 - [0 : 18446744073709551615]

1.1.2 floating point type

Similar to integers, ClickHouse directly uses Float32 and Float64 to represent single precision floating-point numbers and double precision floating-point numbers

There are two types of floating-point values for ClickHouse:

Float32 - float
Float64 - double
It is recommended to store data in integer form as much as possible. For example, convert a fixed precision number to an integer value, such as time in milliseconds, because floating-point calculations may cause rounding errors.
bigdata02 :) select 1 - 0.9;
SELECT 1 - 0.9
┌───────minus(1, 0.9)─┐
│ 0.09999999999999998 │
└─────────────────────┘
1 rows in set. Elapsed: 0.002 sec.

Compared with standard SQL, ClickHouse supports the following categories of floating-point numbers:

Inf - positive infinity:

bigdata02 :) select 1 / 0;
SELECT 1 / 0
┌─divide(1, 0)─┐
│          inf │
└──────────────┘
1 rows in set. Elapsed: 0.001 sec.

-Inf - negative infinity:

bigdata02 :) select -1 / 0;
SELECT -1 / 0
┌─divide(-1, 0)─┐
│          -inf │
└───────────────┘
1 rows in set. Elapsed: 0.001 sec.

NaN - non numeric:

bigdata02 :) select 0 / 0;
SELECT 0 / 0
┌─divide(0, 0)─┐
│          nan │
└──────────────┘
1 rows in set. Elapsed: 0.001 sec.

When using floating-point numbers, you should be aware that they are of limited precision. If you write values that exceed the effective accuracy to Float32 and Float64 respectively, let's see what happens. For example, write the value with 20 decimal places into Float32 and Float64 respectively, and the result will have data error:

bigdata02 :) select toFloat32('0.1234567901234567890123456789') as a,
toTypeName(a);
┌──────────a─┬─toTypeName(toFloat32('0.1234567901234567890123456789'))─┐
│ 0.12345679 │ Float32                                                 │
└────────────┴─────────────────────────────────────────────────────────┘
bigdata02 :) select toFloat64('0.1234567901234567890123456789') as a,
toTypeName(a);
┌───────────────────a─┬─toTypeName(toFloat64('0.1234567901234567890123456789'))─┐
│ 0.12345679012345678 │ Float64│
└─────────────────────┴─────────────────────────────────────────────────────────┘

It can be found that Float32 has data overflow from the 8th place after the decimal point and Float64 has data overflow from the 17th place after the decimal point.

1.1.3,Decimal

If higher precision numerical operations are required, fixed-point numbers need to be used. ClickHouse provides three precision fixed-point numbers: Decimal32, Decimal64 and Decimal128. Fixed points can be declared in two forms: Decimal32(S), Decimal64(S) and Decimal128(S). The native mode is Decimal(P, S), where:

P Represents the precision and determines the total number of digits(Integer part+Fractional part),The value range is 1~38; 
S Represents the scale, determines the decimal places, and the value range is 0~P. 
Corresponding relationship between abbreviated mode and native mode

When two fixed-point numbers with different precision are used for four operations, their decimal places S will change. When performing addition operation, S takes the maximum value. For example, in the following query, S=4 after adding toDecimal64(2,4) and toDecimal32(2,2):

select toDecimal64(2, 4) + toDecimal32(2, 2);

result:
 

┌─plus(toDecimal64(2, 4), toDecimal32(2, 2))─┐
│                                     4.0000 │
└────────────────────────────────────────────┘

Pay attention to the accuracy of Decimal's four operations of addition, subtraction, multiplication and division! To sum up:

1.1.4 Boolean

There is no separate type to store Boolean values. UInt8 type can be used, and the value is limited to 0 or 1. In addition, during the use of ClickHouse, you will also find that the comparison results are 1 or 0, rather than True or False in the usual sense

select 1 == 1;
select 1 == 2;

1.2 string

String types can be subdivided into three categories: string, FixedString and UUID

1.2.1. String type

Strings can be of any length. It can contain any byte set and empty bytes, and can be used to replace Varchar, Blob, Clob and other data types. The string is defined by string and is unlimited in length. Therefore, there is no need to declare the size when using string. It completely replaces the character types such as Varchar, Text, Clob and Blob in the traditional database. String type does not limit the character set, because it does not have this concept at all, so any encoded string can be stored in it. However, for the standardization and maintainability of the program, the use of unified coding should be followed in the same set of programs. For example, "unified maintenance of UTF-8 coding" is a good convention.

create table dylan_test02(
    id UInt8,
    name String
) engine = Memory;

1.2.2,FixedString(N)

The FixedString type is somewhat similar to the traditional Char type. When some characters have a definite length, a fixed length string can be used. A fixed length string is declared by FixedString(N), where N represents the length of the string. However, unlike Char, FixedString uses null bytes to fill the end characters, while Char usually uses spaces to fill. For example, in the following example, although the string 'abc' has only 3 bits, its length is 5, because there are 2 empty characters at the end:

SELECT toFixedString('abc', 5), LENGTH(toFixedString ('abc',5))AS LENGTH;

For a String of fixed length N, n must be a strictly positive natural number. When the server reads a String with a length less than N, the length of N bytes is reached by adding empty bytes at the end of the String. When the server reads a String with a length greater than N, an error message will be returned. Compared with String, FixedString is rarely used because it is not very convenient to use.

Summary:

A,N Is the maximum number of bytes(Byte),Character length is not,If it is UTF8 character string,Then it will occupy 3 bytes,GBK It will occupy 2 bytes. 
B,When the content is less than N,The database will automatically fill in empty bytes on the right(null byte)(Follow PGsql dissimilarity,PGsql Filled with spaces),When the content is greater than N time,Will throw an error. 
C,Empty byte after writing content,The system will not automatically cut,The query will also be output(mysql No output) 
D,FixedString(N) than String Support fewer methods

1.2.3,UUID

UUID is a common primary key type in database. It is directly regarded as a data type in ClickHouse. UUID has 32 bits in total, and its format is 00000000-0000-0000-0000- 000000000000. If a field of UUID type is not assigned when writing data, it will be filled with 0 according to the format, for example:

-- Build table
create table uuid_test(
c1 UUID,
    c2 String
) engine = Memory;
-- insert data
insert into uuid_test select generateUUIDv4(), 't1'; insert into uuid_test (c2) select 't2';
-- Query data
select * from uuid_test;

From the self execution result, you can see that the UUID not assigned in the second line is filled with 0.

1.3. Date time type

Time types are divided into DateTime, DateTime64 and Date. ClickHouse currently has no timestamp type. The highest precision of time type is seconds, that is, if you need to deal with time greater than the resolution of seconds, such as milliseconds and microseconds, you can only use UInt type.

Date:  2020-02-02
DateTime: 2020-02-02 20:20:20
DateTime64: 2020-02-02 20:20:20.335

1.3.1. Date type

Date type, stored in two bytes, represents the date value from 1970-01-01 (unsigned) to the current date. It is allowed to store the upper limit threshold constant defined from the beginning of the Unix era to the compilation stage (the current upper limit is 2106, but the final fully supported year is 2105). The minimum output is 1970-01-01.

Date type does not contain specific time information. It is only accurate to days. It supports writing in string form. It should be noted that there is no time zone information stored in the date. By default, the time zone of the server will be used when the client connects to the service.

You can do this by enabling the client command line option -- use_client_time_zone to set the time to use the client.

-- Build table
drop table if exists date_test; create table date_test(
    c1 Date
) engine = Memory;
-- insert data
insert into date_test values('2021-04-25');
--query
select c1, toTypeName(c1) from date_test;

1.3.2. DateTime type

Timestamp type. The Unix timestamp is stored in four bytes (unsigned integer type, Uint32). Allows you to store values in the same range as the Date type. The minimum value is 1970-01-01 00:00:00. Timestamp type values are accurate to seconds (no leap seconds).

DateTime type contains hour, minute and second information, accurate to seconds, and supports writing in string form.

Summary:

A,the reason being that Unsigned Therefore, it cannot support January 1, 1970(UTC/GMT Midnight)Previous time. 
B,Time zones affect input and output. Please make–use_client_time_zone To switch the time zone, it is best to use it when the server starts TZ=X To ensure the time zone
-- Build table
create table datetime_test(
    c1 DateTime
) engine = Memory;
-- insert data
insert into datetime_test values('2021-04-25 10:20:30');
--query
select c1, toTypeName(c1) from datetime_test;

1.3.3. DateTime64 type

DateTime64 can record sub seconds. It adds precision settings to DateTime, for example:

-- Build table
drop table if exists datetime64_test; create table datetime64_test(
    c1 DateTime64(2),
    c2 DateTime64(4)
) engine = Memory;
-- insert data
insert into datetime64_test (c1, c2) values('2021-04-25 10:20:30', '2021-04-25 10:20:30.333');
--query
select c1, toTypeName(c1), c2, toTypeName(c2) from datetime64_test;

1.4. Composite type

In addition to the basic data types, ClickHouse also provides four types of composite types: array, tuple, enumeration and nesting. These types are usually features that are not native to other databases. With composite types, ClickHouse's data model is more expressive.

1.4.1. Enumeration type

ClickHouse supports enumeration type, which is a data type often used when defining constants. ClickHouse provides Enum8 and Enum16 enumeration types, which are the same except for different Value ranges. Enumerations define data in the form of (String:Int) Key/Value pairs, so Enum8 and Enum16 correspond to (String:Int8) and (String:Int16) respectively, including Enum8 and Enum16 types. Enum saves the correspondence of 'string'= integer. Duplicate tables are not allowed when creating tables. Secondly, the Key/Value cannot be Null at the same time, but the Key allows empty strings. There are several points to note when defining enumeration collections. First of all, Key and Value are not allowed to be repeated, and uniqueness should be guaranteed. Secondly, the values of Key and Value cannot be Null, but the Key is allowed to be an empty string. When writing enumeration data, only the Key string part will be used

Enum8 use 'String'= Int8 Description of 
Enum16 use 'String'= Int16 Description of

Usage demonstration: create a column with enum enum8 ('a '= 1,' B '= 2,' C '= 3):

CREATE TABLE dylan_enum_test (x Enum8('a' = 1, 'b' = 2, 'c'=3)) ENGINE = TinyLog;

This res. column can only store values listed in the type definition: 'a' or 'b' or 'c'. ClickHouse throws an exception if you try to save any other values.

Insert normal data:

INSERT INTO dylan_enum_test VALUES ('a'), ('b'), ('c');

Query results:

select * from dylan_enum_test;

Insert exception data:

insert into dylan_enum_test values('test');

If you need to see the value of the corresponding row, you must convert the Enum value to an integer type.

SELECT CAST(res, 'Int8') FROM dylan_enum_test;

Some people may think that String can be used instead of enumeration. Why do you need special enumeration types? This is for performance reasons. Although the Key in the enumeration definition is of String type, the Value value of Int type will be used in all subsequent operations on enumeration (including sorting, grouping, de duplication, filtering, etc.).

1.4.2. array

T can be any type, including array type. However, multi-dimensional arrays are not recommended. ClickHouse has limited support for multi-dimensional arrays. For example, you cannot store multidimensional arrays in the MergeTree table.

You can use the array function to create an array:

array(T)

Square brackets can also be used:

[]

Create array:

SELECT array(1, 2) AS x, toTypeName(x);
SELECT [1, 2] AS x, toTypeName(x);

ClickHouse will automatically detect array elements and calculate the smallest data type to store these elements according to the elements. If there is Null or Nullable element in the element, the element type of the array will become Nullable.

If ClickHouse cannot determine the data type, it will generate an exception. This happens when you try to create an array containing both strings and numbers (SELECT array(1, 'a')).

SELECT array(1, 2, NULL) AS x, toTypeName(x);

ClickHouse throws an exception if you try to create an array of incompatible data types:

SELECT array(1, 'a');

1.4.3. tuple

Tuple type consists of 1~n elements. Different data types are allowed to be set between each element, and they are not required to be compatible with each other. Tuples also support type inference, which is still based on the principle of minimum storage cost. Like arrays, tuples can be defined in two ways, the conventional tuple(T).

The function of element type is similar to that of generic type, which can further ensure the data quality. Type checking is performed during data writing. Tuple(T1, T2, ...): Tuples, where each element has a separate type.

Example of creating tuples:

SELECT tuple(1,'a') AS x, toTypeName(x);
SELECT tuple(1,'a') AS x, toTypeName(x), x.1, x.2;

Create a table with tuple field:

drop table dylan_tuple_table;
CREATE TABLE dylan_tuple_table (t Tuple(Int8, String, Array(String), Array(Int8)))
ENGINE = TinyLog;

Insert data:

INSERT INTO dylan_tuple_table VALUES((1, 'a', ['a', 'b', 'c'], [1, 2, 3])),(tuple(11, 'A', ['A', 'B', 'C'], [11, 22, 33]));

View data:

SELECT t, t.2, t.3, t.4 FROM dylan_tuple_table;

1.4.4. Nested Type

Nested type, as the name suggests, is a nested table structure. A data table can define any number of nested type fields, but the nesting level of each field only supports one level, that is, nested types cannot continue to be used in nested tables. Nested types are also a good choice for hierarchical or associative relationships in simple scenes. Nested (name1, Type1, Name2, type2,...) A nested data structure is like a nested table. The parameters of the nested data structure - column name and type - are specified in the same way as in the CREATE query. The rows of each table can correspond to any number of rows in the nested data structure.

As follows:

drop table dylan_test_table;
CREATE TABLE dylan_nest_table (
    name String,
    age Int8,
    dept Nested(
id UInt8,
name String )
)
ENGINE = TinyLog;

Insert data:

insert into dylan_test_table values ('tt', 1, 1, 'hey'); 
insert into dylan_test_table values ('tt1', 2, (1, 'Yo'));
Insert an error message, and the core exception information is:
DB::Exception: Type mismatch in IN or VALUES section. Expected: Array(UInt8).Got: UInt64

Through this information display, the exception display requires Array instead of pure Int. Therefore, it is clear here that the essence of nested type is a multi-dimensional Array structure. A field of a nested type corresponds to an Array. There is no limit to the number of fields in the corresponding Array, but the fields need to be aligned with the number in the Array.

The correct insert statement should be:

insert into dylan_test_table values ('tt', 1, [1], ['hey']); 
insert into dylan_test_table values ('tt1', 1, [1,2,3], ['python','spark','flink']);

In fact, in Hive, there is a complex type called Struct, which is very similar to the current situation, but according to experience, it is recommended to use the Nested type as little as possible

Query:

select * from dylan_test_table;

result:

┌─name────┬─age─┬─dept.id─┬─dept.name────────────┐ 
│ tt │ 1 │ [1] │ ['hey'] │ 
│ tt1 │ 1 │ [1,2,3] │ ['python','spark','flink'] │
└─────────┴─────┴─────────┴──────────────────────┘

1.5. Other types

1.5.1. Nullable(TypeName)

To be exact, Nullable is not an independent data type. It is more like an auxiliary modifier, which needs to be used together with the basic data type. The basic data type of an object is Null, which can be similar to the value of Java Optional 8.

The "missing value" can be represented by a special tag (NULL) and can be stored with the normal value of TypeName. For example, a Nullable(Int8) type column can store Int8 type values, while rows without values store NULL.

Official website explanation:

Special marking is allowed (NULL) express«Missing value»,Can be with TypeName The normal values of the are stored together. For example, Nullable(Int8) Columns of type can store Int8 Type values, and rows without values are stored NULL. 
about TypeName,Composite data type arrays and tuples cannot be used. Composite data types can contain Nullable Type value, for example Array(Nullable(Int8)). 
Nullable Type fields cannot be included in a table index.
Unless in ClickHouse Unless otherwise specified in the server configuration, otherwise NULL Is any Nullable The default value for the type.

characteristic:

1,Nullable Can only be used with basic types;
2,Cannot be used in Array/Tuple On this compound type; 
3,Cannot be an index field[ Order by()];
4,Use with caution Nullable ,Poor write performance. Because it generates separate files.

Usage:

--Build table
create table dylan_null_test(
c1 String,
    c2 Nullable(UInt8)
) engine = TinyLog;
--insert data
insert into dylan_null_test values ('aa', null); 
insert into dylan_null_test values ('aa', 1);
--Query data
select c1, c2 from dylan_null_test;

There are two other points to note when using Nullable types:

First of all, it can only be used with basic types, not for composite types such as arrays and tuples, nor as index fields;

Secondly, Nullable types should be used with caution, including Nullable data tables, otherwise the query and write performance will slow down. Because under normal circumstances, the data of each column field will be stored in the corresponding [column] In the bin file. If a column field is decorated with Nullable type, an additional [column] will be generated Null. The bin file specifically holds its Null value. This means that double the additional file operations are required when reading and writing data.

1.5.2. Domain

Domain name types are divided into IPv4 and IPv6. In essence, they are further encapsulation of integers and strings. IPv4 type is encapsulated based on UInt32. Its specific usage is as follows:

It is divided into IPv4 and IPv6. In fact, the essence is the encapsulation of integers and strings.

IPv4 use UInt32 Storage. Such as 110.253.30.113
IPv6 use FixedString(16) Storage. Such as 2 e02:ea08:e010:4100::2

Usage:

-- Build table
drop table if exists dylan_ip_test; 
create table dylan_ip_test(
url String,
    ip IPv4
) engine = Memory;
-- insert data
insert into dylan_ip_test values ('www.baidu.com', '110.253.30.113');
-- Query data
select url, ip, toTypeName(ip) from dylan_ip_test;

Question: why not just use strings instead of IPv4 types?

1,For convenience, for example IPv4 The type supports format checking. The format is incorrect IP Data cannot be written 
2,For performance reasons, it is also based on IPv4 For example, IPv4 use UInt32 Storage, compared to String More compact, less space and faster query performance. IPv6 Type is based on FixedString(16)Encapsulated, its use method and IPv4 identical

When using Domain type, it should be noted that although it looks like String, Domain type is not a String, so it does not support implicit automatic type conversion. If you need to return the String form of IP, you need to explicitly call the IPv4NumToString or IPv6NumToString function for conversion.

 

 

 

 

 

 

 

 

 

 

 

 

Keywords: clickhouse

Added by Appletalk on Sat, 19 Feb 2022 13:53:02 +0200