Doris storage file format optimization

Doris storage file format optimization

#File format

Documents include:

The beginning of the file is an 8-byte magic code, which is used to identify the file format and version

Data Region: used to store the data information of each column. The data here is loaded by page on demand

Index Region: doris uniformly stores the index data of each column in the Index Region. The data here will be loaded according to the column granularity, so it is stored separately from the column data information

Footer information
FileFooterPB: defines the metadata information of the file
checksum of 4-byte footer pb content
FileFooterPB message length of 4 bytes, used to read FileFooterPB
The reason why the 8-byte MAGIC CODE is stored in the last bit is to facilitate the identification of file types in different scenes

The data in the file is organized in the way of page, which is the basic unit of coding and compression. The current page types include the following:

DataPage

There are two types of datapages: nullable and non nullable data page s.

nullable of data page The contents include:


                 +----------------+
                 |  value count   |
                 |----------------|
                 |  first row id  |
                 |----------------|
                 | bitmap length  |
                 |----------------|
                 |  null bitmap   |
                 |----------------|
                 |     data       |
                 |----------------|
                 |    checksum    |
                 +----------------+
non-nullable data page The structure is as follows:

                 |----------------|
                 |   value count  |
                 |----------------|
                 |  first row id  |
                 |----------------|
                 |     data       |
                 |----------------|
                 |    checksum    |
                 +----------------+

The meanings of each field are as follows:

value count
Indicates the number of rows in the page

first row id
The line number of the first line in the page

bitmap length
Represents the number of bytes of the next bitmap

null bitmap
bitmap representing null information

data
Store the data after encoding and compress ion

You need to write: is in the header information of the data_ compressed

For data with different codes, some field information needs to be written in the header information to realize data analysis

TODO: add header information of various encoding
checksum
Store the checksum of page granularity, including the header of the page and the actual data after it

Bloom Filter Pages

For each bloom filter column, a bloom filter page will be generated according to the page granularity and saved in the bloom filter pages area

Ordinal Index Page

For each column, a sparse index of row numbers will be established according to the page granularity. The content is the line number of the starting line of the page to the pointer of the block (including offset and length)

Short Key Index page

We will generate a sparse index of short key every N rows (configurable). The content of the index is: short key - > line number (ordinal)

Other indexes of Column

The format design supports the subsequent expansion of other index information, such as bitmap index, spatial index, etc. you only need to write the required data after the existing column data, and add the corresponding metadata field to FileFooterPB

Meta Data Define

SegmentFooterPB Is defined as:

message ColumnPB {
    required int32 unique_id = 1;   // column id is used instead of column name because the plan supports modifying column names
    optional string name = 2;   // Column name, when name is__ DORIS_DELETE_SIGN__,  Indicates that the column is a hidden deleted column
    required string type = 3;   // Column type
    optional bool is_key = 4;   // Is it a primary key column
    optional string aggregation = 5;    // Aggregation mode
    optional bool is_nullable = 6;      // null
    optional bytes default_value = 7;   // Default value
    optional int32 precision = 8;       // accuracy
    optional int32 frac = 9;
    optional int32 length = 10;         // length
    optional int32 index_length = 11;   // Index length
    optional bool is_bf_column = 12;    // Is there a bf dictionary
    optional bool has_bitmap_index = 15 [default=false];  // Is there a bitmap index
}

// page offset
message PagePointerPB {
	required uint64 offset; // Offset of page in file
	required uint32 length; // page size
}

message MetadataPairPB {
  optional string key = 1;
  optional bytes value = 2;
}

message ColumnMetaPB {
	optional ColumnMessage encoding; // Coding mode

	optional PagePointerPB dict_page // Dictionary page
	repeated PagePointerPB bloom_filter_pages; // bloom filter dictionary information
	optional PagePointerPB ordinal_index_page; // Line number index data
	optional PagePointerPB page_zone_map_page; // page level statistics index data

	optional PagePointerPB bitmap_index_page; // bitmap index data

	optional uint64 data_footprint; // The size of the index in the column
	optional uint64 index_footprint; // The size of the data in the column
	optional uint64 raw_data_footprint; // Raw column data size

	optional CompressKind compress_kind; // Column compression method

	optional ZoneMapPB column_zone_map; //File level filtering criteria
	repeated MetadataPairPB column_meta_datas;
}

message SegmentFooterPB {
	optional uint32 version = 2 [default = 1]; // For version compatibility and upgrade
	repeated ColumnPB schema = 5; // Column Schema
  optional uint64 num_values = 4; // Number of lines saved in the file
  optional uint64 index_footprint = 7; // Index size
  optional uint64 data_footprint = 8; // data size
	optional uint64 raw_data_footprint = 8; // Raw data size

  optional CompressKind compress_kind = 9 [default = COMPRESS_LZO]; // Compression mode
  repeated ColumnMetaPB column_metas = 10; // Column metadata
	optional PagePointerPB key_index_page; // short key index page
}

Read write logic

write in

The general writing process is as follows:

Write magic
Generate the corresponding ColumnWriter according to the schema information. Each ColumnWriter obtains the corresponding encoding information (configurable) according to different types, and generates the corresponding encoder according to encoding
Call encoder - > Add (value) to write data. For each K rows, a short key index entry is generated. If the current page meets certain conditions (the size exceeds 1M or the number of rows is K), a new page is generated and cached in memory.
Continue to cycle step 3 until the data writing is completed. Brush the data of each column into the file in order
Generate FileFooterPB information and write it to the file.
Related questions:

How to generate the index of short key?

At present, the sparse index of a short key is generated according to the number of rows. Keep generating a short sparse index every 1024 rows. The specific contents are: short key - > ordinal
What should be stored in the ordinal index?

Store the mapping information from the first ordinal of the page to the page pointer
What is stored in page s of different encoding types?

Dictionary compression
plain
rle
bshuf

read

Read the magic of the file and judge the file type and version
Read FileFooterPB and perform checksum verification
Read the short key index and the data ordinal index information of the corresponding column according to the required column
Use the start key and end key to locate the row number to be read through the short key index, and then determine the row ranges to be read through the ordinal index. At the same time, filter the row ranges to be read through statistics, bitmap index, etc
Then read the row data through the ordinal index according to the row ranges
Related questions:

How to quickly locate a line within the page?

page internal data is encoded, so it is impossible to locate row level data quickly. Different encoding methods have different schemes for fast internal line number positioning, which need specific analysis:

If it is rle encoded, you need to skip by parsing the rle header until you reach the rle block containing the row, and then reverse solve it.
binary plain encoding: the offset information will be stored in the of the page, and the offset of the offset information will be specified in the page header. When reading, the offset information will be parsed into the array first, so that the child can quickly locate the data of a row of the block through the offset data information of each row
How to achieve efficient reading of blocks? You can consider merging adjacent blocks when reading, and reading at one time? When reading, you need to judge whether the block is continuous. If it is continuous, you can read it at one time

code

In the existing doris storage, plain encoding is adopted for the encoding of string type, which is inefficient. After comparison, it is found that in the scenario of Baidu statistics, the data will be more than doubled due to the coding of string type. Therefore, it is planned to introduce dictionary based coding compression.

compress

Implement an extensible compression framework and support a variety of compression algorithms to facilitate the subsequent addition of new compression algorithms. It is planned to introduce zstd compression

Keywords: Data Warehouse

Added by Napper on Thu, 27 Jan 2022 06:22:27 +0200