Doris storage file format optimization
#File format
Documents include:
The beginning of the file is an 8-byte magic code, which is used to identify the file format and version
Data Region: used to store the data information of each column. The data here is loaded by page on demand
Index Region: doris uniformly stores the index data of each column in the Index Region. The data here will be loaded according to the column granularity, so it is stored separately from the column data information
Footer information
FileFooterPB: defines the metadata information of the file
checksum of 4-byte footer pb content
FileFooterPB message length of 4 bytes, used to read FileFooterPB
The reason why the 8-byte MAGIC CODE is stored in the last bit is to facilitate the identification of file types in different scenes
The data in the file is organized in the way of page, which is the basic unit of coding and compression. The current page types include the following:
DataPage
There are two types of datapages: nullable and non nullable data page s.
nullable of data page The contents include: +----------------+ | value count | |----------------| | first row id | |----------------| | bitmap length | |----------------| | null bitmap | |----------------| | data | |----------------| | checksum | +----------------+ non-nullable data page The structure is as follows: |----------------| | value count | |----------------| | first row id | |----------------| | data | |----------------| | checksum | +----------------+
The meanings of each field are as follows:
value count
Indicates the number of rows in the page
first row id
The line number of the first line in the page
bitmap length
Represents the number of bytes of the next bitmap
null bitmap
bitmap representing null information
data
Store the data after encoding and compress ion
You need to write: is in the header information of the data_ compressed
For data with different codes, some field information needs to be written in the header information to realize data analysis
TODO: add header information of various encoding
checksum
Store the checksum of page granularity, including the header of the page and the actual data after it
Bloom Filter Pages
For each bloom filter column, a bloom filter page will be generated according to the page granularity and saved in the bloom filter pages area
Ordinal Index Page
For each column, a sparse index of row numbers will be established according to the page granularity. The content is the line number of the starting line of the page to the pointer of the block (including offset and length)
Short Key Index page
We will generate a sparse index of short key every N rows (configurable). The content of the index is: short key - > line number (ordinal)
Other indexes of Column
The format design supports the subsequent expansion of other index information, such as bitmap index, spatial index, etc. you only need to write the required data after the existing column data, and add the corresponding metadata field to FileFooterPB
Meta Data Define
SegmentFooterPB Is defined as: message ColumnPB { required int32 unique_id = 1; // column id is used instead of column name because the plan supports modifying column names optional string name = 2; // Column name, when name is__ DORIS_DELETE_SIGN__, Indicates that the column is a hidden deleted column required string type = 3; // Column type optional bool is_key = 4; // Is it a primary key column optional string aggregation = 5; // Aggregation mode optional bool is_nullable = 6; // null optional bytes default_value = 7; // Default value optional int32 precision = 8; // accuracy optional int32 frac = 9; optional int32 length = 10; // length optional int32 index_length = 11; // Index length optional bool is_bf_column = 12; // Is there a bf dictionary optional bool has_bitmap_index = 15 [default=false]; // Is there a bitmap index } // page offset message PagePointerPB { required uint64 offset; // Offset of page in file required uint32 length; // page size } message MetadataPairPB { optional string key = 1; optional bytes value = 2; } message ColumnMetaPB { optional ColumnMessage encoding; // Coding mode optional PagePointerPB dict_page // Dictionary page repeated PagePointerPB bloom_filter_pages; // bloom filter dictionary information optional PagePointerPB ordinal_index_page; // Line number index data optional PagePointerPB page_zone_map_page; // page level statistics index data optional PagePointerPB bitmap_index_page; // bitmap index data optional uint64 data_footprint; // The size of the index in the column optional uint64 index_footprint; // The size of the data in the column optional uint64 raw_data_footprint; // Raw column data size optional CompressKind compress_kind; // Column compression method optional ZoneMapPB column_zone_map; //File level filtering criteria repeated MetadataPairPB column_meta_datas; } message SegmentFooterPB { optional uint32 version = 2 [default = 1]; // For version compatibility and upgrade repeated ColumnPB schema = 5; // Column Schema optional uint64 num_values = 4; // Number of lines saved in the file optional uint64 index_footprint = 7; // Index size optional uint64 data_footprint = 8; // data size optional uint64 raw_data_footprint = 8; // Raw data size optional CompressKind compress_kind = 9 [default = COMPRESS_LZO]; // Compression mode repeated ColumnMetaPB column_metas = 10; // Column metadata optional PagePointerPB key_index_page; // short key index page }
Read write logic
write in
The general writing process is as follows:
Write magic
Generate the corresponding ColumnWriter according to the schema information. Each ColumnWriter obtains the corresponding encoding information (configurable) according to different types, and generates the corresponding encoder according to encoding
Call encoder - > Add (value) to write data. For each K rows, a short key index entry is generated. If the current page meets certain conditions (the size exceeds 1M or the number of rows is K), a new page is generated and cached in memory.
Continue to cycle step 3 until the data writing is completed. Brush the data of each column into the file in order
Generate FileFooterPB information and write it to the file.
Related questions:
How to generate the index of short key?
At present, the sparse index of a short key is generated according to the number of rows. Keep generating a short sparse index every 1024 rows. The specific contents are: short key - > ordinal
What should be stored in the ordinal index?
Store the mapping information from the first ordinal of the page to the page pointer
What is stored in page s of different encoding types?
Dictionary compression
plain
rle
bshuf
read
Read the magic of the file and judge the file type and version
Read FileFooterPB and perform checksum verification
Read the short key index and the data ordinal index information of the corresponding column according to the required column
Use the start key and end key to locate the row number to be read through the short key index, and then determine the row ranges to be read through the ordinal index. At the same time, filter the row ranges to be read through statistics, bitmap index, etc
Then read the row data through the ordinal index according to the row ranges
Related questions:
How to quickly locate a line within the page?
page internal data is encoded, so it is impossible to locate row level data quickly. Different encoding methods have different schemes for fast internal line number positioning, which need specific analysis:
If it is rle encoded, you need to skip by parsing the rle header until you reach the rle block containing the row, and then reverse solve it.
binary plain encoding: the offset information will be stored in the of the page, and the offset of the offset information will be specified in the page header. When reading, the offset information will be parsed into the array first, so that the child can quickly locate the data of a row of the block through the offset data information of each row
How to achieve efficient reading of blocks? You can consider merging adjacent blocks when reading, and reading at one time? When reading, you need to judge whether the block is continuous. If it is continuous, you can read it at one time
code
In the existing doris storage, plain encoding is adopted for the encoding of string type, which is inefficient. After comparison, it is found that in the scenario of Baidu statistics, the data will be more than doubled due to the coding of string type. Therefore, it is planned to introduce dictionary based coding compression.
compress
Implement an extensible compression framework and support a variety of compression algorithms to facilitate the subsequent addition of new compression algorithms. It is planned to introduce zstd compression