hash and range partition of kudu

preface

1. The partition table supports hash partition and range partition. Tables are divided into tables according to the partition mode on the primary key column. Each tablet is provided by at least one tablet server. Ideally, a table is divided into multiple tables and distributed among different table servers to maximize parallel operations.
2. Kudu currently has no mechanism to split or merge tables after creating tables.
3. When you create a table, you must provide a partition mode for the table.
4. When designing tables, you can use primary keys to divide tables into tables that grow at the same rate.
5. You can partition the table using Impala's PARTITION BY keyword, which supports RANGE or HASH distribution. The partition scheme can contain zero or more HASH definitions, followed by an optional RANGE definition. A RANGE definition can reference one or more primary key columns

1, Division by range


Advantage: allows you to split the table based on a specific value or range of values for the selected partition key. This balances parallel write and scan efficiency
Disadvantages: if you partition by range on a column whose value is monotonically increasing, the growth of the last table will be much greater than that of others. In addition, all inserted data will be written to a single table at one time, which limits the scalability of data ingestion
example:

create table apex_report.can_ci_che_dui_detail 
(
  order_no string,            
  certification_item string,  
  order_time string,          
  bill_date string,           
  goods_code string,          
  batch string,               
  mold string,                
  ssc_region_id string,       
  ssc_region_name string,     
  ssc_trade_id string,        
  ssc_trade_name string,      
  sfc_center_node string,     
  sfc_center_name_node string,
  ssc_center_node string,     
  ssc_center_name_node string,
  order_type string,          
  install_way string,         
  install_way_name string,    
  vehicle_lisence string,     
  vehicle_name string,        
  car_number string,          
  source_sn string,           
  model_source string,        
  model_code string,          
  channel_code string,        
  channel_name string,        
  product string,             
  big_product string,         
  product_desc string,        
  sp_code string,             
  center_name string,         
  bill_type string,           
  bill_cnt double,            
  bill_cnt_ck double,         
  LAST_UPDATE_TIME string,    
  PRIMARY KEY (order_no,certification_item,order_time,bill_date,goods_code,batch)
)
PARTITION BY HASH (order_no) PARTITIONS 4
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='10.138.xxx.xx:7051,10.138.xxx.xx:7051,10.138.xxx.xx:7051', 'kudu.table_name'='APEX_REPORT.CAN_CI_CHE_DUI_DETAIL');

2, PARTITION BY HASH

Advantage: data is evenly distributed between data buckets
Disadvantages: the value query may need to read all the tablet s, that is, three customized ones
example:

create table apex_report.can_ci_che_dui_detail
(
  order_no string,                 
  certification_item string,       
  order_time string,               
  bill_date string,                
  goods_code string,               
  batch string,                    
  mold string,                     
  ssc_region_id string,            
  ssc_region_name string,          
  ssc_trade_id string,             
  ssc_trade_name string,           
  sfc_center_node string,          
  sfc_center_name_node string,     
  ssc_center_node string,          
  ssc_center_name_node string,     
  order_type string,               
  install_way string,              
  install_way_name string,         
  vehicle_lisence string,          
  vehicle_name string,             
  car_number string,               
  source_sn string,                
  model_source string,             
  model_code string,               
  channel_code string,             
  channel_name string,             
  product string,                  
  big_product string,              
  product_desc string,             
  sp_code string,                  
  center_name string,              
  bill_type string,                
  bill_cnt double,                 
  bill_cnt_ck double,              
  LAST_UPDATE_TIME string,         
  PRIMARY KEY (order_no,certification_item,order_time,bill_date,goods_code,batch)
)
PARTITION BY RANGE (bill_date) (
PARTITION VALUES < '2022-01-01',
PARTITION '2022-01-01' <= VALUES < '2022-07-01', 
PARTITION '2022-07-01' <= VALUES < '2023-01-01', 
PARTITION '2023-01-01' <= VALUES < '2023-07-01', 
PARTITION '2023-07-01' <= VALUES < '2024-01-01', 
PARTITION '2024-01-01' <= VALUES < '2024-07-01', 
PARTITION '2024-07-01' <= VALUES < '2025-01-01', 
PARTITION '2025-01-01' <= VALUES < '2025-07-01', 
PARTITION '2025-07-01' <= VALUES 
)
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='10.138.xxx.xx:7051,10.138.xxx.xx:7051,10.138.xxx.xx:7051', 'kudu.table_name'='APEX_REPORT.CAN_CI_CHE_DUI_DETAIL');

3, Advanced partition

PARTITION BY HASH and RANGE
Advantages: the data can be evenly distributed, and the specified data can be retained in each slice
example:

create table apex_report.can_ci_che_dui_detail
(
  order_no string,                 
  certification_item string,       
  order_time string,               
  bill_date string,                
  goods_code string,               
  batch string,                    
  mold string,                     
  ssc_region_id string,            
  ssc_region_name string,          
  ssc_trade_id string,             
  ssc_trade_name string,           
  sfc_center_node string,          
  sfc_center_name_node string,     
  ssc_center_node string,          
  ssc_center_name_node string,     
  order_type string,               
  install_way string,              
  install_way_name string,         
  vehicle_lisence string,          
  vehicle_name string,             
  car_number string,               
  source_sn string,                
  model_source string,             
  model_code string,               
  channel_code string,             
  channel_name string,             
  product string,                  
  big_product string,              
  product_desc string,             
  sp_code string,                  
  center_name string,              
  bill_type string,                
  bill_cnt double,                 
  bill_cnt_ck double,              
  LAST_UPDATE_TIME string,         
  PRIMARY KEY (order_no,certification_item,order_time,bill_date,goods_code,batch)
)
PARTITION BY HASH (order_no) PARTITIONS 4,
RANGE (bill_date) (
PARTITION VALUES < '2022-01-01',
PARTITION '2022-01-01' <= VALUES < '2022-07-01', 
PARTITION '2022-07-01' <= VALUES < '2023-01-01', 
PARTITION '2023-01-01' <= VALUES < '2023-07-01', 
PARTITION '2023-07-01' <= VALUES < '2024-01-01', 
PARTITION '2024-01-01' <= VALUES < '2024-07-01', 
PARTITION '2024-07-01' <= VALUES < '2025-01-01', 
PARTITION '2025-01-01' <= VALUES < '2025-07-01', 
PARTITION '2025-07-01' <= VALUES 
)
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='10.138.xxx.xx:7051,10.138.xxx.xx:7051,10.138.xxx.xx:7051', 'kudu.table_name'='APEX_REPORT.CAN_CI_CHE_DUI_DETAIL');

Hash partition helps to maximize write throughput, while range partition can avoid the problem of unlimited growth of tablet s; The combination of hash partition and range partition can greatly improve kudu performance.

Keywords: Kudu

Added by Shadow Wolf on Thu, 16 Dec 2021 06:05:56 +0200