The storage format of hit table; the use of ORC format

There are several types of source file storage formats for the hit table:

1,TEXTFILE
The default format is not specified when creating tables. When importing data, the data files will be copied directly to hdfs for processing. Source files can be viewed directly through Hadoop fs-cat
2. SEQUENCEFILE is a binary file provided by Hadoop API, which is easy to use, separable and compressible.
SEQUENCEFILE serializes data into files in the form of <key, value>. Serialization and deserialization are implemented using Hadoop's standard Writable interface. The key is empty and the value is used to store the actual value, thus avoiding the sorting process in the map stage.
Three compression options: NONE, RECORD, BLOCK. Record compression rate is low. BLOCK compression is generally recommended. Set parameters when using.
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK
create table test2(str STRING) STORED AS SEQUENCEFILE;

3,RCFILE
A storage method combining row and column storage. Firstly, it divides the data into rows to ensure that the same record is on one block, avoiding the need to read more than one block to read a record. Secondly, block data column storage is conducive to data compression and fast column access.
In theory, it has high query efficiency (but hive official said that the effect is not obvious, only save 10% of the storage space, so it is not easy to use, can not be used).
RCFile combines the characteristics of fast query in row storage and space saving in column storage
1) The data of the same row is located at the same node, so the cost of tuple reconstruction is very low.
2) In-block column storage, column dimension data can be compressed and unnecessary column reading can be skipped.
During the query process, columns that do not care about are skipped on IO. The actual process is that in the map stage, the whole block of data is still copied from the remote copy to the local directory. It is not really skipping columns directly, but by scanning the header definition of each row group.
However, the header at the entire HDFS Block level does not define from which row group each column starts to which row group ends. So RCFile does not perform as well as SequenceFile when all columns are read.

4. The new format given by ORC hive belongs to the upgraded version of RCFILE.

5. Data file format of user with custom format can not be recognized by current Hive. When input format and output format are implemented, input and output format are customized.
Reference code: hive-0.8.1 src contrib src java org apache hadoop hive contrib fileformat Base64

See http://www.cnblogs.com/ggjucheng/archive/2013/01/03/2843318.html

Be careful:

  Only TEXTFILE tables can load data directly. It is necessary to use TEXTFILE tables only for local load data and external table to load path data directly.
  Further, hive's default supported compressed file (hadoop's default supported compressed format) can only be read directly from the TEXTFILE table. Other formats do not work. The insert can be loaded into other tables through the TEXTFILE table.

  In other words, the SequenceFile and RCFile tables can not load data directly. The data should be imported into the textfile table first, and then imported into the SequenceFile,RCFile table through insert select from the textfile table.
  The source files of SequenceFile and RCFile tables can not be viewed directly, but can be viewed with select in hive. RCFile source files can be viewed with hive -- service rcfilecat / XXXXXXXXXXXXXXXXXXXXXXX / 000000_0, but in different formats, they are messy.

  hive supports compressed file formats by default, referring to http://blog.csdn.net/longshenlmj/article/details/50550580

ORC format

   ORC is an upgraded version of RCfile, and its performance has been greatly improved.
    And the data can be compressed and stored, the compression ratio is similar to Lzo compression, and the compression ratio of text files can reach 70% of the space. Moreover, the reading performance is very high, which can realize efficient query.
    Specific introduction to https://cwiki.apache.org/confluence/display/Hive/Language Manual+ORC

The TABLE statement is as follows:
  At the same time, the NULL in the ORC table is changed from the default N to''.

One way

create table if not exists test_orc(
  advertiser_id string,
  ad_plan_id string,
  cnt BIGINT
) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)
STORED AS ORC;

alter table test_orc set serdeproperties('serialization.null.format' = '');


//View results
hive> show create table test_orc;
CREATE  TABLE `test_orc`(
  `advertiser_id` string, 
  `ad_plan_id` string, 
  `cnt` bigint)
PARTITIONED BY ( 
  `day` string, 
  `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', 
  `hour` tinyint)
ROW FORMAT DELIMITED 
  NULL DEFINED AS '' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'
TBLPROPERTIES (
  'last_modified_by'='pmp_bi', 
  'last_modified_time'='1465992624', 
  'transient_lastDdlTime'='1465992624')

Mode two

drop table test_orc;
create table if not exists test_orc(
  advertiser_id string,
  ad_plan_id string,
  cnt BIGINT
) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
with serdeproperties('serialization.null.format' = '')
STORED AS ORC;


//View results
hive> show create table test_orc;
CREATE  TABLE `test_orc`(
  `advertiser_id` string, 
  `ad_plan_id` string, 
  `cnt` bigint)
PARTITIONED BY ( 
  `day` string, 
  `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', 
  `hour` tinyint)
ROW FORMAT DELIMITED 
  NULL DEFINED AS '' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'
TBLPROPERTIES (
  'transient_lastDdlTime'='1465992726')

Mode three

drop table test_orc;
create table if not exists test_orc(
  advertiser_id string,
  ad_plan_id string,
  cnt BIGINT
) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)
ROW FORMAT DELIMITED 
  NULL DEFINED AS '' 
STORED AS ORC;

//View results
hive> show create table test_orc;
CREATE  TABLE `test_orc`(
  `advertiser_id` string, 
  `ad_plan_id` string, 
  `cnt` bigint)
PARTITIONED BY ( 
  `day` string, 
  `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', 
  `hour` tinyint)
ROW FORMAT DELIMITED 
  NULL DEFINED AS '' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'
TBLPROPERTIES (
  'transient_lastDdlTime'='1465992916')

Keywords: hive Hadoop Apache Java

Added by poppy28 on Sun, 12 May 2019 10:09:32 +0300

Programming VIP