Hive configuration update and delete data operations

Hive supports transaction and row level updates from version 0.14, but it is not supported by default. Some additional configurations are required. To support row level insert, update and delete, hive support transactions need to be configured.

1, Hive has the use scenario of ACID semantic transaction
1. Stream data.
Many users use tools such as Apache Flume, Apache Storm or Apache Kafka to pour streaming data into Hadoop clusters. When these tools write at hundreds of rows per second, Hive may only add one partition every 15 minutes to an hour, because adding partitions too often will soon make the number of partitions in a table difficult to maintain. Moreover, these tools may also write data to existing partitions, but this will produce dirty reads (data written after the query start time point may be read), and leave a large number of small files in the directory where these partitions are located, which will put pressure on NameNode. In this usage scenario, transaction support can obtain a consistent view of data and avoid generating too many files.

2. Slow change dimension.
In a typical star schema data warehouse, the dimension table changes slowly with time. For example, a retailer opens a new store and needs to add the new store data to the store table, or the business area of an existing store or other features that need to be tracked have changed. These changes can cause individual records to be inserted or modified. Since version 0.14, Hive supports row level updates.

3. Data restatement.
Sometimes errors are found in the data set and need to be corrected. Or the current data is only an approximate value (for example, only 90% of all data will lag). Or business rules may need to restate specific transactions based on subsequent transactions (for example, a customer who purchases some goods and then purchases a membership can enjoy a discounted price, including previously purchased goods). Or a customer may request to DELETE their customer data after terminating the partnership according to the contract. Starting from Hive 0.14, these usage scenarios can be supported through INSERT, UPDATE and DELETE.

2, Configure Hive support transactions (Hive version 2.0)

<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
</property>
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
</property>
<property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
</property>
<property>
    <name>hive.compactor.worker.threads</name>
    <value>1</value>
</property>

Command operation

-- Create a non partitioned table and load data
CREATE TABLE t1 (id INT, name STRING, cty STRING, st STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';  
LOAD DATA LOCAL INPATH '/home/grid/a.txt' INTO TABLE t1;  
SELECT * FROM t1;

-- Create external partition transaction table and load data
CREATE EXTERNAL TABLE t2 (id INT, name STRING) PARTITIONED BY (country STRING, state STRING)
CLUSTERED BY (id) INTO 8 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='true');
INSERT INTO T2 PARTITION (country, state) SELECT * FROM T1;
SELECT * FROM t2;

-- Modify data
INSERT INTO TABLE t2 PARTITION (country, state) VALUES (5,'Liu','DD','DD');
UPDATE t2 SET name='Zhang' WHERE id=1;
DELETE FROM t2 WHERE name='Li Si';
SELECT * FROM t2;

Three problems needing attention:

1. Only internal tables can be created

2. It needs to be stored in ORC format

STORED AS ORC 

3. When creating a table, you need to add transaction attributes

TBLPROPERTIES ('transactional'='true'); 

4. Import data can only be inserted through insert [into/overwrite] table tab2 select * from tab1

reference resources:

Keywords: Java Hadoop hive

Added by ibo on Wed, 02 Mar 2022 10:11:06 +0200