Transactional SQL in Apache Hive

Transactional Operations
in Hive
Eugene Koifman
June 2017

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Motivations/Goals
 End user point of view
 Design
 Performance Improvements/Results
 Roadmap

Motivations
 Modifying existing data
– INSERT OVERWRITE TABLE Target SELECT * FROM Target WHERE …
• Delete – OK, Update - ?
• Concurrency
– Hope for the best (multiple updates)
– ZooKeeper lock manager S/X locks – restrictive
• Expensive to do repeatedly (write side)

Motivations
 Continuously adding new data to Hive in the past
– ALTER TABLE Target ADD PARTITION (dt=‘2016-06-30’)
• Lots of files – bad for performance
• Fewer files –users wait longer to see latest data
– INSERT INTO Target as SELECT FROM Staging

Merge Statement – SQL Standard 2011 (Hive 2.2)
ID State County Value
1 CA LA 19.0
2 MA Norfolk 15.0
7 MA Suffolk 50.15
16 CA Orange 9.1
ID State Value
1 20.0
7 80.0
100 NH 6.0
MERGE INTO TARGET T
USING SOURCE S ON T.ID=S.ID
WHEN MATCHED THEN
UPDATE SET T.Value=S.Value
WHEN NOT MATCHED
INSERT (ID,State,Value)
VALUES(S.ID, S.State, S.Value)
1 CA LA 20.0
2 MA Norfolk 15.0
7 MA Suffolk 80.0
16 CA Orange 9.1
100 NH null 6.0

Goals
 Make above use cases easy and efficient
 Key Requirement
– Long running analytics queries should run concurrently with update commands
 NOT OLTP!!!
– Support slowly changing tables
– Not for 100s of concurrent queries trying to update the same partition

System at High Level
 A new type of table that supports Insert/Update/Delete/Merge SQL operations
 Concept of ACID transaction
– Atomic, Consistent, Isolated, Durable
 Streaming Ingest API
– Write a continuous stream of events to Hive in micro batches with transactional semantics

User Point of View
 CREATE TABLE T(a int, b int) CLUSTERED BY (b) INTO 8 BUCKETS STORED AS ORC
TBLPROPERTIES ('transactional'='true');
 Not all tables support transactional semantics
 Table must be bucketed
 Table cannot be sorted
 Currently requires ORC File but anything implementing format
– AcidInputFormat/AcidOutputFormat
 autoCommit=true
 Transactions run at Snapshot Isolation
– Lock in the state of the DB as of the start of the query for the duration of the query
– Between Serializable and Repeatable Read

Design

Design
 Transaction Manager
– Begin transaction and obtain a transaction ID
 Storage layer enhanced to support MVCC architecture
– Each row is tagged with unique ROW_ID (internal)
– Multiple versions of each row to allow concurrent readers and writers
– Result of each write is stored in a new Delta file

How ACID is implemented in Hive?
 CREATE TABLE acidtbl (a INT, b STRING) CLUSTERED BY (a) INTO 1 BUCKETS STORED AS
ORC TBLPROPERTIES ('transactional'='true');
ACID Metadata Columns original_transaction_id
bucket_id
row_id
current_transaction_id
User Columns col_1:
a : INT
col_2:
b : STRING
ACID_PK

 INSERT INTO acidtbl (a,b) VALUES (100, “foo”), (200, “xyz”), (300, “bee”);
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
delta_00001_00001/bucket_0000

 UPDATE acidTbl SET b = “bar” where a = 300;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
delta_00001_00001/bucket_0000
delta_00002_00002/bucket_0000

 DELETE FROM acidTbl where a = 200;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
ACID_PK a b
{ 1, 0, 1 } null null
delta_00001_00001/bucket_0000
delta_00002_00002/bucket_0000 delta_00003_00003/bucket_0000

 SELECT * FROM acidtbl;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
ACID_PK a b
{ 1, 0, 1 } null null
delta_00001_00001/bucket_0000 delta_00002_00002/bucket_0000 delta_00003_00003/bucket_0000
{ 1, 0, 0 } 100 “foo” 100 “foo”{ 1, 0, 1 } 200 “xyz”{ 1, 0, 1 } null null{ 1, 0, 2 } 300 “bee”{ 1, 0, 2 } 300 “bar”
300 “bar”

Design - Compactor
 More operations = more delta files – make reads more expensive

Design - Compactor
 ALTER TABLE acidTbl COMPACT ‘MAJOR’;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
ACID_PK a b
{ 1, 0, 1 } null nulldelta_00001_00001/bucket_0000
delta_00002_00002/bucket_0000
delta_00003_00003/bucket_0000
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 2 } 200 “bar”
base_00003/bucket_0000

Design - Compactor
 Compactor rewrites the table in the background
– Minor compaction - merges delta files into fewer deltas
– Major compactor merges deltas with base - more expensive
– This amortizes the cost of updates and self tunes the tables
• Makes ORC more efficient - larger stripes, better compression
 Compaction can be triggered automatically or on demand
– There are various configuration options to control when the process kicks in.
– Compaction itself is a Map-Reduce job
 Key design principle is that compactor does not affect readers/writers
 Cleaner process – removes obsolete files

Design - Concurrency
 Transaction Manager
– manages transaction ID assignment
– keeps track of transaction state: open, committed, aborted
 Lock Manager
– DDL operations acquire eXclusive locks
– Read operations acquire Shared locks
– Also locks non transactional tables – different logic
• hive.txn.strict.locking.mode
 State of both persisted in Hive Metastore

Design - Concurrency
 Write Set tracking to prevent Write-Write conflicts in concurrent transactions
 Note that 2 Inserts are never in conflict since Hive does not enforce unique
constraints.
 You are allowed to read acid and non-acid tables in same query.
 You cannot write to acid and non-acid tables at the same time (multi-insert
statement)

Design - Streaming Ingest
 Allows you to continuously write events to a hive table
– Can commit periodically to make writes durable/visible
– Can also call abort to make writes since last commit/abort invisible.
– Optimized so that it can handle writing micro batches of events - every second.
• Multiple transactions are written to one file
– Only supports adding new data
 Streaming tools like NiFi, Storm and Flume rely on this API to ingest data into hive
 This API is public so it can be used directly
 Data written via Streaming API has the same transactional semantics as SQL side

Merge Statement – SQL Standard 2011 (Hive 2.2)
1 CA LA 19.0
2 MA Norfolk 15.0
7 MA Suffolk 50.15
16 CA Orange 9.1
ID State Value
1 20.0
7 80.0
100 NH 6.0
MERGE INTO TARGET T
USING SOURCE S ON T.ID=S.ID
WHEN MATCHED THEN
UPDATE SET T.Value=S.Value
WHEN NOT MATCHED
INSERT (ID,State,Value)
VALUES(S.ID, S.State, S.Value)
1 CA LA 20.0
2 MA Norfolk 15.0
7 MA Suffolk 80.0
16 CA Orange 9.1
100 NH null 6.0

SQL Merge
Target
Source
ACID_PK ID Stat
e
Count
y
Value
{ 1, 0, 1 } 1 CA LA 20.0
{ 1, 0, 3 } 7 MA Suffolk 80.0
ACID_PK ID State Coun
ty
Value
{ 2, 0, 1 } 100 NH 6.0
delta_00002_00002/bucket_0000
delta_00002_00002_001/bucket_0000
Right Outer Join
ON T.ID=S.ID

W/o MERGE – much less efficient
 UPDATE Target set Value= 20.0 where ID = 1;
 UPDATE Target set Value = 80.0 where ID = 7;
 INSERT INTO Target (ID, State, Value) VALUES(100, ‘NH’, 6.0);

Performance

Work-In-Progress
 Split an update into combination of delete and insert
 UPDATE acidTbl SET b = “bar” where a = 300;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
delta_00001_00001/bucket_0000
delta_00002_00002/bucket_0000
ACID_PK a b
{ 2, 0, 0 } 300 “bar”
ACID_PK a b
{ 1, 0, 2 } null null
delta_00002_00002/bucket_0000 delete_delta_00002_00002/bucket_0000
Enabled
PPD
Splits for
Delta files

Benefits
 Improved PPD
 Better Network Utilization
 Better Memory Utilization
 Full Vectorization of Reads
 Updating bucket/partition columns

Performance
 TPC-H Benchmark
– 10 node cluster at Scale Factor 1000 (1 TB of data)
– 11 delta files with 90 GB data each

Future Work
 Multi statement transactions, i.e. BEGIN TRANSACTION/COMMIT/ROLLBACK
 Performance
– Smarter Compaction
 Finer grained concurrency management/conflict detection
 Read Committed w/Lock Based scheduling
 Better Monitoring/Alerting
 LOAD DATA … support
 Optional bucketing
 SMB support – user defined sort order

Transactional SQL in Apache Hive

More Related Content

What's hot

Similar to Transactional SQL in Apache Hive

More from DataWorks Summit

Recently uploaded

Transactional SQL in Apache Hive

Editor's Notes