SlideShare a Scribd company logo
Transactional Operations
in Hive
Eugene Koifman
June 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Motivations/Goals
 End user point of view
 Design
 Performance Improvements/Results
 Roadmap
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Motivations
 Modifying existing data
– INSERT OVERWRITE TABLE Target SELECT * FROM Target WHERE …
• Delete – OK, Update - ?
• Concurrency
– Hope for the best (multiple updates)
– ZooKeeper lock manager S/X locks – restrictive
• Expensive to do repeatedly (write side)
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Motivations
 Continuously adding new data to Hive in the past
– ALTER TABLE Target ADD PARTITION (dt=‘2016-06-30’)
• Lots of files – bad for performance
• Fewer files –users wait longer to see latest data
– INSERT INTO Target as SELECT FROM Staging
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Merge Statement – SQL Standard 2011 (Hive 2.2)
ID State County Value
1 CA LA 19.0
2 MA Norfolk 15.0
7 MA Suffolk 50.15
16 CA Orange 9.1
ID State Value
1 20.0
7 80.0
100 NH 6.0
MERGE INTO TARGET T
USING SOURCE S ON T.ID=S.ID
WHEN MATCHED THEN
UPDATE SET T.Value=S.Value
WHEN NOT MATCHED
INSERT (ID,State,Value)
VALUES(S.ID, S.State, S.Value)
ID State County Value
1 CA LA 20.0
2 MA Norfolk 15.0
7 MA Suffolk 80.0
16 CA Orange 9.1
100 NH null 6.0
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goals
 Make above use cases easy and efficient
 Key Requirement
– Long running analytics queries should run concurrently with update commands
 NOT OLTP!!!
– Support slowly changing tables
– Not for 100s of concurrent queries trying to update the same partition
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
System at High Level
 A new type of table that supports Insert/Update/Delete/Merge SQL operations
 Concept of ACID transaction
– Atomic, Consistent, Isolated, Durable
 Streaming Ingest API
– Write a continuous stream of events to Hive in micro batches with transactional semantics
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
User Point of View
 CREATE TABLE T(a int, b int) CLUSTERED BY (b) INTO 8 BUCKETS STORED AS ORC
TBLPROPERTIES ('transactional'='true');
 Not all tables support transactional semantics
 Table must be bucketed
 Table cannot be sorted
 Currently requires ORC File but anything implementing format
– AcidInputFormat/AcidOutputFormat
 autoCommit=true
 Transactions run at Snapshot Isolation
– Lock in the state of the DB as of the start of the query for the duration of the query
– Between Serializable and Repeatable Read
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design
 Transaction Manager
– Begin transaction and obtain a transaction ID
 Storage layer enhanced to support MVCC architecture
– Each row is tagged with unique ROW_ID (internal)
– Multiple versions of each row to allow concurrent readers and writers
– Result of each write is stored in a new Delta file
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How ACID is implemented in Hive?
 CREATE TABLE acidtbl (a INT, b STRING) CLUSTERED BY (a) INTO 1 BUCKETS STORED AS
ORC TBLPROPERTIES ('transactional'='true');
ACID Metadata Columns original_transaction_id
bucket_id
row_id
current_transaction_id
User Columns col_1:
a : INT
col_2:
b : STRING
ACID_PK
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How ACID is implemented in Hive?
 INSERT INTO acidtbl (a,b) VALUES (100, “foo”), (200, “xyz”), (300, “bee”);
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
delta_00001_00001/bucket_0000
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How ACID is implemented in Hive?
 UPDATE acidTbl SET b = “bar” where a = 300;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
delta_00001_00001/bucket_0000
delta_00002_00002/bucket_0000
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How ACID is implemented in Hive?
 DELETE FROM acidTbl where a = 200;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
ACID_PK a b
{ 1, 0, 1 } null null
delta_00001_00001/bucket_0000
delta_00002_00002/bucket_0000 delta_00003_00003/bucket_0000
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How ACID is implemented in Hive?
 SELECT * FROM acidtbl;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
ACID_PK a b
{ 1, 0, 1 } null null
delta_00001_00001/bucket_0000 delta_00002_00002/bucket_0000 delta_00003_00003/bucket_0000
{ 1, 0, 0 } 100 “foo” 100 “foo”{ 1, 0, 1 } 200 “xyz”{ 1, 0, 1 } null null{ 1, 0, 2 } 300 “bee”{ 1, 0, 2 } 300 “bar”
300 “bar”
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design - Compactor
 More operations = more delta files – make reads more expensive
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design - Compactor
 ALTER TABLE acidTbl COMPACT ‘MAJOR’;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
ACID_PK a b
{ 1, 0, 1 } null nulldelta_00001_00001/bucket_0000
delta_00002_00002/bucket_0000
delta_00003_00003/bucket_0000
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 2 } 200 “bar”
base_00003/bucket_0000
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design - Compactor
 Compactor rewrites the table in the background
– Minor compaction - merges delta files into fewer deltas
– Major compactor merges deltas with base - more expensive
– This amortizes the cost of updates and self tunes the tables
• Makes ORC more efficient - larger stripes, better compression
 Compaction can be triggered automatically or on demand
– There are various configuration options to control when the process kicks in.
– Compaction itself is a Map-Reduce job
 Key design principle is that compactor does not affect readers/writers
 Cleaner process – removes obsolete files
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design - Concurrency
 Transaction Manager
– manages transaction ID assignment
– keeps track of transaction state: open, committed, aborted
 Lock Manager
– DDL operations acquire eXclusive locks
– Read operations acquire Shared locks
– Also locks non transactional tables – different logic
• hive.txn.strict.locking.mode
 State of both persisted in Hive Metastore
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design - Concurrency
 Write Set tracking to prevent Write-Write conflicts in concurrent transactions
 Note that 2 Inserts are never in conflict since Hive does not enforce unique
constraints.
 You are allowed to read acid and non-acid tables in same query.
 You cannot write to acid and non-acid tables at the same time (multi-insert
statement)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design - Streaming Ingest
 Allows you to continuously write events to a hive table
– Can commit periodically to make writes durable/visible
– Can also call abort to make writes since last commit/abort invisible.
– Optimized so that it can handle writing micro batches of events - every second.
• Multiple transactions are written to one file
– Only supports adding new data
 Streaming tools like NiFi, Storm and Flume rely on this API to ingest data into hive
 This API is public so it can be used directly
 Data written via Streaming API has the same transactional semantics as SQL side
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Merge Statement – SQL Standard 2011 (Hive 2.2)
ID State County Value
1 CA LA 19.0
2 MA Norfolk 15.0
7 MA Suffolk 50.15
16 CA Orange 9.1
ID State Value
1 20.0
7 80.0
100 NH 6.0
MERGE INTO TARGET T
USING SOURCE S ON T.ID=S.ID
WHEN MATCHED THEN
UPDATE SET T.Value=S.Value
WHEN NOT MATCHED
INSERT (ID,State,Value)
VALUES(S.ID, S.State, S.Value)
ID State County Value
1 CA LA 20.0
2 MA Norfolk 15.0
7 MA Suffolk 80.0
16 CA Orange 9.1
100 NH null 6.0
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Merge
Target
Source
ACID_PK ID Stat
e
Count
y
Value
{ 1, 0, 1 } 1 CA LA 20.0
{ 1, 0, 3 } 7 MA Suffolk 80.0
ACID_PK ID State Coun
ty
Value
{ 2, 0, 1 } 100 NH 6.0
delta_00002_00002/bucket_0000
delta_00002_00002_001/bucket_0000
Right Outer Join
ON T.ID=S.ID
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
W/o MERGE – much less efficient
 UPDATE Target set Value= 20.0 where ID = 1;
 UPDATE Target set Value = 80.0 where ID = 7;
 INSERT INTO Target (ID, State, Value) VALUES(100, ‘NH’, 6.0);
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work-In-Progress
 Split an update into combination of delete and insert
 UPDATE acidTbl SET b = “bar” where a = 300;
ACID_PK a b
{ 1, 0, 0 } 100 “foo”
{ 1, 0, 1 } 200 “xyz”
{ 1, 0, 2 } 300 “bee”
ACID_PK a b
{ 1, 0, 2 } 300 “bar”
delta_00001_00001/bucket_0000
delta_00002_00002/bucket_0000
ACID_PK a b
{ 2, 0, 0 } 300 “bar”
ACID_PK a b
{ 1, 0, 2 } null null
delta_00002_00002/bucket_0000 delete_delta_00002_00002/bucket_0000
Enabled
PPD
Splits for
Delta files
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benefits
 Improved PPD
 Better Network Utilization
 Better Memory Utilization
 Full Vectorization of Reads
 Updating bucket/partition columns
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance
 TPC-H Benchmark
– 10 node cluster at Scale Factor 1000 (1 TB of data)
– 11 delta files with 90 GB data each
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future Work
 Multi statement transactions, i.e. BEGIN TRANSACTION/COMMIT/ROLLBACK
 Performance
– Smarter Compaction
 Finer grained concurrency management/conflict detection
 Read Committed w/Lock Based scheduling
 Better Monitoring/Alerting
 LOAD DATA … support
 Optional bucketing
 SMB support – user defined sort order
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Further Reading
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Etc
 Documentation
– https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
– https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest
 Follow/Contribute
– https://issues.apache.org/jira/browse/HIVE-
14004?jql=project%20%3D%20HIVE%20AND%20component%20%3D%20Transactions
 user@hive.apache.org
 dev@hive.apache.org
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
Mahmood Reza Esmaili Zand
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 

Similar to Transactional SQL in Apache Hive

Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
DataWorks Summit/Hadoop Summit
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
Eugene Koifman
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Artem Ervits
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Abdelkrim Hadjidj
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Future of Data Meetup
 
April 2014 HUG : Apache Phoenix
April 2014 HUG : Apache PhoenixApril 2014 HUG : Apache Phoenix
April 2014 HUG : Apache Phoenix
Yahoo Developer Network
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
Steve Loughran
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
Yifeng Jiang
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
DataWorks Summit
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
Steve Loughran
 
Integrate SparkR with existing R packages to accelerate data science workflows
 Integrate SparkR with existing R packages to accelerate data science workflows Integrate SparkR with existing R packages to accelerate data science workflows
Integrate SparkR with existing R packages to accelerate data science workflows
Artem Ervits
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
alanfgates
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
Hortonworks
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Piotr Pruski
 

Similar to Transactional SQL in Apache Hive (20)

Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
April 2014 HUG : Apache Phoenix
April 2014 HUG : Apache PhoenixApril 2014 HUG : Apache Phoenix
April 2014 HUG : Apache Phoenix
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Integrate SparkR with existing R packages to accelerate data science workflows
 Integrate SparkR with existing R packages to accelerate data science workflows Integrate SparkR with existing R packages to accelerate data science workflows
Integrate SparkR with existing R packages to accelerate data science workflows
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
User Group3009
User Group3009User Group3009
User Group3009
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 

Recently uploaded (20)

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 

Transactional SQL in Apache Hive

  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda  Motivations/Goals  End user point of view  Design  Performance Improvements/Results  Roadmap
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Motivations  Modifying existing data – INSERT OVERWRITE TABLE Target SELECT * FROM Target WHERE … • Delete – OK, Update - ? • Concurrency – Hope for the best (multiple updates) – ZooKeeper lock manager S/X locks – restrictive • Expensive to do repeatedly (write side)
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Motivations  Continuously adding new data to Hive in the past – ALTER TABLE Target ADD PARTITION (dt=‘2016-06-30’) • Lots of files – bad for performance • Fewer files –users wait longer to see latest data – INSERT INTO Target as SELECT FROM Staging
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Merge Statement – SQL Standard 2011 (Hive 2.2) ID State County Value 1 CA LA 19.0 2 MA Norfolk 15.0 7 MA Suffolk 50.15 16 CA Orange 9.1 ID State Value 1 20.0 7 80.0 100 NH 6.0 MERGE INTO TARGET T USING SOURCE S ON T.ID=S.ID WHEN MATCHED THEN UPDATE SET T.Value=S.Value WHEN NOT MATCHED INSERT (ID,State,Value) VALUES(S.ID, S.State, S.Value) ID State County Value 1 CA LA 20.0 2 MA Norfolk 15.0 7 MA Suffolk 80.0 16 CA Orange 9.1 100 NH null 6.0
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goals  Make above use cases easy and efficient  Key Requirement – Long running analytics queries should run concurrently with update commands  NOT OLTP!!! – Support slowly changing tables – Not for 100s of concurrent queries trying to update the same partition
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved System at High Level  A new type of table that supports Insert/Update/Delete/Merge SQL operations  Concept of ACID transaction – Atomic, Consistent, Isolated, Durable  Streaming Ingest API – Write a continuous stream of events to Hive in micro batches with transactional semantics
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved User Point of View  CREATE TABLE T(a int, b int) CLUSTERED BY (b) INTO 8 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true');  Not all tables support transactional semantics  Table must be bucketed  Table cannot be sorted  Currently requires ORC File but anything implementing format – AcidInputFormat/AcidOutputFormat  autoCommit=true  Transactions run at Snapshot Isolation – Lock in the state of the DB as of the start of the query for the duration of the query – Between Serializable and Repeatable Read
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design  Transaction Manager – Begin transaction and obtain a transaction ID  Storage layer enhanced to support MVCC architecture – Each row is tagged with unique ROW_ID (internal) – Multiple versions of each row to allow concurrent readers and writers – Result of each write is stored in a new Delta file
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How ACID is implemented in Hive?  CREATE TABLE acidtbl (a INT, b STRING) CLUSTERED BY (a) INTO 1 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true'); ACID Metadata Columns original_transaction_id bucket_id row_id current_transaction_id User Columns col_1: a : INT col_2: b : STRING ACID_PK
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How ACID is implemented in Hive?  INSERT INTO acidtbl (a,b) VALUES (100, “foo”), (200, “xyz”), (300, “bee”); ACID_PK a b { 1, 0, 0 } 100 “foo” { 1, 0, 1 } 200 “xyz” { 1, 0, 2 } 300 “bee” delta_00001_00001/bucket_0000
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How ACID is implemented in Hive?  UPDATE acidTbl SET b = “bar” where a = 300; ACID_PK a b { 1, 0, 0 } 100 “foo” { 1, 0, 1 } 200 “xyz” { 1, 0, 2 } 300 “bee” ACID_PK a b { 1, 0, 2 } 300 “bar” delta_00001_00001/bucket_0000 delta_00002_00002/bucket_0000
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How ACID is implemented in Hive?  DELETE FROM acidTbl where a = 200; ACID_PK a b { 1, 0, 0 } 100 “foo” { 1, 0, 1 } 200 “xyz” { 1, 0, 2 } 300 “bee” ACID_PK a b { 1, 0, 2 } 300 “bar” ACID_PK a b { 1, 0, 1 } null null delta_00001_00001/bucket_0000 delta_00002_00002/bucket_0000 delta_00003_00003/bucket_0000
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How ACID is implemented in Hive?  SELECT * FROM acidtbl; ACID_PK a b { 1, 0, 0 } 100 “foo” { 1, 0, 1 } 200 “xyz” { 1, 0, 2 } 300 “bee” ACID_PK a b { 1, 0, 2 } 300 “bar” ACID_PK a b { 1, 0, 1 } null null delta_00001_00001/bucket_0000 delta_00002_00002/bucket_0000 delta_00003_00003/bucket_0000 { 1, 0, 0 } 100 “foo” 100 “foo”{ 1, 0, 1 } 200 “xyz”{ 1, 0, 1 } null null{ 1, 0, 2 } 300 “bee”{ 1, 0, 2 } 300 “bar” 300 “bar”
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design - Compactor  More operations = more delta files – make reads more expensive
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design - Compactor  ALTER TABLE acidTbl COMPACT ‘MAJOR’; ACID_PK a b { 1, 0, 0 } 100 “foo” { 1, 0, 1 } 200 “xyz” { 1, 0, 2 } 300 “bee” ACID_PK a b { 1, 0, 2 } 300 “bar” ACID_PK a b { 1, 0, 1 } null nulldelta_00001_00001/bucket_0000 delta_00002_00002/bucket_0000 delta_00003_00003/bucket_0000 ACID_PK a b { 1, 0, 0 } 100 “foo” { 1, 0, 2 } 200 “bar” base_00003/bucket_0000
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design - Compactor  Compactor rewrites the table in the background – Minor compaction - merges delta files into fewer deltas – Major compactor merges deltas with base - more expensive – This amortizes the cost of updates and self tunes the tables • Makes ORC more efficient - larger stripes, better compression  Compaction can be triggered automatically or on demand – There are various configuration options to control when the process kicks in. – Compaction itself is a Map-Reduce job  Key design principle is that compactor does not affect readers/writers  Cleaner process – removes obsolete files
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design - Concurrency  Transaction Manager – manages transaction ID assignment – keeps track of transaction state: open, committed, aborted  Lock Manager – DDL operations acquire eXclusive locks – Read operations acquire Shared locks – Also locks non transactional tables – different logic • hive.txn.strict.locking.mode  State of both persisted in Hive Metastore
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design - Concurrency  Write Set tracking to prevent Write-Write conflicts in concurrent transactions  Note that 2 Inserts are never in conflict since Hive does not enforce unique constraints.  You are allowed to read acid and non-acid tables in same query.  You cannot write to acid and non-acid tables at the same time (multi-insert statement)
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design - Streaming Ingest  Allows you to continuously write events to a hive table – Can commit periodically to make writes durable/visible – Can also call abort to make writes since last commit/abort invisible. – Optimized so that it can handle writing micro batches of events - every second. • Multiple transactions are written to one file – Only supports adding new data  Streaming tools like NiFi, Storm and Flume rely on this API to ingest data into hive  This API is public so it can be used directly  Data written via Streaming API has the same transactional semantics as SQL side
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Merge Statement – SQL Standard 2011 (Hive 2.2) ID State County Value 1 CA LA 19.0 2 MA Norfolk 15.0 7 MA Suffolk 50.15 16 CA Orange 9.1 ID State Value 1 20.0 7 80.0 100 NH 6.0 MERGE INTO TARGET T USING SOURCE S ON T.ID=S.ID WHEN MATCHED THEN UPDATE SET T.Value=S.Value WHEN NOT MATCHED INSERT (ID,State,Value) VALUES(S.ID, S.State, S.Value) ID State County Value 1 CA LA 20.0 2 MA Norfolk 15.0 7 MA Suffolk 80.0 16 CA Orange 9.1 100 NH null 6.0
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL Merge Target Source ACID_PK ID Stat e Count y Value { 1, 0, 1 } 1 CA LA 20.0 { 1, 0, 3 } 7 MA Suffolk 80.0 ACID_PK ID State Coun ty Value { 2, 0, 1 } 100 NH 6.0 delta_00002_00002/bucket_0000 delta_00002_00002_001/bucket_0000 Right Outer Join ON T.ID=S.ID
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved W/o MERGE – much less efficient  UPDATE Target set Value= 20.0 where ID = 1;  UPDATE Target set Value = 80.0 where ID = 7;  INSERT INTO Target (ID, State, Value) VALUES(100, ‘NH’, 6.0);
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work-In-Progress  Split an update into combination of delete and insert  UPDATE acidTbl SET b = “bar” where a = 300; ACID_PK a b { 1, 0, 0 } 100 “foo” { 1, 0, 1 } 200 “xyz” { 1, 0, 2 } 300 “bee” ACID_PK a b { 1, 0, 2 } 300 “bar” delta_00001_00001/bucket_0000 delta_00002_00002/bucket_0000 ACID_PK a b { 2, 0, 0 } 300 “bar” ACID_PK a b { 1, 0, 2 } null null delta_00002_00002/bucket_0000 delete_delta_00002_00002/bucket_0000 Enabled PPD Splits for Delta files
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benefits  Improved PPD  Better Network Utilization  Better Memory Utilization  Full Vectorization of Reads  Updating bucket/partition columns
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance  TPC-H Benchmark – 10 node cluster at Scale Factor 1000 (1 TB of data) – 11 delta files with 90 GB data each
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work  Multi statement transactions, i.e. BEGIN TRANSACTION/COMMIT/ROLLBACK  Performance – Smarter Compaction  Finer grained concurrency management/conflict detection  Read Committed w/Lock Based scheduling  Better Monitoring/Alerting  LOAD DATA … support  Optional bucketing  SMB support – user defined sort order
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Further Reading
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Etc  Documentation – https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions – https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest  Follow/Contribute – https://issues.apache.org/jira/browse/HIVE- 14004?jql=project%20%3D%20HIVE%20AND%20component%20%3D%20Transactions  user@hive.apache.org  dev@hive.apache.org
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You

Editor's Notes

  1. Easiest way to explain this is to talk about how you used to do some things in Hive before Hive ACID project.
  2. Easiest way to explain this is to talk about how you used to do some things in Hive before Hive ACID project.
  3. Target is the table inside the Warehouse Source table contains the changes to apply
  4. Target is the table inside the Warehouse Source table contains the changes to apply