SlideShare a Scribd company logo
Submit Search
Upload
Hive acid-updates-summit-sjc-2014
Report
A
alanfgates
Follow
•
1 like
•
4,555 views
1
of
26
Hive acid-updates-summit-sjc-2014
•
1 like
•
4,555 views
Download Now
Download to read offline
Report
Technology
A
alanfgates
Follow
Recommended
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
10.4K views
•
22 slides
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
11.2K views
•
21 slides
Hive Does ACID
DataWorks Summit
8.1K views
•
25 slides
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
4.2K views
•
26 slides
HiveACIDPublic
Inderaj (Raj) Bains
1.7K views
•
36 slides
Hive acid-updates-strata-sjc-feb-2015
alanfgates
10.3K views
•
23 slides
More Related Content
What's hot
Strata Stinger Talk October 2013
alanfgates
5.1K views
•
26 slides
Hive: Loading Data
Benjamin Leonhardi
34K views
•
28 slides
Sub-second-sql-on-hadoop-at-scale
Yifeng Jiang
2.5K views
•
38 slides
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
Big Data Spain
7.5K views
•
42 slides
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
2.8K views
•
18 slides
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
1.5K views
•
42 slides
What's hot
(20)
Strata Stinger Talk October 2013
alanfgates
•
5.1K views
Hive: Loading Data
Benjamin Leonhardi
•
34K views
Sub-second-sql-on-hadoop-at-scale
Yifeng Jiang
•
2.5K views
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
Big Data Spain
•
7.5K views
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
•
2.8K views
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
•
1.5K views
ORC 2015: Faster, Better, Smaller
DataWorks Summit
•
5K views
Apache Hive ACID Project
DataWorks Summit/Hadoop Summit
•
2.7K views
ORC: 2015 Faster, Better, Smaller
DataWorks Summit
•
7.6K views
ORC 2015
t3rmin4t0r
•
3.3K views
LLAP Nov Meetup
t3rmin4t0r
•
710 views
Optimizing Hive Queries
Owen O'Malley
•
36K views
Data organization: hive meetup
t3rmin4t0r
•
2.1K views
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
•
1.5K views
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
•
3.2K views
Hive - 1455: Cloud Storage
Hortonworks
•
903 views
Hive analytic workloads hadoop summit san jose 2014
alanfgates
•
4.5K views
Hive ACID Apache BigData 2016
alanfgates
•
915 views
LLAP: long-lived execution in Hive
DataWorks Summit
•
17.2K views
CBlocks - Posix compliant files systems for HDFS
DataWorks Summit
•
682 views
Viewers also liked
Cost-based query optimization in Apache Hive
Julian Hyde
8.7K views
•
29 slides
Hivemall talk@Hadoop summit 2014, San Jose
Makoto Yui
12.5K views
•
43 slides
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Yahoo Developer Network
25K views
•
46 slides
Hive2.0 big dataspain-nov-2016
alanfgates
980 views
•
18 slides
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
2.3K views
•
16 slides
Big data spain keynote nov 2016
alanfgates
719 views
•
21 slides
Viewers also liked
(20)
Cost-based query optimization in Apache Hive
Julian Hyde
•
8.7K views
Hivemall talk@Hadoop summit 2014, San Jose
Makoto Yui
•
12.5K views
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Yahoo Developer Network
•
25K views
Hive2.0 big dataspain-nov-2016
alanfgates
•
980 views
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
•
2.3K views
Big data spain keynote nov 2016
alanfgates
•
719 views
Keynote apache bd-eu-nov-2016
alanfgates
•
3.6K views
Hortonworks apache training
alanfgates
•
4.3K views
Big Data Warehousing: Pig vs. Hive Comparison
Caserta
•
19.4K views
Empower Hive with Spark
DataWorks Summit
•
2.9K views
Advanced Analytics using Apache Hive
Murtaza Doctor
•
6.4K views
How To Analyze Geolocation Data with Hive and Hadoop
Hortonworks
•
35.9K views
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
•
28.1K views
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
•
101.7K views
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
•
23.2K views
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
•
4.6K views
Integration of Hive and HBase
Hortonworks
•
99.6K views
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
•
129.5K views
Architecting a Next Generation Data Platform
hadooparchbook
•
5.6K views
Hive on spark is blazing fast or is it final
Hortonworks
•
74.6K views
Similar to Hive acid-updates-summit-sjc-2014
Apache Hive on ACID
Hortonworks
741 views
•
18 slides
ACID Transactions in Hive
Eugene Koifman
303 views
•
19 slides
What is New in Apache Hive 3.0?
DataWorks Summit
279 views
•
31 slides
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
201 views
•
31 slides
What is new in Apache Hive 3.0?
DataWorks Summit
6.6K views
•
32 slides
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
DataWorks Summit
1.5K views
•
31 slides
Similar to Hive acid-updates-summit-sjc-2014
(20)
Apache Hive on ACID
Hortonworks
•
741 views
ACID Transactions in Hive
Eugene Koifman
•
303 views
What is New in Apache Hive 3.0?
DataWorks Summit
•
279 views
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
•
201 views
What is new in Apache Hive 3.0?
DataWorks Summit
•
6.6K views
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
DataWorks Summit
•
1.5K views
Docker based Hadoop provisioning - anywhere
DataWorks Summit
•
1.1K views
What's New in Apache Hive 3.0?
DataWorks Summit
•
269 views
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
•
945 views
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
•
1.4K views
Yahoo! Hack Europe Workshop
Hortonworks
•
2K views
Fast SQL on Hadoop, Really?
DataWorks Summit
•
654 views
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
•
227 views
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
•
766 views
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
DataWorks Summit
•
416 views
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
•
1.4K views
An In-Depth Look at Putting the Sting in Hive
DataWorks Summit
•
1.9K views
Stinger hadoop summit june 2013
alanfgates
•
2.6K views
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
•
2.6K views
A Reference Architecture for ETL 2.0
DataWorks Summit
•
21.5K views
Recently uploaded
Java 21 and Beyond- A Roadmap of Innovations .pdf
Ana-Maria Mihalceanu
51 views
•
90 slides
GigaIO: The March of Composability Onward to Memory with CXL
CXL Forum
118 views
•
12 slides
CXL at OCP
CXL Forum
183 views
•
66 slides
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays
33 views
•
30 slides
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays
35 views
•
21 slides
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
Ridwan Fadjar
163 views
•
45 slides
Recently uploaded
(20)
Java 21 and Beyond- A Roadmap of Innovations .pdf
Ana-Maria Mihalceanu
•
51 views
GigaIO: The March of Composability Onward to Memory with CXL
CXL Forum
•
118 views
CXL at OCP
CXL Forum
•
183 views
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays
•
33 views
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays
•
35 views
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
Ridwan Fadjar
•
163 views
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya
•
51 views
JCon Live 2023 - Lice coding some integration problems
Bernd Ruecker
•
61 views
Micron CXL product and architecture update
CXL Forum
•
23 views
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10
•
42 views
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada
•
97 views
Liqid: Composable CXL Preview
CXL Forum
•
118 views
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
Fwdays
•
23 views
MemVerge: Past Present and Future of CXL
CXL Forum
•
105 views
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk
•
75 views
Level-up Your Cloud Visibility Into AWS With ThousandEyes
ThousandEyes
•
74 views
Webinar : Competing for tomorrow’s leaders – How MENA insurers can win the wa...
The Digital Insurer
•
24 views
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver
•
23 views
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
Fwdays
•
35 views
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang
•
34 views
Hive acid-updates-summit-sjc-2014
1.
© Hortonworks Inc.
2014 Adding ACID Updates to Hive April 2014 Page 1 Owen O’Malley Alan Gates owen@hortonworks.com gates@hortonworks.com @owen_omalley @alanfgates
2.
© Hortonworks Inc.
2014 Page 2 •Hive Only Updates Partitions –Insert overwrite rewrites an entire partition –Forces daily or even hourly partitions •What Happens to Concurrent Readers? –Ok for inserts, but overwrite causes races –There is a zookeeper lock manager, but… •No way to delete, update, or insert rows –Makes adhoc work difficult What’s Wrong?
3.
© Hortonworks Inc.
2014 Page 3 •Hadoop and Hive have always… –Worked without ACID –Perceived as tradeoff for performance •But, your data isn’t static –It changes daily, hourly, or faster –Ad hoc solutions require a lot of work –Managing change makes the user’s life better •Do or Do Not, There is NO Try Why is ACID Critical?
4.
© Hortonworks Inc.
2014 Page 4 •Updating a Dimension Table –Changing a customer’s address •Delete Old Records –Remove records for compliance •Update/Restate Large Fact Tables –Fix problems after they are in the warehouse •Streaming Data Ingest –A continual stream of data coming in –Typically from Flume or Storm Use Cases
5.
© Hortonworks Inc.
2014 Page 5 •HDFS Does Not Allow Arbitrary Writes –Store changes as delta files –Stitched together by client on read •Writes get a Transaction ID –Sequentially assigned by Metastore •Reads get Committed Transactions –Provides snapshot consistency –No locks required –Provide a snapshot of data from start of query Design
6.
© Hortonworks Inc.
2013 Stitching Buckets Together Page 6
7.
© Hortonworks Inc.
2014 Page 7 •Partition locations remain unchanged –Still warehouse/$db/$tbl/$part •Bucket Files Structured By Transactions –Base files $part/base_$tid/bucket_* –Delta files $part/delta_$tid_$tid/bucket_* •Minor Compactions merge deltas –Read delta_$tid1_$tid1 .. delta_$tid2_$tid2 –Written as delta_$tid1_$tid2 •Compaction doesn’t disturb readers HDFS Layout
8.
© Hortonworks Inc.
2014 Page 8 •Created new AcidInput/OutputFormat –Unique key is transaction, bucket, row •Reader returns most recent update •Also Added Raw API for Compactor –Provides previous events as well •ORC implements new API –Extends records with change metadata –Add operation (d, u, i), transaction and key Input and Output Formats
9.
© Hortonworks Inc.
2014 Page 9 •Need to split buckets for MapReduce –Need to split base and deltas the same way –Use key ranges –Use indexes Distributing the Work
10.
© Hortonworks Inc.
2014 Page 10 •Existing lock managers –In memory - not durable –ZooKeeper - requires additional components to install, administer, etc. •Locks need to be integrated with transactions –commit/rollback must atomically release locks •We sort of have this database lying around which has ACID characteristics (metastore) •Transactions and locks stored in metastore •Uses metastore DB to provide unique, ascending ids for transactions and locks Transaction Manager
11.
© Hortonworks Inc.
2014 Page 11 •No explicit transactions in 0.13 –First implementation of INSERT, UPDATE, DELETE will be auto-commit –Will then add BEGIN, COMMIT, ROLLBACK •Snapshot isolation –Reader will see consistent data for the duration of his/her query –May extend to other isolation levels in the future •Current transactions can be displayed using new SHOW TRANSACTIONS statement Transaction Model
12.
© Hortonworks Inc.
2014 Page 12 •Three types of locks –shared –semi-shared (can co-exist with shared, but not other semi-shared) –exclusive •Operations require different locks –SELECT, INSERT – shared –UPDATE, DELETE – semi-shared –DROP, INSERT OVERWRITE – exclusive Locking Model
13.
© Hortonworks Inc.
2014 Page 13 •Each transaction (or batch of transactions in streaming ingest) creates a new delta file •Too many files = NameNode •Need a way to –Collect many deltas into one delta – minor compaction –Rewrite base and delta to new base – major compaction Compactor
14.
© Hortonworks Inc.
2014 Page 14 •Run when there are 10 or more deltas (configurable) •Results in base + 1 delta Minor Compaction /hive/warehouse/purchaselog/ds=201403311000/base_0028000 /hive/warehouse/purchaselog/ds=201403311000/delta_0028001_0028100 /hive/warehouse/purchaselog/ds=201403311000/delta_0028101_0028200 /hive/warehouse/purchaselog/ds=201403311000/delta_0028201_0028300 /hive/warehouse/purchaselog/ds=201403311000/delta_0028301_0028400 /hive/warehouse/purchaselog/ds=201403311000/delta_0028401_0028500 /hive/warehouse/purchaselog/ds=201403311000/base_0028000 /hive/warehouse/purchaselog/ds=201403311000/delta_0028001_0028500
15.
© Hortonworks Inc.
2014 Page 15 •Run when deltas are 10% the size of base (configurable) •Results in new base Major Compaction /hive/warehouse/purchaselog/ds=201403311000/base_0028000 /hive/warehouse/purchaselog/ds=201403311000/delta_0028001_0028100 /hive/warehouse/purchaselog/ds=201403311000/delta_0028101_0028200 /hive/warehouse/purchaselog/ds=201403311000/delta_0028201_0028300 /hive/warehouse/purchaselog/ds=201403311000/delta_0028301_0028400 /hive/warehouse/purchaselog/ds=201403311000/delta_0028401_0028500 /hive/warehouse/purchaselog/ds=201403311000/base_0028500
16.
© Hortonworks Inc.
2014 Page 16 •Metastore thrift server will schedule and execute compactions –No need for user to schedule –User can initiate via new ALTER TABLE COMPACT statement •No locking required, compactions run at same time as select, inserts –Compactor aware of readers, does not remove old files until readers have finished with them •Current compactions can be viewed via new SHOW COMPACTIONS statement Compactor Continued
17.
© Hortonworks Inc.
2014 Page 17 •Data is flowing in from generators in a stream •Without this, you have to add it to Hive in batches, often every hour –Thus your users have to wait an hour before they can see their data •New interface in hive.hcatalog.streaming lets applications write small batches of records and commit them –Users can now see data within a few seconds of it arriving from the data generators •Available for Apache Flume in HDP 2.1 –Working on Apache Storm integration Application: Streaming Ingest
18.
© Hortonworks Inc.
2014 Page 18 Streaming Ingest Illustrated Flume Agent HDFS
19.
© Hortonworks Inc.
2014 Page 19 Streaming Ingest Illustrated Flume Agent HDFS while (…) write(); commit(); Commit can be time based or size based, up to writer commit() flushes to disk and sends commit to metastore
20.
© Hortonworks Inc.
2014 Page 20 Streaming Ingest Illustrated Flume Agent HDFS while (…) write(); commit(); Next write() appends to the same file
21.
© Hortonworks Inc.
2014 Page 21 Streaming Ingest Illustrated Flume Agent HDFS while (…) write(); commit(); Reader Task Reader uses txnid to determine which records to read
22.
© Hortonworks Inc.
2014 Page 22 • Phase 1, Hive 0.13 –Transaction and new lock manager –ORC file support –Automatic and manual compaction –Snapshot isolation –Streaming ingest via Flume • Phase 2, Hive 0.14 (we hope) –INSERT … VALUES, UPDATE, DELETE –BEGIN, COMMIT, ROLLBACK • Future (all speculative based on user feedback) –Versioned or point in time queries –Additional isolation levels such as dirty read or read committed –MERGE Phases of Development
23.
© Hortonworks Inc.
2014 Page 23 •Only suitable for data warehousing, not for OLTP •Table must be bucketed, and (currently) not sorted –Sorting restriction will be removed in the future Limitations
24.
© Hortonworks Inc.
2014 Page 24 •Good –Handles compactions for us –Already has similar data model with LSM •Bad –No cross row transactions –Would require us to write a transaction manager over HBase, doable, but not less work –Hfile is column family based rather than columnar –HBase focused on point lookups and range scans –Warehousing tends to require full scans Why Not HBase?
25.
© Hortonworks Inc.
2014 Page 25 •JIRA: https://issues.apache.org/jira/browse/HI VE-5317 •Adds ACID semantics to Hive •Uses SQL standard commands –INSERT, UPDATE, DELETE •Provides scalable read and write access Conclusion
26.
© Hortonworks Inc.
2013 Thank You! Questions & Answers Page 26