SlideShare a Scribd company logo
1 of 18
Apache Hive on ACID
Alan Gates
Hive PMC Member
Co-founder Hortonworks
May 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History
 Hive only updated partitions
– INSERT...OVERWRITE rewrote an entire partition
– Forced daily or even hourly partitions
– Could add files to partition directory, file compaction was manual
 What about concurrent readers?
– Ok for inserts, but overwrite caused races
– There is a zookeeper lock manager, but…
 No way to delete or update rows
 No INSERT INTO T VALUES…
– Breaks some tools
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Do You Need ACID?
 Hadoop and Hive have always…
– Just said no to ACID
– Perceived as tradeoff for performance
 But, your data isn’t static
– It changes daily, hourly, or faster
– Sometimes it needs restated (late arriving data) or facts change (e.g. a user’s physical address)
– Loading data into Hive every hour is so 2010; data should be available in Hive as soon as it arrives
 We saw users implementing ad hoc solutions
– This is a lot of work and hard to get right
– Hive should support this as a first class feature
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
When Should You Use Hive’s ACID?
 NOT OLTP!!!
 Updating a Dimension Table
– Changing a customer’s address
 Delete Old Records
– Remove records for compliance
 Update/Restate Large Fact Tables
– Fix problems after they are in the warehouse
 Streaming Data Ingest
– A continual stream of data coming in
– Typically from Flume or Storm
 NOT OLTP!!!
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Changes for ACID
 Since Hive 0.14
 New DML
– INSERT INTO T VALUES(1, ‘fred’, ...);
– UPDATE T SET (x = 5[, ...]) [WHERE ...]
– DELETE FROM T [WHERE ...]
– Supports partitioned and non-partitioned tables, WHERE clause can specify partition but not required
 Restrictions
– Table must have format that extends AcidInputFormat
• currently ORC
• work started on Parquet (HIVE-8123)
– Table must be bucketed and not sorted
• can use 1 bucket but this will restrict write parallelism
– Table must be marked transactional
• create table T(...) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES
('transactional'='true');
• Existing ORC tables that are bucketed can be marked transactional via ALTER
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ingesting Data Into Hive From a Stream
 Data is flowing in from generators in a stream
 Without this, you have to add it to Hive in batches, often every hour
– Thus your users have to wait an hour before they can see their data
 New interface in hive.hcatalog.streaming lets applications write small batches of
records and commit them
– Users can now see data within a few seconds of it arriving from the data generators
 Available for Apache Flume and Apache Storm
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design
 HDFS does not allow arbitrary writes
– Store changes as delta files
– Stitched together by client on read
 Writes get a transaction ID
– Sequentially assigned by metastore
 Reads get highest committed transaction & list of open/aborted transactions
– Provides snapshot consistency
– No exclusive locks required
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Not HBase
 Good
– Handles compactions for us
– Already has similar data model with LSM
 Bad
– When we started this there were no transaction managers for HBase, this requires transactions
– Hfile is column family based rather than columnar
– HBase focused on point lookups and range scans
• Warehousing requires full scans
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Stitching Buckets Together
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS Layout
 Partition locations remain unchanged
– Still warehouse/$db/$tbl/$part
 Bucket Files Structured By Transactions
– Base files $part/base_$tid/bucket_*
– Delta files $part/delta_$tid_$tid/bucket_*
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Input and Output Formats
 Created new AcidInput/OutputFormat
– Unique key is original transaction id, bucket, row id
 Reader returns correct version of row based on transaction state
 Also added raw API for compactor
– Provides previous events as well
 ORC implements new API
– Extends records with change metadata
• Add operation (d, u, i), latest transaction id, and key
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Transaction Manager
 Existing lock managers
– In memory - not durable
– ZooKeeper - requires additional components to install, administer, etc.
 Locks need to be integrated with transactions
– commit/rollback must atomically release locks
 We sort of have this database lying around which has ACID characteristics (metastore)
 Transactions and locks stored in metastore
 Uses metastore DB to provide unique, ascending ids for transactions and locks
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Transaction & Locking Model
 DML statements are auto-commit
 Snapshot isolation
– Reader will see consistent data for the duration of a query
 Current transactions can be displayed using SHOW TRANSACTIONS
 Three types of locks
– shared read
– shared write (can co-exist with shared read, but not other shared write)
– exclusive
 Operations require different locks
– SELECT, INSERT – shared read (inserts cannot conflict because there is no primary key)
– UPDATE, DELETE – shared write
– DROP, INSERT OVERWRITE – exclusive
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Compaction
 Each transaction (or batch of transactions in streaming) creates a new delta directory
 Too many files = NameNode  and poor read performance due to fan in on merge
 Need to automatically compact files
– Initiated by metastore server, run as MR jobs in the cluster
– Can be manually initiated by user via ALTER TABLE COMPACT
 Minor compaction merges many deltas into one
– Run when there are more than 10 delta directories (configurable)
 Major compaction merges deltas with base and rewrites base
– Run when size of the deltas > 10% of the size of the base (configurable)
 Old files kept around until all readers are done with their snapshots, then cleaned up
– Compaction and data read/writes can be done in parallel with no need to pause the world
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issues Found and (Some) Fixed
 Not GA ready in Hive 1.2 or 2.0, hope to have GA ready by 1.3 and 2.1
 Deadlocks in the RDBMS
– The way the Hive metastore used the RDBMS caused a lot of deadlocks – greatly improved
 Usability
– SHOW COMPACTIONS and SHOW LOCKS did not give users/admins enough information to successfully
determine who was blocking whom or what was getting compacted – improved, some work still to do
here
 Resilience
– System was easy to knock over when clients did silly things (like open 1M+ transactions) – improved,
though I am sure there are still some ways to kill it
– Initially compactor threads only run in 1 metastore instance – resolved, now can run in multiple instances
 Correctness
– Streaming ingest did not enforce proper bucket spraying – resolved
– Initial versions of the compactor had a race condition that resulted in record loss – resolved
– Adding a column to a table or changing a column’s type caused read time errors - resolved
– Updates can get lost when overlapping transactions update the same partition – HIVE-13395
 Performance
– Some work done here (e.g. making predicate push down work, efficient split combinations)
– Much still to be done
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next: MERGE
 Standard SQL, added in SQL 2003
 Problem, today each UPDATE requires a scan of the partition or table
– There is no way to apply separate updates in a batch
 Allows upserts
 Use case:
– bring in batch from transactional/front end systems
– Apply as insert or updates (as appropriate) in one read/write pass
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future Work
 Multi-statement transactions (BEGIN, COMMIT, ROLLBACK)
 Integration with LLAP
– Figure out how MVCC works with LLAP’s caching
– Build a write path through LLAP
 Lower the user burden
– Make the bucketing automatic so the user does not have to be aware of it
– Allow user to determine sort order of the table
– Eventually remove the transactional/non-transactional distinction in tables
 Improve monitoring and alerting facilities
– Make is easier for an admin to determine when the system is in trouble, e.g. the compactor is not
running or is failing on every run, there are too many open transactions, etc.
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You

More Related Content

What's hot

Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudDataWorks Summit
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghaiYifeng Jiang
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHortonworks
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureHortonworks
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingDataWorks Summit/Hadoop Summit
 
Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Hortonworks
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016alanfgates
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 

What's hot (20)

Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, Future
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
 
Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5Webinar Series Part 5 New Features of HDF 5
Webinar Series Part 5 New Features of HDF 5
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 

Similar to Apache Hive on ACID

ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in HiveEugene Koifman
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015alanfgates
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesBig Data Spain
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveDataWorks Summit
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseMingliang Liu
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsDataWorks Summit
 

Similar to Apache Hive on ACID (20)

ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Apache Hive on ACID

  • 1. Apache Hive on ACID Alan Gates Hive PMC Member Co-founder Hortonworks May 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved History  Hive only updated partitions – INSERT...OVERWRITE rewrote an entire partition – Forced daily or even hourly partitions – Could add files to partition directory, file compaction was manual  What about concurrent readers? – Ok for inserts, but overwrite caused races – There is a zookeeper lock manager, but…  No way to delete or update rows  No INSERT INTO T VALUES… – Breaks some tools
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Do You Need ACID?  Hadoop and Hive have always… – Just said no to ACID – Perceived as tradeoff for performance  But, your data isn’t static – It changes daily, hourly, or faster – Sometimes it needs restated (late arriving data) or facts change (e.g. a user’s physical address) – Loading data into Hive every hour is so 2010; data should be available in Hive as soon as it arrives  We saw users implementing ad hoc solutions – This is a lot of work and hard to get right – Hive should support this as a first class feature
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved When Should You Use Hive’s ACID?  NOT OLTP!!!  Updating a Dimension Table – Changing a customer’s address  Delete Old Records – Remove records for compliance  Update/Restate Large Fact Tables – Fix problems after they are in the warehouse  Streaming Data Ingest – A continual stream of data coming in – Typically from Flume or Storm  NOT OLTP!!!
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL Changes for ACID  Since Hive 0.14  New DML – INSERT INTO T VALUES(1, ‘fred’, ...); – UPDATE T SET (x = 5[, ...]) [WHERE ...] – DELETE FROM T [WHERE ...] – Supports partitioned and non-partitioned tables, WHERE clause can specify partition but not required  Restrictions – Table must have format that extends AcidInputFormat • currently ORC • work started on Parquet (HIVE-8123) – Table must be bucketed and not sorted • can use 1 bucket but this will restrict write parallelism – Table must be marked transactional • create table T(...) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true'); • Existing ORC tables that are bucketed can be marked transactional via ALTER
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ingesting Data Into Hive From a Stream  Data is flowing in from generators in a stream  Without this, you have to add it to Hive in batches, often every hour – Thus your users have to wait an hour before they can see their data  New interface in hive.hcatalog.streaming lets applications write small batches of records and commit them – Users can now see data within a few seconds of it arriving from the data generators  Available for Apache Flume and Apache Storm
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Design  HDFS does not allow arbitrary writes – Store changes as delta files – Stitched together by client on read  Writes get a transaction ID – Sequentially assigned by metastore  Reads get highest committed transaction & list of open/aborted transactions – Provides snapshot consistency – No exclusive locks required
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Not HBase  Good – Handles compactions for us – Already has similar data model with LSM  Bad – When we started this there were no transaction managers for HBase, this requires transactions – Hfile is column family based rather than columnar – HBase focused on point lookups and range scans • Warehousing requires full scans
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Stitching Buckets Together
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Layout  Partition locations remain unchanged – Still warehouse/$db/$tbl/$part  Bucket Files Structured By Transactions – Base files $part/base_$tid/bucket_* – Delta files $part/delta_$tid_$tid/bucket_*
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Input and Output Formats  Created new AcidInput/OutputFormat – Unique key is original transaction id, bucket, row id  Reader returns correct version of row based on transaction state  Also added raw API for compactor – Provides previous events as well  ORC implements new API – Extends records with change metadata • Add operation (d, u, i), latest transaction id, and key
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Transaction Manager  Existing lock managers – In memory - not durable – ZooKeeper - requires additional components to install, administer, etc.  Locks need to be integrated with transactions – commit/rollback must atomically release locks  We sort of have this database lying around which has ACID characteristics (metastore)  Transactions and locks stored in metastore  Uses metastore DB to provide unique, ascending ids for transactions and locks
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Transaction & Locking Model  DML statements are auto-commit  Snapshot isolation – Reader will see consistent data for the duration of a query  Current transactions can be displayed using SHOW TRANSACTIONS  Three types of locks – shared read – shared write (can co-exist with shared read, but not other shared write) – exclusive  Operations require different locks – SELECT, INSERT – shared read (inserts cannot conflict because there is no primary key) – UPDATE, DELETE – shared write – DROP, INSERT OVERWRITE – exclusive
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Compaction  Each transaction (or batch of transactions in streaming) creates a new delta directory  Too many files = NameNode  and poor read performance due to fan in on merge  Need to automatically compact files – Initiated by metastore server, run as MR jobs in the cluster – Can be manually initiated by user via ALTER TABLE COMPACT  Minor compaction merges many deltas into one – Run when there are more than 10 delta directories (configurable)  Major compaction merges deltas with base and rewrites base – Run when size of the deltas > 10% of the size of the base (configurable)  Old files kept around until all readers are done with their snapshots, then cleaned up – Compaction and data read/writes can be done in parallel with no need to pause the world
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Issues Found and (Some) Fixed  Not GA ready in Hive 1.2 or 2.0, hope to have GA ready by 1.3 and 2.1  Deadlocks in the RDBMS – The way the Hive metastore used the RDBMS caused a lot of deadlocks – greatly improved  Usability – SHOW COMPACTIONS and SHOW LOCKS did not give users/admins enough information to successfully determine who was blocking whom or what was getting compacted – improved, some work still to do here  Resilience – System was easy to knock over when clients did silly things (like open 1M+ transactions) – improved, though I am sure there are still some ways to kill it – Initially compactor threads only run in 1 metastore instance – resolved, now can run in multiple instances  Correctness – Streaming ingest did not enforce proper bucket spraying – resolved – Initial versions of the compactor had a race condition that resulted in record loss – resolved – Adding a column to a table or changing a column’s type caused read time errors - resolved – Updates can get lost when overlapping transactions update the same partition – HIVE-13395  Performance – Some work done here (e.g. making predicate push down work, efficient split combinations) – Much still to be done
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next: MERGE  Standard SQL, added in SQL 2003  Problem, today each UPDATE requires a scan of the partition or table – There is no way to apply separate updates in a batch  Allows upserts  Use case: – bring in batch from transactional/front end systems – Apply as insert or updates (as appropriate) in one read/write pass
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work  Multi-statement transactions (BEGIN, COMMIT, ROLLBACK)  Integration with LLAP – Figure out how MVCC works with LLAP’s caching – Build a write path through LLAP  Lower the user burden – Make the bucketing automatic so the user does not have to be aware of it – Allow user to determine sort order of the table – Eventually remove the transactional/non-transactional distinction in tables  Improve monitoring and alerting facilities – Make is easier for an admin to determine when the system is in trouble, e.g. the compactor is not running or is failing on every run, there are too many open transactions, etc.
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You