SlideShare a Scribd company logo
1 of 33
Download to read offline
Apache Hive: From MapReduce to
Enterprise-grade Big Data Warehousing
Slim Bouguerra
(bslim AT apache DOT org )
Apache Druid PMC
Apache Hive Committer
Apache Calcite Committer
Jesús Camacho-Rodríguez, Ashutosh Chauhan, Alan Gates,
Eugene Koifman, Owen O’Malley, Vineet Garg, Zoltan
Haindrich, Sergey Shelukhin, Prasanth Jayachandran,
Siddharth Seth, Deepak Jaiswal, Slim Bouguerra, Nishant
Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason
Dere, Daniel Dai, Thejas Nair, Nita Dembla, Gopal
Vijayaraghavan, Günther Hagleitner
SIGMOD 2019 Industrial Track
© 2019 Cloudera, Inc. All rights reserved. 2
BRIEF HISTORY
HDFS, MapReduce, Hive, and Pig
• Hadoop (HDFS, MapReduce) is open sourced in 2006
– Ubiquitous platform for inexpensive data storage and processing
– Focused mainly in ETL and batch reporting workloads
• Hive (Facebook) and Pig (Yahoo!) are developed to expose a SQL-ISH
higher-level abstraction for data processing on top of MapReduce
“To the developers of the Hive and Pig database systems, for developing seminal software systems that
served to bring relational-style declarative programming to the Hadoop ecosystem”
2018 SIGMOD Systems Award
NO-SQL
© 2019 Cloudera, Inc. All rights reserved. 3
MOTIVATION
• Evolve from SQL-LIKE and Batch TO Low latency
FULL SQL engine On Hadoop.
– Offload existing workloads from Major expensive
MPP databases!
Option 1
Implement new system
Option 2
Extend existing system
– Exists !!!
– Years worth of hackers code on the
open source community.
– Handles very well XXXL size ETL.
– Handles very well lot of Hadoop/Blob
storage consistency edge cases.
© 2019 Cloudera, Inc. All rights reserved. 4
MOTIVATION
Goal
• Requirements for our implementation
– Compliant: support SQL standard and provide ACID guarantees
– Efficient: use optimization techniques present in other MPP databases
– Flexible: work reliably for multiple use cases
– Extensible: able to interact with other data processing engines
© 2019 Cloudera, Inc. All rights reserved. 5
APACHE HIVE IMPROVEMENTS
Compliant
SQL and ACID support
Flexible
Runtime latency
Efficient
Query optimization
Extensible
Federation capabilities
© 2019 Cloudera, Inc. All rights reserved. 6
APACHE HIVE IMPROVEMENTS
Efficient
Query optimization
Compliant
SQL and ACID support
Flexible
Runtime latency
Extensible
Federation capabilities
© 2019 Cloudera, Inc. All rights reserved. 7
SQL AND ACID SUPPORT
ACID implementation
• Implementation of ACID compliant record level transactions
– Support to execute INSERT, UPDATE, DELETE and MERGE statements
• How to Build this ?
– Transaction manager
– Overcome Hadoop/Cloud file system limitations (no updates and s3 fuzzy
consistency)
• Multi-version optimistic concurrency control (MVOCC)
– Snapshot isolation level
– Single statement transactions across tables
– Performance comparable to non-transactional tables
© 2019 Cloudera, Inc. All rights reserved. 8
HiveServer2
WriteId = 1
Table contents
SQL AND ACID SUPPORT
Write transaction
Transaction
Manager
open transaction
TxnId
get WriteId (table1, TxnId)
WriteId
table1/
├── delta_001_001/
│ ├── 0000
│ └── 0001
├── delete_delta_002_002/
│ ├── 0000
│ └── 0001
└── delta_003_003/
└── 0000
WriteId = 2
WriteId = 3
‘john’ ‘doe’INSERT record
<ROW__ID> null nullDELETE record
<ROW__ID>
Identifies uniquely every record in the table
commit (TxnId)
Hive Metastore
© 2019 Cloudera, Inc. All rights reserved. 9
HiveServer2
Table contents
SQL AND ACID SUPPORT
Read transaction
Transaction
Manager
get snapshot
<TXN_ID_LIST>
get snapshot (table1, <TXN_ID_LIST>)
<WRITE_ID_LIST>
table1/
├── delta_001_001/
│ ├── 0000
│ └── 0001
├── delete_delta_002_002/
│ ├── 0000
│ └── 0001
└── delta_003_003/
└── 0000
‘john’ ‘doe’INSERT record
<ROW__ID> null nullDELETE record
<ROW__ID>
Ignored by record reader
Record reader performs anti-semijoin
WRITE_ID_LIST = [2, ()]
Hive Metastore
© 2019 Cloudera, Inc. All rights reserved. 10
SQL AND ACID SUPPORT
Compactor
• Minor compaction: Merge
files in delta directories
• Major compaction: Merge
delta files with base
directories
Table contents
table1/
├── delta_001_001/
│ ├── 0000
│ └── 0001
├── delete_delta_002_002/
│ ├── 0000
│ └── 0001
└── delta_003_003/
└── 0000
Table contents
table1/
├── delta_001_003/
│ ├── 0000
│ └── 0001
└── delete_delta_002_002/
├── 0000
└── 0001
Table contents
table2/
├── base_100/
│ ├── 0000
│ └── 0001
└── delta_101_103/
├── 0000
└── 0001
Table contents
table2/
└── base_103/
├── 0000
└── 0001
© 2019 Cloudera, Inc. All rights reserved. 11
APACHE HIVE IMPROVEMENTS
Efficient
Query optimization
Compliant
SQL and ACID support
Flexible
Runtime latency
Extensible
Federation capabilities
© 2019 Cloudera, Inc. All rights reserved. 12
QUERY OPTIMIZATION
Work smarter, not harder
• Rule and cost-based optimizer based on Apache Calcite
– Representing queries at the right abstraction level is critical to implementing
advanced optimization algorithms
• Query reoptimization
– Catches runtime errors and re-executes query, changing configuration parameters
(overlay) or using statistics captured at runtime (re-optimize)
• Query results cache
– Reuses the results of a previously executed query by checking the internal
transactional state of the participating tables
© 2019 Cloudera, Inc. All rights reserved. 13
QUERY OPTIMIZATION
Work smarter, not harder
• Materialized views:
– Transparent query rewriting (rich SQL dialect), incremental maintenance
• Shared work:
– Identifying overlapping subexpressions within executing plan of a given query,
computing them only once and reusing their results
• Dynamic semijoin:
– Reduces the size of intermediate results during query execution by skipping complete
partitions (dynamic partition pruning) or row groups (index semijoin)
© 2019 Cloudera, Inc. All rights reserved. 14
APACHE HIVE IMPROVEMENTS
Efficient
Query optimization
Compliant
SQL and ACID support
Flexible
Runtime latency
Extensible
Federation capabilities
© 2019 Cloudera, Inc. All rights reserved. 15
RUNTIME LATENCY
Motivation
• Previous improvements introduced by Stinger initiative reduced query latency
by orders of magnitude
– Apache Tez, columnar storage formats and vectorized operators
• Architecture tailored towards cluster throughput
– Execution requires containers allocation → Startup time overhead
– Containers killed after query execution → JIT compiler optimizations not effective
– Impossible to exploit data sharing and caching → Unnecessary IO overhead
© 2019 Cloudera, Inc. All rights reserved. 16
RUNTIME LATENCY
Apache Hive architecture (next-gen) LLAP
JDBC, ODBC,
Beeline
YARN cluster
HDFS
Object stores
(AWS, GCP, Azure)
Apache Druid, JDBC,
other external enginesRDBMS
Node manager Node manager Node manager Node manager Node manager
Node manager Node manager Node manager Node manager Node manager
Query Coordinator
Container
Container
Container Container Container Container
Container
Hive Metastore
HiveServer2
LLAP daemon LLAP daemon LLAP daemon LLAP daemon LLAP daemon
Shared Hive services
Infrastructure / Hadoop
Ephemeral per query tasks
LLAP
Coordinator
© 2019 Cloudera, Inc. All rights reserved. 17
Query
Coordinator
RUNTIME LATENCY
LLAP daemon anatomy
LLAP daemon
Execution IO elevatorWork queue
Fragment
Fragment
Fragment
Executor
Fragment
Fragment
Fragment
Fragment
Fragment
Executor
Fragment
Executor
Fragment
Executor
Fragment
IO queue
Request
Reader
Reader
Data
(HDFS, object store)
Request
Request
Query
Coordinator
Query
Coordinator
Off-heap cache
(encoded data)
© 2019 Cloudera, Inc. All rights reserved. 18
RUNTIME LATENCY
Data caching in LLAP
• Fine-grained compact data cache
– Keep only the columns and rows that are accessed
– Data is stored encoded to minimize memory footprint
– Cache file metadata to enable PPD pushdown with no FS reads!
• Supports most common file formats ORC, Parquet, Text
• Incremental: Adding new data to your tables does not invalidate the cache
• Plugable replacement policy: FIFO, LRFU.
© 2019 Cloudera, Inc. All rights reserved. 19
RUNTIME LATENCY
Multi-tenant deployments
• Fragment preemption based on state, priorities
• Workload manager
– Define plans to share effectively LLAP cluster resources
– Resource-based guardrail policies
Resource plan
Resource pool
BI: 80%
Resource pool
ETL: 20%
Downgrade when runtime > 3s
© 2019 Cloudera, Inc. All rights reserved. 20
TPC-DS 10TB running 10 Nodes querying ACID tables on HDFS
© 2019 Cloudera, Inc. All rights reserved. 21
APACHE HIVE IMPROVEMENTS
Efficient
Query optimization
Compliant
SQL and ACID support
Flexible
Runtime latency
Extensible
Federation capabilities
© 2019 Cloudera, Inc. All rights reserved. 22
FEDERATED WAREHOUSE SYSTEM
Motivation
• Growing proliferation of specialized data management systems
• Apache Hive as a mediator
– Use a blend of systems to achieve desired performance and functionality
– Implement data movement and transformations between systems
– Globally enforce access control and capture audit trails (Apache Ranger)
– Meet compliance requirements (Apache Atlas)
© 2019 Cloudera, Inc. All rights reserved. 23
FEDERATED WAREHOUSE SYSTEM
Storage handler + Calcite adapter
• Storage handler implementation defines how to interact with another data
processing engine
– Treats engine as a external Hive table
• Calcite adapters define which operations can be pushed to the engine and how
to generate queries for it
• Currently supported systems include Apache Druid, Kafka and JDBC sources
Query
Planning (Calcite) Execution
op1 op2
op3
op5
op6
op4
op1 op2
op3
op5
op6
op4
op5
op6
op4op7
WHAT’S NEXT?
© 2019 Cloudera, Inc. All rights reserved. 25
Conclusion and road ahead
• Hive’s architecture and design principles have
proven to be powerful in today’s analytic
landscape
• The work done by the community has taken
Hive a step closer to other existing MPP
database engines
7000 analysts, 80ms average latency, 1PB data
250k BI queries per hour
• Future improvements to Apache Hive
– Compliant, efficient, flexible, extensible
ONE MORE THING
© 2019 Cloudera, Inc. All rights reserved. 27
Containerized Hive in the Cloud
Work in progress
• Hive on Kubernetes
– Hive/LLAP side install (to main cluster)
– Multiple versions of Hive
– Multiple warehouse & compute instances
– Dynamic configuration and secrets
management
– Stateful and work preserving restarts (cache)
– Rolling restart for upgrades. Fast rollback to
previous good state
THANK YOU !
Questions ?
© 2019 Cloudera, Inc. All rights reserved. 29
BRIEF HISTORY
Wide adoption of Hadoop in the enterprise
• YARN for resource management and job scheduling in Hadoop
• Increase workloads executed natively within Hadoop
– Batch, interactive, iterative, streaming
Scalability ServiceabilityMulti-tenancy Locality awareness
Reliability / Availability Secure and auditable operation
High Cluster Utilization
Support for programming model diversity
Backwards compatibleFlexible resource model
© 2019 Cloudera, Inc. All rights reserved. 30
MOTIVATION
Why extending Hive?
• Apache Hive provided a solid foundation to satisfy these requirements
– Already designed for large-scale reliable computation in Hadoop
– Provided SQL compatibility (alas, limited)
– Implemented connectivity to other systems in the Hadoop ecosystem
• However, it needed to evolve and undergo major renovation
© 2019 Cloudera, Inc. All rights reserved. 31
MOTIVATION
Apache Hive architecture (before 2.0)
JDBC, ODBC,
Beeline
YARN cluster
HDFS
Object stores
(AWS, GCP, Azure)
Apache Druid, JDBC,
other external enginesRDBMS
Node manager Node manager Node manager Node manager Node manager
Node manager Node manager Node manager Node manager Node manager
Query Coordinator
Container
Container
Container Container Container Container
Container
Hive Metastore
HiveServer2
Shared Hive services
Infrastructure / Hadoop
Ephemeral per query tasks
© 2019 Cloudera, Inc. All rights reserved. 32
Offload data from kafka exactly once.
© 2019 Cloudera, Inc. All rights reserved. 33
RUNTIME LATENCY
Low-latency analytical processing
• Interactive queries require more fundamental enhancements
• LLAP (Live Long And Process) optional layer
– Persistent multi-threaded query executors
– Asynchronous IO and multi-tenant in-memory data cache
– Compatible with existing execution runtime

More Related Content

What's hot

What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014InMobi Technology
 
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013Hortonworks
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Cloudera, Inc.
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformInMobi Technology
 
Ambari Meetup: YARN
Ambari Meetup: YARNAmbari Meetup: YARN
Ambari Meetup: YARNHortonworks
 
An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)Anthony Baker
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireJohn Blum
 
IBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsIBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsHugo Blanco
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...StampedeCon
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...DataWorks Summit
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseCloudera, Inc.
 
Apache Hive authorization models
Apache Hive authorization modelsApache Hive authorization models
Apache Hive authorization modelsThejas Nair
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsDataWorks Summit/Hadoop Summit
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu
 
Building Effective Apache Geode Applications with Spring Data GemFire
Building Effective Apache Geode Applications with Spring Data GemFireBuilding Effective Apache Geode Applications with Spring Data GemFire
Building Effective Apache Geode Applications with Spring Data GemFireJohn Blum
 
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal GemfireIMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal GemfireIn-Memory Computing Summit
 
Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-HadoopCloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-HadoopCloudera, Inc.
 

What's hot (20)

What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
 
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Ambari Meetup: YARN
Ambari Meetup: YARNAmbari Meetup: YARN
Ambari Meetup: YARN
 
An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)
 
5. pivotal hd 2013
5. pivotal hd 20135. pivotal hd 2013
5. pivotal hd 2013
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
 
IBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsIBM Power leading Cognitive Systems
IBM Power leading Cognitive Systems
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
 
Apache Hive authorization models
Apache Hive authorization modelsApache Hive authorization models
Apache Hive authorization models
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
 
Building Effective Apache Geode Applications with Spring Data GemFire
Building Effective Apache Geode Applications with Spring Data GemFireBuilding Effective Apache Geode Applications with Spring Data GemFire
Building Effective Apache Geode Applications with Spring Data GemFire
 
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal GemfireIMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
 
Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-HadoopCloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-Hadoop
 

Similar to Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
 
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMark Swarbrick
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010Cloudera, Inc.
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupAndrei Savu
 
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsCloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Cloudera, Inc.
 

Similar to Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing (20)

What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL Fabric
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
 
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 

Recently uploaded

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

  • 1. Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing Slim Bouguerra (bslim AT apache DOT org ) Apache Druid PMC Apache Hive Committer Apache Calcite Committer Jesús Camacho-Rodríguez, Ashutosh Chauhan, Alan Gates, Eugene Koifman, Owen O’Malley, Vineet Garg, Zoltan Haindrich, Sergey Shelukhin, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere, Daniel Dai, Thejas Nair, Nita Dembla, Gopal Vijayaraghavan, Günther Hagleitner SIGMOD 2019 Industrial Track
  • 2. © 2019 Cloudera, Inc. All rights reserved. 2 BRIEF HISTORY HDFS, MapReduce, Hive, and Pig • Hadoop (HDFS, MapReduce) is open sourced in 2006 – Ubiquitous platform for inexpensive data storage and processing – Focused mainly in ETL and batch reporting workloads • Hive (Facebook) and Pig (Yahoo!) are developed to expose a SQL-ISH higher-level abstraction for data processing on top of MapReduce “To the developers of the Hive and Pig database systems, for developing seminal software systems that served to bring relational-style declarative programming to the Hadoop ecosystem” 2018 SIGMOD Systems Award NO-SQL
  • 3. © 2019 Cloudera, Inc. All rights reserved. 3 MOTIVATION • Evolve from SQL-LIKE and Batch TO Low latency FULL SQL engine On Hadoop. – Offload existing workloads from Major expensive MPP databases! Option 1 Implement new system Option 2 Extend existing system – Exists !!! – Years worth of hackers code on the open source community. – Handles very well XXXL size ETL. – Handles very well lot of Hadoop/Blob storage consistency edge cases.
  • 4. © 2019 Cloudera, Inc. All rights reserved. 4 MOTIVATION Goal • Requirements for our implementation – Compliant: support SQL standard and provide ACID guarantees – Efficient: use optimization techniques present in other MPP databases – Flexible: work reliably for multiple use cases – Extensible: able to interact with other data processing engines
  • 5. © 2019 Cloudera, Inc. All rights reserved. 5 APACHE HIVE IMPROVEMENTS Compliant SQL and ACID support Flexible Runtime latency Efficient Query optimization Extensible Federation capabilities
  • 6. © 2019 Cloudera, Inc. All rights reserved. 6 APACHE HIVE IMPROVEMENTS Efficient Query optimization Compliant SQL and ACID support Flexible Runtime latency Extensible Federation capabilities
  • 7. © 2019 Cloudera, Inc. All rights reserved. 7 SQL AND ACID SUPPORT ACID implementation • Implementation of ACID compliant record level transactions – Support to execute INSERT, UPDATE, DELETE and MERGE statements • How to Build this ? – Transaction manager – Overcome Hadoop/Cloud file system limitations (no updates and s3 fuzzy consistency) • Multi-version optimistic concurrency control (MVOCC) – Snapshot isolation level – Single statement transactions across tables – Performance comparable to non-transactional tables
  • 8. © 2019 Cloudera, Inc. All rights reserved. 8 HiveServer2 WriteId = 1 Table contents SQL AND ACID SUPPORT Write transaction Transaction Manager open transaction TxnId get WriteId (table1, TxnId) WriteId table1/ ├── delta_001_001/ │ ├── 0000 │ └── 0001 ├── delete_delta_002_002/ │ ├── 0000 │ └── 0001 └── delta_003_003/ └── 0000 WriteId = 2 WriteId = 3 ‘john’ ‘doe’INSERT record <ROW__ID> null nullDELETE record <ROW__ID> Identifies uniquely every record in the table commit (TxnId) Hive Metastore
  • 9. © 2019 Cloudera, Inc. All rights reserved. 9 HiveServer2 Table contents SQL AND ACID SUPPORT Read transaction Transaction Manager get snapshot <TXN_ID_LIST> get snapshot (table1, <TXN_ID_LIST>) <WRITE_ID_LIST> table1/ ├── delta_001_001/ │ ├── 0000 │ └── 0001 ├── delete_delta_002_002/ │ ├── 0000 │ └── 0001 └── delta_003_003/ └── 0000 ‘john’ ‘doe’INSERT record <ROW__ID> null nullDELETE record <ROW__ID> Ignored by record reader Record reader performs anti-semijoin WRITE_ID_LIST = [2, ()] Hive Metastore
  • 10. © 2019 Cloudera, Inc. All rights reserved. 10 SQL AND ACID SUPPORT Compactor • Minor compaction: Merge files in delta directories • Major compaction: Merge delta files with base directories Table contents table1/ ├── delta_001_001/ │ ├── 0000 │ └── 0001 ├── delete_delta_002_002/ │ ├── 0000 │ └── 0001 └── delta_003_003/ └── 0000 Table contents table1/ ├── delta_001_003/ │ ├── 0000 │ └── 0001 └── delete_delta_002_002/ ├── 0000 └── 0001 Table contents table2/ ├── base_100/ │ ├── 0000 │ └── 0001 └── delta_101_103/ ├── 0000 └── 0001 Table contents table2/ └── base_103/ ├── 0000 └── 0001
  • 11. © 2019 Cloudera, Inc. All rights reserved. 11 APACHE HIVE IMPROVEMENTS Efficient Query optimization Compliant SQL and ACID support Flexible Runtime latency Extensible Federation capabilities
  • 12. © 2019 Cloudera, Inc. All rights reserved. 12 QUERY OPTIMIZATION Work smarter, not harder • Rule and cost-based optimizer based on Apache Calcite – Representing queries at the right abstraction level is critical to implementing advanced optimization algorithms • Query reoptimization – Catches runtime errors and re-executes query, changing configuration parameters (overlay) or using statistics captured at runtime (re-optimize) • Query results cache – Reuses the results of a previously executed query by checking the internal transactional state of the participating tables
  • 13. © 2019 Cloudera, Inc. All rights reserved. 13 QUERY OPTIMIZATION Work smarter, not harder • Materialized views: – Transparent query rewriting (rich SQL dialect), incremental maintenance • Shared work: – Identifying overlapping subexpressions within executing plan of a given query, computing them only once and reusing their results • Dynamic semijoin: – Reduces the size of intermediate results during query execution by skipping complete partitions (dynamic partition pruning) or row groups (index semijoin)
  • 14. © 2019 Cloudera, Inc. All rights reserved. 14 APACHE HIVE IMPROVEMENTS Efficient Query optimization Compliant SQL and ACID support Flexible Runtime latency Extensible Federation capabilities
  • 15. © 2019 Cloudera, Inc. All rights reserved. 15 RUNTIME LATENCY Motivation • Previous improvements introduced by Stinger initiative reduced query latency by orders of magnitude – Apache Tez, columnar storage formats and vectorized operators • Architecture tailored towards cluster throughput – Execution requires containers allocation → Startup time overhead – Containers killed after query execution → JIT compiler optimizations not effective – Impossible to exploit data sharing and caching → Unnecessary IO overhead
  • 16. © 2019 Cloudera, Inc. All rights reserved. 16 RUNTIME LATENCY Apache Hive architecture (next-gen) LLAP JDBC, ODBC, Beeline YARN cluster HDFS Object stores (AWS, GCP, Azure) Apache Druid, JDBC, other external enginesRDBMS Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Query Coordinator Container Container Container Container Container Container Container Hive Metastore HiveServer2 LLAP daemon LLAP daemon LLAP daemon LLAP daemon LLAP daemon Shared Hive services Infrastructure / Hadoop Ephemeral per query tasks LLAP Coordinator
  • 17. © 2019 Cloudera, Inc. All rights reserved. 17 Query Coordinator RUNTIME LATENCY LLAP daemon anatomy LLAP daemon Execution IO elevatorWork queue Fragment Fragment Fragment Executor Fragment Fragment Fragment Fragment Fragment Executor Fragment Executor Fragment Executor Fragment IO queue Request Reader Reader Data (HDFS, object store) Request Request Query Coordinator Query Coordinator Off-heap cache (encoded data)
  • 18. © 2019 Cloudera, Inc. All rights reserved. 18 RUNTIME LATENCY Data caching in LLAP • Fine-grained compact data cache – Keep only the columns and rows that are accessed – Data is stored encoded to minimize memory footprint – Cache file metadata to enable PPD pushdown with no FS reads! • Supports most common file formats ORC, Parquet, Text • Incremental: Adding new data to your tables does not invalidate the cache • Plugable replacement policy: FIFO, LRFU.
  • 19. © 2019 Cloudera, Inc. All rights reserved. 19 RUNTIME LATENCY Multi-tenant deployments • Fragment preemption based on state, priorities • Workload manager – Define plans to share effectively LLAP cluster resources – Resource-based guardrail policies Resource plan Resource pool BI: 80% Resource pool ETL: 20% Downgrade when runtime > 3s
  • 20. © 2019 Cloudera, Inc. All rights reserved. 20 TPC-DS 10TB running 10 Nodes querying ACID tables on HDFS
  • 21. © 2019 Cloudera, Inc. All rights reserved. 21 APACHE HIVE IMPROVEMENTS Efficient Query optimization Compliant SQL and ACID support Flexible Runtime latency Extensible Federation capabilities
  • 22. © 2019 Cloudera, Inc. All rights reserved. 22 FEDERATED WAREHOUSE SYSTEM Motivation • Growing proliferation of specialized data management systems • Apache Hive as a mediator – Use a blend of systems to achieve desired performance and functionality – Implement data movement and transformations between systems – Globally enforce access control and capture audit trails (Apache Ranger) – Meet compliance requirements (Apache Atlas)
  • 23. © 2019 Cloudera, Inc. All rights reserved. 23 FEDERATED WAREHOUSE SYSTEM Storage handler + Calcite adapter • Storage handler implementation defines how to interact with another data processing engine – Treats engine as a external Hive table • Calcite adapters define which operations can be pushed to the engine and how to generate queries for it • Currently supported systems include Apache Druid, Kafka and JDBC sources Query Planning (Calcite) Execution op1 op2 op3 op5 op6 op4 op1 op2 op3 op5 op6 op4 op5 op6 op4op7
  • 25. © 2019 Cloudera, Inc. All rights reserved. 25 Conclusion and road ahead • Hive’s architecture and design principles have proven to be powerful in today’s analytic landscape • The work done by the community has taken Hive a step closer to other existing MPP database engines 7000 analysts, 80ms average latency, 1PB data 250k BI queries per hour • Future improvements to Apache Hive – Compliant, efficient, flexible, extensible
  • 27. © 2019 Cloudera, Inc. All rights reserved. 27 Containerized Hive in the Cloud Work in progress • Hive on Kubernetes – Hive/LLAP side install (to main cluster) – Multiple versions of Hive – Multiple warehouse & compute instances – Dynamic configuration and secrets management – Stateful and work preserving restarts (cache) – Rolling restart for upgrades. Fast rollback to previous good state
  • 29. © 2019 Cloudera, Inc. All rights reserved. 29 BRIEF HISTORY Wide adoption of Hadoop in the enterprise • YARN for resource management and job scheduling in Hadoop • Increase workloads executed natively within Hadoop – Batch, interactive, iterative, streaming Scalability ServiceabilityMulti-tenancy Locality awareness Reliability / Availability Secure and auditable operation High Cluster Utilization Support for programming model diversity Backwards compatibleFlexible resource model
  • 30. © 2019 Cloudera, Inc. All rights reserved. 30 MOTIVATION Why extending Hive? • Apache Hive provided a solid foundation to satisfy these requirements – Already designed for large-scale reliable computation in Hadoop – Provided SQL compatibility (alas, limited) – Implemented connectivity to other systems in the Hadoop ecosystem • However, it needed to evolve and undergo major renovation
  • 31. © 2019 Cloudera, Inc. All rights reserved. 31 MOTIVATION Apache Hive architecture (before 2.0) JDBC, ODBC, Beeline YARN cluster HDFS Object stores (AWS, GCP, Azure) Apache Druid, JDBC, other external enginesRDBMS Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Query Coordinator Container Container Container Container Container Container Container Hive Metastore HiveServer2 Shared Hive services Infrastructure / Hadoop Ephemeral per query tasks
  • 32. © 2019 Cloudera, Inc. All rights reserved. 32 Offload data from kafka exactly once.
  • 33. © 2019 Cloudera, Inc. All rights reserved. 33 RUNTIME LATENCY Low-latency analytical processing • Interactive queries require more fundamental enhancements • LLAP (Live Long And Process) optional layer – Persistent multi-threaded query executors – Asynchronous IO and multi-tenant in-memory data cache – Compatible with existing execution runtime