Submit Search
Upload
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
•
0 likes
•
78 views
C
c-bslim
Follow
SIGMOD 2019 Apache Hive Presentation.
Read less
Read more
Engineering
Report
Share
Report
Share
1 of 33
Download now
Download to read offline
Recommended
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
EMC
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
SQL On Hadoop
SQL On Hadoop
Muhammad Ali
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
DataWorks Summit
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via Slider
Hortonworks
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
DataWorks Summit
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
Recommended
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
EMC
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
SQL On Hadoop
SQL On Hadoop
Muhammad Ali
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
DataWorks Summit
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via Slider
Hortonworks
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
DataWorks Summit
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
InMobi Technology
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Hortonworks
Empower Hive with Spark
Empower Hive with Spark
DataWorks Summit
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
InMobi Technology
Ambari Meetup: YARN
Ambari Meetup: YARN
Hortonworks
An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)
Anthony Baker
5. pivotal hd 2013
5. pivotal hd 2013
Chiou-Nan Chen
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
John Blum
IBM Power leading Cognitive Systems
IBM Power leading Cognitive Systems
Hugo Blanco
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
DataWorks Summit
Yarns About Yarn
Yarns About Yarn
Cloudera, Inc.
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
Cloudera, Inc.
Apache Hive authorization models
Apache Hive authorization models
Thejas Nair
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
DataWorks Summit/Hadoop Summit
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
VMware Tanzu
Building Effective Apache Geode Applications with Spring Data GemFire
Building Effective Apache Geode Applications with Spring Data GemFire
John Blum
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
In-Memory Computing Summit
Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-Hadoop
Cloudera, Inc.
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
More Related Content
What's hot
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
InMobi Technology
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Hortonworks
Empower Hive with Spark
Empower Hive with Spark
DataWorks Summit
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
InMobi Technology
Ambari Meetup: YARN
Ambari Meetup: YARN
Hortonworks
An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)
Anthony Baker
5. pivotal hd 2013
5. pivotal hd 2013
Chiou-Nan Chen
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
John Blum
IBM Power leading Cognitive Systems
IBM Power leading Cognitive Systems
Hugo Blanco
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
DataWorks Summit
Yarns About Yarn
Yarns About Yarn
Cloudera, Inc.
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
Cloudera, Inc.
Apache Hive authorization models
Apache Hive authorization models
Thejas Nair
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
DataWorks Summit/Hadoop Summit
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
VMware Tanzu
Building Effective Apache Geode Applications with Spring Data GemFire
Building Effective Apache Geode Applications with Spring Data GemFire
John Blum
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
In-Memory Computing Summit
Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-Hadoop
Cloudera, Inc.
What's hot
(20)
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Apache Ambari BOF - OpenStack - Hadoop Summit 2013
Empower Hive with Spark
Empower Hive with Spark
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
Ambari Meetup: YARN
Ambari Meetup: YARN
An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)
5. pivotal hd 2013
5. pivotal hd 2013
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
IBM Power leading Cognitive Systems
IBM Power leading Cognitive Systems
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Yarns About Yarn
Yarns About Yarn
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
Apache Hive authorization models
Apache Hive authorization models
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Building Effective Apache Geode Applications with Spring Data GemFire
Building Effective Apache Geode Applications with Spring Data GemFire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-Hadoop
Similar to Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Timothy Spann
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
Nicolas Morales
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
Jim Kaskade
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL Fabric
Mark Swarbrick
Spark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera, Inc.
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
Applications on Hadoop
Applications on Hadoop
markgrover
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu
One Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
Cloudera, Inc.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
Cloudera, Inc.
Similar to Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
(20)
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL Fabric
Spark One Platform Webinar
Spark One Platform Webinar
Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
Applications on Hadoop
Applications on Hadoop
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
Recently uploaded
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
Asst.prof M.Gokilavani
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
upamatechverse
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
roncy bisnoi
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
ranjana rawat
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
fenichawla
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
SIVASHANKAR N
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur High Profile
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
upamatechverse
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
sanyuktamishra911
Extrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
120cr0395
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
ranjana rawat
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
Call Girls in Nagpur High Profile
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
rknatarajan
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
University management System project report..pdf
University management System project report..pdf
Kamal Acharya
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Dr.Costas Sachpazis
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
ranjana rawat
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
Call Girls in Nagpur High Profile Call Girls
Recently uploaded
(20)
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
Extrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
University management System project report..pdf
University management System project report..pdf
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
1.
Apache Hive: From
MapReduce to Enterprise-grade Big Data Warehousing Slim Bouguerra (bslim AT apache DOT org ) Apache Druid PMC Apache Hive Committer Apache Calcite Committer Jesús Camacho-Rodríguez, Ashutosh Chauhan, Alan Gates, Eugene Koifman, Owen O’Malley, Vineet Garg, Zoltan Haindrich, Sergey Shelukhin, Prasanth Jayachandran, Siddharth Seth, Deepak Jaiswal, Slim Bouguerra, Nishant Bangarwa, Sankar Hariappan, Anishek Agarwal, Jason Dere, Daniel Dai, Thejas Nair, Nita Dembla, Gopal Vijayaraghavan, Günther Hagleitner SIGMOD 2019 Industrial Track
2.
© 2019 Cloudera,
Inc. All rights reserved. 2 BRIEF HISTORY HDFS, MapReduce, Hive, and Pig • Hadoop (HDFS, MapReduce) is open sourced in 2006 – Ubiquitous platform for inexpensive data storage and processing – Focused mainly in ETL and batch reporting workloads • Hive (Facebook) and Pig (Yahoo!) are developed to expose a SQL-ISH higher-level abstraction for data processing on top of MapReduce “To the developers of the Hive and Pig database systems, for developing seminal software systems that served to bring relational-style declarative programming to the Hadoop ecosystem” 2018 SIGMOD Systems Award NO-SQL
3.
© 2019 Cloudera,
Inc. All rights reserved. 3 MOTIVATION • Evolve from SQL-LIKE and Batch TO Low latency FULL SQL engine On Hadoop. – Offload existing workloads from Major expensive MPP databases! Option 1 Implement new system Option 2 Extend existing system – Exists !!! – Years worth of hackers code on the open source community. – Handles very well XXXL size ETL. – Handles very well lot of Hadoop/Blob storage consistency edge cases.
4.
© 2019 Cloudera,
Inc. All rights reserved. 4 MOTIVATION Goal • Requirements for our implementation – Compliant: support SQL standard and provide ACID guarantees – Efficient: use optimization techniques present in other MPP databases – Flexible: work reliably for multiple use cases – Extensible: able to interact with other data processing engines
5.
© 2019 Cloudera,
Inc. All rights reserved. 5 APACHE HIVE IMPROVEMENTS Compliant SQL and ACID support Flexible Runtime latency Efficient Query optimization Extensible Federation capabilities
6.
© 2019 Cloudera,
Inc. All rights reserved. 6 APACHE HIVE IMPROVEMENTS Efficient Query optimization Compliant SQL and ACID support Flexible Runtime latency Extensible Federation capabilities
7.
© 2019 Cloudera,
Inc. All rights reserved. 7 SQL AND ACID SUPPORT ACID implementation • Implementation of ACID compliant record level transactions – Support to execute INSERT, UPDATE, DELETE and MERGE statements • How to Build this ? – Transaction manager – Overcome Hadoop/Cloud file system limitations (no updates and s3 fuzzy consistency) • Multi-version optimistic concurrency control (MVOCC) – Snapshot isolation level – Single statement transactions across tables – Performance comparable to non-transactional tables
8.
© 2019 Cloudera,
Inc. All rights reserved. 8 HiveServer2 WriteId = 1 Table contents SQL AND ACID SUPPORT Write transaction Transaction Manager open transaction TxnId get WriteId (table1, TxnId) WriteId table1/ ├── delta_001_001/ │ ├── 0000 │ └── 0001 ├── delete_delta_002_002/ │ ├── 0000 │ └── 0001 └── delta_003_003/ └── 0000 WriteId = 2 WriteId = 3 ‘john’ ‘doe’INSERT record <ROW__ID> null nullDELETE record <ROW__ID> Identifies uniquely every record in the table commit (TxnId) Hive Metastore
9.
© 2019 Cloudera,
Inc. All rights reserved. 9 HiveServer2 Table contents SQL AND ACID SUPPORT Read transaction Transaction Manager get snapshot <TXN_ID_LIST> get snapshot (table1, <TXN_ID_LIST>) <WRITE_ID_LIST> table1/ ├── delta_001_001/ │ ├── 0000 │ └── 0001 ├── delete_delta_002_002/ │ ├── 0000 │ └── 0001 └── delta_003_003/ └── 0000 ‘john’ ‘doe’INSERT record <ROW__ID> null nullDELETE record <ROW__ID> Ignored by record reader Record reader performs anti-semijoin WRITE_ID_LIST = [2, ()] Hive Metastore
10.
© 2019 Cloudera,
Inc. All rights reserved. 10 SQL AND ACID SUPPORT Compactor • Minor compaction: Merge files in delta directories • Major compaction: Merge delta files with base directories Table contents table1/ ├── delta_001_001/ │ ├── 0000 │ └── 0001 ├── delete_delta_002_002/ │ ├── 0000 │ └── 0001 └── delta_003_003/ └── 0000 Table contents table1/ ├── delta_001_003/ │ ├── 0000 │ └── 0001 └── delete_delta_002_002/ ├── 0000 └── 0001 Table contents table2/ ├── base_100/ │ ├── 0000 │ └── 0001 └── delta_101_103/ ├── 0000 └── 0001 Table contents table2/ └── base_103/ ├── 0000 └── 0001
11.
© 2019 Cloudera,
Inc. All rights reserved. 11 APACHE HIVE IMPROVEMENTS Efficient Query optimization Compliant SQL and ACID support Flexible Runtime latency Extensible Federation capabilities
12.
© 2019 Cloudera,
Inc. All rights reserved. 12 QUERY OPTIMIZATION Work smarter, not harder • Rule and cost-based optimizer based on Apache Calcite – Representing queries at the right abstraction level is critical to implementing advanced optimization algorithms • Query reoptimization – Catches runtime errors and re-executes query, changing configuration parameters (overlay) or using statistics captured at runtime (re-optimize) • Query results cache – Reuses the results of a previously executed query by checking the internal transactional state of the participating tables
13.
© 2019 Cloudera,
Inc. All rights reserved. 13 QUERY OPTIMIZATION Work smarter, not harder • Materialized views: – Transparent query rewriting (rich SQL dialect), incremental maintenance • Shared work: – Identifying overlapping subexpressions within executing plan of a given query, computing them only once and reusing their results • Dynamic semijoin: – Reduces the size of intermediate results during query execution by skipping complete partitions (dynamic partition pruning) or row groups (index semijoin)
14.
© 2019 Cloudera,
Inc. All rights reserved. 14 APACHE HIVE IMPROVEMENTS Efficient Query optimization Compliant SQL and ACID support Flexible Runtime latency Extensible Federation capabilities
15.
© 2019 Cloudera,
Inc. All rights reserved. 15 RUNTIME LATENCY Motivation • Previous improvements introduced by Stinger initiative reduced query latency by orders of magnitude – Apache Tez, columnar storage formats and vectorized operators • Architecture tailored towards cluster throughput – Execution requires containers allocation → Startup time overhead – Containers killed after query execution → JIT compiler optimizations not effective – Impossible to exploit data sharing and caching → Unnecessary IO overhead
16.
© 2019 Cloudera,
Inc. All rights reserved. 16 RUNTIME LATENCY Apache Hive architecture (next-gen) LLAP JDBC, ODBC, Beeline YARN cluster HDFS Object stores (AWS, GCP, Azure) Apache Druid, JDBC, other external enginesRDBMS Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Query Coordinator Container Container Container Container Container Container Container Hive Metastore HiveServer2 LLAP daemon LLAP daemon LLAP daemon LLAP daemon LLAP daemon Shared Hive services Infrastructure / Hadoop Ephemeral per query tasks LLAP Coordinator
17.
© 2019 Cloudera,
Inc. All rights reserved. 17 Query Coordinator RUNTIME LATENCY LLAP daemon anatomy LLAP daemon Execution IO elevatorWork queue Fragment Fragment Fragment Executor Fragment Fragment Fragment Fragment Fragment Executor Fragment Executor Fragment Executor Fragment IO queue Request Reader Reader Data (HDFS, object store) Request Request Query Coordinator Query Coordinator Off-heap cache (encoded data)
18.
© 2019 Cloudera,
Inc. All rights reserved. 18 RUNTIME LATENCY Data caching in LLAP • Fine-grained compact data cache – Keep only the columns and rows that are accessed – Data is stored encoded to minimize memory footprint – Cache file metadata to enable PPD pushdown with no FS reads! • Supports most common file formats ORC, Parquet, Text • Incremental: Adding new data to your tables does not invalidate the cache • Plugable replacement policy: FIFO, LRFU.
19.
© 2019 Cloudera,
Inc. All rights reserved. 19 RUNTIME LATENCY Multi-tenant deployments • Fragment preemption based on state, priorities • Workload manager – Define plans to share effectively LLAP cluster resources – Resource-based guardrail policies Resource plan Resource pool BI: 80% Resource pool ETL: 20% Downgrade when runtime > 3s
20.
© 2019 Cloudera,
Inc. All rights reserved. 20 TPC-DS 10TB running 10 Nodes querying ACID tables on HDFS
21.
© 2019 Cloudera,
Inc. All rights reserved. 21 APACHE HIVE IMPROVEMENTS Efficient Query optimization Compliant SQL and ACID support Flexible Runtime latency Extensible Federation capabilities
22.
© 2019 Cloudera,
Inc. All rights reserved. 22 FEDERATED WAREHOUSE SYSTEM Motivation • Growing proliferation of specialized data management systems • Apache Hive as a mediator – Use a blend of systems to achieve desired performance and functionality – Implement data movement and transformations between systems – Globally enforce access control and capture audit trails (Apache Ranger) – Meet compliance requirements (Apache Atlas)
23.
© 2019 Cloudera,
Inc. All rights reserved. 23 FEDERATED WAREHOUSE SYSTEM Storage handler + Calcite adapter • Storage handler implementation defines how to interact with another data processing engine – Treats engine as a external Hive table • Calcite adapters define which operations can be pushed to the engine and how to generate queries for it • Currently supported systems include Apache Druid, Kafka and JDBC sources Query Planning (Calcite) Execution op1 op2 op3 op5 op6 op4 op1 op2 op3 op5 op6 op4 op5 op6 op4op7
24.
WHAT’S NEXT?
25.
© 2019 Cloudera,
Inc. All rights reserved. 25 Conclusion and road ahead • Hive’s architecture and design principles have proven to be powerful in today’s analytic landscape • The work done by the community has taken Hive a step closer to other existing MPP database engines 7000 analysts, 80ms average latency, 1PB data 250k BI queries per hour • Future improvements to Apache Hive – Compliant, efficient, flexible, extensible
26.
ONE MORE THING
27.
© 2019 Cloudera,
Inc. All rights reserved. 27 Containerized Hive in the Cloud Work in progress • Hive on Kubernetes – Hive/LLAP side install (to main cluster) – Multiple versions of Hive – Multiple warehouse & compute instances – Dynamic configuration and secrets management – Stateful and work preserving restarts (cache) – Rolling restart for upgrades. Fast rollback to previous good state
28.
THANK YOU ! Questions
?
29.
© 2019 Cloudera,
Inc. All rights reserved. 29 BRIEF HISTORY Wide adoption of Hadoop in the enterprise • YARN for resource management and job scheduling in Hadoop • Increase workloads executed natively within Hadoop – Batch, interactive, iterative, streaming Scalability ServiceabilityMulti-tenancy Locality awareness Reliability / Availability Secure and auditable operation High Cluster Utilization Support for programming model diversity Backwards compatibleFlexible resource model
30.
© 2019 Cloudera,
Inc. All rights reserved. 30 MOTIVATION Why extending Hive? • Apache Hive provided a solid foundation to satisfy these requirements – Already designed for large-scale reliable computation in Hadoop – Provided SQL compatibility (alas, limited) – Implemented connectivity to other systems in the Hadoop ecosystem • However, it needed to evolve and undergo major renovation
31.
© 2019 Cloudera,
Inc. All rights reserved. 31 MOTIVATION Apache Hive architecture (before 2.0) JDBC, ODBC, Beeline YARN cluster HDFS Object stores (AWS, GCP, Azure) Apache Druid, JDBC, other external enginesRDBMS Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Node manager Query Coordinator Container Container Container Container Container Container Container Hive Metastore HiveServer2 Shared Hive services Infrastructure / Hadoop Ephemeral per query tasks
32.
© 2019 Cloudera,
Inc. All rights reserved. 32 Offload data from kafka exactly once.
33.
© 2019 Cloudera,
Inc. All rights reserved. 33 RUNTIME LATENCY Low-latency analytical processing • Interactive queries require more fundamental enhancements • LLAP (Live Long And Process) optional layer – Persistent multi-threaded query executors – Asynchronous IO and multi-tenant in-memory data cache – Compatible with existing execution runtime
Download now