High Volume Updates in Apache Hive

•Download as PPTX, PDF•

10 likes•5,282 views

Apache Hive provides a convenient SQL-based query language to data stored in HDFS. HDFS provides highly scaleable bandwidth to the data, but does not support arbitrary writes. One of Hortonworks` customers needs to store a high volume of customer data (> 1 TB/day) and that data contains a high percentage (15%) of record updates distributed across years. In many high-update use-cases, HBase would suffice, but the current lack of push down filters from Hive into HBase and HBase`s single level keys make it too expensive. Our solution is to use a custom record reader that stores the edit records as separate HDFS files and synthesizes the current set of records dynamically as the table is read. This provides an economical solution to their need that works within the framework provided by Hive. We believe this use case applies to many Hive users and plan to develop and open source a reusable solution.

Technology Business

High Volume Updates in Hive
Owen O’Malley
owen@hortonworks.com
@owen_omalley
June 2012

© Hortonworks Inc. 2012 Page 1

Who Am I?

Page 2
© Hortonworks Inc. 2012

A Data Flood

Page 3
© Hortonworks Inc. 2012

The Dataflow

Page 4
© Hortonworks Inc. 2012

The Approach

Page 5
© Hortonworks Inc. 2012

Why not Hbase?

Page 6
© Hortonworks Inc. 2012

Limitations of a Single Key

Page 7
© Hortonworks Inc. 2012

Hive Table Layout

Page 8
© Hortonworks Inc. 2012

Design

Page 9
© Hortonworks Inc. 2012

Repeatable Reads

Page 10
© Hortonworks Inc. 2012

Stitching Buckets Together

Page 11
© Hortonworks Inc. 2012

Limitations

Page 12
© Hortonworks Inc. 2012

Additional Challenges from Hive

Page 13
© Hortonworks Inc. 2012

Hive’s Output Committer

Page 14
© Hortonworks Inc. 2012

Dynamic Partitions

Page 15
© Hortonworks Inc. 2012

Conclusion

Page 16
© Hortonworks Inc. 2012

Thank You!
Questions & Answers

Page 17
© Hortonworks Inc. 2012

More Related Content

Viewers also liked

Next Generation Hadoop Operations

Next Generation Hadoop Operations

Next Generation Hadoop OperationsOwen O'Malley

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive QueriesOwen O'Malley

ORC FilesOwen O'Malley

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

Hive Quick Start Tutorial

Hive Quick Start Tutorial

Hive Quick Start TutorialCarl Steinbach

ORC File and Vectorization - Hadoop Summit 2013

ORC File and Vectorization - Hadoop Summit 2013

ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley

Hadoop Security Architecture

Hadoop Security Architecture

Hadoop Security ArchitectureOwen O'Malley

Apache Spark Use case for Education Industry

Apache Spark Use case for Education Industry

Apache Spark Use case for Education IndustryVinayak Agrawal

Cancer Outlier Profile Analysis using Apache SparkMahmoud Parsian

How Totango uses Apache Spark

How Totango uses Apache Spark

How Totango uses Apache SparkOren Raboy

Getting Apache Spark Customers to Production

Getting Apache Spark Customers to Production

Getting Apache Spark Customers to ProductionCloudera, Inc.

Hive User Meeting March 2010 - Hive Team

Hive User Meeting March 2010 - Hive Team

Hive User Meeting March 2010 - Hive TeamZheng Shao

Kodu Game Lab e Project Spark

Kodu Game Lab e Project Spark

Kodu Game Lab e Project SparkFabrício Catae

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive QueriesDataWorks Summit

ORC 2015t3rmin4t0r

Fighting Fraud with Apache Spark

Fighting Fraud with Apache Spark

Fighting Fraud with Apache SparkMiklos Christine

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Building a Turbo-fast Data Warehousing Platform with Databricks

Building a Turbo-fast Data Warehousing Platform with Databricks

Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks

Viewers also liked (19)

Next Generation Hadoop Operations

Next Generation Hadoop Operations

Next Generation Hadoop Operations

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

ORC Files

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

Hive Quick Start Tutorial

Hive Quick Start Tutorial

Hive Quick Start Tutorial

ORC File and Vectorization - Hadoop Summit 2013

ORC File and Vectorization - Hadoop Summit 2013

ORC File and Vectorization - Hadoop Summit 2013

Hadoop Security Architecture

Hadoop Security Architecture

Hadoop Security Architecture

Apache Spark Use case for Education Industry

Apache Spark Use case for Education Industry

Apache Spark Use case for Education Industry

Cancer Outlier Profile Analysis using Apache Spark

How Totango uses Apache Spark

How Totango uses Apache Spark

How Totango uses Apache Spark

Getting Apache Spark Customers to Production

Getting Apache Spark Customers to Production

Getting Apache Spark Customers to Production

Hive User Meeting March 2010 - Hive Team

Hive User Meeting March 2010 - Hive Team

Hive User Meeting March 2010 - Hive Team

Kodu Game Lab e Project Spark

Kodu Game Lab e Project Spark

Kodu Game Lab e Project Spark

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

ORC 2015

Fighting Fraud with Apache Spark

Fighting Fraud with Apache Spark

Fighting Fraud with Apache Spark

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Building a Turbo-fast Data Warehousing Platform with Databricks

Building a Turbo-fast Data Warehousing Platform with Databricks

Building a Turbo-fast Data Warehousing Platform with Databricks

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley

Big Data's Journey to ACID

Big Data's Journey to ACID

Big Data's Journey to ACIDOwen O'Malley

ORC Deep Dive 2020

ORC Deep Dive 2020

ORC Deep Dive 2020Owen O'Malley

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Protect your private data with ORC column encryptionOwen O'Malley

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

Strata NYC 2018 IcebergOwen O'Malley

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley

ORC Column Encryption

ORC Column Encryption

ORC Column EncryptionOwen O'Malley

More from Owen O'Malley (9)

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Big Data's Journey to ACID

Big Data's Journey to ACID

Big Data's Journey to ACID

ORC Deep Dive 2020

ORC Deep Dive 2020

ORC Deep Dive 2020

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

ORC Column Encryption

ORC Column Encryption

Recently uploaded

The transition to renewables in India.pdf

The transition to renewables in India.pdf

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Unleash Your Potential - Namagunga Girls Coding Club

Unleash Your Potential - Namagunga Girls Coding Club

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Benefits Of Flutter Compared To Other Frameworks

Benefits Of Flutter Compared To Other Frameworks

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Understanding the Laravel MVC Architecture

Understanding the Laravel MVC Architecture

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Artificial intelligence in the post-deep learning era

Artificial intelligence in the post-deep learning era

Artificial intelligence in the post-deep learning eraDeakin University

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

CloudStudio User manual (basic edition):

CloudStudio User manual (basic edition):

CloudStudio User manual (basic edition):comworks

Unblocking The Main Thread Solving ANRs and Frozen Frames

Unblocking The Main Thread Solving ANRs and Frozen Frames

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Pigging Solutions in Pet Food Manufacturing

Pigging Solutions in Pet Food Manufacturing

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Recently uploaded (20)

The transition to renewables in India.pdf

The transition to renewables in India.pdf

The transition to renewables in India.pdf

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Unleash Your Potential - Namagunga Girls Coding Club

Unleash Your Potential - Namagunga Girls Coding Club

Unleash Your Potential - Namagunga Girls Coding Club

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Benefits Of Flutter Compared To Other Frameworks

Benefits Of Flutter Compared To Other Frameworks

Benefits Of Flutter Compared To Other Frameworks

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Understanding the Laravel MVC Architecture

Understanding the Laravel MVC Architecture

Understanding the Laravel MVC Architecture

Artificial intelligence in the post-deep learning era

Artificial intelligence in the post-deep learning era

Artificial intelligence in the post-deep learning era

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special Edition

DMCC Future of Trade Web3 - Special Edition

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project Setup

Streamlining Python Development: A Guide to a Modern Project Setup

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

CloudStudio User manual (basic edition):

CloudStudio User manual (basic edition):

CloudStudio User manual (basic edition):

Unblocking The Main Thread Solving ANRs and Frozen Frames

Unblocking The Main Thread Solving ANRs and Frozen Frames

Unblocking The Main Thread Solving ANRs and Frozen Frames

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Pigging Solutions in Pet Food Manufacturing

Pigging Solutions in Pet Food Manufacturing

Pigging Solutions in Pet Food Manufacturing

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

High Volume Updates in Apache Hive

1. High Volume Updates in Hive Owen O’Malley owen@hortonworks.com @owen_omalley June 2012 © Hortonworks Inc. 2012 Page 1

2. Who Am I? Page 2 © Hortonworks Inc. 2012

3. A Data Flood Page 3 © Hortonworks Inc. 2012

4. The Dataflow Page 4 © Hortonworks Inc. 2012

5. The Approach Page 5 © Hortonworks Inc. 2012

6. Why not Hbase? Page 6 © Hortonworks Inc. 2012

7. Limitations of a Single Key Page 7 © Hortonworks Inc. 2012

8. Hive Table Layout Page 8 © Hortonworks Inc. 2012

9. Design Page 9 © Hortonworks Inc. 2012

10. Repeatable Reads Page 10 © Hortonworks Inc. 2012

11. Stitching Buckets Together Page 11 © Hortonworks Inc. 2012

12. Limitations Page 12 © Hortonworks Inc. 2012

13. Additional Challenges from Hive Page 13 © Hortonworks Inc. 2012

14. Hive’s Output Committer Page 14 © Hortonworks Inc. 2012

15. Dynamic Partitions Page 15 © Hortonworks Inc. 2012

16. Conclusion Page 16 © Hortonworks Inc. 2012

17. Thank You! Questions & Answers Page 17 © Hortonworks Inc. 2012