ORC File and Vectorization - Hadoop Summit 2013

•Download as PPTX, PDF•

34 likes•18,468 views

Eric Hanson and I gave this presentation at Hadoop Summit 2013: Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

Technology

Copyright 2013 by Hortonworks and Microsoft
ORC File & Vectorization
Improving Hive Data Storage and Query Performance
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
Jitendra Pandey
jitendra@hortonworks.com
Eric Hanson
ehans@microsoft.com
owen@hortonworks.c
om

File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double

Comparison
Page 23
RC File Trevni Parquet ORC
Hive Integration Y N N Y
Active Development N N Y Y
Hive Type Model N N N Y
Shred complex columns N Y Y Y
Splits found quickly N Y Y Y
Files per a bucket 1 many 1 or many 1
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N Y Y
Store min, max, sum, count N N N Y
Store internal indexes N N N Y
No overhead for non-null N N N Y ≥ 0.12
Predicate Pushdown N N N Y ≥ 0.12

Why row-at-a-time execution is slow
Page 26
• Hive uses Object Inspectors to work on a row
• Enables level of abstraction
• Costs major performance
• Exacerbated by using lazy serdes
• Inner loop has many method, new(), and if-
then-else calls
• Lots of CPU instructions
• Pipeline stalls Poor instructions/cycle
• Poor cache locality

How the code works (simplified)
Page 27
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
}
No method calls
Low instruction count
Cache locality to 1024 values
No pipeline stalls
SIMD in Java 8

Preliminary performance results
• NOT a benchmark
• 218 million row fact table of real data, 25 columns
• 18GB raw data
• 6 core, 12 thread workstation, 1 disk, 16GB RAM
• select a, b, count(*) from t
where c >= const group by a, b -- 53 row result
Page 29
warm start times RC non-
vectorized
(default, not
compressed)
ORC non-
vectorized
(default,
compressed)
ORC vectorized
(default,
compressed)
Runtime (sec) 261 58 43
Total CPU (sec) 381 159 42

Thanks to contributors!
Page 30
• Microsoft Big Data:
• Eric Hanson, Remus Rusanu, Sarvesh
Sakalanaga, Tony Murphy, Ashit Gosalia
• Hortonworks:
• Jitendra Pandey, Owen O’Malley, Gopal V
• Others:
• Teddy Choi, Tim Chen
Jitendra/Eric are joint leads

What's hot

Building a Virtual Data Lake with Apache ArrowDremio Corporation

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Transactional operations in Apache Hive: present and futureDataWorks Summit

Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit

Nosql data modelsViet-Trung TRAN

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

YARN Federation DataWorks Summit/Hadoop Summit

ORC improvement in Apache Spark 2.3DataWorks Summit

Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit

Cost-based query optimization in Apache HiveJulian Hyde

HBase in Practice DataWorks Summit/Hadoop Summit

ORC File - Optimizing Your Big DataDataWorks Summit

Hive Bucketing in Apache SparkTejas Patil

Hive 3 - a new horizonThejas Nair

Spark shuffle introductioncolorant

Hive Data Modeling and Query OptimizationEyad Garelnabi

Hive partitioning best practicesNabeel Moidu

The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit

What's hot (20)

Building a Virtual Data Lake with Apache Arrow

The Parquet Format and Performance Optimization Opportunities

Transactional operations in Apache Hive: present and future

Apache Phoenix and HBase: Past, Present and Future of SQL over HBase

Nosql data models

Parquet Strata/Hadoop World, New York 2013

Efficient Data Storage for Analytics with Apache Parquet 2.0

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

YARN Federation

ORC improvement in Apache Spark 2.3

Hive and Apache Tez: Benchmarked at Yahoo! Scale

Cost-based query optimization in Apache Hive

HBase in Practice

ORC File - Optimizing Your Big Data

Hive Bucketing in Apache Spark

Hive 3 - a new horizon

Spark shuffle introduction

Hive Data Modeling and Query Optimization

Hive partitioning best practices

The columnar roadmap: Apache Parquet and Apache Arrow

Viewers also liked

Ingesting Data at Blazing Speed Using Apache OrcDataWorks Summit

Big data - Apache Hadoop for Beginner'ssenthil0809

Get started with R langsenthil0809

Ibm spectrum scale fundamentals workshop for americas part 1 components archi...xKinAnx

Storage Cloud and Spectrum deck 2017 June updateJoe Krotz

Alphorm.com Formation Docker (2/2) - Administration Avancée Alphorm

Viewers also liked (6)

Ingesting Data at Blazing Speed Using Apache Orc

Big data - Apache Hadoop for Beginner's

Get started with R lang

Ibm spectrum scale fundamentals workshop for americas part 1 components archi...

Storage Cloud and Spectrum deck 2017 June update

Alphorm.com Formation Docker (2/2) - Administration Avancée

Similar to ORC File and Vectorization - Hadoop Summit 2013

Overview of the Hive Stinger InitiativeModern Data Stack France

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData

Master tuningThomas Kejser

Web analytics at scale with Druid at naver.comJungsu Heo

CBStreams - Java Streams for ColdFusion (CFML)Ortus Solutions, Corp

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...Ortus Solutions, Corp

User Group3009sqlserver.co.il

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS

Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi

WebObjects OptimizationWO Community

Nodejs - Should Ruby Developers Care?Felix Geisendörfer

NOSQL and Cassandrarantav

Google cloud Dataflow & Apache FlinkIván Fernández Perea

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Orms vs Micro-ORMsDavid Paquette

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit

VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld

Performance optimization - JavaScriptFilip Mares

Node.js: The What, The How and The WhenFITC

Similar to ORC File and Vectorization - Hadoop Summit 2013 (20)

Overview of the Hive Stinger Initiative

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...

Master tuning

Web analytics at scale with Druid at naver.com

CBStreams - Java Streams for ColdFusion (CFML)

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...

User Group3009

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu

Fighting Against Chaotically Separated Values with Embulk

WebObjects Optimization

Nodejs - Should Ruby Developers Care?

NOSQL and Cassandra

Google cloud Dataflow & Apache Flink

Using Apache Hive with High Performance

Orms vs Micro-ORMs

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine

VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight

Performance optimization - JavaScript

Node.js: The What, The How and The When

Recently uploaded

MINDCTI Revenue Release Quarter One 2024MIND CTI

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Architecting Cloud Native ApplicationsWSO2

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

AI in Action: Real World Use Cases by AnitarajAnitaRaj43

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

DBX First Quarter 2024 Investor PresentationDropbox

API Governance and Monetization - The evolution of API governanceWSO2

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Platformless Horizons for Digital AdaptabilityWSO2

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Architecting Cloud Native Applications

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

AI in Action: Real World Use Cases by Anitaraj

Six Myths about Ontologies: The Basics of Formal Ontology

CNIC Information System with Pakdata Cf In Pakistan

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DBX First Quarter 2024 Investor Presentation

API Governance and Monetization - The evolution of API governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Platformless Horizons for Digital Adaptability

How to Troubleshoot Apps for the Modern Connected Worker

WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

ORC File and Vectorization - Hadoop Summit 2013

1. Copyright 2013 by Hortonworks and Microsoft ORC File & Vectorization Improving Hive Data Storage and Query Performance June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley Jitendra Pandey jitendra@hortonworks.com Eric Hanson ehans@microsoft.com owen@hortonworks.c om

2. ORC – Optimized RC File Page 2

3. History Page 3

4. Remaining Challenges Page 4

5. Requirements Page 5

6. File Structure Page 6

7. Stripe Structure Page 7

8. File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4

9. Compression Page 9

10. Integer Column Serialization Page 10

11. String Column Serialization Page 11

12. Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double

13. Compound Type Serialization Page 13

14. Generic Compression Page 14

15. Column Projection Page 15

16. How Do You Use ORC Page 16

17. Managing Memory Page 17

18. TPC-DS File Sizes Page 18

19. ORC Predicate Pushdown Page 19

20. Additional Details Page 20

21. Current work for Hive 0.12 Page 21

22. Future Work Page 22

23. Comparison Page 23 RC File Trevni Parquet ORC Hive Integration Y N N Y Active Development N N Y Y Hive Type Model N N N Y Shred complex columns N Y Y Y Splits found quickly N Y Y Y Files per a bucket 1 many 1 or many 1 Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N Y Y Store min, max, sum, count N N N Y Store internal indexes N N N Y No overhead for non-null N N N Y ≥ 0.12 Predicate Pushdown N N N Y ≥ 0.12

24. Vectorization Page 24

25. Vectorization Page 25

26. Why row-at-a-time execution is slow Page 26 • Hive uses Object Inspectors to work on a row • Enables level of abstraction • Costs major performance • Exacerbated by using lazy serdes • Inner loop has many method, new(), and if- then-else calls • Lots of CPU instructions • Pipeline stalls Poor instructions/cycle • Poor cache locality

27. How the code works (simplified) Page 27 class LongColumnAddLongScalarExpression { int inputColumn; int outputColumn; long scalar; void evaluate(VectorizedRowBatch batch) { long [] inVector = ((LongColumnVector) batch.columns[inputColumn]).vector; long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector; if (batch.selectedInUse) { for (int j = 0; j < batch.size; j++) { int i = batch.selected[j]; outVector[i] = inVector[i] + scalar; } } else { for (int i = 0; i < batch.size; i++) { outVector[i] = inVector[i] + scalar; } } } } } No method calls Low instruction count Cache locality to 1024 values No pipeline stalls SIMD in Java 8

28. Vectorization project Page 28

29. Preliminary performance results • NOT a benchmark • 218 million row fact table of real data, 25 columns • 18GB raw data • 6 core, 12 thread workstation, 1 disk, 16GB RAM • select a, b, count(*) from t where c >= const group by a, b -- 53 row result Page 29 warm start times RC non- vectorized (default, not compressed) ORC non- vectorized (default, compressed) ORC vectorized (default, compressed) Runtime (sec) 261 58 43 Total CPU (sec) 381 159 42

30. Thanks to contributors! Page 30 • Microsoft Big Data: • Eric Hanson, Remus Rusanu, Sarvesh Sakalanaga, Tony Murphy, Ashit Gosalia • Hortonworks: • Jitendra Pandey, Owen O’Malley, Gopal V • Others: • Teddy Choi, Tim Chen Jitendra/Eric are joint leads

ORC File and Vectorization - Hadoop Summit 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to ORC File and Vectorization - Hadoop Summit 2013

Similar to ORC File and Vectorization - Hadoop Summit 2013 (20)

More from Owen O'Malley

More from Owen O'Malley (20)

Recently uploaded

Recently uploaded (20)

ORC File and Vectorization - Hadoop Summit 2013