ORC Files

•

49 likes•51,464 views

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

© Hortonworks Inc. 2012
ORC Files
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
owen@hortonworks.com

© Hortonworks Inc. 2012
Who Am I?
Page 2

© Hortonworks Inc. 2012
Remaining Challenges
Page 4

© Hortonworks Inc. 2012
Requirements
Page 5

© Hortonworks Inc. 2012
File Structure
Page 6

© Hortonworks Inc. 2012
Stripe Structure
Page 7

© Hortonworks Inc. 2012
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

© Hortonworks Inc. 2012
Compression
Page 9

© Hortonworks Inc. 2012
Integer Column Serialization
Page 10

© Hortonworks Inc. 2012
String Column Serialization
Page 11

© Hortonworks Inc. 2012
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double

© Hortonworks Inc. 2012
Compound Type Serialization
Page 13

© Hortonworks Inc. 2012
Generic Compression
Page 14

© Hortonworks Inc. 2012
Column Projection
Page 15

© Hortonworks Inc. 2012
How Do You Use ORC
Page 16

© Hortonworks Inc. 2012
Managing Memory
Page 17

© Hortonworks Inc. 2012
Pavan’s Trick
Page 18

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 19

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 20

© Hortonworks Inc. 2012
TPC-DS File Sizes
Page 21

© Hortonworks Inc. 2012
TPC-DS Query Performance
Page 22

© Hortonworks Inc. 2012
Additional Details
Page 23

© Hortonworks Inc. 2012
Current work
Page 24

© Hortonworks Inc. 2012
Vectorization
Page 25

© Hortonworks Inc. 2012
Vectorization Preliminary Results
Page 26

© Hortonworks Inc. 2012
Future Work
Page 27

© Hortonworks Inc. 2012
Comparison
Page 29
RC File Trevni Parquet ORC File
Hive Type Model N N N Y
Separate complex columns N Y Y Y
Splits found quickly N Y Y Y
Default column group size 4MB 64MB* 64MB* 256MB
Files per a bucket 1 > 1 1* 1
Store min, max, sum, count N N N Y
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N N Y
Store row count N Y N Y
Skip compressed blocks N N N Y
Store internal indexes N N N Y

What's hot

ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley

Cassandra Introduction & FeaturesDataStax Academy

RocksDB compactionMIJIN AN

Physical Plans in Spark SQLDatabricks

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Understanding Query Plans and Spark UIsDatabricks

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Hive: Loading DataBenjamin Leonhardi

ORC improvement in Apache Spark 2.3DataWorks Summit

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

Optimizing Hive QueriesDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

ORC Deep Dive 2020Owen O'Malley

Spark shuffle introductioncolorant

Spark tuningGMO-Z.com Vietnam Lab Center

HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks

Parquet overviewJulien Le Dem

What's hot (20)

ORC File and Vectorization - Hadoop Summit 2013

Cassandra Introduction & Features

RocksDB compaction

Physical Plans in Spark SQL

Top 5 Mistakes When Writing Spark Applications

Understanding Query Plans and Spark UIs

File Format Benchmarks - Avro, JSON, ORC, & Parquet

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Hive: Loading Data

ORC improvement in Apache Spark 2.3

Enabling Vectorized Engine in Apache Spark

Optimizing Hive Queries

Supporting Apache HBase : Troubleshooting and Supportability Improvements

ORC Deep Dive 2020

Spark shuffle introduction

Spark tuning

HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase

Apache Spark in Depth: Core Concepts, Architecture & Internals

Cosco: An Efficient Facebook-Scale Shuffle Service

Parquet overview

Similar to ORC Files

ORC File IntroductionOwen O'Malley

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Optimizing Hive QueriesOwen O'Malley

ORC: 2015 Faster, Better, SmallerDataWorks Summit

Getting Started with MongoDB Using the Microsoft Stack MongoDB

ORC 2015t3rmin4t0r

Hive on spark is blazing fast or is it finalHortonworks

MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMonica Li

Data lake – On Premise VS CloudIdan Tohami

SQL in the Hybrid WorldTanel Poder

Enabling R on HadoopDataWorks Summit

Migre sus bases de datos Oracle a la nube EDB

ORC 2015: Faster, Better, SmallerThe Apache Software Foundation

Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)MongoDB

Migration DB2 to EDB - Project ExperienceEDB

LA HUG - Agile Analytics Applications on HDPHortonworks

Things learned from OpenWorld 2013Connor McDonald

Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald

What's New in Apache Hive 3.0?DataWorks Summit

Similar to ORC Files (20)

ORC File Introduction

Using Apache Hive with High Performance

Optimizing Hive Queries

ORC: 2015 Faster, Better, Smaller

Getting Started with MongoDB Using the Microsoft Stack

ORC 2015

Hive on spark is blazing fast or is it final

MOUG17 Keynote: Oracle OpenWorld Major Announcements

Data lake – On Premise VS Cloud

SQL in the Hybrid World

Enabling R on Hadoop

Migre sus bases de datos Oracle a la nube

ORC 2015: Faster, Better, Smaller

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migration DB2 to EDB - Project Experience

LA HUG - Agile Analytics Applications on HDP

Things learned from OpenWorld 2013

Whats new in Oracle Database 12c release 12.1.0.2

What's New in Apache Hive 3.0?

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley

Big Data's Journey to ACIDOwen O'Malley

Protect your private data with ORC column encryptionOwen O'Malley

Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley

Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley

Strata NYC 2018 IcebergOwen O'Malley

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley

ORC Column EncryptionOwen O'Malley

Protecting Enterprise Data in Apache HadoopOwen O'Malley

Data protection2015Owen O'Malley

Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley

Hadoop Security ArchitectureOwen O'Malley

Adding ACID Updates to HiveOwen O'Malley

Next Generation Hadoop OperationsOwen O'Malley

Next Generation MapReduceOwen O'Malley

Bay Area HUG Feb 2011 IntroOwen O'Malley

Plugging the Holes: Security and Compatability in HadoopOwen O'Malley

More from Owen O'Malley (17)

Running An Apache Project: 10 Traps and How to Avoid Them

Big Data's Journey to ACID

Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Strata NYC 2018 Iceberg

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

Protecting Enterprise Data in Apache Hadoop

Data protection2015

Structor - Automated Building of Virtual Hadoop Clusters

Hadoop Security Architecture

Adding ACID Updates to Hive

Next Generation Hadoop Operations

Next Generation MapReduce

Bay Area HUG Feb 2011 Intro

Plugging the Holes: Security and Compatability in Hadoop

ORC Files

1. © Hortonworks Inc. 2012 ORC Files June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com

8. © Hortonworks Inc. 2012 File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4

29. © Hortonworks Inc. 2012 Comparison Page 29 RC File Trevni Parquet ORC File Hive Type Model N N N Y Separate complex columns N Y Y Y Splits found quickly N Y Y Y Default column group size 4MB 64MB* 64MB* 256MB Files per a bucket 1 > 1 1* 1 Store min, max, sum, count N N N Y Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N N Y Store row count N Y N Y Skip compressed blocks N N N Y Store internal indexes N N N Y

ORC Files

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ORC Files

Similar to ORC Files (20)

More from Owen O'Malley

More from Owen O'Malley (17)

ORC Files