SlideShare a Scribd company logo
1 of 29
Download to read offline
© Hortonworks Inc. 2012
ORC Files
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
owen@hortonworks.com
© Hortonworks Inc. 2012
Who Am I?
Page 2
© Hortonworks Inc. 2012
History
Page 3
© Hortonworks Inc. 2012
Remaining Challenges
Page 4
© Hortonworks Inc. 2012
Requirements
Page 5
© Hortonworks Inc. 2012
File Structure
Page 6
© Hortonworks Inc. 2012
Stripe Structure
Page 7
© Hortonworks Inc. 2012
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
© Hortonworks Inc. 2012
Compression
Page 9
© Hortonworks Inc. 2012
Integer Column Serialization
Page 10
© Hortonworks Inc. 2012
String Column Serialization
Page 11
© Hortonworks Inc. 2012
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double
© Hortonworks Inc. 2012
Compound Type Serialization
Page 13
© Hortonworks Inc. 2012
Generic Compression
Page 14
© Hortonworks Inc. 2012
Column Projection
Page 15
© Hortonworks Inc. 2012
How Do You Use ORC
Page 16
© Hortonworks Inc. 2012
Managing Memory
Page 17
© Hortonworks Inc. 2012
Pavan’s Trick
Page 18
© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 19
© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 20
© Hortonworks Inc. 2012
TPC-DS File Sizes
Page 21
© Hortonworks Inc. 2012
TPC-DS Query Performance
Page 22
© Hortonworks Inc. 2012
Additional Details
Page 23
© Hortonworks Inc. 2012
Current work
Page 24
© Hortonworks Inc. 2012
Vectorization
Page 25
© Hortonworks Inc. 2012
Vectorization Preliminary Results
Page 26
© Hortonworks Inc. 2012
Future Work
Page 27
© Hortonworks Inc. 2012
Thanks!
Page 28
© Hortonworks Inc. 2012
Comparison
Page 29
RC File Trevni Parquet ORC File
Hive Type Model N N N Y
Separate complex columns N Y Y Y
Splits found quickly N Y Y Y
Default column group size 4MB 64MB* 64MB* 256MB
Files per a bucket 1 > 1 1* 1
Store min, max, sum, count N N N Y
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N N Y
Store row count N Y N Y
Skip compressed blocks N N N Y
Store internal indexes N N N Y

More Related Content

What's hot

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3DataWorks Summit
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 

What's hot (20)

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 

Similar to ORC Files

ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerDataWorks Summit
 
Getting Started with MongoDB Using the Microsoft Stack
Getting Started with MongoDB Using the Microsoft Stack Getting Started with MongoDB Using the Microsoft Stack
Getting Started with MongoDB Using the Microsoft Stack MongoDB
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
MOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMonica Li
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS CloudIdan Tohami
 
SQL in the Hybrid World
SQL in the Hybrid WorldSQL in the Hybrid World
SQL in the Hybrid WorldTanel Poder
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube EDB
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks
 
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)MongoDB
 
Migration DB2 to EDB - Project Experience
 Migration DB2 to EDB - Project Experience Migration DB2 to EDB - Project Experience
Migration DB2 to EDB - Project ExperienceEDB
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPHortonworks
 
Things learned from OpenWorld 2013
Things learned from OpenWorld 2013Things learned from OpenWorld 2013
Things learned from OpenWorld 2013Connor McDonald
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 

Similar to ORC Files (20)

ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
 
Getting Started with MongoDB Using the Microsoft Stack
Getting Started with MongoDB Using the Microsoft Stack Getting Started with MongoDB Using the Microsoft Stack
Getting Started with MongoDB Using the Microsoft Stack
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
MOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major Announcements
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
SQL in the Hybrid World
SQL in the Hybrid WorldSQL in the Hybrid World
SQL in the Hybrid World
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
 
Migration DB2 to EDB - Project Experience
 Migration DB2 to EDB - Project Experience Migration DB2 to EDB - Project Experience
Migration DB2 to EDB - Project Experience
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDP
 
Things learned from OpenWorld 2013
Things learned from OpenWorld 2013Things learned from OpenWorld 2013
Things learned from OpenWorld 2013
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACIDOwen O'Malley
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryptionOwen O'Malley
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 IcebergOwen O'Malley
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column EncryptionOwen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopOwen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to HiveOwen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 

More from Owen O'Malley (17)

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 

ORC Files