© Hortonworks Inc. 2012
ORC Files
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
owen@hortonworks.com
© Hortonworks Inc. 2012
Who Am I?
Page 2
© Hortonworks Inc. 2012
History
Page 3
© Hortonworks Inc. 2012
Remaining Challenges
Page 4
© Hortonworks Inc. 2012
Requirements
Page 5
© Hortonworks Inc. 2012
File Structure
Page 6
© Hortonworks Inc. 2012
Stripe Structure
Page 7
© Hortonworks Inc. 2012
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
© Hortonworks Inc. 2012
Compression
Page 9
© Hortonworks Inc. 2012
Integer Column Serialization
Page 10
© Hortonworks Inc. 2012
String Column Serialization
Page 11
© Hortonworks Inc. 2012
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double
© Hortonworks Inc. 2012
Compound Type Serialization
Page 13
© Hortonworks Inc. 2012
Generic Compression
Page 14
© Hortonworks Inc. 2012
Column Projection
Page 15
© Hortonworks Inc. 2012
How Do You Use ORC
Page 16
© Hortonworks Inc. 2012
Managing Memory
Page 17
© Hortonworks Inc. 2012
Pavan’s Trick
Page 18
© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 19
© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 20
© Hortonworks Inc. 2012
TPC-DS File Sizes
Page 21
© Hortonworks Inc. 2012
TPC-DS Query Performance
Page 22
© Hortonworks Inc. 2012
Additional Details
Page 23
© Hortonworks Inc. 2012
Current work
Page 24
© Hortonworks Inc. 2012
Vectorization
Page 25
© Hortonworks Inc. 2012
Vectorization Preliminary Results
Page 26
© Hortonworks Inc. 2012
Future Work
Page 27
© Hortonworks Inc. 2012
Thanks!
Page 28
© Hortonworks Inc. 2012
Comparison
Page 29
RC File Trevni Parquet ORC File
Hive Type Model N N N Y
Separate complex columns N Y Y Y
Splits found quickly N Y Y Y
Default column group size 4MB 64MB* 64MB* 256MB
Files per a bucket 1 > 1 1* 1
Store min, max, sum, count N N N Y
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N N Y
Store row count N Y N Y
Skip compressed blocks N N N Y
Store internal indexes N N N Y

ORC Files