ORC Files

14,652 views

Published on

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

2 Comments
40 Likes
Statistics
Notes
No Downloads
Views
Total views
14,652
On SlideShare
0
From Embeds
0
Number of Embeds
3,885
Actions
Shares
0
Downloads
0
Comments
2
Likes
40
Embeds 0
No embeds

No notes for slide

ORC Files

  1. 1. © Hortonworks Inc. 2012ORC FilesJune 2013Page 1Owen O’Malleyowen@hortonworks.com@owen_omalleyowen@hortonworks.com
  2. 2. © Hortonworks Inc. 2012Who Am I?Page 2
  3. 3. © Hortonworks Inc. 2012HistoryPage 3
  4. 4. © Hortonworks Inc. 2012Remaining ChallengesPage 4
  5. 5. © Hortonworks Inc. 2012RequirementsPage 5
  6. 6. © Hortonworks Inc. 2012File StructurePage 6
  7. 7. © Hortonworks Inc. 2012Stripe StructurePage 7
  8. 8. © Hortonworks Inc. 2012File LayoutPage 8File FooterPostscriptIndex DataRow DataStripe Footer256MBStripeIndex DataRow DataStripe Footer256MBStripeIndex DataRow DataStripe Footer256MBStripeColumn 1Column 2Column 7Column 8Column 3Column 6Column 4Column 5Column 1Column 2Column 7Column 8Column 3Column 6Column 4Column 5Stream 2.1Stream 2.2Stream 2.3Stream 2.4
  9. 9. © Hortonworks Inc. 2012CompressionPage 9
  10. 10. © Hortonworks Inc. 2012Integer Column SerializationPage 10
  11. 11. © Hortonworks Inc. 2012String Column SerializationPage 11
  12. 12. © Hortonworks Inc. 2012Hive Compound TypesPage 120Struct4Struct3String1Int2Map7Time5String6Double
  13. 13. © Hortonworks Inc. 2012Compound Type SerializationPage 13
  14. 14. © Hortonworks Inc. 2012Generic CompressionPage 14
  15. 15. © Hortonworks Inc. 2012Column ProjectionPage 15
  16. 16. © Hortonworks Inc. 2012How Do You Use ORCPage 16
  17. 17. © Hortonworks Inc. 2012Managing MemoryPage 17
  18. 18. © Hortonworks Inc. 2012Pavan’s TrickPage 18
  19. 19. © Hortonworks Inc. 2012Looking at ORC File StructuresPage 19
  20. 20. © Hortonworks Inc. 2012Looking at ORC File StructuresPage 20
  21. 21. © Hortonworks Inc. 2012TPC-DS File SizesPage 21
  22. 22. © Hortonworks Inc. 2012TPC-DS Query PerformancePage 22
  23. 23. © Hortonworks Inc. 2012Additional DetailsPage 23
  24. 24. © Hortonworks Inc. 2012Current workPage 24
  25. 25. © Hortonworks Inc. 2012VectorizationPage 25
  26. 26. © Hortonworks Inc. 2012Vectorization Preliminary ResultsPage 26
  27. 27. © Hortonworks Inc. 2012Future WorkPage 27
  28. 28. © Hortonworks Inc. 2012Thanks!Page 28
  29. 29. © Hortonworks Inc. 2012ComparisonPage 29RC File Trevni Parquet ORC FileHive Type Model N N N YSeparate complex columns N Y Y YSplits found quickly N Y Y YDefault column group size 4MB 64MB* 64MB* 256MBFiles per a bucket 1 > 1 1* 1Store min, max, sum, count N N N YVersioned metadata N Y Y YRun length data encoding N N Y YStore strings in dictionary N N N YStore row count N Y N YSkip compressed blocks N N N YStore internal indexes N N N Y

×