Successfully reported this slideshow.
Your SlideShare is downloading. ×

ORC Files

ORC Files

Download to read offline

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

More Related Content

Related Audiobooks

Free with a 30 day trial from Scribd

See all

ORC Files

  1. 1. © Hortonworks Inc. 2012 ORC Files June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com
  2. 2. © Hortonworks Inc. 2012 Who Am I? Page 2
  3. 3. © Hortonworks Inc. 2012 History Page 3
  4. 4. © Hortonworks Inc. 2012 Remaining Challenges Page 4
  5. 5. © Hortonworks Inc. 2012 Requirements Page 5
  6. 6. © Hortonworks Inc. 2012 File Structure Page 6
  7. 7. © Hortonworks Inc. 2012 Stripe Structure Page 7
  8. 8. © Hortonworks Inc. 2012 File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4
  9. 9. © Hortonworks Inc. 2012 Compression Page 9
  10. 10. © Hortonworks Inc. 2012 Integer Column Serialization Page 10
  11. 11. © Hortonworks Inc. 2012 String Column Serialization Page 11
  12. 12. © Hortonworks Inc. 2012 Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double
  13. 13. © Hortonworks Inc. 2012 Compound Type Serialization Page 13
  14. 14. © Hortonworks Inc. 2012 Generic Compression Page 14
  15. 15. © Hortonworks Inc. 2012 Column Projection Page 15
  16. 16. © Hortonworks Inc. 2012 How Do You Use ORC Page 16
  17. 17. © Hortonworks Inc. 2012 Managing Memory Page 17
  18. 18. © Hortonworks Inc. 2012 Pavan’s Trick Page 18
  19. 19. © Hortonworks Inc. 2012 Looking at ORC File Structures Page 19
  20. 20. © Hortonworks Inc. 2012 Looking at ORC File Structures Page 20
  21. 21. © Hortonworks Inc. 2012 TPC-DS File Sizes Page 21
  22. 22. © Hortonworks Inc. 2012 TPC-DS Query Performance Page 22
  23. 23. © Hortonworks Inc. 2012 Additional Details Page 23
  24. 24. © Hortonworks Inc. 2012 Current work Page 24
  25. 25. © Hortonworks Inc. 2012 Vectorization Page 25
  26. 26. © Hortonworks Inc. 2012 Vectorization Preliminary Results Page 26
  27. 27. © Hortonworks Inc. 2012 Future Work Page 27
  28. 28. © Hortonworks Inc. 2012 Thanks! Page 28
  29. 29. © Hortonworks Inc. 2012 Comparison Page 29 RC File Trevni Parquet ORC File Hive Type Model N N N Y Separate complex columns N Y Y Y Splits found quickly N Y Y Y Default column group size 4MB 64MB* 64MB* 256MB Files per a bucket 1 > 1 1* 1 Store min, max, sum, count N N N Y Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N N Y Store row count N Y N Y Skip compressed blocks N N N Y Store internal indexes N N N Y

×