SlideShare is now on Android. 15 million presentations at your fingertips.  Get the app

×
  • Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
 

ORC File and Vectorization - Hadoop Summit 2013

by co-founder and senior architect at Hortonworks on Jun 28, 2013

  • 8,056 views

Eric Hanson and I gave this presentation at Hadoop Summit 2013: ...

Eric Hanson and I gave this presentation at Hadoop Summit 2013:

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

Statistics

Views

Total Views
8,056
Views on SlideShare
4,038
Embed Views
4,018

Actions

Likes
15
Downloads
0
Comments
0

21 Embeds 4,018

http://allthingshadoop.com 3342
http://eventifier.co 421
https://twitter.com 166
http://cloud.feedly.com 40
http://digg.com 11
http://feedreader.com 9
http://feeds.feedburner.com 6
http://feedly.com 5
http://eventifier.com 3
http://summary 3
http://www.linkedin.com 2
http://translate.googleusercontent.com 1
http://www.google.com 1
https://www.google.com 1
http://newsblur.com 1
http://www.365dailyjournal.com 1
http://www.feedspot.com 1
http://www.goread.io 1
http://architects.dzone.com 1
http://flask.radcool.co 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via SlideShare as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
Post Comment
Edit your comment

ORC File and Vectorization - Hadoop Summit 2013 ORC File and Vectorization - Hadoop Summit 2013 Presentation Transcript