ORC File & Vectorization - Improving Hive Data Storage and Query Performance
Upcoming SlideShare
Loading in...5
×
 

ORC File & Vectorization - Improving Hive Data Storage and Query Performance

on

  • 3,168 views

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming ...

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

Statistics

Views

Total Views
3,168
Views on SlideShare
3,168
Embed Views
0

Actions

Likes
10
Downloads
103
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ORC File & Vectorization - Improving Hive Data Storage and Query Performance ORC File & Vectorization - Improving Hive Data Storage and Query Performance Presentation Transcript

  • Copyright 2013 by Hortonworks and Microsoft ORC File & Vectorization Improving Hive Data Storage and Query Performance June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley Jitendra Pandey jitendra@hortonworks.com Eric Hanson ehans@microsoft.com owen@hortonworks.c om
  • ORC – Optimized RC File Page 2
  • History Page 3
  • Remaining Challenges Page 4
  • Requirements Page 5
  • File Structure Page 6
  • Stripe Structure Page 7
  • File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4
  • Compression Page 9
  • Integer Column Serialization Page 10
  • String Column Serialization Page 11
  • Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double
  • Compound Type Serialization Page 13
  • Generic Compression Page 14
  • Column Projection Page 15
  • How Do You Use ORC Page 16
  • Managing Memory Page 17
  • TPC-DS File Sizes Page 18
  • ORC Predicate Pushdown Page 19
  • Additional Details Page 20
  • Current work for Hive 0.12 Page 21
  • Future Work Page 22
  • Comparison Page 23 RC File Trevni Parquet ORC Hive Integration Y N N Y Active Development N N Y Y Hive Type Model N N N Y Shred complex columns N Y Y Y Splits found quickly N Y Y Y Files per a bucket 1 many 1 or many 1 Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N Y Y Store min, max, sum, count N N N Y Store internal indexes N N N Y No overhead for non-null N N N Y ≥ 0.12 Predicate Pushdown N N N Y ≥ 0.12
  • Vectorization Page 24
  • Vectorization Page 25
  • Why row-at-a-time execution is slow Page 26 • Hive uses Object Inspectors to work on a row • Enables level of abstraction • Costs major performance • Exacerbated by using lazy serdes • Inner loop has many method, new(), and if- then-else calls • Lots of CPU instructions • Pipeline stalls Poor instructions/cycle • Poor cache locality
  • How the code works (simplified) Page 27 class LongColumnAddLongScalarExpression { int inputColumn; int outputColumn; long scalar; void evaluate(VectorizedRowBatch batch) { long [] inVector = ((LongColumnVector) batch.columns[inputColumn]).vector; long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector; if (batch.selectedInUse) { for (int j = 0; j < batch.size; j++) { int i = batch.selected[j]; outVector[i] = inVector[i] + scalar; } } else { for (int i = 0; i < batch.size; i++) { outVector[i] = inVector[i] + scalar; } } } } } No method calls Low instruction count Cache locality to 1024 values No pipeline stalls SIMD in Java 8
  • Vectorization project Page 28
  • Preliminary performance results • NOT a benchmark • 218 million row fact table of real data, 25 columns • 18GB raw data • 6 core, 12 thread workstation, 1 disk, 16GB RAM • select a, b, count(*) from t where c >= const group by a, b -- 53 row result Page 29 warm start times RC non- vectorized (default, not compressed) ORC non- vectorized (default, compressed) ORC vectorized (default, compressed) Runtime (sec) 261 58 43 Total CPU (sec) 381 159 42
  • Thanks to contributors! Page 30 • Microsoft Big Data: • Eric Hanson, Remus Rusanu, Sarvesh Sakalanaga, Tony Murphy, Ashit Gosalia • Hortonworks: • Jitendra Pandey, Owen O’Malley, Gopal V • Others: • Teddy Choi, Tim Chen Jitendra/Eric are joint leads