Parquet

Open columnar storage for Hadoop
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
htt...
Twitter Context
•

Twitter’s data
•
•

100TB+ a day of compressed data

•

•

200M+ monthly active users generating and co...
Twitter’s use case
•

Logs available on HDFS

•

Thrift to store logs

•

example: one schema has 87 columns, up to 7 leve...
Goal
To have a state of the art columnar storage available
across the Hadoop platform
Hadoop is very reliable for big long...
Columnar storage
•

Limits the IO to only the data that is needed.

•

Saves space: columnar layout compresses better

•

...
Collaboration between Twitter and Cloudera:
•

Common file format definition:
•
•

•

Language independent
Formally specified...
Twitter: Initial results
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compr...
Additional gains with dictionary
13 out of the 30 columns are suitable for dictionary encoding:
they represent 27% of raw ...
Format

Row group
Column a
Page 0

Column b
Page 0

Column c
Page 0

Page 1
Page 1
Page 2

•

Row group: A group of rows i...
Format
Layout:
Row groups in columnar
format. A footer contains
column chunks offset and
schema.
Language independent:
Wel...
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: ...
Repetition level
record 1: [[a, b, c], [d, e, f, g]]
record 2: [[h], [i, j]]

Level: 0,2,2,1,2,2,2,0,1,2
Data: a,b,c,d,e,f...
Differences of Parquet and ORC
Parquet:
•

Repetition/Definition levels capture the structure.

Document

DocId

=> one col...
Iteration on fully assembled records
•

To integrate with existing row based engines (Hive, Pig, M/R).
a1
a2

•

b1

a2

A...
Iteration on columns

Row:

R

D

V

0

0

1

A

1

0

1

B

1

1

C

2

0

0

3

0

1

•

To implement column based execu...
APIs
•

Schema definition and record materialization:
Hadoop does not have a notion of schema, however Impala, Pig, Hive, T...
Encodings
•

Bit packing:

1

3

2

0

0

2

2

0

01|11|10|00 00|10|10|00

•
•
•

Small integers encoded in the minimum b...
Contributors
Main contributors:
•

Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings

•

Nong Li, ...
Parquet 2.0

- More encodings: more compact storage without heavyweight compression
•

Delta encodings: for integers, stri...
How to contribute
Questions? Ideas?
Contribute at: github.com/Parquet
Look for tickets marked as “pick me up!”

http://par...
Upcoming SlideShare
Loading in …5
×

(Julien le dem) parquet

2,075 views
1,853 views

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,075
On SlideShare
0
From Embeds
0
Number of Embeds
340
Actions
Shares
0
Downloads
36
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

(Julien le dem) parquet

  1. 1. Parquet Open columnar storage for Hadoop Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter http://parquet.io http://parquet.io Julien 1
  2. 2. Twitter Context • Twitter’s data • • 100TB+ a day of compressed data • • 200M+ monthly active users generating and consuming 500M+ tweets a day. Scale is huge: Instrumentation, User graph, Derived data, ... Analytics infrastructure: • • Log collection pipeline • • Several 1K+ nodes Hadoop clusters Processing tools Role of Twitter’s analytics infrastructure team • Platform for the whole company. • Manages the data and enables analysis. • Optimizes the cluster’s workload as a whole. The Parquet Planers Gustave Caillebotte 2 Julien - Hundreds of TB - Hadoop: a distributed file system and a parallel execution engine focused on data locality - clusters: ad hoc, prod, ... - Log collection pipeline (Scribe, Kafka, ...) - Processing tools (Pig...) - Every group at Twitter uses the infrastructure: spam, ads, mobile, search, recommendations, ... - optimize space/io/cpu usage. - the clusters are constantly growing and never big enough. - We ingest hundreds of TeraBytes, IO bound processes: Improved compression saves space. take advantage of column storage for better scans.
  3. 3. Twitter’s use case • Logs available on HDFS • Thrift to store logs • example: one schema has 87 columns, up to 7 levels of nesting. struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event ... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid, } http://parquet.io Julien - Logs stored using Thrift with heavily nested data structure. - Example: aggregate event counts per IP address for a given event_type. We just need to access the ip_address and event_name columns 3
  4. 4. Goal To have a state of the art columnar storage available across the Hadoop platform Hadoop is very reliable for big long running queries but also IO heavy. • Incrementally take advantage of column based storage in existing framework. • • Not tied to any framework in particular http://parquet.io 4
  5. 5. Columnar storage • Limits the IO to only the data that is needed. • Saves space: columnar layout compresses better • Enables better scans: load only the columns that need to be accessed • Enables vectorized execution engines. @EmrgencyKittens http://parquet.io 5
  6. 6. Collaboration between Twitter and Cloudera: • Common file format definition: • • • Language independent Formally specified. Implementation in Java for Map/Reduce: • • https://github.com/Parquet/parquet-mr C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala http://parquet.io 6
  7. 7. Twitter: Initial results Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed files Scan time Space 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0% 100.00% 75.00% 50.00% 25.00% 0% Thrift Space Parquet 1 Thrift Parquet 30 columns Space saving: 28% using the same compression algorithm Scan + assembly time compared to original: One column: 10% All columns: 114% http://parquet.io Julien 7
  8. 8. Additional gains with dictionary 13 out of the 30 columns are suitable for dictionary encoding: they represent 27% of raw data but only 3% of compressed data Scan time to thrift Space 100.00% 60.0% 50.0% 75.00% 40.0% 50.00% 30.0% 20.0% 25.00% 0% 10.0% Space Parquet compressed (LZO) Parquet Dictionary uncompressed Parquet Dictionary compressed (LZO) 0% 1 13 Parquet compressed Parquet dictionary compressed Columns Space saving: another 52% using the same compression algorithm (on top of the original columnar storage gains) Scan + assembly time compared to plain Parquet: All 13 columns: 48% (of the already faster columnar scan) http://parquet.io Julien 8
  9. 9. Format Row group Column a Page 0 Column b Page 0 Column c Page 0 Page 1 Page 1 Page 2 • Row group: A group of rows in columnar format. • • roughly: 50MB < row group < 1 GB Page 4 Page 2 Page 3 Row group Column chunk: The data for one column in a row group. • • Page 2 Page 3 One (or more) per split while reading.  • • Max size buffered in memory while writing. Page 1 Column chunks can be read independently for efficient scans. Page: Unit of access in a column chunk. • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB http://parquet.io 9
  10. 10. Format Layout: Row groups in columnar format. A footer contains column chunks offset and schema. Language independent: Well defined format. Hadoop and Cloudera Impala support. http://parquet.io 10
  11. 11. Nested record shredding/assembly • Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, definition level, value. • Level values are bound by the depth of the schema: stored in a compact form. Schema: Document message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } Max rep. level Max def. level DocId Links Backward Forward 0 0 Links.Backward DocId Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Columns 1 2 Links.Forward 1 2 Document Column value R D DocId Backward 10 30 Forward 80 0 Links.Backward Links 20 0 10 0 2 Links.Backward DocId 20 30 1 2 Links.Forward 80 0 2 http://parquet.io Julien - We convert from nested data structure to flat columnar so we need to store extra information to maintain the record structure. - We can picture the schema as a tree. The columns are the leaves identified by the path in the schema. - Additionally to the values we store two integers called the repetition level and definition level. - When the value is NULL, the definition level captures up to which level in the path the value was defined. - the repetition level is the last level in the path where there was a repetition. 0 means we have a new record. - both levels are bound by the depth of the tree. They can be encoded efficiently using bit packing and run length encoding. - a flat schema with all fields required does not have any overhead as R and D are always 0 in that case. Reference: http://research.google.com/pubs/pub36632.html 11
  12. 12. Repetition level record 1: [[a, b, c], [d, e, f, g]] record 2: [[h], [i, j]] Level: 0,2,2,1,2,2,2,0,1,2 Data: a,b,c,d,e,f,g,h,i,j http://parquet.io 12
  13. 13. Differences of Parquet and ORC Parquet: • Repetition/Definition levels capture the structure. Document DocId => one column per Leaf in the schema. Links Backward • Array<int> is one column. • Nullity/repetition of an inner node is stored in each of its children • Forward => One column independently of nesting with some redundancy. ORC: • An extra column for each Map or List to record their size. => one column per Node in the schema. • Array<int> is two columns: array size and content. • => An extra column per nesting level. http://parquet.io 13
  14. 14. Iteration on fully assembled records • To integrate with existing row based engines (Hive, Pig, M/R). a1 a2 • b1 a2 Aware of dictionary encoding: enable optimizations. a1 b2 a3 b3 a3 b1 b2 b3 • Assembles projection for any subset of the columns: - only those are loaded from disc. Document DocId Document Document Links Document Links Links 20 Backward 10 30 Backward 10 30 Forward 80 Forward 80 http://parquet.io Julien Explain in the context of example. - columns: for implementing column based engines - record based: for integrating with existing row based engines (Hive, Pig, M/R). 14
  15. 15. Iteration on columns Row: R D V 0 0 1 A 1 0 1 B 1 1 C 2 0 0 3 0 1 • To implement column based execution engine • D<1 => Null Iteration on triplets: repetition level, definition level, value. • R=1 => same row Repetition level = 0 indicates a new record. D Encoded or decoded values: computing aggregations on integers is faster than on strings. • http://parquet.io Julien 15
  16. 16. APIs • Schema definition and record materialization: Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. • • Event-based SAX-style record materialization layer. No double conversion. Integration with existing type systems and processing frameworks: • • Impala • Pig • Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples • Avro • Hive http://parquet.io Julien all define a schema in a different way. 16
  17. 17. Encodings • Bit packing: 1 3 2 0 0 2 2 0 01|11|10|00 00|10|10|00 • • • Small integers encoded in the minimum bits required Useful for repetition level, definition levels and dictionary keys Run Length Encoding: 1 1 1 1 1 1 1 1 • • Cheap compression • Works well for definition level of sparse columns. Dictionary encoding: • • 1 Used in combination with bit packing, • 8 Useful for columns with few ( < 50,000 ) distinct values Extensible: • Defining new encodings is supported by the format http://parquet.io 17
  18. 18. Contributors Main contributors: • Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings • Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down • • Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration • • Tom White (Cloudera): Avro integration • Avi Bryant, Colin Marc (Stripe): Cascading tuples integration • Matt Massie (Berkeley AMP lab): predicate and projection push down • David Chen (Linkedin): Avro integration improvements http://parquet.io 18
  19. 19. Parquet 2.0 - More encodings: more compact storage without heavyweight compression • Delta encodings: for integers, strings and dictionary • Improved encoding for strings and boolean - Statistics: to be used by query planners and - New page format: faster predicate push down, http://parquet.io 19
  20. 20. How to contribute Questions? Ideas? Contribute at: github.com/Parquet Look for tickets marked as “pick me up!” http://parquet.io 20

×