Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Parquet

Open columnar storage for Hadoop
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
htt...
Twitter Context
•

Twitter’s data
•
•

100TB+ a day of compressed data

•

•

200M+ monthly active users generating and co...
Twitter’s use case
•

Logs available on HDFS

•

Thrift to store logs

•

example: one schema has 87 columns, up to 7 leve...
Goal
To have a state of the art columnar storage available
across the Hadoop platform
Hadoop is very reliable for big long...
Columnar storage
•

Limits the IO to only the data that is needed.

•

Saves space: columnar layout compresses better

•

...
Collaboration between Twitter and Cloudera:
•

Common file format definition:
•
•

•

Language independent
Formally specified...
Twitter: Initial results
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compr...
Additional gains with dictionary
13 out of the 30 columns are suitable for dictionary encoding:
they represent 27% of raw ...
Format

Row group
Column a
Page 0

Column b
Page 0

Column c
Page 0

Page 1
Page 1
Page 2

•

Row group: A group of rows i...
Format
Layout:
Row groups in columnar
format. A footer contains
column chunks offset and
schema.
Language independent:
Wel...
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: ...
Repetition level
record 1: [[a, b, c], [d, e, f, g]]
record 2: [[h], [i, j]]

Level: 0,2,2,1,2,2,2,0,1,2
Data: a,b,c,d,e,f...
Differences of Parquet and ORC
Parquet:
•

Repetition/Definition levels capture the structure.

Document

DocId

=> one col...
Iteration on fully assembled records
•

To integrate with existing row based engines (Hive, Pig, M/R).
a1
a2

•

b1

a2

A...
Iteration on columns

Row:

R

D

V

0

0

1

A

1

0

1

B

1

1

C

2

0

0

3

0

1

•

To implement column based execu...
APIs
•

Schema definition and record materialization:
Hadoop does not have a notion of schema, however Impala, Pig, Hive, T...
Encodings
•

Bit packing:

1

3

2

0

0

2

2

0

01|11|10|00 00|10|10|00

•
•
•

Small integers encoded in the minimum b...
Contributors
Main contributors:
•

Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings

•

Nong Li, ...
Parquet 2.0

- More encodings: more compact storage without heavyweight compression
•

Delta encodings: for integers, stri...
How to contribute
Questions? Ideas?
Contribute at: github.com/Parquet
Look for tickets marked as “pick me up!”

http://par...
Upcoming SlideShare
Loading in …5
×

of

(Julien le dem)   parquet Slide 1 (Julien le dem)   parquet Slide 2 (Julien le dem)   parquet Slide 3 (Julien le dem)   parquet Slide 4 (Julien le dem)   parquet Slide 5 (Julien le dem)   parquet Slide 6 (Julien le dem)   parquet Slide 7 (Julien le dem)   parquet Slide 8 (Julien le dem)   parquet Slide 9 (Julien le dem)   parquet Slide 10 (Julien le dem)   parquet Slide 11 (Julien le dem)   parquet Slide 12 (Julien le dem)   parquet Slide 13 (Julien le dem)   parquet Slide 14 (Julien le dem)   parquet Slide 15 (Julien le dem)   parquet Slide 16 (Julien le dem)   parquet Slide 17 (Julien le dem)   parquet Slide 18 (Julien le dem)   parquet Slide 19 (Julien le dem)   parquet Slide 20
Upcoming SlideShare
Parquet Hadoop Summit 2013
Next
Download to read offline and view in fullscreen.

4 Likes

Share

Download to read offline

(Julien le dem) parquet

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

(Julien le dem) parquet

  1. 1. Parquet Open columnar storage for Hadoop Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter http://parquet.io http://parquet.io Julien 1
  2. 2. Twitter Context • Twitter’s data • • 100TB+ a day of compressed data • • 200M+ monthly active users generating and consuming 500M+ tweets a day. Scale is huge: Instrumentation, User graph, Derived data, ... Analytics infrastructure: • • Log collection pipeline • • Several 1K+ nodes Hadoop clusters Processing tools Role of Twitter’s analytics infrastructure team • Platform for the whole company. • Manages the data and enables analysis. • Optimizes the cluster’s workload as a whole. The Parquet Planers Gustave Caillebotte 2 Julien - Hundreds of TB - Hadoop: a distributed file system and a parallel execution engine focused on data locality - clusters: ad hoc, prod, ... - Log collection pipeline (Scribe, Kafka, ...) - Processing tools (Pig...) - Every group at Twitter uses the infrastructure: spam, ads, mobile, search, recommendations, ... - optimize space/io/cpu usage. - the clusters are constantly growing and never big enough. - We ingest hundreds of TeraBytes, IO bound processes: Improved compression saves space. take advantage of column storage for better scans.
  3. 3. Twitter’s use case • Logs available on HDFS • Thrift to store logs • example: one schema has 87 columns, up to 7 levels of nesting. struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event ... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid, } http://parquet.io Julien - Logs stored using Thrift with heavily nested data structure. - Example: aggregate event counts per IP address for a given event_type. We just need to access the ip_address and event_name columns 3
  4. 4. Goal To have a state of the art columnar storage available across the Hadoop platform Hadoop is very reliable for big long running queries but also IO heavy. • Incrementally take advantage of column based storage in existing framework. • • Not tied to any framework in particular http://parquet.io 4
  5. 5. Columnar storage • Limits the IO to only the data that is needed. • Saves space: columnar layout compresses better • Enables better scans: load only the columns that need to be accessed • Enables vectorized execution engines. @EmrgencyKittens http://parquet.io 5
  6. 6. Collaboration between Twitter and Cloudera: • Common file format definition: • • • Language independent Formally specified. Implementation in Java for Map/Reduce: • • https://github.com/Parquet/parquet-mr C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala http://parquet.io 6
  7. 7. Twitter: Initial results Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed files Scan time Space 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0% 100.00% 75.00% 50.00% 25.00% 0% Thrift Space Parquet 1 Thrift Parquet 30 columns Space saving: 28% using the same compression algorithm Scan + assembly time compared to original: One column: 10% All columns: 114% http://parquet.io Julien 7
  8. 8. Additional gains with dictionary 13 out of the 30 columns are suitable for dictionary encoding: they represent 27% of raw data but only 3% of compressed data Scan time to thrift Space 100.00% 60.0% 50.0% 75.00% 40.0% 50.00% 30.0% 20.0% 25.00% 0% 10.0% Space Parquet compressed (LZO) Parquet Dictionary uncompressed Parquet Dictionary compressed (LZO) 0% 1 13 Parquet compressed Parquet dictionary compressed Columns Space saving: another 52% using the same compression algorithm (on top of the original columnar storage gains) Scan + assembly time compared to plain Parquet: All 13 columns: 48% (of the already faster columnar scan) http://parquet.io Julien 8
  9. 9. Format Row group Column a Page 0 Column b Page 0 Column c Page 0 Page 1 Page 1 Page 2 • Row group: A group of rows in columnar format. • • roughly: 50MB < row group < 1 GB Page 4 Page 2 Page 3 Row group Column chunk: The data for one column in a row group. • • Page 2 Page 3 One (or more) per split while reading.  • • Max size buffered in memory while writing. Page 1 Column chunks can be read independently for efficient scans. Page: Unit of access in a column chunk. • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB http://parquet.io 9
  10. 10. Format Layout: Row groups in columnar format. A footer contains column chunks offset and schema. Language independent: Well defined format. Hadoop and Cloudera Impala support. http://parquet.io 10
  11. 11. Nested record shredding/assembly • Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, definition level, value. • Level values are bound by the depth of the schema: stored in a compact form. Schema: Document message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } Max rep. level Max def. level DocId Links Backward Forward 0 0 Links.Backward DocId Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Columns 1 2 Links.Forward 1 2 Document Column value R D DocId Backward 10 30 Forward 80 0 Links.Backward Links 20 0 10 0 2 Links.Backward DocId 20 30 1 2 Links.Forward 80 0 2 http://parquet.io Julien - We convert from nested data structure to flat columnar so we need to store extra information to maintain the record structure. - We can picture the schema as a tree. The columns are the leaves identified by the path in the schema. - Additionally to the values we store two integers called the repetition level and definition level. - When the value is NULL, the definition level captures up to which level in the path the value was defined. - the repetition level is the last level in the path where there was a repetition. 0 means we have a new record. - both levels are bound by the depth of the tree. They can be encoded efficiently using bit packing and run length encoding. - a flat schema with all fields required does not have any overhead as R and D are always 0 in that case. Reference: http://research.google.com/pubs/pub36632.html 11
  12. 12. Repetition level record 1: [[a, b, c], [d, e, f, g]] record 2: [[h], [i, j]] Level: 0,2,2,1,2,2,2,0,1,2 Data: a,b,c,d,e,f,g,h,i,j http://parquet.io 12
  13. 13. Differences of Parquet and ORC Parquet: • Repetition/Definition levels capture the structure. Document DocId => one column per Leaf in the schema. Links Backward • Array<int> is one column. • Nullity/repetition of an inner node is stored in each of its children • Forward => One column independently of nesting with some redundancy. ORC: • An extra column for each Map or List to record their size. => one column per Node in the schema. • Array<int> is two columns: array size and content. • => An extra column per nesting level. http://parquet.io 13
  14. 14. Iteration on fully assembled records • To integrate with existing row based engines (Hive, Pig, M/R). a1 a2 • b1 a2 Aware of dictionary encoding: enable optimizations. a1 b2 a3 b3 a3 b1 b2 b3 • Assembles projection for any subset of the columns: - only those are loaded from disc. Document DocId Document Document Links Document Links Links 20 Backward 10 30 Backward 10 30 Forward 80 Forward 80 http://parquet.io Julien Explain in the context of example. - columns: for implementing column based engines - record based: for integrating with existing row based engines (Hive, Pig, M/R). 14
  15. 15. Iteration on columns Row: R D V 0 0 1 A 1 0 1 B 1 1 C 2 0 0 3 0 1 • To implement column based execution engine • D<1 => Null Iteration on triplets: repetition level, definition level, value. • R=1 => same row Repetition level = 0 indicates a new record. D Encoded or decoded values: computing aggregations on integers is faster than on strings. • http://parquet.io Julien 15
  16. 16. APIs • Schema definition and record materialization: Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. • • Event-based SAX-style record materialization layer. No double conversion. Integration with existing type systems and processing frameworks: • • Impala • Pig • Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples • Avro • Hive http://parquet.io Julien all define a schema in a different way. 16
  17. 17. Encodings • Bit packing: 1 3 2 0 0 2 2 0 01|11|10|00 00|10|10|00 • • • Small integers encoded in the minimum bits required Useful for repetition level, definition levels and dictionary keys Run Length Encoding: 1 1 1 1 1 1 1 1 • • Cheap compression • Works well for definition level of sparse columns. Dictionary encoding: • • 1 Used in combination with bit packing, • 8 Useful for columns with few ( < 50,000 ) distinct values Extensible: • Defining new encodings is supported by the format http://parquet.io 17
  18. 18. Contributors Main contributors: • Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings • Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down • • Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration • • Tom White (Cloudera): Avro integration • Avi Bryant, Colin Marc (Stripe): Cascading tuples integration • Matt Massie (Berkeley AMP lab): predicate and projection push down • David Chen (Linkedin): Avro integration improvements http://parquet.io 18
  19. 19. Parquet 2.0 - More encodings: more compact storage without heavyweight compression • Delta encodings: for integers, strings and dictionary • Improved encoding for strings and boolean - Statistics: to be used by query planners and - New page format: faster predicate push down, http://parquet.io 19
  20. 20. How to contribute Questions? Ideas? Contribute at: github.com/Parquet Look for tickets marked as “pick me up!” http://parquet.io 20
  • yoheiazekatsu

    Oct. 9, 2018
  • ikjuson

    Dec. 5, 2015
  • CamuelGilyadov

    Jun. 19, 2014
  • binlijin

    Oct. 24, 2013

Views

Total views

2,816

On Slideshare

0

From embeds

0

Number of embeds

713

Actions

Downloads

42

Shares

0

Comments

0

Likes

4

×