• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Parquet overview
 

Parquet overview

on

  • 3,323 views

Parquet overview given to the Apache Drill meetup

Parquet overview given to the Apache Drill meetup

Statistics

Views

Total Views
3,323
Views on SlideShare
2,254
Embed Views
1,069

Actions

Likes
2
Downloads
22
Comments
0

26 Embeds 1,069

http://todobi.blogspot.com.es 511
http://todobi.blogspot.com 258
http://todobi.blogspot.com.ar 83
http://todobi.blogspot.mx 70
https://twitter.com 48
http://feeds.feedburner.com 27
http://www.todobi.blogspot.com 14
http://todobi.blogspot.in 11
http://newsblur.com 9
http://www.todobi.blogspot.com.es 7
http://todobi.blogspot.com.br 5
http://todobi.blogspot.de 3
http://todobi.blogspot.it 3
http://todobi.blogspot.fr 3
http://inhuman14.kekunda.com 3
http://todobi.blogspot.co.at 3
http://todobi.blogspot.co.uk 2
http://todobi.blogspot.tw 1
http://131.253.14.66 1
http://todobi.blogspot.ca 1
http://translate.googleusercontent.com 1
http://todobi.blogspot.ch 1
http://todobi.blogspot.pt 1
http://todobi.blogspot.sg 1
http://dev.newsblur.com 1
http://todobi.blogspot.kr 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Parquet overview Parquet overview Presentation Transcript

    • Parquet overview Julien Le Dem Twitterhttp://parquet.github.com
    • Format Schema definition: for binary representation Layout: currently PAX, supports one file per column when Hadoop allows block placement policy. Not java centric: encodings, compression codecs, etc are ENUMs, not java class names. i.e.: formally defined. Impala reads Parquet files. Footer: contains column chunks offsets 2
    • Format • Row group: A group of rows in columnar format. • Max size buffered in memory while writing. • One (or more) per split while reading.  • roughly: 10MB < row group < 1 GB • Column chunk: The data for one column in a row group. • Column chunks can be read independently for efficient scans. • Page: Unit of compression in a column chunk • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 100KB 3
    • Dremel’s shredding/assembly Schema: message Document { required int64 DocId; Columns: optional group Links { DocId repeated int64 Backward; Links.Backward repeated int64 Forward; } Links.Forward repeated group Name { Name.Language.Code repeated group Language { Name.Language.Country required string Code; Name.Url optional string Country; } optional string Url; }}Reference:http://research.google.com/pubs/pub36632.html• Each cell is encoded as a triplet: repetition level, definition level, value.• This allows reconstructing the nested records.• Level values are bound by the depth of the schema: They are stored in acompact form.Example: Max repetition level Max definition level DocId 0 0 Links.Backward 1 2 Links.Forward 1 2 Name.Language.Code 2 2 Name.Language.Country 2 3 Name.Url 1 2 4
    • Abstractions • Column layer: • Iteration on triplets: repetition level, definition level, value. • Repetition level = 0 indicates a new record. •When dictionary encoding and other compact encodings are implemented, can iterate over encoded or un-encoded values. • Record layer: • Iteration on fully assembled records. •Provides assembled records for any subset of the columns, so that only columns actually accessed are loaded. 5
    • Extensibility • Schema conversion: • Hadoop does not have a notion of schema. • However Pig, Hive, Thrift, Avro, ProtoBufs, etc do. • Record materialization: • Pluggable record materialization layer. • No double conversion. • Sax-style Event base API. • Encodings: • Extensible encoding definitions. • Planned: dictionary encoding, zigzag, rle, ... 6
    • Extensibility • Schema conversion: • Hadoop does not have a notion of schema. • However Pig, Hive, Thrift, Avro, ProtoBufs, etc do. • Record materialization: • Pluggable record materialization layer. • No double conversion. • Sax-style Event base API. • Encodings: • Extensible encoding definitions. • Planned: dictionary encoding, zigzag, rle, ... 6