Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache parquet - Apache big data North America 2017

447 views

Published on

Apache Parquet brings the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. Apache Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces. Apache Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Apache Parquet allows compression schemes to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented. This talk highlights the internal implementation of Apache Parquet.

Published in: Technology
  • Hello! I can recommend a site that has helped me. It's called ⇒ www.HelpWriting.net ⇐ So make sure to check it out!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Ich kann eine Website empfehlen. Er hat mir wirklich geholfen. ⇒ www.WritersHilfe.com ⇐ Zufrieden und beeindruckt.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache parquet - Apache big data North America 2017

  1. 1. EFFI C I EN T C OL U M N AR STORAG E W I TH APACH E PARQU ET Apache: Big Data North America 2017 Ranganathan Balashanmugam, Aconex
  2. 2. “The Tables Have Turned.” Apache: Big Data North America 2017
  3. 3. “The Tables Have Turned.” - 90° Apache: Big Data North America 2017
  4. 4. Apache: Big Data North America 2017 ANALY TI CAL OV ER TRAN SAC TI ON AL Context
  5. 5. Apache: Big Data North America 2017 SIM PL E S TRUCTURE A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 a3 b3 c3 row a1 a2 a3 b1 b2 b3 c1 c2 c3 columnar
  6. 6. Apache: Big Data North America 2017 “Optimizing the disk seeks.”
  7. 7. “The Tables Have Turned.” - 90° Apache: Big Data North America 2017 Efficient writes Efficient reads
  8. 8. Apache: Big Data North America 2017 SIM PL E S TRUCTURE A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 a3 b3 c3 row a1 a2 a3 b1 b2 b3 c1 c2 c3 columnar
  9. 9. message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Apache: Big Data North America 2017 N ES TED A N D R EPEATED S TR U C TU R ES How to preserve in column store? DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ Url: ‘http://a' Name: Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’
  10. 10. Apache: Big Data North America 2017 Colum nar
  11. 11. Apache: Big Data North America 2017 “Get these performance benefits for nested structures into Hadoop ecosystem.” problem statem ent
  12. 12. Apache: Big Data North America 2017 M OTI VATI ON * Allow complex nested data structures * Very efficient compression and encoding schemes * Support many frameworks
  13. 13. Apache: Big Data North America 2017 “Columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.”
  14. 14. Apache: Big Data North America 2017 D ES I G N G OAL S * Interoperability * Space efficiency * Query efficiency
  15. 15. Apache: Big Data North America 2017 parquet avro thrift protobuf pig hive …. avro thrift protobuf pig hive …. object model converters storage format object models parquet binary file src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ column readers scalding
  16. 16. Apache: Big Data North America 2017 Page 1 Page 2 … Page n parquet file header - Magic number (4 bytes) : “PAR1” row group 0 column a column b column c row group 1 …row group n footer Page 1 Page 2 … Page n Page 1 Page 2 … Page n src: https://github.com/Parquet/parquet-format * good enough for compression efficiency * 8KB - 1 MB * good enough to read * group of rows * max size buffered while writing * 50 MB - 1GB * data of one column in row group * can be read independently
  17. 17. Apache: Big Data North America 2017 footer src: https://github.com/Parquet/parquet-format file metadata (ThriftCompactProtocol) - version - schema row group 0 metadata footer length (4 bytes) column 0 - total byte size - total rows * type/path/encodings/codec * num of values * compressed/uncompressed size * data page offset * index page offset column 1 * … Magic number (4 bytes): “PAR1” row group … column 0 … column 1 * … * …
  18. 18. Apache: Big Data North America 2017 page src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ repetition levels definition levels values page header (ThriftCompactProtocol) encoded and com pressed
  19. 19. Apache: Big Data North America 2017 message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Schema
  20. 20. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ Url: ‘http://a' Name: Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Documents
  21. 21. Apache: Big Data North America 2017 “Can we represent it in columnar former efficiently and read them back to their original nested data structure?” problem statem ent
  22. 22. Apache: Big Data North America 2017 “Can we represent it in columnar former efficiently and read them back to their original nested data structure?” Dremel encoding
  23. 23. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } fill all the nulls
  24. 24. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } C O L U M N S R D VAL U E Links.Forward 0 2 20 Links.Forward[1] 1 2 40 Links.Forward[2] 1 2 60 Links.Forward 0 2 80 In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ?
  25. 25. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code en-us Name.Language[1].Code en Name[1].[Language].[Code] null Name[2].Language.Code en-gb Name.Language.Code null
  26. 26. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code en Name[1].[Language].[Code] null Name[2].Language.Code en-gb Name.Language.Code null
  27. 27. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code 2 2 en Name[1].[Language].[Code] null Name[2].Language.Code en-gb Name.Language.Code null
  28. 28. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code 2 2 en Name[1].[Language].[Code] 1 1 null Name[2].Language.Code en-gb Name.Language.Code null
  29. 29. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code 2 2 en Name[1].[Language].[Code] 1 1 null Name[2].Language.Code 1 2 en-gb Name.Language.Code 0 1 null
  30. 30. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } C O L U M N S R D VAL U E DocId 0 0 10 Links.Forward 0 2 20 Links.Forward[1] 1 2 40 Links.Forward[2] 1 2 60 Links.[Backward] 0 1 null Name.Language.Code 0 2 en-us Name.Language.Country 0 3 us Name.Language[1].Code 2 2 en Name.Language[1].[Country] 2 2 null Name.Url 0 2 http://a Name[1].[Language].[Code] 1 1 null Name[1].[Language].[Country] 1 1 null Name[1].Url 1 2 http://b Name[2].Language.Code 1 2 en-gb Name[2].Language.Country 1 3 gb Name[2].[Url] 1 1 null
  31. 31. Apache: Big Data North America 2017 A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 OPTI M I Z ATI ONS projection push down predicate push down read required data fetch required columns filter records while reading
  32. 32. Apache: Big Data North America 2017 EN C OD I N G * Plain * Dictionary Encoding * Run Length Encoding/ Bit-Packing Hybrid * Delta Encoding * Delta-length byte array * Delta Strings
  33. 33. Apache: Big Data North America 2017 C O M PR ES S I O N TR A D E O F F * Storage * IO * Bandwidth * CPU time
  34. 34. Thank you Apache: Big Data North America 2017 Ran.ga.na.than B @ran_than

×