• Like
  • Save

Efficient Data Storage for Analytics with Apache Parquet 2.0

  • 5,039 views
Uploaded on

 

More in: Software , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
5,039
On Slideshare
0
From Embeds
0
Number of Embeds
8

Actions

Shares
Downloads
0
Comments
0
Likes
27

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Efficient Data Storage for Analytics with Apache Parquet 2.0 Julien Le Dem @J_ Processing tools tech lead, Data Platform at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala @ApacheParquet
  • 2. Outline 2 - Why we need efficiency - Properties of efficient algorithms - Enabling efficiency - Efficiency in Apache Parquet
  • 3. Why we need efficiency
  • 4. Producing a lot of data is easy 4 Producing a lot of derived data is even easier. Solution: Compress all the things!
  • 5. Scanning a lot of data is easy 5 1% completed … but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  • 6. Trying new tools is easy 6 ETL Storage ad-hoc queries log collection automated dashboard machine learning graph processing external datasources and schema definition ... ... We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  • 7. Enter Apache Parquet
  • 8. Parquet design goals 8 - Interoperability - Space efficiency - Query efficiency
  • 9. Parquet timeline 9 - Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Parquet 2.0 coming as Apache release
  • 10. Interoperability
  • 11. Interoperable 11 Model agnostic Language agnostic Java C++ Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping Parquet file format Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encoding Impala ... ... Encoding Query execution
  • 12. Frameworks and libraries integrated with Parquet 12 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto ! Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite ! Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  • 13. Enabling efficiency
  • 14. Columnar storage 14 Logical table representation Row layout Column layout encoding Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk
  • 15. Parquet nested representation 15 Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 16. Statistics for filter and query optimization 16 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  • 17. Properties of efficient algorithms
  • 18. CPU Pipeline 18 pipe 1 a b c d 2 a b c d 3 a b c d 4 a b c 1 2 3 4 5 6 1 a b c d 2 a b c 3 a b 4 a clock1 2 3 4 5 6 7 d 7 8 b b b b c d c d c d c d 9 10 clock pipe pipeline time 8 9 10 d c b a Mis-prediction (“Bubble”) Ideal case
  • 19. Optimize for the processor pipeline 19 ifs “Bubbles” can be caused by: loops virtual calls data dependency cost ~ 12 cycles
  • 20. Minimize CPU cache misses 20 a cache miss costs 10 to 100s cycles depending on the level RAM Bus CPU Cache
  • 21. Encodings in Apache Parquet 2.0
  • 22. The right encoding for the right job 22 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. ! - Prefix coding (delta encoding for strings) When dictionary encoding does not work. ! - Dictionary encoding: small (60K) set of values (server IP, experiment id, …) ! - Run Length Encoding: repetitive data.
  • 23. Delta encoding 23 8 * 64bits values = 64 bytes 8 * 64bits values = 64 bytes 101 100101 105102 107101 11499 116101 102 101 119 120 121 values: deltas 1 10 51 2-1 7-2 20 1 -1 3 1 1100 100 101 100101 105102 107101 11499 116101 102 101 119 120 121100 reference block 1 block 2
  • 24. Delta encoding 24 3 02 43 11 60 12 3 1 2 0 0100 -2 min delta 1 10 51 2-1 7-2 20 1 -1 3 1 1100 1 min delta make deltas > 0 by subtracting min
  • 25. 3 02 43 11 60 12 3 1 2 0 0 maxbits = 2 11 10 11 01 0010 11 01 1110110110110100 maxbits = 3 8 * 2 bits = 2 bytes 000 100 001 110 001 010 000 000 000100001110001010000000 8 * 3 bits = 3 bytes 2 3 bits bits 100 -2 100 -2 1 1 min delta min delta reference packing packing 1110110110110100 0001000011100010100000002 3100 -2 1result: min delta min delta Delta encoding 25
  • 26. Delta encoding 26 3 02 43 11 60 12 3 1 2 0 0 maxbits = 2 11 10 11 01 0010 11 01 1110110110110100 8 * 64bits values = 64 bytes 8 * 64bits values = 64 bytes maxbits = 3 8 * 2 bits = 2 bytes 000 100 001 110 001 010 000 000 000100001110001010000000 8 * 3 bits = 3 bytes 2 3 bits bits 101 100101 105102 107101 11499 116101 102 101 119 120 121 100 values: -2 min delta 100 -2 deltas 1 10 51 2-1 7-2 20 1 -1 3 1 1100 1 min delta make deltas > 0 by subtracting min 1 min delta min delta 100 101 100101 105102 107101 11499 116101 102 101 119 120 121100 reference block 1 block 2 reference packing packing 1110110110110100 0001000011100010100000002 3100 -2 1result:
  • 27. Binary packing designed for CPU efficiency 27 better: orvalues = 0! for (int i = 0; i<values.length; ++i) {! orvalues |= values[i]! }! max = maxbit(orvalues)! see paper: “Decoding billions of integers per second through vectorization” by Daniel Lemire and Leonid Boytsov Unpredictable branch! Loop => Very predictable branch naive maxbit: max = 0! for (int i = 0; i<values.length; ++i) {! current = maxbit(values[i])! if (current > max) max = current! }! even better: orvalues = 0! orvalues |= values[0]! …! orvalues |= values[32]! max = maxbit(orvalues) no branching at all!
  • 28. Binary unpacking designed for CPU efficiency 28 ! int j = 0! while (int i = 0; i < output.length; i += 32) {! maxbit = input[j]! unpack_32_values(values, i, out, j + 1, maxbit);! j += 1 + maxbit! }!
  • 29. Compression comparison 29 TPCH: compression of two 64 bits id columns with delta encoding Primary key 0% 20% 40% 60% 80% 100% plain delta no compression + snappy
  • 30. Compression comparison 30 TPCH: compression of two 64 bits id columns with delta encoding Primary key 0% 20% 40% 60% 80% 100% plain delta no compression + snappy Foreign key 0% 20% 40% 60% 80% 100% plain delta
  • 31. Decoding time vs Compression 31 decodingspeed:! Million/second 0 350 700 1050 1400 Compression (percent saved) 0% 25% 50% 75% 100% Delta Plain + Snappy Plain
  • 32. Performance
  • 33. Size comparison 33 TPCDS 100GB scale factor (+ Snappy unless otherwise specified) Store salesLineitem Text uncompressed Seq Avro Text + LZO RC Parquet 1 Parquet 2 The area of the circle is proportional to the file size Text uncompressed Seq RC Avro Parquet 1 Parquet 2
  • 34. Impala query performance 34 Seconds 0 75 150 225 300 Interactive Reporting Deep Analytics Text Seq RC Parquet 1.0 Parquet 2.0 10 machines: 8 cores 48 GB of RAM 12 Disks OS buffer cache flushed between every query TPCDS geometric mean per query category
  • 35. Roadmap 2.x
  • 36. Roadmap 2.x 36 C++ library: implementation of encodings ! Predicate push down: use statistics to implement filters at the metadata level ! Decimal, Timestamp logical types
  • 37. Community
  • 38. Thank you to our contributors 38 Open Source announcement 1.0 release
  • 39. Get involved 39 Mailing lists: - dev@parquet.incubator.apache.org ! Parquet sync ups: - Regular meetings on google hangout
  • 40. Questions 40 Questions.foreach( answer(_) ) @ApacheParquet