Your SlideShare is downloading. ×
Efficient Data Storage for Analytics
with Apache Parquet 2.0
Julien Le Dem @J_
Processing tools tech lead, Data Platform at ...
Outline
2
- Why we need efficiency
- Properties of efficient algorithms
- Enabling efficiency
- Efficiency in Apache Parquet
Why we need efficiency
Producing a lot of data is easy
4
Producing a lot of derived data is even easier.

Solution: Compress all the things!
Scanning a lot of data is easy
5
1% completed
… but not necessarily fast.

Waiting is not productive. We want faster turna...
Trying new tools is easy
6
ETL
Storage
ad-hoc
queries
log
collection
automated
dashboard
machine
learning
graph
processing...
Enter Apache Parquet
Parquet design goals
8
- Interoperability

- Space efficiency

- Query efficiency
Parquet timeline
9
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats

- March 2013: OSS announcemen...
Interoperability
Interoperable
11
Model agnostic
Language agnostic
Java C++
Avro Thrift
Protocol
Buffer
Pig Tuple Hive SerDe
Assembly/strip...
Frameworks and libraries integrated with Parquet
12
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Pres...
Enabling efficiency
Columnar storage
14
Logical table
representation
Row layout
Column layout
encoding
Nested schema
a b c
a b c
a1 b1 c1
a2 b...
Parquet nested representation
15
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links...
Statistics for filter and query optimization
16
Vertical partitioning
(projection push down)
Horizontal partitioning
(predi...
Properties of efficient algorithms
CPU Pipeline
18
pipe
1 a b c d
2 a b c d
3 a b c d
4 a b c
1 2 3 4 5 6
1 a b c d
2 a b c
3 a b
4 a
clock1 2 3 4 5 6
7
d
7 ...
Optimize for the processor pipeline
19
ifs
“Bubbles” can be caused by:
loops
virtual
calls
data
dependency
cost ~ 12 cycles
Minimize CPU cache misses
20
a cache miss costs 10 to 100s cycles depending on the level
RAM
Bus
CPU Cache
Encodings in Apache Parquet 2.0
The right encoding for the right job
22
- Delta encodings:

for sorted datasets or signals where the variation is less imp...
Delta encoding
23
8 * 64bits values = 64 bytes 8 * 64bits values = 64 bytes
101 100101 105102 107101 11499 116101 102 101 ...
Delta encoding
24
3 02 43 11 60 12 3 1 2 0 0100 -2
min
delta
1 10 51 2-1 7-2 20 1 -1 3 1 1100
1
min
delta
make deltas > 0
...
3 02 43 11 60 12 3 1 2 0 0
maxbits = 2
11 10 11 01 0010 11 01
1110110110110100
maxbits = 3
8 * 2 bits = 2 bytes
000 100 00...
Delta encoding
26
3 02 43 11 60 12 3 1 2 0 0
maxbits = 2
11 10 11 01 0010 11 01
1110110110110100
8 * 64bits values = 64 by...
Binary packing designed for CPU efficiency
27
better:
orvalues = 0!
for (int i = 0; i<values.length; ++i) {!
orvalues |= val...
Binary unpacking designed for CPU efficiency
28
!
int j = 0!
while (int i = 0; i < output.length; i += 32) {!
maxbit = input...
Compression comparison
29
TPCH: compression of two 64 bits id columns with delta encoding
Primary key
0%
20%
40%
60%
80%
1...
Compression comparison
30
TPCH: compression of two 64 bits id columns with delta encoding
Primary key
0%
20%
40%
60%
80%
1...
Decoding time vs Compression
31
decodingspeed:!
Million/second
0
350
700
1050
1400
Compression (percent saved)
0% 25% 50% ...
Performance
Size comparison
33
TPCDS 100GB scale factor (+ Snappy unless otherwise specified)
Store salesLineitem
Text uncompressed
Seq...
Impala query performance
34
Seconds
0
75
150
225
300
Interactive Reporting Deep Analytics
Text Seq RC Parquet 1.0 Parquet ...
Roadmap 2.x
Roadmap 2.x
36
C++ library: implementation of encodings

!
Predicate push down: 

use statistics to implement filters at th...
Community
Thank you to our contributors
38
Open Source announcement
1.0 release
Get involved
39
Mailing lists:
- dev@parquet.incubator.apache.org
!
Parquet sync ups:
- Regular meetings on google hangout
Questions
40
Questions.foreach( answer(_) )
@ApacheParquet
Upcoming SlideShare
Loading in...5
×

Efficient Data Storage for Analytics with Apache Parquet 2.0

19,875

Published on

Published in: Software, Technology

Transcript of "Efficient Data Storage for Analytics with Apache Parquet 2.0"

  1. 1. Efficient Data Storage for Analytics with Apache Parquet 2.0 Julien Le Dem @J_ Processing tools tech lead, Data Platform at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala @ApacheParquet
  2. 2. Outline 2 - Why we need efficiency - Properties of efficient algorithms - Enabling efficiency - Efficiency in Apache Parquet
  3. 3. Why we need efficiency
  4. 4. Producing a lot of data is easy 4 Producing a lot of derived data is even easier. Solution: Compress all the things!
  5. 5. Scanning a lot of data is easy 5 1% completed … but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  6. 6. Trying new tools is easy 6 ETL Storage ad-hoc queries log collection automated dashboard machine learning graph processing external datasources and schema definition ... ... We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  7. 7. Enter Apache Parquet
  8. 8. Parquet design goals 8 - Interoperability - Space efficiency - Query efficiency
  9. 9. Parquet timeline 9 - Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Parquet 2.0 coming as Apache release
  10. 10. Interoperability
  11. 11. Interoperable 11 Model agnostic Language agnostic Java C++ Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping Parquet file format Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encoding Impala ... ... Encoding Query execution
  12. 12. Frameworks and libraries integrated with Parquet 12 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto ! Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite ! Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  13. 13. Enabling efficiency
  14. 14. Columnar storage 14 Logical table representation Row layout Column layout encoding Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk
  15. 15. Parquet nested representation 15 Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  16. 16. Statistics for filter and query optimization 16 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  17. 17. Properties of efficient algorithms
  18. 18. CPU Pipeline 18 pipe 1 a b c d 2 a b c d 3 a b c d 4 a b c 1 2 3 4 5 6 1 a b c d 2 a b c 3 a b 4 a clock1 2 3 4 5 6 7 d 7 8 b b b b c d c d c d c d 9 10 clock pipe pipeline time 8 9 10 d c b a Mis-prediction (“Bubble”) Ideal case
  19. 19. Optimize for the processor pipeline 19 ifs “Bubbles” can be caused by: loops virtual calls data dependency cost ~ 12 cycles
  20. 20. Minimize CPU cache misses 20 a cache miss costs 10 to 100s cycles depending on the level RAM Bus CPU Cache
  21. 21. Encodings in Apache Parquet 2.0
  22. 22. The right encoding for the right job 22 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. ! - Prefix coding (delta encoding for strings) When dictionary encoding does not work. ! - Dictionary encoding: small (60K) set of values (server IP, experiment id, …) ! - Run Length Encoding: repetitive data.
  23. 23. Delta encoding 23 8 * 64bits values = 64 bytes 8 * 64bits values = 64 bytes 101 100101 105102 107101 11499 116101 102 101 119 120 121 values: deltas 1 10 51 2-1 7-2 20 1 -1 3 1 1100 100 101 100101 105102 107101 11499 116101 102 101 119 120 121100 reference block 1 block 2
  24. 24. Delta encoding 24 3 02 43 11 60 12 3 1 2 0 0100 -2 min delta 1 10 51 2-1 7-2 20 1 -1 3 1 1100 1 min delta make deltas > 0 by subtracting min
  25. 25. 3 02 43 11 60 12 3 1 2 0 0 maxbits = 2 11 10 11 01 0010 11 01 1110110110110100 maxbits = 3 8 * 2 bits = 2 bytes 000 100 001 110 001 010 000 000 000100001110001010000000 8 * 3 bits = 3 bytes 2 3 bits bits 100 -2 100 -2 1 1 min delta min delta reference packing packing 1110110110110100 0001000011100010100000002 3100 -2 1result: min delta min delta Delta encoding 25
  26. 26. Delta encoding 26 3 02 43 11 60 12 3 1 2 0 0 maxbits = 2 11 10 11 01 0010 11 01 1110110110110100 8 * 64bits values = 64 bytes 8 * 64bits values = 64 bytes maxbits = 3 8 * 2 bits = 2 bytes 000 100 001 110 001 010 000 000 000100001110001010000000 8 * 3 bits = 3 bytes 2 3 bits bits 101 100101 105102 107101 11499 116101 102 101 119 120 121 100 values: -2 min delta 100 -2 deltas 1 10 51 2-1 7-2 20 1 -1 3 1 1100 1 min delta make deltas > 0 by subtracting min 1 min delta min delta 100 101 100101 105102 107101 11499 116101 102 101 119 120 121100 reference block 1 block 2 reference packing packing 1110110110110100 0001000011100010100000002 3100 -2 1result:
  27. 27. Binary packing designed for CPU efficiency 27 better: orvalues = 0! for (int i = 0; i<values.length; ++i) {! orvalues |= values[i]! }! max = maxbit(orvalues)! see paper: “Decoding billions of integers per second through vectorization” by Daniel Lemire and Leonid Boytsov Unpredictable branch! Loop => Very predictable branch naive maxbit: max = 0! for (int i = 0; i<values.length; ++i) {! current = maxbit(values[i])! if (current > max) max = current! }! even better: orvalues = 0! orvalues |= values[0]! …! orvalues |= values[32]! max = maxbit(orvalues) no branching at all!
  28. 28. Binary unpacking designed for CPU efficiency 28 ! int j = 0! while (int i = 0; i < output.length; i += 32) {! maxbit = input[j]! unpack_32_values(values, i, out, j + 1, maxbit);! j += 1 + maxbit! }!
  29. 29. Compression comparison 29 TPCH: compression of two 64 bits id columns with delta encoding Primary key 0% 20% 40% 60% 80% 100% plain delta no compression + snappy
  30. 30. Compression comparison 30 TPCH: compression of two 64 bits id columns with delta encoding Primary key 0% 20% 40% 60% 80% 100% plain delta no compression + snappy Foreign key 0% 20% 40% 60% 80% 100% plain delta
  31. 31. Decoding time vs Compression 31 decodingspeed:! Million/second 0 350 700 1050 1400 Compression (percent saved) 0% 25% 50% 75% 100% Delta Plain + Snappy Plain
  32. 32. Performance
  33. 33. Size comparison 33 TPCDS 100GB scale factor (+ Snappy unless otherwise specified) Store salesLineitem Text uncompressed Seq Avro Text + LZO RC Parquet 1 Parquet 2 The area of the circle is proportional to the file size Text uncompressed Seq RC Avro Parquet 1 Parquet 2
  34. 34. Impala query performance 34 Seconds 0 75 150 225 300 Interactive Reporting Deep Analytics Text Seq RC Parquet 1.0 Parquet 2.0 10 machines: 8 cores 48 GB of RAM 12 Disks OS buffer cache flushed between every query TPCDS geometric mean per query category
  35. 35. Roadmap 2.x
  36. 36. Roadmap 2.x 36 C++ library: implementation of encodings ! Predicate push down: use statistics to implement filters at the metadata level ! Decimal, Timestamp logical types
  37. 37. Community
  38. 38. Thank you to our contributors 38 Open Source announcement 1.0 release
  39. 39. Get involved 39 Mailing lists: - dev@parquet.incubator.apache.org ! Parquet sync ups: - Regular meetings on google hangout
  40. 40. Questions 40 Questions.foreach( answer(_) ) @ApacheParquet

×