Efficient Data Storage for Analytics with Parquet 2.0

Efficient Data Storage for Analytics
with Apache Parquet 2.0
Julien Le Dem @J_
Processing tools tech lead, Data Platform
@ApacheParquet

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/parquet

Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com

Outline
2
- Why we need efficiency
- Properties of efficient algorithms
- Enabling efficiency
- Efficiency in Apache Parquet

Producing a lot of data is easy
4
Producing a lot of derived data is even easier.
Solution: Compress all the things!

Scanning a lot of data is easy
5
1% completed
… but not necessarily fast.
Waiting is not productive. We want faster turnaround.
Compression but not at the cost of reading speed.

Trying new tools is easy
6
We need a storage format interoperable with all the tools we use
and keep our options open for the next big thing.

Parquet design goals
8
- Interoperability
- Space efficiency
- Query efficiency

Parquet timeline
9
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats
- March 2013: OSS announcement; Criteo signs on for Hive integration
- July 2013: 1.0 release. 18 contributors from more than 5 organizations.
- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.
- Parquet 2.0 coming as Apache release

Interoperable
11
Model agnostic
Language agnostic
Java C++

Frameworks and libraries integrated with Parquet
12
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Presto
Frameworks:
Spark, MapReduce,
Cascading,
Crunch, Scalding, Kite
Data Models:
Avro, Thrift, ProtocolBuffers,
POJOs

Columnar storage
14
Logical table
representation
Row layout
Column layout
encoding
Nested schema

Parquet nested representation
15
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Statistics for filter and query optimization
16
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =

Properties of efficient algorithms

CPU Pipeline
18
Mis-prediction
(“Bubble”)
Ideal case

Optimize for the processor pipeline
19
ifs
“Bubbles” can be caused by:
loops
virtual
calls
data
dependency
cost ~ 12 cycles

Minimize CPU cache misses
20
a cache miss costs 10 to 100s cycles depending on the level
RAM
Bus
CPU Cache

Encodings in Apache Parquet 2.0

The right encoding for the right job
22
- Delta encodings:
for sorted datasets or signals where the variation is less important than the
absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding
branching.
- Prefix coding (delta encoding for strings)
When dictionary encoding does not work.
- Dictionary encoding:
small (60K) set of values (server IP, experiment id, …)
- Run Length Encoding:
repetitive data.

Binary packing designed for CPU efficiency
27
better:
orvalues = 0
for (int i = 0; i<values.length; ++i) {
orvalues |= values[i]
}
max = maxbit(orvalues)
see paper:
“Decoding billions of integers per second through vectorization”
by Daniel Lemire and Leonid Boytsov
Unpredictable branch! Loop => Very predictable branch
naive maxbit:
max = 0
for (int i = 0; i<values.length; ++i) {
current = maxbit(values[i])
if (current > max) max = current
}
even better:
orvalues = 0
orvalues |= values[0]
…
orvalues |= values[32]
max = maxbit(orvalues)
no branching at all!

Binary unpacking designed for CPU efficiency
28
int j = 0
while (int i = 0; i < output.length; i += 32) {
maxbit = input[j]
unpack_32_values(values, i, out, j + 1, maxbit);
j += 1 + maxbit
}

Compression comparison
29
TPCH: compression of two 64 bits id columns with delta encoding
0%
20%
40%
60%
80%
100%
120%
plain delta
Primary key
no compression + snappy

Compression comparison
30
TPCH: compression of two 64 bits id columns with delta encoding
0%
20%
40%
60%
80%
100%
120%
plain delta
Primary key
no compression + snappy
0%
20%
40%
60%
80%
100%
120%
plain delta
Foreign key

Decoding time vs Compression
31
Plain
Plain +
Snappy
Delta
0
350
700
1050
1400
0% 25% 50% 75% 100%
decodingspeed:
Million/second
Compression (percent saved)

Size comparison
32
TPCDS 100GB scale factor (+ Snappy unless otherwise specified)
Store salesLineitem
The area of the circle is proportional to the file size

Parquet 2.x roadmap
33
- Even better predicate push down.
- Zero copy path through new HDFS APIs.
- Vectorized APIs
- C++ library: implementation of encodings
- Decimal, Timestamp logical types

Thank you to our contributors
35
Open Source announcement
1.0 release

Get involved
36
Mailing lists:
- dev@parquet.incubator.apache.org
Parquet sync ups:
- Regular meetings on google hangout

Questions
37
Questions.foreach( answer(_) )
@ApacheParquet

Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/parquet

Efficient Data Storage for Analytics with Parquet 2.0

Recommended

Recommended

More Related Content

More from C4Media

More from C4Media (20)

Recently uploaded

Recently uploaded (20)

Efficient Data Storage for Analytics with Parquet 2.0