Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1xrB1JY.
Julien Le Dem discusses the advantages of a columnar data layout, specifically the features and design choices Apache Parquet uses to achieve goals of interoperability, space and query efficiency. Filmed at qconsf.com.
Julien Le Dem is the lead for Parquet's java implementation. He also leads Data Processing tool development at Twitter and is on the Apache Pig PMC.
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Ā
Efficient Data Storage for Analytics with Parquet 2.0
1. Efficient Data Storage for Analytics
with Apache Parquet 2.0
Julien Le Dem @J_
Processing tools tech lead, Data Platform
@ApacheParquet
2. InfoQ.com: News & Community Site
ā¢ 750,000 unique visitors/month
ā¢ Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
ā¢ Post content from our QCon conferences
ā¢ News 15-20 / week
ā¢ Articles 3-4 / week
ā¢ Presentations (videos) 12-15 / week
ā¢ Interviews 2-3 / week
ā¢ Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/parquet
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. Outline
2
- Why we need efficiency
- Properties of efficient algorithms
- Enabling efficiency
- Efficiency in Apache Parquet
6. Producing a lot of data is easy
4
Producing a lot of derived data is even easier.
Solution: Compress all the things!
7. Scanning a lot of data is easy
5
1% completed
ā¦ but not necessarily fast.
Waiting is not productive. We want faster turnaround.
Compression but not at the cost of reading speed.
8. Trying new tools is easy
6
We need a storage format interoperable with all the tools we use
and keep our options open for the next big thing.
11. Parquet timeline
9
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats
- March 2013: OSS announcement; Criteo signs on for Hive integration
- July 2013: 1.0 release. 18 contributors from more than 5 organizations.
- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.
- Parquet 2.0 coming as Apache release
18. Statistics for filter and query optimization
16
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =
24. The right encoding for the right job
22
- Delta encodings:
for sorted datasets or signals where the variation is less important than the
absolute value. (timestamp, auto-generated ids, metrics, ā¦) Focuses on avoiding
branching.
- Prefix coding (delta encoding for strings)
When dictionary encoding does not work.
- Dictionary encoding:
small (60K) set of values (server IP, experiment id, ā¦)
- Run Length Encoding:
repetitive data.
29. Binary packing designed for CPU efficiency
27
better:
orvalues = 0
for (int i = 0; i<values.length; ++i) {
orvalues |= values[i]
}
max = maxbit(orvalues)
see paper:
āDecoding billions of integers per second through vectorizationā
by Daniel Lemire and Leonid Boytsov
Unpredictable branch! Loop => Very predictable branch
naive maxbit:
max = 0
for (int i = 0; i<values.length; ++i) {
current = maxbit(values[i])
if (current > max) max = current
}
even better:
orvalues = 0
orvalues |= values[0]
ā¦
orvalues |= values[32]
max = maxbit(orvalues)
no branching at all!
30. Binary unpacking designed for CPU efficiency
28
int j = 0
while (int i = 0; i < output.length; i += 32) {
maxbit = input[j]
unpack_32_values(values, i, out, j + 1, maxbit);
j += 1 + maxbit
}
34. Size comparison
32
TPCDS 100GB scale factor (+ Snappy unless otherwise specified)
Store salesLineitem
The area of the circle is proportional to the file size
35. Parquet 2.x roadmap
33
- Even better predicate push down.
- Zero copy path through new HDFS APIs.
- Vectorized APIs
- C++ library: implementation of encodings
- Decimal, Timestamp logical types