SlideShare a Scribd company logo
1 of 40
Efficient Data Storage for Analytics
with Apache Parquet 2.0
Julien Le Dem @J_
Processing tools tech lead, Data Platform
@ApacheParquet
InfoQ.com: News & Community Site
ā€¢ 750,000 unique visitors/month
ā€¢ Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
ā€¢ Post content from our QCon conferences
ā€¢ News 15-20 / week
ā€¢ Articles 3-4 / week
ā€¢ Presentations (videos) 12-15 / week
ā€¢ Interviews 2-3 / week
ā€¢ Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/parquet
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Outline
2
- Why we need efficiency
- Properties of efficient algorithms
- Enabling efficiency
- Efficiency in Apache Parquet
Why we need efficiency
Producing a lot of data is easy
4
Producing a lot of derived data is even easier.
Solution: Compress all the things!
Scanning a lot of data is easy
5
1% completed
ā€¦ but not necessarily fast.
Waiting is not productive. We want faster turnaround.
Compression but not at the cost of reading speed.
Trying new tools is easy
6
We need a storage format interoperable with all the tools we use
and keep our options open for the next big thing.
Enter Apache Parquet
Parquet design goals
8
- Interoperability
- Space efficiency
- Query efficiency
Parquet timeline
9
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats
- March 2013: OSS announcement; Criteo signs on for Hive integration
- July 2013: 1.0 release. 18 contributors from more than 5 organizations.
- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.
- Parquet 2.0 coming as Apache release
Interoperability
Interoperable
11
Model agnostic
Language agnostic
Java C++
Frameworks and libraries integrated with Parquet
12
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Presto
Frameworks:
Spark, MapReduce,
Cascading,
Crunch, Scalding, Kite
Data Models:
Avro, Thrift, ProtocolBuffers,
POJOs
Enabling efficiency
Columnar storage
14
Logical table
representation
Row layout
Column layout
encoding
Nested schema
Parquet nested representation
15
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Statistics for filter and query optimization
16
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =
Properties of efficient algorithms
CPU Pipeline
18
Mis-prediction
(ā€œBubbleā€)
Ideal case
Optimize for the processor pipeline
19
ifs
ā€œBubblesā€ can be caused by:
loops
virtual
calls
data
dependency
cost ~ 12 cycles
Minimize CPU cache misses
20
a cache miss costs 10 to 100s cycles depending on the level
RAM
Bus
CPU Cache
Encodings in Apache Parquet 2.0
The right encoding for the right job
22
- Delta encodings:
for sorted datasets or signals where the variation is less important than the
absolute value. (timestamp, auto-generated ids, metrics, ā€¦) Focuses on avoiding
branching.
- Prefix coding (delta encoding for strings)
When dictionary encoding does not work.
- Dictionary encoding:
small (60K) set of values (server IP, experiment id, ā€¦)
- Run Length Encoding:
repetitive data.
Delta encoding
23
Delta encoding
24
Delta encoding
25
Delta encoding
26
Binary packing designed for CPU efficiency
27
better:
orvalues = 0
for (int i = 0; i<values.length; ++i) {
orvalues |= values[i]
}
max = maxbit(orvalues)
see paper:
ā€œDecoding billions of integers per second through vectorizationā€
by Daniel Lemire and Leonid Boytsov
Unpredictable branch! Loop => Very predictable branch
naive maxbit:
max = 0
for (int i = 0; i<values.length; ++i) {
current = maxbit(values[i])
if (current > max) max = current
}
even better:
orvalues = 0
orvalues |= values[0]
ā€¦
orvalues |= values[32]
max = maxbit(orvalues)
no branching at all!
Binary unpacking designed for CPU efficiency
28
int j = 0
while (int i = 0; i < output.length; i += 32) {
maxbit = input[j]
unpack_32_values(values, i, out, j + 1, maxbit);
j += 1 + maxbit
}
Compression comparison
29
TPCH: compression of two 64 bits id columns with delta encoding
0%
20%
40%
60%
80%
100%
120%
plain delta
Primary key
no compression + snappy
Compression comparison
30
TPCH: compression of two 64 bits id columns with delta encoding
0%
20%
40%
60%
80%
100%
120%
plain delta
Primary key
no compression + snappy
0%
20%
40%
60%
80%
100%
120%
plain delta
Foreign key
Decoding time vs Compression
31
Plain
Plain +
Snappy
Delta
0
350
700
1050
1400
0% 25% 50% 75% 100%
decodingspeed:
Million/second
Compression (percent saved)
Size comparison
32
TPCDS 100GB scale factor (+ Snappy unless otherwise specified)
Store salesLineitem
The area of the circle is proportional to the file size
Parquet 2.x roadmap
33
- Even better predicate push down.
- Zero copy path through new HDFS APIs.
- Vectorized APIs
- C++ library: implementation of encodings
- Decimal, Timestamp logical types
Community
Thank you to our contributors
35
Open Source announcement
1.0 release
Get involved
36
Mailing lists:
- dev@parquet.incubator.apache.org
Parquet sync ups:
- Regular meetings on google hangout
Questions
37
Questions.foreach( answer(_) )
@ApacheParquet
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/parquet

More Related Content

More from C4Media

More from C4Media (20)

Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
Ā 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
Ā 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
Ā 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
Ā 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
Ā 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
Ā 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
Ā 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
Ā 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
Ā 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
Ā 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
Ā 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
Ā 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Ā 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
Ā 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
Ā 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
Ā 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
Ā 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Ā 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
Ā 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
Ā 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Ā 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Ā 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Ā 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Ā 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
Ā 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Ā 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
Ā 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Ā 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Ā 

Efficient Data Storage for Analytics with Parquet 2.0

  • 1. Efficient Data Storage for Analytics with Apache Parquet 2.0 Julien Le Dem @J_ Processing tools tech lead, Data Platform @ApacheParquet
  • 2. InfoQ.com: News & Community Site ā€¢ 750,000 unique visitors/month ā€¢ Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) ā€¢ Post content from our QCon conferences ā€¢ News 15-20 / week ā€¢ Articles 3-4 / week ā€¢ Presentations (videos) 12-15 / week ā€¢ Interviews 2-3 / week ā€¢ Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /parquet
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. Outline 2 - Why we need efficiency - Properties of efficient algorithms - Enabling efficiency - Efficiency in Apache Parquet
  • 5. Why we need efficiency
  • 6. Producing a lot of data is easy 4 Producing a lot of derived data is even easier. Solution: Compress all the things!
  • 7. Scanning a lot of data is easy 5 1% completed ā€¦ but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  • 8. Trying new tools is easy 6 We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  • 10. Parquet design goals 8 - Interoperability - Space efficiency - Query efficiency
  • 11. Parquet timeline 9 - Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Parquet 2.0 coming as Apache release
  • 14. Frameworks and libraries integrated with Parquet 12 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  • 16. Columnar storage 14 Logical table representation Row layout Column layout encoding Nested schema
  • 18. Statistics for filter and query optimization 16 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + =
  • 21. Optimize for the processor pipeline 19 ifs ā€œBubblesā€ can be caused by: loops virtual calls data dependency cost ~ 12 cycles
  • 22. Minimize CPU cache misses 20 a cache miss costs 10 to 100s cycles depending on the level RAM Bus CPU Cache
  • 23. Encodings in Apache Parquet 2.0
  • 24. The right encoding for the right job 22 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, ā€¦) Focuses on avoiding branching. - Prefix coding (delta encoding for strings) When dictionary encoding does not work. - Dictionary encoding: small (60K) set of values (server IP, experiment id, ā€¦) - Run Length Encoding: repetitive data.
  • 29. Binary packing designed for CPU efficiency 27 better: orvalues = 0 for (int i = 0; i<values.length; ++i) { orvalues |= values[i] } max = maxbit(orvalues) see paper: ā€œDecoding billions of integers per second through vectorizationā€ by Daniel Lemire and Leonid Boytsov Unpredictable branch! Loop => Very predictable branch naive maxbit: max = 0 for (int i = 0; i<values.length; ++i) { current = maxbit(values[i]) if (current > max) max = current } even better: orvalues = 0 orvalues |= values[0] ā€¦ orvalues |= values[32] max = maxbit(orvalues) no branching at all!
  • 30. Binary unpacking designed for CPU efficiency 28 int j = 0 while (int i = 0; i < output.length; i += 32) { maxbit = input[j] unpack_32_values(values, i, out, j + 1, maxbit); j += 1 + maxbit }
  • 31. Compression comparison 29 TPCH: compression of two 64 bits id columns with delta encoding 0% 20% 40% 60% 80% 100% 120% plain delta Primary key no compression + snappy
  • 32. Compression comparison 30 TPCH: compression of two 64 bits id columns with delta encoding 0% 20% 40% 60% 80% 100% 120% plain delta Primary key no compression + snappy 0% 20% 40% 60% 80% 100% 120% plain delta Foreign key
  • 33. Decoding time vs Compression 31 Plain Plain + Snappy Delta 0 350 700 1050 1400 0% 25% 50% 75% 100% decodingspeed: Million/second Compression (percent saved)
  • 34. Size comparison 32 TPCDS 100GB scale factor (+ Snappy unless otherwise specified) Store salesLineitem The area of the circle is proportional to the file size
  • 35. Parquet 2.x roadmap 33 - Even better predicate push down. - Zero copy path through new HDFS APIs. - Vectorized APIs - C++ library: implementation of encodings - Decimal, Timestamp logical types
  • 37. Thank you to our contributors 35 Open Source announcement 1.0 release
  • 38. Get involved 36 Mailing lists: - dev@parquet.incubator.apache.org Parquet sync ups: - Regular meetings on google hangout
  • 40. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/parquet