SlideShare a Scribd company logo

The Right Data for the Right Job

What if you could get blazing fast queries on your data without having to be on call for a giant, expensive database? By picking the right file format for your data, you can store your data on disk in the cloud and still get the performance you need for modern analytics. We'll discuss benchmarks of four different data storage formats: Parquet, ORC, Avro, and traditional character-separated files like CSV. We'll cover what they are, how they work at a bits-and-bytes level, and why you might choose each one for your use case.

1 of 111
Download to read offline
The Right Data Format
for the Right Job
a.k.a. How you store your files on disk will make or break you!!
Emily May Curtin
IBM Watson Data Platform
Atlanta, GA
@emilymaycurtin
@ecurtin
The Right Data for the Right Job
2892.3
50.6 43.4 40.3 28.90
500
1000
1500
2000
2500
3000
CSV Parquet: LZO Parquet:
Uncompressed
Parquet: GZIP Parquet: Snappy
Query Time (seconds)
SELECT cacheStatus, bytesSent from ADatasetThatHasToDoWithCDNs
WHERE cacheStatus LIKE 'stale'
AND bytesSent < 500
Who Am I
Who Am I
• Artist, former documentary film editor (long story)
• Tabs, not spaces (at least for Scala)(where 1 tab = 2 spaces)
• The Weather Company  IBM Watson Data Platform
• Current lead developer of SparkTC/spark-bench
(Contributors welcome! Talk to me after!)
• From DeKalb County, now live in Atlanta proper
IBM Watson Data Platform: Making Data Simple
https://www.ibm.com/analytics/us/en/watson-data-platform/
Ad

Recommended

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 

More Related Content

What's hot

Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopJava one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopsrisatish ambati
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0J.B. Langston
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
 

What's hot (20)

Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopJava one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Hive sq lfor-hadoop
Hive sq lfor-hadoopHive sq lfor-hadoop
Hive sq lfor-hadoop
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Big data
Big dataBig data
Big data
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
 

Similar to The Right Data for the Right Job

Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920Daniel Cohen
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!Andraz Tori
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftLars Kamp
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopJulian Hyde
 
Discardable In-Memory Materialized Query for Hadoop
Discardable In-Memory Materialized Query for HadoopDiscardable In-Memory Materialized Query for Hadoop
Discardable In-Memory Materialized Query for HadoopDataWorks Summit
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
 
PostgreSQL is the new NoSQL - at Devoxx 2018
PostgreSQL is the new NoSQL  - at Devoxx 2018PostgreSQL is the new NoSQL  - at Devoxx 2018
PostgreSQL is the new NoSQL - at Devoxx 2018Quentin Adam
 

Similar to The Right Data for the Right Job (20)

Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Spark
SparkSpark
Spark
 
DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920DataStax Enterprise in the Field – 20160920
DataStax Enterprise in the Field – 20160920
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon Redshift
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
 
Discardable In-Memory Materialized Query for Hadoop
Discardable In-Memory Materialized Query for HadoopDiscardable In-Memory Materialized Query for Hadoop
Discardable In-Memory Materialized Query for Hadoop
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
PostgreSQL is the new NoSQL - at Devoxx 2018
PostgreSQL is the new NoSQL  - at Devoxx 2018PostgreSQL is the new NoSQL  - at Devoxx 2018
PostgreSQL is the new NoSQL - at Devoxx 2018
 

Recently uploaded

Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...DianaGray10
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceSusan Ibach
 
Traffic Signboard Classification with Voice alert to the driver.pptx
Traffic Signboard Classification with Voice alert to the driver.pptxTraffic Signboard Classification with Voice alert to the driver.pptx
Traffic Signboard Classification with Voice alert to the driver.pptxharimaxwell0712
 
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Adrian Sanabria
 
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IPQ1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IPMemory Fabric Forum
 
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17Ana-Maria Mihalceanu
 
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?GleecusTechlabs1
 
Zi-Stick UBS Dongle ZIgbee from Aeotec manual
Zi-Stick UBS Dongle ZIgbee from  Aeotec manualZi-Stick UBS Dongle ZIgbee from  Aeotec manual
Zi-Stick UBS Dongle ZIgbee from Aeotec manualDomotica daVinci
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...UiPathCommunity
 
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions..."How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions...Fwdays
 
Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?MENGSAYLOEM1
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanDatabarracks
 
Battle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsBattle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsEvangelia Mitsopoulou
 
Bringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptxBringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptxMaarten Balliauw
 
Power of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfPower of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfkatalinjordans1
 
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERNRonnelBaroc
 
My sample product research idea for you!
My sample product research idea for you!My sample product research idea for you!
My sample product research idea for you!KivenRaySarsaba
 
Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build PolandGDSC PJATK
 
"Platform Engineering with Development Containers", Igor Fesenko
"Platform Engineering with Development Containers", Igor Fesenko"Platform Engineering with Development Containers", Igor Fesenko
"Platform Engineering with Development Containers", Igor FesenkoFwdays
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologySafe Software
 

Recently uploaded (20)

Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data science
 
Traffic Signboard Classification with Voice alert to the driver.pptx
Traffic Signboard Classification with Voice alert to the driver.pptxTraffic Signboard Classification with Voice alert to the driver.pptx
Traffic Signboard Classification with Voice alert to the driver.pptx
 
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
 
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IPQ1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
Q1 Memory Fabric Forum: Building Fast and Secure Chips with CXL IP
 
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17Enhancing Productivity and Insight  A Tour of JDK Tools Progress Beyond Java 17
Enhancing Productivity and Insight A Tour of JDK Tools Progress Beyond Java 17
 
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
 
Zi-Stick UBS Dongle ZIgbee from Aeotec manual
Zi-Stick UBS Dongle ZIgbee from  Aeotec manualZi-Stick UBS Dongle ZIgbee from  Aeotec manual
Zi-Stick UBS Dongle ZIgbee from Aeotec manual
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
 
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions..."How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
 
Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?Are Human-generated Demonstrations Necessary for In-context Learning?
Are Human-generated Demonstrations Necessary for In-context Learning?
 
How to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response PlanHow to write an effective Cyber Incident Response Plan
How to write an effective Cyber Incident Response Plan
 
Battle of React State Managers in frontend applications
Battle of React State Managers in frontend applicationsBattle of React State Managers in frontend applications
Battle of React State Managers in frontend applications
 
Bringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptxBringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptx
 
Power of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfPower of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdf
 
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
 
My sample product research idea for you!
My sample product research idea for you!My sample product research idea for you!
My sample product research idea for you!
 
Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build Poland
 
"Platform Engineering with Development Containers", Igor Fesenko
"Platform Engineering with Development Containers", Igor Fesenko"Platform Engineering with Development Containers", Igor Fesenko
"Platform Engineering with Development Containers", Igor Fesenko
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI Technology
 

The Right Data for the Right Job

  • 1. The Right Data Format for the Right Job a.k.a. How you store your files on disk will make or break you!! Emily May Curtin IBM Watson Data Platform Atlanta, GA @emilymaycurtin @ecurtin
  • 3. 2892.3 50.6 43.4 40.3 28.90 500 1000 1500 2000 2500 3000 CSV Parquet: LZO Parquet: Uncompressed Parquet: GZIP Parquet: Snappy Query Time (seconds) SELECT cacheStatus, bytesSent from ADatasetThatHasToDoWithCDNs WHERE cacheStatus LIKE 'stale' AND bytesSent < 500
  • 5. Who Am I • Artist, former documentary film editor (long story) • Tabs, not spaces (at least for Scala)(where 1 tab = 2 spaces) • The Weather Company  IBM Watson Data Platform • Current lead developer of SparkTC/spark-bench (Contributors welcome! Talk to me after!) • From DeKalb County, now live in Atlanta proper
  • 6. IBM Watson Data Platform: Making Data Simple https://www.ibm.com/analytics/us/en/watson-data-platform/
  • 7. Shout-Out to My Team Matthew Schauer @showermat *not an actual photo of my team Craig Ingram @cin Brad Kaiser @brad-kaiser Spark Technology Center East Research and development on Apache Spark and the Spark ecosystem. - Dynamic Allocation on Yarn - Benchmarking and Tracing - Variations of Spark On Kubernetes
  • 9. Atlanta Recommendations From A Local • ✅ Your DeKalb Farmer’s Market • ✅ Buford Highway Farmer’s Market • ✅ Hike the East Palisades in Chattahoochee National Recreation Area • ✅ Civil Rights Museum • ✅ High Museum of Art • ✅ B’s Cracklin’ or ✅ Community Q for BBQ, make sure you get collards
  • 10. Atlanta Recommendations From A Local DO NOT go to • 🚫 World of Coke • 🚫 The Varsity • 🚫 Underground Atlanta • 🚫 Fat Matt’s
  • 11. Outline • Quick Intro to Spark • What is the Right Job • Guiding Principles • CSVs and why text formats are so bad • Improvements for All Datasets • Partition by size • Compression • Partition by data • JSON Sucks • Avro is better • ORC is even better • Parquet is even better than that • NUMBERS AND GRAPHS!! • Summarize
  • 12. TL;DR For large ETL, data mining, analytics, and machine learning/AI applications, using the right on-disk data storage format can drastically improve your application’s performance, usually between 10x – 1000x. Stop using CSV, use Parquet instead.
  • 14. Spark As The Context • Spark is the generalized distributed compute engine of choice • Integrates with pretty much everything! • Many flavors of Spark, this talk uses vanilla Spark Standalone
  • 15. The Choice of Compute Engine Matters for Benchmarks Format performance with Apache Spark IS NOT THE SAME AS format performance with another engine. Some examples: • Orc faster on Hive, slower on Spark • Parquet different between Pig, Spark, Hive • MR slower than all Spark, generally Also, differences in hardware, network speed can be huge!
  • 16. How Spark Standalone Works, Kinda Driver Object Storage In the ☁️ Executor Executor Executor HDFS HDFS HDFS Will respect data locality In HDFS if possible RDD: Resilient Distributed Dataset with partitions on each node
  • 17. (Just FYI, There’s More Than Spark Standalone) • Spark Standalone • Spark on Mesos • Spark on Yarn • Spark on Yarn with dynamic allocation • Spark Standalone on Kubernetes • Spark on Kubernetes, the fork • Spark on Kubernetes, the partially merged version in 2.3 • Etc, etc, etc
  • 18. The Right Job Do you need a Spark cluster for this?
  • 19. The Right Job • Very large datasets (100+ GB, TB, PB) • Strong need for distributed computation • Write once, read forever • ETL, Analytics, Machine Learning, IoT, etc.
  • 23. Wrong Job: Your data fits in Excel Spreadsheets rock for small use cases! Don’t mess with success!
  • 24. Wrong Job: Your data is "laptop-sized"
  • 25. Wrong Job: Your data is "laptop-sized" For small n, just do the simple thing "…premature optimization is the root of all evil (or at least most of it) in programming." - Donald Knuth
  • 26. Wrong Job: Your data is "laptop-sized" For small n, just do the simple thing "…premature optimization is the root of all evil (or at least most of it) in programming." - Donald Knuth Caveats: • "Laptop-sized" is super subjective • Many single-node sized problems can still benefit from the techniques in this talk!
  • 28. Wrong Job: Your data is highly relational and/or constantly updated If you need a database, use a database!
  • 29. But if you don’t need a database, don’t use one
  • 31. Goals for Data Lake Storage •Good Usability • Importance: 👍 • Easy to backup • Minimal learning curve • Easy integration with existing tools •Resource Efficient • Importance: 👍 👍 👍 • Disk space • Disk I/O Time • Network I/O •Affordable • Importance: 💰💰💵👍💰👍🙏👍💰👍💵 • Developer hours  Cost you $$$ • Compute cycles  Cost you $$$ • On-call problems  Cost you $$$ •Provides Fast Queries • Importance: 🏆🥇‼️💯😍 ‼️ 🥇💰👍💯🙏💯💰
  • 32. Little Costs Matter at Actual Scale "Very Large Dataset" Weather-Scale Data
  • 33. Disk and Network I/O Hurt Action Computer Time "Human Scale" Time 1 CPU cycle 0.3 ns 1 s Level 1 cache access 0.9 ns 3 s Level 2 cache access 2.8 ns 9 s Level 3 cache access 12.9 ns 43 s Main memory access 120 ns 6 min Solid-state disk I/O 50-150 μs 2-6 days Rotational disk I/O 1-10 ms 1-12 months Internet: SF to NYC 40 ms 4 years Internet: SF to UK 81 ms 8 years Internet: SF to Australia 183 ms 19 years Source: Systems Performance: Enterprise and the Cloud by Brendan Gregg via CodingHorror.com "The Infinite Space Between Words"
  • 34. Resource Constraints in The Cloud CPU CPU CPU CPU Cache Cache Cache Cache Memory Disk Memory We have basically infinite boxes… … but we don’t get away from the limitations of the arrows (network and bus)
  • 35. Network and Bus capacity are limited…
  • 36. … so we need to make efficient use of the space we have!
  • 37. … so we need to make efficient use of the space we have!
  • 38. Let’s Talk About CSVs So that we can figure out how to do better!
  • 39. Defining CSVs • Comma Character Separated Values • All values stored as strings on disk • "Write programs to handle text streams, because that is a universal interface." –Doug McIlroy
  • 40. Strings On Disk Value Binary 1 00000001 "1" 00110001 18 00010010 "18" 0011000100111000 511 111111111 "511" 001101010011000100110001 65535 1111111111111111 "65535" 0011011000110101001101010011001100110101
  • 41. Example Sentence Start,Descriptor,Number, My dataset is,great,6234, My dataset is,cool,8679, My dataset is,nice,3, My dataset is,,2857, My dataset is,full of nulls,, My dataset is,whatever,3758, . . .
  • 42. Example Sentence Start,Descriptor,Number,nMy dataset is,great,6234,nMy dataset is,cool,8679,nMy dataset is,nice,3,nMy dataset is,,2857,nMy dataset is,full of nulls,,nMy dataset is,whatever,3758,n . . .
  • 43. Pros 👍 • It’s so simple! • It’s universal! • It’s human readable! Cons 😱 • It’s gigantic • Columns have to be post- processed into the correct type • Must be scanned linearly • No separation between rows or columns on disk. What if you only want part of the data?
  • 44. Let’s Talk About Improving CSVs
  • 45. Let’s Talk About Improving CSVs How to Improve the Performance of Any Data Format
  • 46. One Giant File: Maximally Bad One Gigantic CSV
  • 47. One Giant File: Maximally Bad One Gigantic CSV Most of my benchmarks for this option didn’t even finish
  • 48. Partition the File: SO MUCH Better Gigantic CSV/part- 01.csv Gigantic CSV/part-02.csv Gigantic CSV/part- 03.csv Gigantic CSV/part- 03.csv Gigantic CSV/part- 04.csv Gigantic CSV/part- 04.csv
  • 49. Spark Makes This Easy Dataframe.write.csv() will write out the same number of partitions as it already has in memory. Optionally use Dataframe.repartition().write.csv() to control the number of partitions.
  • 50. Compression Smaller size over the network/bus. Must be decompressed to be useful. YouTube: The Hydraulic Press Channel
  • 51. Partition By Column my-amazing-data.csv/ year=2016/ month=02/ day=22/ part-0000.csv part-0001.csv day=23/ part-0000.csv part-0001.csv part-0002.csv Spark makes this easy! df.write.partitionBy( "whatever", "columns" ) This technique can be used for any format, not restricted to CSV
  • 52. TBH, partitioning and compression are about all you can do for CSV. So…
  • 53. JSON Because the night is darkest just before the day
  • 54. JSON as a Data Lake Storage Format Pros • Your schema is baked into the data • Spark makes it really easy to parse huge amounts of JSON data Cons • Your schema is baked into the data… over and over and over and… • Same text parsing hits as CSV
  • 55. JSON { "Sentence_Start" : "My dataset is", "Descriptor" : "great", "Number" : 6234 } { "Sentence_Start" : "My dataset is", "Descriptor" : "cool", "Number" : 4367 } { "Sentence_Start": "My dataset is", "Descriptor" : "", "Number" : 4367 }
  • 56. Avro When you’re really, really sure you don’t need column-selection
  • 57. Avro is Row Oriented Led Zeppelin IV 11/08/1971 1 Houses of the Holy 03/28/1973 1 Physical Graffiti 02/24/1975 1 Led Zeppelin IV 11/08/1971 1Houses of the Holy 03/28/1973 1Physical Graffiti 02/24/1975 1 Row-Oriented data on disk Column-Oriented data on disk Title Date Chart
  • 58. Avro: JSON Schema + Data {"Title" : "String", "Release_Date" : "String", "Top_Chart_Position" : "Int"} Led Zeppelin IV 11/08/1971 1 Houses of the Holy 03/28/1973 1 Physical Graffiti 02/24/1975 1
  • 59. Avro • Serialization format comparable to Thrift and Protobuf • Handles a variety of compression formats including gzip, deflate, snappy • Schema is stored as JSON in the header of each partition • Supports primitive types and wide array of complex types (Map, array, record, enum, etc) • Supports nested schemas (it’s JSON after all!) • Supports unlimited schema evolution
  • 60. Where Avro Excels 👍 • Read and write performance is balanced • Schemas evolving often • Strong requirement to always read all the data
  • 61. Where Avro Fails 👎 • Anything with column selection • SQL!! • Analytics • Machine learning that’s not just matrix math • Anything where read performance is paramount
  • 62. ORC: Optimized Row Columnar Let’s finally talk about columnar storage!
  • 63. Orc and Parquet are Columnar Led Zeppelin IV 11/08/1971 1 Houses of the Holy 03/28/1973 1 Physical Graffiti 02/24/1975 1 Led Zeppelin IV 11/08/1971 1Houses of the Holy 03/28/1973 1Physical Graffiti 02/24/1975 1 Row-Oriented data on disk Column-Oriented data on disk Title Date Chart
  • 64. Columnar Storage allows for slicing and dicing
  • 65. ORC Pros • Columnar • Storage indexes • Supports variety of compression formats • Very powerful configurable storage index tuning and bloom filtering Cons • Not super well supported in Spark • Only supports flat schemas
  • 68. 33 10 60 72 11 11 Min: 10 Max: 72 812 467 883 564 330 846 Min: 330 Max: 883 3 4567 233 93 42746 14 Min: 3 Max: 42746
  • 69. Storage Index • Works super well for sorted or semi-sorted data • Basically it’s stats • Numeric data • Work REALLY well if you can bucket your data before writing val bucketed = dataframe.bucketBy("column", "names")
  • 70. Bloom Filtering • Uses a probabilistic data structure to say if an element "very probably is" or "definitely is not" in the set • ORC allows users to turn on bloom filtering for certain columns • Useful for non-numeric columns of limited cardinality
  • 71. Support in Spark is a little lacking • SPARK-20901 Feature Parity for ORC with Parquet • SPARK-22320 ORC should support VectorUDT/MatrixUDT • SPARK-22279 Turn on spark.sql.hive.convertMetastoreOrc by default • SPARK-23007 Add schema evolution test suite for file-based data sources
  • 73. Let’s Talk About Parquet Because it’s awesome 
  • 74. Parquet Format "Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language." • Binary Format • API for JVM/Hadoop & C++ • Columnar • Encoded • Compressed • Machine-Friendly
  • 76. Very Important Dataset Title Released Label PeakChart.UK Certification.BVMI Certification.RIAA (omitted for space…) Led Zeppelin 01/12/1969 Atlantic 6 8x Platinum … Led Zeppelin II 10/22/1969 Atlantic 1 Platinum Diamond … Led Zeppelin III 10/05/1970 Atlantic 1 Gold 6x Platinum … Led Zeppelin IV 11/08/1971 Atlantic 1 3x Gold Diamond … Houses of the Holy 03/28/1973 Atlantic 1 Gold Diamond … Physical Graffiti 02/24/1975 Swan Song 1 Gold Diamond … Presence 03/31/1976 Swan Song 1 3x Platinum … In Through The Out Door 08/15/1979 Swan Song 1 6x Platinum … Coda 11/19/1982 Swan Song 4 Platinum …
  • 77. Schema Breakdown COLUMN NAME Title OPTIONAL / REQUIRED / REPEATED OPTIONAL DATA TYPE BINARY ENCODING INFO FOR BINARY 0:UTF8 REPETITION VALUE R:0 DEFINITION VALUE D:0 FLAT SCHEMA TITLE: OPTIONAL BINARY O:UTF8 R:0 D:1 RELEASED: OPTIONAL BINARY O:UTF8 R:0 D:1 LABEL: OPTIONAL BINARY O:UTF8 R:0 D:1 PEAKCHART.UK: REQUIRED INT32 R:0 D:0 . . .
  • 78. Repetition and Definition Levels Source: https://github.com/apache/parquet-mr
  • 79. One Parquet Row, Two Ways Title = Led Zeppelin IV Released = 11/8/1971 Label = Atlantic PeakChart: .AUS = 2 .UK = 1 .US = 2 Certification: .ARIA = 9x Platinum .BPI = 6x Platinum .BVMI = 3x Gold .CRIA = 2x Diamond .IFPI = 2x Platinum .NVPI = Platinum .RIAA = Diamond .SNEP = 2x Platinum TITLE = LED ZEPPELIN IV RELEASED = 11/8/1971 LABEL = ATLANTIC PEAKCHART.UK = 1 PEAKCHART.AUS = 2 PEAKCHART.US = 2 CERTIFICATION.ARIA = 9X PLATINUM CERTIFICATION.BPI = 6X PLATINUM CERTIFICATION.BVMI = 3X GOLD CERTIFICATION.CRIA = 2X DIAMOND CERTIFICATION.IFPI = 2X PLATINUM CERTIFICATION.NVPI = PLATINUM CERTIFICATION.RIAA = DIAMOND CERTIFICATION.SNEP = 2X PLATINUM
  • 81. Parquet Structure In the Filesystem led-zeppelin-albums.parquet/ • _SUCCESS • Year=1969/ • Part-r-00000-6d4d42e2-c13f-4bdf-917d-2152b24a0f24.snappy.parquet • Part-r-00001-6d4d42e2-c13f-4bdf-917d-2152b24a0f24.snappy.parquet • … • Year=1970/ • Part-r-00000-35cb7ef4-6de6-4efa-9bc6-5286de520af7.snappy.parquet • ... • Groups of rows, partitioned by column values, compressed however you like. (GZIP, LZO, Snappy, etc) • In general LZO wins size benchmarks, Snappy good balance between size and CPU intensity. One compressed file == One row group
  • 82. Encoding: Incremental Encoding Led_Zeppelin_IV Led_Zeppelin_III Led_Zeppelin_II Led_Zeppelin 0 Led_Zeppelin 12 _II 15 I 14 V 58 bytes* 24 bytes* *not counting delimiters ENCODING 58% Reduction
  • 83. Encoding: Dictionary Encoding ENCODING Atlantic Swan Song Atlantic Atlantic Atlantic Atlantic Atlantic Swan Song Swan Song Swan Song 84 bytes* 0 1 0 0 0 0 0 1 1 1 0  Atlantic 1  Swan Song 1.25 bytes + dictionary size ~98% Reduction
  • 84. Dictionary Filtering • Similar to Bloom filtering • Pulls the dictionary from the footer metadata. Slighty more I/O, usually for big benefit! • Enable in Spark 2+ using: val spark = SparkSession .builder .config("parquet.enable.dictionary",”true")
  • 85. More Encoding Schemes • Plain (bit-packed, little endian, etc) • Dictionary Encoding • Run Length Encoding/Bit Packing Hybrid • Delta Encoding • Delta-Length Byte Array • Delta Strings (incremental Encoding) See https://github.com/apache/parquet-format/blob/master/Encodings.md for more detail
  • 86. Slicing and Dicing Within A Compressed File File Metadata Row Group Row Group Column Chunk Col1 Column Chunk Col2A Column Chunk Col2B Page Header Column Chunk Column 3 Page Header Page Page Page - Metadata - R Values - D Values - ENCODED DATA
  • 87. Slicing and Dicing Within A Compressed File File Metadata Row Group Row Group Column Chunk Col1 Column Chunk Col2A Column Chunk Col2B Page Header Column Chunk Column 3 Page Header Page Page Page - Metadata - R Values - D Values - ENCODED DATA
  • 88. Slicing and Dicing Within A Compressed File File Metadata Row Group Row Group Column Chunk Col1 Column Chunk Col2A Column Chunk Col2B Page Header Column Chunk Column 3 Page Header Page Page Page - Metadata - R Values - D Values - ENCODED DATA
  • 89. Slicing and Dicing Within A Compressed File File Metadata Row Group Row Group Column Chunk Col1 Column Chunk Col2A Column Chunk Col2B Page Header Column Chunk Column 3 Page Header Page Page Page - Metadata - R Values - D Values - ENCODED DATA
  • 90. Slicing and Dicing Within A Compressed File File Metadata Row Group Row Group Column Chunk Col1 Column Chunk Col2A Column Chunk Col2B Page Header Column Chunk Column 3 Page Header Page Page Page - Metadata - R Values - D Values - ENCODED DATA
  • 91. Slicing and Dicing Within A Compressed File File Metadata Row Group Row Group Column Chunk Col1 Column Chunk Col2A Column Chunk Col2B Page Header Column Chunk Column 3 Page Header Page Page Page - Metadata - R Values - D Values - ENCODED DATA
  • 92. Slicing and Dicing Within A Compressed File File Metadata Row Group Row Group Column Chunk Col1 Column Chunk Col2A Column Chunk Col2B Page Header Column Chunk Column 3 Page Header Page Page Page - Metadata - R Values - D Values - ENCODED DATA
  • 93. Slicing and Dicing Within A Compressed File File Metadata Row Group Row Group Column Chunk Col1 Column Chunk Col2A Column Chunk Col2B Page Header Column Chunk Column 3 Page Header Page Page Page - Metadata - R Values - D Values - ENCODED DATA
  • 94. Format Spec See the format spec for more detail: https://github.com/apache/parquet-format
  • 96. Spark Filter Pushdown spark.sql.parquet.filterPushdown  true by default since 1.5.0 For Where Clauses, Having clauses, etc. in SparkSQL, The Data Loading layer will test the condition before pulling a column chunk into spark memory. select cs_bill_customer_sk customer_sk, cs_item_sk item_sk from catalog_sales,date_dim where cs_sold_date_sk = d_date_sk and d_month_seq between 1200 and 1200 + 11 Example From: https://developer.ibm.com/hadoop/2016/01/14/5-reasons-to-choose-parquet-for-spark-sql/
  • 97. Slicing and Dicing Within A Compressed File File Metadata Row Group Row Group Column Chunk Col1 Column Chunk Col2A Column Chunk Col2B Page Header Column Chunk Column 3 Page Header Page Page Page - Metadata - R Values - D Values - ENCODED DATA Column chunks contain metadata with statistics
  • 98. Get JUST the Data You Need • Get just the partitions you need • Get just the columns you need • Eliminate row groups using footer stats (storage index) • Optionally eliminate row groups using dictionary filtering • Eliminate individual pages using page stats
  • 99. What’s the Catch? Limitations, Write Speed, Immutability
  • 100. Limitations • Pushdown Filtering doesn’t exactly work with object stores: AWS S3, etc. No random access • Pushdown Filtering does not work on nested columns - SPARK-17636 • Binary vs. String saga – SPARK-17213
  • 101. Write Speed  Who Cares!! (In Most Cases) Write Once Read Forever Which case will you optimize for?
  • 102. Dealing With Immutability • Write using partitioning • Reimagine your data as a timeseries • Combine with a database (i.e. Cassandra) • Append additional row groups
  • 104. CSV vs. Other Options In Spark CSV JSON Avro ORC Parquet Column or Row Major Neither Neither Row Column Column Write Speed Fast-ish Slow Fast Slow Slow Read Speed – Table Scan Slow Slow Fast Fast Fast Read Speed – Col Select Very Slow Slow Slow Fast Fast Schema Evolution Not really Yes Yes Not in Spark Limited Supports Nested Schemas No Yes Yes No Yes Support In Spark Good Good Good Workin’ on it… Really Good! Opinion 😱 😱😱 😐 🥇 😍 Note: ORC is faster/better supported in Hive.
  • 105. Show Me The Numbers!
  • 106. Dataset for the Benchmarks • All doubles, no other data types • Data is not bucketed or partitioned at all • 1 master, 10 executor wimpy Spark Standalone cluster with 14GB/executor and 40 total cores • All data in co-located HDFS (unfair advantage for uncompressed formats) -1.352960483 0.5697647008 1.241025176 -2.307209617 -0.8867837614 -1.17363045 0.1408343935 -0.7907896731 -2.173014397 -0.1310740344 -1.256329584 0.6426700416 1.141094923 0.841751255 0.5603264606
  • 110. In Summary • Please don’t use CSVs or JSON unless you have a really good reason • Always take advantage of • Partitioning by space • Partitioning by appropriate column value • Bucketing • Compression • Spark’s default format is Parquet with Snappy compression. If you’re using Spark, that should be yours too! • Know Thy Data • Know Thy Use Case • Know Thy Reasonable Query Patterns
  • 111. Emily May Curtin www.framebit.org Github @ecurtin Twitter @emilymaycurtin Instagram @emilymaycurtin Make your own benchmarks with Spark-Bench!! Github.com/SparkTC/spark-bench

Editor's Notes

  1. I/O costs hurt! And we’re not just talking about disk I/O, we’re querying in a distributed environment so we have to get our data off disk and then across the network to our distributed compute cluster. So we have disk I/O and network I/O to consider. It’s nanoseconds to memory, but it’s MICROseconds to a solid state drive, and it’s MILLIseconds to spinning disk and it’s 10’s OF MILLISECONDS for network! This is definitely a detail worth sweating over when it comes to choosing a storage layer for our data lake. So let’s take a very opinionated look at some options.
  2. The optional indicates that it’s nullable. The binary or Int32 is the type of the column, then the binary encoding, and then the R values and D values. Repetition and Definition levels. This is where Parquet is borrowing from Google’s Dremel.
  3. To dig into this, I’m going to lift this illustration from the Parquet documentation. The Repetition values indicate whether or not the column is repeated. That doesn’t really even make sense in a flat schema. However, let’s consider this example of a JSON tree describing a document where we have a bunch of links. In this case, the Link column is going to be repeated. How about the Definition values. When we have a nested schema, Parquet is going to lay out our data in a tree-like format. The Definition level indicates at what level in the tree the values for that column are defined.
  4. I want to start at the 30,000 foot view of how Parquet works. Here’s the structure on the filesystem. We have our parquet FOLDER, which has a little Hadoop Success message, some metadata written out by Spark, and then I’m faking a partition here with the year. Partitioning is really powerful, we’ll get to that later. At the bottom of the filestructure tree we have these compressed files. These are the _real_ Parquet files. These are compressed using Snappy, you can use Gzip, LZO, or you can choose to forego compression. So this is the top-down view of what’s on the file system. Let’s move from the view waaay up here to the details way down here all the way at the bottom, and then we’ll fill up in between.
  5. So here’s an example, incremental encoding for strings. On the left we’re storing each of these album titles individually, and it’s taking 58 bytes if you don’t count delimiters. On the right, I’m using incremental encoding to store the common base and then afterwards I’m only storing the minimal data necessary. This is 2.4 times smaller, and it’s still very efficient for the cache and processor to unpack.
  6. Let’s take another example. How about dictionary encoding. If we take the column for the record labels, there’s only two distinct entries. I can create a dictionary from these and map them to ints, and now I’ve got just the size of the dictionary and 10 BITS of data! 84 bytes on the left, 10 BITS on the right, and just a little extra space for the dictionary, which will get amortized our dataset grows beyond ten rows.
  7. And there’s more! I won’t go through all these, that would be boring, and there’s extensive detail in the format spec. There’s ways for encoding numeric types, more tricks for binary, on and on and on.
  8. So we have the 30,000 foot view from the file system, and we see what we’re gonna do at the really low level with the data. Here’s how we get from here to there. Parquet is designed as a tree of elements within each compressed file. So you can see waaaay down there at the bottom is our encoded data, and all of this is inside a compressed file.
  9. Parquet is designed as a tree of elements within each compressed file.
  10. So you can see waaaay down there at the bottom in leaf nodes is our encoded data, and all the way at the top is our file metadata, and that’s the root. All of this is inside a compressed file.
  11. At the top of our tree is the file metadata. This has the schema, some thrift headers, offsets, other stuff.
  12. At the next level of our Parquet tree we have a Row Group. So you’re a Spark node, you’re a Spark node, you’re a Spark node. I’m going to give you each a million rows to write to Parquet, ok? But you’re not going to jump into writing all million of those rows all at once, you’re going to consider just the first 100,000 rows. So you’ve got this group of rows, it’s a row group. It’s exactly what it sounds like. Each compressed file has at least one row group. In this example tree we happen to have two row groups. Next, we’re going to consider each column in this row group individually.
  13. And I’m going to slice and dice those columns into chunks. At least one column chunk per column in this row group. Here, columns 1 and 3 have just one column chunk, column 2 has two chunks. A detail, if your column values are null the whole way down in this row group, you’re not going to store that column! Why bother? Within those column chunks, all the pages share one page header,
  14. And speaking of pages
  15. , this is the real meat of our structure. Pages are the indivisible, atomic unit of Parquet. Each page has a little metadata, the R values and D values we talked about earlier, and that Encoded Data. So no actual data is stored in the higher levels of this tree. It’s all down here in the leaves, the pages. Pages are aptly named because they’re designed to be the same size as pages in virtual memory. So when you do I/O, you lift a page directly off disk, and slot it perfectly into a memory page. It’s one of those bits and bytes savings that add up as your data gets bigger.
  16. If you want to see more detail on the format spec, you can check it out online. Like so many things in this presentation, it’s not super slide-friendly.
  17. Let’s talk about a couple of ways that Spark gets really efficient query performance out of this format. I’m going to talk about a couple of features, but they all boil down to getting JUST the data that you need. We already know that when we read Parquet, we’re only going to pull in the columns that we need, not the whole dataset. But there’s more to it than that.
  18. The next thing Spark does is it takes any filtering statements from your SQL query, and it pushes those down to the level of the data scan. I’m going to borrow an example from some colleagues at IBM for this one. Here’s one of their queries, they’re pulling back just a few columns, but most importantly they’re applying some filters on those columns.
  19. The column chunks in our compressed files contain metadata in their footers that have some statistics about what’s in the column chunk, particularly mins and maxes in the case of numeric columns. So when the data is being read in, the read process can skip to the column chunk footer, do some simple tests to see hey, does this column chunk have a chance of containing what I’m looking for? And if the answer is no, it can skip that column chunk AND the rest of that row group!
  20. So that’s how Spark makes this so efficient. We’re going to get just the partitions we need, just the columns we need, and just the chunks of the columns that fit our filter conditions.
  21. And you’re thinking, man, this sounds pretty great! This sounds too good to be true! What’s the catch? There’s always a catch, isn’t there?
  22. Well, there’s limitations on what works in terms of the filter pushdown. It doesn’t work on nested columns, there’s this whole issue with binary vs. string data that I won’t get into, and it’s also not going to work quite right with object stores like S3. S3 doesn’t have random access, so we need to do that initial I/O of getting the whole compressed file over the wire before we can inspect the footers of those column chunks. Now, some would say another big catch with Parquet is the write speed. And to that I say: who cares!
  23. In most cases, particularly when you’re archiving data, this stuff is write once, read forever. Writing Parquet is fast enough, there’s some tuning things you can do to make it a little faster, but really, who cares. Which case are you going to optimize for?
  24. The last catch I want to talk about is Immutability. Parquet is a binary format, and it’s immutable. You can’t just pop open vim and mess around with it like you can with a CSV. So this takes some consideration. How can we get around this? Well, we can start by writing using partitioning. Maybe you want to reimagine your data as a timeseries and write new partitions when you get new data. Another thing that we’ve done at my former home, The Weather Company comma an IBM business, is that we’ve sometimes fronted our Parquet process with a database like Cassandra. So data comes into Cassandra, and when it’s suffienciently historical, it gets spooled out to Parquet. Another thin you can do is just append additional row groups. These compressed files are self-contained row groups, you can just append them!