SlideShare a Scribd company logo

The columnar roadmap: Apache Parquet and Apache Arrow

The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance. In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board. Speaker Julien Le Dem, Principal Engineer, WeWork

1 of 49
Download to read offline
The columnar roadmap:
Apache Parquet and Apache Arrow
Julien Le Dem
VP Apache Parquet, Apache Arrow PMC, Principal Engineer WeWork
@J_
julien.ledem@wework.com
June 2018
Julien Le Dem
@J_
Principal Engineer
Data Platform
• Author of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez
• Used Hadoop first at Yahoo in 2007
• Formerly Twitter Data platform and Dremio
Julien
Agenda
• Community Driven Standard
• Benefits of Columnar representation
• Vertical integration: Parquet to Arrow
• Arrow based communication
CommunityDriven Standard
An open source standard
• Parquet: Common need for on disk columnar.
• Arrow: Common need for in memory columnar.
• Arrow is building on the success of Parquet.
• Top-level Apache project
• Standard from the start:
– Members from 13+ major open source projects involved
• Benefits:
– Share the effort
– Create an ecosystem
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
Interoperability and Ecosystem
Before With Arrow
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)

Recommended

Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 

More Related Content

What's hot

The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 

What's hot (20)

The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 

Similar to The columnar roadmap: Apache Parquet and Apache Arrow

Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 

Similar to The columnar roadmap: Apache Parquet and Apache Arrow (20)

Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Establishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentEstablishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentThorsten Huelsmann
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxNeo4j
 
software-quality-assurance question paper 2023
software-quality-assurance question paper 2023software-quality-assurance question paper 2023
software-quality-assurance question paper 2023RohanMistry15
 
Q4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdfQ4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdfTejal81
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...ShapeBlue
 
Python For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emPython For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emNho Vĩnh
 
Artificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human JusticeArtificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human JusticeJosh Gellers
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientKari Kakkonen
 
Enterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book ReviewEnterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book ReviewAshraf Fouad
 
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...Cprime
 
Pragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdfPragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdfinfogdgmi
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingerssuser9354ce
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyMustafa Kuğu
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...Neo4j
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...DianaGray10
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...2toLead Limited
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerSaiLinnThu2
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...Chris Bingham
 
Large Language Models and Applications in Healthcare
Large Language Models and Applications in HealthcareLarge Language Models and Applications in Healthcare
Large Language Models and Applications in HealthcareAsma Ben Abacha
 

Recently uploaded (20)

Establishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentEstablishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry development
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
 
software-quality-assurance question paper 2023
software-quality-assurance question paper 2023software-quality-assurance question paper 2023
software-quality-assurance question paper 2023
 
Q4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdfQ4 2023 Quarterly Investor Presentation - FINAL.pdf
Q4 2023 Quarterly Investor Presentation - FINAL.pdf
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
 
Python For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emPython For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ em
 
Artificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human JusticeArtificial Intelligence, Design, and More-than-Human Justice
Artificial Intelligence, Design, and More-than-Human Justice
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficient
 
Enterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book ReviewEnterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book Review
 
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
 
Pragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdfPragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdf
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostinger
 
PrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5CompanyPrismCRM-RealEstate-SalesCRM_byCode5Company
PrismCRM-RealEstate-SalesCRM_byCode5Company
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
 
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
Microsoft x 2toLead Webinar Session 1 - How Employee Communication and Connec...
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
 
Large Language Models and Applications in Healthcare
Large Language Models and Applications in HealthcareLarge Language Models and Applications in Healthcare
Large Language Models and Applications in Healthcare
 

The columnar roadmap: Apache Parquet and Apache Arrow

  • 1. The columnar roadmap: Apache Parquet and Apache Arrow Julien Le Dem VP Apache Parquet, Apache Arrow PMC, Principal Engineer WeWork @J_ julien.ledem@wework.com June 2018
  • 2. Julien Le Dem @J_ Principal Engineer Data Platform • Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Julien
  • 3. Agenda • Community Driven Standard • Benefits of Columnar representation • Vertical integration: Parquet to Arrow • Arrow based communication
  • 5. An open source standard • Parquet: Common need for on disk columnar. • Arrow: Common need for in memory columnar. • Arrow is building on the success of Parquet. • Top-level Apache project • Standard from the start: – Members from 13+ major open source projects involved • Benefits: – Share the effort – Create an ecosystem Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 6. Interoperability and Ecosystem Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader)
  • 7. Benefits of Columnar representation
  • 8. Columnar layout Logical table representation Row layout Column layout @EmrgencyKittens
  • 9. On Disk and in Memory • Different trade offs – On disk: Storage. • Accessed by multiple queries. • Priority to I/O reduction (but still needs good CPU throughput). • Mostly Streaming access. – In memory: Transient. • Specific to one query execution. • Priority to CPU throughput (but still needs good I/O). • Streaming and Random access.
  • 10. Parquet on disk columnar format
  • 11. Parquet on disk columnar format • Nested data structures • Compact format: – type aware encodings – better compression • Optimized I/O: – Projection push down (column pruning) – Predicate push down (filters based on stats)
  • 12. Parquet nested representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 13. Access only the data you need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read only the data you need!
  • 15. Arrow in memory columnar format
  • 16. Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU • Embeddable – in execution engines, storage layers, etc. • Interoperable
  • 17. Arrow in memory columnar format • Nested Data Structures • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
  • 19. Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 20. Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 24. Naïve conversion Data dependent branch
  • 25. Peeling away abstraction layers Vectorized read (They have layers)
  • 27. Bit packing case No branch
  • 29. Run length encoding case Block of defined values
  • 30. Run length encoding case Block of null values
  • 32. Example: filter and projection
  • 33. Naive filter and projection implementation
  • 36. Universal high performance UDFs SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads
  • 37. Arrow RPC/REST API • Generic way to retrieve data in Arrow format • Generic way to serve data in Arrow format • Simplify integrations across the ecosystem • Arrow based pipe
  • 38. RPC: arrow based storage interchange The memory representation is sent over the wire. No serialization overhead. Scanner projection/predicate push down Operator Arrow batches Storage Mem Disk SQL execution Scanner Operator Scanner Operator Storage Mem Disk Storage Mem Disk …
  • 39. RPC: arrow based cache The memory representation is sent over the wire. No serialization overhead. projection push down Operator Arrow-based Cache SQL execution Operator Operator …
  • 40. RPC: Single system execution The memory representation is sent over the wire. No serialization overhead. Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result
  • 41. Results - PySpark Integration: 53x speedup (IBM spark work on SPARK-13534) http://s.apache.org/arrowresult1 - Streaming Arrow Performance 7.75GB/s data movement http://s.apache.org/arrowresult2 - Arrow Parquet C++ Integration 4GB/s reads http://s.apache.org/arrowresult3 - Pandas Integration 9.71GB/s http://s.apache.org/arrowresult4
  • 42. Language Bindings Parquet • Target Languages – Java – CPP – Python & Pandas • Engines integration: – Many! Arrow • Target Languages – Java – CPP, Python – R (underway) – C, Ruby, JavaScript • Engines integration: – Drill – Pandas, R – Spark (underway)
  • 43. Arrow Releases 237 131 76 17 62 22 34 178 195 311 89 138 93 137 October 10, 2016 February 18, 2017 May 5, 2017 May 22, 2017 July 23, 2017 August 14, 2017 September 17, 2017 0.1.0 0.2.0 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 Days Changes
  • 44. Current activity: • Spark Integration (SPARK-13534) • Pages index in Parquet footer (PARQUET-922) • Arrow REST API (ARROW-1077) • Bindings: – C, Ruby (ARROW-631) –JavaScript (ARROW-541)
  • 45. Get Involved • Join the community – dev@{arrow,parquet}.apache.org – http://{arrow,parquet}.apache.org – Follow @Apache{Parquet,Arrow}
  • 48. We’re growing We’re hiring! • WeWork doubled in size last year and the year before • We expect to double in size this year as well • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, … WeWork in numbers Technology jobs Locations • San Francisco: Platform teams • New York • Tel Aviv • Singapore • WeWork has 283 locations, in 75 cities internationally • Over 253,000 members worldwide • More than 40,000 member companies Contact: julien.ledem@wework.com
  • 49. Questions? Julien Le Dem @J_ julien.ledem@wework.com June 2018