SlideShare a Scribd company logo
1 of 27
Download to read offline
New Directions for Apache Arrow
Wes McKinney
@wesmckinn
September 10, 2021
New York R Conference
2
Apache Arrow
Multi-language toolbox for accelerated
data interchange and in-memory processing
● Founded in 2016 by a group of developers of open source data projects
● Provides a shared foundation for data analytics
● Enables unification of database and data science technology stacks
● Thriving user and developer community
● Adopted by numerous projects and products in the data ecosystem
3
2018: Ursa Labs
Founded with a not-for-profit mission
● Build cross-language, open libraries for data analytics
● Grow the Apache Arrow ecosystem
● Employ a team of full-time developers
Supported by sponsors and partners
4
2020: Ursa Computing
Founded to support enterprise applications of Arrow
● Empower teams to accelerate data workflows
● Work with enterprises to enhance data platforms
● Enable organizations to get more out of their data
A venture-backed startup
5
2021: Voltron Data
Joining forces for an Arrow-native future
● Ursa joined forces with GPU-accelerated computing pioneers
● Together we are creating a unified foundation for the future of
analytical computing
○ Optimized for diverse hardware
○ Compatible across languages
○ Fast and efficient
○ Based on Apache Arrow
● Ursa Labs is now Voltron Labs
6
Apache Arrow
● Specifies a columnar format for how data is stored in memory
● Provides implementations or bindings in numerous languages
7
Arrow Flight
High-performance data transport protocol
● Provides a framework for sending and receiving Arrow data natively
● Built using gRPC, Protocol Buffers, and the Arrow columnar format
● Designed to move large-scale data with excellent speed and efficiency
● Enables seamless interoperability across networks
Arrow Flight SQL is a next-generation standard for data access using SQL
● Adds SQL semantics to Arrow Flight
● Enables ODBC/JDBC-style data access at the speed of Flight
8
Arrow R Package
Exposes an interface to the Arrow C++ library
● Low-level access to the Arrow C++ API
● Higher-level access through a dplyr backend
Install the latest release from CRAN:
install.packages("arrow")
Install the latest nightly development build:
install.packages("arrow", repos =
c("https://arrow-r-nightly.s3.amazonaws.com", getOption("repos")))
Major Releases of Arrow
Arrow R Package Milestones
Arrow R Package Milestones
Arrow R Package Milestones
Arrow R Package Milestones
Arrow R Package Milestones
Arrow R Package Milestones
read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = TRUE) %>%
filter(total_amount > 100) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct))
#> # A tibble: 3 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> user system elapsed
#> 4.762 0.806 1.612
Parquet file
~10 million rows
~250 MB
Read into
an R data
frame
DEV
VERSIO
N
read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = FALSE) %>%
filter(total_amount > 100) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
#> # A tibble: 3 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> user system elapsed
#> 3.446 1.012 0.605
Parquet file
~10 million rows
~250 MB
Read into
an Arrow
Table
DEV
VERSIO
N
Return the result
as an R data frame
open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 5000) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
#> # A tibble: 4 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 5 16.8 5806
#> 2 1 13.5 143087
#> 3 2 12.6 34418
#> 4 3 11.9 8922
125 Parquet files
~2 billion rows
~40 GB
system.time(...)
#> user system elapsed
#> 3.319 0.247 1.111
DEV
VERSIO
N
open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
to_duckdb() %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 5000) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
● Creates a virtual DuckDB table backed by an Arrow data object
○ No data is loaded until collect() is called
○ Returns a dbplyr object for use in dplyr pipelines
DEV
VERSIO
N
20
Coming Soon
Upcoming Arrow releases will bring additional
query execution capabilities to the Arrow C++ engine and R package
● Joins
● Window functions
● More scalar and aggregate functions
● Performance and efficiency improvements
Ibis
● The Arrow C++ engine currently lacks a high-level Python API
● Ibis can fill this gap
taxi 
.filter(taxi.total_amount > 100) 
.projection(['tip_amount', 'total_amount', 'passenger_count']) 
.mutate(tip_pct = taxi.tip_amount / taxi.total_amount * 100) 
.group_by('passenger_count') 
.aggregate(n=lambda x: x.count(), avg_tip=lambda x: x.tip_pct.mean()) 
.filter(lambda x: x.n > 500) 
.sort_by(ibis.desc('avg_tip')) 
.execute()
Engines and Interfaces
There are multiple efforts underway to develop Arrow-native query engines
● Arrow C++ engine
● Arrow DataFusion
● DuckDB
● …
Users want fluent interfaces to these engines from their preferred languages
● Python
● R
● JavaScript
● …
Users also want to run SQL queries on these engines
23
Past
24
Present
25
Future
26
Compute Intermediate Representation (IR)
The Arrow community has launched a collaboration to establish a Compute IR
● A standard serialized representation of compute expressions
● A common layer connecting APIs (front ends) and engines (back ends)
○ Is produced by APIs
○ Is consumed by engines
● Follow this initiative at substrait.io
Thank you
Wes McKinney
@wesmckinn
arrow.apache.org
voltrondata.com
We’re hiring!
voltrondata.com/careers

More Related Content

What's hot

Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)NTT DATA Technology & Innovation
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Scalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction ManagerScalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction ManagerScalar, Inc.
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 

What's hot (20)

Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Scalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction ManagerScalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction Manager
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 

Similar to New Directions for Apache Arrow

Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Deep_dive_on_Amazon_Neptune_DAT361.pdf
Deep_dive_on_Amazon_Neptune_DAT361.pdfDeep_dive_on_Amazon_Neptune_DAT361.pdf
Deep_dive_on_Amazon_Neptune_DAT361.pdfShaikAsif83
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataDynamical Software, Inc.
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sri Ambati
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Djangokenluck2001
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkRussell Jurney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API'sNatalino Busa
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersICTeam S.p.A.
 

Similar to New Directions for Apache Arrow (20)

Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Deep_dive_on_Amazon_Neptune_DAT361.pdf
Deep_dive_on_Amazon_Neptune_DAT361.pdfDeep_dive_on_Amazon_Neptune_DAT361.pdf
Deep_dive_on_Amazon_Neptune_DAT361.pdf
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big Data
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Django
 
MLflow with R
MLflow with RMLflow with R
MLflow with R
 
Predictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySparkPredictive Analytics with Airflow and PySpark
Predictive Analytics with Airflow and PySpark
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API's
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R users
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

New Directions for Apache Arrow

  • 1. New Directions for Apache Arrow Wes McKinney @wesmckinn September 10, 2021 New York R Conference
  • 2. 2 Apache Arrow Multi-language toolbox for accelerated data interchange and in-memory processing ● Founded in 2016 by a group of developers of open source data projects ● Provides a shared foundation for data analytics ● Enables unification of database and data science technology stacks ● Thriving user and developer community ● Adopted by numerous projects and products in the data ecosystem
  • 3. 3 2018: Ursa Labs Founded with a not-for-profit mission ● Build cross-language, open libraries for data analytics ● Grow the Apache Arrow ecosystem ● Employ a team of full-time developers Supported by sponsors and partners
  • 4. 4 2020: Ursa Computing Founded to support enterprise applications of Arrow ● Empower teams to accelerate data workflows ● Work with enterprises to enhance data platforms ● Enable organizations to get more out of their data A venture-backed startup
  • 5. 5 2021: Voltron Data Joining forces for an Arrow-native future ● Ursa joined forces with GPU-accelerated computing pioneers ● Together we are creating a unified foundation for the future of analytical computing ○ Optimized for diverse hardware ○ Compatible across languages ○ Fast and efficient ○ Based on Apache Arrow ● Ursa Labs is now Voltron Labs
  • 6. 6 Apache Arrow ● Specifies a columnar format for how data is stored in memory ● Provides implementations or bindings in numerous languages
  • 7. 7 Arrow Flight High-performance data transport protocol ● Provides a framework for sending and receiving Arrow data natively ● Built using gRPC, Protocol Buffers, and the Arrow columnar format ● Designed to move large-scale data with excellent speed and efficiency ● Enables seamless interoperability across networks Arrow Flight SQL is a next-generation standard for data access using SQL ● Adds SQL semantics to Arrow Flight ● Enables ODBC/JDBC-style data access at the speed of Flight
  • 8. 8 Arrow R Package Exposes an interface to the Arrow C++ library ● Low-level access to the Arrow C++ API ● Higher-level access through a dplyr backend Install the latest release from CRAN: install.packages("arrow") Install the latest nightly development build: install.packages("arrow", repos = c("https://arrow-r-nightly.s3.amazonaws.com", getOption("repos")))
  • 10. Arrow R Package Milestones
  • 11. Arrow R Package Milestones
  • 12. Arrow R Package Milestones
  • 13. Arrow R Package Milestones
  • 14. Arrow R Package Milestones
  • 15. Arrow R Package Milestones
  • 16. read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = TRUE) %>% filter(total_amount > 100) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 500) %>% arrange(desc(avg_tip_pct)) #> # A tibble: 3 x 3 #> passenger_count avg_tip_pct n #> <int> <dbl> <int> #> 1 1 13.6 11714 #> 2 2 11.9 2892 #> 3 3 11.1 709 system.time(...) #> user system elapsed #> 4.762 0.806 1.612 Parquet file ~10 million rows ~250 MB Read into an R data frame DEV VERSIO N
  • 17. read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = FALSE) %>% filter(total_amount > 100) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 500) %>% arrange(desc(avg_tip_pct)) %>% collect() #> # A tibble: 3 x 3 #> passenger_count avg_tip_pct n #> <int> <dbl> <int> #> 1 1 13.6 11714 #> 2 2 11.9 2892 #> 3 3 11.1 709 system.time(...) #> user system elapsed #> 3.446 1.012 0.605 Parquet file ~10 million rows ~250 MB Read into an Arrow Table DEV VERSIO N Return the result as an R data frame
  • 18. open_dataset("nyc-taxi", partitioning = c("year", "month")) %>% filter(total_amount > 100 & year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 5000) %>% arrange(desc(avg_tip_pct)) %>% collect() #> # A tibble: 4 x 3 #> passenger_count avg_tip_pct n #> <int> <dbl> <int> #> 1 5 16.8 5806 #> 2 1 13.5 143087 #> 3 2 12.6 34418 #> 4 3 11.9 8922 125 Parquet files ~2 billion rows ~40 GB system.time(...) #> user system elapsed #> 3.319 0.247 1.111 DEV VERSIO N
  • 19. open_dataset("nyc-taxi", partitioning = c("year", "month")) %>% filter(total_amount > 100 & year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% to_duckdb() %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 5000) %>% arrange(desc(avg_tip_pct)) %>% collect() ● Creates a virtual DuckDB table backed by an Arrow data object ○ No data is loaded until collect() is called ○ Returns a dbplyr object for use in dplyr pipelines DEV VERSIO N
  • 20. 20 Coming Soon Upcoming Arrow releases will bring additional query execution capabilities to the Arrow C++ engine and R package ● Joins ● Window functions ● More scalar and aggregate functions ● Performance and efficiency improvements
  • 21. Ibis ● The Arrow C++ engine currently lacks a high-level Python API ● Ibis can fill this gap taxi .filter(taxi.total_amount > 100) .projection(['tip_amount', 'total_amount', 'passenger_count']) .mutate(tip_pct = taxi.tip_amount / taxi.total_amount * 100) .group_by('passenger_count') .aggregate(n=lambda x: x.count(), avg_tip=lambda x: x.tip_pct.mean()) .filter(lambda x: x.n > 500) .sort_by(ibis.desc('avg_tip')) .execute()
  • 22. Engines and Interfaces There are multiple efforts underway to develop Arrow-native query engines ● Arrow C++ engine ● Arrow DataFusion ● DuckDB ● … Users want fluent interfaces to these engines from their preferred languages ● Python ● R ● JavaScript ● … Users also want to run SQL queries on these engines
  • 26. 26 Compute Intermediate Representation (IR) The Arrow community has launched a collaboration to establish a Compute IR ● A standard serialized representation of compute expressions ● A common layer connecting APIs (front ends) and engines (back ends) ○ Is produced by APIs ○ Is consumed by engines ● Follow this initiative at substrait.io