Arrow in Flight
New Developments in
Data Connectivity
David Li / Voltron Data
Arrow Is a Memory Format
timestamp price
1667591468 61.92
1667591577 28.08
timestamp
1667591468
1667591577
price
61.92
28.08
Adjacent values are
same type = faster
processing
Don’t care about a
column? Skip it
entirely
Arrow File
Arrow Is a Memory Format (and more)
& more specifications
RecordBatch
1667591468
1667591577
61.92
28.08
Schema
Footer
Same layout as in
memory - can be
memory-mapped
Footer for random
access to batches
Optional per-buffer
compression
RecordBatch
1667591468
1667591577
61.92
28.08
Arrow Is a Set of Libraries
Libraries in multiple
languages implement the
Arrow specifications and
higher-level features Arrow Flight RPC
Arrow Dataset
parquet-cpp
Acero
DataFusion
Arrow Filesystems
arrow-jdbc
Arrow Flight SQL
ADBC
Gandiva
Skyhook
Arrow Tensors
Ballista
Flight SQL JDBC
nanoarrow
PyArrow
An Incomplete History
of Apache Arrow
2016, February
Apache Arrow
is announced
https://www.slideshare.net/wesm/practical-medium-data-analytics-with-python
https://wesmckinney.com/blog/pandas-and-apache-arrow/
https://www.dremio.com/press-releases/introducing-apache-arrow-columnar-in-memory-analytics/
https://blog.cloudera.com/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-
2016, September
Arrow support is merged
into parquet-cpp
Apache Arrow
is announced
https://wesmckinney.com/blog/pandas-and-apache-arrow/
https://github.com/apache/parquet-cpp/pull/158
C++ Python R
Parque
t
CSV ORC
C++ Python R
Parque
t
CSV ORC
Arrow
⬆️ Without Arrow
With Arrow ⬇️
2017, July
Spark adds Pandas
UDFs via Arrow
Arrow support merged
into parquet-cpp
https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
https://arrow.apache.org/blog/2017/07/26/spark-arrow/
2018, March
Rust, Go libraries are
contributed to Arrow
Spark adds Pandas
UDFs via Arrow
https://arrow.apache.org/blog/2018/03/22/go-code-donation/
https://github.com/apache/arrow/pull/1804
C++
C♯
Java
JavaScript
Go
C (GLib)
C
(nanoarrow)
MATLAB
Python
R
Ruby
Rust
Julia
Native
Bindings
Arrow
Implementations
(circa 2022)
2018, October
NVIDIA announces
RAPIDS
Rust, Go libraries are
contributed to Arrow
https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning
https://ursalabs.org/blog/ursa-labs-partner-nvidia/
2019, June
Development starts on
Arrow Dataset
NVIDIA announces
RAPIDS
https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
Arrow Dataset
Parquet CSV ORC
Arrow Filesystems
local S3 GCS
Python R
dplyr
Pandas
2019, June
Development starts on
Arrow Dataset
NVIDIA announces
RAPIDS
https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
R code sample (with timing?)
goes here
Learn More ⏩ https://arrow-user2022.netlify.app
https://arrow.apache.org/docs/r/articles/dataset.html
2019, October
Arrow Flight RPC
is introduced
Development starts on
Arrow Dataset
https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
Client Server
Server
Server
Server
Server
Distributed Fetch with Flight
2020, July
Arrow 1.0.0 is released
Arrow Flight RPC
is introduced
Data courtesy pypistats.org; charts originally by Stephanie Hazlitt
https://arrow.apache.org/blog/2020/07/24/1.0.0-release/
2021, July
Streamlit integrates
Arrow for 10x speed
boost
Arrow 1.0.0 is released
https://blog.streamlit.io/all-in-on-apache-arrow/
2021, December
DuckDB integrates the
Arrow C Data Interface
Streamlit integrates
Arrow for 10x speed
boost
https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/
2022, October
Arrow 10.0.0 is released
DuckDB integrates the
Arrow C Data Interface
Data courtesy pypistats.org; charts originally by Stephanie Hazlitt
https://arrow.apache.org/blog/2022/10/31/10.0.0-release/
Arrow Today
DuckDB
Pandas Parquet
Spark R
cudf
DuckDB
Pandas Parquet
Spark R
cudf
Arrow Is an Ecosystem
…and more!
Apache Arrow,
Taking Off
geoarrow/geoparquet
https://github.com/geoarrow/geoarrow
https://observablehq.com/@kylebarron/geoarrow-and-geoparquet-in-deck-gl
https://dewey.dunnington.ca/post/2022/building-bridges-arrow-parquet-and-geospatial-computing/
“GeoArrow makes it easier to get
the best rendering performance in
deck.gl because it removes the need
for most CPU-based pre-processing
before passing the data to the GPU
for rendering.”
Arrow Flight SQL
● Client/server
database protocol
(not an SQL dialect!)
● Takes advantage of
Arrow Flight
● Implement one
protocol, support all
clients
https://arrow.apache.org/blog/2022/11/01/arrow-flight-sql-jdbc/
Arrow Flight SQL
Arrow-native database
(no spoilers) JDBC ODBC
Arrow data all
the way—no
conversions
Clients use API
of choice
Arrow Flight SQL
● JDBC, ODBC drivers
available
https://arrow.apache.org/blog/2022/11/01/arrow-flight-sql-jdbc/
ADBC: Arrow Database Connectivity
● Flight SQL helps
servers
● ADBC solves the
problem for clients
● One API, multiple
databases
ADBC
Arrow-native application
Flight SQL Postgres DuckDB
Clients get
Arrow data
ADBC driver
converts if
necessary
ADBC: Arrow Database Connectivity
ADBC API
Arrow-native application
ADBC Driver
Database
SQL
DB-specific
protocol
DB-specific
protocol
Arrow
Application doesn’t
worry about what
happens here
C (+Go, Java) APIs
for portability
ADBC
Learn More ⏩ https://github.com/apache/arrow-adbc
Query Engines
● Direct computation
on Arrow(-like) data
● All interoperable
● Arrow-native ‘core’
for bigger projects
Acero
Spark/xgboost
● xgboost accepts
Arrow data as input
● Intel is plugging
Arrow, Velox into
Spark
End result:
● Lower overheads for
ML training
https://medium.com/intel-analytics-software/optimizing-the-end-to-end-training-pipeline-on-apache-spark-clusters-
80261d6a7b8c
https://medium.com/intel-analytics-software/accelerate-spark-sql-queries-with-gluten-9000b65d1b4e
lance
● New toolchain for CV
● Everything is Arrow
● File format: Arrow-
based
● Integrates with
DuckDB: via Arrow
https://eto-ai.github.io/lance/notebooks/03_exploratory_data_analysis.html
Apache Arrow, in Flight
Arrow as glue between
systems
Arrow as alternative
protocol
Arrow as an internal
detail
Arrow as the foundation
of a system
Arrow as the primary
interface
Arrow moving up the
stack
Questions?
Learn More/Get Involved ⏩ https://arrow.apache.org/community/

OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David Li - Voltron Data.pdf

  • 1.
    Arrow in Flight NewDevelopments in Data Connectivity David Li / Voltron Data
  • 2.
    Arrow Is aMemory Format timestamp price 1667591468 61.92 1667591577 28.08 timestamp 1667591468 1667591577 price 61.92 28.08 Adjacent values are same type = faster processing Don’t care about a column? Skip it entirely
  • 3.
    Arrow File Arrow Isa Memory Format (and more) & more specifications RecordBatch 1667591468 1667591577 61.92 28.08 Schema Footer Same layout as in memory - can be memory-mapped Footer for random access to batches Optional per-buffer compression RecordBatch 1667591468 1667591577 61.92 28.08
  • 4.
    Arrow Is aSet of Libraries Libraries in multiple languages implement the Arrow specifications and higher-level features Arrow Flight RPC Arrow Dataset parquet-cpp Acero DataFusion Arrow Filesystems arrow-jdbc Arrow Flight SQL ADBC Gandiva Skyhook Arrow Tensors Ballista Flight SQL JDBC nanoarrow PyArrow
  • 5.
  • 6.
    2016, February Apache Arrow isannounced https://www.slideshare.net/wesm/practical-medium-data-analytics-with-python https://wesmckinney.com/blog/pandas-and-apache-arrow/ https://www.dremio.com/press-releases/introducing-apache-arrow-columnar-in-memory-analytics/ https://blog.cloudera.com/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-
  • 7.
    2016, September Arrow supportis merged into parquet-cpp Apache Arrow is announced https://wesmckinney.com/blog/pandas-and-apache-arrow/ https://github.com/apache/parquet-cpp/pull/158 C++ Python R Parque t CSV ORC C++ Python R Parque t CSV ORC Arrow ⬆️ Without Arrow With Arrow ⬇️
  • 8.
    2017, July Spark addsPandas UDFs via Arrow Arrow support merged into parquet-cpp https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html https://arrow.apache.org/blog/2017/07/26/spark-arrow/
  • 9.
    2018, March Rust, Golibraries are contributed to Arrow Spark adds Pandas UDFs via Arrow https://arrow.apache.org/blog/2018/03/22/go-code-donation/ https://github.com/apache/arrow/pull/1804 C++ C♯ Java JavaScript Go C (GLib) C (nanoarrow) MATLAB Python R Ruby Rust Julia Native Bindings Arrow Implementations (circa 2022)
  • 10.
    2018, October NVIDIA announces RAPIDS Rust,Go libraries are contributed to Arrow https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning https://ursalabs.org/blog/ursa-labs-partner-nvidia/
  • 11.
    2019, June Development startson Arrow Dataset NVIDIA announces RAPIDS https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ Arrow Dataset Parquet CSV ORC Arrow Filesystems local S3 GCS Python R dplyr Pandas
  • 12.
    2019, June Development startson Arrow Dataset NVIDIA announces RAPIDS https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ R code sample (with timing?) goes here Learn More ⏩ https://arrow-user2022.netlify.app https://arrow.apache.org/docs/r/articles/dataset.html
  • 13.
    2019, October Arrow FlightRPC is introduced Development starts on Arrow Dataset https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ Client Server Server Server Server Server Distributed Fetch with Flight
  • 14.
    2020, July Arrow 1.0.0is released Arrow Flight RPC is introduced Data courtesy pypistats.org; charts originally by Stephanie Hazlitt https://arrow.apache.org/blog/2020/07/24/1.0.0-release/
  • 15.
    2021, July Streamlit integrates Arrowfor 10x speed boost Arrow 1.0.0 is released https://blog.streamlit.io/all-in-on-apache-arrow/
  • 16.
    2021, December DuckDB integratesthe Arrow C Data Interface Streamlit integrates Arrow for 10x speed boost https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/
  • 17.
    2022, October Arrow 10.0.0is released DuckDB integrates the Arrow C Data Interface Data courtesy pypistats.org; charts originally by Stephanie Hazlitt https://arrow.apache.org/blog/2022/10/31/10.0.0-release/
  • 18.
    Arrow Today DuckDB Pandas Parquet SparkR cudf DuckDB Pandas Parquet Spark R cudf
  • 19.
    Arrow Is anEcosystem …and more!
  • 20.
  • 21.
    geoarrow/geoparquet https://github.com/geoarrow/geoarrow https://observablehq.com/@kylebarron/geoarrow-and-geoparquet-in-deck-gl https://dewey.dunnington.ca/post/2022/building-bridges-arrow-parquet-and-geospatial-computing/ “GeoArrow makes iteasier to get the best rendering performance in deck.gl because it removes the need for most CPU-based pre-processing before passing the data to the GPU for rendering.”
  • 22.
    Arrow Flight SQL ●Client/server database protocol (not an SQL dialect!) ● Takes advantage of Arrow Flight ● Implement one protocol, support all clients https://arrow.apache.org/blog/2022/11/01/arrow-flight-sql-jdbc/ Arrow Flight SQL Arrow-native database (no spoilers) JDBC ODBC Arrow data all the way—no conversions Clients use API of choice
  • 23.
    Arrow Flight SQL ●JDBC, ODBC drivers available https://arrow.apache.org/blog/2022/11/01/arrow-flight-sql-jdbc/
  • 24.
    ADBC: Arrow DatabaseConnectivity ● Flight SQL helps servers ● ADBC solves the problem for clients ● One API, multiple databases ADBC Arrow-native application Flight SQL Postgres DuckDB Clients get Arrow data ADBC driver converts if necessary
  • 25.
    ADBC: Arrow DatabaseConnectivity ADBC API Arrow-native application ADBC Driver Database SQL DB-specific protocol DB-specific protocol Arrow Application doesn’t worry about what happens here C (+Go, Java) APIs for portability
  • 26.
    ADBC Learn More ⏩https://github.com/apache/arrow-adbc
  • 27.
    Query Engines ● Directcomputation on Arrow(-like) data ● All interoperable ● Arrow-native ‘core’ for bigger projects Acero
  • 28.
    Spark/xgboost ● xgboost accepts Arrowdata as input ● Intel is plugging Arrow, Velox into Spark End result: ● Lower overheads for ML training https://medium.com/intel-analytics-software/optimizing-the-end-to-end-training-pipeline-on-apache-spark-clusters- 80261d6a7b8c https://medium.com/intel-analytics-software/accelerate-spark-sql-queries-with-gluten-9000b65d1b4e
  • 29.
    lance ● New toolchainfor CV ● Everything is Arrow ● File format: Arrow- based ● Integrates with DuckDB: via Arrow https://eto-ai.github.io/lance/notebooks/03_exploratory_data_analysis.html
  • 30.
    Apache Arrow, inFlight Arrow as glue between systems Arrow as alternative protocol Arrow as an internal detail Arrow as the foundation of a system Arrow as the primary interface Arrow moving up the stack
  • 31.
    Questions? Learn More/Get Involved⏩ https://arrow.apache.org/community/