OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David Li - Voltron Data.pdf

Arrow in Flight
New Developments in
Data Connectivity
David Li / Voltron Data

Arrow Is a Memory Format
timestamp price
1667591468 61.92
1667591577 28.08
timestamp
1667591468
1667591577
price
61.92
28.08
Adjacent values are
same type = faster
processing
Don’t care about a
column? Skip it
entirely

Arrow File
Arrow Is a Memory Format (and more)
& more specifications
RecordBatch
1667591468
1667591577
61.92
28.08
Schema
Footer
Same layout as in
memory - can be
memory-mapped
Footer for random
access to batches
Optional per-buffer
compression
RecordBatch
1667591468
1667591577
61.92
28.08

Arrow Is a Set of Libraries
Libraries in multiple
languages implement the
Arrow specifications and
higher-level features Arrow Flight RPC
Arrow Dataset
parquet-cpp
Acero
DataFusion
Arrow Filesystems
arrow-jdbc
Arrow Flight SQL
ADBC
Gandiva
Skyhook
Arrow Tensors
Ballista
Flight SQL JDBC
nanoarrow
PyArrow

An Incomplete History
of Apache Arrow

2016, February
Apache Arrow
is announced
https://www.slideshare.net/wesm/practical-medium-data-analytics-with-python
https://wesmckinney.com/blog/pandas-and-apache-arrow/
https://www.dremio.com/press-releases/introducing-apache-arrow-columnar-in-memory-analytics/
https://blog.cloudera.com/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-

2016, September
Arrow support is merged
into parquet-cpp
Apache Arrow
is announced
https://wesmckinney.com/blog/pandas-and-apache-arrow/
https://github.com/apache/parquet-cpp/pull/158
C++ Python R
Parque
t
CSV ORC
C++ Python R
Parque
t
CSV ORC
Arrow
⬆️ Without Arrow
With Arrow ⬇️

2017, July
Spark adds Pandas
UDFs via Arrow
Arrow support merged
into parquet-cpp
https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
https://arrow.apache.org/blog/2017/07/26/spark-arrow/

2018, March
Rust, Go libraries are
contributed to Arrow
Spark adds Pandas
UDFs via Arrow
https://arrow.apache.org/blog/2018/03/22/go-code-donation/
https://github.com/apache/arrow/pull/1804
C++
C♯
Java
JavaScript
Go
C (GLib)
C
(nanoarrow)
MATLAB
Python
R
Ruby
Rust
Julia
Native
Bindings
Arrow
Implementations
(circa 2022)

2018, October
NVIDIA announces
RAPIDS
Rust, Go libraries are
contributed to Arrow
https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning
https://ursalabs.org/blog/ursa-labs-partner-nvidia/

2019, June
Development starts on
Arrow Dataset
NVIDIA announces
RAPIDS
https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
Arrow Dataset
Parquet CSV ORC
Arrow Filesystems
local S3 GCS
Python R
dplyr
Pandas

2019, June
Arrow Dataset
NVIDIA announces
RAPIDS
R code sample (with timing?)
goes here
Learn More ⏩ https://arrow-user2022.netlify.app
https://arrow.apache.org/docs/r/articles/dataset.html

2019, October
Arrow Flight RPC
is introduced
Arrow Dataset
Client Server
Server
Server
Server
Server
Distributed Fetch with Flight

2020, July
Arrow 1.0.0 is released
Arrow Flight RPC
is introduced
Data courtesy pypistats.org; charts originally by Stephanie Hazlitt
https://arrow.apache.org/blog/2020/07/24/1.0.0-release/

2021, July
Streamlit integrates
Arrow for 10x speed
boost
https://blog.streamlit.io/all-in-on-apache-arrow/

2021, December
DuckDB integrates the
Arrow C Data Interface
Streamlit integrates
Arrow for 10x speed
boost
https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/

2022, October
DuckDB integrates the
Arrow C Data Interface
Data courtesy pypistats.org; charts originally by Stephanie Hazlitt
https://arrow.apache.org/blog/2022/10/31/10.0.0-release/

Arrow Today
DuckDB
Pandas Parquet
Spark R
cudf
DuckDB
Pandas Parquet
Spark R
cudf

Arrow Is an Ecosystem
…and more!

geoarrow/geoparquet
https://github.com/geoarrow/geoarrow
https://observablehq.com/@kylebarron/geoarrow-and-geoparquet-in-deck-gl
https://dewey.dunnington.ca/post/2022/building-bridges-arrow-parquet-and-geospatial-computing/
“GeoArrow makes it easier to get
the best rendering performance in
deck.gl because it removes the need
for most CPU-based pre-processing
before passing the data to the GPU
for rendering.”

Arrow Flight SQL
● Client/server
database protocol
(not an SQL dialect!)
● Takes advantage of
Arrow Flight
● Implement one
protocol, support all
clients
https://arrow.apache.org/blog/2022/11/01/arrow-flight-sql-jdbc/
Arrow Flight SQL
Arrow-native database
(no spoilers) JDBC ODBC
Arrow data all
the way—no
conversions
Clients use API
of choice

Arrow Flight SQL
● JDBC, ODBC drivers
available
https://arrow.apache.org/blog/2022/11/01/arrow-flight-sql-jdbc/

ADBC: Arrow Database Connectivity
● Flight SQL helps
servers
● ADBC solves the
problem for clients
● One API, multiple
databases
ADBC
Arrow-native application
Flight SQL Postgres DuckDB
Clients get
Arrow data
ADBC driver
converts if
necessary

ADBC: Arrow Database Connectivity
ADBC API
Arrow-native application
ADBC Driver
Database
SQL
DB-specific
protocol
DB-specific
protocol
Arrow
Application doesn’t
worry about what
happens here
C (+Go, Java) APIs
for portability

ADBC
Learn More ⏩ https://github.com/apache/arrow-adbc

Query Engines
● Direct computation
on Arrow(-like) data
● All interoperable
● Arrow-native ‘core’
for bigger projects
Acero

Spark/xgboost
● xgboost accepts
Arrow data as input
● Intel is plugging
Arrow, Velox into
Spark
End result:
● Lower overheads for
ML training
https://medium.com/intel-analytics-software/optimizing-the-end-to-end-training-pipeline-on-apache-spark-clusters-
80261d6a7b8c
https://medium.com/intel-analytics-software/accelerate-spark-sql-queries-with-gluten-9000b65d1b4e

lance
● New toolchain for CV
● Everything is Arrow
● File format: Arrow-
based
● Integrates with
DuckDB: via Arrow
https://eto-ai.github.io/lance/notebooks/03_exploratory_data_analysis.html

Apache Arrow, in Flight
Arrow as glue between
systems
Arrow as alternative
protocol
Arrow as an internal
detail
Arrow as the foundation
of a system
Arrow as the primary
interface
Arrow moving up the
stack

Questions?
Learn More/Get Involved ⏩ https://arrow.apache.org/community/

OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David Li - Voltron Data.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David Li - Voltron Data.pdf

Similar to OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David Li - Voltron Data.pdf (20)

More from Altinity Ltd

More from Altinity Ltd (20)

Recently uploaded

Recently uploaded (20)

OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David Li - Voltron Data.pdf