Wes McKinney
Apache Arrow
Cross-Language Development
Platform for In-Memory Analytics
NYC R Conference- 20 April 2018
Wes McKinney
• Created Python pandas project (~2008), lead
developer/maintainer until 2013
• PMC Apache Arrow, Apache Parquet, ASF Member
• Wrote Python for Data Analysis (1e 2012, 2e
2017)
• Formerly Co-founder / CEO of DataPad (acquired
by Cloudera in 2014)
• Other OSS work: Ibis, Feather, Apache Kudu,
statsmodels
● Raise money to support full-time
open source developers
● Grow Apache Arrow ecosystem
● Build cross-language, portable
computational libraries for data
science
● Build relationships across industry
https://ursalabs.org
People
Initial Sponsors and Partners
Prospective sponsors / partners,
please reach out: info@ursalabs.org
Apache Arrow
• https://github.com/apache/arrow
• Open source community initiative started in 2016
• Backed by ~13 major OSS projects at start, significantly more now
• Shared standards and systems for memory interoperability and
computation
• Cross-language libraries
Defragmenting Data Access
“Portable” Data Frames
pandas
R
JVM
Non-Portable Data Frames
Arrow
Portable Data Frames
…
Share data and algorithms at ~zero cost
Some Arrow Use Cases
• Runtime in-memory format for analytical query engines
• Zero-copy (no deserialization) interchange via shared memory
• Low-overhead streaming messaging / RPC
• Serialization format implementation
• Zero-copy random access to on-disk data
• Example: Feather files
• Data ingest / data access
Arrow’s Columnar Memory Format
• Runtime memory format for analytical query processing
• Companion to serialization tech like Apache {Parquet, ORC}
• “Fully shredded” columnar, supports flat and nested schemas
• Organized for cache-efficient access on CPUs/GPUs
• Optimized for data locality, SIMD, parallel processing
• Accommodates both random access and scan workloads
Arrow Implementations and Bindings
Upcoming: Rust (native), R (binding), Julia (native)
Example use: Ray ML framework from Berkeley RISELab
March 20, 2017All Rights Reserved 12
Source: https://arxiv.org/abs/1703.03924
• Shared memory-based object
store
• Zero-copy tensor reads using
Arrow libraries
Some Industry Contributors in Apache Arrow
ClearCode
Arrow Project Growth
• 138 Contributors on GitHub
• > 1900 Resolved JIRAs
• > 100K binary package
downloads per month
JIRA Burndown since Project Inception
Current Project Status
• 0.9.0 Release: March 21, 2018
• Some focus areas
• Columnar format stability / forward compatibility
• Streaming messaging / RPC procedure
• Language implementations / interop
• Data access (e.g. Parquet input/output, ORC)
• Downstream integrations (Apache Spark, Python/pandas, …)
Upcoming Roadmap
• Software development lifecycle improvements
• Data ingest / access / export
• Computational libraries (CPU + GPU)
• Expanded language support
• Richer RPC / messaging
• More system integrations
The current data science stack’s computational
foundation is severely dated, rooted in 1980s /
1990s FORTRAN-style semantics
Single-core /
single-threaded
algorithms
Naïve execution
model, eager
evaluation
Primitive memory
management,
expensive data access
Fragmented language
ecosystems,
“Proprietary” memory
models …
Data scientists working with “small” data
have not experienced great pain
Small Data (< ~10GB)
Medium Data (~10 - ~100GB)
Big Data (> ~100GB-1TB)
Current Python/R
stack begins to “fail”
around this point
Users doing fine here
We can do so much better through modern
systems techniques
Multi-core algorithms,
GPU acceleration,
Code generation
(LLVM)
Lazy evaluation,
“query” optimization
Sophisticated memory
management,
Efficient access to huge
data sets
Interoperable memory
models, zero-copy
interchange between
system components
Note 1
Moore’s Law (and small
data) enabled us to get by
for a long time without
confronting some of these
challenges
Note 2
Most of these methods
have already been widely
employed in analytic
databases. Limited
“novel” research needed
Computational libraries
• “Kernel functions” performing vectorized analytics on Arrow
memory format
• Select CPU or GPU variant based on data location
• Operator graphs (compose multiple operators)
• Subgraph compiler (using LLVM)
• Runtime engine: execute operator graphs
Data Access / Ingest
• Apache Avro
• Apache Parquet nested data support
• Apache ORC
• CSV
• JSON
• ODBC / JDBC
• … and likely other data access points
Arrow-powered Data Science Systems
• Portable runtime libraries, usable from multiple programming
languages
• Decoupled front ends
• Companion to distributed systems like Dask, Ray
Getting involved
• Join dev@arrow.apache.org
• PRs to https://github.com/apache/arrow
• Learn more about the Ursa Labs vision for Arrow-powered data
science: https://ursalabs.org/tech/

Apache Arrow -- Cross-language development platform for in-memory data

  • 1.
    Wes McKinney Apache Arrow Cross-LanguageDevelopment Platform for In-Memory Analytics NYC R Conference- 20 April 2018
  • 2.
    Wes McKinney • CreatedPython pandas project (~2008), lead developer/maintainer until 2013 • PMC Apache Arrow, Apache Parquet, ASF Member • Wrote Python for Data Analysis (1e 2012, 2e 2017) • Formerly Co-founder / CEO of DataPad (acquired by Cloudera in 2014) • Other OSS work: Ibis, Feather, Apache Kudu, statsmodels
  • 3.
    ● Raise moneyto support full-time open source developers ● Grow Apache Arrow ecosystem ● Build cross-language, portable computational libraries for data science ● Build relationships across industry https://ursalabs.org
  • 4.
  • 5.
    Initial Sponsors andPartners Prospective sponsors / partners, please reach out: info@ursalabs.org
  • 6.
    Apache Arrow • https://github.com/apache/arrow •Open source community initiative started in 2016 • Backed by ~13 major OSS projects at start, significantly more now • Shared standards and systems for memory interoperability and computation • Cross-language libraries
  • 7.
  • 8.
    “Portable” Data Frames pandas R JVM Non-PortableData Frames Arrow Portable Data Frames … Share data and algorithms at ~zero cost
  • 9.
    Some Arrow UseCases • Runtime in-memory format for analytical query engines • Zero-copy (no deserialization) interchange via shared memory • Low-overhead streaming messaging / RPC • Serialization format implementation • Zero-copy random access to on-disk data • Example: Feather files • Data ingest / data access
  • 10.
    Arrow’s Columnar MemoryFormat • Runtime memory format for analytical query processing • Companion to serialization tech like Apache {Parquet, ORC} • “Fully shredded” columnar, supports flat and nested schemas • Organized for cache-efficient access on CPUs/GPUs • Optimized for data locality, SIMD, parallel processing • Accommodates both random access and scan workloads
  • 11.
    Arrow Implementations andBindings Upcoming: Rust (native), R (binding), Julia (native)
  • 12.
    Example use: RayML framework from Berkeley RISELab March 20, 2017All Rights Reserved 12 Source: https://arxiv.org/abs/1703.03924 • Shared memory-based object store • Zero-copy tensor reads using Arrow libraries
  • 13.
    Some Industry Contributorsin Apache Arrow ClearCode
  • 14.
    Arrow Project Growth •138 Contributors on GitHub • > 1900 Resolved JIRAs • > 100K binary package downloads per month JIRA Burndown since Project Inception
  • 15.
    Current Project Status •0.9.0 Release: March 21, 2018 • Some focus areas • Columnar format stability / forward compatibility • Streaming messaging / RPC procedure • Language implementations / interop • Data access (e.g. Parquet input/output, ORC) • Downstream integrations (Apache Spark, Python/pandas, …)
  • 16.
    Upcoming Roadmap • Softwaredevelopment lifecycle improvements • Data ingest / access / export • Computational libraries (CPU + GPU) • Expanded language support • Richer RPC / messaging • More system integrations
  • 17.
    The current datascience stack’s computational foundation is severely dated, rooted in 1980s / 1990s FORTRAN-style semantics Single-core / single-threaded algorithms Naïve execution model, eager evaluation Primitive memory management, expensive data access Fragmented language ecosystems, “Proprietary” memory models …
  • 18.
    Data scientists workingwith “small” data have not experienced great pain Small Data (< ~10GB) Medium Data (~10 - ~100GB) Big Data (> ~100GB-1TB) Current Python/R stack begins to “fail” around this point Users doing fine here
  • 19.
    We can doso much better through modern systems techniques Multi-core algorithms, GPU acceleration, Code generation (LLVM) Lazy evaluation, “query” optimization Sophisticated memory management, Efficient access to huge data sets Interoperable memory models, zero-copy interchange between system components Note 1 Moore’s Law (and small data) enabled us to get by for a long time without confronting some of these challenges Note 2 Most of these methods have already been widely employed in analytic databases. Limited “novel” research needed
  • 20.
    Computational libraries • “Kernelfunctions” performing vectorized analytics on Arrow memory format • Select CPU or GPU variant based on data location • Operator graphs (compose multiple operators) • Subgraph compiler (using LLVM) • Runtime engine: execute operator graphs
  • 21.
    Data Access /Ingest • Apache Avro • Apache Parquet nested data support • Apache ORC • CSV • JSON • ODBC / JDBC • … and likely other data access points
  • 22.
    Arrow-powered Data ScienceSystems • Portable runtime libraries, usable from multiple programming languages • Decoupled front ends • Companion to distributed systems like Dask, Ray
  • 23.
    Getting involved • Joindev@arrow.apache.org • PRs to https://github.com/apache/arrow • Learn more about the Ursa Labs vision for Arrow-powered data science: https://ursalabs.org/tech/