Improving data interoperability in Python and R

1© Cloudera, Inc. All rights reserved.
Improving data interoperability
in Python and R
Wes McKinney @wesmckinn
NYC R Conference April 8, 2016

http://numfocus.org

Me
• Data Science Tools at Cloudera, formerly DataPad CEO/founder
• Serial creator of structured data tools / user interfaces
• Wrote bestseller Python for Data Analysis 2012
• Open source projects
• Python {pandas, Ibis, statsmodels}
• Apache {Arrow, Parquet, Kudu (incubating)}
• Mostly work in Python and Cython/C/C++

In process:
Python for Data Analysis: 2nd
Edition
Coming late 2016 / early
2017

Apache
Arrow
http://arrow.apache.org
Some slides from Strata-HW talk w/
Jacques Nadeau

Arrow in a Slide
• New Top-level Apache Software Foundation project
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to choose best of
breed systems
3. Designed to work with any programming language
4. Support for both relational and complex data as-is
• Developers from 13+ major open source projects involved
• A significant % of the world’s data will be processed through
Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R

High Performance Sharing & Interchange
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)

Arrow in action: Feather File Format for Python and R
•Problem: fast, language-
agnostic binary data frame
file format
•By Wes McKinney (Python)
and Hadley Wickham (R)
•Read speeds close to disk IO
performance

Real World Example: Feather File Format for Python
and R
library(feather)
path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)
R Python

More on Feather
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather
C++ library
Rcpp
Cython
R data.frame
pandas DataFrame

Feather: the good and not-so-good
• Good
• Language-agnostic memory representation
• Extremely fast
• New storage features can be added without much difficulty
• Not-so-good
• Data must be convert to/from storage representation (Arrow) and in-
memory “proprietary” data structures (R / Python data frames)

Shared needs for Python, R, Julia, ...
• If PLs can establish a common data frame C/C++-level memory representation,
we can share algorithms and libraries much more easily
• Example: dplyr’s in-memory backend
• Other requirements
• Permissive licensing (Python / Julia require MIT/Apache-like)
• Common build/test/packaging for shared C/C++ library components

Get Involved in Arrow
• Join the community
• dev@arrow.apache.org
• Slack: https://apachearrowslackin.herokuapp.com/
• http://arrow.apache.org
• @ApacheArrow

Thank you
Wes McKinney @wesmckinn
Views are my own

Improving data interoperability in Python and R

More Related Content

What's hot

Viewers also liked

Similar to Improving data interoperability in Python and R

More from Wes McKinney

Recently uploaded

Improving data interoperability in Python and R