Home
Explore
Submit Search
Upload
Login
Signup
Advertisement
Check these out next
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
DataFrames: The Extended Cut
Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
Python for Financial Data Analysis with pandas
Wes McKinney
DataFrames: The Good, Bad, and Ugly
Wes McKinney
Data Science Languages and Industry Analytics
Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
1
of
14
Top clipped slide
Improving data interoperability in Python and R
Apr. 20, 2016
•
0 likes
2 likes
×
Be the first to like this
Show More
•
2,641 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download Now
Download to read offline
Report
Technology
Given at New York R Conference 2016, 2016-04-08
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Advertisement
Advertisement
Advertisement
Recommended
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
8.1K views
•
22 slides
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
7.4K views
•
37 slides
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
6.6K views
•
30 slides
Enabling Python to be a Better Big Data Citizen
Wes McKinney
6K views
•
19 slides
Apache Arrow and Python: The latest
Wes McKinney
5.8K views
•
19 slides
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
2.9K views
•
23 slides
More Related Content
Slideshows for you
(20)
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
DataFrames: The Extended Cut
Wes McKinney
•
8.5K views
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
•
17K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Python for Financial Data Analysis with pandas
Wes McKinney
•
61.7K views
DataFrames: The Good, Bad, and Ugly
Wes McKinney
•
12.9K views
Data Science Languages and Industry Analytics
Wes McKinney
•
5.5K views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
•
5.4K views
What's new in pandas and the SciPy stack for financial users
Wes McKinney
•
11.8K views
PyCon Singapore 2013 Keynote
Wes McKinney
•
94.6K views
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.1K views
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
•
19.8K views
Apache Arrow - An Overview
Dremio Corporation
•
2K views
Extending Pandas using Apache Arrow and Numba
Uwe Korn
•
5.5K views
Productive Data Tools for Quants
Wes McKinney
•
1.7K views
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
•
5.6K views
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
•
4.5K views
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
•
8K views
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
•
4.2K views
Viewers also liked
(15)
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
•
12.9K views
Python Data Wrangling: Preparing for the Future
Wes McKinney
•
12.5K views
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
•
19.7K views
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
High Performance Python on Apache Spark
Wes McKinney
•
16.5K views
PyCon APAC 2016 Keynote
Wes McKinney
•
3.6K views
Data Tools and the Data Scientist Shortage
Wes McKinney
•
3.7K views
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
•
27.3K views
pandas: Powerful data analysis tools for Python
Wes McKinney
•
9.7K views
Open Data Science with R and Anaconda
Anaconda
•
19.7K views
Structured Data Challenges in Finance and Statistics
Wes McKinney
•
5.3K views
User Experience for Business Analysts
Carol Smith
•
3.7K views
Riding the Enterprise Integration train
Dominopoint - Italian Lotus User Group
•
283 views
Scipy 2011 Time Series Analysis in Python
Wes McKinney
•
31.3K views
Salesforce DX Pilot Product Overview
Salesforce Partners
•
3.2K views
Advertisement
Similar to Improving data interoperability in Python and R
(20)
High-Performance Python On Spark
Jen Aman
•
1.7K views
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
•
2.9K views
PyData: The Next Generation
Wes McKinney
•
22.2K views
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
Data Science and CDSW
Jason Hubbard
•
1.3K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
•
2.4K views
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.
•
648 views
Part 2: A Visual Dive into Machine Learning and Deep Learning
Cloudera, Inc.
•
1.5K views
Kite SDK introduction for Portland Big Data
_blue
•
902 views
High Performance Computing in the Open Data Science Era
Anaconda
•
5.2K views
Parquet and AVRO
airisData
•
8.8K views
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
•
575 views
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
•
2.1K views
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
•
7.1K views
Twitter with hadoop for oow
Gwen (Chen) Shapira
•
1.5K views
Building an Apache Hadoop data application
tomwhite
•
1K views
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
•
388 views
Building a Hadoop Data Warehouse with Impala
huguk
•
2K views
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
•
636 views
More from Wes McKinney
(12)
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
996 views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.4K views
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
1.9K views
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
960 views
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Advertisement
Recently uploaded
(20)
GSM FRAME STRUCTURE.pptx
Rasufsd
•
0 views
Bosch BSG8_8100 Service Manual.pdf
ssuser78bec11
•
3 views
Cutting Edge Robotics Innovation.pdf
bgoyani3
•
0 views
Cloud-Native & Sustainability: How and Why to Build Sustainable Workloads
Nico Meisenzahl
•
0 views
【本科生、研究生】英国约克大学毕业证文凭购买指南
foxupud
•
0 views
ETECH Q1 Wk4-GIMP.pptx
John Carlo Rollon
•
0 views
【本科生、研究生】英国埃克塞特大学毕业证文凭购买指南
akuufux
•
0 views
Top 10 Must.pdf
Roberberry
•
0 views
E-TECH Q2 Week 1 - Module 12 (1).pptx
John Carlo Rollon
•
0 views
Events
Victor de Souza Fernandes
•
0 views
Vernacular Architecture - 1.ppt
RekhaVKumar
•
0 views
OODBMSvsORDBMSppt.pptx
MEHMOODNadeem
•
0 views
fyp presentation of group 43011 final.pptx
IIEE - NEDUET
•
0 views
Module II Partition and Generating Function (2).ppt
ssuser26e219
•
0 views
Structure.pptx
MohammedOmer401579
•
0 views
Swarm Intelligence Applications in Unmanned Aerial Vehicles.pdf
AswathiM28
•
0 views
#9 Calicut MuleSoft Meetup - Munits in Mule 4.pptx
AnoopRamachandran13
•
0 views
Raspberry pi presentation.pptx
FrankAnthonyChin
•
2 views
iotSportsgroupFINAL.pptx
DeeJeeV
•
3 views
architecture of android.pptx
allurestore
•
0 views
Improving data interoperability in Python and R
1© Cloudera, Inc.
All rights reserved. Improving data interoperability in Python and R Wes McKinney @wesmckinn NYC R Conference April 8, 2016
2© Cloudera, Inc.
All rights reserved. http://numfocus.org
3© Cloudera, Inc.
All rights reserved. Me • Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubating)} • Mostly work in Python and Cython/C/C++
4© Cloudera, Inc.
All rights reserved. In process: Python for Data Analysis: 2nd Edition Coming late 2016 / early 2017
5© Cloudera, Inc.
All rights reserved. Apache Arrow http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
6© Cloudera, Inc.
All rights reserved. Arrow in a Slide • New Top-level Apache Software Foundation project • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is • Developers from 13+ major open source projects involved • A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
7© Cloudera, Inc.
All rights reserved. High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
8© Cloudera, Inc.
All rights reserved. Arrow in action: Feather File Format for Python and R •Problem: fast, language- agnostic binary data frame file format •By Wes McKinney (Python) and Hadley Wickham (R) •Read speeds close to disk IO performance
9© Cloudera, Inc.
All rights reserved. Real World Example: Feather File Format for Python and R library(feather) path <- "my_data.feather" write_feather(df, path) df <- read_feather(path) import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path) R Python
10© Cloudera, Inc.
All rights reserved. More on Feather array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
11© Cloudera, Inc.
All rights reserved. Feather: the good and not-so-good • Good • Language-agnostic memory representation • Extremely fast • New storage features can be added without much difficulty • Not-so-good • Data must be convert to/from storage representation (Arrow) and in- memory “proprietary” data structures (R / Python data frames)
12© Cloudera, Inc.
All rights reserved. Shared needs for Python, R, Julia, ... • If PLs can establish a common data frame C/C++-level memory representation, we can share algorithms and libraries much more easily • Example: dplyr’s in-memory backend • Other requirements • Permissive licensing (Python / Julia require MIT/Apache-like) • Common build/test/packaging for shared C/C++ library components
13© Cloudera, Inc.
All rights reserved. Get Involved in Arrow • Join the community • dev@arrow.apache.org • Slack: https://apachearrowslackin.herokuapp.com/ • http://arrow.apache.org • @ApacheArrow
14© Cloudera, Inc.
All rights reserved. Thank you Wes McKinney @wesmckinn Views are my own
Advertisement