SlideShare a Scribd company logo
Submit Search
Upload
DataFrames: The Good, Bad, and Ugly
Report
Wes McKinney
Director of Ursa Labs, Open Source Developer at Ursa Labs
Follow
•
18 likes
•
12,884 views
1
of
24
DataFrames: The Good, Bad, and Ugly
•
18 likes
•
12,884 views
Download Now
Download to read offline
Report
Technology
Thoughts on the status quo in "data frame" interfaces
Read more
Wes McKinney
Director of Ursa Labs, Open Source Developer at Ursa Labs
Follow
Recommended
DataFrames: The Extended Cut
Wes McKinney
8.5K views
•
34 slides
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
7.4K views
•
37 slides
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
5.4K views
•
37 slides
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
19.8K views
•
29 slides
PyData: The Next Generation
Wes McKinney
22.2K views
•
31 slides
Data Science Languages and Industry Analytics
Wes McKinney
5.5K views
•
19 slides
More Related Content
What's hot
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
8.1K views
•
22 slides
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
13K views
•
22 slides
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Wes McKinney
82.7K views
•
46 slides
PyCon Singapore 2013 Keynote
Wes McKinney
94.6K views
•
19 slides
Improving data interoperability in Python and R
Wes McKinney
2.6K views
•
14 slides
Python Data Wrangling: Preparing for the Future
Wes McKinney
12.5K views
•
27 slides
What's hot
(20)
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
•
8.1K views
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
•
13K views
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Wes McKinney
•
82.7K views
PyCon Singapore 2013 Keynote
Wes McKinney
•
94.6K views
Improving data interoperability in Python and R
Wes McKinney
•
2.6K views
Python Data Wrangling: Preparing for the Future
Wes McKinney
•
12.5K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
•
103.9K views
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
Productive Data Tools for Quants
Wes McKinney
•
1.7K views
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
•
17K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Python for Financial Data Analysis with pandas
Wes McKinney
•
61.8K views
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
•
7.6K views
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
•
1.1K views
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
•
343 views
Apache Arrow - An Overview
Dremio Corporation
•
2K views
Uber's data science workbench
Ran Wei
•
3.2K views
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Databricks
•
6.4K views
Data Discovery at Databricks with Amundsen
Databricks
•
1.2K views
Viewers also liked
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
3.2K views
•
28 slides
Reproducibility with Checkpoint & RRO - NYC R Conference
Revolution Analytics
1.9K views
•
25 slides
A Step Towards Reproducibility in R
Revolution Analytics
2.1K views
•
25 slides
pandas: Powerful data analysis tools for Python
Wes McKinney
9.8K views
•
38 slides
What's new in pandas and the SciPy stack for financial users
Wes McKinney
11.8K views
•
23 slides
Effective Data Analysis with Deedle
Howard Mansell
2K views
•
19 slides
Viewers also liked
(16)
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
Reproducibility with Checkpoint & RRO - NYC R Conference
Revolution Analytics
•
1.9K views
A Step Towards Reproducibility in R
Revolution Analytics
•
2.1K views
pandas: Powerful data analysis tools for Python
Wes McKinney
•
9.8K views
What's new in pandas and the SciPy stack for financial users
Wes McKinney
•
11.8K views
Effective Data Analysis with Deedle
Howard Mansell
•
2K views
F# and Financial Data Making Data Analysis Simple
Tomas Petricek
•
3.1K views
Structured Data Challenges in Finance and Statistics
Wes McKinney
•
5.3K views
Ayasdi strata
Alpine Data
•
1.8K views
Apache Spark Components
Girish Khanzode
•
3.3K views
A look inside pandas design and development
Wes McKinney
•
29.5K views
Scipy 2011 Time Series Analysis in Python
Wes McKinney
•
31.4K views
Data Structures for Statistical Computing in Python
Wes McKinney
•
89.5K views
Introduction to real time big data with Apache Spark
Taras Matyashovsky
•
2.6K views
SPARQL and Linked Data Benchmarking
Kristian Alexander
•
1K views
Papel quadriculado
Crescendo EAprendendo
•
10.5K views
Similar to DataFrames: The Good, Bad, and Ugly
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
1.8K views
•
31 slides
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
1.8K views
•
42 slides
Productionizing Hadoop - New Lessons Learned
Cloudera, Inc.
1.2K views
•
21 slides
Kudu: Fast Analytics on Fast Data
michaelguia
448 views
•
52 slides
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
7.3K views
•
40 slides
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
7.3K views
•
52 slides
Similar to DataFrames: The Good, Bad, and Ugly
(20)
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
•
1.8K views
Productionizing Hadoop - New Lessons Learned
Cloudera, Inc.
•
1.2K views
Kudu: Fast Analytics on Fast Data
michaelguia
•
448 views
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
•
7.3K views
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
•
7.3K views
Building a Hadoop Data Warehouse with Impala
huguk
•
2K views
Consolidate your data marts for fast, flexible analytics 5.24.18
Cloudera, Inc.
•
1.1K views
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
•
915 views
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
•
654 views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
•
2.5K views
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
•
2.2K views
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
•
7.1K views
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Derek Ashmore
•
53 views
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
•
2.6K views
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Inside Analysis
•
2.9K views
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
•
843 views
Introduction to Apache Kudu
Jeff Holoman
•
5.8K views
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
•
4.2K views
More from Wes McKinney
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
1.1K views
•
31 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
1.1K views
•
26 slides
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
1.4K views
•
53 slides
New Directions for Apache Arrow
Wes McKinney
1.9K views
•
27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
2.2K views
•
31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
2K views
•
47 slides
More from Wes McKinney
(20)
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.4K views
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
2K views
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
969 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.2K views
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
•
19.8K views
PyCon APAC 2016 Keynote
Wes McKinney
•
3.6K views
High Performance Python on Apache Spark
Wes McKinney
•
16.6K views
Recently uploaded
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk
75 views
•
20 slides
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver
23 views
•
39 slides
Liqid: Composable CXL Preview
CXL Forum
118 views
•
8 slides
Java 21 and Beyond- A Roadmap of Innovations .pdf
Ana-Maria Mihalceanu
51 views
•
90 slides
MemVerge: Past Present and Future of CXL
CXL Forum
105 views
•
26 slides
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada
97 views
•
24 slides
Recently uploaded
(20)
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk
•
75 views
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver
•
23 views
Liqid: Composable CXL Preview
CXL Forum
•
118 views
Java 21 and Beyond- A Roadmap of Innovations .pdf
Ana-Maria Mihalceanu
•
51 views
MemVerge: Past Present and Future of CXL
CXL Forum
•
105 views
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada
•
97 views
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang
•
34 views
Micron CXL product and architecture update
CXL Forum
•
23 views
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
CXL Forum
•
118 views
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
Splunk
•
172 views
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya
•
51 views
CXL at OCP
CXL Forum
•
183 views
MemVerge: Memory Viewer Software
CXL Forum
•
115 views
Photowave Presentation Slides - 11.8.23.pptx
CXL Forum
•
118 views
.conf Go 2023 - Data analysis as a routine
Splunk
•
76 views
Data-centric AI and the convergence of data and model engineering:opportunit...
Paolo Missier
•
19 views
"Fast Start to Building on AWS", Igor Ivaniuk
Fwdays
•
31 views
Web Dev - 1 PPT.pdf
gdsczhcet
•
48 views
Microchip: CXL Use Cases and Enabling Ecosystem
CXL Forum
•
107 views
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation
•
23 views
DataFrames: The Good, Bad, and Ugly
1.
1 © Cloudera,
Inc. All rights reserved. DataFrames: The Good, Bad, and Ugly Wes McKinney @wesmckinn NY R Conference, 2015-‐04-‐25
2.
2 © Cloudera,
Inc. All rights reserved. Disclaimer: the views presented in this talk are my personal opinions and not necessarily those of Cloudera
3.
3 © Cloudera,
Inc. All rights reserved. This talk • Some commentary on all the data frame interfaces out there • Biased observaTons and cursory judgments • Thoughts on craVing high quality data tools
4.
4 © Cloudera,
Inc. All rights reserved. Disclaimer #2: This is a nuanced discussion
5.
5 © Cloudera,
Inc. All rights reserved. Who am I? • Father of pandas (2008 -‐ ) • Financial analyTcs in R / Python starTng 2007 • 2010-‐2012 • Hiatus from gainful employment • Make pandas ready for primeTme • Write "Python for Data Analysis” • 2013-‐2014: DataPad with Chang She & co • 2014 -‐ : Cloudera
6.
6 © Cloudera,
Inc. All rights reserved. What’s in a DataFrame?
7.
7 © Cloudera,
Inc. All rights reserved. What’s in a DataFrame? A table with some rows By any other name would analyze as sweet.
8.
8 © Cloudera,
Inc. All rights reserved. Got a table?
9.
9 © Cloudera,
Inc. All rights reserved. Got a table? Put a DataFrame (interface) on it!
10.
10 © Cloudera,
Inc. All rights reserved. What is this “data frame” that you speak of • A table-‐like data structure • An API / user interface for the table • SelecTng data • Math and relaTonal algebra (join, filter, etc.) • File / database IO • ad infinitum
11.
11 © Cloudera,
Inc. All rights reserved. Some axes of comparison • Data structure internals (types, in-‐memory representaTon, etc.) • Basic table API • RelaTonal algebra support • Group-‐by / split-‐apply-‐combine API • Performance, memory use, evaluaTon semanTcs • Missing data • Data Tdying / ETL tools • IO uTliTes • Domain specific tools (e.g. Tme series) • …
12.
12 © Cloudera,
Inc. All rights reserved. The Great Data Tool Decoupling™ • Thesis: over Tme, user interfaces, data storage, and execuTon engines will decouple and specialize • In fact, you should really want this to happen • Share systems among languages • Reduce fragmentaTon and “lock-‐in” • ShiV developer focus to usability • PredicTon: we’ll be there by 2025; sooner if we all get our act together
13.
13 © Cloudera,
Inc. All rights reserved. CraVing quality data tools • Quality / usefulness is usually forged by the fire of basle • Real world use cases and social proof trump theory • Eat that dog food • When in doubt? Look at the test suite.
14.
14 © Cloudera,
Inc. All rights reserved. R data frames • Thin layer on top of R list type • Sequence of named vectors • Can have row names (any R vector) • Simple column and row selecTon API • AnalyTcs, data transformaTon, etc. leV to base package and add-‐on libraries • Richness / usability comes largely from libraries
15.
15 © Cloudera,
Inc. All rights reserved. Some awesome R data frame stuff • “Hadley Stack” • dplyr, Tdyr • legacy: plyr, reshape2 • ggplot2 • data.table (data.frame + indices, fast algorithms) • xts : Tme series
16.
16 © Cloudera,
Inc. All rights reserved. R data frames: rough edges • Copy-‐on-‐write semanTcs • API fragmentaTon / inconsistency • Use the “Hadley stack” for improved sanity • Factor / String dichotomy • stringsAsFactors=FALSE a blessing and curse • Somewhat limited type system
17.
17 © Cloudera,
Inc. All rights reserved. dplyr • Composable table API • Good example of what the “decoupled” future might look like • New in-‐memory R/RCpp execuTon engine • SQL backends for large subset of API
18.
18 © Cloudera,
Inc. All rights reserved. Spark DataFrames • R/pandas-‐inspired API for tabular data manipulaTon in Scala, Python, etc. • Logical operaTon graphs rewrisen internally in more efficient form • Good interop with Spark SQL • Some interoperability with pandas • ParTal API Decoupling! (it sTll binds you to Spark)
19.
19 © Cloudera,
Inc. All rights reserved. pandas • Several key data structures, data frame among them • Considerably more complex internals than other data frame libraries • Some good things • Born of need • A “baseries included” approach • Hierarchical axis labeling: addresses some hard use cases at expense of semanTc complexity • Strong Tme series support
20.
20 © Cloudera,
Inc. All rights reserved. pandas: rough edges • Axis labelling can get in the way for folks needing “just a table” • Ceded control of its type system / data rep’n from day 1 to NumPy • Inefficient string handling (uses NumPy object arrays) • Missing data handling less precise than other tools
21.
21 © Cloudera,
Inc. All rights reserved. Julia: DataFrames.jl • Started by Harlan Harris & co • Part of broader JuliaStats iniTaTve • More R-‐like than pandas-‐like • Very acTve: > 50 contributors! • STll comparaTvely early • Less comprehensive API • More limited IO capabiliTes
22.
22 © Cloudera,
Inc. All rights reserved. Other data frames • Saddle (Scala) • Dev’d by Adam Klein (ex-‐AQR) at Novus Partners (fintech startup) • Designed and used for financial use cases • Deedle (F# / .NET) • Dev’d by AK’s colleagues at BlueMountain (hedge fund) • GraphLab / Dato • Really good C++ data frame with Python interface • Dual-‐licensed: AGPL + Commercial • That’s not all! Haskell, Go, usw…
23.
23 © Cloudera,
Inc. All rights reserved. We’re not done yet • The future is JSON-‐like • Support for nested types / semi-‐structured data is sTll weak • Wanted: Apache-‐licensed, community standard C/C++ data frame that we all use (R, Python, Julia) • Bring on the Great Decoupling
24.
24 © Cloudera,
Inc. All rights reserved. Thank you @wesmckinn