Home
Explore
Submit Search
Upload
Login
Signup
Advertisement
Check these out next
Improving data interoperability in Python and R
Wes McKinney
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
Data Science Languages and Industry Analytics
Wes McKinney
Enabling Python to be a Better Big Data Citizen
Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
1
of
34
Top clipped slide
DataFrames: The Extended Cut
Jun. 1, 2015
•
0 likes
19 likes
×
Be the first to like this
Show More
•
8,455 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download Now
Download to read offline
Report
Technology
Open Data
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Advertisement
Advertisement
Advertisement
Recommended
PyCon Singapore 2013 Keynote
Wes McKinney
94.6K views
•
19 slides
DataFrames: The Good, Bad, and Ugly
Wes McKinney
12.9K views
•
24 slides
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
8.1K views
•
22 slides
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
19.8K views
•
29 slides
Ibis: Scaling the Python Data Experience
Wes McKinney
3.8K views
•
13 slides
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
7.4K views
•
37 slides
More Related Content
Slideshows for you
(20)
Improving data interoperability in Python and R
Wes McKinney
•
2.6K views
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
•
5.6K views
Data Science Languages and Industry Analytics
Wes McKinney
•
5.5K views
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
•
103.8K views
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
•
12.9K views
Python Data Wrangling: Preparing for the Future
Wes McKinney
•
12.5K views
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
•
17K views
Apache Arrow - An Overview
Dremio Corporation
•
2K views
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
•
2.9K views
Python for Financial Data Analysis with pandas
Wes McKinney
•
61.7K views
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
•
1.1K views
High Performance Python on Apache Spark
Wes McKinney
•
16.5K views
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
•
19.7K views
Light Up Your Dark Data
Anaconda
•
3.2K views
Extending Pandas using Apache Arrow and Numba
Uwe Korn
•
5.5K views
Similar to DataFrames: The Extended Cut
(20)
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
•
1.8K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
Building a Hadoop Data Warehouse with Impala
huguk
•
2K views
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
•
7.3K views
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
•
7.3K views
Building data pipelines with kite
Joey Echeverria
•
5.7K views
NoSQL-Overview
Ranjeet Jha - OCM-JEA
•
664 views
Technologies for Data Analytics Platform
N Masahiro
•
9.2K views
Hpc lunch and learn
John D Almon
•
707 views
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
•
2.5K views
Twitter with hadoop for oow
Gwen (Chen) Shapira
•
1.5K views
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
•
7.1K views
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
•
812 views
Big Data (NJ SQL Server User Group)
Don Demcsak
•
1.1K views
Big data berlin
kammeyer
•
1.7K views
Big Data Retrospective - STL Big Data IDEA Jan 2019
Adam Doyle
•
228 views
8. Software Development Security
Sam Bowne
•
13 views
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
•
841 views
Impala use case @ edge
Ram Kedem
•
906 views
Advertisement
More from Wes McKinney
(17)
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
996 views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.4K views
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
1.9K views
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
960 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.1K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
PyCon APAC 2016 Keynote
Wes McKinney
•
3.6K views
Recently uploaded
(20)
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
HostedbyConfluent
•
0 views
The California Age Appropriate Design Code Act Navigating the New Requirement...
TrustArc
•
0 views
Safeguarding - Protecting Your Kafka from Misbehaving Clients with Tom Scott
HostedbyConfluent
•
0 views
Aristiun Whitepaper- Automated Threat Modelling with Aribot
Aristiun B.V.
•
0 views
Lesson-02 (1).pptx
ssuserc24e05
•
0 views
Deep Dive into Kafka Connect Protocol with Catalin Pop
HostedbyConfluent
•
0 views
CyberEthics.ppt
ANKITKUMAR920995
•
0 views
Lesson-01.pptx
ssuserc24e05
•
0 views
Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes w...
HostedbyConfluent
•
0 views
Unveiling the Inner Workings of Apache Kafka® with Flamegraphs with Christo L...
HostedbyConfluent
•
0 views
Fortnite Is Awsome!!!
YT SavageGuy
•
0 views
KPIs&Goals.pdf
mennaHendy
•
0 views
Pytexas: Build ChatGPT over SMS in Python
Elizabeth (Lizzie) Siegle
•
0 views
Powering Consistent, High-throughput, Real-time Distributed Calculation Engin...
HostedbyConfluent
•
0 views
Generate Art with DALL·E 2 and Twilio MMS.pptx
Elizabeth (Lizzie) Siegle
•
0 views
proposal (2).pdf
mennaHendy
•
0 views
What to do if Your Kafka Streams App Gets OOMKilled? with Andrey Serebryanskiy
HostedbyConfluent
•
0 views
如何办理一份高仿东伦敦大学毕业证成绩单?
aazepp
•
0 views
ARTIFICIAL INTELLIGENCE.pptx
Butterfly education
•
0 views
Plant Disease Detection.pptx
vikasmittal92
•
0 views
Advertisement
DataFrames: The Extended Cut
1 © Cloudera,
Inc. All rights reserved. DataFrames: The Extended Cut Wes McKinney @wesmckinn Open Data Science Conference, 2015-‐05-‐31
2 © Cloudera,
Inc. All rights reserved. This talk • Some commentary / notes on all the data frame interfaces out there • Community / collaboraQon challenges • Opinions
3 © Cloudera,
Inc. All rights reserved. Disclaimer: This is a nuanced discussion
4 © Cloudera,
Inc. All rights reserved. Who am I? • Originator of pandas (2008 -‐ ) • Financial analyQcs in R / Python starQng 2007 • 2010-‐2012 • Hiatus from gainful employment • Make pandas ready for primeQme • Write "Python for Data Analysis” • 2013-‐2014: DataPad with Chang She & co • 2014 -‐ : Open source development @ Cloudera
5 © Cloudera,
Inc. All rights reserved. Data frame, in brief • A table-‐like data structure • An API / user interface for the table • SelecQng data • Math and relaQonal algebra (join, filter, etc.) • File / database IO • ad infinitum
6 © Cloudera,
Inc. All rights reserved. Some axes of comparison • Data structure internals (types, in-‐memory representaQon, etc.) • Basic table API • RelaQonal algebra support • Group-‐by / split-‐apply-‐combine API • Performance, memory use, evaluaQon semanQcs • Missing data • Data Qdying / ETL tools • IO uQliQes • Domain specific tools (e.g. Qme series) • …
7 © Cloudera,
Inc. All rights reserved. The Great Data Tool Decoupling™ • Thesis: over Qme, user interfaces, data storage, and execuQon engines will decouple and specialize • In fact, you should really want this to happen • Share systems among languages • Reduce fragmentaQon and “lock-‐in” • Shih developer focus to usability • PredicQon: we’ll be there by 2025; sooner if we all get our act together
8 © Cloudera,
Inc. All rights reserved. Crahing quality data tools • Quality / usefulness is usually forged by the fire of bamle • Real world use cases and social proof trump theory • Eat that dog food • When in doubt? Look at the test suite.
9 © Cloudera,
Inc. All rights reserved. Data frames, circa 2015 • Tight coupling amongst • User API • In-‐memory data representaQon • SerializaQon/deserializaQon tools • AnalyQcs / computaQon • Code evaluaQon semanQcs • Code sharing amongst languages fairly difficult, rarely happens in pracQce
10 © Cloudera,
Inc. All rights reserved. In-‐memory representaQon a major problem • Any algorithms wrimen against a data frame implementaQon assume a parQcular custom • Memory layout • Type system (+ missing data representaQon) • Memory management strategy • Due to organic growth of libraries / language ecosystems, there was never an effort to reach any design consensus • Downstream symptoms: benchmark-‐driven development
11 © Cloudera,
Inc. All rights reserved. Lies, damn lies, and benchmarks • Usual targets of benchmarking: • IO (CSV and database reading) • Joins • AggregaQon / group-‐by operaQons • What’s actually being tested • Quality of algorithm implementaQon • Data representaQon • Memory access pamerns / CPU cache efficiency
12 © Cloudera,
Inc. All rights reserved. Example: group-‐by-‐aggregate df.groupby(‘a’) .b.sum() SELECT a, sum(b) AS total FROM df GROUP BY 1 df %>% group_by(‘a’) %>% summarise(total=sum(b)) What’s actually being benchmarked • CPU/IO efficiency for for scanning a and b columns • Speed to push a values through a hash table (quality of hash table impl now an issue) • Time to sum b values given known categorizaQon (using hash table) • Speed of creaQng result data frame
13 © Cloudera,
Inc. All rights reserved. Data types • PrimiQve value types • Number (integer, floaQng point) • Boolean • String (UTF-‐8), Binary • Timestamp • Complex / nested types • Lists • Structs • Maps
14 © Cloudera,
Inc. All rights reserved. Data types { ‘name’: { ‘first’: ‘Wes’, ‘last’: ‘McKinney’ } ‘age’: 30 ‘numbers’: [1, 2, 3, 4] ‘addresses’: [ {‘city’: ‘New York’, ‘state’: ‘NY’}, {‘city’: ‘San Francisco’, ‘state’: ‘CA’}, ] ‘keys’: [(‘foo’, 27), (‘baz’, 35)] } struct< name: struct< first: string, last: string >, age: int32, numbers: list<int32>, addresses: list<struct< city: string, state: string >>, keys: map<string, int> > Example Table Row Table schema
15 © Cloudera,
Inc. All rights reserved. Table memory layout • Columnar • Faster for analyQcs, single-‐column scans • ProjecQons (column add/remove) cheap • Row appends harder • OperaQons across single rows are slower • Row-‐oriented • Slower single-‐column scans • ProjecQons expensive • Row appends easier • OperaQons across single rows are faster
16 © Cloudera,
Inc. All rights reserved. SerializaQon and protocols • Text-‐based formats • CSV/TSV, JSON • Binary formats • Avro • Thrih • Protocol buffers • Parquet (related: ColumnIO and RecordIO at Google) • ORCFile • HDF5 • Language specific: NumPy, bcolz, Rdata, etc.
17 © Cloudera,
Inc. All rights reserved. Lossy / text-‐based formats • Like CSV, JSON, etc. • Can be read “easily” by anyone • Expensive to read (parse) and write/generate • Expensive to infer correct schema if not known • SQll must validate parsed data, even if you know the schema • Not a high fidelity format • Schema must be known • Data can be easily lost-‐in-‐translaQon • CSV cannot easily handle nested schemas
18 © Cloudera,
Inc. All rights reserved. Binary data formats • Some, like Parquet and ORCFile, are columnar / analyQcs-‐oriented • Others (Avro, Thrih, Protobuf), designed for more general data transport and remote procedure calls (RPC) in distributed systems • Language-‐specific formats generally don’t have full-‐featured reader-‐writers outside the primary language
19 © Cloudera,
Inc. All rights reserved. Data representaQon across languages • R: list of named R arrays (each containing data of a primiQve R type) • Data frames have an implicit schema, but user does not interact with • Limited support for nested type values (JSON-‐like data) • pandas: complex, but fundamentally data stored in NumPy arrays • No explicit table schema; NumPy complexiQes largely hidden from user • Limited / no built-‐in support for nested type values • Missing / NULL values handled on a type-‐by-‐type basis (someQmes not at all)
20 © Cloudera,
Inc. All rights reserved. Missing data • R: uses special values to encode NA • Python pandas: special values, but does not work for all types • For arbitrary type data, best approach is to use a bit or byte to represent nullness/not-‐nullness
21 © Cloudera,
Inc. All rights reserved. I am acQvely working to address infrastructural issues that have prevented collaboraQon / code sharing in tabular data tools
22 © Cloudera,
Inc. All rights reserved. Some awesome R data frame stuff • “Hadley Stack” • dplyr, Qdyr • legacy: plyr, reshape2 • ggplot2 • data.table (data.frame + indices, fast algorithms) • xts : Qme series
23 © Cloudera,
Inc. All rights reserved. R data frames: rough edges • Copy-‐on-‐write semanQcs • API fragmentaQon / inconsistency • Use the “Hadley stack” for improved sanity • Factor / String dichotomy • stringsAsFactors=FALSE a blessing and curse • Somewhat limited type system
24 © Cloudera,
Inc. All rights reserved. dplyr • Composable table API • Good example of what the “decoupled” future might look like • New in-‐memory R/RCpp execuQon engine • SQL backends for large subset of API
25 © Cloudera,
Inc. All rights reserved. Spark DataFrames • R/pandas-‐inspired API for tabular data manipulaQon in Scala, Python, etc. • Logical operaQon graphs rewrimen internally in more efficient form • Good interop with Spark SQL • Some interoperability with pandas • ParQal API Decoupling! • Low-‐level internals (DataFrame ~ RDD[Row]) are … not the best
26 © Cloudera,
Inc. All rights reserved. pandas • Several key data structures, data frame among them • Considerably more complex internals than other data frame libraries • Some good things • Born of need • A “bameries included” approach • Hierarchical axis labeling: addresses some hard use cases at expense of semanQc complexity • Strong Qme series support
27 © Cloudera,
Inc. All rights reserved. pandas: rough edges • Axis labelling can get in the way for folks needing “just a table” • Ceded control of its type system / data rep’n from day 1 to NumPy • Inefficient string handling (uses NumPy object arrays) • Missing data handling less precise than other tools • No C API; very difficult for external tools to interact with pandas objects
28 © Cloudera,
Inc. All rights reserved. blaze • Open source Python project run by ConQnuum AnalyQcs • Bespoke type system (“datashape”) supporQng many kinds of nested schemas • Data expression API with many execuQon backends (SQL, pandas, Spark, etc.) • Does not have a “naQve” backend (was to be the—now aborted?—libdynd project) • Another good example of the “decoupled” future
29 © Cloudera,
Inc. All rights reserved. libdynd • Next-‐gen NumPy-‐in-‐C++ project started by ConQnuum in 2011 to be a naQve backend for Blaze, appears not to be acQvely pursued • Implements the new datashape protocol • Standalone C++11/14 library, with deep Python binding • Lots of interesQng internals / implementaQon ideas
30 © Cloudera,
Inc. All rights reserved. bcolz: the new carray / ctable • Compressed in-‐memory / on disk columnar table structure • Outgrowth of • carray (compressed-‐in-‐memory NumPy array) • blosc (mulQthreaded shuffling compression library)
31 © Cloudera,
Inc. All rights reserved. Julia: DataFrames.jl • Started by Harlan Harris & co • Part of broader JuliaStats iniQaQve • More R-‐like than pandas-‐like • Very acQve: > 50 contributors! • SQll comparaQvely early • Less comprehensive API • More limited IO capabiliQes
32 © Cloudera,
Inc. All rights reserved. Other data frames • Saddle (Scala) • Dev’d by Adam Klein (ex-‐AQR) at Novus Partners (fintech startup) • Designed and used for financial use cases • Deedle (F# / .NET) • Dev’d by AK’s colleagues at BlueMountain (hedge fund) • GraphLab / Dato • Really good C++ data frame with Python interface • Dual-‐licensed: AGPL + Commercial • That’s not all! Haskell, Go, etc…
33 © Cloudera,
Inc. All rights reserved. We’re not done yet • The future is JSON-‐like • Support for nested types / semi-‐structured data is sQll weak • Wanted: Apache-‐licensed, community standard C/C++ data frame data structure and analyQcs toolkit that we all use (R, Python, Julia) • Bring on the Great Decoupling
34 © Cloudera,
Inc. All rights reserved. Thank you @wesmckinn
Advertisement