PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

Wes McKinney @wesmckinn
PyCon Colombia 2020
Python for Data Analysis:
Past, Present, and Future

Wes’s professional timeline
pandas
DataPad
2008 2013 2014 — Present
Apache Arrow

Perspectives on
the last 12 years

January 2020: pandas 1.0
● 26th major release after 10 years of
development
● ~2000 unique contributors
Thanks, Indeed!

Dec 2009 - pandas 0.1
● First open source release after ~18 months
of proprietary use
● Still on PyPI!

Funding pandas development
● pandas received ﬁrst formal grant in 2019
from Chan-Zuckerberg Initiative
● Core devs primarily volunteers, self-funded,
or company-funded (Anaconda, others)

The early pandas gang (2011 - 2012)
Wes McKinney Chang She Adam Klein

pandas’s amazing Core Dev Team
Core Dev Meetup,
2019
Jeff Reback Tom Augspurger
Brock MendelMarc Garcia
Partial cast of characters
Joris van den
Bossche

Python’s journey to
mainstream data
language

"We believe that in the coming years there will be
great opportunity to attract users in need of
statistical data analysis tools to Python who might
have previously chosen R, MATLAB, or another
research environment. By designing robust, easy
to-use data structures that cohere with the rest of the
scientiﬁc Python stack, we can make Python
compelling choice for data analysis applications. In
our opinion, pandas provides a solid foundation upon
which a very powerful data analysis ecosystem can
be established."
Me, Proceedings of SciPy 2011

StackOverﬂow
data from
September 2017

Factors driving
Python’s growth

Contributing factors
● Massive need for data wranglers + scientists
● “Perfect storm” of necessary packages
● New data science education
● Successful early adopters
● Packaging improvements

Confronting
Fear
Uncertainty
Doubt

● Large codebase concerns
● Long-term software lifecycle
● Interpreted languages
○ ... unsafe?
○ ... slow?
● Open source… trustworthy?
Common concerns

May 2011 - “PyData” core dev meetings
"Need a toolset that is robust, fast, and suitable
for a production environment..."

May 2011
"... but also good for interactive research... "

May 2011
"... but also good for interactive research... "
"... and easy / intuitive for non-software
engineers to use"

May 2011
* also, we need to ﬁx packaging

July 2011- Concerns
"... the current state of affairs has me rather
anxious … these tools [e.g. pandas] have
largely not been integrated with any other tools
because of the community's collective
commitment anxiety"
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/

Python for Data Analysis book - 2012
● A primer in data
manipulation in Python
● Focus: NumPy, IPython
/Jupyter, pandas,
matplotlib
● 2 editions (2012, 2017)
● 8 translations so far

PyData NYC 2013: 10 Things I Hate About pandas
● November 2013
● Summary: “pandas is
not designed like, or
intended to be used
as, a database query
engine”

Fall 2014: Python in a Big Data World
Task: Helping Python
become a ﬁrst-class
technology for Big Data
Some Problems
● File formats
● JVM interop
● Non-array-oriented
interfaces

Diﬃculties in pandas (and R) dataframes
● Limited built-in data types
● Performance and memory use issues
● Challenges with larger-than-memory datasets
● Naive execution strategies (no “query
optimization”)

Changing the tides
… and others

Fragmentation of data
and code

Other thoughts
● Projects like pandas may be taking
responsibility for too many things
● It would be more productive (long-term) to
have a reusable computational foundation
for data frames

● New data frame format for
designed for speed
● Computational foundation for
data processing libraries
● Fast cross-language data
interchange
Arrow
memory
JVM Data Ecosystem
Database Systems
Data Science Libraries

● https://github.com/apache/arrow
● Over 400 unique contributors
● Some level of support for 11 programming
languages

● CPU/GPU-friendly columnar memory layout
● Memory map huge datasets
● Relocate data structures without serialization
Important features

Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Engine
Datasets
Framework
Arrow Flight RPC
Network
Storage

“New Data Frame” projects
● dask.dataframe
● Modin
● NVIDIA RAPIDS
● Vaex
● … and more surely in development

Learning from R
● Domain-speciﬁc language culture (“same
code, different backends”)
● Non-standard evaluation
○ Inspect and manipulate unevaluated code
fragments

Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Can be a massive Arrow dataset

flights %>%
summarise(
) %>%
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime

flights %>%
summarise(
) %>%
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM

Funding ambitious
new open source
projects

Some Partners
● https://ursalabs.org
● Apache Arrow-powered
Data Science Tools
● Funded by corporate
partners
● Built in collaboration with
RStudio

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

Similar to PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future (20)

More from Wes McKinney

More from Wes McKinney (20)

Recently uploaded

Recently uploaded (20)

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future