Wes McKinney @wesmckinn
PyCon Colombia 2020
Python for Data Analysis:
Past, Present, and Future
Wes’s professional timeline
pandas
DataPad
2008 2013 2014 — Present
Apache Arrow
Perspectives on
the last 12 years
January 2020: pandas 1.0
● 26th major release after 10 years of
development
● ~2000 unique contributors
Thanks, Indeed!
Dec 2009 - pandas 0.1
● First open source release after ~18 months
of proprietary use
● Still on PyPI!
Funding pandas development
● pandas received first formal grant in 2019
from Chan-Zuckerberg Initiative
● Core devs primarily volunteers, self-funded,
or company-funded (Anaconda, others)
The early pandas gang (2011 - 2012)
Wes McKinney Chang She Adam Klein
pandas’s amazing Core Dev Team
Core Dev Meetup,
2019
Jeff Reback Tom Augspurger
Brock MendelMarc Garcia
Partial cast of characters
Joris van den
Bossche
Community engagement
Python’s journey to
mainstream data
language
"We believe that in the coming years there will be
great opportunity to attract users in need of
statistical data analysis tools to Python who might
have previously chosen R, MATLAB, or another
research environment. By designing robust, easy
to-use data structures that cohere with the rest of the
scientific Python stack, we can make Python
compelling choice for data analysis applications. In
our opinion, pandas provides a solid foundation upon
which a very powerful data analysis ecosystem can
be established."
Me, Proceedings of SciPy 2011
StackOverflow
data from
September 2017
StackOverflow
data from
September 2017
Factors driving
Python’s growth
Contributing factors
● Massive need for data wranglers + scientists
● “Perfect storm” of necessary packages
● New data science education
● Successful early adopters
● Packaging improvements
Perfect storm of packages
View from 2008
Confronting
Fear
Uncertainty
Doubt
● Large codebase concerns
● Long-term software lifecycle
● Interpreted languages
○ ... unsafe?
○ ... slow?
● Open source… trustworthy?
Common concerns
May 2011 - “PyData” core dev meetings
"Need a toolset that is robust, fast, and suitable
for a production environment..."
May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
May 2011 - “PyData” core dev meetings
May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
"... and easy / intuitive for non-software
engineers to use"
May 2011 - “PyData” core dev meetings
May 2011
* also, we need to fix packaging
May 2011 - “PyData” core dev meetings
July 2011- Concerns
"... the current state of affairs has me rather
anxious … these tools [e.g. pandas] have
largely not been integrated with any other tools
because of the community's collective
commitment anxiety"
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
Reading CSV files
Python for Data Analysis book - 2012
● A primer in data
manipulation in Python
● Focus: NumPy, IPython
/Jupyter, pandas,
matplotlib
● 2 editions (2012, 2017)
● 8 translations so far
PyData NYC 2013: 10 Things I Hate About pandas
● November 2013
● Summary: “pandas is
not designed like, or
intended to be used
as, a database query
engine”
Fall 2014: Python in a Big Data World
Task: Helping Python
become a first-class
technology for Big Data
Some Problems
● File formats
● JVM interop
● Non-array-oriented
interfaces
Difficulties in pandas (and R) dataframes
● Limited built-in data types
● Performance and memory use issues
● Challenges with larger-than-memory datasets
● Naive execution strategies (no “query
optimization”)
Does not cut down trees.
Out of memory on 10GB of CSVs
A
of doubt
Changing the tides
… and others
Fragmentation of data
and code
Other thoughts
● Projects like pandas may be taking
responsibility for too many things
● It would be more productive (long-term) to
have a reusable computational foundation
for data frames
● New data frame format for
designed for speed
● Computational foundation for
data processing libraries
● Fast cross-language data
interchange
Arrow
memory
JVM Data Ecosystem
Database Systems
Data Science Libraries
Defragmenting Data
● https://github.com/apache/arrow
● Over 400 unique contributors
● Some level of support for 11 programming
languages
● CPU/GPU-friendly columnar memory layout
● Memory map huge datasets
● Relocate data structures without serialization
Important features
Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Engine
Datasets
Framework
Arrow Flight RPC
Network
Storage
“New Data Frame” projects
● dask.dataframe
● Modin
● NVIDIA RAPIDS
● Vaex
● … and more surely in development
Learning from R
● Domain-specific language culture (“same
code, different backends”)
● Non-standard evaluation
○ Inspect and manipulate unevaluated code
fragments
Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Can be a massive Arrow dataset
Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
Can be a massive Arrow dataset
Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM
Can be a massive Arrow dataset
Funding ambitious
new open source
projects
Some Partners
● https://ursalabs.org
● Apache Arrow-powered
Data Science Tools
● Funded by corporate
partners
● Built in collaboration with
RStudio
Looking forward

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

  • 1.
    Wes McKinney @wesmckinn PyConColombia 2020 Python for Data Analysis: Past, Present, and Future
  • 2.
    Wes’s professional timeline pandas DataPad 20082013 2014 — Present Apache Arrow
  • 3.
  • 4.
    January 2020: pandas1.0 ● 26th major release after 10 years of development ● ~2000 unique contributors Thanks, Indeed!
  • 5.
    Dec 2009 -pandas 0.1 ● First open source release after ~18 months of proprietary use ● Still on PyPI!
  • 6.
    Funding pandas development ●pandas received first formal grant in 2019 from Chan-Zuckerberg Initiative ● Core devs primarily volunteers, self-funded, or company-funded (Anaconda, others)
  • 7.
    The early pandasgang (2011 - 2012) Wes McKinney Chang She Adam Klein
  • 8.
    pandas’s amazing CoreDev Team Core Dev Meetup, 2019 Jeff Reback Tom Augspurger Brock MendelMarc Garcia Partial cast of characters Joris van den Bossche
  • 9.
  • 10.
  • 11.
    "We believe thatin the coming years there will be great opportunity to attract users in need of statistical data analysis tools to Python who might have previously chosen R, MATLAB, or another research environment. By designing robust, easy to-use data structures that cohere with the rest of the scientific Python stack, we can make Python compelling choice for data analysis applications. In our opinion, pandas provides a solid foundation upon which a very powerful data analysis ecosystem can be established." Me, Proceedings of SciPy 2011
  • 12.
  • 14.
  • 16.
  • 17.
    Contributing factors ● Massiveneed for data wranglers + scientists ● “Perfect storm” of necessary packages ● New data science education ● Successful early adopters ● Packaging improvements
  • 18.
  • 19.
  • 20.
  • 21.
    ● Large codebaseconcerns ● Long-term software lifecycle ● Interpreted languages ○ ... unsafe? ○ ... slow? ● Open source… trustworthy? Common concerns
  • 22.
    May 2011 -“PyData” core dev meetings "Need a toolset that is robust, fast, and suitable for a production environment..."
  • 23.
    May 2011 "Need atoolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " May 2011 - “PyData” core dev meetings
  • 24.
    May 2011 "Need atoolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " "... and easy / intuitive for non-software engineers to use" May 2011 - “PyData” core dev meetings
  • 25.
    May 2011 * also,we need to fix packaging May 2011 - “PyData” core dev meetings
  • 26.
    July 2011- Concerns "...the current state of affairs has me rather anxious … these tools [e.g. pandas] have largely not been integrated with any other tools because of the community's collective commitment anxiety" http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
  • 27.
  • 28.
    Python for DataAnalysis book - 2012 ● A primer in data manipulation in Python ● Focus: NumPy, IPython /Jupyter, pandas, matplotlib ● 2 editions (2012, 2017) ● 8 translations so far
  • 29.
    PyData NYC 2013:10 Things I Hate About pandas ● November 2013 ● Summary: “pandas is not designed like, or intended to be used as, a database query engine”
  • 30.
    Fall 2014: Pythonin a Big Data World Task: Helping Python become a first-class technology for Big Data Some Problems ● File formats ● JVM interop ● Non-array-oriented interfaces
  • 31.
    Difficulties in pandas(and R) dataframes ● Limited built-in data types ● Performance and memory use issues ● Challenges with larger-than-memory datasets ● Naive execution strategies (no “query optimization”)
  • 33.
    Does not cutdown trees.
  • 34.
    Out of memoryon 10GB of CSVs
  • 35.
  • 36.
  • 37.
  • 38.
    Other thoughts ● Projectslike pandas may be taking responsibility for too many things ● It would be more productive (long-term) to have a reusable computational foundation for data frames
  • 39.
    ● New dataframe format for designed for speed ● Computational foundation for data processing libraries ● Fast cross-language data interchange Arrow memory JVM Data Ecosystem Database Systems Data Science Libraries
  • 40.
  • 41.
    ● https://github.com/apache/arrow ● Over400 unique contributors ● Some level of support for 11 programming languages
  • 42.
    ● CPU/GPU-friendly columnarmemory layout ● Memory map huge datasets ● Relocate data structures without serialization Important features
  • 43.
    Arrow C++ Platform Multi-coreWork Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
  • 44.
    “New Data Frame”projects ● dask.dataframe ● Modin ● NVIDIA RAPIDS ● Vaex ● … and more surely in development
  • 45.
    Learning from R ●Domain-specific language culture (“same code, different backends”) ● Non-standard evaluation ○ Inspect and manipulate unevaluated code fragments
  • 46.
    Arrow’s relationship withdplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
  • 47.
    Arrow’s relationship withdplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
  • 48.
    Arrow’s relationship withdplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
  • 49.
  • 50.
    Some Partners ● https://ursalabs.org ●Apache Arrow-powered Data Science Tools ● Funded by corporate partners ● Built in collaboration with RStudio
  • 52.