PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
Wes McKinney @wesmckinn
PyCon Colombia 2020
Python for Data Analysis:
Past, Present, and Future
Wes’s professional timeline
pandas
DataPad
2008 2013 2014 — Present
Apache Arrow
Perspectives on
the last 12 years
January 2020: pandas 1.0
● 26th major release after 10 years of
development
● ~2000 unique contributors
Thanks, Indeed!
Dec 2009 - pandas 0.1
● First open source release after ~18 months
of proprietary use
● Still on PyPI!
Funding pandas development
● pandas received first formal grant in 2019
from Chan-Zuckerberg Initiative
● Core devs primarily volunteers, self-funded,
or company-funded (Anaconda, others)
The early pandas gang (2011 - 2012)
Wes McKinney Chang She Adam Klein
pandas’s amazing Core Dev Team
Core Dev Meetup,
2019
Jeff Reback Tom Augspurger
Brock MendelMarc Garcia
Partial cast of characters
Joris van den
Bossche
Community engagement
Python’s journey to
mainstream data
language
"We believe that in the coming years there will be
great opportunity to attract users in need of
statistical data analysis tools to Python who might
have previously chosen R, MATLAB, or another
research environment. By designing robust, easy
to-use data structures that cohere with the rest of the
scientific Python stack, we can make Python
compelling choice for data analysis applications. In
our opinion, pandas provides a solid foundation upon
which a very powerful data analysis ecosystem can
be established."
Me, Proceedings of SciPy 2011
StackOverflow
data from
September 2017
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
StackOverflow
data from
September 2017
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Factors driving
Python’s growth
Contributing factors
● Massive need for data wranglers + scientists
● “Perfect storm” of necessary packages
● New data science education
● Successful early adopters
● Packaging improvements
Perfect storm of packages
View from 2008
Confronting
Fear
Uncertainty
Doubt
● Large codebase concerns
● Long-term software lifecycle
● Interpreted languages
○ ... unsafe?
○ ... slow?
● Open source… trustworthy?
Common concerns
May 2011 - “PyData” core dev meetings
"Need a toolset that is robust, fast, and suitable
for a production environment..."
May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
May 2011 - “PyData” core dev meetings
May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
"... and easy / intuitive for non-software
engineers to use"
May 2011 - “PyData” core dev meetings
May 2011
* also, we need to fix packaging
May 2011 - “PyData” core dev meetings
July 2011- Concerns
"... the current state of affairs has me rather
anxious … these tools [e.g. pandas] have
largely not been integrated with any other tools
because of the community's collective
commitment anxiety"
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
Reading CSV files
Python for Data Analysis book - 2012
● A primer in data
manipulation in Python
● Focus: NumPy, IPython
/Jupyter, pandas,
matplotlib
● 2 editions (2012, 2017)
● 8 translations so far
PyData NYC 2013: 10 Things I Hate About pandas
● November 2013
● Summary: “pandas is
not designed like, or
intended to be used
as, a database query
engine”
Fall 2014: Python in a Big Data World
Task: Helping Python
become a first-class
technology for Big Data
Some Problems
● File formats
● JVM interop
● Non-array-oriented
interfaces
Difficulties in pandas (and R) dataframes
● Limited built-in data types
● Performance and memory use issues
● Challenges with larger-than-memory datasets
● Naive execution strategies (no “query
optimization”)
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Does not cut down trees.
Out of memory on 10GB of CSVs
A
of doubt
Changing the tides
… and others
Fragmentation of data
and code
Other thoughts
● Projects like pandas may be taking
responsibility for too many things
● It would be more productive (long-term) to
have a reusable computational foundation
for data frames
● New data frame format for
designed for speed
● Computational foundation for
data processing libraries
● Fast cross-language data
interchange
Arrow
memory
JVM Data Ecosystem
Database Systems
Data Science Libraries
Defragmenting Data
● https://github.com/apache/arrow
● Over 400 unique contributors
● Some level of support for 11 programming
languages
● CPU/GPU-friendly columnar memory layout
● Memory map huge datasets
● Relocate data structures without serialization
Important features
Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Engine
Datasets
Framework
Arrow Flight RPC
Network
Storage
“New Data Frame” projects
● dask.dataframe
● Modin
● NVIDIA RAPIDS
● Vaex
● … and more surely in development
Learning from R
● Domain-specific language culture (“same
code, different backends”)
● Non-standard evaluation
○ Inspect and manipulate unevaluated code
fragments
Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Can be a massive Arrow dataset
Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
Can be a massive Arrow dataset
Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM
Can be a massive Arrow dataset
Funding ambitious
new open source
projects
Some Partners
● https://ursalabs.org
● Apache Arrow-powered
Data Science Tools
● Funded by corporate
partners
● Built in collaboration with
RStudio
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Looking forward
1 of 52

Recommended

Apache Arrow: Leveling Up the Data Science Stack by
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
3.5K views20 slides
Building Robust Production Data Pipelines with Databricks Delta by
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
1.3K views12 slides
Apache Spark sql by
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
386 views27 slides
Running Airflow Workflows as ETL Processes on Hadoop by
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
2K views24 slides
Making Apache Spark Better with Delta Lake by
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
5.4K views40 slides
How to Choose The Right Database on AWS - Berlin Summit - 2019 by
How to Choose The Right Database on AWS - Berlin Summit - 2019How to Choose The Right Database on AWS - Berlin Summit - 2019
How to Choose The Right Database on AWS - Berlin Summit - 2019Randall Hunt
4.6K views59 slides

More Related Content

What's hot

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture by
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
1.9K views39 slides
Databricks Platform.pptx by
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
3.3K views46 slides
Observability for Data Pipelines With OpenLineage by
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
626 views27 slides
Data Science Across Data Sources with Apache Arrow by
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
671 views23 slides
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn... by
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
1.1K views48 slides
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal... by
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
5.1K views26 slides

What's hot(20)

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture by Kai Wähner
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner1.9K views
Databricks Platform.pptx by Alex Ivy
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy3.3K views
Observability for Data Pipelines With OpenLineage by Databricks
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Databricks626 views
Data Science Across Data Sources with Apache Arrow by Databricks
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
Databricks671 views
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn... by Simplilearn
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn1.1K views
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal... by Databricks
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Databricks5.1K views
Embedding Data & Analytics With Looker by Looker
Embedding Data & Analytics With LookerEmbedding Data & Analytics With Looker
Embedding Data & Analytics With Looker
Looker2.2K views
Productizing Structured Streaming Jobs by Databricks
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks3.2K views
Change Data Feed in Delta by Databricks
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks1.6K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney1.1K views
Intro to Delta Lake by Databricks
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks1.5K views
Top 5 Mistakes When Writing Spark Applications by Spark Summit
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit26.4K views
Introducing DataFrames in Spark for Large Scale Data Science by Databricks
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks41K views
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang by Databricks
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks5.9K views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Data Streaming Ecosystem Management at Booking.com by confluent
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
confluent6K views
A Thorough Comparison of Delta Lake, Iceberg and Hudi by Databricks
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks11.1K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Modernizing to a Cloud Data Architecture by Databricks
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks649 views

Similar to PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
1.1K views32 slides
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote) by
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Peter Wang
2.4K views28 slides
2019 DSA 105 Introduction to Data Science Week 4 by
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
229 views26 slides
Top 10 Data analytics tools to look for in 2021 by
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Mobcoder
90 views13 slides
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science by
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
233 views25 slides
How Data Virtualization Adds Value to Your Data Science Stack by
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackDenodo
160 views24 slides

Similar to PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future (20)

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney1.1K views
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote) by Peter Wang
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Peter Wang2.4K views
Top 10 Data analytics tools to look for in 2021 by Mobcoder
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
Mobcoder90 views
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science by Ferdin Joe John Joseph PhD
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
How Data Virtualization Adds Value to Your Data Science Stack by Denodo
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
Denodo 160 views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
Apache Arrow and Python: The latest by Wes McKinney
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney5.8K views
Enabling Python to be a Better Big Data Citizen by Wes McKinney
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney6K views
Next-generation Python Big Data Tools, powered by Apache Arrow by Wes McKinney
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney13K views
Know thy logos by Vishal V
Know thy logosKnow thy logos
Know thy logos
Vishal V62 views
BDTC2015 databricks-辛湜-state of spark by Jerry Wen
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
Jerry Wen272 views

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
1.1K views31 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
1.5K views53 slides
New Directions for Apache Arrow by
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
1.9K views27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 slides
Apache Arrow: Present and Future @ ScaledML 2020 by
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
970 views36 slides

More from Wes McKinney(20)

Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.5K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney970 views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney6.6K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views
Shared Infrastructure for Data Science by Wes McKinney
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney8.5K views
Data Science Without Borders (JupyterCon 2017) by Wes McKinney
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney6.2K views
Memory Interoperability in Analytics and Machine Learning by Wes McKinney
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney5.6K views
Raising the Tides: Open Source Analytics for Data Science by Wes McKinney
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney3.2K views
Improving Python and Spark (PySpark) Performance and Interoperability by Wes McKinney
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney19.8K views
Python Data Wrangling: Preparing for the Future by Wes McKinney
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney12.5K views
PyCon APAC 2016 Keynote by Wes McKinney
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney3.6K views
High Performance Python on Apache Spark by Wes McKinney
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney16.6K views
Python Data Ecosystem: Thoughts on Building for the Future by Wes McKinney
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney5.4K views

Recently uploaded

【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
69 views8 slides
virtual reality.pptx by
virtual reality.pptxvirtual reality.pptx
virtual reality.pptxG036GaikwadSnehal
18 views15 slides
Unit 1_Lecture 2_Physical Design of IoT.pdf by
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdfStephenTec
15 views36 slides
Zero to Automated in Under a Year by
Zero to Automated in Under a YearZero to Automated in Under a Year
Zero to Automated in Under a YearNetwork Automation Forum
22 views23 slides
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensorssugiuralab
23 views15 slides
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveNetwork Automation Forum
43 views35 slides

Recently uploaded(20)

【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec15 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab23 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10345 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software317 views
"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays17 views

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

  • 1. Wes McKinney @wesmckinn PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
  • 2. Wes’s professional timeline pandas DataPad 2008 2013 2014 — Present Apache Arrow
  • 4. January 2020: pandas 1.0 ● 26th major release after 10 years of development ● ~2000 unique contributors Thanks, Indeed!
  • 5. Dec 2009 - pandas 0.1 ● First open source release after ~18 months of proprietary use ● Still on PyPI!
  • 6. Funding pandas development ● pandas received first formal grant in 2019 from Chan-Zuckerberg Initiative ● Core devs primarily volunteers, self-funded, or company-funded (Anaconda, others)
  • 7. The early pandas gang (2011 - 2012) Wes McKinney Chang She Adam Klein
  • 8. pandas’s amazing Core Dev Team Core Dev Meetup, 2019 Jeff Reback Tom Augspurger Brock MendelMarc Garcia Partial cast of characters Joris van den Bossche
  • 11. "We believe that in the coming years there will be great opportunity to attract users in need of statistical data analysis tools to Python who might have previously chosen R, MATLAB, or another research environment. By designing robust, easy to-use data structures that cohere with the rest of the scientific Python stack, we can make Python compelling choice for data analysis applications. In our opinion, pandas provides a solid foundation upon which a very powerful data analysis ecosystem can be established." Me, Proceedings of SciPy 2011
  • 17. Contributing factors ● Massive need for data wranglers + scientists ● “Perfect storm” of necessary packages ● New data science education ● Successful early adopters ● Packaging improvements
  • 18. Perfect storm of packages
  • 21. ● Large codebase concerns ● Long-term software lifecycle ● Interpreted languages ○ ... unsafe? ○ ... slow? ● Open source… trustworthy? Common concerns
  • 22. May 2011 - “PyData” core dev meetings "Need a toolset that is robust, fast, and suitable for a production environment..."
  • 23. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " May 2011 - “PyData” core dev meetings
  • 24. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " "... and easy / intuitive for non-software engineers to use" May 2011 - “PyData” core dev meetings
  • 25. May 2011 * also, we need to fix packaging May 2011 - “PyData” core dev meetings
  • 26. July 2011- Concerns "... the current state of affairs has me rather anxious … these tools [e.g. pandas] have largely not been integrated with any other tools because of the community's collective commitment anxiety" http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
  • 28. Python for Data Analysis book - 2012 ● A primer in data manipulation in Python ● Focus: NumPy, IPython /Jupyter, pandas, matplotlib ● 2 editions (2012, 2017) ● 8 translations so far
  • 29. PyData NYC 2013: 10 Things I Hate About pandas ● November 2013 ● Summary: “pandas is not designed like, or intended to be used as, a database query engine”
  • 30. Fall 2014: Python in a Big Data World Task: Helping Python become a first-class technology for Big Data Some Problems ● File formats ● JVM interop ● Non-array-oriented interfaces
  • 31. Difficulties in pandas (and R) dataframes ● Limited built-in data types ● Performance and memory use issues ● Challenges with larger-than-memory datasets ● Naive execution strategies (no “query optimization”)
  • 33. Does not cut down trees.
  • 34. Out of memory on 10GB of CSVs
  • 38. Other thoughts ● Projects like pandas may be taking responsibility for too many things ● It would be more productive (long-term) to have a reusable computational foundation for data frames
  • 39. ● New data frame format for designed for speed ● Computational foundation for data processing libraries ● Fast cross-language data interchange Arrow memory JVM Data Ecosystem Database Systems Data Science Libraries
  • 41. ● https://github.com/apache/arrow ● Over 400 unique contributors ● Some level of support for 11 programming languages
  • 42. ● CPU/GPU-friendly columnar memory layout ● Memory map huge datasets ● Relocate data structures without serialization Important features
  • 43. Arrow C++ Platform Multi-core Work Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
  • 44. “New Data Frame” projects ● dask.dataframe ● Modin ● NVIDIA RAPIDS ● Vaex ● … and more surely in development
  • 45. Learning from R ● Domain-specific language culture (“same code, different backends”) ● Non-standard evaluation ○ Inspect and manipulate unevaluated code fragments
  • 46. Arrow’s relationship with dplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
  • 47. Arrow’s relationship with dplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
  • 48. Arrow’s relationship with dplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
  • 49. Funding ambitious new open source projects
  • 50. Some Partners ● https://ursalabs.org ● Apache Arrow-powered Data Science Tools ● Funded by corporate partners ● Built in collaboration with RStudio