Data Analysis and Statistics in Python using pandas and statsmodels

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
Statistics and Data
                   Analysis in Python with
                   pandas and statsmodels
                          Wes McKinney @wesmckinn

                NYC Open Statistical Programming Meetup
                              9/14/2011

Thursday, September 15,
Talk Overview
                 • Statistical Computing Big Picture
                 • Scientific Python Stack
                 • pandas
                 • statsmodels
                 • Ideas for the (near) future
Thursday, September 15,
Who am I?


                    MIT Math        AQR: Quant Finance



               Back to NYC

                                         Statistics

Thursday, September 15,
The Big Picture

                 • Building the “next generation”
                          statistical computing environment
                 • Making data analysis / statistics more
                          intuitive, flexible, powerful
                 • Closing the “research-production” gap

Thursday, September 15,
Application areas

                 • General data munging, manipulation
                 • Financial modeling and analytics
                 • Statistical modeling and econometrics
                 • “Enterprise” / “Big Data” analytics?

Thursday, September 15,
R, the solution?
      Hadley Wickham (ggplot2, plyr, reshape, ...)


                     “R is the most powerful statistical
                     computing language on the planet”




Thursday, September 15,
Easy to miss the point




Thursday, September 15,
R, the solution?
      Ross Ihaka (One of creators of R)

                “I have been worried for some time that R isn’t going
                to provide the base that we’re going to need for
                statistical computation in the future. (It may well be
                that the future is already upon us.) ... I have come to
                the conclusion that rather than ‘fixing’ R, it would
                be much more productive to simply start
                over and build something better”



Thursday, September 15,
Some of my gripes
                               about R
                 • Wonky, highly idiosyncratic programming
                          language*
                 • Poor speed and memory usage
                 • General purpose libraries and software
                          development tools lacking
                 • The GPL
                             * But yes, really great libraries

Thursday, September 15,
R: great libraries and deep
               connections to academia
                              Example R superstars




                         Jeff Ryan         Hadley Wickham
                      xts, quantmod      ggplot2, plyr, reshape

Thursday, September 15,
Uniting against
                          common enemies




Thursday, September 15,
“Research-Production” Gap

                 • Best data analysis / statistics tools: often
                          least well-suited for building production
                          systems
                 • The “Black Box”: embedding or RPC
                 • High productivity <=> Low productivity

Thursday, September 15,
“Research-Production” Gap

                 • Production: much more than crunching data
                          and making pretty plots
                 • Code readability, debuggability,
                          maintainability matter a lot in the long run
                 • Integration with other systems

Thursday, September 15,
“Research-Production” Gap




Thursday, September 15,
Thursday, September 15,
My assertion

                   Python is the best (only?)
                     viable solution to the
                   Research-Production gap


Thursday, September 15,
Scientific Python Stack
                 • Incredible growth in libraries and tools
                          over the last 5 years
                      • NumPy: the cornerstone
                      • Killer app: IPython
                      • Cython: C speedups, 80+% less dev time
                 • Other exciting high-profile projects: scikit-
                          learn, theano, sympy


Thursday, September 15,
Uniting the Python
                              Community
                 • Fragmentation is a (big) problem / risk
                 • Statistical libraries need to be able to talk
                          to each other easily
                 • R’s success: S-Plus legacy + quality CRAN
                          packages built around cohesive base R /
                          data structures



Thursday, September 15,
pandas
                 • Foundational rich data structures and data
                          analysis tools
                 • Arrays with labeled axes and support for
                          heterogeneous data
                 • Similar to R data.frame, but with many more
                          built-in features
                 • Missing data, time series support
Thursday, September 15,
pandas

                 • Milestone: 0.4 release 9/12/2011
                 • Dozens of new features and enhancements
                 • Completely rewritten docs: pandas.sf.net
                 • Many more new features planned for the
                          future



Thursday, September 15,
The sleeping dragon




Thursday, September 15,
Little did I know...




Thursday, September 15,
pandas: some key features

                 • Automatic and explicit data alignment
                 • Label-based (inc hierarchical) indexing
                 • GroupBy, pivoting, and reshaping
                 • Missing data support
                 • Time series functionality

Thursday, September 15,
Demo time



Thursday, September 15,
statsmodels
                 • Statistics and econometrics in Python
                 • Focused on estimation of statistical models
                  • Regression models (GLS, Robust LM, ...)
                  • Time series models (AR/ARMA,VAR,
                          Kalman Filter, ...)
                      • Non-parametric models (e.g. KDE)

Thursday, September 15,
statsmodels
                 • Development has been largely focused on
                          computation
                      • Correct, tested results
                 • In progress: better user interface
                  • Formula frameworks (e.g. similar to R)
                  • pandas integration

Thursday, September 15,
Demo time



Thursday, September 15,
Ideas for the future

                 • ggpy: ggplot2 for Python
                 • Statistical Python Distribution / Umbrella
                          project
                 • Interactive GUI widgets to visualize /
                          explore data and statsmodels results



Thursday, September 15,
Thanks

                 • pandas: http://pandas.sf.net
                 • statsmodels: http://statsmodels.sf.net
                 • Twitter: @wesmckinn
                 • E-mail: wesmckinn (at) gmail (dot) com
                 • Blog: http://blog.wesmckinney.com

Thursday, September 15,
1 of 29

Recommended

Basic of python for data analysis by
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
433 views14 slides
Introduction to Python Pandas for Data Analytics by
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsPhoenix
1.7K views115 slides
Intro to Classification: Logistic Regression & SVM by
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMNYC Predictive Analytics
23.9K views54 slides
Data science applications and usecases by
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
2.8K views16 slides
Python for Data Science by
Python for Data SciencePython for Data Science
Python for Data ScienceHarri Hämäläinen
17.1K views39 slides
Class ppt intro to r by
Class ppt intro to rClass ppt intro to r
Class ppt intro to rJigsawAcademy2014
18.9K views51 slides

More Related Content

What's hot

Data Science With Python by
Data Science With PythonData Science With Python
Data Science With PythonMosky Liu
5.2K views54 slides
CART – Classification & Regression Trees by
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
6.5K views25 slides
Descriptive Statistics with R by
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with RKazuki Yoshida
4.4K views51 slides
PPT on Data Science Using Python by
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
7.4K views42 slides
Statistics for data scientists by
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
14.2K views40 slides
1.11.association mining 3 by
1.11.association mining 31.11.association mining 3
1.11.association mining 3Krish_ver2
2.6K views34 slides

What's hot(20)

Data Science With Python by Mosky Liu
Data Science With PythonData Science With Python
Data Science With Python
Mosky Liu5.2K views
CART – Classification & Regression Trees by Hemant Chetwani
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
Hemant Chetwani6.5K views
Descriptive Statistics with R by Kazuki Yoshida
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with R
Kazuki Yoshida4.4K views
Statistics for data scientists by Ajay Ohri
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
Ajay Ohri14.2K views
1.11.association mining 3 by Krish_ver2
1.11.association mining 31.11.association mining 3
1.11.association mining 3
Krish_ver22.6K views
Data Mining: Concepts and Techniques — Chapter 2 — by Salah Amean
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
Salah Amean7.2K views
Big Data: Its Characteristics And Architecture Capabilities by Ashraf Uddin
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
Ashraf Uddin7.5K views
Linear Regression With R by Edureka!
Linear Regression With RLinear Regression With R
Linear Regression With R
Edureka!4.9K views
Big Data [sorry] & Data Science: What Does a Data Scientist Do? by Data Science London
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London147.9K views
Introduction to Statistical Machine Learning by mahutte
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
mahutte6.5K views
Dimensionality Reduction by mrizwan969
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan96912.6K views
Association rule mining and Apriori algorithm by hina firdaus
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithm
hina firdaus6.3K views
Machine Learning and Data Mining: 14 Evaluation and Credibility by Pier Luca Lanzi
Machine Learning and Data Mining: 14 Evaluation and CredibilityMachine Learning and Data Mining: 14 Evaluation and Credibility
Machine Learning and Data Mining: 14 Evaluation and Credibility
Pier Luca Lanzi15.2K views
Data engineering and analytics using python by Purna Chander
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
Purna Chander1.2K views
Introduction to Data Visualization by Stephen Tracy
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
Stephen Tracy11.3K views
Lecture1 introduction to big data by hktripathy
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy5K views
Introduction to R for data science by Long Nguyen
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
Long Nguyen1.3K views

Viewers also liked

pandas: a Foundational Python Library for Data Analysis and Statistics by
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
60.7K views25 slides
Data Structures for Statistical Computing in Python by
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
89.5K views36 slides
Python for Financial Data Analysis with pandas by
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
61.8K views22 slides
pandas: Powerful data analysis tools for Python by
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
9.8K views38 slides
Statistical inference for (Python) Data Analysis. An introduction. by
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Piotr Milanowski
2K views25 slides
pandas - Python Data Analysis by
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
15.8K views23 slides

Viewers also liked(15)

pandas: a Foundational Python Library for Data Analysis and Statistics by Wes McKinney
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney60.7K views
Data Structures for Statistical Computing in Python by Wes McKinney
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
Wes McKinney89.5K views
Python for Financial Data Analysis with pandas by Wes McKinney
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney61.8K views
pandas: Powerful data analysis tools for Python by Wes McKinney
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney9.8K views
Statistical inference for (Python) Data Analysis. An introduction. by Piotr Milanowski
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.
Piotr Milanowski2K views
pandas - Python Data Analysis by Andrew Henshaw
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
Andrew Henshaw15.8K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P... by Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney103.9K views
Lesson04_new by shengvn
Lesson04_newLesson04_new
Lesson04_new
shengvn788 views
Recurrent Neural Networks in 10 minutes or less by Tal Perry
Recurrent Neural Networks  in 10 minutes or lessRecurrent Neural Networks  in 10 minutes or less
Recurrent Neural Networks in 10 minutes or less
Tal Perry634 views
What's new in pandas and the SciPy stack for financial users by Wes McKinney
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney11.8K views
A look inside pandas design and development by Wes McKinney
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney29.5K views
Ibis: Scaling the Python Data Experience by Wes McKinney
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
Wes McKinney3.8K views
My Data Journey with Python (SciPy 2015 Keynote) by Wes McKinney
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney7.4K views

Similar to Data Analysis and Statistics in Python using pandas and statsmodels

Phingified ci and deployment strategies ipc 2012 by
Phingified ci and deployment strategies ipc 2012Phingified ci and deployment strategies ipc 2012
Phingified ci and deployment strategies ipc 2012TEQneers GmbH & Co. KG
3.1K views37 slides
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote) by
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Peter Wang
2.4K views28 slides
LSESU a Taste of R Language Workshop by
LSESU a Taste of R Language WorkshopLSESU a Taste of R Language Workshop
LSESU a Taste of R Language WorkshopKorkrid Akepanidtaworn
1.6K views80 slides
State of Pyramid - Brasilia 2013 by
State of Pyramid - Brasilia 2013State of Pyramid - Brasilia 2013
State of Pyramid - Brasilia 2013plonepaul
826 views23 slides
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
2.1K views52 slides
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark by
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
8.2K views87 slides

Similar to Data Analysis and Statistics in Python using pandas and statsmodels(20)

Python for Data: Past, Present, Future (PyCon JP 2017 Keynote) by Peter Wang
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Peter Wang2.4K views
State of Pyramid - Brasilia 2013 by plonepaul
State of Pyramid - Brasilia 2013State of Pyramid - Brasilia 2013
State of Pyramid - Brasilia 2013
plonepaul826 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney2.1K views
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark by Krishna Sankar
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar8.2K views
PyCon Singapore 2013 Keynote by Wes McKinney
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney94.6K views
Productive Data Tools for Quants by Wes McKinney
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney1.7K views
Parallel Programming in Python: Speeding up your analysis by Manojit Nandi
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysis
Manojit Nandi1.2K views
Resources for Getting Started in Predictive Analytics by meepbobeep
Resources for Getting Started in Predictive AnalyticsResources for Getting Started in Predictive Analytics
Resources for Getting Started in Predictive Analytics
meepbobeep677 views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney1.1K views
Enterprise Drupal by thesnufkin
Enterprise DrupalEnterprise Drupal
Enterprise Drupal
thesnufkin989 views
Introduction to Python Syntax and Semantics by Adam Cook
Introduction to Python Syntax and SemanticsIntroduction to Python Syntax and Semantics
Introduction to Python Syntax and Semantics
Adam Cook262 views
Data mining with Rattle For R by Akhil Anil
Data mining with Rattle For RData mining with Rattle For R
Data mining with Rattle For R
Akhil Anil3.2K views
R programming language - Mustafa Wahedi by UNICORNS IN TECH
R programming language - Mustafa WahediR programming language - Mustafa Wahedi
R programming language - Mustafa Wahedi
UNICORNS IN TECH262 views
Proud to be polyglot! by NLJUG
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
NLJUG636 views
Data Science at Scale - The DevOps Approach by Mihai Criveti
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
Mihai Criveti126 views
PyData Texas 2015 Keynote by Peter Wang
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
Peter Wang2.6K views
From Developer to Data Scientist - Gaines Kergosien by ITCamp
From Developer to Data Scientist - Gaines KergosienFrom Developer to Data Scientist - Gaines Kergosien
From Developer to Data Scientist - Gaines Kergosien
ITCamp756 views

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
1.1K views31 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
1.1K views26 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
1.4K views53 slides
New Directions for Apache Arrow by
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
1.9K views27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 slides

More from Wes McKinney(20)

Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.4K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney970 views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views
Apache Arrow: Leveling Up the Data Science Stack by Wes McKinney
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney3.5K views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney6.6K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views
Shared Infrastructure for Data Science by Wes McKinney
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney8.5K views
Data Science Without Borders (JupyterCon 2017) by Wes McKinney
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney6.2K views
Memory Interoperability in Analytics and Machine Learning by Wes McKinney
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney5.6K views
Raising the Tides: Open Source Analytics for Data Science by Wes McKinney
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney3.2K views
Improving Python and Spark (PySpark) Performance and Interoperability by Wes McKinney
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney19.8K views
Python Data Wrangling: Preparing for the Future by Wes McKinney
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney12.5K views

Recently uploaded

Empathic Computing: Delivering the Potential of the Metaverse by
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the MetaverseMark Billinghurst
478 views80 slides
Transcript: The Details of Description Techniques tips and tangents on altern... by
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...BookNet Canada
136 views15 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
248 views20 slides
Mini-Track: Challenges to Network Automation Adoption by
Mini-Track: Challenges to Network Automation AdoptionMini-Track: Challenges to Network Automation Adoption
Mini-Track: Challenges to Network Automation AdoptionNetwork Automation Forum
12 views27 slides
Special_edition_innovator_2023.pdf by
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
17 views6 slides
Network Source of Truth and Infrastructure as Code revisited by
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
26 views45 slides

Recently uploaded(20)

Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst478 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada136 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10248 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman33 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada127 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely21 views

Data Analysis and Statistics in Python using pandas and statsmodels

  • 1. Statistics and Data Analysis in Python with pandas and statsmodels Wes McKinney @wesmckinn NYC Open Statistical Programming Meetup 9/14/2011 Thursday, September 15,
  • 2. Talk Overview • Statistical Computing Big Picture • Scientific Python Stack • pandas • statsmodels • Ideas for the (near) future Thursday, September 15,
  • 3. Who am I? MIT Math AQR: Quant Finance Back to NYC Statistics Thursday, September 15,
  • 4. The Big Picture • Building the “next generation” statistical computing environment • Making data analysis / statistics more intuitive, flexible, powerful • Closing the “research-production” gap Thursday, September 15,
  • 5. Application areas • General data munging, manipulation • Financial modeling and analytics • Statistical modeling and econometrics • “Enterprise” / “Big Data” analytics? Thursday, September 15,
  • 6. R, the solution? Hadley Wickham (ggplot2, plyr, reshape, ...) “R is the most powerful statistical computing language on the planet” Thursday, September 15,
  • 7. Easy to miss the point Thursday, September 15,
  • 8. R, the solution? Ross Ihaka (One of creators of R) “I have been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) ... I have come to the conclusion that rather than ‘fixing’ R, it would be much more productive to simply start over and build something better” Thursday, September 15,
  • 9. Some of my gripes about R • Wonky, highly idiosyncratic programming language* • Poor speed and memory usage • General purpose libraries and software development tools lacking • The GPL * But yes, really great libraries Thursday, September 15,
  • 10. R: great libraries and deep connections to academia Example R superstars Jeff Ryan Hadley Wickham xts, quantmod ggplot2, plyr, reshape Thursday, September 15,
  • 11. Uniting against common enemies Thursday, September 15,
  • 12. “Research-Production” Gap • Best data analysis / statistics tools: often least well-suited for building production systems • The “Black Box”: embedding or RPC • High productivity <=> Low productivity Thursday, September 15,
  • 13. “Research-Production” Gap • Production: much more than crunching data and making pretty plots • Code readability, debuggability, maintainability matter a lot in the long run • Integration with other systems Thursday, September 15,
  • 16. My assertion Python is the best (only?) viable solution to the Research-Production gap Thursday, September 15,
  • 17. Scientific Python Stack • Incredible growth in libraries and tools over the last 5 years • NumPy: the cornerstone • Killer app: IPython • Cython: C speedups, 80+% less dev time • Other exciting high-profile projects: scikit- learn, theano, sympy Thursday, September 15,
  • 18. Uniting the Python Community • Fragmentation is a (big) problem / risk • Statistical libraries need to be able to talk to each other easily • R’s success: S-Plus legacy + quality CRAN packages built around cohesive base R / data structures Thursday, September 15,
  • 19. pandas • Foundational rich data structures and data analysis tools • Arrays with labeled axes and support for heterogeneous data • Similar to R data.frame, but with many more built-in features • Missing data, time series support Thursday, September 15,
  • 20. pandas • Milestone: 0.4 release 9/12/2011 • Dozens of new features and enhancements • Completely rewritten docs: pandas.sf.net • Many more new features planned for the future Thursday, September 15,
  • 22. Little did I know... Thursday, September 15,
  • 23. pandas: some key features • Automatic and explicit data alignment • Label-based (inc hierarchical) indexing • GroupBy, pivoting, and reshaping • Missing data support • Time series functionality Thursday, September 15,
  • 25. statsmodels • Statistics and econometrics in Python • Focused on estimation of statistical models • Regression models (GLS, Robust LM, ...) • Time series models (AR/ARMA,VAR, Kalman Filter, ...) • Non-parametric models (e.g. KDE) Thursday, September 15,
  • 26. statsmodels • Development has been largely focused on computation • Correct, tested results • In progress: better user interface • Formula frameworks (e.g. similar to R) • pandas integration Thursday, September 15,
  • 28. Ideas for the future • ggpy: ggplot2 for Python • Statistical Python Distribution / Umbrella project • Interactive GUI widgets to visualize / explore data and statsmodels results Thursday, September 15,
  • 29. Thanks • pandas: http://pandas.sf.net • statsmodels: http://statsmodels.sf.net • Twitter: @wesmckinn • E-mail: wesmckinn (at) gmail (dot) com • Blog: http://blog.wesmckinney.com Thursday, September 15,