SlideShare a Scribd company logo
1 of 29
Download to read offline
Statistics and Data
                   Analysis in Python with
                   pandas and statsmodels
                          Wes McKinney @wesmckinn

                NYC Open Statistical Programming Meetup
                              9/14/2011

Thursday, September 15,
Talk Overview
                 • Statistical Computing Big Picture
                 • Scientific Python Stack
                 • pandas
                 • statsmodels
                 • Ideas for the (near) future
Thursday, September 15,
Who am I?


                    MIT Math        AQR: Quant Finance



               Back to NYC

                                         Statistics

Thursday, September 15,
The Big Picture

                 • Building the “next generation”
                          statistical computing environment
                 • Making data analysis / statistics more
                          intuitive, flexible, powerful
                 • Closing the “research-production” gap

Thursday, September 15,
Application areas

                 • General data munging, manipulation
                 • Financial modeling and analytics
                 • Statistical modeling and econometrics
                 • “Enterprise” / “Big Data” analytics?

Thursday, September 15,
R, the solution?
      Hadley Wickham (ggplot2, plyr, reshape, ...)


                     “R is the most powerful statistical
                     computing language on the planet”




Thursday, September 15,
Easy to miss the point




Thursday, September 15,
R, the solution?
      Ross Ihaka (One of creators of R)

                “I have been worried for some time that R isn’t going
                to provide the base that we’re going to need for
                statistical computation in the future. (It may well be
                that the future is already upon us.) ... I have come to
                the conclusion that rather than ‘fixing’ R, it would
                be much more productive to simply start
                over and build something better”



Thursday, September 15,
Some of my gripes
                               about R
                 • Wonky, highly idiosyncratic programming
                          language*
                 • Poor speed and memory usage
                 • General purpose libraries and software
                          development tools lacking
                 • The GPL
                             * But yes, really great libraries

Thursday, September 15,
R: great libraries and deep
               connections to academia
                              Example R superstars




                         Jeff Ryan         Hadley Wickham
                      xts, quantmod      ggplot2, plyr, reshape

Thursday, September 15,
Uniting against
                          common enemies




Thursday, September 15,
“Research-Production” Gap

                 • Best data analysis / statistics tools: often
                          least well-suited for building production
                          systems
                 • The “Black Box”: embedding or RPC
                 • High productivity <=> Low productivity

Thursday, September 15,
“Research-Production” Gap

                 • Production: much more than crunching data
                          and making pretty plots
                 • Code readability, debuggability,
                          maintainability matter a lot in the long run
                 • Integration with other systems

Thursday, September 15,
“Research-Production” Gap




Thursday, September 15,
Thursday, September 15,
My assertion

                   Python is the best (only?)
                     viable solution to the
                   Research-Production gap


Thursday, September 15,
Scientific Python Stack
                 • Incredible growth in libraries and tools
                          over the last 5 years
                      • NumPy: the cornerstone
                      • Killer app: IPython
                      • Cython: C speedups, 80+% less dev time
                 • Other exciting high-profile projects: scikit-
                          learn, theano, sympy


Thursday, September 15,
Uniting the Python
                              Community
                 • Fragmentation is a (big) problem / risk
                 • Statistical libraries need to be able to talk
                          to each other easily
                 • R’s success: S-Plus legacy + quality CRAN
                          packages built around cohesive base R /
                          data structures



Thursday, September 15,
pandas
                 • Foundational rich data structures and data
                          analysis tools
                 • Arrays with labeled axes and support for
                          heterogeneous data
                 • Similar to R data.frame, but with many more
                          built-in features
                 • Missing data, time series support
Thursday, September 15,
pandas

                 • Milestone: 0.4 release 9/12/2011
                 • Dozens of new features and enhancements
                 • Completely rewritten docs: pandas.sf.net
                 • Many more new features planned for the
                          future



Thursday, September 15,
The sleeping dragon




Thursday, September 15,
Little did I know...




Thursday, September 15,
pandas: some key features

                 • Automatic and explicit data alignment
                 • Label-based (inc hierarchical) indexing
                 • GroupBy, pivoting, and reshaping
                 • Missing data support
                 • Time series functionality

Thursday, September 15,
Demo time



Thursday, September 15,
statsmodels
                 • Statistics and econometrics in Python
                 • Focused on estimation of statistical models
                  • Regression models (GLS, Robust LM, ...)
                  • Time series models (AR/ARMA,VAR,
                          Kalman Filter, ...)
                      • Non-parametric models (e.g. KDE)

Thursday, September 15,
statsmodels
                 • Development has been largely focused on
                          computation
                      • Correct, tested results
                 • In progress: better user interface
                  • Formula frameworks (e.g. similar to R)
                  • pandas integration

Thursday, September 15,
Demo time



Thursday, September 15,
Ideas for the future

                 • ggpy: ggplot2 for Python
                 • Statistical Python Distribution / Umbrella
                          project
                 • Interactive GUI widgets to visualize /
                          explore data and statsmodels results



Thursday, September 15,
Thanks

                 • pandas: http://pandas.sf.net
                 • statsmodels: http://statsmodels.sf.net
                 • Twitter: @wesmckinn
                 • E-mail: wesmckinn (at) gmail (dot) com
                 • Blog: http://blog.wesmckinney.com

Thursday, September 15,

More Related Content

What's hot

Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in PythonJagriti Goswami
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python PandasNeeru Mittal
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
Introduction to data analysis using python
Introduction to data analysis using pythonIntroduction to data analysis using python
Introduction to data analysis using pythonGuido Luz Percú
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To PythonVanessa Rene
 
R Programming Language
R Programming LanguageR Programming Language
R Programming LanguageNareshKarela1
 
Full Python in 20 slides
Full Python in 20 slidesFull Python in 20 slides
Full Python in 20 slidesrfojdar
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science IntroductionGang Tao
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization Sourabh Sahu
 
Python Anaconda Tutorial | Edureka
Python Anaconda Tutorial | EdurekaPython Anaconda Tutorial | Edureka
Python Anaconda Tutorial | EdurekaEdureka!
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)Mohammed Anzil
 
Python programming
Python programmingPython programming
Python programmingMegha V
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaEdureka!
 

What's hot (20)

Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
Pandas
PandasPandas
Pandas
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Introduction to data analysis using python
Introduction to data analysis using pythonIntroduction to data analysis using python
Introduction to data analysis using python
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To Python
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
Full Python in 20 slides
Full Python in 20 slidesFull Python in 20 slides
Full Python in 20 slides
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Python Anaconda Tutorial | Edureka
Python Anaconda Tutorial | EdurekaPython Anaconda Tutorial | Edureka
Python Anaconda Tutorial | Edureka
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
DATA VISUALIZATION USING MATPLOTLIB (PYTHON)
 
Python programming
Python programmingPython programming
Python programming
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 

Viewers also liked

pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Piotr Milanowski
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Lesson04_new
Lesson04_newLesson04_new
Lesson04_newshengvn
 
Recurrent Neural Networks in 10 minutes or less
Recurrent Neural Networks  in 10 minutes or lessRecurrent Neural Networks  in 10 minutes or less
Recurrent Neural Networks in 10 minutes or lessTal Perry
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWes McKinney
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceWes McKinney
 
Austin SEO Meetup 4/1/09 with BuzzStream
Austin SEO Meetup 4/1/09 with BuzzStreamAustin SEO Meetup 4/1/09 with BuzzStream
Austin SEO Meetup 4/1/09 with BuzzStreamContinuum Analytics
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
 

Viewers also liked (13)

pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Lesson04_new
Lesson04_newLesson04_new
Lesson04_new
 
Recurrent Neural Networks in 10 minutes or less
Recurrent Neural Networks  in 10 minutes or lessRecurrent Neural Networks  in 10 minutes or less
Recurrent Neural Networks in 10 minutes or less
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
 
Austin SEO Meetup 4/1/09 with BuzzStream
Austin SEO Meetup 4/1/09 with BuzzStreamAustin SEO Meetup 4/1/09 with BuzzStream
Austin SEO Meetup 4/1/09 with BuzzStream
 
Multivariate
MultivariateMultivariate
Multivariate
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
 

Similar to Data Analysis and Statistics in Python using pandas and statsmodels

Phingified ci and deployment strategies ipc 2012
Phingified ci and deployment strategies ipc 2012Phingified ci and deployment strategies ipc 2012
Phingified ci and deployment strategies ipc 2012TEQneers GmbH & Co. KG
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Peter Wang
 
State of Pyramid - Brasilia 2013
State of Pyramid - Brasilia 2013State of Pyramid - Brasilia 2013
State of Pyramid - Brasilia 2013plonepaul
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynoteWes McKinney
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for QuantsWes McKinney
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisManojit Nandi
 
Resources for Getting Started in Predictive Analytics
Resources for Getting Started in Predictive AnalyticsResources for Getting Started in Predictive Analytics
Resources for Getting Started in Predictive Analyticsmeepbobeep
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Enterprise Drupal
Enterprise DrupalEnterprise Drupal
Enterprise Drupalthesnufkin
 
Introduction to Python Syntax and Semantics
Introduction to Python Syntax and SemanticsIntroduction to Python Syntax and Semantics
Introduction to Python Syntax and SemanticsAdam Cook
 
Data mining with Rattle For R
Data mining with Rattle For RData mining with Rattle For R
Data mining with Rattle For RAkhil Anil
 
R programming language - Mustafa Wahedi
R programming language - Mustafa WahediR programming language - Mustafa Wahedi
R programming language - Mustafa WahediUNICORNS IN TECH
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!NLJUG
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
From Developer to Data Scientist - Gaines Kergosien
From Developer to Data Scientist - Gaines KergosienFrom Developer to Data Scientist - Gaines Kergosien
From Developer to Data Scientist - Gaines KergosienITCamp
 

Similar to Data Analysis and Statistics in Python using pandas and statsmodels (20)

Phingified ci and deployment strategies ipc 2012
Phingified ci and deployment strategies ipc 2012Phingified ci and deployment strategies ipc 2012
Phingified ci and deployment strategies ipc 2012
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
 
LSESU a Taste of R Language Workshop
LSESU a Taste of R Language WorkshopLSESU a Taste of R Language Workshop
LSESU a Taste of R Language Workshop
 
State of Pyramid - Brasilia 2013
State of Pyramid - Brasilia 2013State of Pyramid - Brasilia 2013
State of Pyramid - Brasilia 2013
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysis
 
Resources for Getting Started in Predictive Analytics
Resources for Getting Started in Predictive AnalyticsResources for Getting Started in Predictive Analytics
Resources for Getting Started in Predictive Analytics
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Enterprise Drupal
Enterprise DrupalEnterprise Drupal
Enterprise Drupal
 
Introduction to Python Syntax and Semantics
Introduction to Python Syntax and SemanticsIntroduction to Python Syntax and Semantics
Introduction to Python Syntax and Semantics
 
Data mining with Rattle For R
Data mining with Rattle For RData mining with Rattle For R
Data mining with Rattle For R
 
R programming language - Mustafa Wahedi
R programming language - Mustafa WahediR programming language - Mustafa Wahedi
R programming language - Mustafa Wahedi
 
Proud to be polyglot!
Proud to be polyglot!Proud to be polyglot!
Proud to be polyglot!
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
From Developer to Data Scientist - Gaines Kergosien
From Developer to Data Scientist - Gaines KergosienFrom Developer to Data Scientist - Gaines Kergosien
From Developer to Data Scientist - Gaines Kergosien
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 

Recently uploaded

Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Memoori
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationBuild Intuit
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 

Recently uploaded (20)

Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientation
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 

Data Analysis and Statistics in Python using pandas and statsmodels

  • 1. Statistics and Data Analysis in Python with pandas and statsmodels Wes McKinney @wesmckinn NYC Open Statistical Programming Meetup 9/14/2011 Thursday, September 15,
  • 2. Talk Overview • Statistical Computing Big Picture • Scientific Python Stack • pandas • statsmodels • Ideas for the (near) future Thursday, September 15,
  • 3. Who am I? MIT Math AQR: Quant Finance Back to NYC Statistics Thursday, September 15,
  • 4. The Big Picture • Building the “next generation” statistical computing environment • Making data analysis / statistics more intuitive, flexible, powerful • Closing the “research-production” gap Thursday, September 15,
  • 5. Application areas • General data munging, manipulation • Financial modeling and analytics • Statistical modeling and econometrics • “Enterprise” / “Big Data” analytics? Thursday, September 15,
  • 6. R, the solution? Hadley Wickham (ggplot2, plyr, reshape, ...) “R is the most powerful statistical computing language on the planet” Thursday, September 15,
  • 7. Easy to miss the point Thursday, September 15,
  • 8. R, the solution? Ross Ihaka (One of creators of R) “I have been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) ... I have come to the conclusion that rather than ‘fixing’ R, it would be much more productive to simply start over and build something better” Thursday, September 15,
  • 9. Some of my gripes about R • Wonky, highly idiosyncratic programming language* • Poor speed and memory usage • General purpose libraries and software development tools lacking • The GPL * But yes, really great libraries Thursday, September 15,
  • 10. R: great libraries and deep connections to academia Example R superstars Jeff Ryan Hadley Wickham xts, quantmod ggplot2, plyr, reshape Thursday, September 15,
  • 11. Uniting against common enemies Thursday, September 15,
  • 12. “Research-Production” Gap • Best data analysis / statistics tools: often least well-suited for building production systems • The “Black Box”: embedding or RPC • High productivity <=> Low productivity Thursday, September 15,
  • 13. “Research-Production” Gap • Production: much more than crunching data and making pretty plots • Code readability, debuggability, maintainability matter a lot in the long run • Integration with other systems Thursday, September 15,
  • 16. My assertion Python is the best (only?) viable solution to the Research-Production gap Thursday, September 15,
  • 17. Scientific Python Stack • Incredible growth in libraries and tools over the last 5 years • NumPy: the cornerstone • Killer app: IPython • Cython: C speedups, 80+% less dev time • Other exciting high-profile projects: scikit- learn, theano, sympy Thursday, September 15,
  • 18. Uniting the Python Community • Fragmentation is a (big) problem / risk • Statistical libraries need to be able to talk to each other easily • R’s success: S-Plus legacy + quality CRAN packages built around cohesive base R / data structures Thursday, September 15,
  • 19. pandas • Foundational rich data structures and data analysis tools • Arrays with labeled axes and support for heterogeneous data • Similar to R data.frame, but with many more built-in features • Missing data, time series support Thursday, September 15,
  • 20. pandas • Milestone: 0.4 release 9/12/2011 • Dozens of new features and enhancements • Completely rewritten docs: pandas.sf.net • Many more new features planned for the future Thursday, September 15,
  • 22. Little did I know... Thursday, September 15,
  • 23. pandas: some key features • Automatic and explicit data alignment • Label-based (inc hierarchical) indexing • GroupBy, pivoting, and reshaping • Missing data support • Time series functionality Thursday, September 15,
  • 25. statsmodels • Statistics and econometrics in Python • Focused on estimation of statistical models • Regression models (GLS, Robust LM, ...) • Time series models (AR/ARMA,VAR, Kalman Filter, ...) • Non-parametric models (e.g. KDE) Thursday, September 15,
  • 26. statsmodels • Development has been largely focused on computation • Correct, tested results • In progress: better user interface • Formula frameworks (e.g. similar to R) • pandas integration Thursday, September 15,
  • 28. Ideas for the future • ggpy: ggplot2 for Python • Statistical Python Distribution / Umbrella project • Interactive GUI widgets to visualize / explore data and statsmodels results Thursday, September 15,
  • 29. Thanks • pandas: http://pandas.sf.net • statsmodels: http://statsmodels.sf.net • Twitter: @wesmckinn • E-mail: wesmckinn (at) gmail (dot) com • Blog: http://blog.wesmckinney.com Thursday, September 15,