- 1. Statistics and Data Analysis in Python with pandas and statsmodels Wes McKinney @wesmckinn NYC Open Statistical Programming Meetup 9/14/2011 Thursday, September 15,
- 2. Talk Overview • Statistical Computing Big Picture • Scientiﬁc Python Stack • pandas • statsmodels • Ideas for the (near) future Thursday, September 15,
- 3. Who am I? MIT Math AQR: Quant Finance Back to NYC Statistics Thursday, September 15,
- 4. The Big Picture • Building the “next generation” statistical computing environment • Making data analysis / statistics more intuitive, ﬂexible, powerful • Closing the “research-production” gap Thursday, September 15,
- 5. Application areas • General data munging, manipulation • Financial modeling and analytics • Statistical modeling and econometrics • “Enterprise” / “Big Data” analytics? Thursday, September 15,
- 6. R, the solution? Hadley Wickham (ggplot2, plyr, reshape, ...) “R is the most powerful statistical computing language on the planet” Thursday, September 15,
- 7. Easy to miss the point Thursday, September 15,
- 8. R, the solution? Ross Ihaka (One of creators of R) “I have been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) ... I have come to the conclusion that rather than ‘ﬁxing’ R, it would be much more productive to simply start over and build something better” Thursday, September 15,
- 9. Some of my gripes about R • Wonky, highly idiosyncratic programming language* • Poor speed and memory usage • General purpose libraries and software development tools lacking • The GPL * But yes, really great libraries Thursday, September 15,
- 10. R: great libraries and deep connections to academia Example R superstars Jeff Ryan Hadley Wickham xts, quantmod ggplot2, plyr, reshape Thursday, September 15,
- 11. Uniting against common enemies Thursday, September 15,
- 12. “Research-Production” Gap • Best data analysis / statistics tools: often least well-suited for building production systems • The “Black Box”: embedding or RPC • High productivity <=> Low productivity Thursday, September 15,
- 13. “Research-Production” Gap • Production: much more than crunching data and making pretty plots • Code readability, debuggability, maintainability matter a lot in the long run • Integration with other systems Thursday, September 15,
- 16. My assertion Python is the best (only?) viable solution to the Research-Production gap Thursday, September 15,
- 17. Scientiﬁc Python Stack • Incredible growth in libraries and tools over the last 5 years • NumPy: the cornerstone • Killer app: IPython • Cython: C speedups, 80+% less dev time • Other exciting high-proﬁle projects: scikit- learn, theano, sympy Thursday, September 15,
- 18. Uniting the Python Community • Fragmentation is a (big) problem / risk • Statistical libraries need to be able to talk to each other easily • R’s success: S-Plus legacy + quality CRAN packages built around cohesive base R / data structures Thursday, September 15,
- 19. pandas • Foundational rich data structures and data analysis tools • Arrays with labeled axes and support for heterogeneous data • Similar to R data.frame, but with many more built-in features • Missing data, time series support Thursday, September 15,
- 20. pandas • Milestone: 0.4 release 9/12/2011 • Dozens of new features and enhancements • Completely rewritten docs: pandas.sf.net • Many more new features planned for the future Thursday, September 15,
- 21. The sleeping dragon Thursday, September 15,
- 22. Little did I know... Thursday, September 15,
- 23. pandas: some key features • Automatic and explicit data alignment • Label-based (inc hierarchical) indexing • GroupBy, pivoting, and reshaping • Missing data support • Time series functionality Thursday, September 15,
- 24. Demo time Thursday, September 15,
- 25. statsmodels • Statistics and econometrics in Python • Focused on estimation of statistical models • Regression models (GLS, Robust LM, ...) • Time series models (AR/ARMA,VAR, Kalman Filter, ...) • Non-parametric models (e.g. KDE) Thursday, September 15,
- 26. statsmodels • Development has been largely focused on computation • Correct, tested results • In progress: better user interface • Formula frameworks (e.g. similar to R) • pandas integration Thursday, September 15,
- 27. Demo time Thursday, September 15,
- 28. Ideas for the future • ggpy: ggplot2 for Python • Statistical Python Distribution / Umbrella project • Interactive GUI widgets to visualize / explore data and statsmodels results Thursday, September 15,
- 29. Thanks • pandas: http://pandas.sf.net • statsmodels: http://statsmodels.sf.net • Twitter: @wesmckinn • E-mail: wesmckinn (at) gmail (dot) com • Blog: http://blog.wesmckinney.com Thursday, September 15,