Aug. 17, 2011•0 likes## 4 likes

•11,819 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Download to read offline

Report

Technology

Wes McKinneyFollow

Director of Ursa Labs, Open Source Developer at Ursa LabsApache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney

SciPy 2011 pandas lightning talkWes McKinney

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney

Enabling exploratory data science with Spark and RDatabricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

- What’s new in pandas and the SciPy stack for ﬁnancial users Wes McKinney
- Me • AQR: August 2007 - July 2010 • Duke Statistics: 2010 - present (now on leave) • My plans • Improving Python libs for statistics and ﬁnance • Building a ﬁnancial software + consulting business based on said tools
- Core Python stack for ﬁnance • NumPy, SciPy (heavy lifting) • pandas (data handling / computation) • IPython (dev and research env) • Cython (perf optimization) • matplotlib (visualization) • statsmodels (statistics / econometrics)
- General sentiments • Scientiﬁc Python growing solidly in ﬁnance and in many other ﬁelds • Though good sci-pythonistas are still scarce • Important work happening in many of the core projects • Growing consensus: a new computational model is needed to better cope with “big data”
- NumPy • Signiﬁcantly refactored C internals • Great progress on native datetime64 type • Will signiﬁcantly improve date-handling performance and usability • Extensible business day / holiday logic planned / in progress • Addition of low-level missing data (NA) support in the works
- IPython • One of Python’s killer apps gets even better • Rich Qt GUI console with inline plotting • New and improved architecture for high perf parallel / distributed computing • See Fernando Pérez’s SciPy 2011 talk / video
- Cython • Still the ﬁrst tool you should reach for to get better performance • New: OpenMP integration (for multi-core) with nogil: for i in prange(n): # do something in parallel • Supports (almost) all of standard Python now (some things, like closures, used to not work)
- statsmodels • Statistics and econometrics in Python • Major work in time series models over last year+ • VAR, SVAR models, eventually (V)ECM models for cointegrated time series • AR/ARMA, Kalman Filter, various macro ﬁlters (e.g. Hodrick-Prescott) implemented • Soon: Bayesian state space models (DLMs), ARCH/GARCH models, etc.
- statsmodels • Major criticism: weak user interface • No R-style formula framework • pandas not integrated (need to pass raw NumPy arrays) • I have begun work on pandas integration, formulas have been implemented and will hopefully arrive within the next few months
- pandas • Still the Python data hacker’s best friend? • Most recent release: 0.3.0 on 2/20/2011 • However, last 4 months have been the most active development period in the library’s history • ~375 commits since 0.3.0 release (more than the entire prior open source history)
- The state of data structures
- Ambitious big picture • I want to make pandas the cornerstone of the “next generation” statistical computing environment • Ease-of-use, performance, ﬂexibility all equally important
- Ambitious big picture • Taking the best features of other languages (R and friends) and making them better and easier to use • See my recent blog article “A Roadmap for Rich Scientiﬁc Data Structures in Python”
- pandas: under the hood • Complete redesign of DataFrame internals • Now a single class for 2D data retaining optimal performance of old DataFrame and DataMatrix classes • Signiﬁcantly improved mixed-type and missing data handling • Plan to use internal data structure to implement “NDFrame” for n-dimensional data
- Fancy indexing • Index a Series / DataFrame in a matrix-like way via special .ix attribute, use: • Slices with integers or labels • Lists of integers, labels, or boolean vecs • Integer or label locations df.ix[0] df.ix[date1:date2] df.ix[:5, ‘A’:’F’] df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
- Misc new features • “Sparse” (mostly NA) versions of Series, DataFrame, WidePanel • Many new functions on Series/DataFrame • describe, quantile, select, drop, dropna, corrwith, ... • New moving window methods: rolling_quantile and rolling_apply
- Improved IO • read_csv, read_table functions more ﬂexible and robust, better type inferencing df = read_table(‘foo.txt’, skiprows=[0,1], na_values=[‘#N/A’]) • ExcelFile class for reading multiple sheets out of .xls ﬁles
- Improved IO • HDFStore class provides a complete, tested dict-like PyTables storage container store = HDFStore(‘mydata.h5’) store[‘x’] = x store[‘y’] = y y = store[‘y’] • Experimental: store as Table and query store.put('df', df, table=True) piece = store.select(‘df’, [{‘field’ : ‘index’, ‘op’ : ‘>=’, ‘value’ : date}])
- Group by enhancements • Can group by multiple columns or key functions, SQL-like but more general • Syntactic sugar to invoke aggregation functions on groups • Automatic exclusion of “nuisance” columns of DataFrames • Various other usability enhancements
- Very soon: hierarchical indexing • Enable axis ticks to be identiﬁed by multiple labels instead of a single label • Easily select subsets of data by “level” • Create Excel-style pivot tables / cross- tabulations in a sensible way • Will integrate naturally with groupby
- Other misc things • Flexible binary operators • a.add(b, ﬁll_value=0.) • Some timezone support in DateRange • Numerous performance optimizations • See the (long) release notes =)
- Planned work • Fast time series up/downsampling • Improved support and perf for HF/tick data • Even more sophisticated group by tools • Better documentation, online screencast tutorials / examples
- Thanks • Email: wesmckinn@gmail.com • Twitter: @wesmckinn • Blog: http://blog.wesmckinney.com • pandas: http://github.com/wesm/pandas • statsmodels: http://statsmodels.sourceforge.net