What's new in pandas and the SciPy stack for financial users
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

What's new in pandas and the SciPy stack for financial users

on

  • 17,364 views

 

Statistics

Views

Total Views
17,364
Views on SlideShare
5,205
Embed Views
12,159

Actions

Likes
2
Downloads
82
Comments
0

4 Embeds 12,159

http://wesmckinney.com 11581
http://blog.wesmckinney.com 569
http://twitter.com 5
http://webcache.googleusercontent.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

What's new in pandas and the SciPy stack for financial users Presentation Transcript

  • 1. What’s new in pandas andthe SciPy stack for financial users Wes McKinney
  • 2. Me• AQR: August 2007 - July 2010• Duke Statistics: 2010 - present (now on leave)• My plans • Improving Python libs for statistics and finance • Building a financial software + consulting business based on said tools
  • 3. Core Python stack for finance• NumPy, SciPy (heavy lifting)• pandas (data handling / computation)• IPython (dev and research env)• Cython (perf optimization)• matplotlib (visualization)• statsmodels (statistics / econometrics)
  • 4. General sentiments• Scientific Python growing solidly in finance and in many other fields • Though good sci-pythonistas are still scarce• Important work happening in many of the core projects• Growing consensus: a new computational model is needed to better cope with “big data”
  • 5. NumPy• Significantly refactored C internals• Great progress on native datetime64 type • Will significantly improve date-handling performance and usability • Extensible business day / holiday logic planned / in progress• Addition of low-level missing data (NA) support in the works
  • 6. IPython• One of Python’s killer apps gets even better• Rich Qt GUI console with inline plotting• New and improved architecture for high perf parallel / distributed computing• See Fernando Pérez’s SciPy 2011 talk / video
  • 7. Cython• Still the first tool you should reach for to get better performance• New: OpenMP integration (for multi-core) with nogil: for i in prange(n): # do something in parallel• Supports (almost) all of standard Python now (some things, like closures, used to not work)
  • 8. statsmodels• Statistics and econometrics in Python• Major work in time series models over last year+ • VAR, SVAR models, eventually (V)ECM models for cointegrated time series • AR/ARMA, Kalman Filter, various macro filters (e.g. Hodrick-Prescott) implemented • Soon: Bayesian state space models (DLMs), ARCH/GARCH models, etc.
  • 9. statsmodels• Major criticism: weak user interface • No R-style formula framework • pandas not integrated (need to pass raw NumPy arrays)• I have begun work on pandas integration, formulas have been implemented and will hopefully arrive within the next few months
  • 10. pandas• Still the Python data hacker’s best friend?• Most recent release: 0.3.0 on 2/20/2011• However, last 4 months have been the most active development period in the library’s history• ~375 commits since 0.3.0 release (more than the entire prior open source history)
  • 11. The state of data structures
  • 12. Ambitious big picture• I want to make pandas the cornerstone of the “next generation” statistical computing environment• Ease-of-use, performance, flexibility all equally important
  • 13. Ambitious big picture• Taking the best features of other languages (R and friends) and making them better and easier to use• See my recent blog article “A Roadmap for Rich Scientific Data Structures in Python”
  • 14. pandas: under the hood• Complete redesign of DataFrame internals • Now a single class for 2D data retaining optimal performance of old DataFrame and DataMatrix classes • Significantly improved mixed-type and missing data handling • Plan to use internal data structure to implement “NDFrame” for n-dimensional data
  • 15. Fancy indexing• Index a Series / DataFrame in a matrix-like way via special .ix attribute, use: • Slices with integers or labels • Lists of integers, labels, or boolean vecs • Integer or label locations df.ix[0] df.ix[date1:date2] df.ix[:5, ‘A’:’F’] df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
  • 16. Misc new features• “Sparse” (mostly NA) versions of Series, DataFrame, WidePanel• Many new functions on Series/DataFrame • describe, quantile, select, drop, dropna, corrwith, ...• New moving window methods: rolling_quantile and rolling_apply
  • 17. Improved IO• read_csv, read_table functions more flexible and robust, better type inferencing df = read_table(‘foo.txt’, skiprows=[0,1], na_values=[‘#N/A’])• ExcelFile class for reading multiple sheets out of .xls files
  • 18. Improved IO• HDFStore class provides a complete, tested dict-like PyTables storage container store = HDFStore(‘mydata.h5’) store[‘x’] = x store[‘y’] = y y = store[‘y’]• Experimental: store as Table and query store.put(df, df, table=True) piece = store.select(‘df’, [{‘field’ : ‘index’, ‘op’ : ‘>=’, ‘value’ : date}])
  • 19. Group by enhancements• Can group by multiple columns or key functions, SQL-like but more general• Syntactic sugar to invoke aggregation functions on groups• Automatic exclusion of “nuisance” columns of DataFrames• Various other usability enhancements
  • 20. Very soon: hierarchical indexing• Enable axis ticks to be identified by multiple labels instead of a single label• Easily select subsets of data by “level”• Create Excel-style pivot tables / cross- tabulations in a sensible way• Will integrate naturally with groupby
  • 21. Other misc things• Flexible binary operators • a.add(b, fill_value=0.)• Some timezone support in DateRange• Numerous performance optimizations• See the (long) release notes =)
  • 22. Planned work• Fast time series up/downsampling• Improved support and perf for HF/tick data• Even more sophisticated group by tools• Better documentation, online screencast tutorials / examples
  • 23. Thanks• Email: wesmckinn@gmail.com• Twitter: @wesmckinn• Blog: http://blog.wesmckinney.com• pandas: http://github.com/wesm/pandas• statsmodels: http://statsmodels.sourceforge.net