What’s new in pandas and
the SciPy stack for financial
           users
         Wes McKinney
Me
•   AQR: August 2007 - July 2010

•   Duke Statistics: 2010 - present (now on leave)

•   My plans

    •   Improving Python libs for statistics and finance

    •   Building a financial software + consulting business
        based on said tools
Core Python stack for finance
• NumPy, SciPy (heavy lifting)
• pandas (data handling / computation)
• IPython (dev and research env)
• Cython (perf optimization)
• matplotlib (visualization)
• statsmodels (statistics / econometrics)
General sentiments
•   Scientific Python growing solidly in finance and
    in many other fields

    •   Though good sci-pythonistas are still scarce

•   Important work happening in many of the core
    projects

•   Growing consensus: a new computational
    model is needed to better cope with “big data”
NumPy
• Significantly refactored C internals
• Great progress on native datetime64 type
 • Will significantly improve date-handling
    performance and usability
 • Extensible business day / holiday logic
    planned / in progress
• Addition of low-level missing data (NA)
  support in the works
IPython

• One of Python’s killer apps gets even better
• Rich Qt GUI console with inline plotting
• New and improved architecture for high perf
  parallel / distributed computing
• See Fernando Pérez’s SciPy 2011 talk / video
Cython
• Still the first tool you should reach for to get
  better performance
• New: OpenMP integration (for multi-core)
     with nogil:
         for i in prange(n):
             # do something in parallel

• Supports (almost) all of standard Python now
  (some things, like closures, used to not work)
statsmodels
•   Statistics and econometrics in Python

•   Major work in time series models over last year+

    •   VAR, SVAR models, eventually (V)ECM models
        for cointegrated time series

    •   AR/ARMA, Kalman Filter, various macro filters
        (e.g. Hodrick-Prescott) implemented

    •   Soon: Bayesian state space models (DLMs),
        ARCH/GARCH models, etc.
statsmodels
• Major criticism: weak user interface
 • No R-style formula framework
 • pandas not integrated (need to pass raw
    NumPy arrays)
• I have begun work on pandas integration,
  formulas have been implemented and will
  hopefully arrive within the next few months
pandas
• Still the Python data hacker’s best friend?
• Most recent release: 0.3.0 on 2/20/2011
• However, last 4 months have been the most
  active development period in the library’s
  history
• ~375 commits since 0.3.0 release (more than
  the entire prior open source history)
The state of data structures
Ambitious big picture

• I want to make pandas the cornerstone of the
  “next generation” statistical computing
  environment
• Ease-of-use, performance, flexibility all equally
  important
Ambitious big picture

• Taking the best features of other languages (R
  and friends) and making them better and
  easier to use
• See my recent blog article “A Roadmap for
  Rich Scientific Data Structures in Python”
pandas: under the hood
• Complete redesign of DataFrame internals
 • Now a single class for 2D data retaining
     optimal performance of old DataFrame and
     DataMatrix classes
 •   Significantly improved mixed-type and missing
     data handling
 •   Plan to use internal data structure to
     implement “NDFrame” for n-dimensional data
Fancy indexing
• Index a Series / DataFrame in a matrix-like
  way via special .ix attribute, use:
  • Slices with integers or labels
  • Lists of integers, labels, or boolean vecs
  • Integer or label locations
  df.ix[0]
  df.ix[date1:date2]
  df.ix[:5, ‘A’:’F’]

  df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
Misc new features
• “Sparse” (mostly NA) versions of Series,
  DataFrame, WidePanel
• Many new functions on Series/DataFrame
 • describe, quantile, select, drop, dropna,
    corrwith, ...
• New moving window methods: rolling_quantile
  and rolling_apply
Improved IO
• read_csv, read_table functions more
  flexible and robust, better type inferencing

 df = read_table(‘foo.txt’, skiprows=[0,1],
                     na_values=[‘#N/A’])


• ExcelFile class for reading multiple sheets
  out of .xls files
Improved IO
• HDFStore class provides a complete, tested
  dict-like PyTables storage container
       store = HDFStore(‘mydata.h5’)
       store[‘x’] = x
       store[‘y’] = y
       y = store[‘y’]

• Experimental: store as Table and query
      store.put('df', df, table=True)
      piece = store.select(‘df’,
          [{‘field’ : ‘index’, ‘op’ : ‘>=’,
            ‘value’ : date}])
Group by enhancements
• Can group by multiple columns or key
  functions, SQL-like but more general
• Syntactic sugar to invoke aggregation
  functions on groups
• Automatic exclusion of “nuisance”
  columns of DataFrames
• Various other usability enhancements
Very soon: hierarchical indexing

• Enable axis ticks to be identified by multiple
  labels instead of a single label
• Easily select subsets of data by “level”
• Create Excel-style pivot tables / cross-
  tabulations in a sensible way
• Will integrate naturally with groupby
Other misc things

• Flexible binary operators
 • a.add(b, fill_value=0.)
• Some timezone support in DateRange
• Numerous performance optimizations
• See the (long) release notes =)
Planned work

• Fast time series up/downsampling
• Improved support and perf for HF/tick data
• Even more sophisticated group by tools
• Better documentation, online screencast
  tutorials / examples
Thanks

• Email: wesmckinn@gmail.com
• Twitter: @wesmckinn
• Blog: http://blog.wesmckinney.com
• pandas: http://github.com/wesm/pandas
• statsmodels: http://statsmodels.sourceforge.net

What's new in pandas and the SciPy stack for financial users

  • 1.
    What’s new inpandas and the SciPy stack for financial users Wes McKinney
  • 2.
    Me • AQR: August 2007 - July 2010 • Duke Statistics: 2010 - present (now on leave) • My plans • Improving Python libs for statistics and finance • Building a financial software + consulting business based on said tools
  • 3.
    Core Python stackfor finance • NumPy, SciPy (heavy lifting) • pandas (data handling / computation) • IPython (dev and research env) • Cython (perf optimization) • matplotlib (visualization) • statsmodels (statistics / econometrics)
  • 4.
    General sentiments • Scientific Python growing solidly in finance and in many other fields • Though good sci-pythonistas are still scarce • Important work happening in many of the core projects • Growing consensus: a new computational model is needed to better cope with “big data”
  • 5.
    NumPy • Significantly refactoredC internals • Great progress on native datetime64 type • Will significantly improve date-handling performance and usability • Extensible business day / holiday logic planned / in progress • Addition of low-level missing data (NA) support in the works
  • 6.
    IPython • One ofPython’s killer apps gets even better • Rich Qt GUI console with inline plotting • New and improved architecture for high perf parallel / distributed computing • See Fernando Pérez’s SciPy 2011 talk / video
  • 7.
    Cython • Still thefirst tool you should reach for to get better performance • New: OpenMP integration (for multi-core) with nogil: for i in prange(n): # do something in parallel • Supports (almost) all of standard Python now (some things, like closures, used to not work)
  • 8.
    statsmodels • Statistics and econometrics in Python • Major work in time series models over last year+ • VAR, SVAR models, eventually (V)ECM models for cointegrated time series • AR/ARMA, Kalman Filter, various macro filters (e.g. Hodrick-Prescott) implemented • Soon: Bayesian state space models (DLMs), ARCH/GARCH models, etc.
  • 9.
    statsmodels • Major criticism:weak user interface • No R-style formula framework • pandas not integrated (need to pass raw NumPy arrays) • I have begun work on pandas integration, formulas have been implemented and will hopefully arrive within the next few months
  • 10.
    pandas • Still thePython data hacker’s best friend? • Most recent release: 0.3.0 on 2/20/2011 • However, last 4 months have been the most active development period in the library’s history • ~375 commits since 0.3.0 release (more than the entire prior open source history)
  • 11.
    The state ofdata structures
  • 12.
    Ambitious big picture •I want to make pandas the cornerstone of the “next generation” statistical computing environment • Ease-of-use, performance, flexibility all equally important
  • 13.
    Ambitious big picture •Taking the best features of other languages (R and friends) and making them better and easier to use • See my recent blog article “A Roadmap for Rich Scientific Data Structures in Python”
  • 14.
    pandas: under thehood • Complete redesign of DataFrame internals • Now a single class for 2D data retaining optimal performance of old DataFrame and DataMatrix classes • Significantly improved mixed-type and missing data handling • Plan to use internal data structure to implement “NDFrame” for n-dimensional data
  • 15.
    Fancy indexing • Indexa Series / DataFrame in a matrix-like way via special .ix attribute, use: • Slices with integers or labels • Lists of integers, labels, or boolean vecs • Integer or label locations df.ix[0] df.ix[date1:date2] df.ix[:5, ‘A’:’F’] df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
  • 16.
    Misc new features •“Sparse” (mostly NA) versions of Series, DataFrame, WidePanel • Many new functions on Series/DataFrame • describe, quantile, select, drop, dropna, corrwith, ... • New moving window methods: rolling_quantile and rolling_apply
  • 17.
    Improved IO • read_csv,read_table functions more flexible and robust, better type inferencing df = read_table(‘foo.txt’, skiprows=[0,1], na_values=[‘#N/A’]) • ExcelFile class for reading multiple sheets out of .xls files
  • 18.
    Improved IO • HDFStoreclass provides a complete, tested dict-like PyTables storage container store = HDFStore(‘mydata.h5’) store[‘x’] = x store[‘y’] = y y = store[‘y’] • Experimental: store as Table and query store.put('df', df, table=True) piece = store.select(‘df’, [{‘field’ : ‘index’, ‘op’ : ‘>=’, ‘value’ : date}])
  • 19.
    Group by enhancements •Can group by multiple columns or key functions, SQL-like but more general • Syntactic sugar to invoke aggregation functions on groups • Automatic exclusion of “nuisance” columns of DataFrames • Various other usability enhancements
  • 20.
    Very soon: hierarchicalindexing • Enable axis ticks to be identified by multiple labels instead of a single label • Easily select subsets of data by “level” • Create Excel-style pivot tables / cross- tabulations in a sensible way • Will integrate naturally with groupby
  • 21.
    Other misc things •Flexible binary operators • a.add(b, fill_value=0.) • Some timezone support in DateRange • Numerous performance optimizations • See the (long) release notes =)
  • 22.
    Planned work • Fasttime series up/downsampling • Improved support and perf for HF/tick data • Even more sophisticated group by tools • Better documentation, online screencast tutorials / examples
  • 23.
    Thanks • Email: wesmckinn@gmail.com •Twitter: @wesmckinn • Blog: http://blog.wesmckinney.com • pandas: http://github.com/wesm/pandas • statsmodels: http://statsmodels.sourceforge.net