What's new in pandas and the SciPy stack for financial users
What’s new in pandas and
the SciPy stack for financial
users
Wes McKinney
Me
• AQR: August 2007 - July 2010
• Duke Statistics: 2010 - present (now on leave)
• My plans
• Improving Python libs for statistics and finance
• Building a financial software + consulting business
based on said tools
General sentiments
• Scientific Python growing solidly in finance and
in many other fields
• Though good sci-pythonistas are still scarce
• Important work happening in many of the core
projects
• Growing consensus: a new computational
model is needed to better cope with “big data”
NumPy
• Significantly refactored C internals
• Great progress on native datetime64 type
• Will significantly improve date-handling
performance and usability
• Extensible business day / holiday logic
planned / in progress
• Addition of low-level missing data (NA)
support in the works
IPython
• One of Python’s killer apps gets even better
• Rich Qt GUI console with inline plotting
• New and improved architecture for high perf
parallel / distributed computing
• See Fernando Pérez’s SciPy 2011 talk / video
Cython
• Still the first tool you should reach for to get
better performance
• New: OpenMP integration (for multi-core)
with nogil:
for i in prange(n):
# do something in parallel
• Supports (almost) all of standard Python now
(some things, like closures, used to not work)
statsmodels
• Statistics and econometrics in Python
• Major work in time series models over last year+
• VAR, SVAR models, eventually (V)ECM models
for cointegrated time series
• AR/ARMA, Kalman Filter, various macro filters
(e.g. Hodrick-Prescott) implemented
• Soon: Bayesian state space models (DLMs),
ARCH/GARCH models, etc.
statsmodels
• Major criticism: weak user interface
• No R-style formula framework
• pandas not integrated (need to pass raw
NumPy arrays)
• I have begun work on pandas integration,
formulas have been implemented and will
hopefully arrive within the next few months
pandas
• Still the Python data hacker’s best friend?
• Most recent release: 0.3.0 on 2/20/2011
• However, last 4 months have been the most
active development period in the library’s
history
• ~375 commits since 0.3.0 release (more than
the entire prior open source history)
Ambitious big picture
• I want to make pandas the cornerstone of the
“next generation” statistical computing
environment
• Ease-of-use, performance, flexibility all equally
important
Ambitious big picture
• Taking the best features of other languages (R
and friends) and making them better and
easier to use
• See my recent blog article “A Roadmap for
Rich Scientific Data Structures in Python”
pandas: under the hood
• Complete redesign of DataFrame internals
• Now a single class for 2D data retaining
optimal performance of old DataFrame and
DataMatrix classes
• Significantly improved mixed-type and missing
data handling
• Plan to use internal data structure to
implement “NDFrame” for n-dimensional data
Fancy indexing
• Index a Series / DataFrame in a matrix-like
way via special .ix attribute, use:
• Slices with integers or labels
• Lists of integers, labels, or boolean vecs
• Integer or label locations
df.ix[0]
df.ix[date1:date2]
df.ix[:5, ‘A’:’F’]
df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
Misc new features
• “Sparse” (mostly NA) versions of Series,
DataFrame, WidePanel
• Many new functions on Series/DataFrame
• describe, quantile, select, drop, dropna,
corrwith, ...
• New moving window methods: rolling_quantile
and rolling_apply
Improved IO
• read_csv, read_table functions more
flexible and robust, better type inferencing
df = read_table(‘foo.txt’, skiprows=[0,1],
na_values=[‘#N/A’])
• ExcelFile class for reading multiple sheets
out of .xls files
Improved IO
• HDFStore class provides a complete, tested
dict-like PyTables storage container
store = HDFStore(‘mydata.h5’)
store[‘x’] = x
store[‘y’] = y
y = store[‘y’]
• Experimental: store as Table and query
store.put('df', df, table=True)
piece = store.select(‘df’,
[{‘field’ : ‘index’, ‘op’ : ‘>=’,
‘value’ : date}])
Group by enhancements
• Can group by multiple columns or key
functions, SQL-like but more general
• Syntactic sugar to invoke aggregation
functions on groups
• Automatic exclusion of “nuisance”
columns of DataFrames
• Various other usability enhancements
Very soon: hierarchical indexing
• Enable axis ticks to be identified by multiple
labels instead of a single label
• Easily select subsets of data by “level”
• Create Excel-style pivot tables / cross-
tabulations in a sensible way
• Will integrate naturally with groupby
Other misc things
• Flexible binary operators
• a.add(b, fill_value=0.)
• Some timezone support in DateRange
• Numerous performance optimizations
• See the (long) release notes =)
Planned work
• Fast time series up/downsampling
• Improved support and perf for HF/tick data
• Even more sophisticated group by tools
• Better documentation, online screencast
tutorials / examples