Python for Financial Data Analysis with pandas

Financial data analysis in Python with pandas

Wes McKinney
@wesmckinn

10/17/2011

@wesmckinn () Data analysis with pandas 10/17/2011 1 / 22

My background

3 years as a quant hacker at AQR, now consultant / entrepreneur
Math and statistics background with the zest of computer science
Active in scientiﬁc Python community
My blog: http://blog.wesmckinney.com
Twitter: @wesmckinn


Bare essentials for financial research

Fast time series functionality
Easy data alignment
Date/time handling
Moving window statistics
Resamping / frequency conversion
Fast data access (SQL databases, flat files, etc.)
Data visualization (plotting)
Statistical models
Linear regression
Time series models: ARMA, VAR, ...


Would be nice to have

Portfolio and risk analytics, backtesting
Easy enough to write yourself, though most people do a bad job of it
Portfolio optimization
Most ﬁnancial ﬁrms use a 3rd party library anyway
Derivative pricing
Can use QuantLib in most languages


What are financial firms using?

HFT: a C++ and hardware arms race, a different topic
Research
Mainstream: R, MATLAB, Python, ...
Econometrics: Stata, eViews, RATS, etc.
Non-programmatic environments: ClariFI, Palantir, ...
Production
Popular: Java, C#, C++
Less popular, but growing: Python
Fringe: Functional languages (Ocaml, Haskell, F#)


What are ﬁnancial ﬁrms using?

Many hybrid languages environments (e.g. Java/R, C++/R,
C++/MATLAB, Python/C++)
Which is the main implementation language?
If main language is Java/C++, result is lower productivity and higher
cost to prototyping new functionality
Trends
Banks and hedge funds are realizing that Java-based production
systems can be replaced with 20% as much Python code (or less)
MATLAB is being increasingly ditched in favor of Python. R and
Python use for research generally growing


Python language

Simple, expressive syntax
Designed for readability, like “runnable pseudocode”
Easy-to-use, powerful built-in types and data structures:
Lists and tuples (ﬁxed-size, immutable lists)
Dicts (hash maps / associative arrays) and sets
Everything’s an object, including functions
“There should be one, and preferably only one way to do it”
“Batteries included”: great general purpose standard library


A simple example: quicksort

Pseudocode from Wikipedia:

function qsort(array)
if length(array) < 2
return array
var list less, greater
select and remove a pivot value pivot from array
for each x in array
if x < pivot then append x to less
else append x to greater
return concat(qsort(less), pivot, qsort(greater))



First try Python implementation:
def qsort ( array ):
if len ( array ) < 2:
return array

less , greater = [] , []

pivot , rest = array [0] , array [1:]

for x in rest :
if x < pivot :
less . append ( x )
else :
greater . append ( x )

return qsort ( less ) + [ pivot ] + qsort ( greater )



Use list comprehensions:
return array

less = [ x for x in rest if x < pivot ]
greater = [ x for x in rest if x >= pivot ]

return qsort ( less ) + [ pivot ] + qsort ( greater )



Heck, ﬁt it onto one line!
qs = lambda r : ( r if len ( r ) < 2
else ( qs ([ x for x in r [1:] if x < r [0]])
+ [ r [0]]
+ qs ([ x for x in r [1:] if x >= r [0]])))

Though that’s starting to look like Lisp code...



A quicksort using NumPy arrays
return array
less = rest [ rest < pivot ]
greater = rest [ rest >= pivot ]
return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )]

Of course no need for this when you can just do:

sorted_array = np.sort(array)


Python: drunk with power
This comic has way too much airtime but:


Staples of Python for science: MINS

(M) matplotlib: plotting and data visualization
(I) IPython: rich interactive computing and development environment
(N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random
number generation, etc.
(S) SciPy: optimization, probability distributions, signal processing,
ODEs, sparse matrices, ...


Why did Python become popular in science?

NumPy traces its roots to 1995
Extremely easy to integrate C/C++/Fortran code
Access fast low level algorithms in a high level, interpreted language
The language itself
“It ﬁts in your head”
“It [Python] doesn’t get in my way” - Robert Kern
Python is good at all the things other scientiﬁc programming
languages are not good at (e.g. networking, string processing, OOP)
Liberal BSD license: can use Python for commercial applications


Some exciting stuﬀ in the last few years

Cython
“Augmented” Python language with type declarations, for generating
compiled extensions
C-like speedups with Python-like development time
IPython: enhanced interactive Python interpreter
The best research and software development env for Python
An integrated parallel / distributed computing backend
GUI console with inline plotting and a rich HTML notebook (more on
this later)
PyCUDA / PyOpenCL: GPU computing in Python
Transformed Python overnight into one of the best languages for doing
GPU computing


Where has Python historically been weak?

Rich data structures for data analysis and statistics
NumPy arrays, while powerful, feel distinctly “lower level” if you’re
used to R’s data.frame
pandas has ﬁlled this gap over the last 2 years
Statistics libraries
Nowhere near the depth of R’s CRAN repository
statsmodels provides tested implementations a lot of standard
regression and time series models
Turns out that most ﬁnancial data analysis requires only fairly
elementary statistical models


pandas library

Began building at AQR in 2008, open-sourced late 2009
Why
R / MATLAB, while good for research / data analysis, are not suitable
implementation languages for large-scale production systems
(I personally don’t care for them for data analysis)
Existing data structures for time series in R / MATLAB were too
limited / not ﬂexible enough my needs
Core idea: indexed data structures capable of storing heterogeneous
data
Etymology: panel data structures


pandas in a nutshell

A clean axis indexing design to support fast data alignment, lookups,
hierarchical indexing, and more
High-performance data structures
Series/TimeSeries: 1D labeled vector
DataFrame: 2D spreadsheet-like structure
Panel: 3D labeled array, collection of DataFrames
SQL-like functionality: GroupBy, joining/merging, etc.
Missing data handling
Time series functionality


pandas design philosophy

“Think outside the matrix”: stop thinking about shape and start
thinking about indexes
Indexing and data alignment are essential
Fault-tolerance: save you from common blunders caused by coding
errors (speciﬁcally misaligned data)
Lift the best features of other data analysis environments (R,
MATLAB, Stata, etc.) and make them better, faster
Performance and usability equally important


The pandas killer feature: indexing

Each axis has an index
Automatic alignment between diﬀerently-indexed objects: makes it
nearly impossible to accidentally combine misaligned data
Hierarchical indexing provides an intuitive way of structuring and
working with higher-dimensional data
Natural way of expressing “group by” and join-type operations
Better integrated and more ﬂexible indexing than anything available
in R or MATLAB


Tutorial time

To the IPython console!


Python for Financial Data Analysis with pandas

More Related Content

What's hot

Similar to Python for Financial Data Analysis with pandas

More from Wes McKinney

Recently uploaded

Python for Financial Data Analysis with pandas