Structured Data Challenges in Finance and Statistics
1. Structured Data Challenges in Finance and Statistics
Wes McKinney
Rice Statistics, 21 November 2011
Wes McKinney () Structured data challenges Rice Statistics 1 / 43
2. Me
S.B., MIT Math ’07
3 years in the quant finance business
Now: starting a software company, initially to build financial data
analysis and research systems
My blog: http://blog.wesmckinney.com
Twitter: @wesmckinn
Book! “Python for Data Analysis”, to hit the shelves later next year
from O’Reilly Media
Wes McKinney () Structured data challenges Rice Statistics 2 / 43
3. Structured data
cname year agefrom ageto ls lsc pop ccode
0 Australia 1950 15 19 64.3 15.4 558 AUS
1 Australia 1950 20 24 48.4 26.4 645 AUS
2 Australia 1950 25 29 47.9 26.2 681 AUS
3 Australia 1950 30 34 44 23.8 614 AUS
4 Australia 1950 35 39 42.1 21.9 625 AUS
5 Australia 1950 40 44 38.9 20.1 555 AUS
6 Australia 1950 45 49 34 16.9 491 AUS
7 Australia 1950 50 54 29.6 14.6 439 AUS
8 Australia 1950 55 59 28 12.9 408 AUS
9 Australia 1950 60 64 26.3 12.1 356 AUS
Wes McKinney () Structured data challenges Rice Statistics 3 / 43
4. Partial list of structured data necessities
Table modification: column insertion/deletion/type changes
Rich axis indexing, metadata
Easy data alignment
Aggregation and transformation by group (“group by”)
Missing data (NA) handling
Pivoting and reshaping
Merging and joining
Time series-specific manipulations
Fast Input/Output: text files, databases, HDF5, ...
Wes McKinney () Structured data challenges Rice Statistics 4 / 43
5. Are existing tools good enough?
We care nearly equally about
Ease-of-use (syntax / API fits your mental model)
Expressiveness
Performance (speed and memory usage)
Clean, consistent interface design is hard
Wes McKinney () Structured data challenges Rice Statistics 5 / 43
6. Auxiliary concerns
Any tool needs to integrate well with:
Statistical modeling tools
Data visualization (plotting)
Target users
Computer scientists, statisticians, software engineers?
Data scientists?
Wes McKinney () Structured data challenges Rice Statistics 6 / 43
7. Are existing tools good enough?
The typical players
R data.frame and friends + CRAN libraries
SQL and other relational databases
Python / NumPy: structured (record) arrays
Commercial products: SAS, Stata, MS Excel...
My conclusion: we still have a ways to go
R has become demonstrably better in the last 5 years (e.g. via plyr,
reshape2)
Wes McKinney () Structured data challenges Rice Statistics 7 / 43
8. Deeper problems in many industries
Facilitating the research process only part of problem
Much of academia: “Production systems?”
Industry: a wasteland of misshapen wheels or expensive vendor
products
Explosive growth in data-driven production systems
Hybrid-language systems are not always a good idea
Wes McKinney () Structured data challenges Rice Statistics 8 / 43
9. The big data conundrum
Great effort being invested in the (difficult) problem of large-scale
data processing, e.g. MapReduce-based
Less effort in the fundamental tooling for data manipulation /
preparation / integration
Single-node performance does matter
Single-node code development time matters too
Wes McKinney () Structured data challenges Rice Statistics 9 / 43
10. pandas: my effort in this arena
Pick your favorite: panel data structures or Python structured data
analysis
Starting building April 2008 back at AQR Capital
Open-sourced (BSD license) mid-2009
Heavily tested, being used by many companies (inc. lots of financial
firms) as the cornerstone of their systems
Goal: optimal balance of ease-of-use, flexibility, and performance
Heavy development the last 6 months
Wes McKinney () Structured data challenges Rice Statistics 10 / 43
11. Why did I start from scratch?
Accusations of NIH Syndrome abound
In 2008 I simultaneously needed
Agile, high performance data structures
A high productivity programming language for implementing all of the
non-computational business logic
A production application platform that would seamlessly integrate with
an interactive data analysis / research platform
In short, I was rebuilding major financial systems and I found my
options inadequate
Thrilling innovation opportunity!
Wes McKinney () Structured data challenges Rice Statistics 11 / 43
12. Why did I use Python?
High productivity general purpose language
Well thought-out object-oriented model
Excellent software-development tools
Easy for MATLAB/R users to learn
Flexible built-in data structures (dicts, sets, lists, tuples)
The right open-source scientific computing tools
Powerful array processing (NumPy)
Abundant tools for performance computing
Wes McKinney () Structured data challenges Rice Statistics 12 / 43
13. But, Python is not perfect
For statistical computing, a chicken-and-egg problem
Python’s plotting libraries are not designed for statistical graphics
Built-in data structures are not especially optimized for my large data
use cases
Occasional semantic / syntactic niggles
Wes McKinney () Structured data challenges Rice Statistics 13 / 43
14. Partial list of structured data necessities
Table modification: column insertion/deletion/type changes
Rich axis indexing, metadata
Easy data alignment
Aggregation and transformation by group (“group by”)
Missing data (NA) handling
Pivoting and reshaping
Merging and joining
Time series-specific manipulations
Fast Input/Output: text files, databases, HDF5, ...
Wes McKinney () Structured data challenges Rice Statistics 14 / 43
15. DataFrame, the pandas workhorse
A 2D tabular data structure with row and column indexes
Fast for row- and column-oriented operations
Support heterogeneous columns WITHOUT sacrificing performance in
the homogeneous (e.g. floating point only) case
Wes McKinney () Structured data challenges Rice Statistics 15 / 43
16. DataFrame
cname year agefrom ageto ls lsc pop ccode
0 Australia 1950 15 19 64.3 15.4 558 AUS
1 Australia 1950 20 24 48.4 26.4 645 AUS
2 Australia 1950 25 29 47.9 26.2 681 AUS
3 Australia 1950 30 34 44 23.8 614 AUS
4 Australia 1950 35 39 42.1 21.9 625 AUS
5 Australia 1950 40 44 38.9 20.1 555 AUS
6 Australia 1950 45 49 34 16.9 491 AUS
7 Australia 1950 50 54 29.6 14.6 439 AUS
8 Australia 1950 55 59 28 12.9 408 AUS
9 Australia 1950 60 64 26.3 12.1 356 AUS
Wes McKinney () Structured data challenges Rice Statistics 16 / 43
17. Axis indexing and metadata
Basic concept: labeled axes in use throughout the library
Need to support
Fast lookups (constant time)
Data realignment / selection by labels (linear)
Munging together irregularly indexed data
Key innovation: index is a data structure itself. Different
implementations can support more sophisticated indexing
Axis labels can be any immutable Python object
Wes McKinney () Structured data challenges Rice Statistics 17 / 43
18. Irregularly indexed data
DataFrame
Columns
I
N
D
E
X
Wes McKinney () Structured data challenges Rice Statistics 18 / 43
19. Axis indexing
Axis Index
d 0
a 1
b 2
c 3
e 4
Wes McKinney () Structured data challenges Rice Statistics 19 / 43
20. Why does this matter?
Real world data is highly irregular, especially time series
Operations between DataFrame objects automatically align on the
indexes
Nearly impossible to have errors due to misaligned data
Can vastly facilitate munging unstructured data into structured form
Grants immense freedom in writing research code
Time series are just a special case of a general indexed data structure
Wes McKinney () Structured data challenges Rice Statistics 20 / 43
22. Hierarchical indexing
Basic idea: represent high dimensional data in a lower-dimensional
structure that is easier to reason about
Axis index with k levels of indexing
Slice chunks of data in constant time!
Provides a very natural way of implementing reshaping operations
Advantage over a truly N-dimensional object: space-efficient dense
representation if groups are unbalanced
Extremely useful for econometric models on panel data
Wes McKinney () Structured data challenges Rice Statistics 22 / 43
24. Joining and merging
Join and merge-type operations are very easy to implement with
indexing in place
Multi-key join same code as aligning hierarchically-indexed
DataFrames
Will illustrate this with examples
Wes McKinney () Structured data challenges Rice Statistics 24 / 43
25. Supporting size mutability
In order to have good row-oriented performance, need to store
like-typed columns in a single ndarray
“Column” insertion: accumulate 1 × N × . . . homogeneous columns,
later consolidate with other like-typed into a single block
I.e. avoid reallocate-copy or array concatenation steps as long as
possible
Column deletions can be no-copy events (since ndarrays support
views)
Wes McKinney () Structured data challenges Rice Statistics 25 / 43
26. DataFrame, under the hood
Actually You see
6 2 4 5 1 3 1 2 3 4 5 6
O F I B I F B F F O
Wes McKinney () Structured data challenges Rice Statistics 26 / 43
29. Reshaping implementation nuances
Must carefully deal with unbalanced group sizes / missing data
I play vectorization tricks with the NumPy memory layout: no for
loops!
Care must be taken to handle heterogeneous and homogeneous data
cases
Wes McKinney () Structured data challenges Rice Statistics 29 / 43
30. GroupBy
High level process
split data set into groups
apply function to each group (an aggregation or a transformation)
combine results intelligently into a result data structure
Can be used to emulate SQL GROUP BY operations
Wes McKinney () Structured data challenges Rice Statistics 30 / 43
31. GroupBy
Grouping closely related to indexing
Create correspondence between axis labels and group labels using one
of:
Array of group labels (like a DataFrame column)
Python function to be applied to each axis tick
Can group by multiple keys
For a hierarchically indexed axis, can select a level and group by that
(or some transformation thereof)
Wes McKinney () Structured data challenges Rice Statistics 31 / 43
32. Anatomy of GroupBy
grouped = obj.groupby([key1, key2, key3])
This returns a GroupBy object
Each of the keys could be any of:
A Python function
A vector
A column name
Wes McKinney () Structured data challenges Rice Statistics 32 / 43
33. Anatomy of GroupBy
aggregate, transform, and the more general apply supported
group_means = grouped.agg(np.mean)
group_means2 = grouped.mean()
demeaned = grouped.transform(lambda x: x - x.mean())
Wes McKinney () Structured data challenges Rice Statistics 33 / 43
34. Anatomy of GroupBy
The GroupBy object is also iterable
group_means = {}
for group_name, group in grouped:
group_means[group_name] = grouped.mean()
Wes McKinney () Structured data challenges Rice Statistics 34 / 43
35. GroupBy and hierarchical indexing
Hierarchical indexing came about as the natural result of a multi-key
aggregation:
>>> group_means = df.groupby(['country', 'agefrom']).mean()
>>> group_means[['ls', 'lsc', 'pop']].unstack('country')
ls lsc pop
country Australia Austria Australia Austria Australia Austria
agefrom
15 70.03 31.1 26.17 14.67 6163 3310
20 58.02 59.98 34.51 45.83 1113 531
25 57.02 45.5 33.02 32.73 5021 2791
30 59.16 46.56 35.29 33.87 1082 527
35 59.58 43.29 34.53 30.3 1053 528.8
40 58.8 40.98 33.92 27.88 1005 522.5
45 56.79 39.19 31.71 26 927.2 503.5
50 54.71 37.14 30.37 23.94 836.6 475.2
55 51.93 34.82 26.98 22.11 735.8 443.2
60 49.92 32.35 25.85 20.2 632.6 408.8
65 47.02 29.59 22.98 18.44 522.5 361.9
70 46.27 28.52 22.53 17.72 410.5 295.8
75 46.28 28.06 23.41 18.42 624.5 437.6
Wes McKinney () Structured data challenges Rice Statistics 35 / 43
36. What makes GroupBy hard?
factor-izing the group labels is very expensive
Function call overhead on small groups
To sort or not to sort?
Cheaper than computing the group labels!
Munging together results in exceptional cases is tricky
Wes McKinney () Structured data challenges Rice Statistics 36 / 43
37. Time series operations
Fixed frequency indexing (DateRange)
Domain-specific time offsets (like business day, business month end)
Frequency conversion
Forward-filling / back-filling / interpolation
Leading/lagging
In the works (later this year), better/faster upsampling/downsampling
Wes McKinney () Structured data challenges Rice Statistics 37 / 43
38. Shoot-out with R’s time series objects
“Inner join” addition between two irregular 500K-length time series
sampled from 1 million-length time series. Timestamps are POSIXct (R) /
int64 (Python)
package timing factor
pandas 21.5 ms 1.0
xts 41.3 ms 1.92
fts 370 ms 17.2
its 1117 ms 51.95
zoo 3479 ms 161.8
Where there is smoke, there is fire?
Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2
Wes McKinney () Structured data challenges Rice Statistics 38 / 43
39. Erm, inner join?
Intersecting time stamps loses information
Wes McKinney () Structured data challenges Rice Statistics 39 / 43
40. My (mild) performance obsession
Wes McKinney () Structured data challenges Rice Statistics 40 / 43
41. My (mild) performance obsession
Good performance stems from many factors
Well-designed algorithms (from a complexity standpoint)
Minimizing copying of data
Minimizing interpreter / function call overhead
Taking advantage of memory layout
I value functionality over performance, but I do spend a lot of time
profiling code and grokking performance characteristics
Wes McKinney () Structured data challenges Rice Statistics 41 / 43