• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Structured Data Challenges in Finance and Statistics
 

Structured Data Challenges in Finance and Statistics

on

  • 13,443 views

Presented

Presented

Statistics

Views

Total Views
13,443
Views on SlideShare
2,923
Embed Views
10,520

Actions

Likes
5
Downloads
81
Comments
0

11 Embeds 10,520

http://blog.wesmckinney.com 5416
http://wesmckinney.com 5081
http://www.newsblur.com 13
http://www.twylah.com 2
http://www.blog.wesmckinney.com 2
http://zootool.com 1
http://translate.googleusercontent.com 1
https://webcache.googleusercontent.com 1
http://webcache.googleusercontent.com 1
http://us-w1.rockmelt.com 1
http://www.netvibes.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Structured Data Challenges in Finance and Statistics Structured Data Challenges in Finance and Statistics Presentation Transcript

    • Structured Data Challenges in Finance and Statistics Wes McKinney Rice Statistics, 21 November 2011 Wes McKinney () Structured data challenges Rice Statistics 1 / 43
    • Me S.B., MIT Math ’07 3 years in the quant finance business Now: starting a software company, initially to build financial data analysis and research systems My blog: http://blog.wesmckinney.com Twitter: @wesmckinn Book! “Python for Data Analysis”, to hit the shelves later next year from O’Reilly Media Wes McKinney () Structured data challenges Rice Statistics 2 / 43
    • Structured data cname year agefrom ageto ls lsc pop ccode0 Australia 1950 15 19 64.3 15.4 558 AUS1 Australia 1950 20 24 48.4 26.4 645 AUS2 Australia 1950 25 29 47.9 26.2 681 AUS3 Australia 1950 30 34 44 23.8 614 AUS4 Australia 1950 35 39 42.1 21.9 625 AUS5 Australia 1950 40 44 38.9 20.1 555 AUS6 Australia 1950 45 49 34 16.9 491 AUS7 Australia 1950 50 54 29.6 14.6 439 AUS8 Australia 1950 55 59 28 12.9 408 AUS9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney () Structured data challenges Rice Statistics 3 / 43
    • Partial list of structured data necessities Table modification: column insertion/deletion/type changes Rich axis indexing, metadata Easy data alignment Aggregation and transformation by group (“group by”) Missing data (NA) handling Pivoting and reshaping Merging and joining Time series-specific manipulations Fast Input/Output: text files, databases, HDF5, ... Wes McKinney () Structured data challenges Rice Statistics 4 / 43
    • Are existing tools good enough? We care nearly equally about Ease-of-use (syntax / API fits your mental model) Expressiveness Performance (speed and memory usage) Clean, consistent interface design is hard Wes McKinney () Structured data challenges Rice Statistics 5 / 43
    • Auxiliary concerns Any tool needs to integrate well with: Statistical modeling tools Data visualization (plotting) Target users Computer scientists, statisticians, software engineers? Data scientists? Wes McKinney () Structured data challenges Rice Statistics 6 / 43
    • Are existing tools good enough? The typical players R data.frame and friends + CRAN libraries SQL and other relational databases Python / NumPy: structured (record) arrays Commercial products: SAS, Stata, MS Excel... My conclusion: we still have a ways to go R has become demonstrably better in the last 5 years (e.g. via plyr, reshape2) Wes McKinney () Structured data challenges Rice Statistics 7 / 43
    • Deeper problems in many industries Facilitating the research process only part of problem Much of academia: “Production systems?” Industry: a wasteland of misshapen wheels or expensive vendor products Explosive growth in data-driven production systems Hybrid-language systems are not always a good idea Wes McKinney () Structured data challenges Rice Statistics 8 / 43
    • The big data conundrum Great effort being invested in the (difficult) problem of large-scale data processing, e.g. MapReduce-based Less effort in the fundamental tooling for data manipulation / preparation / integration Single-node performance does matter Single-node code development time matters too Wes McKinney () Structured data challenges Rice Statistics 9 / 43
    • pandas: my effort in this arena Pick your favorite: panel data structures or Python structured data analysis Starting building April 2008 back at AQR Capital Open-sourced (BSD license) mid-2009 Heavily tested, being used by many companies (inc. lots of financial firms) as the cornerstone of their systems Goal: optimal balance of ease-of-use, flexibility, and performance Heavy development the last 6 months Wes McKinney () Structured data challenges Rice Statistics 10 / 43
    • Why did I start from scratch? Accusations of NIH Syndrome abound In 2008 I simultaneously needed Agile, high performance data structures A high productivity programming language for implementing all of the non-computational business logic A production application platform that would seamlessly integrate with an interactive data analysis / research platform In short, I was rebuilding major financial systems and I found my options inadequate Thrilling innovation opportunity! Wes McKinney () Structured data challenges Rice Statistics 11 / 43
    • Why did I use Python? High productivity general purpose language Well thought-out object-oriented model Excellent software-development tools Easy for MATLAB/R users to learn Flexible built-in data structures (dicts, sets, lists, tuples) The right open-source scientific computing tools Powerful array processing (NumPy) Abundant tools for performance computing Wes McKinney () Structured data challenges Rice Statistics 12 / 43
    • But, Python is not perfect For statistical computing, a chicken-and-egg problem Python’s plotting libraries are not designed for statistical graphics Built-in data structures are not especially optimized for my large data use cases Occasional semantic / syntactic niggles Wes McKinney () Structured data challenges Rice Statistics 13 / 43
    • Partial list of structured data necessities Table modification: column insertion/deletion/type changes Rich axis indexing, metadata Easy data alignment Aggregation and transformation by group (“group by”) Missing data (NA) handling Pivoting and reshaping Merging and joining Time series-specific manipulations Fast Input/Output: text files, databases, HDF5, ... Wes McKinney () Structured data challenges Rice Statistics 14 / 43
    • DataFrame, the pandas workhorse A 2D tabular data structure with row and column indexes Fast for row- and column-oriented operations Support heterogeneous columns WITHOUT sacrificing performance in the homogeneous (e.g. floating point only) case Wes McKinney () Structured data challenges Rice Statistics 15 / 43
    • DataFrame cname year agefrom ageto ls lsc pop ccode0 Australia 1950 15 19 64.3 15.4 558 AUS1 Australia 1950 20 24 48.4 26.4 645 AUS2 Australia 1950 25 29 47.9 26.2 681 AUS3 Australia 1950 30 34 44 23.8 614 AUS4 Australia 1950 35 39 42.1 21.9 625 AUS5 Australia 1950 40 44 38.9 20.1 555 AUS6 Australia 1950 45 49 34 16.9 491 AUS7 Australia 1950 50 54 29.6 14.6 439 AUS8 Australia 1950 55 59 28 12.9 408 AUS9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney () Structured data challenges Rice Statistics 16 / 43
    • Axis indexing and metadata Basic concept: labeled axes in use throughout the library Need to support Fast lookups (constant time) Data realignment / selection by labels (linear) Munging together irregularly indexed data Key innovation: index is a data structure itself. Different implementations can support more sophisticated indexing Axis labels can be any immutable Python object Wes McKinney () Structured data challenges Rice Statistics 17 / 43
    • Irregularly indexed data DataFrame Columns I N D E X Wes McKinney () Structured data challenges Rice Statistics 18 / 43
    • Axis indexing Axis Index d 0 a 1 b 2 c 3 e 4 Wes McKinney () Structured data challenges Rice Statistics 19 / 43
    • Why does this matter? Real world data is highly irregular, especially time series Operations between DataFrame objects automatically align on the indexes Nearly impossible to have errors due to misaligned data Can vastly facilitate munging unstructured data into structured form Grants immense freedom in writing research code Time series are just a special case of a general indexed data structure Wes McKinney () Structured data challenges Rice Statistics 20 / 43
    • Axis indexing "I have a brain!" year 1965 1970 1975 1980 1985 cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 20 57.7 70.6 74.1 73.4 72 25 52.8 57.7 71.8 74 73.5 30 53.8 57.7 60.4 72.8 74.1Me too! 35 46.6 52.6 59.5 63.2 72.7 40 47.5 52.6 56.3 61.4 62.9 45 41.3 48.7 55.9 61.1 61.1 50 41.7 48.7 53.8 60.2 60 55 36.3 42.1 54.1 61 59.2 60 37.5 42.1 48.9 61.6 59.1 65 30.2 30.9 48.8 60.4 59.7 70 30.2 30.9 41.7 60.4 57.6 75 30.2 30.9 41.7 62.5 57.5 Wes McKinney () Structured data challenges Rice Statistics 21 / 43
    • Hierarchical indexing Basic idea: represent high dimensional data in a lower-dimensional structure that is easier to reason about Axis index with k levels of indexing Slice chunks of data in constant time! Provides a very natural way of implementing reshaping operations Advantage over a truly N-dimensional object: space-efficient dense representation if groups are unbalanced Extremely useful for econometric models on panel data Wes McKinney () Structured data challenges Rice Statistics 22 / 43
    • Hierarchical indexing agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9 Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7 Wes McKinney () Structured data challenges Rice Statistics 23 / 43
    • Joining and merging Join and merge-type operations are very easy to implement with indexing in place Multi-key join same code as aligning hierarchically-indexed DataFrames Will illustrate this with examples Wes McKinney () Structured data challenges Rice Statistics 24 / 43
    • Supporting size mutability In order to have good row-oriented performance, need to store like-typed columns in a single ndarray “Column” insertion: accumulate 1 × N × . . . homogeneous columns, later consolidate with other like-typed into a single block I.e. avoid reallocate-copy or array concatenation steps as long as possible Column deletions can be no-copy events (since ndarrays support views) Wes McKinney () Structured data challenges Rice Statistics 25 / 43
    • DataFrame, under the hood Actually You see 6 2 4 5 1 3 1 2 3 4 5 6 O F I B I F B F F O Wes McKinney () Structured data challenges Rice Statistics 26 / 43
    • Reshaping The fundamental operations stack: pivot level from columns to rows unstack: pivot level from rows to columns Completely natural and intuitive with hierarchical indexing No munging of column names necessary year 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 80 66.3 73.4 82.3 78.4 20 57.7 70.6 74.1 73.4 72 66.4 59.9 38.2 44.8 41.5 25 52.8 57.7 71.8 74 73.5 72 66.4 59.9 38.2 44.8 30 53.8 57.7 60.4 72.8 74.1 73.5 72 66.4 59.9 38.2 35 46.6 52.6 59.5 63.2 72.7 74.1 73.5 72 66.4 59.9 Austria 15 8.1 13.3 23.5 23.8 22.4 27.4 38.9 41.4 42.9 45.4 20 50.9 61.4 64.3 73.9 69.2 72 64.2 73.2 93.9 93.5 25 50.9 50.9 61.4 64.3 71.1 68.9 66.9 64.2 75 92.4 30 24.2 50.9 56.6 61.4 62.9 68.3 66.6 63.1 62.9 73.2 35 24.2 39.6 49.6 61.4 59.9 61.5 66.7 64.5 59.2 62.7 Wes McKinney () Structured data challenges Rice Statistics 27 / 43
    • ReshapingIn [5]: df.unstack(’agefrom’).stack(’year’) agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9 Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7 Wes McKinney () Structured data challenges Rice Statistics 28 / 43
    • Reshaping implementation nuances Must carefully deal with unbalanced group sizes / missing data I play vectorization tricks with the NumPy memory layout: no for loops! Care must be taken to handle heterogeneous and homogeneous data cases Wes McKinney () Structured data challenges Rice Statistics 29 / 43
    • GroupBy High level process split data set into groups apply function to each group (an aggregation or a transformation) combine results intelligently into a result data structure Can be used to emulate SQL GROUP BY operations Wes McKinney () Structured data challenges Rice Statistics 30 / 43
    • GroupBy Grouping closely related to indexing Create correspondence between axis labels and group labels using one of: Array of group labels (like a DataFrame column) Python function to be applied to each axis tick Can group by multiple keys For a hierarchically indexed axis, can select a level and group by that (or some transformation thereof) Wes McKinney () Structured data challenges Rice Statistics 31 / 43
    • Anatomy of GroupBy grouped = obj.groupby([key1, key2, key3]) This returns a GroupBy object Each of the keys could be any of: A Python function A vector A column name Wes McKinney () Structured data challenges Rice Statistics 32 / 43
    • Anatomy of GroupBy aggregate, transform, and the more general apply supported group_means = grouped.agg(np.mean) group_means2 = grouped.mean() demeaned = grouped.transform(lambda x: x - x.mean()) Wes McKinney () Structured data challenges Rice Statistics 33 / 43
    • Anatomy of GroupBy The GroupBy object is also iterable group_means = {} for group_name, group in grouped: group_means[group_name] = grouped.mean() Wes McKinney () Structured data challenges Rice Statistics 34 / 43
    • GroupBy and hierarchical indexing Hierarchical indexing came about as the natural result of a multi-key aggregation: >>> group_means = df.groupby([country, agefrom]).mean() >>> group_means[[ls, lsc, pop]].unstack(country) ls lsc pop country Australia Austria Australia Austria Australia Austria agefrom 15 70.03 31.1 26.17 14.67 6163 3310 20 58.02 59.98 34.51 45.83 1113 531 25 57.02 45.5 33.02 32.73 5021 2791 30 59.16 46.56 35.29 33.87 1082 527 35 59.58 43.29 34.53 30.3 1053 528.8 40 58.8 40.98 33.92 27.88 1005 522.5 45 56.79 39.19 31.71 26 927.2 503.5 50 54.71 37.14 30.37 23.94 836.6 475.2 55 51.93 34.82 26.98 22.11 735.8 443.2 60 49.92 32.35 25.85 20.2 632.6 408.8 65 47.02 29.59 22.98 18.44 522.5 361.9 70 46.27 28.52 22.53 17.72 410.5 295.8 75 46.28 28.06 23.41 18.42 624.5 437.6 Wes McKinney () Structured data challenges Rice Statistics 35 / 43
    • What makes GroupBy hard? factor-izing the group labels is very expensive Function call overhead on small groups To sort or not to sort? Cheaper than computing the group labels! Munging together results in exceptional cases is tricky Wes McKinney () Structured data challenges Rice Statistics 36 / 43
    • Time series operations Fixed frequency indexing (DateRange) Domain-specific time offsets (like business day, business month end) Frequency conversion Forward-filling / back-filling / interpolation Leading/lagging In the works (later this year), better/faster upsampling/downsampling Wes McKinney () Structured data challenges Rice Statistics 37 / 43
    • Shoot-out with R’s time series objects“Inner join” addition between two irregular 500K-length time seriessampled from 1 million-length time series. Timestamps are POSIXct (R) /int64 (Python) package timing factor pandas 21.5 ms 1.0 xts 41.3 ms 1.92 fts 370 ms 17.2 its 1117 ms 51.95 zoo 3479 ms 161.8 Where there is smoke, there is fire? Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2 Wes McKinney () Structured data challenges Rice Statistics 38 / 43
    • Erm, inner join? Intersecting time stamps loses information Wes McKinney () Structured data challenges Rice Statistics 39 / 43
    • My (mild) performance obsession Wes McKinney () Structured data challenges Rice Statistics 40 / 43
    • My (mild) performance obsession Good performance stems from many factors Well-designed algorithms (from a complexity standpoint) Minimizing copying of data Minimizing interpreter / function call overhead Taking advantage of memory layout I value functionality over performance, but I do spend a lot of time profiling code and grokking performance characteristics Wes McKinney () Structured data challenges Rice Statistics 41 / 43
    • Demo, time permittingWes McKinney () Structured data challenges Rice Statistics 42 / 43