Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

**Scribd will begin operating the SlideShare business on December 1, 2020**
As of this date, Scribd will manage your SlideShare account and any content you may have on SlideShare, and Scribd's General Terms of Use and Privacy Policy will apply. If you wish to opt out, please close your SlideShare account. Learn more.

Successfully reported this slideshow.

Like this presentation? Why not share!

79,339 views

Published on

Published in:
Technology

No Downloads

Total views

79,339

On SlideShare

0

From Embeds

0

Number of Embeds

44,638

Shares

0

Downloads

644

Comments

15

Likes

37

No notes for slide

- 1. Financial data analysis in Python with pandas Wes McKinney @wesmckinn 10/17/2011@wesmckinn () Data analysis with pandas 10/17/2011 1 / 22
- 2. My background 3 years as a quant hacker at AQR, now consultant / entrepreneur Math and statistics background with the zest of computer science Active in scientiﬁc Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn @wesmckinn () Data analysis with pandas 10/17/2011 2 / 22
- 3. Bare essentials for ﬁnancial research Fast time series functionality Easy data alignment Date/time handling Moving window statistics Resamping / frequency conversion Fast data access (SQL databases, ﬂat ﬁles, etc.) Data visualization (plotting) Statistical models Linear regression Time series models: ARMA, VAR, ... @wesmckinn () Data analysis with pandas 10/17/2011 3 / 22
- 4. Would be nice to have Portfolio and risk analytics, backtesting Easy enough to write yourself, though most people do a bad job of it Portfolio optimization Most ﬁnancial ﬁrms use a 3rd party library anyway Derivative pricing Can use QuantLib in most languages @wesmckinn () Data analysis with pandas 10/17/2011 4 / 22
- 5. What are ﬁnancial ﬁrms using? HFT: a C++ and hardware arms race, a diﬀerent topic Research Mainstream: R, MATLAB, Python, ... Econometrics: Stata, eViews, RATS, etc. Non-programmatic environments: ClariFI, Palantir, ... Production Popular: Java, C#, C++ Less popular, but growing: Python Fringe: Functional languages (Ocaml, Haskell, F#) @wesmckinn () Data analysis with pandas 10/17/2011 5 / 22
- 6. What are ﬁnancial ﬁrms using? Many hybrid languages environments (e.g. Java/R, C++/R, C++/MATLAB, Python/C++) Which is the main implementation language? If main language is Java/C++, result is lower productivity and higher cost to prototyping new functionality Trends Banks and hedge funds are realizing that Java-based production systems can be replaced with 20% as much Python code (or less) MATLAB is being increasingly ditched in favor of Python. R and Python use for research generally growing @wesmckinn () Data analysis with pandas 10/17/2011 6 / 22
- 7. Python language Simple, expressive syntax Designed for readability, like “runnable pseudocode” Easy-to-use, powerful built-in types and data structures: Lists and tuples (ﬁxed-size, immutable lists) Dicts (hash maps / associative arrays) and sets Everything’s an object, including functions “There should be one, and preferably only one way to do it” “Batteries included”: great general purpose standard library @wesmckinn () Data analysis with pandas 10/17/2011 7 / 22
- 8. A simple example: quicksortPseudocode from Wikipedia:function qsort(array) if length(array) < 2 return array var list less, greater select and remove a pivot value pivot from array for each x in array if x < pivot then append x to less else append x to greater return concat(qsort(less), pivot, qsort(greater)) @wesmckinn () Data analysis with pandas 10/17/2011 8 / 22
- 9. A simple example: quicksortFirst try Python implementation:def qsort ( array ): if len ( array ) < 2: return array less , greater = [] , [] pivot , rest = array [0] , array [1:] for x in rest : if x < pivot : less . append ( x ) else : greater . append ( x ) return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 9 / 22
- 10. A simple example: quicksortUse list comprehensions:def qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = [ x for x in rest if x < pivot ] greater = [ x for x in rest if x >= pivot ] return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 10 / 22
- 11. A simple example: quicksortHeck, ﬁt it onto one line!qs = lambda r : ( r if len ( r ) < 2 else ( qs ([ x for x in r [1:] if x < r [0]]) + [ r [0]] + qs ([ x for x in r [1:] if x >= r [0]])))Though that’s starting to look like Lisp code... @wesmckinn () Data analysis with pandas 10/17/2011 11 / 22
- 12. A simple example: quicksortA quicksort using NumPy arraysdef qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = rest [ rest < pivot ] greater = rest [ rest >= pivot ] return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )]Of course no need for this when you can just do:sorted_array = np.sort(array) @wesmckinn () Data analysis with pandas 10/17/2011 12 / 22
- 13. Python: drunk with powerThis comic has way too much airtime but: @wesmckinn () Data analysis with pandas 10/17/2011 13 / 22
- 14. Staples of Python for science: MINS (M) matplotlib: plotting and data visualization (I) IPython: rich interactive computing and development environment (N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random number generation, etc. (S) SciPy: optimization, probability distributions, signal processing, ODEs, sparse matrices, ... @wesmckinn () Data analysis with pandas 10/17/2011 14 / 22
- 15. Why did Python become popular in science? NumPy traces its roots to 1995 Extremely easy to integrate C/C++/Fortran code Access fast low level algorithms in a high level, interpreted language The language itself “It ﬁts in your head” “It [Python] doesn’t get in my way” - Robert Kern Python is good at all the things other scientiﬁc programming languages are not good at (e.g. networking, string processing, OOP) Liberal BSD license: can use Python for commercial applications @wesmckinn () Data analysis with pandas 10/17/2011 15 / 22
- 16. Some exciting stuﬀ in the last few years Cython “Augmented” Python language with type declarations, for generating compiled extensions C-like speedups with Python-like development time IPython: enhanced interactive Python interpreter The best research and software development env for Python An integrated parallel / distributed computing backend GUI console with inline plotting and a rich HTML notebook (more on this later) PyCUDA / PyOpenCL: GPU computing in Python Transformed Python overnight into one of the best languages for doing GPU computing @wesmckinn () Data analysis with pandas 10/17/2011 16 / 22
- 17. Where has Python historically been weak? Rich data structures for data analysis and statistics NumPy arrays, while powerful, feel distinctly “lower level” if you’re used to R’s data.frame pandas has ﬁlled this gap over the last 2 years Statistics libraries Nowhere near the depth of R’s CRAN repository statsmodels provides tested implementations a lot of standard regression and time series models Turns out that most ﬁnancial data analysis requires only fairly elementary statistical models @wesmckinn () Data analysis with pandas 10/17/2011 17 / 22
- 18. pandas library Began building at AQR in 2008, open-sourced late 2009 Why R / MATLAB, while good for research / data analysis, are not suitable implementation languages for large-scale production systems (I personally don’t care for them for data analysis) Existing data structures for time series in R / MATLAB were too limited / not ﬂexible enough my needs Core idea: indexed data structures capable of storing heterogeneous data Etymology: panel data structures @wesmckinn () Data analysis with pandas 10/17/2011 18 / 22
- 19. pandas in a nutshell A clean axis indexing design to support fast data alignment, lookups, hierarchical indexing, and more High-performance data structures Series/TimeSeries: 1D labeled vector DataFrame: 2D spreadsheet-like structure Panel: 3D labeled array, collection of DataFrames SQL-like functionality: GroupBy, joining/merging, etc. Missing data handling Time series functionality @wesmckinn () Data analysis with pandas 10/17/2011 19 / 22
- 20. pandas design philosophy “Think outside the matrix”: stop thinking about shape and start thinking about indexes Indexing and data alignment are essential Fault-tolerance: save you from common blunders caused by coding errors (speciﬁcally misaligned data) Lift the best features of other data analysis environments (R, MATLAB, Stata, etc.) and make them better, faster Performance and usability equally important @wesmckinn () Data analysis with pandas 10/17/2011 20 / 22
- 21. The pandas killer feature: indexing Each axis has an index Automatic alignment between diﬀerently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations Better integrated and more ﬂexible indexing than anything available in R or MATLAB @wesmckinn () Data analysis with pandas 10/17/2011 21 / 22
- 22. Tutorial time To the IPython console! @wesmckinn () Data analysis with pandas 10/17/2011 22 / 22

No public clipboards found for this slide

Login to see the comments