Scipy 2011 Time Series Analysis in Python

37,777 views

Published on

Published in: Technology, Economy & Finance
3 Comments
30 Likes
Statistics
Notes
  • Hi All, We are planning to start new devops online batch on this week... If any one interested to attend the demo please register in our website... For this batch we are also provide everyday recorded sessions with Materials. For more information feel free to contact us : siva@keylabstraining.com. For Course Content and Recorded Demo Click Here : http://www.keylabstraining.com/devops-online-training-tutorial
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I know a lot of people switching from Stata to Python for this and other reasons...
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I think this will really help switch over from Stata. Looking forward to writing everything in Python.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
37,777
On SlideShare
0
From Embeds
0
Number of Embeds
6,661
Actions
Shares
0
Downloads
583
Comments
3
Likes
30
Embeds 0
No embeds

No notes for slide

Scipy 2011 Time Series Analysis in Python

  1. 1. Time Series Analysis in Python with statsmodels Wes McKinney1 Josef Perktold2 Skipper Seabold3 1 Departmentof Statistical Science Duke University 2 Department of Economics University of North Carolina at Chapel Hill 3 Departmentof Economics American University 10th Python in Science Conference, 13 July 2011McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29
  2. 2. What is statsmodels? A library for statistical modeling, implementing standard statistical models in Python using NumPy and SciPy Includes: Linear (regression) models of many forms Descriptive statistics Statistical tests Time series analysis ...and much moreMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 2 / 29
  3. 3. What is Time Series Analysis? Statistical modeling of time-ordered data observations Inferring structure, forecasting and simulation, and testing distributional assumptions about the data Modeling dynamic relationships among multiple time series Broad applications e.g. in economics, finance, neuroscience, signal processing...McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 3 / 29
  4. 4. Talk Overview Brief update on statsmodels development Aside: user interface and data structures Descriptive statistics and tests Auto-regressive moving average models (ARMA) Vector autoregression (VAR) models Filtering tools (Hodrick-Prescott and others) Near future: Bayesian dynamic linear models (DLMs), ARCH / GARCH volatility models and beyondMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 4 / 29
  5. 5. Statsmodels development update We’re now on GitHub! Join us: http://github.com/statsmodels/statsmodels Check out the slick Sphinx docs: http://statsmodels.sourceforge.net Development focus has been largely computational, i.e. writing correct, tested implementations of all the common classes of statistical modelsMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 5 / 29
  6. 6. Statsmodels development update Major work to be done on providing a nice integrated user interface We must work together to close the gap between R and Python! Some important areas: Formula framework, for specifying model design matrices Need integrated rich statistical data structures (pandas) Data visualization of results should always be a few keystrokes away Write a “Statsmodels for R users” guideMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 6 / 29
  7. 7. Aside: statistical data structures and user interface While I have a captive audience... Controversial fact: pandas is the only Python library currently providing data structures matching (and in many places exceeding) the richness of R’s data structures (for statistics) Let’s have a BoF session so I can justify this statement Feedback I hear is that end users find the fragmented, incohesive set of Python tools for data analysis and statistics to be confusing, frustrating, and certainly not compelling them to use Python... (Not to mention the packaging headaches)McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 7 / 29
  8. 8. Aside: statistical data structures and user interface We need to “commit” ASAP (not 12 months from now) to a high level data structure(s) as the “primary data structure(s) for statistical data analysis” and communicate that clearly to end users Or we might as well all start programming in R...McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 8 / 29
  9. 9. Example data: EEG trace data 300 200 100 0 100 200 300 400 500 600 0 500 0 0 0 0 0 0 0 100 150 200 250 300 350 400McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 9 / 29
  10. 10. Example data: Macroeconomic data 5.5 5.0 cpi 4.5 4.0 3.5 3.0 7.5 7.0 m1 6.5 6.0 5.5 5.0 4.5 9.5 9.0 realgdp 8.5 8.0 0 4 8 2 6 0 4 8 2 6 0 4 8 196 196 196 197 197 198 198 198 199 199 200 200 200McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 10 / 29
  11. 11. Example data: Stock data 800 AAPL 700 GOOG MSFT 600 YHOO 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 200 200 200 200 200 200 200 200 200McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 11 / 29
  12. 12. Descriptive statistics Autocorrelation, partial autocorrelation plots Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q) models acf = tsa . acf ( eeg , 50) pacf = tsa . pacf ( eeg , 50) 1.0 Autocorrelation 1.0 Partial Autocorrelation 0.5 0.5 0.0 0.0 0.5 0.5 1.00 10 20 30 40 50 1.00 10 20 30 40 50McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 12 / 29
  13. 13. Statistical tests Ljung-Box test for zero autocorrelation Unit root test for cointegration (Augmented Dickey-Fuller test) Granger-causality Whiteness (iid-ness) and normality See our conference paper (when the proceedings get published!)McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 13 / 29
  14. 14. Autoregressive moving average (ARMA) models One of most common univariate time series models: yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q 2 where E ( t , s ) = 0, for t = s and t ∼ N (0, σ ) Exact log-likelihood can be evaluated via the Kalman filter, but the “conditional” likelihood is easier and commonly used statsmodels has tools for simulating ARMA processes with known coefficients ai , bi and also estimation given specified lag orders import scikits.statsmodels.tsa.arima_process as ap ar_coef = [1, .75, -.25]; ma_coef = [1, -.5] nobs = 100 y = ap.arma_generate_sample(ar_coef, ma_coef, nobs) y += 4 # add in constantMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 14 / 29
  15. 15. ARMA Estimation Several likelihood-based estimators implemented (see docs) model = tsa.ARMA(y) result = model.fit(order=(2, 1), trend=’c’, method=’css-mle’, disp=-1) result.params # array([ 3.97, -0.97, -0.05, -0.13]) Standard model diagnostics, standard errors, information criteria (AIC, BIC, ...), etc available in the returned ARMAResults objectMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 15 / 29
  16. 16. Vector Autoregression (VAR) models Widely used model for modeling multiple (K -variate) time series, especially in macroeconomics: Yt = A1 Yt−1 + . . . + Ap Yt−p + t, t ∼ N (0, Σ) Matrices Ai are K × K . Yt must be a stationary process (sometimes achieved by differencing). Related class of models (VECM) for modeling nonstationary (including cointegrated) processesMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 16 / 29
  17. 17. Vector Autoregression (VAR) models >>> model = VAR(data); model.select_order(8) VAR Order Selection ===================================================== aic bic fpe hqic ----------------------------------------------------- 0 -27.83 -27.78 8.214e-13 -27.81 1 -28.77 -28.57 3.189e-13 -28.69 2 -29.00 -28.64* 2.556e-13 -28.85 3 -29.10 -28.60 2.304e-13 -28.90* 4 -29.09 -28.43 2.330e-13 -28.82 5 -29.13 -28.33 2.228e-13 -28.81 6 -29.14* -28.18 2.213e-13* -28.75 7 -29.07 -27.96 2.387e-13 -28.62 ===================================================== * MinimumMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 17 / 29
  18. 18. Vector Autoregression (VAR) models >>> result = model.fit(2) >>> result.summary() # print summary for each variable <snip> Results for equation m1 ==================================================== coefficient std. error t-stat prob ---------------------------------------------------- const 0.004968 0.001850 2.685 0.008 L1.m1 0.363636 0.071307 5.100 0.000 L1.realgdp -0.077460 0.092975 -0.833 0.406 L1.cpi -0.052387 0.128161 -0.409 0.683 L2.m1 0.250589 0.072050 3.478 0.001 L2.realgdp -0.085874 0.092032 -0.933 0.352 L2.cpi 0.169803 0.128376 1.323 0.188 ==================================================== <snip>McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 18 / 29
  19. 19. Vector Autoregression (VAR) models >>> result = model.fit(2) >>> result.summary() # print summary for each variable <snip> Correlation matrix of residuals m1 realgdp cpi m1 1.000000 -0.055690 -0.297494 realgdp -0.055690 1.000000 0.115597 cpi -0.297494 0.115597 1.000000McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 19 / 29
  20. 20. VAR: Impulse Response analysis Analyze systematic impact of unit “shock” to a single variable irf = result.irf(10) irf.plot() Impulse responses m1 → m1 realgdp → m1 cpi → m1 1.0 0.2 0.4 0.8 0.1 0.3 0.2 0.6 0.0 0.1 0.4 0.1 0.0 0.2 0.2 0.1 0.2 0.0 0.3 0.3 0.20 4 0.40 4 10 0.40 2 6 m1 → realgdp 8 10 2 realgdp → realgdp 8 6 2 cpi4→ realgdp 6 8 10 0.20 1.0 0.2 0.15 0.8 0.1 0.10 0.6 0.0 0.05 0.4 0.1 0.00 0.05 0.2 0.2 0.10 0.0 0.3 0.150 2 4 6 8 10 0.20 2 4 0.40 4 → cpi m1 → cpi realgdp →6 cpi 8 10 2 cpi 6 8 10 0.20 0.15 1.0 0.15 0.10 0.8 0.10 0.05 0.6 0.05 0.00 0.00 0.05 0.4 0.05 0.10 0.2 0.100 2 4 6 8 10 0.150 2 4 6 8 10 0.00 2 4 6 8 10McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 20 / 29
  21. 21. VAR: Forecast Error Variance Decomposition Analyze contribution of each variable to forecasting error fevd = result.fevd(20) fevd.plot() Forecast error variance decomposition (FEVD) m1 1.0 m1 realgdp 0.8 cpi 0.6 0.4 0.2 0.00 5 10 15 20 1.2 realgdp 1.0 0.8 0.6 0.4 0.2 0.00 5 10 15 20 1.2 cpi 1.0 0.8 0.6 0.4 0.2 0.00 5 10 15 20McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 21 / 29
  22. 22. VAR: Statistical tests In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’]) Granger causality f-test ========================================================= Test statistic Critical Value p-value df --------------------------------------------------------- 1.248787 2.387325 0.289 (4, 579) ========================================================= H_0: [’cpi’, ’realgdp’] do not Granger-cause m1 Conclusion: fail to reject H_0 at 5.00% significance levelMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 22 / 29
  23. 23. Filtering Hodrick-Prescott (HP) filter separates a time series yt into a trend τt and a cyclical component ζt , so that yt = τt + ζt . 14 Inflation 12 Cyclical component 10 Trend component 8 6 4 2 0 2 4 2 6 0 4 8 2 6 0 4 8 2 6 196 196 197 197 197 198 198 199 199 199 200 200McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 23 / 29
  24. 24. Filtering In addition to the HP filter, 2 other filters popular in finance and economics, Baxter-King and Christiano-Fitzgerald, are available We refer you to our paper and the documentation for details on these: Inflation and Unemployment: BK Filtered Inflation and Unemployment: CF Filtered INFL INFL 4 4 UNEMP UNEMP 2 2 0 0 2 2 4 4 63 73 83 93 68 78 88 98 03 71 81 91 08 66 76 86 96 01 06 19 19 19 19 19 19 19 19 19 19 19 20 19 19 19 19 20 20 20McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 24 / 29
  25. 25. Preview: Bayesian dynamic linear models (DLM) A state space model by another name: yt = Ft θt + νt , νt ∼ N (0, Vt ) θt = G θt−1 + ωt , ωt ∼ N (0, Wt ) Estimation of basic model by Kalman filter recursions. Provides elegant way to do time-varying linear regressions for forecasting Extensions: multivariate DLMs, stochastic volatility (SV) models, MCMC-based posterior sampling, mixtures of DLMsMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 25 / 29
  26. 26. Preview: DLM Example (Constant+Trend model) model = Polynomial(2) dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model m0=m0, C0=C0, n0=n0, s0=s0, # priors state_discount=.95) # discount factor Constant + Trend DLM 200 150 100 50 8 9 009 9 009 9 9 200 200 2 200 Jul 2 200 200 Nov Jan Mar May Sep NovMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 26 / 29
  27. 27. Preview: Stochastic volatility models 1.6 JPY-USD Exchange Rate Volatility Process 1.4 1.2 1.0 0.8 0.6 0.4 0.20 200 400 600 800 1000McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 27 / 29
  28. 28. Future: sandbox and beyond ARCH / GARCH models for volatility Structural VAR and error correction models (ECM) for cointegrated processes Models with non-normally distributed errors Better data description, visualization, and interactive research tools More sophisticated Bayesian time series modelsMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 28 / 29
  29. 29. Conclusions We’ve implemented many foundational models for time series analysis, but the field is very broad User interface can and should be much improved Repo: http://github.com/statsmodels/statsmodels Docs: http://statsmodels.sourceforge.net Contact: pystatsmodels@googlegroups.comMcKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 29 / 29

×