- 1. Analysis of Time-series Data Generalized Additive Model Jinseob Kim July 17, 2015 Jinseob Kim Analysis of Time-series Data July 17, 2015 1 / 45
- 2. Contents 1 Non-linear Issues Distribution of Y Estimate of Beta 2 GAM Theory Various Spline Model selection 3 Descriptive Analysis of Time-series data Time series plot 4 Analysis using GAM Jinseob Kim Analysis of Time-series Data July 17, 2015 2 / 45
- 3. Objective 1 Non-linear regression의 종류를 안다. 2 Additive model의 개념과 spline에 대해 이해한다. 3 Time-series data를 살펴볼 줄 안다. 4 R의 mgcv 패키지를 이용하여 분석을 시행할 수 있다. Jinseob Kim Analysis of Time-series Data July 17, 2015 3 / 45
- 4. Non-linear Issues Contents 1 Non-linear Issues Distribution of Y Estimate of Beta 2 GAM Theory Various Spline Model selection 3 Descriptive Analysis of Time-series data Time series plot 4 Analysis using GAM Jinseob Kim Analysis of Time-series Data July 17, 2015 4 / 45
- 5. Non-linear Issues Distribution of Y Count data 일/주/월 별 발생/사망 수 Population의 경향을 바라본다. 나랏님 시점!! 인구집단에서 발생 or 사망할 확률이 어느정도냐? 확률 정규분포 포아송분포 기타..quasipoisson, Gamma, Negbin, ZIP, ZINB... 매우 중요하다!!! p-value가 바뀐다!!! Jinseob Kim Analysis of Time-series Data July 17, 2015 5 / 45
- 6. Non-linear Issues Distribution of Y Compare Distribution http://resources.esri.com/help/9.3/arcgisdesktop/com/gp_ toolref/process_simulations_sensitivity_analysis_and_error_ analysis_modeling/distributions_for_assigning_random_ values.htm Jinseob Kim Analysis of Time-series Data July 17, 2015 6 / 45
- 7. Non-linear Issues Distribution of Y 기초수준 흔한 질병이면 정규분포 고려. 분석 쉬워진다. 드문 질병이면 포아송. 평균 < 분산? → quasipoisson 나머지는 드물게 쓰인다. Jinseob Kim Analysis of Time-series Data July 17, 2015 7 / 45
- 8. Non-linear Issues Distribution of Y Poisson VS quasipoisson Poisson E(Yi ) = µi , Var(Yi ) = µi quasipoisson E(Yi ) = µi , Var(Yi ) = φ × µi Jinseob Kim Analysis of Time-series Data July 17, 2015 8 / 45
- 9. Non-linear Issues Estimate of Beta Beta의 의미 Distribution에 따라 Beta의 의미가 바뀐다. 정규분포: 선형관계 이항분포: log(OR)- 로짓함수와 선형관계 포아송분포: log(RR)- 로그함수와 선형관계 어쨌든, 다 선형관계라고 하자. Jinseob Kim Analysis of Time-series Data July 17, 2015 9 / 45
- 10. Non-linear Issues Estimate of Beta Non-linear 선형관계가 해석은 쉽지만.. 과연 진실인가? 기후, 오염물질.. 딱 선형관계가 아닐지도. U shape, threshold etc.. Jinseob Kim Analysis of Time-series Data July 17, 2015 10 / 45
- 11. GAM Theory Contents 1 Non-linear Issues Distribution of Y Estimate of Beta 2 GAM Theory Various Spline Model selection 3 Descriptive Analysis of Time-series data Time series plot 4 Analysis using GAM Jinseob Kim Analysis of Time-series Data July 17, 2015 11 / 45
- 12. GAM Theory Various Spline Additive Model Y = β0 + β1x1 + β2x2 + · · · + (1) Y = β0 + f (x1) + β2x2 · · · + (2) f (x1, x2)꼴의 형태도 가능.. 이번시간에선 제외. Jinseob Kim Analysis of Time-series Data July 17, 2015 12 / 45
- 13. GAM Theory Various Spline Determine f 종류 Loess (Natural)Cubic spline Smoothing spline 내용은 다양하지만.. 실제 결과는 거의 비슷. Jinseob Kim Analysis of Time-series Data July 17, 2015 13 / 45
- 14. GAM Theory Various Spline Loess Locally weighted scatterplot smoothing Jinseob Kim Analysis of Time-series Data July 17, 2015 14 / 45
- 15. GAM Theory Various Spline Example: Loess Jinseob Kim Analysis of Time-series Data July 17, 2015 15 / 45
- 16. GAM Theory Various Spline Cubic spline Cubic = 3차방정식 구간을 몇개로 나누고: knot 각 구간을 3차방정식을 이용하여 모델링. 구간 사이에 smoothing 고려.. Jinseob Kim Analysis of Time-series Data July 17, 2015 16 / 45
- 17. GAM Theory Various Spline Example: Cubic spline Jinseob Kim Analysis of Time-series Data July 17, 2015 17 / 45
- 18. GAM Theory Various Spline Example: Cubic Spline(2) Jinseob Kim Analysis of Time-series Data July 17, 2015 18 / 45
- 19. GAM Theory Various Spline Natural cubic spline: ns Cubic + 처음과 끝은 Linear 처음보다 더 처음, 끝보다 더 끝(데이터에 없는 숫자)에 대한 보수적인 추정. 3차보다 1차가 변화량이 적음. Jinseob Kim Analysis of Time-series Data July 17, 2015 19 / 45
- 20. GAM Theory Various Spline Smoothing Splines Alias Penalised Splines Loess, Cubic spline Span, knot를 미리 지정: local 구간을 미리 지정. Penalized spline 알아서.. 데이터가 말해주는 대로.. mgcv R 패키지의 기본옵션. Jinseob Kim Analysis of Time-series Data July 17, 2015 20 / 45
- 21. GAM Theory Various Spline Penalized regression: Smoothing Minimize ||Y − Xβ||2 + λ f (x)2 dx λ → 0: 울퉁불퉁. λ가 커질수록 smoothing Jinseob Kim Analysis of Time-series Data July 17, 2015 21 / 45
- 22. GAM Theory Various Spline Example: Smoothing spline Jinseob Kim Analysis of Time-series Data July 17, 2015 22 / 45
- 23. GAM Theory Model selection Choose λ 1 CV (cross validation) 2 GCV (generalized) 3 UBRE (unbiased risk estimator) 4 Mallow’s Cp 어떤 것이든.. 최소로 하는 λ를 choose!! Jinseob Kim Analysis of Time-series Data July 17, 2015 23 / 45
- 24. GAM Theory Model selection Cross validation Minimize 1 n n i=1 (Yi − ˆf −[i] (xi ))2 1번째 빼고 예측한 걸로 실제 1번째와 차이.. 2번째 빼고 예측한 걸로 실제 2번째와 차이.. .. n번째 빼고 예측한 걸로 실제 n번째와 차이.. GCV: CV의 computation burden을 개선. Jinseob Kim Analysis of Time-series Data July 17, 2015 24 / 45
- 25. GAM Theory Model selection Example : 10 fold CV Jinseob Kim Analysis of Time-series Data July 17, 2015 25 / 45
- 26. GAM Theory Model selection Example : GCV Jinseob Kim Analysis of Time-series Data July 17, 2015 26 / 45
- 27. GAM Theory Model selection In practice poisson: UBRE quasipoisson: GCV Jinseob Kim Analysis of Time-series Data July 17, 2015 27 / 45
- 28. GAM Theory Model selection AIC 우리가 구한 모형의 가능도를 L이라 하면. 1 AIC = −2 × log(L) + 2 × k 2 k: 설명변수의 갯수(성별, 나이, 연봉...) 3 작을수록 좋은 모형!!! 가능도가 큰 모형을 고르겠지만.. 설명변수 너무 많으면 페널티!!! Jinseob Kim Analysis of Time-series Data July 17, 2015 28 / 45
- 29. Descriptive Analysis of Time-series data Contents 1 Non-linear Issues Distribution of Y Estimate of Beta 2 GAM Theory Various Spline Model selection 3 Descriptive Analysis of Time-series data Time series plot 4 Analysis using GAM Jinseob Kim Analysis of Time-series Data July 17, 2015 29 / 45
- 30. Descriptive Analysis of Time-series data Time series plot Time series plot 012345 incidence 1020000010300000 population 0102030 temp 0200400 2002 2004 2006 2008 2010 pcp Time Seoul Jinseob Kim Analysis of Time-series Data July 17, 2015 30 / 45
- 31. Descriptive Analysis of Time-series data Time series plot Serial Correlation Jinseob Kim Analysis of Time-series Data July 17, 2015 31 / 45
- 32. Descriptive Analysis of Time-series data Time series plot 0.0 0.1 0.2 0.3 0.4 0.5 0.00.20.40.60.81.0 Lag ACF Autocorrelation plot: Seoul 0.0 0.1 0.2 0.3 0.4 0.5 −0.050.000.050.100.15 Lag PartialACF Partial Autocorrelation plot: Seoul Jinseob Kim Analysis of Time-series Data July 17, 2015 32 / 45
- 33. Descriptive Analysis of Time-series data Time series plot Decompose plot 012345 observed 0.20.40.60.8 trend 01234 seasonal 02468 2002 2004 2006 2008 2010 random Time Decomposition of multiplicative time series Jinseob Kim Analysis of Time-series Data July 17, 2015 33 / 45
- 34. Analysis using GAM Contents 1 Non-linear Issues Distribution of Y Estimate of Beta 2 GAM Theory Various Spline Model selection 3 Descriptive Analysis of Time-series data Time series plot 4 Analysis using GAM Jinseob Kim Analysis of Time-series Data July 17, 2015 34 / 45
- 35. Analysis using GAM Seoul example: poisson (1) Family: poisson Link function: log Formula: incidence ~ offset(log(population)) + temp + pcp + s(week, k = 53) + s(year, k = 9) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.702e+01 2.411e-01 -70.597 <2e-16 *** temp -5.465e-03 1.776e-02 -0.308 0.758 pcp -3.751e-04 1.332e-03 -0.282 0.778 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(week) 3.038 3.997 13.33 0.00975 ** s(year) 7.568 7.942 31.79 9.93e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.123 Deviance explained = 14.3% UBRE = -0.029349 Scale est. = 1 n = 477 Jinseob Kim Analysis of Time-series Data July 17, 2015 35 / 45
- 36. Analysis using GAM 0 10 20 30 40 50 −2.0−1.00.00.51.0 week s(week,3.04) 2002 2004 2006 2008 2010 −2.0−1.00.00.51.0 year s(year,7.57) Jinseob Kim Analysis of Time-series Data July 17, 2015 36 / 45
- 37. Analysis using GAM Seoul example: poisson (2) Family: poisson Link function: log Formula: incidence ~ offset(log(population)) + s(temp) + s(pcp) + s(week, k = 53) + s(year, k = 9) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -17.07888 0.07856 -217.4 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(temp) 1.000 1.000 0.538 0.46313 s(pcp) 3.312 4.142 7.036 0.14440 s(week) 3.063 4.030 14.319 0.00654 ** s(year) 1.798 2.236 6.634 0.04593 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.0834 Deviance explained = 11.5% UBRE = -0.014142 Scale est. = 1 n = 477 Jinseob Kim Analysis of Time-series Data July 17, 2015 37 / 45
- 38. Analysis using GAM 0 10 20 30 −2.0−1.00.01.0 temp s(temp,1) 0 100 200 300 400 500 −2.0−1.00.01.0 pcp s(pcp,3.31) 0 10 20 30 40 50 −2.0−1.00.01.0 s(week,3.06) 2002 2004 2006 2008 2010 −2.0−1.00.01.0 s(year,1.8) Jinseob Kim Analysis of Time-series Data July 17, 2015 38 / 45
- 39. Analysis using GAM Seoul example: quasipoisson(1) Family: quasipoisson Link function: log Formula: incidence ~ offset(log(population)) + temp + pcp + s(week, k = 53) + s(year, k = 9) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.012052 0.252254 -67.440 <2e-16 *** temp -0.006425 0.018615 -0.345 0.730 pcp -0.000377 0.001378 -0.274 0.785 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(week) 3.126 4.110 3.072 0.015470 * s(year) 7.595 7.949 3.746 0.000303 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.124 Deviance explained = 14.3% GCV = 0.96803 Scale est. = 1.068 n = 477 Jinseob Kim Analysis of Time-series Data July 17, 2015 39 / 45
- 40. Analysis using GAM 0 10 20 30 40 50 −2.0−1.00.00.51.0 week s(week,3.13) 2002 2004 2006 2008 2010 −2.0−1.00.00.51.0 year s(year,7.59) Jinseob Kim Analysis of Time-series Data July 17, 2015 40 / 45
- 41. Analysis using GAM Seoul example: quasipoisson(2) Family: quasipoisson Link function: log Formula: incidence ~ offset(log(population)) + s(temp) + s(pcp) + s(week, k = 53) + s(year, k = 9) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.08040 0.08055 -212 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(temp) 1.000 1.000 0.543 0.46143 s(pcp) 3.356 4.193 1.616 0.16537 s(week) 3.109 4.088 3.412 0.00873 ** s(year) 1.872 2.329 2.748 0.05679 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.0838 Deviance explained = 11.6% GCV = 0.98475 Scale est. = 1.0457 n = 477 Jinseob Kim Analysis of Time-series Data July 17, 2015 41 / 45
- 42. Analysis using GAM 0 10 20 30 −2.0−1.00.01.0 temp s(temp,1) 0 100 200 300 400 500 −2.0−1.00.01.0 pcp s(pcp,3.36) 0 10 20 30 40 50 −2.0−1.00.01.0 s(week,3.11) 2002 2004 2006 2008 2010 −2.0−1.00.01.0 s(year,1.87) Jinseob Kim Analysis of Time-series Data July 17, 2015 42 / 45
- 43. Analysis using GAM Compare AIC > model_gam$aic [1] 809.8845 > model_gam2$aic [1] 817.1379 > model_gam3$aic [1] NA > model_gam4$aic [1] NA Jinseob Kim Analysis of Time-series Data July 17, 2015 43 / 45
- 44. Analysis using GAM Good reference Using R for Time Series Analysis http://a-little-book-of-r-for-time-series.readthedocs.org/ en/latest/ Jinseob Kim Analysis of Time-series Data July 17, 2015 44 / 45
- 45. Analysis using GAM END Email : secondmath85@gmail.com Jinseob Kim Analysis of Time-series Data July 17, 2015 45 / 45