Ysc2013

1. Young Statisticians Conference 7 February 2013

4. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Where fools fear to tread 2

5. My story Olympic video poker slots Beware of smelly clients Threats and slander Nerves in court Three university consulting services Reviewing my own work Six times an expert witness Hundreds of clients Man vs Wild Data Where fools fear to tread 3

13. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Working with inadequate tools 4

14. Disposable tableware company Problem: Want forecasts of each of hundreds of items. Series can be stationary, trended or seasonal. They currently have a large forecasting program written in-house but it doesn’t seem to produce sensible forecasts. They want me to tell them what is wrong and ﬁx it. Man vs Wild Data Working with inadequate tools 5

15. Disposable tableware company Problem: Want forecasts of each of hundreds of items. Series can be stationary, trended or seasonal. They currently have a large forecasting program written in-house but it doesn’t seem to produce sensible forecasts. They want me to tell them what is wrong and ﬁx it. Additional information Program written in COBOL making numerical calculations limited. It is not possible to do any optimisation. Man vs Wild Data Working with inadequate tools 5

16. Disposable tableware company Problem: Want forecasts of each of hundreds of items. Series can be stationary, trended or seasonal. They currently have a large forecasting program written in-house but it doesn’t seem to produce sensible forecasts. They want me to tell them what is wrong and ﬁx it. Additional information Program written in COBOL making numerical calculations limited. It is not possible to do any optimisation. Their programmer has little experience in numerical computing. Man vs Wild Data Working with inadequate tools 5

17. Disposable tableware company Problem: Want forecasts of each of hundreds of items. Series can be stationary, trended or seasonal. They currently have a large forecasting program written in-house but it doesn’t seem to produce sensible forecasts. They want me to tell them what is wrong and ﬁx it. Additional information Program written in COBOL making numerical calculations limited. It is not possible to do any optimisation. Their programmer has little experience in numerical computing. They employ no statisticians and want the program to produce forecasts automatically. Man vs Wild Data Working with inadequate tools 5

18. Disposable tableware company Methods currently used A 12 month average C 6 month average E straight line regression over last 12 months G straight line regression over last 6 months H average slope between last year’s and this year’s values. (Equivalent to differencing at lag 12 and taking mean.) I Same as H except over 6 months. K I couldn’t understand the explanation. Man vs Wild Data Working with inadequate tools 6

19. Disposable tableware company My solution Use ﬁrst differencing to deal with trend, or seasonal differencing to deal with seasonality. Use simple exponential smoothing on (differenced) data with the parameter selected from {0.1, 0.3, 0.5, 0.7, 0.9}. For each series, try 15 models: no differencing, ﬁrst differencing, and seasonal differencing, plus SES with 5 parameter values. Model selected based on smallest MSE. (Only one parameter for each model, so no need to penalize for model size.) Man vs Wild Data Working with inadequate tools 7

23. Disposable tableware company My solution Use ﬁrst differencing to deal with trend, or seasonal Some lessons with seasonality. differencing to deal Use simple exponential smoothing on (differenced) Be pragmatic. data with the parameter selected from {0Understand .9}. .1, 0.3, 0.5, 0.7, 0 your tools well enough For each series, to adapt them. to be able try 15 models: no differencing, ﬁrst differencing, and seasonal differencing, plus SES with successful consulting job often A 5 parameter values. Model selected based on methods. (Only one uses very simple smallest MSE. parameter for each model, so no need to penalize for model size.) Man vs Wild Data Working with inadequate tools 7

24. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data When you can’t lose 8

25. Forecasting the PBS Man vs Wild Data When you can’t lose 9

26. Forecasting the PBS The Pharmaceutical Beneﬁts Scheme (PBS) is the Australian government drugs subsidy scheme. Many drugs bought from pharmacies are subsidised to allow more equitable access to modern drugs. The cost to government is determined by the number and types of drugs purchased. Currently nearly 1% of GDP. The total cost is budgeted based on forecasts of drug usage. Man vs Wild Data When you can’t lose 10

30. Forecasting the PBS Man vs Wild Data When you can’t lose 11

31. Forecasting the PBS In 2001: $4.5 billion budget, under-forecasted by $800 million. Thousands of products. Seasonal demand. Subject to covert marketing, volatile products, uncontrollable expenditure. Although monthly data available for 10 years, data are aggregated to annual values, and only the ﬁrst three years are used in estimating the forecasts. All forecasts being done with the FORECAST function in MS-Excel! Man vs Wild Data When you can’t lose 12

36. ATC drug classiﬁcation A Alimentary tract and metabolism B Blood and blood forming organs C Cardiovascular system D Dermatologicals G Genito-urinary system and sex hormones H Systemic hormonal preparations, excluding sex hormones and insulins J Anti-infectives for systemic use L Antineoplastic and immunomodulating agents M Musculo-skeletal system N Nervous system P Antiparasitic products, insecticides and repellents R Respiratory system S Sensory organs V Various Man vs Wild Data When you can’t lose 13

37. ATC drug classiﬁcation 14 classes A Alimentary tract and metabolism 84 classes A10 Drugs used in diabetes A10B Blood glucose lowering drugs A10BA Biguanides A10BA02 Metformin Man vs Wild Data When you can’t lose 14

38. Forecasting the PBS Monthly data on thousands of drug groups and 4 concession types available from 1991. Method needs to be automated and implemented within MS-Excel. Exponential smoothing seems appropriate (monthly data with changing trends and seasonal patterns), but in 2001, automated exponential smoothing was not well-developed, and not available in MS-Excel. As part of this project, we developed an automatic forecasting algorithm for exponential smoothing state space models based on the AIC. Forecast MAPE reduced from 15–20% to about 0.6%. Man vs Wild Data When you can’t lose 15

43. Forecasting the PBS Total cost: A03 concession safety net group 1200 1000 800 $ thousands 600 400 200 0 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16

44. Forecasting the PBS Total cost: A05 general copayments group 250 200 $ thousands 150 100 50 0 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16

45. Forecasting the PBS Total cost: D01 general copayments group 700 600 500 400 $ thousands 300 200 100 0 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16

46. Forecasting the PBS Total cost: S01 general copayments group 6000 5000 4000 $ thousands 3000 2000 1000 0 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16

47. Forecasting the PBS Total cost: R03 general copayments group 1000 2000 3000 4000 5000 6000 7000 $ thousands 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16

48. Forecasting the PBS Total cost: R03 general copayments group 1000 2000 3000 4000 5000 6000 7000 Some lessons Often what people do is very bad, and it is easy to make a big difference. $ thousands Sometimes you have to invent new methods, and that can lead to publications. You have to implement solutions in the client’s software environment. Be aware of the2000 1995 politics. 2005 2010 Man vs Wild Data When you can’t lose 16

49. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Getting dirty with data 17

50. Airline passenger trafﬁc Man vs Wild Data Getting dirty with data 18

51. Airline passenger trafﬁc First class passengers: Melbourne−Sydney 2.0 1.0 0.0 1988 1989 1990 1991 1992 1993 Year Business class passengers: Melbourne−Sydney 0 2 4 6 8 1988 1989 1990 1991 1992 1993 Year Economy class passengers: Melbourne−Sydney 30 20 10 0 1988 1989 1990 1991 1992 1993 Man vs Wild Data Year Getting dirty with data 19

52. Airline passenger trafﬁc First class passengers: Melbourne−Sydney 2.0 1.0 0.0 1988 Not1989 real 1990 the data! 1991 1992 1993 Year Or is it? class passengers: Melbourne−Sydney Business 0 2 4 6 8 1988 1989 1990 1991 1992 1993 Year Economy class passengers: Melbourne−Sydney 30 20 10 0 1988 1989 1990 1991 1992 1993 Man vs Wild Data Year Getting dirty with data 19

53. Airline passenger trafﬁc Economy Class Passengers: Melbourne−Sydney 35 30 Passengers (thousands) 25 20 15 10 5 0 1988 1989 1990 1991 1992 1993 Man vs Wild Data Getting dirty with data 20

56. Possible model ∗ Yt = Yt + Z t ∗ Yt = β0 + βj xt,j + Nt j Yt = observed data for one passenger class. ∗ Yt = reconstructed data. Zt = latent process (usually equal to zero). xt,j are covariates and dummy variables. Nt = seasonal ARIMA process of period 52. Man vs Wild Data Getting dirty with data 21

57. Possible model ∗ Yt = Yt + Z t ∗ Some lessonsβ0 + Yt = βj xt,j + Nt j Real data is often very messy. Be Yt = aware of the causes. passenger class. observed data for one ∗ Yt = Get an answer data. if it isn’t pretty. reconstructed even Zt = What to do with the non-integer zero). latent process (usually equal to xt,j are covariates (average 52.19) seasonality? and dummy variables. Nt = How to deal with process of period 52. seasonal ARIMA the correlations between classes and between routes? You often think of better approaches long after the project is ﬁnished. Man vs Wild Data Getting dirty with data 21

58. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Going to extremes 22

59. Extreme electricity demand Man vs Wild Data Going to extremes 23

60. The problem We want to forecast the peak electricity demand in a half-hour period in ten years time. We have twelve years of half-hourly electricity data, temperature data and some economic and demographic data. The location is South Australia: home to the most volatile electricity demand in the world. Sounds impossible? Man vs Wild Data Going to extremes 24

65. South Australian demand data Man vs Wild Data Going to extremes 25

66. South Australian demand data Black Saturday → Man vs Wild Data Going to extremes 25

67. South Australian demand data South Australia state wide demand (summer 10/11) 3.5 South Australia state wide demand (GW) 3.0 2.5 2.0 1.5 Oct 10 Nov 10 Dec 10 Jan 11 Feb 11 Mar 11 Man vs Wild Data Going to extremes 25

68. South Australian demand data South Australia state wide demand (January 2011) 3.5 3.0 South Australian demand (GW) 2.5 2.0 1.5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Date in January Man vs Wild Data Going to extremes 25

69. Demand boxplots (Sth Aust) Time: 12 midnight 3.5 3.0 2.5 Demand (GW) q q q q q q q q q q 2.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.5 q q q q q q q q q q q q q q q q q q 1.0 q q Mon Tue Wed Thu Fri Sat Sun Day of week Man vs Wild Data Going to extremes 26

70. Temperature data (Sth Aust) Time: 12 midnight 3.5 Workday Non−workday 3.0 2.5 Demand (GW) 2.0 1.5 1.0 10 20 30 40 Temperature (deg C) Man vs Wild Data Going to extremes 27

71. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 yt denotes per capita demand at time t (measured in half-hourly intervals) and p denotes the time of day p = 1, . . . , 48; hp (t ) models all calendar effects; fp (w1,t , w2,t ) models all temperature effects where w1,t is a vector of recent temperatures at location 1 and w2,t is a vector of recent temperatures at location 2; zj,t is a demographic or economic variable at time t nt denotes the model error at time t. Man vs Wild Data Going to extremes 28

76. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 hp (t ) includes handle annual, weekly and daily seasonal patterns as well as public holidays: hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p p (t) is “time of summer” effect (a regression spline); αt,p is day of week effect; βt,p is “holiday” effect; γt,p New Year’s Eve effect; δt,p is millennium effect; Man vs Wild Data Going to extremes 29

82. Fitted results (Summer 3pm) Time: 3:00 pm 0.4 0.4 Effect on demand Effect on demand 0.0 0.0 −0.4 −0.4 0 50 100 150 Mon Tue Wed Thu Fri Sat Sun Day of summer Day of week 0.4 Effect on demand 0.0 −0.4 Normal Day before Holiday Day after Holiday Man vs Wild Data Going to extremes 30

83. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 6 + − fp (w1,t , w2,t ) = ¯ fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt ) k =0 6 + Fj,p (xt−48j ) + Gj,p (dt−48j ) j=1 xt is ave temp across two sites (Kent Town and Adelaide Airport) at time t; dt is the temp difference between two sites at time t; + xt is max of xt values in past 24 hours; − xt is min of xt values in past 24 hours; ¯ xt is ave temp in past seven days. Each function is smooth & estimated using regression splines. Man vs Wild Data Going to extremes 31

90. 0.4 Fitted results (Summer 3pm) Time: 3:00 pm 0.4 0.4 0.4 0.2 0.2 0.2 0.2 Effect on demand Effect on demand Effect on demand Effect on demand 0.0 0.0 0.0 0.0 −0.2 −0.2 −0.2 −0.2 −0.4 −0.4 −0.4 −0.4 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 Temperature Lag 1 temperature Lag 2 temperature Lag 3 temperature 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 Effect on demand Effect on demand Effect on demand Effect on demand 0.0 0.0 0.0 0.0 −0.2 −0.2 −0.2 −0.2 −0.4 −0.4 −0.4 −0.4 10 20 30 40 10 15 20 25 30 15 25 35 10 15 20 25 Lag 1 day temperature Last week average temp Previous max temp Previous min temp Man vs Wild Data Going to extremes 32

91. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 Same predictors used for all 48 models. Predictors chosen by cross-validation on summer of 2007/2008 and 2009/2010. Each model is ﬁtted to the data twice, ﬁrst excluding the summer of 2009/2010 and then excluding the summer of 2010/2011. The average out-of-sample MSE is calculated from the omitted data for the time periods 12noon–8.30pm. Man vs Wild Data Going to extremes 33

94. Half-hourly models x x1 x2 x3 x4 x5 x6 x48 x96 x144 x192 x240 x288 d d1 d2 d3 d4 d5 d6 d48 d96 d144 d192 d240 d288 x+ x− x dow hol dos MSE ¯ 1 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.037 2 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.034 3 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.031 4 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.027 5 • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.025 6 • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.020 7 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.025 8 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.026 9 • • • • • • • • • • • • • • • • • • • • • • • • • 1.035 10 • • • • • • • • • • • • • • • • • • • • • • • • 1.044 11 • • • • • • • • • • • • • • • • • • • • • • • 1.057 12 • • • • • • • • • • • • • • • • • • • • • • 1.076 13 • • • • • • • • • • • • • • • • • • • • • 1.102 14 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.018 15 • • • • • • • • • • • • • • • • • • • • • • • • • 1.021 16 • • • • • • • • • • • • • • • • • • • • • • • • 1.037 17 • • • • • • • • • • • • • • • • • • • • • • • 1.074 18 • • • • • • • • • • • • • • • • • • • • • • 1.152 19 • • • • • • • • • • • • • • • • • • • • • 1.180 20 • • • • • • • • • • • • • • • • • • • • • • • • • 1.021 21 • • • • • • • • • • • • • • • • • • • • • • • • 1.027 22 • • • • • • • • • • • • • • • • • • • • • • • 1.038 23 • • • • • • • • • • • • • • • • • • • • • • 1.056 24 • • • • • • • • • • • • • • • • • • • • • 1.086 25 • • • • • • • • • • • • • • • • • • • • 1.135 26 • • • • • • • • • • • • • • • • • • • • • • • • • 1.009 27 • • • • • • • • • • • • • • • • • • • • • • • • • 1.063 28 • • • • • • • • • • • • • • • • • • • • • • • • • 1.028 29 • • • • • • • • • • • • • • • • • • • • • • • • • 3.523 30 • • • • • • • • • • • • • • • • • • • • • • • • • 2.143 31 • • • • • • • • • • • • • • • • • • • • • • • • • 1.523 Man vs Wild Data Going to extremes 34

95. Half-hourly models R−squared 90 R−squared (%) 80 70 60 12 midnight 3:00 am 6:00 am 9:00 am 12 noon 3:00 pm 6:00 pm 9:00 pm 12 midnight Time of day Man vs Wild Data Going to extremes 35

96. Half-hourly models South Australian demand (January 2011) 4.0 Actual Fitted 3.5 South Australian demand (GW) 3.0 2.5 2.0 1.5 1.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Man vs Wild Data Date in January Going to extremes 35

97. Half-hourly models Man vs Wild Data Going to extremes 35

98. Half-hourly models Man vs Wild Data Going to extremes 35

99. Adjusted model Original model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 Model allowing saturated usage J qt = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j=1 qt if qt ≤ τ ; log(yt ) = τ + k(qt − τ ) if qt > τ . Man vs Wild Data Going to extremes 36

100. Adjusted model Original model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 Model allowing saturated usage J qt = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j=1 qt if qt ≤ τ ; log(yt ) = τ + k(qt − τ ) if qt > τ . Man vs Wild Data Going to extremes 36

101. Peak demand forecasting J qt,p = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j=1 Multiple alternative futures created: hp (t ) known; simulate future temperatures using double seasonal block bootstrap with variable blocks (with adjustment for climate change); use assumed values for GSP, population and price; resample residuals using double seasonal block bootstrap with variable blocks. Man vs Wild Data Going to extremes 37

102. Peak demand backcasting J qt,p = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j=1 Multiple alternative pasts created: hp (t ) known; simulate past temperatures using double seasonal block bootstrap with variable blocks; use actual values for GSP, population and price; resample residuals using double seasonal block bootstrap with variable blocks. Man vs Wild Data Going to extremes 37

103. Peak demand backcasting PoE (annual interpretation) 4.0 10 % 50 % 90 % 3.5 q q q PoE Demand q 3.0 q q q q q q q q 2.5 q q 2.0 98/99 00/01 02/03 04/05 06/07 08/09 10/11 Year Man vs Wild Data Going to extremes 38

104. Peak demand forecasting South Australia GSP 120 High billion dollars (08/09 dollars) Base 100 Low 80 60 40 1990 1995 2000 2005 2010 2015 2020 Year South Australia population 2.0 High Base Low 1.8 million 1.6 1.4 1990 1995 2000 2005 2010 2015 2020 Year Average electricity prices High 22 Base Low 20 c/kWh 18 16 14 12 1990 1995 2000 2005 2010 2015 2020 Year Man vs Wild Data Major industrial offset demand Going to extremes 39 0

105. Peak demand distribution Annual POE levels 6 1 % POE 5 % POE 10 % POE 50 % POE 5 90 % POE q Actual annual maximum PoE Demand 4 q q q q 3 q q q q q q q q q 2 98/99 00/01 02/03 04/05 06/07 08/09 10/11 12/13 14/15 16/17 18/19 20/21 Year Man vs Wild Data Going to extremes 40

106. Results We have successfully forecast the extreme upper tail in ten years time using only twelve years of data! This method has now been adopted for the ofﬁcial long-term peak electricity demand forecasts for all states except WA. Some lessons Cross-validation is very useful in prediction problems. Statistical modelling is an iterative process. Getting client understanding of percentiles is extremely difﬁcult. Beware of clients who think they know more than you! Man vs Wild Data Going to extremes 41

110. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Final thoughts 42

111. Crazy clients The client who wouldn’t tell me the problem. The client who wanted all meetings held at random locations for security reasons. The client who didn’t like the answer. Expert witnessing on the color purple (and now yellow). Man vs Wild Data Final thoughts 43

115. Go forth and consult A good statistician is not smarter than everyone else, he merely has his ignorance better organised. (Anonymous) Man vs Wild Data Final thoughts 44

116. Go forth and consult All models are wrong, some are useful. (George E P Box) Man vs Wild Data Final thoughts 44

117. Go forth and consult It is better to solve the right problem the wrong way than the wrong problem the right way. (John W Tukey) Man vs Wild Data Final thoughts 44

118. Go forth and consult It is better to solve the right problem the wrong way than the wrong problem the right way. (John W Tukey) Slides available from robjhyndman.com Man vs Wild Data Final thoughts 44

Ysc2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ysc2013

Similar to Ysc2013 (20)

More from Rob Hyndman

More from Rob Hyndman (15)

Ysc2013