Upcoming SlideShare
×

# 18 Modelling

985 views
915 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
985
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
28
0
Likes
1
Embeds 0
No embeds

No notes for slide

### 18 Modelling

1. 1. Stat405 Quantitative models Hadley Wickham Wednesday, 28 October 2009
2. 2. 1. Strategy for analysing large data. 2. Introduction to the Texas housing data. 3. What’s happening in Houston? 4. Using a models as a tool 5. Using models in their own right Wednesday, 28 October 2009
3. 3. Large data strategy Start with a single unit, and identify interesting patterns. Summarise patterns with a model. Apply model to all units. Look for units that don’t ﬁt the pattern. Summarise with a single model. Wednesday, 28 October 2009
4. 4. Texas housing data For each metropolitan area (45) in Texas, for each month from 2000 to 2009 (112): Number of houses listed and sold Total value of houses, and average sale price Average time on market CC BY http://www.ﬂickr.com/photos/imagesbywestfall/3510831277/ Wednesday, 28 October 2009
5. 5. Strategy Start with a single city (Houston). Explore patterns & ﬁt models. Apply models to all cities. Wednesday, 28 October 2009
6. 6. 220000 200000 avgprice 180000 160000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
7. 7. 8000 7000 6000 sales 5000 4000 3000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
8. 8. 6.5 6.0 onmarket 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
9. 9. Seasonal trends Make it much harder to see long term trend. How can we remove the trend? (Many sophisticated techniques from time series, but what’s the simplest thing that might work?) Wednesday, 28 October 2009
10. 10. 220000 200000 avgprice 180000 160000 2 4 6 8 10 12 month Wednesday, 28 October 2009
11. 11. 220000 200000 avgprice 180000 160000 qplot(month, avgprice, data = houston, geom = "line", group = year) + stat_summary(aes(group = 1), fun.y = "mean", geom = "line", 2 4 6 8 10 12 colour = "red", size = 2, na.rm month = TRUE) Wednesday, 28 October 2009
12. 12. Challenge What does the following function do? deseas <- function(var, month) { resid(lm(var ~ factor(month))) + mean(var, na.rm = TRUE) } How could you use it in conjunction with transform to deasonalise the data? What if you wanted to deasonalise every city? Wednesday, 28 October 2009
13. 13. houston <- transform(houston, avgprice_ds = deseas(avgprice, month), listings_ds = deseas(listings, month), sales_ds = deseas(sales, month), onmarket_ds = deseas(onmarket, month) ) qplot(month, sales_ds, data = houston, geom = "line", group = year) + avg Wednesday, 28 October 2009
14. 14. 210000 200000 190000 avgprice_ds 180000 170000 160000 150000 2 4 6 8 10 12 month Wednesday, 28 October 2009
15. 15. Model as tools Here we’re using the linear model as a tool - we don’t care about the coefﬁcients or the standard errors, just using it to get rid of a striking pattern. Tukey described this pattern as residuals and reiteration: by removing a striking pattern we can see more subtle patterns. Wednesday, 28 October 2009
16. 16. 210000 200000 190000 avgprice_ds 180000 170000 160000 150000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
17. 17. 7000 6500 6000 sales_ds 5500 5000 4500 4000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
18. 18. 6.5 6.0 onmarket_ds 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
19. 19. Summary Most variables seem to be combination of strong seasonal pattern plus weaker long- term trend. How do these patterns hold up for the rest of Texas? We’ll focus on sales. Wednesday, 28 October 2009
20. 20. 8000 6000 sales 4000 2000 2000 2002 2004 2006 2008 date Wednesday, 28 October 2009
21. 21. 7000 6000 5000 sales_ds 4000 3000 2000 1000 2000 2002 2004 2006 2008 tx <- ddply(tx, "city", transform, date sales_ds = deseas(sales, month)) qplot(date, sales_ds, data = tx, geom = "line", group = city) Wednesday, 28 October 2009
22. 22. Is this still such a good idea? What do we lose? Is there anything else we could do to improve the plot? 7000 6000 5000 sales_ds 4000 3000 2000 1000 2000 2002 2004 2006 2008 tx <- ddply(tx, "city", transform, date sales_ds = deseas(sales, month)) qplot(date, sales_ds, data = tx, geom = "line", group = city) Wednesday, 28 October 2009
23. 23. 3.5 3.0 2.5 log10(sales) 2.0 1.5 1.0 0.5 qplot(date, log10(sales), data = tx, geom = "line", 2000 2002 2004 2006 2008 group = city) date Wednesday, 28 October 2009
24. 24. It works, but... Instead of throwing the models away and just using the residuals, let’s keep the models and explore them in more depth. Wednesday, 28 October 2009
25. 25. Two new tools dlply: takes a data frame, splits up in the same way as ddply, applies function to each piece and combines the results into a list ldply: takes a list, splits up into elements, applies function to each piece and then combines the results into a data frame dlply + ldply = ddply Wednesday, 28 October 2009
26. 26. models <- dlply(tx, "city", function(df) lm(sales ~ factor(month), data = df)) models[[1]] coef(models[[1]]) ldply(models, coef) Wednesday, 28 October 2009
27. 27. Labelling Notice we didn’t have to do anything to have the coefﬁcients labelled correctly. Behind the scenes plyr records the labels used for the split step, and ensures they are preserved across multiple plyr calls. Wednesday, 28 October 2009
28. 28. Back to the model What are some problems with this model? How could you ﬁx them? Is the format of the coefﬁcients optimal? Turn to the person next to you and discuss for 2 minutes. Wednesday, 28 October 2009
29. 29. qplot(date, log10(sales), data = tx, geom = "line", group = city) models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) }) Wednesday, 28 October 2009
30. 30. qplot(date, log10(sales), data = tx, geom = "line", group = city) models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) }) Put coefﬁcients in rows, so they can be plotted more easily Wednesday, 28 October 2009
31. 31. 0.4 0.3 0.2 effect 0.1 0.0 −0.1 2 4 6 8 10 12 qplot(month, effect, data = coef2, group month = city, geom = "line") Wednesday, 28 October 2009
32. 32. 2.5 2.0 10^effect 1.5 1.0 2 4 6 8 10 12 month qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line") Wednesday, 28 October 2009
33. 33. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 2.5 2.0 1.5 1.0 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 2.5 2.0 1.5 1.0 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 2.5 2.0 1.5 1.0 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 10^effect 2.5 2.0 1.5 1.0 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 2.5 2.0 1.5 1.0 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 2.5 2.0 1.5 1.0 Victoria Waco Wichita Falls 2.5 2.0 1.5 1.0 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 month qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city) Wednesday, 28 October 2009
34. 34. What should we do next? What do you think? You have 30 seconds to come up with (at least) one idea. Wednesday, 28 October 2009
35. 35. My ideas Fit a single model, log(sales) ~ city * factor(month), and look at residuals Fit individual models, log(sales) ~ factor(month) + ns(date, 3), look cities that don’t ﬁt Wednesday, 28 October 2009
36. 36. # One approach - fit a single model mod <- lm(log10(sales) ~ city + factor(month), data = tx) tx\$sales2 <- 10 ^ resid(mod) qplot(date, sales2, data = tx, geom = "line", group = city) last_plot() + facet_wrap(~ city) Wednesday, 28 October 2009
37. 37. 3.5 3.0 2.5 sales2 2.0 1.5 1.0 0.5 2000 2002 2004 2006 2008 date qplot(date, sales2, data = tx, geom = "line", group = city) Wednesday, 28 October 2009
38. 38. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 0.5 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 sales2 3.0 2.5 2.0 1.5 1.0 0.5 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 0.5 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 date last_plot() + facet_wrap(~ city) Wednesday, 28 October 2009
39. 39. # Another approach: Essence of most cities is seasonal # term plus long term smooth trend. We could fit this # model to each city, and then look for models which don't # fit well. library(splines) models3 <- dlply(tx, "city", function(df) { lm(log10(sales) ~ factor(month) + ns(date, 3), data = df) }) # Extract rsquared from each model rsq <- function(mod) c(rsq = summary(mod)\$r.squared) quality <- ldply(models3, rsq) Wednesday, 28 October 2009
40. 40. Wichita Falls ● Waco ● Victoria ● Tyler ● Texarkana ● Temple−Belton ● Sherman−Denison ● San Marcos ● San Antonio ● San Angelo ● Port Arthur ● Paris ● Palestine ● Odessa ● NE Tarrant County ● Nacogdoches ● Montgomery County ● Midland ● McAllen ● Lufkin ● Lubbock ● Longview−Marshall ● Laredo ● city Killeen−Fort Hood ● Irving ● Houston ● Harlingen ● Garland ● Galveston ● Fort Worth ● Fort Bend ● El Paso ● Denton County ● Dallas ● Corpus Christi ● Collin County ● Bryan−College Station ● Brownsville ● Brazoria County ● Beaumont ● Bay Area ● Austin ● Arlington ● Amarillo ● Abilene ● 0.5 0.6 0.7 0.8 0.9 rsq qplot(rsq, city, data = quality) Wednesday, 28 October 2009
41. 41. San Antonio ● Montgomery County ● Houston How are the good ● Dallas ● Bryan−College Station Collin County ﬁts different from ● ● Denton County Austin the bad ﬁts? ● ● Fort Bend ● Fort Worth ● Tyler ● NE Tarrant County ● Bay Area ● Corpus Christi ● Arlington ● Waco ● Temple−Belton ● Lubbock ● reorder(city, rsq) Garland ● Longview−Marshall ● Midland ● Laredo ● Harlingen ● Abilene ● Killeen−Fort Hood ● Brazoria County ● McAllen ● Brownsville ● Wichita Falls ● Sherman−Denison ● Irving ● Galveston ● Odessa ● San Marcos ● San Angelo ● Amarillo ● Nacogdoches ● Lufkin ● Victoria ● Beaumont ● Texarkana ● Paris ● Palestine ● El Paso ● Port Arthur ● 0.5 0.6 0.7 0.8 0.9 rsq qplot(rsq, reorder(city, rsq), data = quality) Wednesday, 28 October 2009
42. 42. quality\$poor <- quality\$rsq < 0.7 tx2 <- merge(tx, quality, by = "city") cities <- dlply(tx, "city") mfit <- mdply(cbind(mod = models3, df = cities), function(mod, df) { data.frame( city = df\$city, date = df\$date, resid = resid(mod), pred = predict(mod)) }) tx2 <- merge(tx2, mfit) Wednesday, 28 October 2009
43. 43. quality\$poor <- quality\$rsq < 0.7 tx2 <- merge(tx, quality, by = "city") Takes each row of input cities <- dlply(tx, to function and feeds it "city") mfit <- mdply(cbind(mod = models3, df = cities), function(mod, df) { data.frame( city = df\$city, date = df\$date, resid = resid(mod), pred = predict(mod)) }) tx2 <- merge(tx2, mfit) Wednesday, 28 October 2009
44. 44. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 0.5 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 0.5 log10(sales) Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 0.5 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Raw data date Wednesday, 28 October 2009
45. 45. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 3.0 pred 2.5 2.0 1.5 1.0 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 Predictions 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 date Wednesday, 28 October 2009
46. 46. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 0.4 0.2 0.0 −0.2 −0.4 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 0.4 0.2 0.0 −0.2 −0.4 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 0.4 0.2 0.0 −0.2 −0.4 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 0.4 0.2 resid 0.0 −0.2 −0.4 Montgomery County Nacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 0.4 0.2 0.0 −0.2 −0.4 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 0.4 0.2 0.0 −0.2 −0.4 Victoria Waco Wichita Falls 0.4 0.2 0.0 −0.2 −0.4 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Residuals date Wednesday, 28 October 2009
47. 47. Your turn Pick one of the other variables and perform a similar exploration, working through the same steps. Wednesday, 28 October 2009
48. 48. Conclusions Simple (and relatively small) example, but shows how collections of models can be useful for gaining understanding. Each attempt illustrated something new about the data. Plyr made it easy to create and summarise collection of models, so we didn’t have to worry about the mechanics. Wednesday, 28 October 2009