Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
plyr
                       Modelling large data


                          Hadley Wickham
Tuesday, 7 July 2009
1. Strategy for analysing large data.
                2. Introduction to the Texas housing
                   data.
      ...
Large data strategy
                       Start with a single unit, and identify
                       interesting patte...
Texas housing data
                       For each metropolitan area (45) in Texas,
                       for each month ...
Strategy

                       Start with a single city (Houston).
                       Explore patterns & fit models.
...
220000




            200000
 avgprice




            180000




            160000




                       2000   20...
8000


         7000


         6000
 sales




         5000


         4000


         3000


                  2000   2...
6.5


            6.0
 onmarket




            5.5


            5.0


            4.5


            4.0

               ...
Seasonal trends

                       Make it much harder to see long term
                       trend. How can we remo...
220000




            200000
 avgprice




            180000




            160000




                       2   4    ...
Challenge
                       What does the following function do?
                       deseas <- function(var, month...
houston <- transform(houston,
        avgprice_ds = deseas(avgprice, month),
        listings_ds = deseas(listings, month)...
210000



               200000



               190000
 avgprice_ds




               180000



               170000

...
Model as tools
                       Here we’re using the linear model as a
                       tool - we don’t care a...
210000


               200000


               190000
 avgprice_ds




               180000


               170000


  ...
7000


            6500


            6000
 sales_ds




            5500


            5000


            4500


        ...
6.5



               6.0
 onmarket_ds




               5.5



               5.0



               4.5



             ...
Summary

                       Most variables seem to be combination of
                       strong seasonal pattern pl...
8000



         6000
 sales




         4000



         2000




                  2000   2002   2004     2006   2008
 ...
tx <- read.csv("tx-house-sales.csv")
      qplot(date, sales, data = tx, geom = "line",
        group = city)

      tx <-...
7000

            6000

            5000
 sales_ds




            4000

            3000

            2000

            1...
It works, but...
                       It doesn’t give us any insight into the
                       similarity of the p...
Two new tools
                       dlply: takes a data frame, splits up in the
                       same way as ddply,...
models <- dlply(tx, "city", function(df)
        lm(sales ~ factor(month), data = df))


      models[[1]]
      coef(mode...
Labelling

                       Notice we didn’t have to do anything to
                       have the coefficients labe...
Back to the model

                       What are some problems with this model?
                       How could you fix ...
qplot(date, log10(sales), data = tx, geom = "line",
      group = city)

      models2 <- dlply(tx, "city", function(df)
 ...
qplot(date, log10(sales), data = tx, geom = "line",
      group = city)
                     Log transform sales to
      ...
0.4




           0.3




           0.2
 effect




           0.1




           0.0




          −0.1




           ...
2.5




             2.0
 10^effect




             1.5




             1.0




                       2    4           ...
Abilene          Amarillo         Arlington        Austin        Bay Area      Beaumont      Brazoria County
             ...
What should
                               we do next?

                       What do you think?
                       Y...
My ideas

                       Fit a single model, log(sales) ~ city *
                       factor(month), and look at...
# One approach - fit a single model

      mod <- lm(log10(sales) ~ city + factor(month),
        data = tx)

      tx$sal...
3.5



          3.0



          2.5
 sales2




          2.0



          1.5



          1.0



          0.5




   ...
Abilene          Amarillo         Arlington        Austin        Bay Area      Beaumont      Brazoria County
          3.5...
#    Another approach: Essence of most cities is seasonal
      #    term plus long term smooth trend. We could fit this
 ...
Wichita Falls                                                                   ●
                                        ...
San Antonio                                                                                                ●
             ...
quality$poor <- quality$rsq < 0.7
      tx2 <- merge(tx, quality, by = "city")

      mfit <- ldply(models3, function(mod)...
Abilene          Amarillo        Arlington        Austin       Bay Area      Beaumont      Brazoria County
               ...
Abilene          Amarillo        Arlington        Austin       Bay Area      Beaumont      Brazoria County
        3.5
   ...
Abilene       Amarillo         Arlington        Austin        Bay Area      Beaumont      Brazoria County
          0.4
  ...
Conclusions
                       Simple (and relatively small) example, but
                       shows how collections...
Upcoming SlideShare
Loading in …5
×

03 Modelling

15,141 views

Published on

Published in: Technology

03 Modelling

  1. 1. plyr Modelling large data Hadley Wickham Tuesday, 7 July 2009
  2. 2. 1. Strategy for analysing large data. 2. Introduction to the Texas housing data. 3. What’s happening in Houston? 4. Using a models as a tool 5. Using models in their own right Tuesday, 7 July 2009
  3. 3. Large data strategy Start with a single unit, and identify interesting patterns. Summarise patterns with a model. Apply model to all units. Look for units that don’t fit the pattern. Summarise with a single model. Tuesday, 7 July 2009
  4. 4. Texas housing data For each metropolitan area (45) in Texas, for each month from 2000 to 2009 (112): Number of houses listed and sold Total value of houses, and average sale price Average time on market CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/ Tuesday, 7 July 2009
  5. 5. Strategy Start with a single city (Houston). Explore patterns & fit models. Apply models to all cities. Tuesday, 7 July 2009
  6. 6. 220000 200000 avgprice 180000 160000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  7. 7. 8000 7000 6000 sales 5000 4000 3000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  8. 8. 6.5 6.0 onmarket 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  9. 9. Seasonal trends Make it much harder to see long term trend. How can we remove the trend? (Many sophisticated techniques from time series, but what’s the simplest thing that might work?) Tuesday, 7 July 2009
  10. 10. 220000 200000 avgprice 180000 160000 2 4 6 8 10 12 month Tuesday, 7 July 2009
  11. 11. Challenge What does the following function do? deseas <- function(var, month) { resid(lm(var ~ factor(month))) + mean(var, na.rm = TRUE) } How could you use it in conjunction with transform to deasonalise the data? What if you wanted to deasonalise every city? Tuesday, 7 July 2009
  12. 12. houston <- transform(houston, avgprice_ds = deseas(avgprice, month), listings_ds = deseas(listings, month), sales_ds = deseas(sales, month), onmarket_ds = deseas(onmarket, month) ) qplot(month, sales_ds, data = houston, geom = "line", group = year) + avg Tuesday, 7 July 2009
  13. 13. 210000 200000 190000 avgprice_ds 180000 170000 160000 150000 2 4 6 8 10 12 month Tuesday, 7 July 2009
  14. 14. Model as tools Here we’re using the linear model as a tool - we don’t care about the coefficients or the standard errors, just using it to get rid of a striking pattern. Tukey described this pattern as residuals and reiteration: by removing a striking pattern we can see more subtle patterns. Tuesday, 7 July 2009
  15. 15. 210000 200000 190000 avgprice_ds 180000 170000 160000 150000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  16. 16. 7000 6500 6000 sales_ds 5500 5000 4500 4000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  17. 17. 6.5 6.0 onmarket_ds 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  18. 18. Summary Most variables seem to be combination of strong seasonal pattern plus weaker long- term trend. How do these patterns hold up for the rest of Texas? We’ll focus on sales. Tuesday, 7 July 2009
  19. 19. 8000 6000 sales 4000 2000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  20. 20. tx <- read.csv("tx-house-sales.csv") qplot(date, sales, data = tx, geom = "line", group = city) tx <- ddply(tx, "city", transform, sales_ds = deseas(sales, month)) qplot(date, sales_ds, data = tx, geom = "line", group = city) Tuesday, 7 July 2009
  21. 21. 7000 6000 5000 sales_ds 4000 3000 2000 1000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  22. 22. It works, but... It doesn’t give us any insight into the similarity of the patterns across multiple cities. Are the trends the same or different? So instead of throwing the models away and just using the residuals, let’s keep the models and explore them in more depth. Tuesday, 7 July 2009
  23. 23. Two new tools dlply: takes a data frame, splits up in the same way as ddply, applies function to each piece and combines the results into a list ldply: takes a list, splits up into elements, applies function to each piece and then combines the results into a data frame dlply + ldply = ddply Tuesday, 7 July 2009
  24. 24. models <- dlply(tx, "city", function(df) lm(sales ~ factor(month), data = df)) models[[1]] coef(models[[1]]) ldply(models, coef) Tuesday, 7 July 2009
  25. 25. Labelling Notice we didn’t have to do anything to have the coefficients labelled correctly. Behind the scenes plyr records the labels used for the split step, and ensures they are preserved across multiple plyr calls. Tuesday, 7 July 2009
  26. 26. Back to the model What are some problems with this model? How could you fix them? Is the format of the coefficients optimal? Turn to the person next to you and discuss for 2 minutes. Tuesday, 7 July 2009
  27. 27. qplot(date, log10(sales), data = tx, geom = "line", group = city) models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) }) Tuesday, 7 July 2009
  28. 28. qplot(date, log10(sales), data = tx, geom = "line", group = city) Log transform sales to make coefficients models2 <- dlply(tx, "city", function(df) comparable (ratios) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) }) Puts coefficients in rows, so they can be plotted more easily Tuesday, 7 July 2009
  29. 29. 0.4 0.3 0.2 effect 0.1 0.0 −0.1 2 4 6 8 10 12 qplot(month, effect, data = coef2, group month = city, geom = "line") Tuesday, 7 July 2009
  30. 30. 2.5 2.0 10^effect 1.5 1.0 2 4 6 8 10 12 month qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line") Tuesday, 7 July 2009
  31. 31. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 2.5 2.0 1.5 1.0 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 2.5 2.0 1.5 1.0 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 2.5 2.0 1.5 1.0 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 10^effect 2.5 2.0 1.5 1.0 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 2.5 2.0 1.5 1.0 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 2.5 2.0 1.5 1.0 Victoria Waco Wichita Falls 2.5 2.0 1.5 1.0 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 month qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city) Tuesday, 7 July 2009
  32. 32. What should we do next? What do you think? You have 30 seconds to come up with (at least) one idea. Tuesday, 7 July 2009
  33. 33. My ideas Fit a single model, log(sales) ~ city * factor(month), and look at residuals Fit individual models, log(sales) ~ factor(month) + ns(date, 3), look cities that don’t fit Tuesday, 7 July 2009
  34. 34. # One approach - fit a single model mod <- lm(log10(sales) ~ city + factor(month), data = tx) tx$sales2 <- 10 ^ resid(mod) qplot(date, sales2, data = tx, geom = "line", group = city) last_plot() + facet_wrap(~ city) Tuesday, 7 July 2009
  35. 35. 3.5 3.0 2.5 sales2 2.0 1.5 1.0 0.5 2000 2002 2004 2006 2008 date qplot(date, sales2, data = tx, geom = "line", group = city) Tuesday, 7 July 2009
  36. 36. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 0.5 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 sales2 3.0 2.5 2.0 1.5 1.0 0.5 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 0.5 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 date last_plot() + facet_wrap(~ city) Tuesday, 7 July 2009
  37. 37. # Another approach: Essence of most cities is seasonal # term plus long term smooth trend. We could fit this # model to each city, and then look for models which don't # fit well. library(splines) models3 <- dlply(tx, "city", function(df) { lm(log10(sales) ~ factor(month) + ns(date, 3), data = df) }) # Extract rsquared from each model rsq <- function(mod) c(rsq = summary(mod)$r.squared) quality <- ldply(models3, rsq) Tuesday, 7 July 2009
  38. 38. Wichita Falls ● Waco ● Victoria ● Tyler ● Texarkana ● Temple−Belton ● Sherman−Denison ● San Marcos ● San Antonio ● San Angelo ● Port Arthur ● Paris ● Palestine ● Odessa ● NE Tarrant County ● Nacogdoches ● Montgomery County ● Midland ● McAllen ● Lufkin ● Lubbock ● Longview−Marshall ● Laredo ● city Killeen−Fort Hood ● Irving ● Houston ● Harlingen ● Garland ● Galveston ● Fort Worth ● Fort Bend ● El Paso ● Denton County ● Dallas ● Corpus Christi ● Collin County ● Bryan−College Station ● Brownsville ● Brazoria County ● Beaumont ● Bay Area ● Austin ● Arlington ● Amarillo ● Abilene ● 0.5 0.6 0.7 0.8 0.9 rsq qplot(rsq, city, data = quality) Tuesday, 7 July 2009
  39. 39. San Antonio ● Montgomery County ● Houston How are the good ● Dallas ● Bryan−College Station Collin County fits different from ● ● Denton County Austin the bad fits? ● ● Fort Bend ● Fort Worth ● Tyler ● NE Tarrant County ● Bay Area ● Corpus Christi ● Arlington ● Waco ● Temple−Belton ● Lubbock ● reorder(city, rsq) Garland ● Longview−Marshall ● Midland ● Laredo ● Harlingen ● Abilene ● Killeen−Fort Hood ● Brazoria County ● McAllen ● Brownsville ● Wichita Falls ● Sherman−Denison ● Irving ● Galveston ● Odessa ● San Marcos ● San Angelo ● Amarillo ● Nacogdoches ● Lufkin ● Victoria ● Beaumont ● Texarkana ● Paris ● Palestine ● El Paso ● Port Arthur ● 0.5 0.6 0.7 0.8 0.9 rsq qplot(rsq, reorder(city, rsq), data = quality) Tuesday, 7 July 2009
  40. 40. quality$poor <- quality$rsq < 0.7 tx2 <- merge(tx, quality, by = "city") mfit <- ldply(models3, function(mod) { data.frame( resid = resid(mod), pred = predict(mod)) }) tx2 <- cbind(tx2, mfit[, -1]) Can you think of any potential problems with this line? Tuesday, 7 July 2009
  41. 41. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 0.5 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 0.5 log10(sales) Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 0.5 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Raw data date Tuesday, 7 July 2009
  42. 42. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 3.0 pred 2.5 2.0 1.5 1.0 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 Predictions 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 date Tuesday, 7 July 2009
  43. 43. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 0.4 0.2 0.0 −0.2 −0.4 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 0.4 0.2 0.0 −0.2 −0.4 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 0.4 0.2 0.0 −0.2 −0.4 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 0.4 0.2 resid 0.0 −0.2 −0.4 Montgomery County Nacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 0.4 0.2 0.0 −0.2 −0.4 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 0.4 0.2 0.0 −0.2 −0.4 Victoria Waco Wichita Falls 0.4 0.2 0.0 −0.2 −0.4 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Residuals date Tuesday, 7 July 2009
  44. 44. Conclusions Simple (and relatively small) example, but shows how collections of models can be useful for gaining understanding. Each attempt illustrated something new about the data. Plyr made it easy to create and summarise collection of models, so we didn’t have to worry about the mechanics. Tuesday, 7 July 2009

×