Computational Social Science, Lecture 11: Regression

1,676 views

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,676
On SlideShare
0
From Embeds
0
Number of Embeds
869
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Computational Social Science, Lecture 11: Regression

  1. 1. RegressionAPAM E4990Computational Social ScienceJake HofmanColumbia UniversityApril 12, 2013Jake Hofman (Columbia University) Regression April 12, 2013 1 / 37
  2. 2. Definition?Jake Hofman (Columbia University) Regression April 12, 2013 2 / 37
  3. 3. DefinitionJake Hofman (Columbia University) Regression April 12, 2013 2 / 37
  4. 4. Definition?Jake Hofman (Columbia University) Regression April 12, 2013 2 / 37
  5. 5. Definition“The primary goal in a regression analysis is tounderstand, as far as possible with the available data,how the conditional distribution of the response variesacross subpopulations determined by the possiblevalues of the predictor or predictors.”-“Applied Regression Including Computing and Graphics”Cook & Weisberg (1999)Jake Hofman (Columbia University) Regression April 12, 2013 2 / 37
  6. 6. GoalsDescribeProvide a compact summary of outcomes under different conditionsPredictMake forecasts for future outcomes or unobserved conditionsExplainAccount for associations between predictors and outcomesJake Hofman (Columbia University) Regression April 12, 2013 3 / 37
  7. 7. GoalsDescribeProvide a compact summary of outcomes under different conditionsNever “false”, but may be wasteful or misleadingPredictMake forecasts for future outcomes or unobserved conditionsVarying degrees of success, often room for improvementExplainAccount for associations between predictors and outcomesDifficult to establish causality in observational studiesSee “Regression Analysis: A Constructive Critique”, Berk (2004)Jake Hofman (Columbia University) Regression April 12, 2013 3 / 37
  8. 8. GoalsModels should be flexible enough to describe observed phenomenabut simple enough to generalize to future observationsJake Hofman (Columbia University) Regression April 12, 2013 4 / 37
  9. 9. Examples11.2 Setting the Regression Context 3Should one be especially interested in a comparison of the means, one couldroceed descriptively with a conventional least squares regression analysis asspecial case. That is, for each observation i, one could letˆyi = β0 + β1xi, (1.1)here the response variable y is each applicant’s SAT score, x is an indicatorFig. 1.2. Distribution of SAT scores for Asian applicants.SAT Scores for Asian ApplicantsSAT ScoreFrequency600 800 1000 1200 1400 1600050100150of some response y varies across subpopulations determined by the povalues of the predictor or predictors” (Cook and Weisberg, 1999: 27).is, interest centers on the distribution of the response variable Y condition one or more predictors X.This definition includes a wide variety of elementary procedures eimplemented in R. (See, for example, Maindonald and Braun, 2007: Ch2.) For example, consider Figures 1.1 and 1.2. The first shows the distribof SAT scores for recent applicants to a major university, who self-ideas “Hispanic.” The second shows the distribution of SAT scores for rapplicants to that same university, who self-identify as “Asian.”Fig. 1.1. Distribution of SAT scores for Hispanic applicants.It is clear that the two distributions differ substantially. The Asiantribution is shifted to the right, leading to a distribution with a higher(1227 compared to 1072), a smaller standard deviation (170 compared toand greater skewing. A comparative description of the two histogramsconstitutes a proper regression analysis. Using various summary statisome key features of the two displays are compared and contrasted (1“Statistical Learning from a Regression Perspective”, Berk (2008)Jake Hofman (Columbia University) Regression April 12, 2013 5 / 37
  10. 10. Examples1aph more legible.2e+04 4e+04 6e+04 8e+04 1e+058001000120014001600SAT Score by Household IncomeIncome Bounded at $100,000SATScoreFig. 1.4. SAT scores by family income.1“Statistical Learning from a Regression Perspective”, Berk (2008)Jake Hofman (Columbia University) Regression April 12, 2013 5 / 37
  11. 11. Examples16 1 Regression Framework1234400 600 800 1000 1200 1400 1600400 600 800 1000 1200 1400 1600 400 600 800 1000 1200 1400 16001234FreshmanGPA0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0High School GPAFig. 1.5. Freshman GPA on SAT holding high school GPA constant.1“Statistical Learning from a Regression Perspective”, Berk (2008)Jake Hofman (Columbia University) Regression April 12, 2013 5 / 37
  12. 12. ExamplesHow useful is online activity for predicting real-worldoutcomes?“Predicting consumer activity with Web search”Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010How do online experiences vary across subpopulations?“Who Does What on the Web”Goel, Hofman & Sirer, ICWSM 2012Jake Hofman (Columbia University) Regression April 12, 2013 6 / 37
  13. 13. Predicting consumer activity with Web searchwith Sharad Goel, S´ebastien Lahaie, David Pennock, Duncan Watts"Right Round"WeekRank40302010ccccccccccccccccccccccccccccccccccccccccccMar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09BillboardSearchJake Hofman (Columbia University) Regression April 12, 2013 7 / 37
  14. 14. Search predictionsMotivationDoes collective search activityprovide useful predictive signalabout real-world outcomes?"Right Round"WeekRank40302010ccccccccccccccccccccccccccccccccccccccccccMar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09BillboardSearchJake Hofman (Columbia University) Regression April 12, 2013 8 / 37
  15. 15. Search predictionsMotivationPast work mainly focuses on predicting the present2 and ignoresbaseline models trained on publicly available dataDateFluLevel(Percent)123456782004 2005 2006 2007 2008 2009 2010ActualSearchAutoregressive2Varian, 2009Jake Hofman (Columbia University) Regression April 12, 2013 9 / 37
  16. 16. Search predictionsMotivationWe predict future sales for movies, video games, and music"Transformers 2"Time to Release (Days)SearchVolumea−30 −20 −10 0 10 20 30"Tom Clancys HAWX"Time to Release (Days)SearchVolumeb−30 −20 −10 0 10 20 30"Right Round"WeekRank40302010ccccccccccccccccccccccccccccccccccccccccccMar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09BillboardSearchJake Hofman (Columbia University) Regression April 12, 2013 10 / 37
  17. 17. Search predictionsSearch modelsFor movies and video games, predict opening weekend box officeand first month sales, respectively:log(revenue) = β0 + β1 log(search) +For music, predict following week’s Billboard Hot 100 rank:billboardt+1 = β0 + β1searcht + β2searcht−1 +Jake Hofman (Columbia University) Regression April 12, 2013 11 / 37
  18. 18. Search predictionsSearch volumeJake Hofman (Columbia University) Regression April 12, 2013 12 / 37
  19. 19. Search predictionsSearch modelsSearch activity is predictive for movies, video games, and musicweeks to months in advanceMoviesPredicted Revenue (Dollars)ActualRevenue(Dollars)103104105106107108109●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa103104105106107108109Video GamesPredicted Revenue (Dollars)ActualRevenue(Dollars)103104105106107●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb103104105106107● Non−SequelSequelMusicPredicted Billboard RankActualBillboardRank020406080100●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●c0 20 40 60 80 100MoviesTime to Release (Weeks)ModelFit0.40.50.60.70.80.9 ddddddd−6 −5 −4 −3 −2 −1 0Video GamesTime to Release (Weeks)ModelFit0.40.50.60.70.80.9 eeeeeee−6 −5 −4 −3 −2 −1 0MusicTime to Release (Weeks)ModelFit0.40.50.60.70.80.9 fffffff−6 −5 −4 −3 −2 −1 0Jake Hofman (Columbia University) Regression April 12, 2013 13 / 37
  20. 20. Search predictionsBaseline modelsFor movies, use budget, number of opening screens and HollywoodStock Exchange:log(revenue) = β0 + β1 log(budget) + β2 log(screens) +β3 log(hsx) +Jake Hofman (Columbia University) Regression April 12, 2013 14 / 37
  21. 21. Search predictionsBaseline modelsFor video games, use critic ratings and predecessor sales (sequelsonly):log(revenue) = β0 + β1rating + β2 log(predecessor) +Jake Hofman (Columbia University) Regression April 12, 2013 14 / 37
  22. 22. Search predictionsBaseline modelsFor music, use an autoregressive model with the previouslyavailable rank:billboardt+1 = β0 + β1billboardt−1 +Jake Hofman (Columbia University) Regression April 12, 2013 14 / 37
  23. 23. Search predictionsBaseline + combined modelsBaseline models are often surprisingly goodMovies (Baseline)Predicted Revenue (Dollars)ActualRevenue(Dollars)103104105106107108109●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa103104105106107108109Video Games (Baseline)Predicted Revenue (Dollars)ActualRevenue(Dollars)103104105106107●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb103104105106107● Non−SequelSequelMusic (Baseline)Predicted Billboard RankActualBillboardRank020406080100●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●c0 20 40 60 80 100Movies (Combined)Predicted Revenue (Dollars)ActualRevenue(Dollars)103104105106107108109●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd103104105106107108109Video Games (Combined)Predicted Revenue (Dollars)ActualRevenue(Dollars)103104105106107●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee103104105106107● Non−SequelSequelMusic (Combined)Predicted Billboard RankActualBillboardRank020406080100●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●f0 20 40 60 80 100Jake Hofman (Columbia University) Regression April 12, 2013 15 / 37
  24. 24. Search predictionsModel comparisonFor movies, search is outperformed by the baseline and of littlemarginal valueModelFit0.40.50.60.70.80.91.0CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineNonsequelGamesSequelGamesMusicMoviesFluJake Hofman (Columbia University) Regression April 12, 2013 16 / 37
  25. 25. Search predictionsModel comparisonFor video games, search helps substantially for non-sequels, less sofor sequelsModelFit0.40.50.60.70.80.91.0CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineNonsequelGamesSequelGamesMusicMoviesFluJake Hofman (Columbia University) Regression April 12, 2013 16 / 37
  26. 26. Search predictionsModel comparisonFor music, the addition of search yields a substantially bettercombined modelModelFit0.40.50.60.70.80.91.0CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineNonsequelGamesSequelGamesMusicMoviesFluJake Hofman (Columbia University) Regression April 12, 2013 16 / 37
  27. 27. Search predictionsSummary• Relative performance and value of search varies acrossdomains• Search provides a fast, convenient, and flexible signal acrossdomains• “Predicting consumer activity with Web search”Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010Jake Hofman (Columbia University) Regression April 12, 2013 17 / 37
  28. 28. Demographic diversity on the Webwith Irmak Sirer and Sharad Goel (ICWSM 2012)DailyPer−CapitaPageviews010203040506070qqqqqOver $25kUnder $25kBlack&HispanicWhiteNo CollegeSome CollegeOver 65Under 65FemaleMaleIncome Race Education Age SexJake Hofman (Columbia University) Regression April 12, 2013 18 / 37
  29. 29. MotivationPrevious work is largely survey-based and focuses and group-leveldifferences in online accessJake Hofman (Columbia University) Regression April 12, 2013 19 / 37
  30. 30. Motivation“As of January 1997, we estimate that 5.2 millionAfrican Americans and 40.8 million whites have ever usedthe Web, and that 1.4 million African Americans and20.3 million whites used the Web in the past week.”-Hoffman & Novak (1998)Jake Hofman (Columbia University) Regression April 12, 2013 19 / 37
  31. 31. MotivationFocus on activity instead of accessHow diverse is the Web?To what extent do online experiences vary across demographicgroups?Jake Hofman (Columbia University) Regression April 12, 2013 20 / 37
  32. 32. Data• Representative sample of 265,000 individuals in the US, paidvia the Nielsen MegaPanel3• Log of anonymized, complete browsing activity from June2009 through May 2010 (URLs viewed, timestamps, etc.)• Detailed individual and household demographic information(age, education, income, race, sex, etc.)3Special thanks to Mainak MazumdarJake Hofman (Columbia University) Regression April 12, 2013 21 / 37
  33. 33. Data# ls -alh nielsen_megapanel.tar-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tarJake Hofman (Columbia University) Regression April 12, 2013 22 / 37
  34. 34. Data# ls -alh nielsen_megapanel.tar-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.comJake Hofman (Columbia University) Regression April 12, 2013 22 / 37
  35. 35. Data# ls -alh nielsen_megapanel.tar-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)Jake Hofman (Columbia University) Regression April 12, 2013 22 / 37
  36. 36. Data# ls -alh nielsen_megapanel.tar-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)• Aggregate activity at the site, group, and user levelsJake Hofman (Columbia University) Regression April 12, 2013 22 / 37
  37. 37. Hadoop + Pig (+ awk)100GB → ∼1GBJake Hofman (Columbia University) Regression April 12, 2013 23 / 37
  38. 38. Aggregate usage patternsHow do users distribute their time across different categories?Fractionoftotalpageviews0.050.100.150.200.25qqqq qSocialMediaE−mailGamesPortalsSearchAll groups spend the majority of their time in the top five mostpopular categoriesJake Hofman (Columbia University) Regression April 12, 2013 24 / 37
  39. 39. Aggregate usage patternsHow do users distribute their time across different categories?User Rank by Daily ActivityFractionofPageviewsinCategory0.050.100.150.200.250.30qq q q qqqqqq10% 30% 50% 70% 90%q Social MediaE−mailGamesPortalsSearchHighly active users devote nearly twice as much of their time tosocial media relative to typical individualsJake Hofman (Columbia University) Regression April 12, 2013 24 / 37
  40. 40. Group-level activityHow does browsing activity vary at the group level?DailyPer−CapitaPageviews010203040506070qqqqqOver $25kUnder $25kBlack&HispanicWhiteNo CollegeSome CollegeOver 65Under 65FemaleMaleIncome Race Education Age SexLarge differences exist even at the aggregate level(e.g. women on average generate 40% more pageviews than men)Jake Hofman (Columbia University) Regression April 12, 2013 25 / 37
  41. 41. Group-level activityHow does browsing activity vary at the group level?DailyPer−CapitaPageviews010203040506070qqqqqOver $25kUnder $25kBlack&HispanicWhiteNo CollegeSome CollegeOver 65Under 65FemaleMaleIncome Race Education Age SexYounger and more educated individuals are both more likely toaccess the Web and more active once they doJake Hofman (Columbia University) Regression April 12, 2013 25 / 37
  42. 42. Group-level activityAll demographic groups spend the majority of their time in thesame categoriesAgeFractionoftotalpageviews0.00.10.20.30.40.5qqqqqqq qqqqqqqq q5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80q Social MediaE−mailGamesPortalsSearchFractionoftotalpageviews0.00.10.20.30.4Education● ●●●●●●GrammarSchoolSomeHighSchoolHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeSex●●FemaleMaleIncome●● ●●●●$0−25k$25−50k$50−75k$75−100k$100−150k$150k+Race● ●● ●●OtherHispanicBlackWhiteAsianJake Hofman (Columbia University) Regression April 12, 2013 26 / 37
  43. 43. Group-level activityOlder, more educated, male, wealthier, and Asian Internet usersspend a smaller fraction of their time on social mediaAgeFractionoftotalpageviews0.00.10.20.30.40.5qqqqqqq qqqqqqqq q5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80q Social MediaE−mailGamesPortalsSearchFractionoftotalpageviews0.00.10.20.30.4Education● ●●●●●●GrammarSchoolSomeHighSchoolHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeSex●●FemaleMaleIncome●● ●●●●$0−25k$25−50k$50−75k$75−100k$100−150k$150k+Race● ●● ●●OtherHispanicBlackWhiteAsianJake Hofman (Columbia University) Regression April 12, 2013 26 / 37
  44. 44. Group-level activityLower social media use by these groups is often accompanied byhigher e-mail volumeAgeFractionoftotalpageviews0.00.10.20.30.40.5qqqqqqq qqqqqqqq q5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80q Social MediaE−mailGamesPortalsSearchFractionoftotalpageviews0.00.10.20.30.4Education● ●●●●●●GrammarSchoolSomeHighSchoolHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeSex●●FemaleMaleIncome●● ●●●●$0−25k$25−50k$50−75k$75−100k$100−150k$150k+Race● ●● ●●OtherHispanicBlackWhiteAsianJake Hofman (Columbia University) Regression April 12, 2013 26 / 37
  45. 45. Group-level activityFemale−to−malepageviewratio0.512q qqqqq q q q qq q qqq qqq qq qq qqq qqq q qq q q q q q q q qqqq qq qq qq q q q q qqq qqq q q qq q q qqq qq qq q qq qqq q qqqqqqqqApparel/BeautyFamilyResourcesMulti−categoryHome&FashionPetsHolidays&SpecialEventsHealth,Fitness&NutritionFood&CookingPhotographyNon−ProfitMulti−categorySpecialOccasionsHome&GardenMulti−categoryFamily&LifestylesBooksMemberCommunitiesMassMerchandiserGreetingCardsGenealogyUniversitiesShoppingDirectories&GuidesEducationalResourcesGifts&FlowersCorporateInformationRealEstate/ApartmentsE−mailKids,Games,ToysGovernmentOnlineGamesDirectories/LocalGuidesCoupons/RewardsCellular/PagingMulti−categoryTelecom/InternetServicesCruiseLinesInsuranceFullServiceBanks&CreditUnionsFullServiceCommercialBanks&CreditUnionsLoansReligion&SpiritualityBroadcastMediaDestinationsMulti−categoryTravelGeneralInterestPortals&CommunitiesSoftwareManufacturersDelivery/StampsArts/GraphicsCreditCardSearchHotels/HotelDirectoriesMaps/TravelInfoMulti−categoryEntertainmentLongDistance/LocalCarrierAirlinesCareerDevelopmentFinancialToolsClassifieds/AuctionsFreeMerchandiseEventsMulti−categoryNews&InformationISPInstantMessagingGroundTransportationMulti−categoryFinance/Insurance/InvestmentsCurrentEvents&GlobalNewsMusicSpecialInterestNewsWeatherInternetTools/WebServicesGambling/SweepstakesResearchToolsMilitaryHardwareManufacturersTargetedPortals&CommunitiesMulti−categoryComputers&ConsumerElectronicsAutomotiveManufacturerVideos/MoviesWebHostingComputer&ConsumerElectronicsNewsMulti−categoryAutomotiveAutomotiveInformationMulti−CategoryEducation&CareersParts&AccessoriesFinancialNews&InformationHumorPersonalsOnlineTradingSportsAdultJake Hofman (Columbia University) Regression April 12, 2013 27 / 37
  46. 46. Revisiting the digital divideHow does usage of news, health, and reference vary withdemographics?Averagepageviewspermonth024681012Education●●●● ●●●GrammarSchoolSomeHighSchoolHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeSex●●FemaleMaleIncome● ● ●●●●$0−25k$25−50k$50−75k$75−100k$100−150k$150k+Race● ●●●●OtherHispanicBlackWhiteAsian● NewsHealthReferencePost-graduates spend three times as much time on health sitesthan adults with only some high school educationJake Hofman (Columbia University) Regression April 12, 2013 28 / 37
  47. 47. Revisiting the digital divideHow does usage of news, health, and reference vary withdemographics?Averagepageviewspermonth024681012Education●●●● ●●●GrammarSchoolSomeHighSchoolHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeSex●●FemaleMaleIncome● ● ●●●●$0−25k$25−50k$50−75k$75−100k$100−150k$150k+Race● ●●●●OtherHispanicBlackWhiteAsian● NewsHealthReferenceAsians spend more than 50% more time browsing online news thando other race groupsJake Hofman (Columbia University) Regression April 12, 2013 28 / 37
  48. 48. Revisiting the digital divideHow does usage of news, health, and reference vary withdemographics?Averagepageviewspermonth024681012Education●●●● ●●●GrammarSchoolSomeHighSchoolHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeSex●●FemaleMaleIncome● ● ●●●●$0−25k$25−50k$50−75k$75−100k$100−150k$150k+Race● ●●●●OtherHispanicBlackWhiteAsian● NewsHealthReferenceEven when less educated and less wealthy groups gain access tothe Web, they utilize these resources relatively infrequentlyJake Hofman (Columbia University) Regression April 12, 2013 28 / 37
  49. 49. Revisiting the digital divideHow does usage of news, health, and reference vary withdemographics?Averagepageviewspermonth024681012Newsqq qqqHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeHealthqq qq qHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeReferenceqq qq qHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeAsianBlackHispanicWhiteControlling for other variables, effects of race and gender largelydisappear, while education continues to have large effectpi =jαj xij +j kβjkxij xik +jγj x2ij + iJake Hofman (Columbia University) Regression April 12, 2013 29 / 37
  50. 50. Revisiting the digital divideHow does usage of news, health, and reference vary withdemographics?Averagepageviewspermonth024681012Healthqq qq qHighSchoolGraduateSomeCollegeAssociateDegreeBachelorsDegreePostGraduateDegreeFemaleMaleHowever, women spend considerably more time on health sitescompared to menJake Hofman (Columbia University) Regression April 12, 2013 30 / 37
  51. 51. Revisiting the digital divideHow does usage of news, health, and reference vary withdemographics?Monthly pageviews on health sites20 40 60 80 100FemaleMaleHowever, women spend considerably more time on health sitescompared to men, although means can be misleadingJake Hofman (Columbia University) Regression April 12, 2013 30 / 37
  52. 52. Individual-level predictionHow well can one predict an individual’s demographics from theirbrowsing activity?• Represent each user by the set of sites visited• Fit linear models4 to predict majority/minority for eachattribute on 80% of users• Tune model parameters using a 10% validation set• Evaluate final performance on held-out 10% test set4http://bit.ly/svmperfJake Hofman (Columbia University) Regression April 12, 2013 31 / 37
  53. 53. Individual-level predictionReasonable (∼70-85%) accuracy and AUC across all attributesCollege/No CollegeUnder/Over $50,000Household IncomeWhite/Non−WhiteFemale/MaleOver/Under 25Years OldAccuracyqqqqq.5 .6 .7 .8 .9 1AUCqqqqq.5 .6 .7 .8 .9 1Jake Hofman (Columbia University) Regression April 12, 2013 32 / 37
  54. 54. Individual-level predictionHighly-weighted sites under the fitted modelsLarge positive weight Large negative weightFemalewinster.comlancome-usa.comsports.yahoo.comespn.go.comWhitemarlboro.comcmt.commediatakeout.combet.comCollege Educatednews.yahoo.comlinkedin.comyoutube.commyspace.comOver 25 Years Oldevite.comclassmates.comaddictinggames.comyoutube.comHousehold IncomeUnder $50,000eharmony.comtracfone.comrownine.commatrixdirect.comTable 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task.College/No CollegeUnder/Over $50,000Household IncomeWhite/Non−WhiteFemale/MaleOver/Under 25Years OldAUC!!!!!.5 .6 .7 .8 .9 1Accuracy!!!!!.5 .6 .7 .8 .9 1Figure 7, a measure that effectively re-normalizes the ma-jority and minority classes to have equal size. Intuitively,AUC is the probability that a model scores a randomly se-lected positive example higher than a randomly selected neg-ative one (e.g., the probability that the model correctly dis-tinguishes between a randomly selected female and male).Though an uninformative rule would correctly discriminatebetween such pairs 50% of the time, predictions based onJake Hofman (Columbia University) Regression April 12, 2013 33 / 37
  55. 55. Individual-level predictionSubstantially better performance when restricted to “stereotypical”users (∼80-90%)Fraction of UsersAUC0.700.750.800.850.900.95qqqqqqqqq0.0 0.2 0.4 0.6 0.8 1.0q AgeSexRaceEducationIncomeFraction of UsersAccuracy0.700.750.800.850.900.95qqqqqqqqq0.0 0.2 0.4 0.6 0.8 1.0q AgeSexRaceEducationIncomeJake Hofman (Columbia University) Regression April 12, 2013 34 / 37
  56. 56. Individual-level predictionSimilar performance even when restricted to top 1k sitesNumber of DomainsAUC0.50.60.70.80.9qqq q102102.5103103.5104104.5105q AgeSexRaceEducationIncomeNumber of DomainsAccuracy0.50.60.70.80.9qqq q102102.5103103.5104104.5105q AgeSexRaceEducationIncomeJake Hofman (Columbia University) Regression April 12, 2013 35 / 37
  57. 57. Individual-level predictionProof of concept browser demohttp://bit.ly/surfpredsJake Hofman (Columbia University) Regression April 12, 2013 36 / 37
  58. 58. Summary• Highly active users spend disproportionately more of theirtime on social media and less on e-mail relative to the overallpopulation• Access to research, news, and healthcare is strongly related toeducation, not as closely to ethnicity• User demographics can be inferred from browsing activity withreasonable accuracy• “Who Does What on the Web”, Goel, Hofman & Sirer,ICWSM 2012Jake Hofman (Columbia University) Regression April 12, 2013 37 / 37

×