• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data-driven modeling: Lecture 01
 

Data-driven modeling: Lecture 01

on

  • 10,894 views

 

Statistics

Views

Total Views
10,894
Views on SlideShare
7,153
Embed Views
3,741

Actions

Likes
1
Downloads
58
Comments
0

3 Embeds 3,741

http://jakehofman.com 3733
http://us-w1.rockmelt.com 7
http://www.twylah.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data-driven modeling: Lecture 01 Data-driven modeling: Lecture 01 Presentation Transcript

    • Data-driven modeling APAM E4990 Jake Hofman Columbia University January 23, 2012Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 1 / 19
    • Data-dependent products • Effective/practical systems that learn from experience impact our daily lives, e.g.: • Recommendation systems • Spam detection • Optical character recognition • Face recognition • Fraud detection • Machine translation • ...Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 2 / 19
    • Learning by exampleJake Hofman (Columbia University) Data-driven modeling January 23, 2012 3 / 19
    • Learning by example • How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)?Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 3 / 19
    • Learning by example • We learn quickly from few, relatively unstructured examples ... but we don’t understand how we accomplish this • We’d like to develop algorithms that enable machines to learn by example from large data setsJake Hofman (Columbia University) Data-driven modeling January 23, 2012 4 / 19
    • Got data? • Web service APIs expose lots of dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 5 / 19
    • Got data? • Many free, public data sets available onlineJake Hofman (Columbia University) Data-driven modeling January 23, 2012 6 / 19
    • Black-boxified?Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 7 / 19
    • Black-boxified?Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 7 / 19
    • Roadmap? Step 1: Have data Step 2: ??? Step 3: ProfitJake Hofman (Columbia University) Data-driven modeling January 23, 2012 8 / 19
    • Roadmap, take two 1 Get dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
    • Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
    • Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss functionJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
    • Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function 7 Develop algorithm to minimize lossJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
    • Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function 7 Develop algorithm to minimize loss 8 Choose performance measure 9 “Train” to minimize loss 10 “Test” to evaluate generalizationJake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
    • Topics • Supervised • Unsupervised • k-nearest neighbors • K-means • Naive Bayes • Mixture models • Linear regression • Principal components • Logistic regression analysis • Support vector machines • Topic models • Collaborative filtering • Matrix factorization • Data representation: feature space, selection, normalization • Model assessment: complexity control, cross-validation, ROC curve, Bayesian Occam’s razor • Large-scale learningJake Hofman (Columbia University) Data-driven modeling January 23, 2012 10 / 19
    • Everything old is new again1 • Many fields ... • Statistics • Pattern recognition • Data mining • Machine learning • ... similar goals • Extract and recognize patterns in data • Interpret or explain observations • Test validity of hypotheses • Efficiently search the space of hypotheses • Design efficient algorithms enabling machines to learn from data 1 http://cbcl.mit.edu/publications/theses/thesis-rifkin.pdfJake Hofman (Columbia University) Data-driven modeling January 23, 2012 11 / 19
    • Statistics vs. machine learning2 Glossary Machine learning Statistics network, graphs model weights parameters learning fitting generalization test set performance supervised learning regression/classification unsupervised learning density estimation, clustering large grant = $1,000,000 large grant= $50,000 nice place to have a meeting: nice place to have a meeting: Snowbird, Utah, French Alps Las Vegas in August 1 2 http: //anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 12 / 19
    • Philosophy • We would like models that: • Provide predictive and explanatory power • Are complex enough to describe observed phenomena • Are simple enough to generalize to future observationsJake Hofman (Columbia University) Data-driven modeling January 23, 2012 13 / 19
    • Example: Netflix PrizeJake Hofman (Columbia University) Data-driven modeling January 23, 2012 14 / 19
    • Example: Netflix PrizeJake Hofman (Columbia University) Data-driven modeling January 23, 2012 15 / 19
    • Shipping = FeatureJake Hofman (Columbia University) Data-driven modeling January 23, 2012 16 / 19
    • ReferencesJake Hofman (Columbia University) Data-driven modeling January 23, 2012 17 / 19
    • Disclaimer You may be bored if you already know how to ... • Acquire data from APIs • Clean/explore/visualize data • Classify and cluster various types of data (e.g., images, text) • Code in Python, R, SciPy/NumPy, etc. • Scale solutions to large data sets (e.g. Hadoop, SGD) • Script with unix tools on the command line, e.g. $ sed -e ’s/<[^>]*>//g’ < page.html > page.txtJake Hofman (Columbia University) Data-driven modeling January 23, 2012 18 / 19
    • Themes Data jeopardy Regardless of scale, it’s difficult to find the right questions to ask of the dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
    • Themes Data hacking Cleaning and normalizing data is a substantial amount of the work (and likely impacts results)Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
    • Themes Data hacking The ability to iterate quickly, asking and answering many questions, is crucialJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
    • Themes Data hacking Hacks happen: sed/awk/grep are useful, and scaleJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
    • Themes “Data science” Simple methods (e.g., linear models) work surprisingly well, especially with lots of dataJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
    • Themes “Data science” It’s easy to cover your tracks—things are often much more complicated than they appearJake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
    • Predicting consumer activity with Web searchwith Sharad Goel, S´bastien Lahaie, David Pennock, Duncan Watts e "Right Round" c 10 20 Rank 30 Billboard 40 Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 WeekJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 6 / 71
    • Search predictionsMotivation "Right Round" c 10 Does collective search activity 20 Rank provide useful predictive signal about real-world outcomes? 30 Billboard 40 Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 WeekJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 7 / 71
    • Search predictionsMotivation Past work mainly focuses on predicting the present1 and ignores baseline models trained on publicly available data 8 Actual 7 Search Flu Level (Percent) 6 Autoregressive 5 4 3 2 1 2004 2005 2006 2007 2008 2009 2010 Date 1 Varian, 2009Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 8 / 71
    • Search predictionsMotivation We predict future sales for movies, video games, and music "Transformers 2" "Tom Clancys HAWX" "Right Round" a b c 10 Search Volume Search Volume 20 Rank 30 Billboard 40 Search −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Time to Release (Days) Time to Release (Days) WeekJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 9 / 71
    • Search predictionsSearch models For movies and video games, predict opening weekend box office and first month sales, respectively: log(revenue) = β0 + β1 log(search) + ￿ For music, predict following week’s Billboard Hot 100 rank: billboardt+1 = β0 + β1 searcht + β2 searcht−1 + ￿Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 10 / 71
    • Search predictionsSearch volumeJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 11 / 71
    • Search predictionsSearch models Search activity is predictive for movies, video games, and music weeks to months in advance Movies Video Games Music 109 a 107 b c ! ! !! ! !! !! ! !! ! ! !! !!!!! !! !! !! !!!!!!!! !! ! ! ! ! ! !! ! ! ! !! !! ! !! !!!! !! !! ! ! 100 ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 108 ! !! ! ! !! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! !! ! !! ! !! ! ! ! ! ! ! !! ! ! Actual Revenue (Dollars) Actual Revenue (Dollars) ! ! ! ! ! ! ! !! ! 106 ! ! ! ! ! ! ! ! ! ! !!!!! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! !! ! ! !! ! ! ! 80 Actual Billboard Rank ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !!!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! 107 ! ! ! ! ! ! !!! !! ! ! ! ! !! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! !! ! !! ! !! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! !! ! ! ! ! ! ! ! ! ! !! ! 60 ! ! !! ! ! ! !! ! ! !! ! ! ! !! ! ! ! ! !! ! !! ! ! 106 105 !! ! ! !! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! !! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! !! ! ! !! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! !! ! !! ! !! !! ! !! ! ! ! ! ! ! ! ! ! !! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! 105 ! !! ! ! ! ! ! ! ! 40 !! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! !! 4 ! ! ! ! ! ! ! ! ! ! ! ! 10 ! ! !! ! ! ! ! ! ! ! ! ! ! !! !! !! ! ! ! ! !! ! ! ! ! ! ! ! ! !!! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! ! !! !! ! !! ! ! ! ! ! ! ! ! 4 ! ! ! ! ! ! ! ! !! ! ! 10 ! ! ! ! 20 ! ! ! ! !! ! ! !! ! !! ! !! !! !! ! ! ! ! ! !! ! ! ! !! !! ! !! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Non−Sequel ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! !! ! !! !! !! ! ! ! ! !!! ! ! ! ! ! ! !! ! !! !! ! ! ! ! ! ! ! ! ! ! !! ! !! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! !! !! ! ! ! !! ! !! ! Sequel ! ! ! ! !! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! ! !! !! ! ! 103 103 ! !! ! ! ! ! !! ! ! !! ! ! !! ! ! ! !! ! !! ! 0 ! !! ! ! !! ! ! 103 104 105 106 107 108 109 103 104 105 106 107 0 20 40 60 80 100 Predicted Revenue (Dollars) Predicted Revenue (Dollars) Predicted Billboard Rank Movies Video Games Music 0.9 d 0.9 e 0.9 f 0.8 0.8 0.8 0.7 0.7 0.7 Model Fit Model Fit Model Fit 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 −6 −5 −4 −3 −2 −1 0 −6 −5 −4 −3 −2 −1 0 −6 −5 −4 −3 −2 −1 0 Time to Release (Weeks) Time to Release (Weeks) Time to Release (Weeks)Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 12 / 71
    • Search predictionsBaseline models For movies, use budget, number of opening screens and Hollywood Stock Exchange: log(revenue) = β0 + β1 log(budget) + β2 log(screens) + β3 log(hsx) + ￿Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 13 / 71
    • Search predictionsBaseline models For video games, use critic ratings and predecessor sales (sequels only): log(revenue) = β0 + β1 rating + β2 log(predecessor) + ￿Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 13 / 71
    • Search predictionsBaseline models For music, use an autoregressive model with the previously available rank: billboardt+1 = β0 + β1 billboardt−1 + ￿Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 13 / 71
    • Search predictionsBaseline + combined models Baseline models are often surprisingly good Movies (Baseline) Video Games (Baseline) Music (Baseline) 109 a 107 b c ! ! !!! !! ! ! ! ! ! ! ! ! ! ! !! ! !!!!! ! ! !! !!! ! ! ! ! ! !!!! ! ! !! ! !! !!! ! ! !! ! !! ! ! !! ! ! !! ! 100 !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 108 !! !! ! !! ! ! !! ! ! !! ! ! ! ! !! ! ! !! ! ! !! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! !! ! ! !! !! !! ! ! ! Actual Revenue (Dollars) Actual Revenue (Dollars) !!! ! !!! ! !! ! ! !! ! ! !!! 106 ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! !! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! !! !! ! ! ! ! !! ! !!! !! 80 Actual Billboard Rank ! !! !! ! ! ! ! !! ! !! ! ! !! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! !! ! !! ! ! !! !! !! ! !! ! !! ! ! ! ! ! ! ! ! ! ! 107 ! ! ! !! ! ! ! ! ! !! ! ! ! ! !! ! ! !!! ! !! ! ! !! !! ! ! !! ! ! ! ! ! !! !! ! !! ! ! !! !! ! ! !!! !! ! ! !! ! !! ! ! ! ! ! ! ! !! ! ! ! !! !! ! ! ! ! !! ! ! ! ! !!! ! ! ! ! ! !! !! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! !! !!! !! !!! ! ! ! ! ! !! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! !! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! 60 ! ! !! ! ! !! ! ! !! ! ! ! !! ! ! ! ! ! ! ! !! !! ! ! 106 105 ! ! ! ! !! !! ! ! ! ! ! ! !! ! ! ! ! ! !! ! !! ! !! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! !! !! !!! !!! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! ! !! ! !! ! !! ! ! ! ! ! ! !! ! !! ! ! ! !!! ! !!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !!!! !! ! ! ! ! ! ! ! !!! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! !! ! ! !! ! !! ! ! !! ! ! ! ! ! ! ! ! ! ! 105 ! ! ! ! ! 40 ! ! ! !! ! ! !! ! ! ! ! ! !! ! ! !! ! ! ! !! ! !! ! ! ! ! ! !! ! !! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! !! ! !! ! ! !! ! ! !! !! !! ! ! ! ! ! !! ! ! !! ! ! ! ! !! 4 ! ! ! ! ! !!! ! ! ! ! 10 ! ! ! !! !! !! !! !! ! ! !!! ! !! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! !! ! !! ! ! ! ! !! ! ! !! ! ! ! ! ! !! !! ! !! ! ! ! ! ! ! ! ! ! ! ! 104 ! ! !! ! ! !! ! ! ! ! ! ! ! 20 ! ! ! !! ! ! ! ! ! !! ! ! ! !! ! ! ! ! !! !! ! ! ! ! ! ! !! ! ! !!! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! Non−Sequel ! !! ! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! !! ! !! !! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! !! !! ! ! ! ! ! !! ! !! ! ! ! ! ! !! ! ! !! ! !! Sequel !!! ! ! ! ! ! ! ! ! !!! ! !! ! ! ! ! ! !! !! ! !! ! ! ! ! 103 103 !!! ! !! !! ! !!! ! ! !! ! ! ! ! !! ! !! !! ! 0 !!! !! !! !! ! 103 104 105 106 107 108 109 103 104 105 106 107 0 20 40 60 80 100 Predicted Revenue (Dollars) Predicted Revenue (Dollars) Predicted Billboard Rank Movies (Combined) Video Games (Combined) Music (Combined) 109 d 107 e f ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! ! ! !! ! ! ! !!! !!! !! ! !!!! ! ! ! ! ! !!!! ! ! ! !!! ! ! ! !! !! 100 ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! 108 ! ! ! ! ! !! ! ! ! ! ! !! ! !! ! ! !! ! ! ! ! ! !! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! !! !! ! ! ! ! ! ! ! !! !! Actual Revenue (Dollars) Actual Revenue (Dollars) 6 ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! 10 ! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! !! !! ! ! ! ! ! ! !!! ! !! !! ! !! 80 Actual Billboard Rank ! !! ! ! ! ! ! ! ! ! !! !! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! !! ! ! ! !! ! ! !! ! ! !! ! ! ! ! 107 ! ! !! ! ! ! !! ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! !! !! ! ! ! ! !!!! ! ! !! ! !! !! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! !! ! !! ! ! ! !! ! ! !! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! ! !! !! !! ! ! !! ! ! !! ! !! ! 60 ! ! ! ! ! !! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! !!! ! ! 106 105 ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! !! ! !! ! ! !! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! !! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! !! ! ! ! ! ! 105 ! !! ! ! ! 40 ! ! ! ! ! ! !! !! !! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! !! ! ! ! ! !!! !! ! ! ! ! !! ! ! ! ! ! ! !! ! ! !! ! ! ! !! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! !!! !! ! ! ! ! ! !! ! ! !! ! ! ! ! 104 ! ! ! ! !! ! ! ! !! ! !! ! ! ! ! !! ! !! ! ! !! ! ! ! !! !! ! ! ! ! !!! ! !! !! ! ! ! ! !! ! !! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! !! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! 104 ! ! !! ! !! ! ! ! ! ! ! 20 ! ! ! !! ! !! ! ! ! ! ! !! !! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! Non−Sequel ! !!! ! ! ! !! !! ! ! ! !! ! !! ! ! ! !! ! ! ! ! ! !! ! ! !! ! !! ! ! ! ! !! ! ! !! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! !!! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! ! Sequel ! ! !! ! !! ! !! !! ! !! ! ! ! ! ! ! !! ! !! !! ! ! ! 103 103 ! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! !! ! !! 0 !! !! ! ! ! ! 3 4 5 6 7 8 9 3 4 5 6 7 10 10 10 10 10 10 10 10 10 10 10 10 0 20 40 60 80 100 Predicted Revenue (Dollars) Predicted Revenue (Dollars) Predicted Billboard RankJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 14 / 71
    • Search predictionsModel comparison For movies, search is outperformed by the baseline and of little marginal value 1.0 0.9 Combined 0.8 Model Fit Search 0.7 0.6 Baseline 0.5 0.4 es es ic es u Fl us am am i ov M M G G el el qu qu se Se on NJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 15 / 71
    • Search predictionsModel comparison For video games, search helps substantially for non-sequels, less so for sequels 1.0 0.9 Combined 0.8 Model Fit Search 0.7 0.6 Baseline 0.5 0.4 es es ic es u Fl us am am i ov M M G G el el qu qu se Se on NJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 15 / 71
    • Search predictionsModel comparison For music, the addition of search yields a substantially better combined model 1.0 0.9 Combined 0.8 Model Fit Search 0.7 0.6 Baseline 0.5 0.4 es es ic es u Fl us am am i ov M M G G el el qu qu se Se on NJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 15 / 71
    • Search predictionsSummary • Relative performance and value of search varies across domains • Search provides a fast, convenient, and flexible signal across domains • “Predicting consumer activity with Web search” Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 16 / 71
    • Demographic diversity on the Webwith Irmak Sirer and Sharad Goel College Female Over 25 70 Non−White ! ! ! Daily Per−Capita Pageviews 60 ! White No College 50 40 Male 30 Under 25 20 10 0 Race Education Sex AgeJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 18 / 71
    • Motivation Previous work is largely survey-based and focuses and group-level differences in online accessJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 19 / 71
    • Motivation “As of January 1997, we estimate that 5.2 million African Americans and 40.8 million whites have ever used the Web, and that 1.4 million African Americans and 20.3 million whites used the Web in the past week.” -Hoffman & Novak (1998)Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 19 / 71
    • Motivation Focus on activity instead of access How diverse is the Web? To what extent do online experiences vary across demographic groups?Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 20 / 71
    • Data • Representative sample of 265,000 individuals in the US, paid via the Nielsen MegaPanel2 • Log of anonymized, complete browsing activity from June 2009 through May 2010 (URLs viewed, timestamps, etc.) • Detailed individual and household demographic information (age, education, income, race, sex, etc.) 2 Special thanks to Mainak MazumdarJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 21 / 71
    • Data # ls -alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tarJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 22 / 71
    • Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/MaleJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 23 / 71
    • Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.comJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 23 / 71
    • Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors)Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 23 / 71
    • Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) • Aggregate activity at the site, group, and user levelsJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 23 / 71
    • Hadoop + Pig (+ awk)100GB → ∼1GBJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 24 / 71
    • Site-level skew How diverse are site audiences? • For each site and attribute, calculate the skew in visitors (e.g., 93% of pageviews on Density foxnews.com are by White users) • For each attribute, plot the distribution of visitor skew across all sites 0.0 0.2 0.4 0.6 0.8 1.0 Proportion White VisitorsJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 25 / 71
    • Site-level skew Density Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Female Visitors Proportion White Visitors Proportion College Educated Visitors Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With Proportion Adult Visitors Household Incomes Under $50,000Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 26 / 71
    • Site-level skew Density Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Female Visitors Proportion White Visitors Proportion College Educated Visitors Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With Proportion Adult Visitors Household Incomes Under $50,000 Many sites have skew close the overall mean, but there also popular, highly-skewed sitesJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 26 / 71
    • Sites vs. ZIPs How do diversity of the online and offline worlds compare? Sites Sites ZIPs ZIPs Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Female Proportion WhiteJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 27 / 71
    • Sites vs. ZIPs How do diversity of the online and offline worlds compare? Sites Sites ZIPs ZIPs Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Female Proportion White As expected, neighborhoods are more gender-balanced than sitesJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 27 / 71
    • Sites vs. ZIPs How do diversity of the online and offline worlds compare? Sites Sites ZIPs ZIPs Density Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Female Proportion White But sites typically have more racially diverse audiences than neighborhoods have residentsJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 27 / 71
    • Site-level skew p.white apps.myspace.com activities.myspace.com webim.myspace.com friends.myspace.com webfetti.com profileedit.myspace.com signups.myspace.com messaging.myspace.com eq.rockyou.com photoupload.myspace.com searchservice.myspace.com browseusers.myspace.com comment.myspace.com home.myspace.com secure.myspace.com viewmorepics.myspace.com streamate.com myspace.commusic.myspace.com collect.myspace.com eltpath.com playsushi.com profile.myspace.com vids.myspace.com toolbar.inbox.com zwinky.com blogs.myspace.com photobucket.com fling.com nick.com snagajob.com youtube.com bulletins.myspace.com tagged.com gamevance.com gamehouse.com mtv.com mywebtattoo.com edit.yahoo.com adserve.brandgivewaycentre.com cim.meebo.com messenger.yahoo.com t−mobile.com fafsa.ed.gov app2.yoville.com answers.yahoo.com us.peeplo.com search.mywebsearch.com windowsmedia.com metrolyrics.com toolbar.yahoo.com degrees.classesusa.com bigfishgames.com music.yahoo.com kmart.com profile.live.com blurtit.com kaboodle.com gamestop.com fastbrowsersearch.com search.conduit.com youmindquizzes.com courseadvisor.com eversave.com iqquizapp.com findstuff.com try.alot.com winster.com discounts.shopathome.com profiles.yahoo.com onlineprofitsnow.com msg.yahoo.com webcrawler.com officialiqquiz.com scribd.com ask.com rds.yahoo.com walmart.com pogo.com help.yahoo.com translate.google.com chacha.com real.com media.causes.com home.live.com iwon.com yellowpages.com squidoo.com login.facebook.com shine.yahoo.com game3.pogo.com coolsavings.com familylink.com gamespot.com pornhub.com local.com myfuncards.com qualityhealth.com worldwinner.com lifescript.com instoresnow.walmart.com shop.avon.com addictinggames.com autos.yahoo.com answerbag.com account.live.comchannel.facebook.com wiki.answers.com lgn5.coolsavings.com help.live.com hrblock.com upload.facebook.com register.facebook.comrighthealth.com cbk1.google.com twitter.com lgn3.coolsavings.com answers.com creditreport.com video.yahoo.com oprah.com images.google.com hubpages.com tv.com buzzle.com games.com omg.yahoo.com drugs.com music.aol.com ww23.rr.com pharmdaily.com pronto.com angelfire.com pizzahut.com login.yahoo.com sprint.com video.google.com mozilla.com dictionary.reference.com zimbio.com shell.windows.com everything.yahoo.com best−price.com nlm.nih.gov us.mcafee.com smarter.com disney.go.com tvguide.com api.facebook.com livejasmin.com business.com systemofdeals.com search.yahoo.com get.adobe.com facebook.com localization.att.com turbotax.intuit.com www1.macys.com redtube.com java.com games.yahoo.com insureme.com apps.facebook.com www4.jcpenney.com true.com hulu.com geocities.com paypal−shopping.com local.yahoo.com urbandictionary.com hotjobs.yahoo.comtopix.com intelius.com wireless.att.com rightathome.com www3.jcpenney.com bettycrocker.com avg.com connect.facebook.com mail.yahoo.com ebillpay.verizonwireless.com info.com inventionconcept.com freecreditreport.com viewpoints.comcbs.com movies.yahoo.com verizonwireless.com arcamax.com forms.earnmydegree.com startsampling.comwww5.jcpenney.comhsn.com ssa.gov shopathome.com medicinenet.com usmagazine.com www2.jcpenney.com cooks.com oldnavy.gap.com buddytv.com download.cnet.com videogames.yahoo.com shopping.yahoo.com login.verizonwireless.com yahoo.com fancast.com www.yahoo.com job.com ezinearticles.com att.com vistaprint.com target.com jcpenney.com bricks.coupons.com geo.craigslist.org netflix.comshopzilla.com bbb.org officedepot.com macys.com fedex.com webmd.com wikipedia.org en.wikipedia.org craigslist.org articlesbase.combing.com radioshack.com weather.yahoo.com secure.classmates.com apple.com realestate.yahoo.com stores.ebay.com wisegeek.com dw.weather.commylife.comnbc.com irs.gov walgreens.com legacy.comprint.coupons.com everydayhealth.com abc.go.com southernfood.about.com creatives.livejasmin.com login.live.com microsoft.com smi.usps.com blockbuster.com gevalia.com associatedcontent.com coupons2.smartsource.com motors.ebay.com adobe.com myworld.ebay.com desktopfw.weather.com mahalo.com tv.yahoo.com search.aol.com dexknows.com foodnetwork.com addthis.com tube8.com checkoutfree.com people.com www.mozilla.com post.craigslist.orgmail.live.com accounts.craigslist.org itunes.apple.com merriam−webster.com peoplefinders.com offer.ebay.com health.yahoo.com staples.com superpages.com merchantcircle.com www2.victoriassecret.com voap.weather.com hgtv.commayoclinic.com dailymotion.com en.allexperts.com r.msn.com paypal.com bizrate.comsites.target.com wellsfargo.com zazzle.com pbs.org whitepages.com ancestry.com toysrus.com orientaltrading.com pages.ebay.com www22.verizon.com shop.ebay.com simplyink.com online.wellsfargo.com ussearch.com thefreedictionary.com store.yahoo.net lifestyle.msn.com metacafe.com search.mylife.com catalog.ebay.com hp.com my.yahoo.com msn.com blogger.com directv.com sears.com samsclub.com cvs.com allrecipes.com kohls.com yelp.comaolhealth.com netquote.comone−time−offer.com moviefone.com office.microsoft.com etsy.com fandango.com sitekey.bankofamerica.com comcast.com eonline.com autotrader.com google.com bestbuy.com classmates.com nextag.com popeater.com ehow.com personals.yahoo.com digg.com maps.google.com travel.yahoo.com news.yahoo.com pandora.combankofamerica.com mypoints.com search.barnesandnoble.com usps.com capitalone.comyellowpages.superpages.com onlineeast3.bankofamerica.com zappos.com wonderwall.msn.com support.microsoft.com onlineeast2.bankofamerica.com cgi.ebay.com onlineeast1.bankofamerica.com web.aol.com picasaweb.google.com health.msn.com maps.yahoo.com careerbuilder.com imdb.com ebay.com signin.ebay.com aol.com mapquest.com television.aol.com buzz.yahoo.com offers.bankofamerica.com stylelist.com priceline.com eharmony.com specials.msn.com webmail.aol.com overstock.com my.ebay.com indeed.com vap.yahoo.net wwwapps.ups.com hosted.where2getit.com amazon.com mail.aol.com cgi1.ebay.com hotwire.com chase.com retailmenot.com nfl.com history.paypal.commovies.msn.com chaseonline.chase.com screenname.aol.comservicing.capitalone.com ups.com mercurymagazines.com news.aol.com store.apple.com askville.amazon.com flickr.com mlt0.google.com bbc.co.uk docs.google.com homedepot.com tgifridays.com gima.weather.com barnesandnoble.com bedbathandbeyond.com recipezaar.com drugstore.com adultfriendfinder.com evony.com kbb.com mwfb.zynga.comcontact.ebay.com payments.ebay.com walletpop.com tv.msn.com ftd.com mail.google.com dailymail.co.uk cgi5.ebay.com feedback.ebay.comcbsnews.com servicemagic.com mycokerewards.com abcnews.go.com books.google.com dell.com monster.com comcast.net tmz.com payments.chase.com manta.com costco.com thefind.com ticketmaster.com hotels.com lowes.comexaminer.com youporn.com travel.travelocity.com travela.priceline.com fixya.com jobsearch.monster.com realtor.com travelocity.com expedia.com city−data.com groups.yahoo.com login.comcast.net shopping.hp.com mt0.google.com simplyhired.com sites.google.com clients.google.com snopes.com magazines.com justanswer.com break.com forbes.com orbitz.com stmts.chase.com accountonline.com evite.com wikihow.com us.dell.com kodakgallery.com news.google.com weather.com discovercard.com cards.chase.com buy.com new.facebook.com jobview.monster.com match.com southwest.com sports.yahoo.com bidcactus.com restaurant.com msnbc.msn.com wunderground.com huffingtonpost.com edmunds.com rivals.yahoo.com nytimes.com reviews.cnet.com time.com espn.go.com washingtonpost.com tripadvisor.com nydailynews.com cnn.com msn.foxsports.com delta.com foxnews.com online.wsj.com mlb.mlb.com finance.yahoo.com usatoday.com money.cnn.com moneycentral.msn.com linkedin.com boston.comJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 28 / 71
    • Group-level activity How does browsing activity vary at the group level? College Female Over 25 70 Non−White ! ! ! 60 Daily Per−Capita Pageviews ! White No College 50 40 Male 30 Under 25 20 10 0 Race Education Sex Age Large differences exist even at the aggregate level (e.g. women on average generate 40% more pageviews than men)Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 29 / 71
    • Percent of Total Time Spent on SiteJake Hofman 0.1% 1% 10% fa m ce ai bo l.y o ap k ps g aho .co .fa oo o. m m ce gl co ai b e. m l.g oo co o m m og k.co a le m y il. .c vi c web ou live om ew h m m tu .c m an w ai be om or ne fb l.a .co ep l.f .zy ol m(Yahoo! Research) ic ac ng .co se s.m ebo a.c m ar ys ok om ch pa .c . c o m yah e.c m ys o o pa o.c m Group-level activity ce om a m .c sh ma sn om op zo .co im .e n. m ho age y bay com m s. ah .c e. go oo om m my og .co ai sp le m l.c a .c om ce om w c .c w b as om m w in t.n .y g e es sa cg ah .co t gi i.e oo m ng es ba .co .m pn y. m ys .g com p o. ci twace com m i .c .m tte o e r m fa en my eb .co ce . .e o m bo lo wik ba .co ok gin ipe y.c m .m .y di om af ah a. m ia oo or fri ga y. wa .co g en m ya rs m ds e3 ho .co .m .p o. mLearning from Online Activity ys og co o m w ta pac .co or g e m ld ge .co w d m in .c m ne o lo ee r.c m g b o m my in.li o.c m ap p ve om s. oin .c go ts om og .c m le om w a .c m p ol. om ne s.z og co w yn o.c m fa s. g o email, search, and social networking sites nt ya a m as w ho .com ys in o po n ste .co rts et r.c m se a .y flix om ar ho .co ch o m . al co .ao com ot m l.c m ca o et s m ric t.n s. et male co m femaleNovember 30, 2011 All groups spend more than a third of their time on a handful of30 / 71
    • Percent of Total Time Spent on SiteJake Hofman 0.1% 1% 10% fa m ce ai bo l.y o ap k ps g aho .co .fa oo o. m m ce gl co ai b e. m l.g oo co o m m og k.co a le m y il. .c vi c web ou live om ew h m m tu .c m an w ai be om or ne fb l.a .co ep l.f .zy ol m(Yahoo! Research) ic ac ng .co se s.m ebo a.c m ar ys ok om ch pa .c . c o m yah e.c m ys o o pa o.c m Group-level activity ce om a m .c sh ma sn om op zo .co im .e n. m ho age y bay com m s. ah .c e. go oo om m my og .co ai sp le m l.c a .c om ce om w c .c w b as om m w in t.n .y g e es sa cg ah .co t gi i.e oo m ng es ba .co .m pn y. m ys .g com p o. ci twace com m i .c .m tte o e r m fa en my eb .co ce . .e o m bo lo wik ba .co ok gin ipe y.c m .m .y di om af ah a. m ia oo or fri ga y. wa .co g en m ya rs m ds e3 ho .co .m .p o. mLearning from Online Activity ys og co o m w ta pac .co or g e m ld ge .co w d m in .c m ne o lo ee r.c m g b o m my in.li o.c m ap p ve om s. oin .c go ts om og .c m le om w a .c m p ol. om ne s.z og co w yn o.c m fa s. g o ya a m nt as universally popular and on more niche sites w ho .com ys in o po n ste .co rts et r.c m se a .y flix om ar ho .co ch o m . al co .ao com ot m l.c m ca o et s m ric t.n s. et male co m female But different groups distribute their time differently, both onNovember 30, 201130 / 71
    • Percent of Total Time Spent on SiteJake Hofman 0.1% 1% 10% 0.1% 1% 10% 0.1% 1% 10% 0.1% 1% 10% 0.1% 1% 10% fa c m eb ail oo .y k. ap ah co ps go oo m .fa og .co m ceb le. m ail o co .g ok m o .c m ogle om ail . .li co w you ve. m e t c vie ch m bm ube om w an wf ail. .co m n b ao m or el. .zy l. ep fa n co ic ce ga m(Yahoo! Research) s b .c se .my oo om ar sp k.c ch ac o .y e m m aho .com ys o pa .co ce m .c am msn om sh az .c op on om im .e .c b o ag y ay m ho es ah .co m .g oo m e. oo .c m o m ys gle m ail pa .co .c ce m om .c ca om w st w b .n w in e .y g. t m a es cg ho com sa i. o. gin e eba com g. spn y.c m .g om ys o pa .co c m cim tw e.c .m itter om e . m e co en y.e bo. m fa c ce lo .wik bay om bo g ip .c ok in. ed om .m ya ia. af ho org ia o m wa .co y frie gam .ya rs.c m nd e3 ho om s. .p o.c m og omLearning from Online Activity ys o pa .co w ta ce. m or gg c ld e om w d. in co m ner m . lo eeb com gin o .c m my .live om ap po .c s. int om go s. og co le m . m a co w p ol.c m m o o ne s.z go. m w yn co s. ga m ya .c fa h o nt w oo m as in .c ys s o po n ter. m rts etf co se ah c.y lix. m ar o om ch o.c .a om alo com ol.c tm ca om et st. ric ne s. t co mNovember 30, 2011 Income Education Race Sex Age31 / 71
    • M Female−to−male pageview ratio ult i− 0.5 1 2 caJake Hofman te g FA H ory amppa o H lid Ho ily rel/ ea ay m Re B M lth s e e ult , F& S & souaut ! i− itn p Fa rcey ! M ca e e s s ! ult te Foss cial hion i− go od & EvPe ! ca ry te P & Nut ents ! go Spe ho Co ritiots ry H cia N tog ok n !! Fa o l Oon ra ing M milme cc−P phy em y & a ro ! ! & G sio fit M be ! Sh as r C Lifeard ns ! s o sty en op M ! pin G ermm Bo les g re ch u ok ! et a nit s EdDir in n ie uc ec Geg Cdis s(Yahoo! Research) !! at tor Un ne a er io ie iv rd ! R Cor na s & eralo s ! ea p G l R s gy l E ora ifts e Gu itie ! st te & so ides M ! at In F ur s ult ! i− Kid e/A for lowces ! Fu ca D s, parma ers te G tm tio Group-level activity ll go ire am e n Se ry ct n !! rv or O Go eE−m ts ic F Te ie n v s, a e ull le Cous/L lineern Toy il C S co pooca G me s om e m C n l a n m rvic /In e s G m t er e te llu /R uides !! !! cia B rn la ew e ! l Ban et r/P ar s ! G anks Cru Se ag ds is r in ! en ks & C I e vic g ! er R & Crednsu Lines al eli ! In re it ra es gio d U n te ! re n it Unioce st M Bro & n n Po ult ad Sp Loions !! So rt i− Dca irit ans ! ftwals ca e st ua s ! ar & teg stinMe lity e Co or a dia ! y ! D Ma mm Ttion eli nu ra s v f u v ! M Ho Arery actnitieel ult te ts /S ur s e !! Lo i−c ls/H Cr/Gr tam rs ! ng at o ed aph ps D ego Matel it ic is ry p D S Ca s ta ! ! s/ ire ea rd nc En Tr c rc t ! C e/L ter aveorieh M ! ar M ult ee ocatain l In s m ! ult i− i− ca C r D l Ca efo ! la F ev A r nt ca te s in r ! te go Fr sifieanelo irlinier go ! ry ee d cia pm es ry N s l e Fin ew Me /Au To nt !! C anc G s rch ctiools ur e r In & I and nsLearning from Online Activity re /In ou s nf E is !! nt s n tan ormve e ! Ev ura d T t M a nts en nc ra e tio ! M ts e ns ss I n ult i− In Sp & /Inv po agSP te G r in ca !! rn ecia loest tatiog te e b m ! go G t To l In al Nen n ! ry am o te e ts C Ta bli ls/W res Muws om r ng e Wt N s !! puget Ha /S b e ewic C te ed rd Re w Se at s rs P w om seee rv he ! ! pu & or ar a p ic r te Au Cotalse M rchsta es r& to ns & an T kes m um C u M oo C !! ! ot e o fa ilit ls M on ive r m ctu ar m ! ult s M E u re y i− Mu um ! C V a lec n rs at lti− er idenuf troities eg A ca Ele W o ac nic ! ! or uto te c e s/M tu s ! Fin y m go tro b o re an Edu otivry nicHosvie r cia Pa ca e Au s N tin s !! l N rts tio Inf tom ewg ! ew & n &orm ot s ! s Ac C at ive & c a io ! In es ree n ! fo so r rm ries ! O P H atio s ! nli e u n ne rs m ! Tr onaor a ls ! Spdin ! o g ! Adrts ult !November 30, 201132 / 71
    • Group-level activity There is both reasonable overlap and variation amongst the most popular sites within groups Over/Under 25 ! Years Old Female/Male ! Under/Over $50,000 ! Household Income White/Non−White ! College/No College ! 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard SimilarityJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 33 / 71
    • Individual-level prediction How well can one predict an individual’s demographics from their browsing activity? • Represent each user by the set of sites visited • Fit linear models to predict majority/minority for each attribute on 80% of users • Tune model parameters using a 10% validation set • Evaluate final performance on held-out 10% test setJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 34 / 71
    • Individual-level prediction model <- glm(is.female ~ ., data=nielsen, family=binomial) ???Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 35 / 71
    • Individual-level prediction y (xi ) = w · xi + b ˆ ￿ Support Vector Machine3 : L(y , y ) = C ˆ i [1 − yi y (xi )]+ + ||w ||2 ˆ 3 http://bit.ly/svmperfJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 36 / 71
    • Individual-level prediction Reasonable (∼70-85%) accuracy and AUC across all attributes AUC Accuracy Over/Under 25 ! ! Years Old Female/Male ! ! White/Non−White ! ! Under/Over $50,000 ! ! Household Income College/No College ! ! .5 .6 .7 .8 .9 1 .5 .6 .7 .8 .9 1Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 37 / 71
    • Individual-level prediction Highly-weighted sites under the fitted models Large positive weight Large negative weight winster.com sports.yahoo.com Female lancome-usa.com espn.go.com marlboro.com mediatakeout.com White cmt.com bet.com news.yahoo.com youtube.com College Educated linkedin.com myspace.com evite.com addictinggames.com Over 25 Years Old classmates.com youtube.com Household Income eharmony.com rownine.com Under $50,000 tracfone.com matrixdirect.com Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task. AUC Accuracy Figure 7, a measure that effectively re-normalizes the ma- Over/Under 25 Years Old ! ! jority and minority classes to have equal size. Intuitively, Female/Male ! ! AUC is the probability that a model scores a randomly se- lected positive example higher than a randomly selected neg- White/Non−White ! ! ative one (e.g., the probability that the model correctly dis- Under/Over $50,000 Household Income ! ! tinguishes between a randomly selected female and male). College/No College ! ! Though an uninformative rule would correctly discriminateJake Hofman (Yahoo!.5 Research) .6 .7 .8 .9 1 .5 .6 Learning from Online Activity pairs 50% of the time, predictions based on 38 / 71 .7 .8 .9 1 between such November 30, 2011
    • Individual-level prediction Similar performance even when restricted to top 1k sites 0.9 0.9 ! ! ! ! ! 0.8 0.8 ! ! ! Accuracy 0.7 0.7 AUC ! Age ! Age Sex Sex 0.6 0.6 Race Race Education Education Income Income 0.5 0.5 102 102.5 103 103.5 104 104.5 105 102 102.5 103 103.5 104 104.5 105 Number of Domains Number of DomainsJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 39 / 71
    • Individual-level prediction Substantially better performance when restricted to “stereotypical” users (∼80-90%) 0.95 0.95 ! ! 0.90 ! 0.90 ! ! !! !! 0.85 0.85 ! Accuracy ! AUC ! !! !! 0.80 0.80 ! Age ! Age Sex Sex 0.75 Race 0.75 Race Education Education 0.70 Income 0.70 Income 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Users Fraction of UsersJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 40 / 71
    • Individual-level prediction Proof of concept browser demo http://bit.ly/surfpredsJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 41 / 71
    • Summary • Notable differences in online and offline diversity • Large group-level differences in how users distribute their time • User demographics can be inferred from browsing activity with reasonable accuracyJake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 42 / 71