Learning from Web Activity
Jake Hofman
Yahoo! Research
November 18, 2010
Jake Hofman (@jakehofman) Learning from Web Activ...
Outline
1 Agenda: Just enough philosophy
2 Case study: Demographic diversity on the Web
3 Conclusion: Lessons learned
Jake...
Agenda
Size (only kind of) matters
Big Data
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 /...
Agenda
Size (only kind of) matters
Big Data
Lots of data means lots to learn (from)
Jake Hofman (@jakehofman) Learning fro...
Agenda
Size (only kind of) matters
Big Data
But the “big” part isn’t intrinsically interesting
(although large sample size...
Agenda
Size (only kind of) matters
Big Data
Regardless of size, it’s really about “data jeopardy”
(To what question are th...
Agenda
Tools
Data tools:
• Shell scripting & Python
Munging, Glue
• R
Modeling, Visualization
Jake Hofman (@jakehofman) Le...
Agenda
Tools
Big Data tools:
• Hadoop & Pig
Filtering, Aggregating
• Shell scripting & Python
Munging, Glue
• R
Modeling, ...
Agenda
The clean real story
“We have a habit in writing articles published in
scientific journals to make the work as finish...
Outline
1 Agenda: Just enough philosophy
2 Case study: Demographic diversity on the Web
3 Conclusion: Lessons learned
Jake...
Demographic diversity on the Web
The clean story
(covering our tracks)
Jake Hofman (@jakehofman) Learning from Web Activit...
Demographic diversity on the Web
with Irmak Sirer and Sharad Goel
How diverse is the Web?
To what extent do online experie...
Diversity of the Web
Data
• Representative sample of 265,000 individuals in the US, paid
via the Nielsen MegaPanel2
• Log ...
Diversity of the Web
Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → Whi...
Diversity of the Web
Pig to the rescue
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 11 / 33
Diversity of the Web
Site-level skew
How diverse are site audiences?
• For each site and attribute,
calculate the skew in ...
Diversity of the Web
Site-level skew
Proportion Female Visitors
Density
0.0 0.2 0.4 0.6 0.8 1.0
Proportion White VisitorsD...
Diversity of the Web
Site-level skew
Many sites have skew close the average, but there also popular,
highly-skewed sites
G...
Diversity of the Web
Site-level skew
Many sites have skew close the average, but there also popular,
highly-skewed sites
G...
Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0...
Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0...
Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0...
Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0...
Diversity of the Web
Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
1...
Diversity of the Web
Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
1...
Diversity of the Web
Group-level activity
All groups spend more than a third of their time on a handful of
email, search, ...
Diversity of the Web
Group-level activity
But different groups distribute their time differently, both on
universally popula...
Diversity of the Web
Group-level activity
But different groups distribute their time differently, both on
universally popula...
PercentofTotalTimeSpentonSite
0.1%
1%
10%
0.1%
1%
10%
0.1%
1%
10%
0.1%
1%
10%
0.1%
1%
10%
facebook.com
m
ail.yahoo.com
goo...
Diversity of the Web
Individual-level prediction
How well can one predict an individual’s demographics from their
browsing...
Diversity of the Web
GNU-fu
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 20 / 33
Diversity of the Web
Individual-level prediction
• Reasonable (∼70-85%)
accuracy and AUC across all
attributes
• Similar p...
Diversity of the Web
Individual-level prediction
Highly-weighted sites under the fitted models
Large positive weight Large ...
Diversity of the Web
Individual-level prediction
Proof of concept browser demo
Jake Hofman (@jakehofman) Learning from Web...
Diversity of the Web
Individual-level prediction
Proof of concept browser demo
Jake Hofman (@jakehofman) Learning from Web...
Diversity of the Web
The real story
(what we actually did)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen,...
Diversity on the Web
The real story
• Got several hundred GBs of MegaPanel data from Nielsen3
3
Special thanks to Mainak M...
Diversity on the Web
The real story
• Got several hundred GBs of MegaPanel data from Nielsen3
• Discussed possible project...
Diversity on the Web
The real story (cont’d)
• Started with predicting real-valued age
• Worked on this for an embarassing...
Diversity on the Web
The real story (cont’d)
• Became curious about why classification worked well
compared to regression
•...
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Learning from Web Activity
Upcoming SlideShare
Loading in...5
×

Learning from Web Activity

2,914

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,914
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Learning from Web Activity

  1. 1. Learning from Web Activity Jake Hofman Yahoo! Research November 18, 2010 Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 1 / 33
  2. 2. Outline 1 Agenda: Just enough philosophy 2 Case study: Demographic diversity on the Web 3 Conclusion: Lessons learned Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 2 / 33
  3. 3. Agenda Size (only kind of) matters Big Data Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
  4. 4. Agenda Size (only kind of) matters Big Data Lots of data means lots to learn (from) Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
  5. 5. Agenda Size (only kind of) matters Big Data But the “big” part isn’t intrinsically interesting (although large sample sizes are always good) Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
  6. 6. Agenda Size (only kind of) matters Big Data Regardless of size, it’s really about “data jeopardy” (To what question are these data the answer?) Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
  7. 7. Agenda Tools Data tools: • Shell scripting & Python Munging, Glue • R Modeling, Visualization Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 4 / 33
  8. 8. Agenda Tools Big Data tools: • Hadoop & Pig Filtering, Aggregating • Shell scripting & Python Munging, Glue • R Modeling, Visualization Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 4 / 33
  9. 9. Agenda The clean real story “We have a habit in writing articles published in scientific journals to make the work as finished as possible, to cover all the tracks, to not worry about the blind alleys or to describe how you had the wrong idea first, and so on. So there isn’t any place to publish, in a dignified manner, what you actually did in order to get to do the work ...” -Richard Feynman Nobel Lecture1, 1965 1 http://bit.ly/feynmannobel Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 5 / 33
  10. 10. Outline 1 Agenda: Just enough philosophy 2 Case study: Demographic diversity on the Web 3 Conclusion: Lessons learned Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 6 / 33
  11. 11. Demographic diversity on the Web The clean story (covering our tracks) Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 7 / 33
  12. 12. Demographic diversity on the Web with Irmak Sirer and Sharad Goel How diverse is the Web? To what extent do online experiences vary across demographic groups? Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 8 / 33
  13. 13. Diversity of the Web Data • Representative sample of 265,000 individuals in the US, paid via the Nielsen MegaPanel2 • Log of anonymized, complete browsing activity from June 2009 through May 2010 (URLs viewed, timestamps, etc.) • Detailed individual and household demographic information (age, education, income, race, sex, etc.) 2 http://bit.ly/nielsenonline Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 9 / 33
  14. 14. Diversity of the Web Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k most popular sites • Aggregate activity at the site, group, and user levels Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 10 / 33
  15. 15. Diversity of the Web Pig to the rescue Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 11 / 33
  16. 16. Diversity of the Web Site-level skew How diverse are site audiences? • For each site and attribute, calculate the skew in visitors (e.g., 93% of pageviews on foxnews.com are by White users) • For each attribute, plot the distribution of visitor skew across all sites Proportion White Visitors Density 0.0 0.2 0.4 0.6 0.8 1.0 Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 12 / 33
  17. 17. Diversity of the Web Site-level skew Proportion Female Visitors Density 0.0 0.2 0.4 0.6 0.8 1.0 Proportion White VisitorsDensity 0.0 0.2 0.4 0.6 0.8 1.0 Proportion College Educated Visitors Density 0.0 0.2 0.4 0.6 0.8 1.0 Proportion Adult Visitors Density 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With Household Incomes Under $50,000 Density 0.0 0.2 0.4 0.6 0.8 1.0 Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 13 / 33
  18. 18. Diversity of the Web Site-level skew Many sites have skew close the average, but there also popular, highly-skewed sites Greater Than 90% Less Than 10% Female youravon.com collectionsetc.com coveritlive.com needlive.com White foxnews.com wunderground.com blackplanet.com mediatakeout.com College Educated news.google.com nytimes.com slumz.boxden.com sythe.com Over 25 Years Old mail.yahoo.com apps.facebook.com nanowrimo.org cbox.ws Household Income Under $50,000 scarleteen.com boards.adultswim.com opentable.com marketwatch.com Table 1: A selection of popular sites that are homogeneous along various demographic dimensions. ilyPer−CapitaPageviews 20 30 40 50 60 70 ! ! ! !Non−White Male Non−White Male No College Under 25 No College Under 25 White Female White FemaleCollege Over 25 College Over 25 visually apparent from Figure 5, there are significant differ- ences in how groups distribute their time on the web. These differences—which, as mentioned above, hold for highly fre- quented sites such as Facebook and YouTube—are in some cases even more pronounced for lower traffic sites. For in- stance, the gaming site pogo.com accounts for less than 1% of pageviews among both low and high income users, but low income users spend almost twice as much of their time there.Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 14 / 33
  19. 19. Diversity of the Web Site-level skew Many sites have skew close the average, but there also popular, highly-skewed sites Greater Than 90% Less Than 10% Female youravon.com collectionsetc.com coveritlive.com needlive.com White foxnews.com wunderground.com blackplanet.com mediatakeout.com College Educated news.google.com nytimes.com slumz.boxden.com sythe.com Over 25 Years Old mail.yahoo.com apps.facebook.com nanowrimo.org cbox.ws Household Income Under $50,000 scarleteen.com boards.adultswim.com opentable.com marketwatch.com Table 1: A selection of popular sites that are homogeneous along various demographic dimensions. ilyPer−CapitaPageviews 20 30 40 50 60 70 ! ! ! !Non−White Male Non−White Male No College Under 25 No College Under 25 White Female White FemaleCollege Over 25 College Over 25 visually apparent from Figure 5, there are significant differ- ences in how groups distribute their time on the web. These differences—which, as mentioned above, hold for highly fre- quented sites such as Facebook and YouTube—are in some cases even more pronounced for lower traffic sites. For in- stance, the gaming site pogo.com accounts for less than 1% of pageviews among both low and high income users, but low income users spend almost twice as much of their time there. This skew persists even when we restrict attention to the top 10k or 1k sites Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 14 / 33
  20. 20. Diversity of the Web Sites vs. ZIPs How do diversity of the online and offline worlds compare? Proportion Female Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Proportion White Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Proportion College Educated Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
  21. 21. Diversity of the Web Sites vs. ZIPs How do diversity of the online and offline worlds compare? Proportion Female Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Proportion White Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Proportion College Educated Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs As expected, neighborhoods are more gender-balanced than sites Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
  22. 22. Diversity of the Web Sites vs. ZIPs How do diversity of the online and offline worlds compare? Proportion Female Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Proportion White Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Proportion College Educated Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs But sites typically have more racially diverse audiences than neighborhoods have residents Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
  23. 23. Diversity of the Web Sites vs. ZIPs How do diversity of the online and offline worlds compare? Proportion Female Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Proportion White Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Proportion College Educated Density 0.0 0.2 0.4 0.6 0.8 1.0 Sites ZIPs Skew by education is comparable, with online showing a bias towards higher education Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
  24. 24. Diversity of the Web Group-level activity How does browsing activity vary at the group level? DailyPer−CapitaPageviews 0 10 20 30 40 50 60 70 q q q qNon−White Male Non−White Male No College Under 25 No College Under 25 White Female White FemaleCollege Over 25 College Over 25 Race Education Sex Age Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 16 / 33
  25. 25. Diversity of the Web Group-level activity How does browsing activity vary at the group level? DailyPer−CapitaPageviews 0 10 20 30 40 50 60 70 q q q qNon−White Male Non−White Male No College Under 25 No College Under 25 White Female White FemaleCollege Over 25 College Over 25 Race Education Sex Age Large differences exist even at the aggregate level (e.g. women on average generate 40% more pageviews than men) Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 16 / 33
  26. 26. Diversity of the Web Group-level activity All groups spend more than a third of their time on a handful of email, search, and social networking sites PercentofTotalTimeSpentonSite 0.1% 1% 10% facebook.com m ail.yahoo.com google.com apps.facebook.com m ail.google.com m ail.live.com youtube.com w ebm ail.aol.com m w fb.zynga.com channel.facebook.com view m orepics.m yspace.com search.yahoo.com m yspace.com m sn.com am azon.com shop.ebay.com yahoo.com im ages.google.com hom e.m yspace.com m ail.com cast.net bing.com w w w .yahoo.com cgi.ebay.com espn.go.com m essaging.m yspace.com tw itter.com cim .m eebo.com m y.ebay.com en.w ikipedia.org login.yahoo.com facebook.m afiawars.com m y.yahoo.com gam e3.pogo.com friends.m yspace.com tagged.com w orldw inner.com m eebo.com login.live.com m ypoints.com m aps.google.com aol.com pogo.com m w m s.zynga.com new s.yahoo.com w inster.com netflix.com fantasysports.yahoo.com search.aol.com com cast.net alotm etrics.com female male Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 17 / 33
  27. 27. Diversity of the Web Group-level activity But different groups distribute their time differently, both on universally popular and on more niche sites PercentofTotalTimeSpentonSite 0.1% 1% 10% facebook.com m ail.yahoo.com google.com apps.facebook.com m ail.google.com m ail.live.com youtube.com w ebm ail.aol.com m w fb.zynga.com channel.facebook.com view m orepics.m yspace.com search.yahoo.com m yspace.com m sn.com am azon.com shop.ebay.com yahoo.com im ages.google.com hom e.m yspace.com m ail.com cast.net bing.com w w w .yahoo.com cgi.ebay.com espn.go.com m essaging.m yspace.com tw itter.com cim .m eebo.com m y.ebay.com en.w ikipedia.org login.yahoo.com facebook.m afiawars.com m y.yahoo.com gam e3.pogo.com friends.m yspace.com tagged.com w orldw inner.com m eebo.com login.live.com m ypoints.com m aps.google.com aol.com pogo.com m w m s.zynga.com new s.yahoo.com w inster.com netflix.com fantasysports.yahoo.com search.aol.com com cast.net alotm etrics.com female male Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 17 / 33
  28. 28. Diversity of the Web Group-level activity But different groups distribute their time differently, both on universally popular and on more niche sites PercentofTotalTimeSpentonSite 0.1% 1% 10% facebook.com m ail.yahoo.com google.com apps.facebook.com m ail.google.com m ail.live.com youtube.com w ebm ail.aol.com m w fb.zynga.com channel.facebook.com view m orepics.m yspace.com search.yahoo.com m yspace.com m sn.com am azon.com shop.ebay.com yahoo.com im ages.google.com hom e.m yspace.com m ail.com cast.net bing.com w w w .yahoo.com cgi.ebay.com espn.go.com m essaging.m yspace.com tw itter.com cim .m eebo.com m y.ebay.com en.w ikipedia.org login.yahoo.com facebook.m afiawars.com m y.yahoo.com gam e3.pogo.com friends.m yspace.com tagged.com w orldw inner.com m eebo.com login.live.com m ypoints.com m aps.google.com aol.com pogo.com m w m s.zynga.com new s.yahoo.com w inster.com netflix.com fantasysports.yahoo.com search.aol.com com cast.net alotm etrics.com white non.white Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 17 / 33
  29. 29. PercentofTotalTimeSpentonSite 0.1% 1% 10% 0.1% 1% 10% 0.1% 1% 10% 0.1% 1% 10% 0.1% 1% 10% facebook.com m ail.yahoo.com google .com apps.facebook.com m ail.google .com m ail.live.com youtube.com w ebm ail.aol.com m w fb.zynga.com channel.facebook.com vie w m orepic s.m yspace.com search.yahoo.com m yspace.com m sn.com am azon.com shop.ebay.com yahoo.com im ages.google .com hom e.m yspace.com m ail.com cast.net bin g.com w w w .yahoo.com cgi.ebay.com espn.go.com m essagin g.m yspace.com tw itter.com cim .m eebo.com m y.ebay.com en.w ik ip edia .org lo gin .yahoo.com facebook.m afia wars.com m y.yahoo.com gam e3.pogo.com frie nds.m yspace.com tagged.com w orld w in ner.com m eebo.com lo gin .live.com m ypoin ts.com m aps.google .comaol.com pogo.com m w m s.zynga.com new s.yahoo.com w in ster.com netflix.com fantasysports.yahoo.com search.aol.com com cast.net alo tm etric s.com AgeSexRaceEducationIncome Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 18 / 33
  30. 30. Diversity of the Web Individual-level prediction How well can one predict an individual’s demographics from their browsing activity? • Represent each user by the set of sites visited • Fit linear models to predict majority/minority for each attribute on 80% of users • Tune model parameters using a 10% validation set • Evaluate final performance on held-out 10% test set Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 19 / 33
  31. 31. Diversity of the Web GNU-fu Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 20 / 33
  32. 32. Diversity of the Web Individual-level prediction • Reasonable (∼70-85%) accuracy and AUC across all attributes • Similar performance even when restricted to top 1k sites • Can achieve substantially better performance when restricted to “stereotypical” users (∼80-90%) College/No College Under/Over $50,000 Household Income White/Non−White Female/Male Over/Under 25 Years Old AUC q q q q q .5 .6 .7 .8 .9 1 Accuracy q q q q q .5 .6 .7 .8 .9 1 Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 21 / 33
  33. 33. Diversity of the Web Individual-level prediction Highly-weighted sites under the fitted models Large positive weight Large negative weight Female winster.com lancome-usa.com sports.yahoo.com espn.go.com White marlboro.com cmt.com mediatakeout.com bet.com College Educated news.yahoo.com linkedin.com youtube.com myspace.com Over 25 Years Old evite.com classmates.com addictinggames.com youtube.com Household Income Under $50,000 eharmony.com tracfone.com rownine.com matrixdirect.com Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task. College/No College Under/Over $50,000 Household Income White/Non−White Female/Male Over/Under 25 Years Old AUC ! ! ! ! ! Accuracy ! ! ! ! ! Figure 7, a measure that effectively re-normalizes the ma- jority and minority classes to have equal size. Intuitively, AUC is the probability that a model scores a randomly se- lected positive example higher than a randomly selected neg- ative one (e.g., the probability that the model correctly dis- tinguishes between a randomly selected female and male). Though an uninformative rule would correctly discriminateJake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 22 / 33
  34. 34. Diversity of the Web Individual-level prediction Proof of concept browser demo Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 23 / 33
  35. 35. Diversity of the Web Individual-level prediction Proof of concept browser demo Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 23 / 33
  36. 36. Diversity of the Web The real story (what we actually did) Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 24 / 33
  37. 37. Diversity on the Web The real story • Got several hundred GBs of MegaPanel data from Nielsen3 3 Special thanks to Mainak Mazumdar Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 25 / 33
  38. 38. Diversity on the Web The real story • Got several hundred GBs of MegaPanel data from Nielsen3 • Discussed possible projects • Predict user demographics (e.g. real-valued age) from a few minutes of browsing activity for ad-targeting? • Infer the number of individuals using the same browser or behind the same ip? • Determine number of actual uniques advertisers are receiving? • . . . 3 Special thanks to Mainak Mazumdar Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 25 / 33
  39. 39. Diversity on the Web The real story (cont’d) • Started with predicting real-valued age • Worked on this for an embarassingly long time (various methods, feature selection, etc.) • Turns out to be difficult to do better than within 10 years of true age, on average • Settled for classification on binary outcomes (e.g., adult/non-adult) over entire history • Classification worked reasonably well for age and other attributes Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 26 / 33
  40. 40. Diversity on the Web The real story (cont’d) • Became curious about why classification worked well compared to regression • Generated descriptive statistics across all attributes at the site and group levels • Compared site statistics to ZIP code data from the US Census • Compared time distribution across groups • Realized that we now had the largest comprehensive study of demographic diversity on the web Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 27 / 33

×