The document summarizes Jake Hofman's presentation on learning from web activity data. It discusses a case study analyzing demographic diversity on the web using anonymized browsing data from 265,000 US users. The study found some sites have highly skewed audiences along attributes like gender, race, education and income, though most sites' audiences more closely match the general population. It also compared demographic skew between online audiences and offline neighborhoods, finding websites generally have more racially diverse audiences than neighborhoods have residents. The presentation explored differences in how demographic groups distribute their time online.
Using internet technologies and network analysis in political campaigns effectively, specifically within the context of Rahm Emanuel's 2011 mayoral race.
Using internet technologies and network analysis in political campaigns effectively, specifically within the context of Rahm Emanuel's 2011 mayoral race.
This presentation was session 7 in a 12 part webinar series on the book Extreme Democracy. Extreme democracy is a political philosophy of the information era that puts people in charge of the entire political process. It suggests a deliberative process that places total confidence in the people, opening the policy-making process to many centers of power through deeply networked coalitions that can be organized around local, national and international issues. This seminar covered Strategy & the Political Process: A discussion of the essays of Adam Greenfield (Democracy for the Rest of Us: The Minimal Compact & Open Source Government) & Ethan Zuckerman (Making Room for the Third World in the Second Superpower), pages 200 – 227
Companion to Using Facebook for Dissemination slideshow: http://www.slideshare.net/DrNICHCY/making-friends-with-facebook-for-project-dissemination
Developed by the National Dissemination Center for Children with Disabilities (NICHCY)
Independent Journalism: Doing good and doing well.Kevin Anderson
In this presentation for the Association of Alternative Newsmedia, Kevin Anderson, of the Media Development Loan Fund, looks at the challenges faces independent journalism and ways that independent news organisations are building sustainable financially viable businesses to support the valuable mission they do.
Going beyond google 2 philadelphia loss conferencemikep007
Some attorneys have argued that scouring social networking sites in search of a reason not to hire someone, or evidence of insurance fraud, is an invasion of privacy. But insurance companies and their attorneys argue that internet searches for public social networking profiles are similar to the informal video surveillance investigations of property-casualty claimants that are common in the industry.
Sinau is a Javanese word, means Learning. The chinese character "欣“ (xin) has the meaning of "happy; joyful", while the "翱" (ao) means "take wing". It means passionate learner who take a wing in his / her professional development.
SinauOnline is based on constructive learning approach which will put learner in the focus of the learning process. On the core of its approach is to encourage people to learn through sharing (speak coordinated blogging), and jointly create an open learning materials, which than be provided and packaged for everybody interested in.
This presentation was session 7 in a 12 part webinar series on the book Extreme Democracy. Extreme democracy is a political philosophy of the information era that puts people in charge of the entire political process. It suggests a deliberative process that places total confidence in the people, opening the policy-making process to many centers of power through deeply networked coalitions that can be organized around local, national and international issues. This seminar covered Strategy & the Political Process: A discussion of the essays of Adam Greenfield (Democracy for the Rest of Us: The Minimal Compact & Open Source Government) & Ethan Zuckerman (Making Room for the Third World in the Second Superpower), pages 200 – 227
Companion to Using Facebook for Dissemination slideshow: http://www.slideshare.net/DrNICHCY/making-friends-with-facebook-for-project-dissemination
Developed by the National Dissemination Center for Children with Disabilities (NICHCY)
Independent Journalism: Doing good and doing well.Kevin Anderson
In this presentation for the Association of Alternative Newsmedia, Kevin Anderson, of the Media Development Loan Fund, looks at the challenges faces independent journalism and ways that independent news organisations are building sustainable financially viable businesses to support the valuable mission they do.
Going beyond google 2 philadelphia loss conferencemikep007
Some attorneys have argued that scouring social networking sites in search of a reason not to hire someone, or evidence of insurance fraud, is an invasion of privacy. But insurance companies and their attorneys argue that internet searches for public social networking profiles are similar to the informal video surveillance investigations of property-casualty claimants that are common in the industry.
Sinau is a Javanese word, means Learning. The chinese character "欣“ (xin) has the meaning of "happy; joyful", while the "翱" (ao) means "take wing". It means passionate learner who take a wing in his / her professional development.
SinauOnline is based on constructive learning approach which will put learner in the focus of the learning process. On the core of its approach is to encourage people to learn through sharing (speak coordinated blogging), and jointly create an open learning materials, which than be provided and packaged for everybody interested in.
Here is a presentation I recently have to the a Midwest security user group on how to manage multiple environments, or clients, with Symantec Endpoint Protection.
Hona hemen Kamila Zebra haur eta gazte literatura liburuari buruzko power ppoin bat. Unitate didaktiko hau, irakasleei bai ikasleei zuzendutakoa da eta bertan, Ardora programarekin egindako ariketak aurki ditzakezue ikasleek burutzeko.
History of social media from 1970 to present day. Includes information about Robert Metcalfe and the invention of ethernet, innovation and technology adoption cycles, social media interaction, Steve Wozniak, Google, The Huffington Post, Steve Jobs, and an introduction to Internet.org (If anyone is involved with internet.org and sees this post, please email me at susanchesley@gmail.com as I'm very interested in learning more about this initiative.) Slides also include excellent Saved by the Bell Zack Morris 90s cell phone references.
Social Media and You (for tweeners/teens)Anne Arendt
Made for Girls Retreat on Wisdom Conference (http://www.uvu.edu/wsc/grow/) held June 14, 2012. Target audience is tweeners and early teens. The idea is not to discourage social media but to explain the role it plays and encourage wise decisions.
Presented on November 9, 2009 as a part of the Seminar for Historical Administration surrounding the idea of how the changing media landscape has (and will continue to) alter the mission and behaviors of museums around the world.
One one "cult" brief that is said to be popular with Anonymous and Lutzsec -- I would be glad if that were true. Open Source Everything is now a meme and a mind-set (see my 2012 book), this was the beginning of my final 20 year push.
Bridging the Real and Virtual Worlds: The Next Evolution of Social and Mobile...Georgiana Cohen
When we talk about integrated marketing, that needs to mean more than making sure our Twitter, Facebook and website are in strategic alignment. Our web marketing has to align off-screen as well as it does on-screen. The world is increasingly becoming hypertext, rich with multiple layers of meaning and context. From Foursquare decals to chalked messages to "follow us on Facebook" to event-specific hashtags, we are surrounded by calls to link our real-life activities to their online complements. In our role as web communicators, how can we do this well in a way that serves both our needs and the needs of our audiences? Whether we're talking about geosocial/location-based services, viewbooks, flyers or tweetups, there is a large number of platforms where this is becoming increasingly relevant. How can we activate the ambient intimacy and latent connectivity around us to engage our audiences with relevant experiences and content? How can we bridge online community with off-line community? In this session, we will explore these principles as well as several concrete ideas for how to put them into action.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4
Learning from Web Activity
1. Learning from Web Activity
Jake Hofman
Yahoo! Research
November 18, 2010
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 1 / 33
2. Outline
1 Agenda: Just enough philosophy
2 Case study: Demographic diversity on the Web
3 Conclusion: Lessons learned
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 2 / 33
3. Agenda
Size (only kind of) matters
Big Data
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
4. Agenda
Size (only kind of) matters
Big Data
Lots of data means lots to learn (from)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
5. Agenda
Size (only kind of) matters
Big Data
But the “big” part isn’t intrinsically interesting
(although large sample sizes are always good)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
6. Agenda
Size (only kind of) matters
Big Data
Regardless of size, it’s really about “data jeopardy”
(To what question are these data the answer?)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
7. Agenda
Tools
Data tools:
• Shell scripting & Python
Munging, Glue
• R
Modeling, Visualization
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 4 / 33
8. Agenda
Tools
Big Data tools:
• Hadoop & Pig
Filtering, Aggregating
• Shell scripting & Python
Munging, Glue
• R
Modeling, Visualization
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 4 / 33
9. Agenda
The clean real story
“We have a habit in writing articles published in
scientific journals to make the work as finished as
possible, to cover all the tracks, to not worry about the
blind alleys or to describe how you had the wrong idea
first, and so on. So there isn’t any place to publish, in
a dignified manner, what you actually did in order to
get to do the work ...”
-Richard Feynman
Nobel Lecture1, 1965
1
http://bit.ly/feynmannobel
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 5 / 33
10. Outline
1 Agenda: Just enough philosophy
2 Case study: Demographic diversity on the Web
3 Conclusion: Lessons learned
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 6 / 33
11. Demographic diversity on the Web
The clean story
(covering our tracks)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 7 / 33
12. Demographic diversity on the Web
with Irmak Sirer and Sharad Goel
How diverse is the Web?
To what extent do online experiences vary across demographic
groups?
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 8 / 33
13. Diversity of the Web
Data
• Representative sample of 265,000 individuals in the US, paid
via the Nielsen MegaPanel2
• Log of anonymized, complete browsing activity from June
2009 through May 2010 (URLs viewed, timestamps, etc.)
• Detailed individual and household demographic information
(age, education, income, race, sex, etc.)
2
http://bit.ly/nielsenonline
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 9 / 33
14. Diversity of the Web
Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k most popular sites
• Aggregate activity at the site, group, and user levels
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 10 / 33
15. Diversity of the Web
Pig to the rescue
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 11 / 33
16. Diversity of the Web
Site-level skew
How diverse are site audiences?
• For each site and attribute,
calculate the skew in visitors
(e.g., 93% of pageviews on
foxnews.com are by White
users)
• For each attribute, plot the
distribution of visitor skew
across all sites
Proportion White Visitors
Density
0.0 0.2 0.4 0.6 0.8 1.0
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 12 / 33
17. Diversity of the Web
Site-level skew
Proportion Female Visitors
Density
0.0 0.2 0.4 0.6 0.8 1.0
Proportion White VisitorsDensity
0.0 0.2 0.4 0.6 0.8 1.0
Proportion College Educated Visitors
Density
0.0 0.2 0.4 0.6 0.8 1.0
Proportion Adult Visitors
Density
0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With
Household Incomes Under $50,000
Density
0.0 0.2 0.4 0.6 0.8 1.0
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 13 / 33
18. Diversity of the Web
Site-level skew
Many sites have skew close the average, but there also popular,
highly-skewed sites
Greater Than 90% Less Than 10%
Female
youravon.com
collectionsetc.com
coveritlive.com
needlive.com
White
foxnews.com
wunderground.com
blackplanet.com
mediatakeout.com
College Educated
news.google.com
nytimes.com
slumz.boxden.com
sythe.com
Over 25 Years Old
mail.yahoo.com
apps.facebook.com
nanowrimo.org
cbox.ws
Household Income
Under $50,000
scarleteen.com
boards.adultswim.com
opentable.com
marketwatch.com
Table 1: A selection of popular sites that are homogeneous along various demographic dimensions.
ilyPer−CapitaPageviews
20
30
40
50
60
70
!
!
!
!Non−White
Male
Non−White
Male
No College
Under 25
No College
Under 25
White
Female
White
FemaleCollege
Over 25
College
Over 25
visually apparent from Figure 5, there are significant differ-
ences in how groups distribute their time on the web. These
differences—which, as mentioned above, hold for highly fre-
quented sites such as Facebook and YouTube—are in some
cases even more pronounced for lower traffic sites. For in-
stance, the gaming site pogo.com accounts for less than 1%
of pageviews among both low and high income users, but
low income users spend almost twice as much of their time
there.Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 14 / 33
19. Diversity of the Web
Site-level skew
Many sites have skew close the average, but there also popular,
highly-skewed sites
Greater Than 90% Less Than 10%
Female
youravon.com
collectionsetc.com
coveritlive.com
needlive.com
White
foxnews.com
wunderground.com
blackplanet.com
mediatakeout.com
College Educated
news.google.com
nytimes.com
slumz.boxden.com
sythe.com
Over 25 Years Old
mail.yahoo.com
apps.facebook.com
nanowrimo.org
cbox.ws
Household Income
Under $50,000
scarleteen.com
boards.adultswim.com
opentable.com
marketwatch.com
Table 1: A selection of popular sites that are homogeneous along various demographic dimensions.
ilyPer−CapitaPageviews
20
30
40
50
60
70
!
!
!
!Non−White
Male
Non−White
Male
No College
Under 25
No College
Under 25
White
Female
White
FemaleCollege
Over 25
College
Over 25
visually apparent from Figure 5, there are significant differ-
ences in how groups distribute their time on the web. These
differences—which, as mentioned above, hold for highly fre-
quented sites such as Facebook and YouTube—are in some
cases even more pronounced for lower traffic sites. For in-
stance, the gaming site pogo.com accounts for less than 1%
of pageviews among both low and high income users, but
low income users spend almost twice as much of their time
there.
This skew persists even when we restrict attention to the top 10k
or 1k sites
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 14 / 33
20. Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion White
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion College Educated
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
21. Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion White
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion College Educated
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
As expected, neighborhoods are more gender-balanced than sites
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
22. Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion White
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion College Educated
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
But sites typically have more racially diverse audiences than
neighborhoods have residents
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
23. Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion White
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion College Educated
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Skew by education is comparable, with online showing a bias
towards higher education
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
24. Diversity of the Web
Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
q
q
q
qNon−White
Male
Non−White
Male
No College
Under 25
No College
Under 25
White
Female
White
FemaleCollege
Over 25
College
Over 25
Race Education Sex Age
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 16 / 33
25. Diversity of the Web
Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
q
q
q
qNon−White
Male
Non−White
Male
No College
Under 25
No College
Under 25
White
Female
White
FemaleCollege
Over 25
College
Over 25
Race Education Sex Age
Large differences exist even at the aggregate level
(e.g. women on average generate 40% more pageviews than men)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 16 / 33
26. Diversity of the Web
Group-level activity
All groups spend more than a third of their time on a handful of
email, search, and social networking sites
PercentofTotalTimeSpentonSite
0.1%
1%
10%
facebook.com
m
ail.yahoo.com
google.com
apps.facebook.com
m
ail.google.com
m
ail.live.com
youtube.com
w
ebm
ail.aol.com
m
w
fb.zynga.com
channel.facebook.com
view
m
orepics.m
yspace.com
search.yahoo.com
m
yspace.com
m
sn.com
am
azon.com
shop.ebay.com
yahoo.com
im
ages.google.com
hom
e.m
yspace.com
m
ail.com
cast.net
bing.com
w
w
w
.yahoo.com
cgi.ebay.com
espn.go.com
m
essaging.m
yspace.com
tw
itter.com
cim
.m
eebo.com
m
y.ebay.com
en.w
ikipedia.org
login.yahoo.com
facebook.m
afiawars.com
m
y.yahoo.com
gam
e3.pogo.com
friends.m
yspace.com
tagged.com
w
orldw
inner.com
m
eebo.com
login.live.com
m
ypoints.com
m
aps.google.com
aol.com
pogo.com
m
w
m
s.zynga.com
new
s.yahoo.com
w
inster.com
netflix.com
fantasysports.yahoo.com
search.aol.com
com
cast.net
alotm
etrics.com
female
male
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 17 / 33
27. Diversity of the Web
Group-level activity
But different groups distribute their time differently, both on
universally popular and on more niche sites
PercentofTotalTimeSpentonSite
0.1%
1%
10%
facebook.com
m
ail.yahoo.com
google.com
apps.facebook.com
m
ail.google.com
m
ail.live.com
youtube.com
w
ebm
ail.aol.com
m
w
fb.zynga.com
channel.facebook.com
view
m
orepics.m
yspace.com
search.yahoo.com
m
yspace.com
m
sn.com
am
azon.com
shop.ebay.com
yahoo.com
im
ages.google.com
hom
e.m
yspace.com
m
ail.com
cast.net
bing.com
w
w
w
.yahoo.com
cgi.ebay.com
espn.go.com
m
essaging.m
yspace.com
tw
itter.com
cim
.m
eebo.com
m
y.ebay.com
en.w
ikipedia.org
login.yahoo.com
facebook.m
afiawars.com
m
y.yahoo.com
gam
e3.pogo.com
friends.m
yspace.com
tagged.com
w
orldw
inner.com
m
eebo.com
login.live.com
m
ypoints.com
m
aps.google.com
aol.com
pogo.com
m
w
m
s.zynga.com
new
s.yahoo.com
w
inster.com
netflix.com
fantasysports.yahoo.com
search.aol.com
com
cast.net
alotm
etrics.com
female
male
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 17 / 33
28. Diversity of the Web
Group-level activity
But different groups distribute their time differently, both on
universally popular and on more niche sites
PercentofTotalTimeSpentonSite
0.1%
1%
10%
facebook.com
m
ail.yahoo.com
google.com
apps.facebook.com
m
ail.google.com
m
ail.live.com
youtube.com
w
ebm
ail.aol.com
m
w
fb.zynga.com
channel.facebook.com
view
m
orepics.m
yspace.com
search.yahoo.com
m
yspace.com
m
sn.com
am
azon.com
shop.ebay.com
yahoo.com
im
ages.google.com
hom
e.m
yspace.com
m
ail.com
cast.net
bing.com
w
w
w
.yahoo.com
cgi.ebay.com
espn.go.com
m
essaging.m
yspace.com
tw
itter.com
cim
.m
eebo.com
m
y.ebay.com
en.w
ikipedia.org
login.yahoo.com
facebook.m
afiawars.com
m
y.yahoo.com
gam
e3.pogo.com
friends.m
yspace.com
tagged.com
w
orldw
inner.com
m
eebo.com
login.live.com
m
ypoints.com
m
aps.google.com
aol.com
pogo.com
m
w
m
s.zynga.com
new
s.yahoo.com
w
inster.com
netflix.com
fantasysports.yahoo.com
search.aol.com
com
cast.net
alotm
etrics.com
white
non.white
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 17 / 33
30. Diversity of the Web
Individual-level prediction
How well can one predict an individual’s demographics from their
browsing activity?
• Represent each user by the set of sites visited
• Fit linear models to predict majority/minority for each
attribute on 80% of users
• Tune model parameters using a 10% validation set
• Evaluate final performance on held-out 10% test set
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 19 / 33
31. Diversity of the Web
GNU-fu
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 20 / 33
32. Diversity of the Web
Individual-level prediction
• Reasonable (∼70-85%)
accuracy and AUC across all
attributes
• Similar performance even
when restricted to top 1k
sites
• Can achieve substantially
better performance when
restricted to “stereotypical”
users (∼80-90%)
College/No College
Under/Over $50,000
Household Income
White/Non−White
Female/Male
Over/Under 25
Years Old
AUC
q
q
q
q
q
.5 .6 .7 .8 .9 1
Accuracy
q
q
q
q
q
.5 .6 .7 .8 .9 1
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 21 / 33
33. Diversity of the Web
Individual-level prediction
Highly-weighted sites under the fitted models
Large positive weight Large negative weight
Female
winster.com
lancome-usa.com
sports.yahoo.com
espn.go.com
White
marlboro.com
cmt.com
mediatakeout.com
bet.com
College Educated
news.yahoo.com
linkedin.com
youtube.com
myspace.com
Over 25 Years Old
evite.com
classmates.com
addictinggames.com
youtube.com
Household Income
Under $50,000
eharmony.com
tracfone.com
rownine.com
matrixdirect.com
Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task.
College/No College
Under/Over $50,000
Household Income
White/Non−White
Female/Male
Over/Under 25
Years Old
AUC
!
!
!
!
!
Accuracy
!
!
!
!
!
Figure 7, a measure that effectively re-normalizes the ma-
jority and minority classes to have equal size. Intuitively,
AUC is the probability that a model scores a randomly se-
lected positive example higher than a randomly selected neg-
ative one (e.g., the probability that the model correctly dis-
tinguishes between a randomly selected female and male).
Though an uninformative rule would correctly discriminateJake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 22 / 33
34. Diversity of the Web
Individual-level prediction
Proof of concept browser demo
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 23 / 33
35. Diversity of the Web
Individual-level prediction
Proof of concept browser demo
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 23 / 33
36. Diversity of the Web
The real story
(what we actually did)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 24 / 33
37. Diversity on the Web
The real story
• Got several hundred GBs of MegaPanel data from Nielsen3
3
Special thanks to Mainak Mazumdar
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 25 / 33
38. Diversity on the Web
The real story
• Got several hundred GBs of MegaPanel data from Nielsen3
• Discussed possible projects
• Predict user demographics (e.g. real-valued age) from a few
minutes of browsing activity for ad-targeting?
• Infer the number of individuals using the same browser or
behind the same ip?
• Determine number of actual uniques advertisers are receiving?
• . . .
3
Special thanks to Mainak Mazumdar
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 25 / 33
39. Diversity on the Web
The real story (cont’d)
• Started with predicting real-valued age
• Worked on this for an embarassingly long time
(various methods, feature selection, etc.)
• Turns out to be difficult to do better than within 10 years of
true age, on average
• Settled for classification on binary outcomes (e.g.,
adult/non-adult) over entire history
• Classification worked reasonably well for age and other
attributes
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 26 / 33
40. Diversity on the Web
The real story (cont’d)
• Became curious about why classification worked well
compared to regression
• Generated descriptive statistics across all attributes at the site
and group levels
• Compared site statistics to ZIP code data from the US Census
• Compared time distribution across groups
• Realized that we now had the largest comprehensive study of
demographic diversity on the web
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 27 / 33