More Related Content
Similar to Calibration of the web browsing traffic | MLPrague 2018 (20)
Calibration of the web browsing traffic | MLPrague 2018
- 1. © 2018 Jumpshot, Inc.
Calibration of the web browsing traffic
Jonas Amrich
Machine Learning Prague
2018
- 2. © 2018 Jumpshot, Inc.
Jumpshot delivers digital
intelligence from within
the Internet’s most
valuable walled gardens.
Number of Days Viewed in Q1 2017
- 3. © 2018 Jumpshot, Inc.
Jumpshot delivers digital
intelligence from within
the Internet’s most
valuable walled gardens.
This is done using
clickstream data from our
panelists.
Number of Days Viewed in Q1 2017
- 5. © 2018 Jumpshot, Inc.
Worldwide
3.5B devices
~1011 clicks/day
Our panel
100M devices
5B clicks/day
- 6. © 2018 Jumpshot, Inc.
We wish for a uniformly
random sample
Easy to describe
whole population
- 7. © 2018 Jumpshot, Inc.
Sadly, we have a
biased sample
Hard to describe
whole population
- 8. © 2018 Jumpshot, Inc.
Ignoring sampling bias
may lead to fatal
mistakes in prediction.
- 9. © 2018 Jumpshot, Inc.
How to correct the biased sample?
1. Identify significant groups of users in our panel
2. Assign correcting weights to these groups
- 10. © 2018 Jumpshot, Inc.
1. Identify behavior of significant groups in the panel
We do not know age, sex
or income of the panelists,
but we know their
browsing behavior.
We model it using Latent
Semantic Analysis (LSA)
from natural language
processing.
- 11. © 2018 Jumpshot, Inc.
Dewey defeats Truman
Trump likes golf
golf
president
politics
sport
articles small set of topicshuge set of words
LSA in NLP: Compact representation of articles
- 12. © 2018 Jumpshot, Inc.
Dewey defeats Truman
Trump likes golf
golf
president
politics
sport
panelist 1
panelist 2
nature.com
youtube.com
science
video
panelists small set of cohortshuge set of domains
- 13. © 2018 Jumpshot, Inc.
100 M users
100 k domains
10 TB input data
Procedure has to be fast —
we need to retrain our
models frequently.
LSA is unsupervised,
we can leverage our
large datasets.
- 14. © 2018 Jumpshot, Inc.
As a result, we get
accurate behavioral
profiles of online
population.
- 15. © 2018 Jumpshot, Inc.
As a result, we get
accurate behavioral
profiles of online
population.
Cluster interested in
politics
chicagotribune.com
nytimes.com
politifact.com
politico.com
- 16. © 2018 Jumpshot, Inc.
How to correct the biased sample?
1. Identify significant groups of users in our panel
Groups identified according to behavior using LSA
2. Assign correcting weights to these groups
- 17. © 2018 Jumpshot, Inc.
Real size is
unknown
Cluster interested
in politics
2. Assign correcting weights to groups of users
- 18. © 2018 Jumpshot, Inc.
We don’t know true
number of people
interested in politics.
But we can obtain
number of visitors for
some domains.
panel
reality
panel
reality
politico.com (over-represented in panel)
times.com (under-represented in panel)
1.1 x
4 x
- 19. © 2018 Jumpshot, Inc.
politico.com
(mean of visitors)
times.com
(mean of visitors)
We find domain’s
representation as
mean of its visitors.
- 20. We train a regression
model which fits known
correction weights.
times.com
weight 4x
politico.com
weight 1.1x
10x1x
- 21. © 2018 Jumpshot, Inc.
s
10x1x
And we use this model to
assign weights to users.
times.com
weight 4x
politico.com
weight 1.1x
- 22. © 2018 Jumpshot, Inc.
Dataset of daily unique visitors on 10k domains from external source
Small to mid-size domains with unstable traffic
How to deal with variation and outliers?
- 23. © 2018 Jumpshot, Inc.
Dataset of daily unique visitors on 10k domains from external source
Small to mid-size domains with unstable traffic
How to deal with variation and outliers?
Average data over time
• Lower variation
• Information is lost
Pick a robust model
• More data points
• Information is utilized
- 24. © 2018 Jumpshot, Inc.
Dataset of daily unique visitors on 10k domains from external source
Small to mid-size domains with unstable traffic
How to deal with variation and outliers?
Average data over time
• Lower variation
• Information is lost
Pick a robust model
• More data points
• Information is utilized
Huber Regression
- 25. © 2018 Jumpshot, Inc.
How to correct the biased sample?
1. Identify significant groups of users in our panel
Groups identified according to behavior using LSA
2. Assign correcting weights to these groups
Trained on domains, applied to users