SlideShare a Scribd company logo
1 of 26
Download to read offline
© 2018 Jumpshot, Inc.
Calibration of the web browsing traffic
Jonas Amrich
Machine Learning Prague
2018
© 2018 Jumpshot, Inc.
Jumpshot delivers digital
intelligence from within
the Internet’s most
valuable walled gardens.
Number of Days Viewed in Q1 2017
© 2018 Jumpshot, Inc.
Jumpshot delivers digital
intelligence from within
the Internet’s most
valuable walled gardens.
This is done using
clickstream data from our
panelists.
Number of Days Viewed in Q1 2017
© 2018 Jumpshot, Inc.
Our panel
100M devices
5B clicks/day
© 2018 Jumpshot, Inc.
Worldwide
3.5B devices
~1011 clicks/day
Our panel
100M devices
5B clicks/day
© 2018 Jumpshot, Inc.
We wish for a uniformly
random sample
Easy to describe
whole population
© 2018 Jumpshot, Inc.
Sadly, we have a
biased sample
Hard to describe
whole population
© 2018 Jumpshot, Inc.
Ignoring sampling bias
may lead to fatal
mistakes in prediction.
© 2018 Jumpshot, Inc.
How to correct the biased sample?
1. Identify significant groups of users in our panel
2. Assign correcting weights to these groups
© 2018 Jumpshot, Inc.
1. Identify behavior of significant groups in the panel
We do not know age, sex
or income of the panelists,
but we know their
browsing behavior.
We model it using Latent
Semantic Analysis (LSA)
from natural language
processing.
© 2018 Jumpshot, Inc.
Dewey defeats Truman
Trump likes golf
golf
president
politics
sport
articles small set of topicshuge set of words
LSA in NLP: Compact representation of articles
© 2018 Jumpshot, Inc.
Dewey defeats Truman
Trump likes golf
golf
president
politics
sport
panelist 1
panelist 2
nature.com
youtube.com
science
video
panelists small set of cohortshuge set of domains
© 2018 Jumpshot, Inc.
100 M users
100 k domains
10 TB input data
Procedure has to be fast —
we need to retrain our
models frequently.
LSA is unsupervised,
we can leverage our
large datasets.
© 2018 Jumpshot, Inc.
As a result, we get
accurate behavioral
profiles of online
population.
© 2018 Jumpshot, Inc.
As a result, we get
accurate behavioral
profiles of online
population.
Cluster interested in
politics
chicagotribune.com
nytimes.com
politifact.com
politico.com
© 2018 Jumpshot, Inc.
How to correct the biased sample?
1. Identify significant groups of users in our panel
Groups identified according to behavior using LSA
2. Assign correcting weights to these groups
© 2018 Jumpshot, Inc.
Real size is
unknown
Cluster interested
in politics
2. Assign correcting weights to groups of users
© 2018 Jumpshot, Inc.
We don’t know true
number of people
interested in politics.
But we can obtain
number of visitors for
some domains.
panel
reality
panel
reality
politico.com (over-represented in panel)
times.com (under-represented in panel)
1.1 x
4 x
© 2018 Jumpshot, Inc.
politico.com
(mean of visitors)
times.com
(mean of visitors)
We find domain’s
representation as
mean of its visitors.
We train a regression
model which fits known
correction weights.
times.com
weight 4x
politico.com
weight 1.1x
10x1x
© 2018 Jumpshot, Inc.
s
10x1x
And we use this model to
assign weights to users.
times.com
weight 4x
politico.com
weight 1.1x
© 2018 Jumpshot, Inc.
Dataset of daily unique visitors on 10k domains from external source
Small to mid-size domains with unstable traffic
How to deal with variation and outliers?
© 2018 Jumpshot, Inc.
Dataset of daily unique visitors on 10k domains from external source
Small to mid-size domains with unstable traffic
How to deal with variation and outliers?
Average data over time
• Lower variation
• Information is lost
Pick a robust model
• More data points
• Information is utilized
© 2018 Jumpshot, Inc.
Dataset of daily unique visitors on 10k domains from external source
Small to mid-size domains with unstable traffic
How to deal with variation and outliers?
Average data over time
• Lower variation
• Information is lost
Pick a robust model
• More data points
• Information is utilized
Huber Regression
© 2018 Jumpshot, Inc.
How to correct the biased sample?
1. Identify significant groups of users in our panel
Groups identified according to behavior using LSA
2. Assign correcting weights to these groups
Trained on domains, applied to users
© 2018 Jumpshot, Inc.
Thank you!
Jonáš Amrich
& Břéťa Šopík
jumpshot.com

More Related Content

Similar to Calibration of the web browsing traffic | MLPrague 2018

Similar to Calibration of the web browsing traffic | MLPrague 2018 (20)

NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...
NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...
NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...
 
AWS Startup Day Toronto - Sudip Chakrabarti- Building & Selling AI-Powered En...
AWS Startup Day Toronto - Sudip Chakrabarti- Building & Selling AI-Powered En...AWS Startup Day Toronto - Sudip Chakrabarti- Building & Selling AI-Powered En...
AWS Startup Day Toronto - Sudip Chakrabarti- Building & Selling AI-Powered En...
 
Scale as an Enabler for Security
Scale as an Enabler for SecurityScale as an Enabler for Security
Scale as an Enabler for Security
 
Leadership Session: Overview of Amazon Digital User Engagement Solutions (DIG...
Leadership Session: Overview of Amazon Digital User Engagement Solutions (DIG...Leadership Session: Overview of Amazon Digital User Engagement Solutions (DIG...
Leadership Session: Overview of Amazon Digital User Engagement Solutions (DIG...
 
Why and How to Increase Process Thinking Capability - Program Launch v1.0
Why and How to Increase Process Thinking Capability - Program Launch v1.0Why and How to Increase Process Thinking Capability - Program Launch v1.0
Why and How to Increase Process Thinking Capability - Program Launch v1.0
 
Use Amazon Rekognition to Power Video Creative Asset Production (ADT202) - AW...
Use Amazon Rekognition to Power Video Creative Asset Production (ADT202) - AW...Use Amazon Rekognition to Power Video Creative Asset Production (ADT202) - AW...
Use Amazon Rekognition to Power Video Creative Asset Production (ADT202) - AW...
 
Amazon, awsreinvent2018, Artificial Intelligence & Machine Learning, AIM422, ...
Amazon, awsreinvent2018, Artificial Intelligence & Machine Learning, AIM422, ...Amazon, awsreinvent2018, Artificial Intelligence & Machine Learning, AIM422, ...
Amazon, awsreinvent2018, Artificial Intelligence & Machine Learning, AIM422, ...
 
Cloud Powered IoT: Connected Solutions Helping Communities
Cloud Powered IoT: Connected Solutions Helping CommunitiesCloud Powered IoT: Connected Solutions Helping Communities
Cloud Powered IoT: Connected Solutions Helping Communities
 
[NEW LAUNCH!] Introducing Amazon Personalize: Real-time Personalization and R...
[NEW LAUNCH!] Introducing Amazon Personalize: Real-time Personalization and R...[NEW LAUNCH!] Introducing Amazon Personalize: Real-time Personalization and R...
[NEW LAUNCH!] Introducing Amazon Personalize: Real-time Personalization and R...
 
An Agile Approach to Cloud Adoption
An Agile Approach to Cloud AdoptionAn Agile Approach to Cloud Adoption
An Agile Approach to Cloud Adoption
 
How Trupanion Became an AI-driven Company for Pets
How Trupanion Became an AI-driven Company for PetsHow Trupanion Became an AI-driven Company for Pets
How Trupanion Became an AI-driven Company for Pets
 
Meetup Niort Data - AWS Intelligence Artificielle
Meetup Niort Data - AWS Intelligence ArtificielleMeetup Niort Data - AWS Intelligence Artificielle
Meetup Niort Data - AWS Intelligence Artificielle
 
Cloud Choices Quantifying the Cost and Risk Implications of Cloud
Cloud Choices Quantifying the Cost and Risk Implications of CloudCloud Choices Quantifying the Cost and Risk Implications of Cloud
Cloud Choices Quantifying the Cost and Risk Implications of Cloud
 
Meet Preston, and Explore Your Digital Twin in Virtual Reality (GPSTEC321) - ...
Meet Preston, and Explore Your Digital Twin in Virtual Reality (GPSTEC321) - ...Meet Preston, and Explore Your Digital Twin in Virtual Reality (GPSTEC321) - ...
Meet Preston, and Explore Your Digital Twin in Virtual Reality (GPSTEC321) - ...
 
Sviluppare un backend serverless in real time attraverso GraphQL
Sviluppare un backend serverless in real time attraverso GraphQLSviluppare un backend serverless in real time attraverso GraphQL
Sviluppare un backend serverless in real time attraverso GraphQL
 
Practical Human-in-the-Loop Machine Learning
 Practical Human-in-the-Loop Machine Learning Practical Human-in-the-Loop Machine Learning
Practical Human-in-the-Loop Machine Learning
 
Innovation at AWS
Innovation at AWS Innovation at AWS
Innovation at AWS
 
The Lima Consulting Group Digital Transformation Maturity Model Presented at ...
The Lima Consulting Group Digital Transformation Maturity Model Presented at ...The Lima Consulting Group Digital Transformation Maturity Model Presented at ...
The Lima Consulting Group Digital Transformation Maturity Model Presented at ...
 
Deep Dive on Amazon Rekognition, ft. Pinterest (AIM307-R1) - AWS re:Invent 2018
Deep Dive on Amazon Rekognition, ft. Pinterest (AIM307-R1) - AWS re:Invent 2018Deep Dive on Amazon Rekognition, ft. Pinterest (AIM307-R1) - AWS re:Invent 2018
Deep Dive on Amazon Rekognition, ft. Pinterest (AIM307-R1) - AWS re:Invent 2018
 
[NEW LAUNCH!] Introducing Amazon Forecast (AIM344) - AWS re:Invent 2018
[NEW LAUNCH!] Introducing Amazon Forecast  (AIM344) - AWS re:Invent 2018[NEW LAUNCH!] Introducing Amazon Forecast  (AIM344) - AWS re:Invent 2018
[NEW LAUNCH!] Introducing Amazon Forecast (AIM344) - AWS re:Invent 2018
 

Recently uploaded

Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 

Recently uploaded (20)

Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
 

Calibration of the web browsing traffic | MLPrague 2018

  • 1. © 2018 Jumpshot, Inc. Calibration of the web browsing traffic Jonas Amrich Machine Learning Prague 2018
  • 2. © 2018 Jumpshot, Inc. Jumpshot delivers digital intelligence from within the Internet’s most valuable walled gardens. Number of Days Viewed in Q1 2017
  • 3. © 2018 Jumpshot, Inc. Jumpshot delivers digital intelligence from within the Internet’s most valuable walled gardens. This is done using clickstream data from our panelists. Number of Days Viewed in Q1 2017
  • 4. © 2018 Jumpshot, Inc. Our panel 100M devices 5B clicks/day
  • 5. © 2018 Jumpshot, Inc. Worldwide 3.5B devices ~1011 clicks/day Our panel 100M devices 5B clicks/day
  • 6. © 2018 Jumpshot, Inc. We wish for a uniformly random sample Easy to describe whole population
  • 7. © 2018 Jumpshot, Inc. Sadly, we have a biased sample Hard to describe whole population
  • 8. © 2018 Jumpshot, Inc. Ignoring sampling bias may lead to fatal mistakes in prediction.
  • 9. © 2018 Jumpshot, Inc. How to correct the biased sample? 1. Identify significant groups of users in our panel 2. Assign correcting weights to these groups
  • 10. © 2018 Jumpshot, Inc. 1. Identify behavior of significant groups in the panel We do not know age, sex or income of the panelists, but we know their browsing behavior. We model it using Latent Semantic Analysis (LSA) from natural language processing.
  • 11. © 2018 Jumpshot, Inc. Dewey defeats Truman Trump likes golf golf president politics sport articles small set of topicshuge set of words LSA in NLP: Compact representation of articles
  • 12. © 2018 Jumpshot, Inc. Dewey defeats Truman Trump likes golf golf president politics sport panelist 1 panelist 2 nature.com youtube.com science video panelists small set of cohortshuge set of domains
  • 13. © 2018 Jumpshot, Inc. 100 M users 100 k domains 10 TB input data Procedure has to be fast — we need to retrain our models frequently. LSA is unsupervised, we can leverage our large datasets.
  • 14. © 2018 Jumpshot, Inc. As a result, we get accurate behavioral profiles of online population.
  • 15. © 2018 Jumpshot, Inc. As a result, we get accurate behavioral profiles of online population. Cluster interested in politics chicagotribune.com nytimes.com politifact.com politico.com
  • 16. © 2018 Jumpshot, Inc. How to correct the biased sample? 1. Identify significant groups of users in our panel Groups identified according to behavior using LSA 2. Assign correcting weights to these groups
  • 17. © 2018 Jumpshot, Inc. Real size is unknown Cluster interested in politics 2. Assign correcting weights to groups of users
  • 18. © 2018 Jumpshot, Inc. We don’t know true number of people interested in politics. But we can obtain number of visitors for some domains. panel reality panel reality politico.com (over-represented in panel) times.com (under-represented in panel) 1.1 x 4 x
  • 19. © 2018 Jumpshot, Inc. politico.com (mean of visitors) times.com (mean of visitors) We find domain’s representation as mean of its visitors.
  • 20. We train a regression model which fits known correction weights. times.com weight 4x politico.com weight 1.1x 10x1x
  • 21. © 2018 Jumpshot, Inc. s 10x1x And we use this model to assign weights to users. times.com weight 4x politico.com weight 1.1x
  • 22. © 2018 Jumpshot, Inc. Dataset of daily unique visitors on 10k domains from external source Small to mid-size domains with unstable traffic How to deal with variation and outliers?
  • 23. © 2018 Jumpshot, Inc. Dataset of daily unique visitors on 10k domains from external source Small to mid-size domains with unstable traffic How to deal with variation and outliers? Average data over time • Lower variation • Information is lost Pick a robust model • More data points • Information is utilized
  • 24. © 2018 Jumpshot, Inc. Dataset of daily unique visitors on 10k domains from external source Small to mid-size domains with unstable traffic How to deal with variation and outliers? Average data over time • Lower variation • Information is lost Pick a robust model • More data points • Information is utilized Huber Regression
  • 25. © 2018 Jumpshot, Inc. How to correct the biased sample? 1. Identify significant groups of users in our panel Groups identified according to behavior using LSA 2. Assign correcting weights to these groups Trained on domains, applied to users
  • 26. © 2018 Jumpshot, Inc. Thank you! Jonáš Amrich & Břéťa Šopík jumpshot.com