SlideShare a Scribd company logo
ith
y w
Pla
Matthieu Scordia
Kaggle?

Kaggle is a platform for predictive modeling competitions.

“We're making data science into a sport.”
Let's enter a challenge!
The Data

Noteworthy characteristics of the dataset:
●

Unique queries:

21,073,569

●

Unique urls:

703,484,26

●

Unique users:

5,736,333

●

Training sessions:

34,573,630

●

Test sessions:

797,867

●

Clicks in the training data:

64,693,054

Total records in the log: 167,413,039 (=15Go!)
Let's shake the data
Logs format.

Session 0

Session 1
Session 2
Session 3

Session 4
Sessions format.

session 5
day 15
user 1

Term IDs

(Url,Domain) ranked

Session metadata

Url clicked
time passed
Evaluation.

The URLs are labeled using 3 grades of relevance: {0, 1, 2}
The labeling is done automatically, based on dwell-time.

0 : irrelevant - no clicks and clicks with dwell time < 50 time units.
1 : relevant - clicks with dwell time > 50 and < 399 time units.
2 : highly relevant - clicks with dwell time > 400 time units.
How to beat Yandex?

Clicks
count

Rank

So we have to sort better than that!
Step 1 : Reshape it!

For each user we would estimate his click probability on an url
Step 2 : Cross validate

Split your dataset like Yandex did:
- On the last 3 days.
- Only one session by user.

Goal: auto-evaluate our model.
Step 3 : Add new features

We add some informations on each user:
- Did he see this url in the past?
- Did he click on it?
- How many times?
- Did he skip it?
- Had he ever click on a rank 9 url in the past?
Step 3 : Add new features

The thing is, we don't want to re-rank all...
So we add click entropy:

For each query:

Where p(x) is the percentage of clicks on document x among all clicks.

Example:
Small click entropy query: yahoo, youtube.
Large click entropy query: photos, jobs.
Step 4 : the model.

Goal: Predict the probability of click of an user on an url.
Our training set:
session

url

features...

We use logistic regression and random forest.

target
The leaderboard.
Thanks !

If you want to enter with us in a future challenge:

contact@dataiku.com

More Related Content

Similar to Play with Kaggle

Welcome Webinar Slides
Welcome Webinar SlidesWelcome Webinar Slides
Welcome Webinar Slides
Sumo Logic
 
SEMNE Google Analytics Master Class - 15 Oct 2014
SEMNE Google Analytics Master Class - 15 Oct 2014SEMNE Google Analytics Master Class - 15 Oct 2014
SEMNE Google Analytics Master Class - 15 Oct 2014
Jay Murphy
 
Recsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and DeepakRecsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and Deepak
Deepak Agarwal
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Google Analytics
Google Analytics Google Analytics
Iterative Methodology for Personalization Models Optimization
 Iterative Methodology for Personalization Models Optimization Iterative Methodology for Personalization Models Optimization
Iterative Methodology for Personalization Models Optimization
Sonya Liberman
 
moma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layermoma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layer
Gadi Oren
 
Supercharging your Organic CTR
Supercharging your Organic CTRSupercharging your Organic CTR
Supercharging your Organic CTR
Phil Pearce
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Mail.ru Group
 
kdd2015
kdd2015kdd2015
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
SD Sharma
 
How can a data layer help my seo
How can a data layer help my seoHow can a data layer help my seo
How can a data layer help my seo
Phil Pearce
 
Mozilla Foundation Metrics - presentation to engineers
Mozilla Foundation Metrics - presentation to engineersMozilla Foundation Metrics - presentation to engineers
Mozilla Foundation Metrics - presentation to engineers
John Schneider
 
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Lippo Group Digital
 
A new algorithm for inferring user search goals with feedback sessions
A new algorithm for inferring user search goals with feedback sessionsA new algorithm for inferring user search goals with feedback sessions
A new algorithm for inferring user search goals with feedback sessions
JPINFOTECH JAYAPRAKASH
 
Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019
Sonya Liberman
 
A new algorithm for inferring user search goals with feedback sessions
A new algorithm for inferring user search goals with feedback sessionsA new algorithm for inferring user search goals with feedback sessions
A new algorithm for inferring user search goals with feedback sessions
JPINFOTECH JAYAPRAKASH
 
Find and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedInFind and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedIn
Daniel Tunkelang
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Spark Summit
 
Intro to Google Analytics and Google AdWords (March 19 2013)
Intro to Google Analytics and Google AdWords (March 19 2013)Intro to Google Analytics and Google AdWords (March 19 2013)
Intro to Google Analytics and Google AdWords (March 19 2013)
Chester County Marketing Group
 

Similar to Play with Kaggle (20)

Welcome Webinar Slides
Welcome Webinar SlidesWelcome Webinar Slides
Welcome Webinar Slides
 
SEMNE Google Analytics Master Class - 15 Oct 2014
SEMNE Google Analytics Master Class - 15 Oct 2014SEMNE Google Analytics Master Class - 15 Oct 2014
SEMNE Google Analytics Master Class - 15 Oct 2014
 
Recsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and DeepakRecsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and Deepak
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Google Analytics
Google Analytics Google Analytics
Google Analytics
 
Iterative Methodology for Personalization Models Optimization
 Iterative Methodology for Personalization Models Optimization Iterative Methodology for Personalization Models Optimization
Iterative Methodology for Personalization Models Optimization
 
moma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layermoma-django overview --> Django + MongoDB: building a custom ORM layer
moma-django overview --> Django + MongoDB: building a custom ORM layer
 
Supercharging your Organic CTR
Supercharging your Organic CTRSupercharging your Organic CTR
Supercharging your Organic CTR
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
 
kdd2015
kdd2015kdd2015
kdd2015
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
How can a data layer help my seo
How can a data layer help my seoHow can a data layer help my seo
How can a data layer help my seo
 
Mozilla Foundation Metrics - presentation to engineers
Mozilla Foundation Metrics - presentation to engineersMozilla Foundation Metrics - presentation to engineers
Mozilla Foundation Metrics - presentation to engineers
 
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
 
A new algorithm for inferring user search goals with feedback sessions
A new algorithm for inferring user search goals with feedback sessionsA new algorithm for inferring user search goals with feedback sessions
A new algorithm for inferring user search goals with feedback sessions
 
Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019
 
A new algorithm for inferring user search goals with feedback sessions
A new algorithm for inferring user search goals with feedback sessionsA new algorithm for inferring user search goals with feedback sessions
A new algorithm for inferring user search goals with feedback sessions
 
Find and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedInFind and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedIn
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Intro to Google Analytics and Google AdWords (March 19 2013)
Intro to Google Analytics and Google AdWords (March 19 2013)Intro to Google Analytics and Google AdWords (March 19 2013)
Intro to Google Analytics and Google AdWords (March 19 2013)
 

Recently uploaded

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 

Recently uploaded (20)

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 

Play with Kaggle

  • 2. Kaggle? Kaggle is a platform for predictive modeling competitions. “We're making data science into a sport.”
  • 3. Let's enter a challenge!
  • 4. The Data Noteworthy characteristics of the dataset: ● Unique queries: 21,073,569 ● Unique urls: 703,484,26 ● Unique users: 5,736,333 ● Training sessions: 34,573,630 ● Test sessions: 797,867 ● Clicks in the training data: 64,693,054 Total records in the log: 167,413,039 (=15Go!)
  • 6. Logs format. Session 0 Session 1 Session 2 Session 3 Session 4
  • 7. Sessions format. session 5 day 15 user 1 Term IDs (Url,Domain) ranked Session metadata Url clicked time passed
  • 8. Evaluation. The URLs are labeled using 3 grades of relevance: {0, 1, 2} The labeling is done automatically, based on dwell-time. 0 : irrelevant - no clicks and clicks with dwell time < 50 time units. 1 : relevant - clicks with dwell time > 50 and < 399 time units. 2 : highly relevant - clicks with dwell time > 400 time units.
  • 9. How to beat Yandex? Clicks count Rank So we have to sort better than that!
  • 10. Step 1 : Reshape it! For each user we would estimate his click probability on an url
  • 11. Step 2 : Cross validate Split your dataset like Yandex did: - On the last 3 days. - Only one session by user. Goal: auto-evaluate our model.
  • 12. Step 3 : Add new features We add some informations on each user: - Did he see this url in the past? - Did he click on it? - How many times? - Did he skip it? - Had he ever click on a rank 9 url in the past?
  • 13. Step 3 : Add new features The thing is, we don't want to re-rank all... So we add click entropy: For each query: Where p(x) is the percentage of clicks on document x among all clicks. Example: Small click entropy query: yahoo, youtube. Large click entropy query: photos, jobs.
  • 14. Step 4 : the model. Goal: Predict the probability of click of an user on an url. Our training set: session url features... We use logistic regression and random forest. target
  • 16. Thanks ! If you want to enter with us in a future challenge: contact@dataiku.com