SlideShare a Scribd company logo
1 of 13
WHAT DOES IT TAKE TO WIN THE
KAGGLE/YANDEX COMPETITION

Christophe Bourguignat
Kenji Lefèvre-Hasegawa
Paul Masurel @Dataiku
Matthieu Scordia @Dataiku
OUTLINE OF THE TALK

• Review of the Kaggle/Yandex Challenge
• How we worked (team work & tools)
• The winning model
GOAL Re-rank URLs returned by Yandex according to
the personal preferences of the users
url1

url3

url2

url2
GOAL

url3

url1

url4

url4

ML CHALLENGE Predict user’s pertinence

for urls and rerank result set accordingly
The Kaggle/Yandex challenge
GIVEN
• 30 days logs test: 3 days, train: 27 days
• Users historic queries, clicks & dwell-times
Q

Q

Q

Q

• Test session prior activity queries, clicks & dwell-times
Test session :

SIZE
• 15Gb size
The Kaggle/Yandex challenge

Q

Q

T

?
QUALITY METRIC
• One query test / user on the last 3 days
• NDCG metric penalize error of pertinence on top ranked
urls

• No A/B test
url1
url2

OK

BAD

url4
url3

Kaggle
The Kaggle/Yandex challenge

Prediction

Another ranking
TEAM DATAIKU SCIENCE STUDIO / KAGGLE

•
•
•
•

Christophe Bourguignat Engineer, Data enthusiastic
Kenji Lefèvre-Hasegawa Ph.D. math, new to ML
Paul Masurel Software Engineer @dataiku
Matthieu Scordia Data Scientist @dataiku
First meeting : October16th 2013

How we worked (Team work & tools)
WE’VE USED
•
•
•
•
•

Related papers (mainly Microsoft’s)
12 core, 64 Gb
Python scikit-learn
Dataiku Science Studio
Java Ranklib

How we worked (Team work & tools)
DATAIKU SCIENCE STUDIO
Features & labels

Features

Labels

Split train & validation

Original train

LEARNING
Team members
work independantly

FEATURES CONSTRUCTION
Team members work
independantly
DATA DRIVEN
COMPUTATION
How we worked (Team work & tools)
HOW MUCH WORK ?
• 960+ emails
• 360+ features
• 50+ ideas grid tuned (300+ models fitted)
• Server heavily loaded the last 3 weeks
• 56 kaggle submissions
• 196 teams, 264 players, 3570 submissions

How we worked (Team work & tools)

2014-01-01

1st

Future top 2 & 3
enter race

1 week

3rd

1 week

1st

5th

Top 10

Top 25

1/2 month

1 week
PROBLEM ANALYSIS
Query

Result Set
• Rank
• URL Snippet Quality
• URL is skipped, clicked or missed

CLICK
Reading URL
• URL & Domain pertinence with dwell-time

The winning model
FEATURES
Features :
• Rank
• User habits, query specificity (entropy, frequency,…)
• Snippet pertinence
• Missed, Skipped, Clicked
• URL & Domain Pertinence
Declinaison of
& Clicked
• Probability, Stimuli freq., Mean Reciprocal Rank (MRR)
• For each user : historic & previous activity in test session &
aggregate
• For all user
• Declined for all queries & same query
The winning model
MODELS
• Random Forest (predict proba)
+ maximize E(NDCG)
Kaggle/Yandex Top 1
then 3rd

• Lambda MART (Gradient Boosting Tree
optimized for NDCG) WINS !
The winning model
QUESTIONS

?
Thank you !

More Related Content

Similar to What does it take to win the Kaggle/Yandex competition

Emptying Your Cup an Agile Primer
Emptying Your Cup an Agile Primer Emptying Your Cup an Agile Primer
Emptying Your Cup an Agile Primer Todd Shelton
 
Agile testing experiments
Agile testing experimentsAgile testing experiments
Agile testing experimentsBaiju Joseph
 
SenchaCon 2016: Creating a Flexible and Usable Industry Specific Solution - D...
SenchaCon 2016: Creating a Flexible and Usable Industry Specific Solution - D...SenchaCon 2016: Creating a Flexible and Usable Industry Specific Solution - D...
SenchaCon 2016: Creating a Flexible and Usable Industry Specific Solution - D...Sencha
 
Continuous Delivery without Significant Test Automation
Continuous Delivery without Significant Test AutomationContinuous Delivery without Significant Test Automation
Continuous Delivery without Significant Test AutomationMaaret Pyhäjärvi
 
My Experiments In Agile Testing in Yahoo.pptx
My Experiments In Agile Testing in Yahoo.pptxMy Experiments In Agile Testing in Yahoo.pptx
My Experiments In Agile Testing in Yahoo.pptxBaiju Joseph
 
Moving from fast to solr on atg
Moving from fast to solr on atgMoving from fast to solr on atg
Moving from fast to solr on atglucenerevolution
 
CommerceSearch: Moving from FAST to Solr on ATG
CommerceSearch: Moving from FAST to Solr on ATGCommerceSearch: Moving from FAST to Solr on ATG
CommerceSearch: Moving from FAST to Solr on ATGlucenerevolution
 
20121213 qa introduction smileryang
20121213 qa introduction smileryang20121213 qa introduction smileryang
20121213 qa introduction smileryangnetdbncku
 
User Centered Agile Development at NASA - One Groups Path to Better Software
User Centered Agile Development at NASA - One Groups Path to Better SoftwareUser Centered Agile Development at NASA - One Groups Path to Better Software
User Centered Agile Development at NASA - One Groups Path to Better SoftwareBalanced Team
 
User centered agile dev balanced team 2013
User centered agile dev balanced team 2013User centered agile dev balanced team 2013
User centered agile dev balanced team 2013Jay Trimble
 
Journey of Migrating Millions of Queries on The Cloud
Journey of Migrating Millions of Queries on The CloudJourney of Migrating Millions of Queries on The Cloud
Journey of Migrating Millions of Queries on The Cloudtakezoe
 
SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 Sujit Ghosh
 

Similar to What does it take to win the Kaggle/Yandex competition (20)

Emptying Your Cup an Agile Primer
Emptying Your Cup an Agile Primer Emptying Your Cup an Agile Primer
Emptying Your Cup an Agile Primer
 
Agile testing experiments
Agile testing experimentsAgile testing experiments
Agile testing experiments
 
SenchaCon 2016: Creating a Flexible and Usable Industry Specific Solution - D...
SenchaCon 2016: Creating a Flexible and Usable Industry Specific Solution - D...SenchaCon 2016: Creating a Flexible and Usable Industry Specific Solution - D...
SenchaCon 2016: Creating a Flexible and Usable Industry Specific Solution - D...
 
Continuous Delivery without Significant Test Automation
Continuous Delivery without Significant Test AutomationContinuous Delivery without Significant Test Automation
Continuous Delivery without Significant Test Automation
 
My Experiments In Agile Testing in Yahoo.pptx
My Experiments In Agile Testing in Yahoo.pptxMy Experiments In Agile Testing in Yahoo.pptx
My Experiments In Agile Testing in Yahoo.pptx
 
Play with Kaggle
Play with KagglePlay with Kaggle
Play with Kaggle
 
Moving from fast to solr on atg
Moving from fast to solr on atgMoving from fast to solr on atg
Moving from fast to solr on atg
 
CommerceSearch: Moving from FAST to Solr on ATG
CommerceSearch: Moving from FAST to Solr on ATGCommerceSearch: Moving from FAST to Solr on ATG
CommerceSearch: Moving from FAST to Solr on ATG
 
Agile Testing
Agile TestingAgile Testing
Agile Testing
 
20121213 qa introduction smileryang
20121213 qa introduction smileryang20121213 qa introduction smileryang
20121213 qa introduction smileryang
 
User Centered Agile Development at NASA - One Groups Path to Better Software
User Centered Agile Development at NASA - One Groups Path to Better SoftwareUser Centered Agile Development at NASA - One Groups Path to Better Software
User Centered Agile Development at NASA - One Groups Path to Better Software
 
User centered agile dev balanced team 2013
User centered agile dev balanced team 2013User centered agile dev balanced team 2013
User centered agile dev balanced team 2013
 
Agile by KD
Agile by KDAgile by KD
Agile by KD
 
Agile by KD
Agile by KDAgile by KD
Agile by KD
 
Journey of Migrating Millions of Queries on The Cloud
Journey of Migrating Millions of Queries on The CloudJourney of Migrating Millions of Queries on The Cloud
Journey of Migrating Millions of Queries on The Cloud
 
SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014
 
Resume
ResumeResume
Resume
 
Scalability and performance for e commerce
Scalability and performance for e commerceScalability and performance for e commerce
Scalability and performance for e commerce
 
Effective Scrum
Effective ScrumEffective Scrum
Effective Scrum
 
Java Certification by HUJAK - 2015-05-12 - at JavaCro'15 conference
Java Certification by HUJAK - 2015-05-12 - at JavaCro'15 conferenceJava Certification by HUJAK - 2015-05-12 - at JavaCro'15 conference
Java Certification by HUJAK - 2015-05-12 - at JavaCro'15 conference
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 

What does it take to win the Kaggle/Yandex competition

  • 1. WHAT DOES IT TAKE TO WIN THE KAGGLE/YANDEX COMPETITION Christophe Bourguignat Kenji Lefèvre-Hasegawa Paul Masurel @Dataiku Matthieu Scordia @Dataiku
  • 2. OUTLINE OF THE TALK • Review of the Kaggle/Yandex Challenge • How we worked (team work & tools) • The winning model
  • 3. GOAL Re-rank URLs returned by Yandex according to the personal preferences of the users url1 url3 url2 url2 GOAL url3 url1 url4 url4 ML CHALLENGE Predict user’s pertinence for urls and rerank result set accordingly The Kaggle/Yandex challenge
  • 4. GIVEN • 30 days logs test: 3 days, train: 27 days • Users historic queries, clicks & dwell-times Q Q Q Q • Test session prior activity queries, clicks & dwell-times Test session : SIZE • 15Gb size The Kaggle/Yandex challenge Q Q T ?
  • 5. QUALITY METRIC • One query test / user on the last 3 days • NDCG metric penalize error of pertinence on top ranked urls • No A/B test url1 url2 OK BAD url4 url3 Kaggle The Kaggle/Yandex challenge Prediction Another ranking
  • 6. TEAM DATAIKU SCIENCE STUDIO / KAGGLE • • • • Christophe Bourguignat Engineer, Data enthusiastic Kenji Lefèvre-Hasegawa Ph.D. math, new to ML Paul Masurel Software Engineer @dataiku Matthieu Scordia Data Scientist @dataiku First meeting : October16th 2013 How we worked (Team work & tools)
  • 7. WE’VE USED • • • • • Related papers (mainly Microsoft’s) 12 core, 64 Gb Python scikit-learn Dataiku Science Studio Java Ranklib How we worked (Team work & tools)
  • 8. DATAIKU SCIENCE STUDIO Features & labels Features Labels Split train & validation Original train LEARNING Team members work independantly FEATURES CONSTRUCTION Team members work independantly DATA DRIVEN COMPUTATION How we worked (Team work & tools)
  • 9. HOW MUCH WORK ? • 960+ emails • 360+ features • 50+ ideas grid tuned (300+ models fitted) • Server heavily loaded the last 3 weeks • 56 kaggle submissions • 196 teams, 264 players, 3570 submissions How we worked (Team work & tools) 2014-01-01 1st Future top 2 & 3 enter race 1 week 3rd 1 week 1st 5th Top 10 Top 25 1/2 month 1 week
  • 10. PROBLEM ANALYSIS Query Result Set • Rank • URL Snippet Quality • URL is skipped, clicked or missed CLICK Reading URL • URL & Domain pertinence with dwell-time The winning model
  • 11. FEATURES Features : • Rank • User habits, query specificity (entropy, frequency,…) • Snippet pertinence • Missed, Skipped, Clicked • URL & Domain Pertinence Declinaison of & Clicked • Probability, Stimuli freq., Mean Reciprocal Rank (MRR) • For each user : historic & previous activity in test session & aggregate • For all user • Declined for all queries & same query The winning model
  • 12. MODELS • Random Forest (predict proba) + maximize E(NDCG) Kaggle/Yandex Top 1 then 3rd • Lambda MART (Gradient Boosting Tree optimized for NDCG) WINS ! The winning model