Your SlideShare is downloading. ×
0
write your own data story!
short story
Founded

January 2013
Founded

January 2013
January 2014

A Data Science Studio
powered team
wins a Challenge
Founded

January 2013
January 2014

A Data Science Studio
powered team
wins a Challenge
Data Science Studio’s GA
February ...
Founded

January 2013
January 2014

A Data Science Studio
powered team
wins a Challenge
Data Science Studio’s GA
February ...
Founded

January 2013
January 2014

A Data Science Studio
powered team
wins a Challenge
Data Science Studio’s GA
February ...
!
BI
Developer
Data
Preparation
Build
Algorithm
Build
Application
Run
Application
Business
Analyst
Data

Scientist
I don’t want to be a data
cleaner anymore
“
Finding Leaks in my Data
Pipelines
Waiting for
the
(gradient boosted)
trees
to grow
“
MPP Databases
Statistical Software Machine Learning
No-SQL Hadoop
Demo Time
Challenge
Using Historical Logs of a search engine
QUERIES
RESULTS
CLICKS
!
and a set of new QUERIES and RESULTS
!
rerank the RESULT...
No researcher.
No experience in reranking.
Not much experience in ML for most of us.
Not exactly our job. No expectations....
A-Team?
“HOBBITS"
YANDEX SUPPLIED 27 DAYS OF ANONYMOUS LOG
Challenge Data
34,573,630 Sessions with user id
21,073,569 Queries
64,693,054 Cli...
Relevance?
A METRIC FOR RELEVANCE RIGHT FROM THE LOG?
ASSUMING WE SEARCH FOR "FRENCH NEWSPAPER", WE TAKE
A LOOK AT THE LOGS.
WE COMPUTE THE SO CALLED DWELL TIME OF A CLICK
I.E. THE TIME ELAPSED BEFORE THE NEXT ACTION
DWELL TIME
DWELL TIME HAS BEEN SHOWN TO BE CORRELATED WITH
THE RELEVANCE
GOOD WE HAVE A MEASURE OF RELEVANCE !
CAN WE GET AN OVERALL SCORE FOR OUR SEARCH ENGINE
NOW?
Emphasis on relevant
documents
Discount per ranking
Discount Cumulative Gain
Normalized Discount Cumulative Gain
Just Normalize Between 0 and 1
PERSONALIZED RERANKING
IS ABOUT REORDERING THE N-BEST RESULTS BASED ON
THE USER PAST SEARCH HISTORY
Results Obtained in th...
How they did it
Simple, point wise approach
Session 1 Session 2 ....
0
1
2
For each (URL, Session) predict relevance (0,1 or 2)
Supervised Learning on History
We split 27 days of the train dataset 24 (history) + 3 days (annotated).
!
Stop randomly in...
Working with a ML workflow collaboratively
Features Construction :
Team Member work independantly
Learning :
Team Member work independantly
Split Train & Validation
...
!
regression : we keep the hierarchy between the classes, but optimizing NDCG is cookery.
classification : we lose the hier...
!
Compute the probabilities that P(relevance = X)
Build a sorted list
!
Sort by
!
P(Relevance=1) + 3 P (Relevance=2)
Hence order by decreasing
Hence order by P(Relevance=1) + 3 P (Relevance=2)
P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Lea...
Features
FIRST OF ALL THE RANK
In this contest, the rank is both
The rank that has been displayed to the user
THE DISPLAY RANK
!
Th...
Digression
THE PROBLEM!
WITH RERANKING
53% OF THE COMPETITORS 

COULD NOT IMPROVE THE BASELINE
Worse
53%
Better
47%
1. compute non-personalized rank
2. select 10 best hits and serves them in order
3. re-rank using log analysis.
4. put new...
1. compute non-personalized rank
2. select 10 bests hits
3. serve 10 bests hits ranked in random order
4. re-rank using lo...
Users tend to click on the first few urls.
User satisfaction metric is influenced by the display rank. Our score is not ali...
PROMOTES
OVER CONSERVATIVE RE-RANKING POLICY
Even if we know for sure that the url with rank 9 would be clicked by the use...
end digression
Revisits (Query-(User)-URL) features

and variants
Query Features
Cumulative Features
User Click Habits
Collaborative Filt...
!
In the past, when the user was displayed this url, with the exact same query
what is the probability that :
REVISITS
• s...
• (In the past|within the same sesssion),
• (with this very query | whatever query | a subquery | a super query)
• and was...
ADDITIVE SMOOTHING
http://fumicoton.com/posts/bayesian_rating
• book A : 1 rating of 5. Average rating of 5.
• book B : 50...
CUMULATIVE FEATURES
Aggregate the features of the URL above in the ranking list
Rationale : If a URL above is likely to be...
QUERY FEATURES
Click entropy
number of time it has been queried for
number of terms
average position within in session
ave...
USER FEATURES
What are the users habits ?
Click entropy
User click rank counters
Rank {1, 2} clicks
Rank {3, 4, 5} clicks
...
SEASONALITY
What day is monday ?
COLLABORATIVE FILTERING
(ATTEMPT)
User / Domain interaction matrix.
FunkSVD Algorithm
Simon Funk
http://sifter.org/~simon/...
learning
Short Story
Point Wise, Random Forest, 30 Features, 4th Place (*)
List Wise , LambdaMART, 90 Features, 1st Place (*)
(*) A...
Lambda Mart
From RankNet to LambdaRank to LambdaMART: An Overview	

Christopher J.C. Burges	

Microsoft Research Technical...
Lambda Rank
Original Ranking Re Ranked
13 errors 11 errors
High Quality Hit
Low Quality Hit
Rank Net Gradient
LambdaRank "...
Grid Search
We are not doing typical classification here. It is extremely important to perform grid
search directly against...
Feature Selection
	

 Top-Down approach : Starting from a high number of features, iteratively
removed subsets of features...
Top Features
References
http://sourceforge.net/p/lemur/wiki/RankLib/
Ranklib ( Implementation of LambdaMART)
These Slides
http://www.slideshare.ne...
Random Thoughts
Dependancy analysis and comparing rank with predictive “relevance"
could help determine general cases wher...
THANK YOU!
Florian DOUETTEAU
florian.douetteau@daitaku.com

+33 6 70 56 88 97
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Upcoming SlideShare
Loading in...5
×

Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge

1,631

Published on

This is a presentation made on the 13th August 2014 at the SF Data Mining Meetup at Trulia. It's about Dataiku and the Kaggle Personalized Web Search Ranking challenge sponsored by Yandex

Published in: Data & Analytics
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,631
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
57
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge"

  1. 1. write your own data story!
  2. 2. short story
  3. 3. Founded
 January 2013
  4. 4. Founded
 January 2013 January 2014
 A Data Science Studio powered team wins a Challenge
  5. 5. Founded
 January 2013 January 2014
 A Data Science Studio powered team wins a Challenge Data Science Studio’s GA February 2014
  6. 6. Founded
 January 2013 January 2014
 A Data Science Studio powered team wins a Challenge Data Science Studio’s GA February 2014 July 2014
 Data Science Studio Available for Free with a Community Edition ! !
  7. 7. Founded
 January 2013 January 2014
 A Data Science Studio powered team wins a Challenge Data Science Studio’s GA February 2014 15 People Now July 2014
 Data Science Studio Available for Free with a Community Edition ! !
  8. 8. ! BI Developer Data Preparation Build Algorithm Build Application Run Application Business Analyst Data
 Scientist
  9. 9. I don’t want to be a data cleaner anymore “
  10. 10. Finding Leaks in my Data Pipelines
  11. 11. Waiting for the (gradient boosted) trees to grow “
  12. 12. MPP Databases Statistical Software Machine Learning No-SQL Hadoop
  13. 13. Demo Time
  14. 14. Challenge
  15. 15. Using Historical Logs of a search engine QUERIES RESULTS CLICKS ! and a set of new QUERIES and RESULTS ! rerank the RESULTS in order to optimize relevance Personalized Web Search Yandex Fri 11 Oct 2013 – Fri 10 Jan 2014 194 Teams $9,000 cash prize
  16. 16. No researcher. No experience in reranking. Not much experience in ML for most of us. Not exactly our job. No expectations. Kenji Lefevre 37 Algrebraic Geometry Learning Python Christophe Bourguignat 37 Signal Processing Eng. Learning Scikit Mathieu Scordia 24 Data Scientist Paul Masurel 33 Soft. Engineer The Team
  17. 17. A-Team?
  18. 18. “HOBBITS"
  19. 19. YANDEX SUPPLIED 27 DAYS OF ANONYMOUS LOG Challenge Data 34,573,630 Sessions with user id 21,073,569 Queries 64,693,054 Clicks ~ 15GB Example
  20. 20. Relevance?
  21. 21. A METRIC FOR RELEVANCE RIGHT FROM THE LOG? ASSUMING WE SEARCH FOR "FRENCH NEWSPAPER", WE TAKE A LOOK AT THE LOGS.
  22. 22. WE COMPUTE THE SO CALLED DWELL TIME OF A CLICK I.E. THE TIME ELAPSED BEFORE THE NEXT ACTION DWELL TIME
  23. 23. DWELL TIME HAS BEEN SHOWN TO BE CORRELATED WITH THE RELEVANCE
  24. 24. GOOD WE HAVE A MEASURE OF RELEVANCE ! CAN WE GET AN OVERALL SCORE FOR OUR SEARCH ENGINE NOW?
  25. 25. Emphasis on relevant documents Discount per ranking Discount Cumulative Gain
  26. 26. Normalized Discount Cumulative Gain Just Normalize Between 0 and 1
  27. 27. PERSONALIZED RERANKING IS ABOUT REORDERING THE N-BEST RESULTS BASED ON THE USER PAST SEARCH HISTORY Results Obtained in the contest: ! Original NCDG 0.79056 ! ReRanked NCDG 0.80714 ! ! ~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #5 on each query ~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #2 in 20% of the queries Equivalent To
  28. 28. How they did it
  29. 29. Simple, point wise approach Session 1 Session 2 .... 0 1 2 For each (URL, Session) predict relevance (0,1 or 2)
  30. 30. Supervised Learning on History We split 27 days of the train dataset 24 (history) + 3 days (annotated). ! Stop randomly in the last 3 days at a “test" session (like Yandex) Train Set (24 history) Train Set (annotation) Test Set
  31. 31. Working with a ML workflow collaboratively
  32. 32. Features Construction : Team Member work independantly Learning : Team Member work independantly Split Train & Validation Features on 30 days Labelled 30 days data
  33. 33. ! regression : we keep the hierarchy between the classes, but optimizing NDCG is cookery. classification : we lose the hierarchy but we can optimize the NDCG (more and that later) REGRESSION or CLASSIFICATION According to P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007. Classification outperforms regression.
  34. 34. ! Compute the probabilities that P(relevance = X) Build a sorted list ! Sort by ! P(Relevance=1) + 3 P (Relevance=2)
  35. 35. Hence order by decreasing Hence order by P(Relevance=1) + 3 P (Relevance=2) P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007. get slightly better results with linear weighting.
  36. 36. Features
  37. 37. FIRST OF ALL THE RANK In this contest, the rank is both The rank that has been displayed to the user THE DISPLAY RANK ! The rank that is computed by Yandex using
 PageRank, non-personalized log analysis?, TF-IDF, 
 and machine learning etc. THE NON-PERSONALIZED RANK RANK AS feature
  38. 38. Digression
  39. 39. THE PROBLEM! WITH RERANKING
  40. 40. 53% OF THE COMPETITORS 
 COULD NOT IMPROVE THE BASELINE Worse 53% Better 47%
  41. 41. 1. compute non-personalized rank 2. select 10 best hits and serves them in order 3. re-rank using log analysis. 4. put new ranking algorithm in prod (yeah right!) 5. compute NDCG on new logs 6. … 7. Profits !! IDEAL
  42. 42. 1. compute non-personalized rank 2. select 10 bests hits 3. serve 10 bests hits ranked in random order 4. re-rank using log analysis, including non-personalized rank as a feature 5. compute score against the log with the former rank REAL
  43. 43. Users tend to click on the first few urls. User satisfaction metric is influenced by the display rank. Our score is not aligned with our goal. PROBLEM We cannot discriminate the effect of the signal of the non-personalized rank from effect of the display rank
  44. 44. PROMOTES OVER CONSERVATIVE RE-RANKING POLICY Even if we know for sure that the url with rank 9 would be clicked by the user if it was presented at rank 1, it would be probably a bad idea to rerank it to rank 1 in this contest. Average per session of the max position jump
  45. 45. end digression
  46. 46. Revisits (Query-(User)-URL) features
 and variants Query Features Cumulative Features User Click Habits Collaborative Filtering Seasonality FEATURES
  47. 47. ! In the past, when the user was displayed this url, with the exact same query what is the probability that : REVISITS • satisfaction=2 • satisfaction=1 • satisfaction=0 • miss (not-clicked) • skipped (after the last click) 5 Conditional Probability Features 1 An overall counter of display 4 mean reciprocal rank
 (kind of the harmonic mean of the rank) 1 snippet quality score (twisted formula used to compute snippet quality) 11 Base Features
  48. 48. • (In the past|within the same sesssion), • (with this very query | whatever query | a subquery | a super query) • and was offered (this url/this domain) MANY VARIATIONS X2 X 3 X 2 12 variants With the same user Without being the same user ( URL - query features) • Same Domain • Same URL • Same Query and Same URL 3 variants 15 Variants X 11 Base Features 165 Features
  49. 49. ADDITIVE SMOOTHING http://fumicoton.com/posts/bayesian_rating • book A : 1 rating of 5. Average rating of 5. • book B : 50 ratings. Average rating of 4.5 In our case to evaluate the probability that a (URL|query) should have a label l, under predicate P:
  50. 50. CUMULATIVE FEATURES Aggregate the features of the URL above in the ranking list Rationale : If a URL above is likely to be clicked, those below are likely to be missed
  51. 51. QUERY FEATURES Click entropy number of time it has been queried for number of terms average position within in session average number of occurences in a session MRR of its clicks How complex and ambiguous is a query ?
  52. 52. USER FEATURES What are the users habits ? Click entropy User click rank counters Rank {1, 2} clicks Rank {3, 4, 5} clicks Rank {6,7,8,9,10 } clicks Average number of terms Average number of different terms in a session Total number of queries issued by the user
  53. 53. SEASONALITY What day is monday ?
  54. 54. COLLABORATIVE FILTERING (ATTEMPT) User / Domain interaction matrix. FunkSVD Algorithm Simon Funk http://sifter.org/~simon/journal/20061211.html https://github.com/commonsense/divisi/blob/master/svdlib/_svdlib.pyx Cython implementation Marginal increase 5.10^-5 of the NCDG ! Why ?
  55. 55. learning
  56. 56. Short Story Point Wise, Random Forest, 30 Features, 4th Place (*) List Wise , LambdaMART, 90 Features, 1st Place (*) (*) A Yandex “PaceMaker" Team was also displaying results on the leaderboard and were 
 at the first place during the whole competition even if not officially contestant
 Trained in 2 days, 1135 Trees Optimize & Train in ~ 1 hour (12 cores), 24 trees
  57. 57. Lambda Mart From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges Microsoft Research Technical Report MSR-TR-2010-82 LambdaMART = LambdaRank + MART
  58. 58. Lambda Rank Original Ranking Re Ranked 13 errors 11 errors High Quality Hit Low Quality Hit Rank Net Gradient LambdaRank "Gradient" From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82
  59. 59. Grid Search We are not doing typical classification here. It is extremely important to perform grid search directly against NDCG final score. NDCG “conservatism” end up with large “min samples per leaf” (between 40 and 80 )
  60. 60. Feature Selection Top-Down approach : Starting from a high number of features, iteratively removed subsets of features. This approach led to the subset of 90 features for the LambdaMart winning solutions (Similar strategy now implemented by sklearn.feature_selection.RFECV) ! Bottom-up approach : Starting from a low number of features, add the features that produce the best marginal improvement. Gave the 30 features that lead to the best solution with the point-wise approach.
  61. 61. Top Features
  62. 62. References
  63. 63. http://sourceforge.net/p/lemur/wiki/RankLib/ Ranklib ( Implementation of LambdaMART) These Slides http://www.slideshare.net/Dataiku Learning to rank using multiple classification and gradient boosting. P. Li, C. J. C. Burges, and Q. Wu. Mcrank - In NIPS, 2007 From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82 http://fumicoton.com/posts/bayesian_rating Blog Post About Additive Smoothing Blog Posts about the solution Contest Url Paper with Detailed Description http://blog.kaggle.com/2014/02/06/winning-personalized-web-search-team-dataiku/ http://www.dataiku.com/blog/2014/01/14/winning-kaggle.html http://research.microsoft.com/en-us/um/people/nickcr/wscd2014/papers/wscdchallenge2014dataiku.pdf https://www.kaggle.com/c/yandex-personalized-web-search-challenge Research Papers References
  64. 64. Random Thoughts Dependancy analysis and comparing rank with predictive “relevance" could help determine general cases where the existing engine is not relevant enough How does it compare to a pure statistical approach ? ! Applying personalisation technique this way might not be practical because of the amount of live information to be maintained (in real-time) about users (each query, each click) to perform actionnable predictions How could a machine learning challenge enforce this kind of constraints? Is data science a science, a sport or a hobby. Newcomers can discover a field, improve existing results, and seemingly obtain incrementally more effective results, with little plateau effect ! Are we just at the very beginning non-industrial era of this discipline?
  65. 65. THANK YOU! Florian DOUETTEAU florian.douetteau@daitaku.com
 +33 6 70 56 88 97
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×