Click Log Mining CS598

1,835 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,835
On SlideShare
0
From Embeds
0
Number of Embeds
1,438
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Click Log Mining CS598

  1. 1. Query Log Mining Yandex Challenge 2011Nikita Spirin, Shih-Wen Huang, Shuo Yang, Anirudh Ravula
  2. 2. Search logs are used to improve search• Learn a ranking functions – Users click on meaningful results• Personalize search based on users history – Previous user searches unveil users interests• Identify spammers – Bots click on suspicious websites more often• Tune contextual advertizing models• Recommend and disambiguate queries – See also “java programming” Vs. “java coffee”
  3. 3. Yandex QLM Challenge 2011 goals• Learn a ranking function – For a given query provide a list of ordered URLs using the information from the log• Plan for today – Task description – General framework: learning to rank (L2R) – Features for L2R – Preferences extraction for L2R – Ranking algorithms – Collaborative Filtering and graph-based approaches – Experiments – Future Plans to improve
  4. 4. Task description: Input to the challenge• Query log – Query action SessionID TimePassed QUERY QueryID RegionID ListOfURLs – Click action SessionID TimePassed CLICK URLID• Training relevance labels from {0,1} set QueryID RegionID URLID RelevanceLabel• Testing query/region pairs – The goal is to provide relevant URLs for these new query/region pairs
  5. 5. Some real input data• Snapshot of the real Yandex query logSessionID Time Action QueryId RegionId URL URL URL• Training relevance labels from {0,1} set QueryId RegionId URL Relevance
  6. 6. Some statistics about the query log• Unique queries: 30,717,251• Unique URLs: 117,093,258• Sessions: 43,977,859• Total records in the log: 340,796,067• Assessed query-region-url triples for the total query set (training + test): 71,930• Log size: 17 Gb (doesn’t’t fit into memory)
  7. 7. General Framework: Learning to Rank (L2R)• Training formalization: – Given an ordered set of ranks Y = {0,1} (0 < 1) – Given a set of queries Q = {q1, . . . , qn} – A list of documents is associated with each query Dq = {dq1, . . . , dq,n(q)} – Factor ranking model: • Xqd = ( f1(q, d), . . . , fm(q, d) ), feature vector for q-d pair• Goal of L2R: – Learn a Ranker: X Y
  8. 8. Subtasks of L2R from query logs• Extract preferences (absolute, pairwise) form a query log using click-through statistics• Generate features (factors) to make a problem structured• Learn a ranking algorithm
  9. 9. SVM for L2R = RankSVM• Extract preferences from a query log based on some heuristics
  10. 10. Boosting for L2R = RankBoost• Uses each feature as a decision stump• Builds a linear weighted ensemble model
  11. 11. Ensemble Approach• Generate multiple models by varying… – Feature subsets – Algorithms parameters – Ranking models – Model Subsets – Averaging strategies (weighted, quality-absed, etc.)• Finally average [similar to CombMNZ]
  12. 12. Best result so far 0.642436
  13. 13. Future work• Add more models – SVMpref (reduction on L2R to classification) – Direct optimization of AUC – Experiment with more sophisticated ensemble models (MonoRank, etc.)

×