Upcoming SlideShare
×

# Multi-Task Learning and Web Search Ranking

2,756 views
2,618 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
2,756
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
10
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Multi-Task Learning and Web Search Ranking

1. 1. Multi-Task Learning and Web Search Ranking Gordon Sun ( 孙国政 ) Yahoo! Inc March 200 7
2. 2. <ul><li>Outline: </li></ul><ul><li>Brief Review: Machine Learning in web search ranking and Multi-Task learning. </li></ul><ul><li>MLR with Adaptive Target Value Transformation – each query is a task. </li></ul><ul><li>MLR for Multi-Languages – each language is a task. </li></ul><ul><li>MLR for Multi-query classes – each type of queries is a task. </li></ul><ul><li>Future work and Challenges. </li></ul>
3. 3. <ul><li>MLR (Machine Learning Ranking) </li></ul><ul><li>General Function Estimation and Risk Minimization: </li></ul><ul><li>Input: x = {x 1 , x 2 , …, x n } </li></ul><ul><li>Output: y </li></ul><ul><li>Training set: {y i , x i }, i = 1, …, n </li></ul><ul><li>Goal: Estimate mapping function y = F(x) </li></ul><ul><li>In MLR work: </li></ul><ul><li>x = x (q, d) = {x 1 , x 2 , …, x n } --- ranking features </li></ul><ul><li>y = judgment labeling: e.g. {P E G F B} mapped to {0, 1, 2, 3, 4}. </li></ul><ul><li>Loss Function: L(y, F(x)) = (y – F(x)) 2 </li></ul><ul><li>Algorithm: MLR with regression. </li></ul>
4. 4. <ul><li>Rank features construction </li></ul><ul><ul><li>Query features: </li></ul></ul><ul><ul><ul><li>query language, query word types (Latin, Kanji, …), … </li></ul></ul></ul><ul><ul><li>Document features: </li></ul></ul><ul><ul><ul><li>page_quality, page_spam, page_rank,… </li></ul></ul></ul><ul><ul><li>Query-Document dependent features: </li></ul></ul><ul><ul><ul><li>Text match scores in body, title, anchor text (TF/IDF, proximity), ... </li></ul></ul></ul><ul><li>Evaluation metric – DCG ( Discounted Cumulative Gain ) </li></ul><ul><li>where grades G i = grade values for {P, E, G, F, B} (NDCG – 2 n ) DCG5 -- (n=5), DCG10 -- (n=10) </li></ul>
5. 5. Distribution of judgment grades
6. 6. <ul><li>Milti-Task Learning </li></ul><ul><li>Single-Task Learning (STL) </li></ul><ul><ul><ul><li>One prediction task (classification/regression): </li></ul></ul></ul><ul><ul><ul><li>to estimate a function based on one Training/testing set: </li></ul></ul></ul><ul><ul><ul><li>T= {y i , x i }, i = 1, …, n </li></ul></ul></ul><ul><li>Multi-Task Learning (MTL) </li></ul><ul><ul><ul><li>Multiple prediction tasks, each with their own training/testing set: </li></ul></ul></ul><ul><ul><ul><li>T k = {y ki , x ki }, k = 1, …, m, i = 1, …, n k </li></ul></ul></ul><ul><ul><ul><li>Goal is to solve multiple tasks together: </li></ul></ul></ul><ul><ul><ul><li>- Tasks share the same input space (or at least partially): </li></ul></ul></ul><ul><ul><ul><li>- Tasks are related (say, MLR -- share one mapping function) </li></ul></ul></ul>
7. 7. <ul><li>Milti-Task Learning: Intuition and Benefits </li></ul><ul><li>Empirical Intuition </li></ul><ul><ul><li>Data from “related” tasks could help -- </li></ul></ul><ul><ul><li>Equivalent to increase the effective sample size </li></ul></ul><ul><li>Goal: Share data and knowledge from task to task --- Transfer Learning. </li></ul><ul><li>Benefits </li></ul><ul><ul><ul><li>- when # of training examples per task is limited </li></ul></ul></ul><ul><ul><ul><li>- when # of tasks is large and can not be handled by MLR for each task. </li></ul></ul></ul><ul><ul><ul><li>- when it is difficult/expensive to obtain examples for some tasks </li></ul></ul></ul><ul><ul><ul><li>- possible to obtain meta-level knowledge </li></ul></ul></ul>
8. 8. <ul><li>Milti-Task Learning: “Relatedness” approaches. </li></ul><ul><li>Probabilistic modeling for task generation </li></ul><ul><li>[Baxter ’00], [Heskes ’00], [The, Seeger, Jordan ’05], </li></ul><ul><li>[Zhang, Gharamani, Yang ’05] </li></ul><ul><li>• Latent Variable correlations </li></ul><ul><li>– Noise correlations [Greene ’02] </li></ul><ul><li>– Latent variable modeling [Zhang ’06] </li></ul><ul><li>• Hidden common data structure and latent variables. </li></ul><ul><li>– Implicit structure (common kernels) [Evgeniou, </li></ul><ul><li>Micchelli, Pontil ’05] </li></ul><ul><li>– Explicit structure (PCA) [Ando, Zhang ’04] </li></ul><ul><li>• Transformation relatedness [Shai ’05] </li></ul>
9. 9. <ul><li>Milti-Task Learning for MLR </li></ul><ul><li>Different levels of relatedness. </li></ul><ul><ul><li>Grouping data based on queries, each query could be one task. </li></ul></ul><ul><ul><li>Grouping data based on languages of queries, each language is a task. </li></ul></ul><ul><ul><li>Grouping data based on query classes </li></ul></ul>
10. 10. <ul><li>Outline: </li></ul><ul><li>Brief Review: Machine Learning in web search ranking and Multi-Task learning. </li></ul><ul><li>MLR with Adaptive Target Value Transformation – each query is a task. </li></ul><ul><li>MLR for Multi-Languages – each language is a task. </li></ul><ul><li>MLR for Multi-query classes – each type of queries is a task. </li></ul><ul><li>Future work and Challenges. </li></ul>
11. 11. <ul><li>Adaptive Target Value Transformation </li></ul><ul><li>Intuition: </li></ul><ul><li>Rank features vary a lot from query to query. </li></ul><ul><li>Rank features vary a lot from sample to sample with same labeling. </li></ul><ul><li>MLR is a ranking problem, but regression is to minimize prediction errors. </li></ul><ul><li>Solution: Adaptively adjust training target values: </li></ul><ul><li>Where linear (monotonic) transformation is required </li></ul><ul><li>(nonlinear g() may not reserve orders of E(y|x)) </li></ul>
12. 12. <ul><li>Adaptive Target Value Transformation </li></ul><ul><li>Implementation: Empirical Risk Minimization </li></ul><ul><li>Where the linear transformation weights are regularized, </li></ul><ul><li>λ α and λ β are regularization parameters, the p-norm. </li></ul><ul><li>The solution will be </li></ul>
13. 13. <ul><li>Adaptive Target Value Transformation </li></ul><ul><li>Norm p=2 solution: for each ( λ α and λ β ) </li></ul><ul><ul><li>For initial ( αβ ) , find F(x) by solving: </li></ul></ul><ul><ul><li>For given F(x), solve for each ( α q , β q ) , q = 1, 2, … Q. </li></ul></ul><ul><ul><li>Repeat 1 until </li></ul></ul><ul><li>Norm p=1 solution, solve conditional quadratic programming [Lasso/lars] </li></ul><ul><li>Convergence Analysis: Assuming </li></ul>
14. 14. Adaptive Target Value Transformation Experiments data:
15. 15. Adaptive Target Value Transformation Evaluation of aTVT on US and CN data
16. 16. Adaptive Target Value Transformation
17. 17. Adaptive Target Value Transformation
18. 18. Adaptive Target Value Transformation Observations: 1. Relevance gain (DCG5 ~ 2%) is visible. 2. Regularization is needed. 3. Different query types gain differently from aTVT.
19. 19. <ul><li>Outline: </li></ul><ul><li>Brief Review: Machine Learning in web search ranking and Multi-Task learning. </li></ul><ul><li>MLR with Adaptive Target Value Transformation – each query is a task. </li></ul><ul><li>MLR for Multi-Languages – each language is a task. </li></ul><ul><li>MLR for Multi-query classes – each type of queries is a task. </li></ul><ul><li>Future work and Challenges. </li></ul>
20. 20. Multi-Language MLR <ul><li>Objective: </li></ul><ul><li>Make MLR globally scalable: >100 languages, >50 regions. </li></ul><ul><li>Improve MLR for small regions/languages using data from other languages. </li></ul><ul><li>Build a Universal MLR for all regions that do not have data and editorial support. </li></ul>
21. 21. Multi-Language MLR Part 1 <ul><li>Feature Differences between Languages </li></ul><ul><li>MLR function differences between Languages. </li></ul>
22. 22. Multi-Language MLR Distribution of Text Score Legend: JP, CN, DE, UK, KR Perf+Excellent urls Bad urls
23. 23. Multi-Language MLR Distribution of Spam Score Legend: JP, CN, DE, UK, KR Perf+Excellent urls Bad urls JP, KR similar DE, UK similar
24. 24. Multi-Language MLR Training and Testing on Different Languages Train Language Test Language % DCG improvement over base function 7.47 2.50 0.29 -3.53 1.91 CN 1.30 4.48 -0.30 -3.79 -1.25 JP 3.86 4.49 5.69 -0.55 1.50 KR 3.94 6.05 6.25 13.1 6.96 DE 0.32 2.96 -0.21 2.29 6.22 UK CN JP KR DE UK
25. 25. Multi-Language MLR Language Differences: observations <ul><li>Feature difference across languages is visible but not huge. </li></ul><ul><li>MLR trained for one language does not work well for other languages. </li></ul>
26. 26. Multi-Language MLR Part 2 <ul><li>Transfer Learning with Region features </li></ul>
27. 27. Multi-Language MLR Query Region Feature <ul><li>New feature: query region: </li></ul><ul><li>Multiple Binary Valued Features: </li></ul><ul><ul><li>Feature vector: qr = (CN, JP, UK, DE, KR) </li></ul></ul><ul><ul><li>CN queries: (1, 0, 0, 0, 0) </li></ul></ul><ul><ul><li>JP queries: (0, 1, 0, 0, 0) </li></ul></ul><ul><ul><li>UK queries: (0, 0, 1, 0, 0) </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>To test the Trained Universal MLR on new languages: e.g. FR </li></ul><ul><ul><li>Feature vector: qr = (0, 0, 0, 0, 0) </li></ul></ul>
28. 28. Multi-Language MLR Query Region Feature: Experiment results % DCG-5 improvement over base function 6.83% 5.79% KR 10.51% 9.86% DE 5.92% 4.34% UK 7.02% 6.24% CN 3.53% 3.07% JP Combined Model With Query Region Feature Combined Model Language
29. 29. Multi-Language MLR Query Region Feature: Experiment results CJK and UK,DE Models All models include query region feature 6.83% 10.51% 5.92% 7.02% 3.53% All Language Model 6.14% 7.17% 4.39% CJK Model KR 12.5% DE 5.93% UK CN JP UK, DE Model Test Language
30. 30. Multi-Language MLR Query Region Feature: Observations <ul><li>Query Region feature seems to improve combined model performance in every case. Not always statistically significant. </li></ul><ul><li>Helped more when we had less data (KR). </li></ul><ul><li>Helped more when introducing “near languages” models (CJK, EU) </li></ul><ul><li>Would not help for languages with large training data (JP, CN). </li></ul>
31. 31. Multi-Language MLR Experiments: Overweighting Target Language <ul><li>This method deals with the common case where there is a language with a small amount of data available. </li></ul><ul><li>Use all available data, but change the weight of the data from the target language. </li></ul><ul><li>When weight=1 “Universal Language Model” </li></ul><ul><li>As weight->INF becomes Single Language Model. </li></ul>
32. 32. Multi-Language MLR Germany
33. 33. Multi-Language MLR UK
34. 34. Multi-Language MLR China
35. 35. Multi-Language MLR Korea
36. 36. Multi-Language MLR Japan
37. 37. Multi-Language MLR Average DCG Gain For JP, CN, DE, UK, KR
38. 38. Multi-Language MLR Overweighting Target Language Observations: <ul><li>It helps on certain languages with small size of data (KR, DE). </li></ul><ul><li>It does not help on some languages (CN, JP). </li></ul><ul><li>For languages with enough data, it will not help. </li></ul><ul><li>The weighting of 10 seems better than 1 and 100 on average. </li></ul>
39. 39. Multi-Language MLR Part 3 <ul><li>Transfer Learning with </li></ul><ul><li>Language Neutral Data and Regression Diff </li></ul>
40. 40. Multi-Language MLR Selection of Language Neutral queries: <ul><li>For each of (CN, JP, KR, DE, UK), train an MLR with own data. </li></ul><ul><li>Test queries of one language by all language s MLRs. </li></ul><ul><li>Select queries that showed best DCG cross different language MLRs . </li></ul><ul><li>Consider these queries as language neutral and could be shared by all language MLR development. </li></ul>
41. 41. Multi-Language MLR Evaluation of Language Neutral Queries on CN-simplified dataset (2,753 queries). 5.83 5.85 Japanese 5.50(+6%) 5.19 Korean 5.79(+2.7%) dcg5 = 5.64 CN-Traditional Language-Neutral queries only (top ~500 queries ) All the queries
42. 42. <ul><li>Outline: </li></ul><ul><li>Brief Review: Machine Learning in web search ranking and Multi-Task learning. </li></ul><ul><li>MLR with Adaptive Target Value Transformation – each query is a task. </li></ul><ul><li>MLR for Multi-Languages – each language is a task. </li></ul><ul><li>MLR for Multi-query classes – each type of queries is a task. </li></ul><ul><li>Future work and Challenges. </li></ul>
43. 43. Multi-Query Class MLR <ul><li>Intuitions: </li></ul><ul><li>Different types of queries behave differently: </li></ul><ul><ul><li>Require different ranking features, </li></ul></ul><ul><ul><li>(Time sensitive queries  page_time_stamps). </li></ul></ul><ul><ul><li>Expect different results : </li></ul></ul><ul><ul><li>(Navigational queries  one official page on the top.) </li></ul></ul><ul><li>Also, different types of queries could share the same ranking features. </li></ul><ul><ul><li>. </li></ul></ul><ul><li>Multi-class learning could be done in a unified MLR by </li></ul><ul><ul><li>Introducing query classification and use query class as input ranking features. </li></ul></ul><ul><ul><li>Adding page level features for the corresponding classes. </li></ul></ul>
44. 44. Multi-Query Class MLR <ul><li>Time Recency experiments: </li></ul><ul><li>Feature implementation: </li></ul><ul><ul><li>Binary query feature: Time Sensitive (0,1) </li></ul></ul><ul><ul><li>Binary page feature: discovered within last three month. </li></ul></ul><ul><li>Data: </li></ul><ul><ul><li>300 time sensitive queries (editorial). </li></ul></ul><ul><ul><li>~2000 ordinary queries. </li></ul></ul><ul><ul><li>Over weight time sensitive queries by 3. </li></ul></ul><ul><ul><li>10-fold cross validation on MLR training/testing . </li></ul></ul>
45. 45. Multi-Query Class MLR <ul><li>Time Recency experiments result: </li></ul><ul><li>Compare MLR with and w/o page_time feature. </li></ul>0.0017 0.52% All queries 1.08e-6 2.31% Time sensitive queries P-value DCG gain
46. 46. Multi-Query Class MLR <ul><li>Name Entity queries: </li></ul><ul><li>Feature implementation: </li></ul><ul><ul><li>Binary query feature: name entity query (0,1) </li></ul></ul><ul><ul><li>11 new page features implemented : </li></ul></ul><ul><ul><ul><li>Path length </li></ul></ul></ul><ul><ul><ul><li>Host length </li></ul></ul></ul><ul><ul><ul><li>Number of host component (url depth) </li></ul></ul></ul><ul><ul><ul><li>Path contains “index” </li></ul></ul></ul><ul><ul><ul><li>Path contains either “cgi”, “asp”, “jsp”, or “php” </li></ul></ul></ul><ul><ul><ul><li>Path contains “search” or “srch”, … </li></ul></ul></ul><ul><ul><li>Data: </li></ul></ul><ul><ul><li>142 place name entity queries. </li></ul></ul><ul><ul><li>~2000 ordinary queries. </li></ul></ul><ul><ul><li>10-fold cross validation on MLR training/testing . </li></ul></ul>
47. 47. Multi-Query Class MLR <ul><li>Name Entity query experiments result: </li></ul><ul><li>Compared MLR with base model without name entity features . </li></ul>0.09 0.28% All queries 0.09 0.82% Name Entity queries (142) P-value DCG gain
48. 48. Multi-Query Class MLR <ul><li>Observations: </li></ul><ul><li>Query class combined with page level features could help MLR relevance. </li></ul><ul><li>More research is needed on query classification and page level feature optimization. </li></ul>
49. 49. <ul><li>Outline: </li></ul><ul><li>Brief Review: Machine Learning in web search ranking and Multi-Task learning. </li></ul><ul><li>MLR with Adaptive Target Value Transformation – each query is a task. </li></ul><ul><li>MLR for Multi-Languages – each language is a task. </li></ul><ul><li>MLR for Multi-query classes – each type of queries is a task. </li></ul><ul><li>Future work and Challenges. </li></ul>
50. 50. Future Work and Challenges <ul><li>Multi-task learning extended to different types of training data: </li></ul><ul><ul><li>Editorial judgment data. </li></ul></ul><ul><ul><li>User click-through data </li></ul></ul><ul><li>Multi-task learning extended to different types of relevance judgments: </li></ul><ul><ul><li>Absolute relevance judgment. </li></ul></ul><ul><ul><li>Relative relevance judgment </li></ul></ul><ul><li>Multi-task learning extended to use both </li></ul><ul><ul><li>Labeled data. </li></ul></ul><ul><ul><li>Unlabeled data. </li></ul></ul><ul><li>Multi-task learning extended to different types of search user intentions. </li></ul>
51. 51. <ul><li>Contributors from Yahoo! International Search Relevance team: </li></ul><ul><li>Algorithm and model development: </li></ul><ul><ul><li>Zhaohui Zheng, </li></ul></ul><ul><ul><li>Hongyuan Zha, </li></ul></ul><ul><ul><li>Lukas Biewald, </li></ul></ul><ul><ul><li>Haoying Fu </li></ul></ul><ul><li>Data exporting/processing/QA: </li></ul><ul><ul><li>Jianzhang He </li></ul></ul><ul><li>Srihari Reddy </li></ul><ul><li>Director: </li></ul><ul><ul><li>Gordon Sun. </li></ul></ul>
52. 52. Thank you. Q&A?