Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine Learned Relevance at a Large Scale Search Engine  Salford Analytics and Data Mining Conference 2012
Machine Learned Relevance at a Large Scale             Search Engine              Salford Data Mining – May 25, 2012      ...
About the Authors   James G. Shanahan - PhD in Machine Learning University of Bristol, UK     – 20+ years in the field AI...
About the Authors   Eric Glover - PhD CSE (AI) From U of M in 2001     – Fellow at Quixey, where among other things, he f...
Talk Outline   Introduction: Search and Machine Learned Ranking   Relevance and evaluation methodologies   Data collect...
Google
Search Engine: SearchMeSearch engine: lets you see and hear what youre searching for
6 Steps to MLR in Practice                                                 6                                              ...
How is ML for Search Unique Many Machine Learning (ML) systems start with source data   – Goal is to analyze, model, pred...
If we can’t measure “it”, then… ….we should think twice about doing “it” Measurement has enabled us to compare systems  ...
Improve in a Measured Way
From Information Needs Queries   The idea of using computers to search for relevant pieces of information was    popular...
Relevance is a Huge Challenge    Relevance typically denotes how well a retrieved object (document) or set of objects    ...
From Cranfield to TREC                            Text REtrieval                             Conference/Competition      ...
The TREC Benchmark   TREC: Text REtrieval Conference (http://trec.nist.gov/) Originated from the    TIPSTER program spons...
User’sInformationNeed                       Collections                                            Pre-process text input ...
User’sInformationNeed                       Collections                                          Pre-process text input  P...
Talk Outline   Introduction: Search and Machine Learned Ranking   Relevance and evaluation methodologies   Data collect...
Difficulties in Evaluating IR Systems Effectiveness is related to the relevancy of the set of returned  items. Relevancy...
Relevance as a MeasureRelevance is everything! How relevant is the document retrieved    – for the user’s information nee...
What to Evaluate?    What can be measured that reflects users’ ability to use    system? (Cleverdon 66)     –   Coverage o...
Talk Outline   Introduction: Search and Machine Learned Ranking   Relevance and evaluation methodologies   Data collect...
Data Collection is a Challenge   Most search engines do not start with labeled data (relevance judgments)   Good labeled...
Relevance/Usefulness/Ranking Web Search: topical relevance or aboutness, trustability of source Local Search: topical re...
Commonly used Search Metrics   Early search systems used binary judgments (relevant/not relevant) and evaluated    based ...
Metrics for Web Search Existing metrics limited such as Precision and Recall   –   Not always clear-cut binary decision: ...
Cumulative Gain  With graded relevance                               relevance                                           ...
Discounting Based on Position                                               rel Users care more about high-   n    doc # ...
Normalized Discounted Cumulative Gain(NDCG)   To compare DCGs, normalize values so that a ideal ranking would have a    N...
Normalized Discounted Cumulative Gain(NDCG)                                                 rel                           ...
Machine Learning Uses in Commercial SE   Query parsing   SPAM Classification   Result Categorization   Behavioral Cate...
6 Steps to MLR in Practice                                                 6                                              ...
In Practice QPS, Deploy model Imbalanced data Relevance changes over time; non-stationary behavior Speed, accuracy, (S...
MLR – Typical Approach by Companies1. Define goals and “specific problem”2. Collect human judged training data:   – Given ...
MLR Training Data1.     Collect human judged training data:       Given a large set of <query, result> tuples            J...
The Evaluation Disconnect   Evaluation in a supervised learner tries to minimize MSE of the targets     – for each tuple ...
From Grep to Machine Learnt RankingRelative Performance                                                                   ...
Real World MLR Systems   SearchMe was a visual/media search engine – about 3 Billion pages in index, and    hundreds of u...
Talk Outline   Introduction: Search and Machine Learned Ranking   Relevance and evaluation methodologies   Data collect...
Quixey: What is an App?   An app is a piece of computer software designed to help a user perform specific    tasks.     –...
My house is awash with platforms
My car...            NPR programs such as Car Talk are available 24/7 on the NPR            News app for Ford SYNC
My life...
©
Own “The Millionaires App" for $1,000    .
Law Students App   ..
Apps for Pets   ..
Pablo Picatso!   ..
50 Best iPhone Apps 2011 [Time]Games                On the Go         Lifestyle    Music &             Entertainment      ...
©
Examples of Functional Search™©
App World: Integrating Multi-Data Sources A1    App Store                                                  A1             ...
Talk Outline   Introduction: Search and Machine Learned Ranking   Relevance and evaluation methodologies   Data collect...
Search Architecture (Online)            Queryquery                       DBQ = data storage queries          Processing   ...
Architecture Details Online Flow: 1. Given a “query” generate Query-specific features, Fq 2. Using Fq generate appropriate...
Examples of Possible Features   Query features:     – popularity/frequency of query     – number of words in query, indiv...
Features Are Key   Typically MLR systems use both textual and non-textual features:          • What makes one app better ...
Features Are Key: Learned Meta-Features   Meta-features can capture multiple simple features into fewer “super-features”...
Idea of Metafeatures (Example)   In this case – each Metafeature is independently solved on different training data      ...
Data and Feature Engineering are Key!   Selection of “good” query/result pairs for labeling, and good metafeatures     – ...
Applying TreeNet for MLR    Starting with a set of query, result pairs obtain human judgments [1-5], and features      – ...
Talk Outline   Introduction: Search and Machine Learned Ranking   Relevance and evaluation methodologies   Data collect...
Choosing the Best Model - Disconnect   TreeNet uses a mean-squared error minimization     – The “best” model is the one w...
Assumptions Made (Are there choices)   MSE is used because the input data is independent judgment pairs   Assumptions of...
Other Ways to do MLR   Changing data collection:     – Use inferred as opposed to direct data          • Click/user behav...
Talk Outline   Introduction: Search and Machine Learned Ranking   Relevance and evaluation methodologies   Data collect...
Conclusion   Machine Learning is very important to Search     – Metafeatures reduce model complexity and lower costs     ...
Quixey is hiring   If you want a cool internship, or a great job, contact us afterwards or e-mail:   jobs@quixey.com and...
QuestionsJames_DOT_Shanahan_AT_gmail_DOT_com       Eric_AT_Quixey_DOT_com
3250 Ash St.Palo Alto, CA 94306888.707.4441www.quixey.com
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Upcoming SlideShare
Loading in …5
×

Machine Learned Relevance at A Large Scale Search Engine

1,226 views

Published on

Published in: Technology, Education
  • Be the first to comment

Machine Learned Relevance at A Large Scale Search Engine

  1. 1. Machine Learned Relevance at a Large Scale Search Engine Salford Analytics and Data Mining Conference 2012
  2. 2. Machine Learned Relevance at a Large Scale Search Engine Salford Data Mining – May 25, 2012 Presented by: Dr. Eric Glover – eric@quixey.com Dr. James Shanahan – james.shanahan@gmail.com
  3. 3. About the Authors James G. Shanahan - PhD in Machine Learning University of Bristol, UK – 20+ years in the field AI and information science – Principal and Founder, Boutique Data Consultancy • Clients include: Adobe, Digg, SearchMe, AT&T, Ancestry SkyGrid, Telenav – Affiliated with University of California Santa Cruz (UCSC) – Adviser to Quixey – Previously • Chief Scientist, Turn Inc. (A CPX ad network, DSP) • Principal Scientist, Clairvoyance Corp (CMU spinoff) • Co-founder of Document Souls (task centric info access system) • Research Scientist, Xerox Research (XRCE) • AI Research Engineer, Mitsubishi Group
  4. 4. About the Authors Eric Glover - PhD CSE (AI) From U of M in 2001 – Fellow at Quixey, where among other things, he focuses on the architecture and processes related to applied machine learning for relevance and evaluation methodologies – More than a dozen years of Search Engine experience including: NEC Labs, Ask Jeeves, SearchMe, and own startup. – Multiple relevant publications ranging from classification to automatically discovering topical hierarchies – Dissertation studied Personalizing Web Search through incorporation of user- preferences and machine learning – More than a dozen filed patents
  5. 5. Talk Outline Introduction: Search and Machine Learned Ranking Relevance and evaluation methodologies Data collection and metrics Quixey – Functional Application Search™ System Architecture, features, and model training Alternative approaches Conclusion
  6. 6. Google
  7. 7. Search Engine: SearchMeSearch engine: lets you see and hear what youre searching for
  8. 8. 6 Steps to MLR in Practice 6 Deploy System in the wild (and test) 5 Interpret and Evaluate discovered knowledge 4 Modeling: Extract Patterns/Models 3 Feature Engineering 2 Collect requirements, and Data1 Understand the domain and Systems Modeling is inherently Define problems interactive and iterative.
  9. 9. How is ML for Search Unique Many Machine Learning (ML) systems start with source data – Goal is to analyze, model, predict – Features are often pre-defined, in a well-studied area MLR for Search Engines is different from many other ML applications: – Does not start with labeled data • Need to pay judges to provide labels – Opportunity to invent new features (Feature Engineering) – Often require real-time operation • Processing tens of billions of possible results, microseconds matter – Require domain-specific metrics for evaluation
  10. 10. If we can’t measure “it”, then… ….we should think twice about doing “it” Measurement has enabled us to compare systems and also to machine learn them Search is about measurement, measurement and measurement
  11. 11. Improve in a Measured Way
  12. 12. From Information Needs Queries The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945 An information need is an individual or groups desire to locate and obtain information to satisfy a conscious or unconscious need. Within the context of web search information needs are expressed as textual queries (possibly with constraints) E.g., “Analytics Data Mining Conference” program Metric: “Relevance” as a measure of how well is a system performing
  13. 13. Relevance is a Huge Challenge Relevance typically denotes how well a retrieved object (document) or set of objects meets the information need of the user. Relevance is often viewed as multifaceted. – A core facet of relevance relates to topical relevance or aboutness, • i.e., to what extent the topic of a result matches the topic of the query or information need. • Another facet of relevance is based on user perception, and sometimes referred to as user relevance; it encompasses other concerns of the user such as timeliness, authority or novelty of the result In local search type queries, yet another facet of relevance that comes into play is geographical aboutness, – i.e., to what extent the location of a result, a business listing, matches the location of the query or information need
  14. 14. From Cranfield to TREC  Text REtrieval Conference/Competition – http://trec.nist.gov/ – Run by NIST (National Institute of Standards & Technology)  Started in 1992  Collections: > 6 Gigabytes (5 CRDOMs), >1.5 Million Docs – Newswire & full text news (AP, WSJ, Ziff, FT) – Government documents (federal register, Congressional Record) – Radio Transcripts (FBIS) – Web “subsets” – Tweets
  15. 15. The TREC Benchmark TREC: Text REtrieval Conference (http://trec.nist.gov/) Originated from the TIPSTER program sponsored by Defense Advanced Research Projects Agency (DARPA). Became an annual conference in 1992, co-sponsored by National Institute of Standards and Technology (NIST) and DARPA. Participants are given parts of a standard set of documents and TOPICS (from which queries have to be derived) in different stages for training and testing. Participants submit the P/R values for the final document and query corpus and present their results at the conference. 15
  16. 16. User’sInformationNeed Collections Pre-process text input Parse Query Index Match Query Reformulation
  17. 17. User’sInformationNeed Collections Pre-process text input Parse Query Index Rank or Match Evaluation Query Reformulation
  18. 18. Talk Outline Introduction: Search and Machine Learned Ranking Relevance and evaluation methodologies Data collection and metrics Quixey – Functional Application Search™ System Architecture, features, and model training Alternative approaches Conclusion
  19. 19. Difficulties in Evaluating IR Systems Effectiveness is related to the relevancy of the set of returned items. Relevancy is not typically binary but continuous. Even if relevancy is binary, it can be a difficult judgment to make. Relevancy, from a human standpoint, is: – Subjective: Depends upon a specific user’s judgment. – Situational: Relates to user’s current needs. – Cognitive: Depends on human perception and behavior. – Dynamic: Changes over time.
  20. 20. Relevance as a MeasureRelevance is everything! How relevant is the document retrieved – for the user’s information need. Subjective, but one assumes it’s measurable Measurable to some extent – How often do people agree a document is relevant to a query • More often than expected How well does it answer the question? – Complete answer? Partial? – Background Information? – Hints for further exploration?
  21. 21. What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) – Coverage of Information – Form of Presentation – Effort required/Ease of Use – Time and Space Efficiency – Effectiveness  Recall – proportion of relevant material actually retrieved  Precision – proportion of retrieved material actually relevant  Typically a 5-point scale is used 5=best, 1=worst
  22. 22. Talk Outline Introduction: Search and Machine Learned Ranking Relevance and evaluation methodologies Data collection and metrics Quixey – Functional Application Search™ System Architecture, features, and model training Alternative approaches Conclusion
  23. 23. Data Collection is a Challenge Most search engines do not start with labeled data (relevance judgments) Good labeled data is required to perform evaluations and perform learning Not practical to hand-label all possibilities for modern large-scale search engines Using 3rd party sources such as Mechanical Turk is often very noisy/inconsistent Data collection is non-trivial – A custom system (specific to the domain) is often required – Phrasing of the “questions”, options (including a skip option), UI design and judge training are critical to increase the chance of consistency Can leverage judgment collection to aid in feature engineering – Judges can provide reasons and observations
  24. 24. Relevance/Usefulness/Ranking Web Search: topical relevance or aboutness, trustability of source Local Search: topical relevance and geographical applicability Functional App Search: – Task relevance – User must be convinced app results can solve need – Finding the “best” apps that address the users task needs – Very domain and user specific Advertising – Performance measure – expected revenue P(click) * revenue(click) – Consistency with user-search (showing irrelevant ads hurts brand)
  25. 25. Commonly used Search Metrics Early search systems used binary judgments (relevant/not relevant) and evaluated based on precision and recall – Recall difficult to assess for large sets Modern search systems often use DCG or nDCG: – Easy to collect and compare large sets of “independent judgments” • Independent judgments map easily to MSE minimization learners – Relevance is not binary, and depends on the order of results Other measures exist – Subjective “how did I do”, but these are difficult to use for MLR or compare – Pairwise comparison – measure number of out-of order pairs • Lots of recent research on pairwise based MLR • Most companies use “independent judgments”
  26. 26. Metrics for Web Search Existing metrics limited such as Precision and Recall – Not always clear-cut binary decision: relevant vs. not relevant – Not position sensitive: p: relevant, n: not relevant ranking 1: p n p n n ranking 2: n n n p p How do you measure recall over the whole web? – How many of the potentially billions results will get looked at? Which ones actually need to be good? Normalized Discounted Cumulated Gain (NDCG) – K. Jaervelin and J. Kekaelaeinen (TOIS 2002) – Gain: relevance of a document is no longer binary – Sensitive to the position of highest rated documents • Log-discounting of gains according to the positions – Normalize the DCG with the “ideal set” DCG (NDCG)
  27. 27. Cumulative Gain  With graded relevance relevance (gain) n doc # CG n judgments, we can compute 1 588 1.0 1.0 2 589 0.6 1.6 the gain at each rank. 3 576 0.0 1.6  Cumulative Gain at rank n: 4 5 590 986 0.8 0.0 2.4 2.4 6 592 1.0 3.4 7 984 0.0 3.4 8 988 0.0 3.4 9 578 0.0 3.4 10 985 0.0 3.4 (Where reli is the graded relevance 11 103 0.0 3.4 of the document at position i) 12 591 0.0 3.4 13 772 0.2 3.6 14 990 0.0 3.6 28
  28. 28. Discounting Based on Position rel Users care more about high- n doc # (gain) CG n logn DCG n 1 588 1.0 1.0 - 1.00 ranked documents, so we 2 589 0.6 1.6 1.00 1.60 3 576 0.0 1.6 1.58 1.60 discount results by 4 590 0.8 2.4 2.00 2.00 1/log2(rank) 5 986 0.0 2.4 2.32 2.00 6 592 1.0 3.4 2.58 2.39 7 984 0.0 3.4 2.81 2.39 8 988 0.0 3.4 3.00 2.39 Discounted Cumulative Gain: 9 578 0.0 3.4 3.17 2.39 10 985 0.0 3.4 3.32 2.39 11 103 0.0 3.4 3.46 2.39 12 591 0.0 3.4 3.58 2.39 13 772 0.2 3.6 3.70 2.44 14 990 0.0 3.6 3.81 2.44 29
  29. 29. Normalized Discounted Cumulative Gain(NDCG) To compare DCGs, normalize values so that a ideal ranking would have a Normalized DCG of 1.0 Ideal ranking: rel rel (gain) n doc # (gain) CG n logn IDCG n n doc # CG n logn DCG n 1 588 1.0 1.0 0.00 1.00 1 588 1.0 1.0 0.00 1.00 2 589 0.6 1.6 1.00 1.60 2 592 1.0 2.0 1.00 2.00 3 576 0.0 1.6 1.58 1.60 3 590 0.8 2.8 1.58 2.50 4 590 0.8 2.4 2.00 2.00 4 589 0.6 3.4 2.00 2.80 5 986 0.0 2.4 2.32 2.00 5 772 0.2 3.6 2.32 2.89 6 592 1.0 3.4 2.58 2.39 6 576 0.0 3.6 2.58 2.89 7 984 0.0 3.4 2.81 2.39 7 986 0.0 3.6 2.81 2.89 8 988 0.0 3.4 3.00 2.39 8 984 0.0 3.6 3.00 2.89 9 578 0.0 3.4 3.17 2.39 9 988 0.0 3.6 3.17 2.89 10 985 0.0 3.4 3.32 2.39 10 578 0.0 3.6 3.32 2.89 11 103 0.0 3.4 3.46 2.39 11 985 0.0 3.6 3.46 2.89 12 591 0.0 3.4 3.58 2.39 12 103 0.0 3.6 3.58 2.89 13 772 0.2 3.6 3.70 2.44 13 591 0.0 3.6 3.70 2.89 14 990 0.0 3.6 3.81 2.44 14 990 0.0 3.6 3.81 2.89 30
  30. 30. Normalized Discounted Cumulative Gain(NDCG) rel n doc # (gain) DCG n IDCG n NDCG n Normalize by DCG of the ideal 1 588 1.0 1.00 1.00 1.00 ranking: 2 589 0.6 1.60 2.00 0.80 3 576 0.0 1.60 2.50 0.64 4 590 0.8 2.00 2.80 0.71 5 986 0.0 2.00 2.89 0.69 6 592 1.0 2.39 2.89 0.83 7 984 0.0 2.39 2.89 0.83 NDCG ≤ 1 at all ranks 8 988 0.0 2.39 2.89 0.83 NDCG is comparable across 9 578 0.0 2.39 2.89 0.83 10 985 0.0 2.39 2.89 0.83 different queries 11 103 0.0 2.39 2.89 0.83 12 591 0.0 2.39 2.89 0.83 13 772 0.2 2.44 2.89 0.84 14 990 0.0 2.44 2.89 0.84 31
  31. 31. Machine Learning Uses in Commercial SE Query parsing SPAM Classification Result Categorization Behavioral Categories Search engine results ranking
  32. 32. 6 Steps to MLR in Practice 6 Deploy System in the wild (and test) 5 Interpret and Evaluate discovered knowledge 4 Modeling: Extract Patterns/Models 3 Feature Engineering 2 Collect requirements, and Data1 Understand the domain and Systems Modeling is inherently Define problems interactive and iterative
  33. 33. In Practice QPS, Deploy model Imbalanced data Relevance changes over time; non-stationary behavior Speed, accuracy, (SVMs,) Practical : Grid search, 8-16 nodes, 500 trees million of records, InteractionsVariable selection: 1000  100s variables, add random variables ~6 weeks cycle Training time is days; lab evaluation is weeks; live AB testing Why Treenet? No missing values, Categorical variables
  34. 34. MLR – Typical Approach by Companies1. Define goals and “specific problem”2. Collect human judged training data: – Given a large set of <query, result> tuples • Judges rate “relevance” on a 1 to 5 scale (5=“perfect”, 1=“worst”)3. Generate training data from the provided <query, result> tuples – <q,r>  Features, Input to model is: <F,judgment>4. Train model typically minimize MSE (Mean Squared Error)5. Lab evaluation using DCG-type metrics6. Deploy model in a test system and evaluate
  35. 35. MLR Training Data1. Collect human judged training data: Given a large set of <query, result> tuples Judges rate “relevance” on a 1 to 5 scale (5=“perfect”, 1=“worst”)2. Featurize the training data from the provided <query, result> tuples <q,r>  Features, Input to model is: <F,judgment> InstanceAttr x0 x1 x2 … xn Label <query1, Doc1> 1 3 0 .. 7 4 <query1, Doc2> 1 5 … … … … … … … <queryn, Docn> 1 0 4 ... 8 3
  36. 36. The Evaluation Disconnect Evaluation in a supervised learner tries to minimize MSE of the targets – for each tuple Fi,xi the learner predicts a target yi • Error is f(yi – xi) – typically (yi – xi) ^2 • Optimum is some function of the “errors” – i.e. try minimize total error Evaluation of the deployed model is different evaluation of the learner - typically DCG or nDCG Individual result error calculation is different from error based on result ordering – A small error between predicted target for a result could have substantial impact on result ordering – likewise, the “best result ordering” might not exactly match the predicted targets for any results – An affine transform of the targets produces no change to DCG, but large change to calculated MSE
  37. 37. From Grep to Machine Learnt RankingRelative Performance ?? Personalization, (e.g., DCG) Social Machine Learning, Behavioral Data Graph-Features, Language Models Boolean,VSM, TF_IDF Pre-1990s 1990s 2000s 2010s
  38. 38. Real World MLR Systems SearchMe was a visual/media search engine – about 3 Billion pages in index, and hundreds of unique features used to predict the score (and ultimately rank results). Results could be video, audio, images, or regular web pages. – The goal was for a given input query, return the best ordering of relevant results – in an immersive UI (mixing different results types simultaneously) Quixey – Functional App Search™ - over 1M apps, many sources of data for each app (multiple stores, reviews, blog sites, etc…) – goal is given a “functional query” i.e. “a good offline san diego map for iphone” or “kids games for android”– find the most relevant apps (ranked properly) – Dozens of sources of data for each app, many potential features used to: • Predict “quality”, “text relevance” and other meta-features • Calculate a meaningful score used to make decisions by partners • Rank-order and raw score matter (important to know “how good” an app is) Local Search (Telenav, YellowPages)
  39. 39. Talk Outline Introduction: Search and Machine Learned Ranking Relevance and evaluation methodologies Data collection and metrics Quixey – Functional Application Search™ System Architecture, features, and model training Alternative approaches Conclusion
  40. 40. Quixey: What is an App? An app is a piece of computer software designed to help a user perform specific tasks. – Contrast with systems software and middleware Apps were originally intended for productivity – (email, calendar and contact databases), but consumer and business demand has caused rapid expansion into other areas such as games, factory automation, GPS and location-based services, banking, order-tracking, and ticket purchases Run on various devices (phones, tablets, game consoles, cars)
  41. 41. My house is awash with platforms
  42. 42. My car... NPR programs such as Car Talk are available 24/7 on the NPR News app for Ford SYNC
  43. 43. My life...
  44. 44. ©
  45. 45. Own “The Millionaires App" for $1,000  .
  46. 46. Law Students App ..
  47. 47. Apps for Pets ..
  48. 48. Pablo Picatso! ..
  49. 49. 50 Best iPhone Apps 2011 [Time]Games On the Go Lifestyle Music & Entertainment SocialAngry Birds Kayak Amazon Photography Netflix FacebookScrabble Yelp Epicurious Mog IMDb TwitterPlants v. Zombies Word Lens Mixology Pandora ESPN Scorecenter GoogleDoodle Jump Weather Channel Paypal SoundHound Instapaper AIMFruit Ninja OpenTable Shop Savvy Bloom Kindle SkypeCut the Rope Wikipedia Mint Camera+ PulseNews FoursquarePictureka Hopstop WebMD Photoshop Express BumpWurdle AroundMe Lose It! HipstamaticGeoDefense Google Earth Springpad InstagramSwarm Zipcar ColorSplash [http://www.time.com/time/specials/packages/completelist/0,29569,204 4480,00.html#ixzz1s1pAMNWM]
  50. 50. ©
  51. 51. Examples of Functional Search™©
  52. 52. App World: Integrating Multi-Data Sources A1 App Store A1 App Catalog A5 A2 1 A3 A7 Blogs A2 A4 App Store 2 ? blah blah Angry Birds A5 ? A5 App Store App Review Site A7 3 blah blah Learn Spanish A8 App Developer Developer Homepage …
  53. 53. Talk Outline Introduction: Search and Machine Learned Ranking Relevance and evaluation methodologies Data collection and metrics Quixey – Functional Application Search™ System Architecture, features, and model training Alternative approaches Conclusion
  54. 54. Search Architecture (Online) Queryquery DBQ = data storage queries Processing Offline Indexes Processing Data Feature and Data storage Data Building ML Simple Models Scoring (set reducer)Shown Result Result Consideration FeatureResults Sorting Scoring Set Generation
  55. 55. Architecture Details Online Flow: 1. Given a “query” generate Query-specific features, Fq 2. Using Fq generate appropriate “database queries” 3. Cheaply pare down initial possible results 4. Obtain result features Fr for remaining consideration set 5. Generate query-result features Fqr for remaining consideration set 6. Given all features score each result (assuming independent scoring) 7. Present and organize the “best results” (not nesc. linearized by score)
  56. 56. Examples of Possible Features Query features: – popularity/frequency of query – number of words in query, individual POS tags per term/token – collection term-frequency information (per term/token) – Geo-location of user Result features – (web) – in-links/page-rank, anchortext match (might be processed with query) – (app) – download rate, app-popularity, platform(s), star-rating(s), review-text – (app) – ML-Quality score, etc.. Query-result features – BM-25 (per text-block) – Frequency in specific sections – lexical similarity query to title – etc…
  57. 57. Features Are Key Typically MLR systems use both textual and non-textual features: • What makes one app better than another? • Text-match alone insufficient • Popularity alone insufficient No single feature or simple combination is sufficient At both SearchMe and Quixey we built learned “meta-features” (next slide) non-title freq ofquery: Games Title Text Match "game" App Popularity How good for queryAngry Birds low high Very high very highSudoku (genina.com) low low high highPacMan low high high highCave Shooter low/medium medium low mediumStupid Maze Game very high medium very low low
  58. 58. Features Are Key: Learned Meta-Features Meta-features can capture multiple simple features into fewer “super-features” SearchMe: SpamScore, SiteAuthority, Category-related Quixey: App-Quality, TextMatch (as distinct from overall-relevance) SpamScore and App-Quality are complex learned meta-features – Potentially hundreds of “smaller features” feed into simpler model – SpamScore considered – average-pageRank, num-ads, distinct concepts, several language-related features – App-Quality is learned (TreeNet) – designed to be resistant to gaming • An app developer might pay people to give high-ratings • Has a well defined meaning
  59. 59. Idea of Metafeatures (Example) In this case – each Metafeature is independently solved on different training data Final Model Many data points (expensive) … Many complex trees Judgments prone to human errors F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 VS Explicit human decided metafeatures Final Model produces simpler, faster models MF1 MF2 MF3 requires fewer total training points … Humans can define metafeatures to F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 minimize human errors, and possibly use different targets
  60. 60. Data and Feature Engineering are Key! Selection of “good” query/result pairs for labeling, and good metafeatures – Should cover various areas of the sub-space (i.e. popular and rare queries) – Be sure to only pick examples which “can be learned” and are representative • Misspellings are a bad choice if no spell-corrector • “exceptions” - i.e. special cases (i.e. Spanish results) for an English engine are bad and should be avoided unless features can capture this – Distribution is important • Bias the data to focus on business goals – If the goal is be the best for “long queries” have more “long queries” Features are critical – must be able to capture the variations (good metafeatures) Feature engineering is probably the single most important (and most difficult) aspect of MLR
  61. 61. Applying TreeNet for MLR Starting with a set of query, result pairs obtain human judgments [1-5], and features – 5=perfect, 1=worst (maps to target [0-1]Query, Result, Judgment Query, Result, Featuresq1,r1, 2 q1,r1, f1,1, f1,2,f1,3,…,f1,n Candidate Models:q1,r2, 5 q1,r2, f2,1, f2,2,f2,3,…,f2,n TreeNet M1, M2, M3, ….q1,r3, 2 …q2,r1, 4… Candidate Model: M Results from Test Queries Test Search q1  r1,1 r1,2 … Human DCG Queries Engine q2  r2,1 r2,2 … Judgments Calculation (q1,…qn) …
  62. 62. Talk Outline Introduction: Search and Machine Learned Ranking Relevance and evaluation methodologies Data collection and metrics Quixey – Functional Application Search™ System Architecture, features, and model training Alternative approaches Conclusion
  63. 63. Choosing the Best Model - Disconnect TreeNet uses a mean-squared error minimization – The “best” model is the one with the lowest MSE where error is: • abs(target – predicted_score) – Each result is independent DCG minimizes rank-ordering error – The ranking is query-dependent Might require evaluating several TreeNet models before a real DCG improvement – Try new features, – TreeNet options (learn rate, max-trees), change splits of data – Collect more/better data (clean errors), consider active learning
  64. 64. Assumptions Made (Are there choices) MSE is used because the input data is independent judgment pairs Assumptions of consistency over time and between users (stationarity of judgments) – Is Angry Birds v1 a perfect score for “popular game” in 10 years? – Directions need to be very clear to ensure user consistency • Independent model assumes all users are consistent with each other Collect judgments in a different form: – Pairwise comparisons <q1,r1> is better than <q1,r2>, etc… – Evaluate a “set” of results – Use a different scale for judgments which is more granular – Full-ordering (lists)
  65. 65. Other Ways to do MLR Changing data collection: – Use inferred as opposed to direct data • Click/user behavior to infer relevance targets – From independent judgments to pairwise or listwise Pairwise SVM: – R. Herbrich, T. Graepel, K. Obermayer. “Support Vector Learning for Ordinal Regression.” In Proceedings of ICANN 1999. – T. Joachims, “A Support Vector Method for Multivariate Performance Measures.” In Proceedings of ICML 2005. (http://www.cs.cornell.edu/People/tj/svm_light/svm_perf.html) Listwise learning – LambdaRank, Chris Burghes et al, 2007 – LambdaMART, Qiang Wu, Chris J.C. Burges, Krysta M. Svore and Jianfeng Gao, 2008
  66. 66. Talk Outline Introduction: Search and Machine Learned Ranking Relevance and evaluation methodologies Data collection and metrics Quixey – Functional Application Search™ System Architecture, features, and model training Alternative approaches Conclusion
  67. 67. Conclusion Machine Learning is very important to Search – Metafeatures reduce model complexity and lower costs • Divide and conquer (parallel development) – MLR – is real, and is just one part of ML in search Major challenges include data collection and feature engineering – Must pay for data – non-trivial, but have a say in what you collect – Features must be reasonable for given problem (domain specific) Evaluation is critical – How to evaluate effectively is important to ensure improvement – MSE vs DCG disconnect TreeNet can and is an effective tool for Machine Learning in Search
  68. 68. Quixey is hiring If you want a cool internship, or a great job, contact us afterwards or e-mail: jobs@quixey.com and mention this presentation
  69. 69. QuestionsJames_DOT_Shanahan_AT_gmail_DOT_com Eric_AT_Quixey_DOT_com
  70. 70. 3250 Ash St.Palo Alto, CA 94306888.707.4441www.quixey.com

×