Successfully reported this slideshow.

2011 Search Query Rewrites - Synonyms & Acronyms


Published on

July 27, 2011 Bay Area Search Presentation

Brian Johnson, Engineering Director, Query Services @ eBay

Query expansion is an important part of of the search recall for all search engines. In this talk I'll discuss some of the general trend driving Hadoop adoption within the Search Query Services team at eBay, and the types of algorithms/techniques we've moved to Hadoop at eBay. Over time we've moved from smaller, editorial data sets to large machine generated data sets mined from behavior log data, items/listings, catalogs, etc. One common workflow is to mine large candidate rewrites/expansions data sets from multiple data sources, use crowd sourced human judgment to classify a subset of the candidates (true positive, false positive), use machine learning techniques discard false positives, run automated validation on the final data set, and automatically push to production.

Ravi Jammalakadaka, Senior Applied Researcher, Query Services @ eBay

Ravi is a real engineer. Not a pointy haired manager like the previous speaker. Expect some real engineering:-) He'll be doing a literature review for acronym mining and discussing a real world implementation.

Title: Mining Acronyms From Raw Text

Abstract: Significant number of eBay products are known by their acronyms. eBay query expansion service expands user queries by their acronym equivalents to increase recall. The challenge is to mine acronyms from either seller ( ex. item descriptions, titles) or buyer ( ex. queries) data.
Ravi will present the state of the art algorithms from recent conferences that mine acronyms from raw text and present their limitations. He will present a new acronym mining algorithm that seeks to address the limitations identified with previous algorithms. He will present a machine learning classifier that seeks to remove the false positives generated from the acronym mining algorithm.

Published in: Technology
  • Be the first to comment

2011 Search Query Rewrites - Synonyms & Acronyms

  1. 1. Bay Area Search Wednesday July 27, 2011
  2. 2. AgendaŸ  6:30 Eat & Greet - Free Food & BeerŸ  7:00 Speaker #1 – Brian JohnsonŸ  7:45 Speaker #2 – Ravi JammalamadakaŸ  Plan on 2 fabulous 45 minute presentations by excellent local search experts. Please suggest speakers or topics you would like to hear.Ÿ  Great speakers, good food, fine beer, and everyones favorite search term - Free, Free, Free:-)Ÿ  Event will be held at the eBay campus just off 17/880 @ Hamilton in the main Community building. Look for lobby/flagpole.Ÿ  4th Wednesday of every monthŸ
  3. 3. How Can I Help?Ÿ SpeakersŸ  FeedbackŸ  OrganizersŸ  Videographers
  4. 4. Brian JohnsonŸ  Brian is the Director of Engineering for Query Services at eBay. He has held this role since January of 2011. Prior to that he managed the engineering teams for Query Understanding (metrics and crowdsourced human judgment), classification, data publishing, and browsing. Brian has been at eBay since 2002.Ÿ  Prior to eBay Brian was at ( –  Handspring - Managed the team working on email/IM/web browsing for one of the first smartphones (Treo) –  Excite@Home - Director of Engineering for the Excite homepage –  Synopsys - Engineer for chip design visualization –  AT&T Bell Labs - Data visualization researchŸ  Brian received his PHD in Computer Science from the University of Maryland in 1993. His papers regarding visualizing hierarchical and categorical data with Treemaps have been cited hundreds of times.Ÿ  Brian is a pleasure to listen to and Im sure youll appreciate his insights from the trenches regarding search query rewrite research and practice at eBay.
  5. 5. Ravi JammalamadakaŸ Ravi works in the query services team at eBay looking at ways to rewrite user queries to improve both precision and recall.Ÿ Received his PhD from University of California, Irvine. –  Research on Data Security, DatabasesŸ Ravi published 10 research papers in the areas of databases, data security and data mining.Ÿ Ravi was invited to be a Program committee member for IEEE ISI 2010, 2011 and ICDE 2010 (demo track).
  6. 6. Query  Rewrites     Brian Johnson Bay Area Search July 27, 2011
  7. 7. Documents + UsersSEARCH
  8. 8. What Is A Query?Ÿ  Queries are more than a text boxŸ  Keywords=Red Size 7 ShoesŸ  Keywords=Red, Category=ShoesŸ  Keywords=Red, Category=Shoes, Size = 7Ÿ  Many filter variables affects recallŸ  Query, category, attributes current context dimension targetsŸ  Format, condition, location/distance, shipping, seller, price
  9. 9. Questions About QueriesŸ  Popularity/RankŸ  SupplyŸ  DemandŸ  Click Through Rate (CTR)Ÿ  ConversionŸ  Rewrites/ExpansionsŸ  Related Searches with CTR & ConversionŸ  Category Supply/Demand/CTR/SalesŸ  Product Supply/Demand/CTR/SalesŸ  Top ProductsŸ  Items (recalled, view, bin, bid, offer, watch, ask, purchase)Ÿ  AutocompletesŸ  Classification (broad, narrow, ambiguous, help, navigational)Ÿ  Purchase SiteŸ  Frequency by day, day of week, time of dayŸ  Cross BorderŸ  SalesŸ  Position distribution in user sessionsŸ  Result set sizeŸ  Exit RateŸ  Exit Destination 9
  10. 10. Data Mining & Machine LearningTRENDS
  11. 11. Query Rewrite TrendsIntelligence: Human è MachineData: Small è BigSources: Few è ManyContext: Little è Some
  12. 12. EXAMPLES
  13. 13. Example Query Services/Rewrites•  Related Search canon sd1300is, canon sd1400 is, canon sd4000, canon sd1400is, canon sd, canon sd1300 is waterproof, canon sd 1300, canon•  Stemming (ipod or ipods)•  Spelling (cannon or canon)•  Condition (new or condition=new)•  Synonyms (boat carpet or marine carpet)•  Space Synonyms (MarioKart > Mario Kart)•  Item Specifics (blue or color=blue)•  Acronyms (os = one size in CSA | Operating Systems in Electronics)•  Category (shoes or category=63850)•  Cross Border (site=0 and category =123) or (site=3 and category=456)•  Fitment (fits model=X)•  Term Removal (Harry Potter and the Order of the Phoenix (daily deal)) 13
  14. 14. Context & SpecificityŸ  Beyond decontextualized single entitiesŸ  Examples –  Stemming failures ○  (cowboy v cowboys) and (hat v hats) ○  Doesn’t work for cowboy hats & dallas cowboy caps/hats –  hp printer > (hp v “hewlett packard”) printer –  15 hp pump > 15 (hp v horsepower) pump –  motor bike > motor (bike v cycle) –  audi b6 > (audi v make=audi) & (b6 v platform=b6) v (product=789) –  the who != who the –  Time ○  Today: latest generation > latest generation v (generation=4) ○  Tomorrow: latest generation > latest generation v (generation=5)
  15. 15. HOW
  16. 16. Architecture Online (Code + NoSQL Cache) Offline (Hadoop) Document & Behavioral Data
  17. 17. Better, Faster, CheaperBetter•  Better recall•  Awesome related search suggestions•  Mind reading spell correctionsFaster•  <3 milliseconds per query•  1.2 billion queries per day•  1,000’s of queries per second on a single machineCheaper•  Hadoop offline•  Caching online
  18. 18. Metrics/EvaluationŸ  Revenue (A/B Test)Ÿ  Relevance (Recall, Precision, DCG, etc.)Ÿ  Result CountŸ  Result Set OverlapŸ  Click Through RateŸ  Feedback (site links)Ÿ  Human JudgmentŸ  Competitive/Benchmark dataŸ  “Gold” test sets 18
  19. 19. Thinking about rewrites Ÿ  Query length Ÿ  Language detection Ÿ  Intent identification Ÿ  Concept vs instantiation Ÿ  Autocomplete, (ex: car vs honda) autosuggest Ÿ  Phrases Ÿ  Summarization Ÿ  Bracketing Ÿ  Inference (ex: movie 9) Ÿ  Normalization Ÿ  Stemming Ÿ  Key term extraction Ÿ  Synonyms Ÿ  Term relaxation / Ÿ  Spell checking constraining Ÿ  Stopwords, noise words Ÿ  Session context Ÿ  Abbreviations, acronyms Ÿ  Trend detection Ÿ  Units, brands, sizes, Ÿ  Online feedback dimensions Ÿ  Temporal queries, recency Ÿ  Buzz 19
  20. 20. SYNONYMS
  21. 21. Synonym Candidates Synonyms  derived  from  top  changes  in  successive  queries   frame   frames   lamp   lamps   case   cases   grill   grille   shoe   shoes     Synonyms  derived  from  top  queries  in  item  query  clusters   texas  instruments  ba  ii  plus   4  ba  ii  plus   brighton  handbag   brighton  purse   lenovo  x200   thinkpad  x200   king  bedspread   king  coverlet   rockabilly  dress   swing  dress   1963  ford  falcon   63  falcon   jessica  simpson  hair  extensions   jessica  simpson  hairdo     Abbrevia<ons/acronym  derived  from  query  transi<ons   stanford  ky   stanford  kentucky   dc  sub   dc  subwoofer   meridian  ms   meridian  mississippi   front  royal  va   front  royal  virginia   baseball  pin   baseball  pinback   snowboard  helmet  l   snowboard  helmet  large   motorcycle  cam   motorcycle  camera   diamond  amp   diamond  amplifier   ac4ve  sub   ac4ve  subwoofer   shapleigh  me   shapleigh  maine  
  22. 22. SPELLING
  23. 23. Spell Check – Offline Ÿ  Successive queries qi and qi’ are candidates q1 for spell correction analysis if the edit distance is within 40% of the average query length. q2 •  qi and qi’ may have tokens in common, called anchors. q3 q1’ •  Use transitivity remove intermediate queries. Ÿ  Create a bipartite graph for spell correction q4 q2’ candidates. Ÿ  Same query can exist on the source and sink q5 sides of the graph. Ÿ  Compute input and output degrees of each sink node, indicating how info flows in and q6 out of a query. Ÿ  A correct spelling candidate is a sink node with a far more flow into rather than out of it.
  24. 24. Spell Check – Online query Tokenize to tokens In the white list? (wi-2, wi-1, wi) Found a match Calculate contextual Priority possibility Queue Search in dictionary No, go Obtain entropy to next N-Gram Index Last? Yes, get the A list of best Edit distance, candidates Obtain cosine phonetics similarity Result
  25. 25. ACRONYMS
  26. 26. Acronyms Ÿ  Expand User Queries –  Increase recall without sacrificing precision –  Better deals for buyers Ÿ  Examples BAPE 2,540 results OR(Bathing Ape, Bape) 2987 results Rescue Project 26
  27. 27. Mining Acronyms From Query ReformulationsŸ  Learn from user behavioral dataŸ  Example UCB Sweatshirt CSA University of California Berkeley CSA Sweatshirt Rescue Project 27
  28. 28. Acronym Context & SpecificityŸ  Need to express context sensitive expansions –  Categorical ○  ATC > Armored Troop Carrier in Toys and Hobbies ○  ATC > Artist trading card in ART ○  ATC > Automatic Tool Change in Business and Industrial –  Directional ○  Old > Antique ○  Yoga towels/mats > Yogitoes Rescue Project 28
  29. 29. Acronym/Abbreviation Category Based Mining Expansions•  Acronyms/Abbreviation mined from Rawtext and query logs hp Electronics Cars and Trucks•  Look for patterns of text •  long form (short form) •  short form (long form)• Employ intelligent matching algorithms to Hewlett Packard horsepowermine candidatesExample title: System allowsnew cheap Playstation portable (PSP) •  Category based expansionsAcronym discovered •  Directional expansionsPSP -> PlayStation Portable •  Positive and NegativeCandidates mined are fed through a expansionsmachine learning classifier to remove thefalse positives
  31. 31. Mining  Acronyms/Abbrevia<ons  from  Raw  Text   Ravi Chandra Jammalamadaka eBay, Inc 07/27/2011
  32. 32. Talk OverviewŸ  Motivation –  Introduction of the Acronym mining problem.Ÿ  Related Work –  Algorithm overview.Ÿ  eBay Acronym Mining algorithm. –  Architecture. –  Algorithm overview.Ÿ  Results.Ÿ  Conclusions.
  33. 33. MotivationŸ  User queries are incomplete representation of their information needs –  Spelling mistakes ○  Jetsky instead of Jetski –  Synonyms are not considered ○  PS3 and PlayStation 3 ( Acronym, topic of talk) ○  JetSki and Personal Watercraft –  Users are not experts in search engine technology ○  Example: Anniversary gifts for men eBay, Inc. 33
  34. 34. Need for Query Rewrites JetSky 2 results Spelling Correction JetSki 23782 results Synonym Expansion OR( Jetski, Personal WaterCraft) 24151 results eBay, Inc. 34
  35. 35. Motivation: Acronyms/Abbreviation Uke IQD Iraqi Dinar Ukulele eBay, Inc. 35
  36. 36. Where can we find Acronyms? Grand Theft Auto III (GTA 3) (PlayStation 2, 2001) New Uke Grand Theft Auto IV (GTA 4) PS3 mint condition Warhawk (No Headset) PlayStation 3 (PS3) BRAND NEW! New Ukulele COLD LASER. Low Level Laser Therapy(LLLT) + Acupuncture From Item Title/Descriptions From Query Reformulations i.e how users change their queries. eBay, Inc. 36
  37. 37. Related Work eBay, Inc. 37
  38. 38. Schwartz et al: Greedy Match AlgorithmWarhawk (No Headset) PlayStation 3 (PS3) BRAND NEW!Warhawk (No Headset) PlayStation 3 (PS3) BRAND NEW! eBay, Inc. 38
  39. 39. Identifying Abbreviation Definitions in Biomedical Text. Ÿ  Mining for patterns –  long form ( short form) –  short form ( long form) –  Long form is no more than min ( |A| + 5 , |A| * 2). –  Roche et. al. proposes that number to be less than |A|*3. Ÿ  The characters in the short form should match the long form in the same order and the first character in the short form should be at the beginning of a word. Ÿ  Example: –  PS3 -> PlayStation 3 eBay, Inc. 39
  40. 40. Schwartz et al Ÿ  Pros: –  Finds almost all abbreviations and acronyms Ÿ  Cons: –  High False positive rate. ○  Foot Massage Diabetes Treatment (FEET) –  Suffers from truncated long form problem. –  Example: American Automobile Association (AAA) eBay, Inc. 40
  41. 41. Acronym-Expansion Recognition and Ranking on the Web Ÿ  First few characters match Ÿ  Ignore Stop words Ÿ  Example: –  Cool - > Cooperation in Ontology and Linguistics. Alpa Jain, Silviu Cucerzan, Saliha Azzam. Acronym-Expansion Recognition and Ranking on the Web. eBay, Inc. 41
  42. 42. Jain et alŸ  Pros: –  Low false positive rateŸ  Cons: –  Does not do a good job at identifying abbreviations –  Misses out on a lot of actual acronyms ○  Will not find PlayStation 3 and PS3 association. eBay, Inc. 42
  43. 43. eBay Acronym Mining Architecture Candidate   Feature   Classifier   Generator   Extractor   User   Dic4onary   Data     Live  on   Human   A/B  Test   Site   Judgment  
  44. 44. eBay Acronym/Abbreviation Mining AlgorithmŸ  Desirable Properties –  Find all abbreviation and Acronyms like the greedy match –  Reduce the amount of false positives –  Solve the truncated long form problem.Ÿ  What makes a good acronym – expansion pair? –  Characters in the acronym are found at the beginning of the words. –  Expansions generally do not have words that are skipped or not represented in the acronym. –  Can a cost metric capture the intuition ? eBay, Inc. 44
  45. 45. Cost Based Approach for Mining Abbreviations CIM ------- Computer Interface Module Total Cost: Low cost PVC ------- PolyVinyl Cloride Total Cost: medium cost HSF –-- Heat shock transcription factor Total Cost: High Cost eBay, Inc. 45
  46. 46. Cost Based Recursive Algorithm Title: new American Automobile Association (AAA) map of mexico Objective: Find the longest form with the lowest cost American Automobile Association (AAA)Min ( American Automobile Associ (AA) , American Automobile Associ (AAA) ) + Cost so far eBay, Inc. 46
  47. 47. Salient Properties of the new algorithm Ÿ  If Cost > Threshold, then the long form is a false positive. Ÿ  As cost increases –  False positives increase –  The chance that a real acronym is not identified decreases Ÿ  As cost decreases –  False positives decrease –  The chance that a real acronym is not identified increases. Ÿ  At lower costs, the algorithm behaves like the first few characters match. Ÿ  At high costs, the algorithm behaves like the greedy match algorithm. eBay, Inc. 47
  48. 48. ExperimentsSample Dataset: 2.5 million item titles Algorithm Total Candidates False Positive Rate Yield Greedy Match 2548 39 % 1554 First Few 759 4% 728 Characters Match Cost Based Match, 1223 14 % 1051 k1 Cost Based Match, 1604 16 % 1284 k2 Cost Based Match, 2023 20 % 1554 k3 eBay, Inc. 48
  49. 49. Removing false positivesŸ  Goal –  Develop a classification algorithm that will classify is a candidate is a acronym or not.Ÿ  Classification algorithm –  Decision trees ○  TreeNet data mining tool.Ÿ  Candidate are tagged with many features.Ÿ  Classifier learns on the tagged golden set.Ÿ  New candidates are then run through the classifier. eBay, Inc. 49
  50. 50. Example of a Decision Tree Tid Refund Marital Taxable Splitting Attributes Status Income Cheat 1 Yes Single 125K No 2 No Married 100K No Refund No Yes No 3 No Single 70K 4 Yes Married 120K No NO MarSt 5 No Divorced 95K Yes Married Single, Divorced 6 No Married 60K No 7 Yes Divorced 220K No TaxInc NO 8 No Single 85K Yes < 80K > 80K 9 No Married 75K No NO YES 10 No Single 90K Yes Model: Decision Tree10 Training Data eBay, Inc. 50 Acknowledgements: George Kollios,
  51. 51. Features: Neighborhood Similarity Ÿ  Rationale: Two synonym candidates A and B, will tend to have similar neighbors (viz keywords) surrounding them. Neighborhood similarity = Intersection ( Neighbours(A) , Neighbours(b) ) Min (Neighbours(a), Neighbours(b)) eBay, Inc. 51
  52. 52. Features: Mutual InformationŸ  Rationale: The goal of this metric is determine if the co-occurrence of the candidates in the description is significantly more than the random chance of them co-occurring. eBay, Inc. 52
  53. 53. Features: KL divergenceŸ  Rationale: Two synonym candidates will have similar category distributions of their inventory. eBay, Inc. 53
  54. 54. Kl distance: Example Ipods: Electronics (50), Electronics (100), Ipod: Clothing Shoes and Clothing Shoes and Accessories (1) Accessories (3) Ipod: Electronics (100), T-shirt Clothing Shoes and Clothing Shoes and Accessories (1000), Accessories (3) Uniforms ( 50) KL divergence: 0.83 KL divergence: 128592.74
  55. 55. Classifier Decision Tree Example KL Distance > 2.5 ≤ 2.5 False Positive Neighbourhood Similarity > 0.2 ≤ 0.2 Mutual Information False Positive > 0.003 ≤ 0.003 True Positive False Positive
  56. 56. Classifier ResultsŸ  False positive rate at the candidate generation stage 20 %Ÿ  False positive rate after going through the classifier is 5.5 %Ÿ  The remaining false positives are removed by human judges. eBay, Inc. 56
  57. 57. ConclusionsŸ  We presented the state of the art algorithms for acronym mining and their limitations.Ÿ  We presented a new cost based algorithm for mining acronyms from raw text that seeks to address the limitations of the previous algorithms.Ÿ  We presented a classifier approach to remove false positives.Ÿ  We experimentally validated our approach and show it is a viable approach for mining acronyms. eBay, Inc. 57
  58. 58. References Ÿ  [1] Ariel S Schwartz, Marti A. Hearst. A simple Aglorithm for Identifying Abbreviation definition in BioMedical Text. Ÿ  [2] Yongja Park, Roy J. Byrd. Hybrid text mining for finding abbreviations and their definitions. Ÿ  [3] Mathieu Roche, Violaine Prince. Managing the Acronym/Expansion Identification Process for Text-mining Applications. eBay, Inc. 58
  59. 59. References(2)Ÿ  [4] Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee. Efficient Web- Based Linkage of Short to Long Forms.Ÿ  [5] Alpa Jain, Silviu Cucerzan, Saliha Azzam. Acronym-Expansion Recognition and Ranking on the Web.Ÿ  [6]Xiaonan Ji, Gu Xu, James Bailey and Hang Li. Mining, Ranking and Using Acronym Patterns. eBay, Inc. 59
  60. 60. ThankseBay, Inc. 60