Semantic search: from document retrieval to virtual assistants


Published on

Keynote at the 3rd Spanish Conference in Information Retrieval (CERI)

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Semantic search: from document retrieval to virtual assistants

  1. 1. Semantic search: from document retrieval to virtual assistants P R E S E N T E D B Y P e t e r M i k a , S r . R e s e a r c h S c i e n t i s t , Y a h o o L a b s ⎪ J u n e 1 9 , 2 0 1 4
  2. 2. Agenda 2  Invite  What is Semantic Search?  Applications to Web search › Enhanced results › Entity retrieval and recommendations  Beyond Web search
  3. 3. Yahoo Labs Barcelona  Established January, 2006 › Part of a global network of Labs in Sunnyvale, New York, Barcelona, Haifa, Bangalore, Beijing, Santiago  Led by Ricardo Baeza-Yates  Research areas › Distributed Systems › Semantic Search › Social Media › Web Mining › Web Retrieval
  4. 4. Semantic Search Research Jordi Atserias Sr. Research Engineer Roi Blanco Sr. Research Scientist Hugues Bouchard Sr. Research Engineer Peter Mika Sr. Research Scientist Manager Tim Potter Research Engineer Edgar Meij Research Scientist
  5. 5. What is Semantic Search? 5
  6. 6. Search is really fast, without necessarily being intelligent
  7. 7. Why Semantic Search?  Improvements in IR are harder and harder to come by › Basic relevance models are well established › Machine learning using hundreds of features › Heavy investment in computational power, e.g. real-time indexing and instant search  Remaining challenges are not computational, but in modeling user cognition › Could Watson explain why the answer is Toronto? › Need a deeper understanding of the query, the content and the relationship of the two
  8. 8.  Semantic gap › Ambiguity • jaguar • paris hilton › Secondary meaning • george bush (and I mean the beer brewer in Arizona) › Subjectivity • reliable digital camera • paris hilton sexy › Imprecise or overly precise searches • jim hendler  Complex needs › Missing information • brad pitt zombie • florida man with 115 guns • 35 year old computer scientist living in barcelona › Category queries • countries in africa • barcelona nightlife › Transactional or computational queries • 120 dollars in euros • digital camera under 300 dollars • world temperature in 2020 Poorly solved information needs remain Are there even true keyword queries? Users may have stopped asking them
  9. 9. Real problem
  10. 10. What it’s like to be a machine? Roi Blanco
  11. 11. What it’s like to be a machine? ↵⏏☐ģ ✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓ ţğ★✜ ✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫ ≠=⅚©§★✓♪ΒΓΕ℠ ✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ ⏎⌥°¶§ΥΦΦΦ✗✕☐
  12. 12.  Def. Semantic Search is any retrieval method where › User intent and resources are represented in a semantic model • A set of concepts or topics that generalize over tokens/phrases • Additional structure such as a hierarchy among concepts, relationships among concepts etc. › Semantic representations of the query and the user intent are exploited in some part of the retrieval process  As a research field › Workshops • ESAIR (2008-2014) at CIKM, Semantic Search (SemSearch) workshop series (2008-2011) at ESWC/WWW, EOS workshop (2010-2011) at SIGIR, JIWES workshop (2012) at SIGIR, Semantic Search Workshop (2011-2014) at VLDB › Special Issues of journals › Surveys • Christos L. Koumenides, Nigel R. Shadbolt: Ranking methods for entity- oriented semantic web search. JASIST 65(6): 1091-1106 (2014) 12 Semantic Search
  13. 13. Semantic models: implicit vs. explicit 13  Implicit/internal semantics › Models of text extracted from a corpus of queries, documents or interaction logs • Query reformulation, term dependency models, translation models, topic models, latent space models, learning to match (PLS) › See • Hang Li and Jun Xu: Semantic Matching in Search. Foundations and Trends in Information Retrieval Vol 7 Issue 5, 2013, pp 343-469  Explicit/external semantics › Explicit linguistic or ontological structures extracted from text and linked to external knowledge › Obtained using IE techniques or acquired from Semantic Web markup
  14. 14. Semantic Search – a process view Query Constructi on •Keywords •Forms •NL •Formal language Query Processin g •IR-style matching & ranking •DB-style precise matching •KB-style matching & inferences Result Presentation •Query visualization •Document and data presentation •Summarization Query Refinement •Implicit feedback •Explicit feedback •Incentives Document Representation Knowledge Representation Semantic Models Resources Documents
  15. 15. What it’s like to be a machine? ↵⏏☐ģ ✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓ ţğ★✜ ✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫ ≠=⅚©§★✓♪ΒΓΕ℠ ✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ ⏎⌥°¶§ΥΦΦΦ✗✕☐
  16. 16. What it’s like to be a machine? <roi>↵⏏☐ģ</roi> ✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓ ţğ★✜ ✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫ ≠=⅚©§★✓♪ΒΓΕ℠ ✖Γ♫⅜±<roi>⏎↵⏏☐ģ</roi>ğğğμλκσςτ ⏎⌥°¶§ΥΦΦΦ✗✕☐ <roi>
  17. 17. Information Extraction 17  Documents › Natural language • Named Entity Recognition & Disambiguation (“entity linking”) • Deep parsing (dependency parsing) › Specific to the Web • Extraction from web tables, wrapper induction etc. • Open Information Extraction such as NELL, ReVerb etc.  Queries › Short text and no structure… nothing to do?
  18. 18. Information Extraction on queries 18  Entities play an important role › ~70% of queries contain a named entity (entity mention queries) and ~50% of queries have an entity focus (entity seeking queries) • brad pitt attacked by fans › ~10% of queries are looking for a class of entities • brad pitt movies › See • Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010: 771-780 • Thomas Lin, Patrick Pantel, Michael Gamon, Anitha Kannan, Ariel Fuxman: Active objects: actions for entity-centric search. WWW 2012: 589-598
  19. 19. Information Extraction on queries 19  Common structure to entity mention queries: query = <entity> + <intent> › Intent is typically an additional word or phrase to • Disambiguate, e.g. brad pitt actor • Specify action or aspect e.g. brad pitt net worth, brad pitt download  Useful also in off-line query log analysis › Reduce the sparsity of query log data by mapping entities and intents to a reference base of entities and intents
  20. 20. oakland as bradd pitt movie moneyball oakland as captain america moneyball trailer money moneyball moneyball peter brand peter brand oakland moneyball the movie moneyball trailer moneyball trailer brad pitt brad pitt moneyball brad pitt moneyball movie brad pitt moneyball brad pitt moneyball oscar relay for life calvert ocunty trailer for moneyball moneyball map of africa money ball movie money ball movie trailer brad pitt new brad pitt news moneyball trailer moneyball trailer Patterns in logs are hard to see  Sample of sessions from June, 2011 containing the term “moneyball” › What are users trying to do?
  21. 21. oakland as bradd pitt movie moneyball trailer oakland as Semantic annotations help to generalize… Sports team Movie Actor
  22. 22. … and understand user needs 6/19/201422 moneyball trailer what the user wants to do with it Movie Object of the query
  23. 23. Information extraction on queries 23  Entity linking › Tutorial: Entity Linking and Retrieval by Edgar Meij, Krisztián Balog and Daan Odijk › Dataset for evaluation of entity linking (2013) • Yahoo WebScope dataset L24 - Yahoo Search Query Log To Entities, version 1.0  Semantic annotation for query log analysis › Frequent pattern mining on raw queries fails due to large amount of noise › Meaningful patterns start to emerge when mining the semantic annotations instead › Laura Hollink, Peter Mika, Roi Blanco: Web usage mining with semantic analysis. WWW 2013: 561-570
  24. 24. Semantic Web 24  Significant extension of the Web stack › Languages for publishing raw data and document annotations › Standards for querying, validating and reasoning with data distributed across the Web  Research community formed around 2001 › ISWC, ESWC, WWW Semantic Web Track, JWS  Conflicted history with Information Retrieval › Misplaced expectations as to what the Semantic Web will bring › Building the chicken farm before any chickens or eggs  Since 2007 more solid progress in adoption › Metadata in HTML › Public and private ‘Knowledge Graphs’
  25. 25. Metadata in HTML: 25  Agreement on a shared set of schemas for common types of web content › Bing, Google, and Yahoo! as initial founders (June, 2011), joined by Yandex later › Similar in intent to • Use a single format to communicate the same information to all three search engines <div vocab="" typeof="Movie"> <h1 property="name">Pirates of the Carribean: On Stranger Tides (2011)</h1> <span property="description">Jack Sparrow and Barbossa embark on a quest to find the elusive fountain of youth, only to discover that Blackbeard and his daughter are after it too.</span> Director: <div property="director” typeof="Person"> <span property="name">Rob Marshall</span> </div> </div>
  26. 26. Substantial adoption of markup 26  Over 15% of all pages now have markup  Over 5 million sites, over 25 billion entity references  In other words: same order of magnitude as the web › Source: R.V. Guha: Light at the end of the tunnel, ISWC 2013 keynote  See also › P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 • Based on Bing US corpus • 31% of webpages, 5% of domains contain some metadata (including Facebook’s OGP) › WebDataCommons • Based on CommonCrawl Nov 2013 • 26% of webpages, 14% of domains contain some metadata (including Facebook’s OGP)
  27. 27. Knowledge Graphs 27  Linked (Open) Data ( › Public movement for making open/public databases • available in standard Semantic Web formats • interlinking them › Dbpedia is a central hub in this network of datasets • Software framework to extract structured data from Wikipedia and consolidate it under a common ontology • The resulting dataset that contains links to Freebase and others – Freebase links to IMDB and so on  Basis for private Knowledge Graphs › Bing, Google, Yahoo
  28. 28. Yahoo’s Knowledge Graph Chicago Cubs Chicago Barack Obama Carlos Zambrano 10% off tickets for plays for plays in lives in Brad Pitt Angelina Jolie Steven Soderbergh George Clooney Ocean’s Twelve partner directs casts in E/R casts in takes place in Fight Club casts in Dust Brothers casts in music by Nicolas Torzec: Making knowledge reusable at Yahoo!: a Look at the Yahoo! Knowledge Base (SemTech 2013)
  29. 29. Building Yahoo’s Knowledge Graph  Ontology building and maintenance › Editorially maintained OWL ontology with 300+ classes › Covering the domains of interest of Yahoo  Information extraction › Public datasets and proprietary data  Data fusion › Manual mapping from the source schemas to the ontology › Supervised entity reconciliation • Kedar Bellare, Carlo Curino, Ashwin Machanavajihala, Peter Mika, Mandar Rahurkar, Aamod Sane: WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis. PVLDB 2013 • Michael J. Welch, Aamod Sane, Chris Drome: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012 › Editorial curation and quality assessment
  30. 30. Applications in Web Search 33
  31. 31. Semantic Search for… 34  Improving ad-hoc document retrieval › Query composition › Result presentation › Matching › Ranking  Providing new search functionality › Entity retrieval • Related entity recommendation › Personalization › Question-answering › Task completion
  32. 32. Exploiting Semantic Web markup (internal prototype, 2007) Personal and private homepage of the same person (clear from the snippet but it could be also automatically de-duplicated) Conferences he plans to attend and his vacations from homepage plus bio events from LinkedIn Geolocation
  33. 33. Search snippets using Semantic Web markup  Summarization of HTML is a hard task • Template detection • Selecting relevant snippets • Composing readable text › Efficiency constraints  Yahoo SearchMonkey (2008) › Enhanced results using structured data from the page • Key/value pairs • Deep links • Image or Video
  34. 34. Effectiveness of enhanced results  Explicit user feedback › Side-by-side editorial evaluation (A/B testing) • Editors are shown a traditional search result and enhanced result for the same page • Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)  Implicit user feedback › Click-through rate analysis • Long dwell time limit of 100s (Ciemiewicz et al. 2010) • 15% increase in ‘good’ clicks › User interaction model • Enhanced results lead users to relevant documents – even though less likely to clicked than textual results • Enhanced results effectively reduce bad clicks!  See › Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011: 725-734
  35. 35. Enhanced results at other search providers  Google announces Rich Snippets - June, 2009 › Faceted search for recipes - Feb, 2011  Bing tiles – Feb, 2011  Facebook’s Like button and the Open Graph Protocol (2010) › Shows up in profiles and news feed › Site owners can later reach users who have liked an object
  36. 36. Moving beyond entity markup 39  We would like to help our users in task completion › But we have trained our users to talk in nouns • Retrieval performance decreases by adding verbs to queries › Markup for actions/intents could potentially help  Modeling actions › Understand what actions can be taken on a page › Help users in mapping their query to potential actions › Applications in web search, email etc. THING THING v1.2 including Actions vocabulary published April 16, 2014
  37. 37. Applications of Actions markup Email (Gmail) SERP (Yandex)
  38. 38.  Entity retrieval › Which entity does a keyword query refer to, if any?  Related entities for navigation › Which entity would the user visit next? Entity displays in web search
  39. 39. Entity Retrieval  Keyword search over entity graphs › see Pound et al. WWW08 for a definition › No common benchmark until 2010  SemSearch Challenge 2010/2011 • 50 entity-mention queries Selected from the Search Query Tiny Sample v1.0 dataset (Yahoo! Webscope) • Billion Triples Challenge 2009 data set • Evaluation using Mechanical Turk › See report: • Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran: Repeatable and reliable semantic search evaluation. J. Web Sem. 21: 14-29 (2013)
  40. 40. Glimmer: open-source entity retrieval engine from Yahoo  Extension of MG4J from University of Milano  Indexing of RDF data › MapReduce-based › Horizontal indexing (subject/predicate/object fields) › Vertical indexing (one field per predicate)  Retrieval › BM25F with machine-learned weights for properties and domains › 52% improvement over the best system in SemSearch 2010  See › Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data. International Semantic Web Conference (1) 2011: 83-97 ›
  41. 41. Other evaluations in Entity Retrieval  TREC Entity Track › 2009-2011 › Data • ClueWeb 09 collection › Queries • Related Entity Finding – Entities related to a given entity through a particular relationship – (Homepages of) airlines that fly Boeing 747 • Entity List Completion – Given some elements of a list of entities, complete the list  Professional sports teams in Philadelphia such as the Philadelphia Wings, … › Relevance assessments provided by TREC assessors  Question Answering over Linked Data › 2011-2014 › Data • Dbpedia and MusicBrainz in RDF › Queries • Full natural language questions of different forms, written by the organizers • Multi-lingual • Give me all actors starring in Batman Begins › Results are defined by an equivalent SPARQL query • Systems are free to return list of results or a SPARQL query 45
  42. 42. Related entity recommendations Related entities
  43. 43. Example user sessions
  44. 44. Spark(le) system for related entity recommendations 1. Knowledge Graph › Filtering and enrichment 2. Feature extraction › Query logs, Flickr, Twitter 3. MLR 4. Online/offline evaluation › Point-wise assessments › Side-by-side testing › Online evaluation 5. Runtime › Unary • Popularity features from text: probability, entropy, Wiki entity popularity … • Graph features: PageRank on the entity graph, Wikipedia, Web graph • Type features: entity type › Binary • Co-occurrence features from text: conditional probability, joint probability … • Graph features: common neighbors … • Type features: relation type 48 Roi Blanco, B. Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013
  45. 45. Beyond Web Search 49
  46. 46. Mobile search on the rise  Information access on-the-go requires hands-free operation › Driving, walking, gym, etc. • Americans spend 540 hours a year in their cars [1] vs. 348 hours browsing the Web [2]  ~50% of queries are coming from mobile devices (and growing) › Changing habits, e.g. iPad usage peaks before bedtime › Limitations in input/output [1] [2]
  47. 47. Mobile search: challenges and opportunities 51  Interaction › Question-answering › Support for interactive retrieval › Spoken-language access › Task completion  Contextualization › Personalization › Geo › Context (work/home/travel) • Try
  48. 48. Interactive, conversational voice search  Parlance EU project › Complex dialogs within a domain • Requires complete semantic understanding  Complete system (mixed license) › Automated Speech Recognition (ASR) › Spoken Language Understanding (SLU) › Interaction Management › Knowledge Base › Natural Language Generation (NLG) › Text-to-Speech (TTS)  Video
  49. 49. Example dialogue
  50. 50. Components of a Spoken Dialog Systems (SDS) Recognizer (ASR) Semantic Decoder Dialog Control Synthesizer (TTS) Message Generator User Waveforms Words Dialog Acts I want to find a restaurant? inform(task=find, entity=restaurant) request(food)What kind of food would you like? The Web • Currently limited domain • Hand-crafted using rule-based parsers, template generators and flowchart-based dialog control • Expensive to build and fragile in operation
  51. 51. A Statistical Spoken Dialogue System Bayesian Belief Network Semantic Decoder Stochastic Policy Response Generator Ontology inform(food=italian){0.6} inform(food=indian) {0.2} inform(area=east){0.1} null(){0.1} confirm(food=italian) request(area) Action Reward Function Rewards: success/fail Reinforcement Learning Supervised Learning Partially Observable Markov Decision Process (POMDP) ASR Evidence Belief State Belief Propagation I want an Italian You are looking for an Italian restaurant? Whereabouts? Id like italian {0.4} I want an Italian {0.2} Id like Indian{0.2} In the east{0.1} TTS Ita Ind - Food N E S W Area
  52. 52. Semantic Decoding I’m looking for a place to eat – perhaps french. Extract features eg frequent N-grams I’m looking I’m looking for for a place place to eat french u-act = request u-act = inform entity=restaurant entity=bar entity=hotel food=french food=chinese etc Bank of binary classifiers inform(entity=restaurant, food=french) {0.5} User Acts0.1 0.6 0.5 0.3 0.0 0.8 0.1 inform(entity=bar, food=french) {0.3} …. inform(entity=restaurant, food=chinese) {0.1}
  53. 53. Belief State oentity gentity uentity Goal User Act Observation at time t User Behaviour Recognition/ Understanding Errors task -> find(entity,method,…) entity -> restaurant(food, ..) entity -> bar(food, ..) food = French, Italian, Indian, .. ofood gfood ufood NextTimeSlicet+1 Compile Bayesian Network a Ontology
  54. 54. Choosing the next action – the Policy gentity gfood inform(entity=bar) {0.4} HB R Fr It In - b Feature Extraction summary belief space select(entity=bar, entity=restaurant) Sample argmaxa{Q(b,a): a Î A} Gaussian Process Q-Function Approximation Q(b, a) = E rt | b, a t =t+1 T å é ë ê ù û ú {Q(b,a) : a Î A}
  55. 55. Large Scale Evaluation – Task Success Rates Word Err Rate Conventional Success Rate POMDP System Success Rate Telephone 21% 84.6% 86.9% Telephone + noise 30% 75.2% 81.2% In Car 29% 67.8% 75.8% Success = finding the required information for a restaurant which matches the supplied criteria Note that user’s perceived success rate was ~10% higher!
  56. 56. Real Users Working System Scaling up to the Web We can build a fully statistical spoken dialogue system for a specific narrow domain – but how do we scale up too much broader domains? CamInfo Restaurant System Crowd-sourced annotators Data for input output mapping User simulator for policy optimisation Corpus Data for model parameter estimation Domain Ontology Hand-crafted input, output, and model parameters Personal Assistant Corpus Data for model parameter estimation Domain Ontology Unsupervised learning Fast on-line reinforcement learning Wide coverage ontology Real Users
  57. 57. Conclusions 61  Semantic Search › Explicit understanding for queries and documents through links to external knowledge • Using methods of Information Extraction or explicit annotations (markup) in webpages • Semantic Web as a source of external knowledge  Increasing level of understanding › Early focus on entities and their attributes • Applications in web search: rich results, entity displays, entity recommendation › Moving toward modeling intents/actions › Adding human-like interaction
  58. 58. Q&A  Many thanks to members of the Semantic Search team at Yahoo Labs Barcelona and to Yahoos around the world › Slides on POMDP-based dialogue systems courtesy of prof. Steve Young, UCAM  Contact › › @pmika › › Ask about our internships and other opportunities