Semantic search: from document retrieval to virtual assistants
1. Semantic search:
from document retrieval to virtual
assistants
P R E S E N T E D B Y P e t e r M i k a , S r . R e s e a r c h S c i e n t i s t , Y a h o o L a b s ⎪ J u n e 1 9 , 2 0 1 4
2. Agenda
2
Invite
What is Semantic Search?
Applications to Web search
› Enhanced results
› Entity retrieval and recommendations
Beyond Web search
3. Yahoo Labs Barcelona
Established January, 2006
› Part of a global network of Labs in
Sunnyvale, New York, Barcelona, Haifa,
Bangalore, Beijing, Santiago
Led by Ricardo Baeza-Yates
Research areas
› Distributed Systems
› Semantic Search
› Social Media
› Web Mining
› Web Retrieval
4. Semantic Search Research
Jordi Atserias
Sr. Research Engineer
Roi Blanco
Sr. Research Scientist
Hugues Bouchard
Sr. Research Engineer
Peter Mika
Sr. Research Scientist
Manager
Tim Potter
Research Engineer
Edgar Meij
Research Scientist
7. Why Semantic Search?
Improvements in IR are harder and harder to come by
› Basic relevance models are well established
› Machine learning using hundreds of features
› Heavy investment in computational power, e.g. real-time indexing and instant search
Remaining challenges are not computational, but in modeling user
cognition
› Could Watson explain why the answer is Toronto?
› Need a deeper understanding of the query, the content and the relationship of the two
8. Semantic gap
› Ambiguity
• jaguar
• paris hilton
› Secondary meaning
• george bush (and I mean the beer brewer
in Arizona)
› Subjectivity
• reliable digital camera
• paris hilton sexy
› Imprecise or overly precise searches
• jim hendler
Complex needs
› Missing information
• brad pitt zombie
• florida man with 115 guns
• 35 year old computer scientist living in
barcelona
› Category queries
• countries in africa
• barcelona nightlife
› Transactional or computational queries
• 120 dollars in euros
• digital camera under 300 dollars
• world temperature in 2020
Poorly solved information needs remain
Are there even
true keyword
queries?
Users may
have stopped
asking them
12. Def. Semantic Search is any
retrieval method where
› User intent and resources are
represented in a semantic model
• A set of concepts or topics that generalize
over tokens/phrases
• Additional structure such as a hierarchy
among concepts, relationships among
concepts etc.
› Semantic representations of the query
and the user intent are exploited in
some part of the retrieval process
As a research field
› Workshops
• ESAIR (2008-2014) at CIKM, Semantic
Search (SemSearch) workshop series
(2008-2011) at ESWC/WWW, EOS
workshop (2010-2011) at SIGIR, JIWES
workshop (2012) at SIGIR, Semantic
Search Workshop (2011-2014) at VLDB
› Special Issues of journals
› Surveys
• Christos L. Koumenides, Nigel R.
Shadbolt: Ranking methods for entity-
oriented semantic web search.
JASIST 65(6): 1091-1106 (2014)
12
Semantic Search
13. Semantic models: implicit vs. explicit
13
Implicit/internal semantics
› Models of text extracted from a corpus of queries, documents or interaction logs
• Query reformulation, term dependency models, translation models, topic models, latent space
models, learning to match (PLS)
› See
• Hang Li and Jun Xu: Semantic Matching in Search. Foundations and Trends in Information
Retrieval Vol 7 Issue 5, 2013, pp 343-469
Explicit/external semantics
› Explicit linguistic or ontological structures extracted from text and linked to external
knowledge
› Obtained using IE techniques or acquired from Semantic Web markup
14. Semantic Search – a process view
Query
Constructi
on
•Keywords
•Forms
•NL
•Formal language
Query
Processin
g
•IR-style matching & ranking
•DB-style precise matching
•KB-style matching & inferences
Result
Presentation
•Query visualization
•Document and data presentation
•Summarization
Query
Refinement
•Implicit feedback
•Explicit feedback
•Incentives
Document Representation
Knowledge Representation
Semantic Models
Resources
Documents
17. Information Extraction
17
Documents
› Natural language
• Named Entity Recognition & Disambiguation (“entity linking”)
• Deep parsing (dependency parsing)
› Specific to the Web
• Extraction from web tables, wrapper induction etc.
• Open Information Extraction such as NELL, ReVerb etc.
Queries
› Short text and no structure… nothing to do?
18. Information Extraction on queries
18
Entities play an important role
› ~70% of queries contain a named entity (entity mention queries) and
~50% of queries have an entity focus (entity seeking queries)
• brad pitt attacked by fans
› ~10% of queries are looking for a class of entities
• brad pitt movies
› See
• Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW
2010: 771-780
• Thomas Lin, Patrick Pantel, Michael Gamon, Anitha Kannan, Ariel Fuxman: Active objects:
actions for entity-centric search. WWW 2012: 589-598
19. Information Extraction on queries
19
Common structure to entity mention queries:
query = <entity> + <intent>
› Intent is typically an additional word or phrase to
• Disambiguate, e.g. brad pitt actor
• Specify action or aspect e.g. brad pitt net worth, brad pitt download
Useful also in off-line query log analysis
› Reduce the sparsity of query log data by mapping entities and intents to a
reference base of entities and intents
20. oakland as bradd pitt movie moneyball movies.yahoo.com oakland as wikipedia.org
captain america movies.yahoo.com moneyball trailer movies.yahoo.com
money moneyball movies.yahoo.com
moneyball movies.yahoo.com movies.yahoo.com en.wikipedia.org movies.yahoo.com peter brand
peter brand oakland nymag.com moneyball the movie www.imdb.com
moneyball trailer movies.yahoo.com moneyball trailer
brad pitt brad pitt moneyball brad pitt moneyball movie brad pitt moneyball brad pitt moneyball oscar
www.imdb.com
relay for life calvert ocunty www.relayforlife.org trailer for moneyball movies.yahoo.com
moneyball.movie-trailer.com
moneyball en.wikipedia.org movies.yahoo.com map of africa www.africaguide.com
money ball movie www.imdb.com money ball movie trailer moneyball.movie-trailer.com
brad pitt new www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com brad pitt
news news.search.yahoo.com moneyball trailer moneyball trailer www.imdb.com www.imdb.com
Patterns in logs are hard to see
Sample of sessions from June, 2011 containing the term “moneyball”
› What are users trying to do?
21. oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org
Semantic annotations help to generalize…
Sports team
Movie
Actor
22. … and understand user needs
6/19/201422
moneyball trailer
what the user wants to do with it
Movie
Object of the query
23. Information extraction on queries
23
Entity linking
› Tutorial: Entity Linking and Retrieval by Edgar Meij, Krisztián Balog and Daan Odijk
› Dataset for evaluation of entity linking (2013)
• Yahoo WebScope dataset L24 - Yahoo Search Query Log To Entities, version 1.0
Semantic annotation for query log analysis
› Frequent pattern mining on raw queries fails due to large amount of noise
› Meaningful patterns start to emerge when mining the semantic annotations instead
› Laura Hollink, Peter Mika, Roi Blanco: Web usage mining with semantic analysis. WWW
2013: 561-570
24. Semantic Web
24
Significant extension of the Web stack
› Languages for publishing raw data and document annotations
› Standards for querying, validating and reasoning with data
distributed across the Web
Research community formed around 2001
› ISWC, ESWC, WWW Semantic Web Track, JWS
Conflicted history with Information Retrieval
› Misplaced expectations as to what the Semantic Web will bring
› Building the chicken farm before any chickens or eggs
Since 2007 more solid progress in adoption
› Metadata in HTML
› Public and private ‘Knowledge Graphs’
25. Metadata in HTML: schema.org
25
Agreement on a shared set of schemas for common types of web
content
› Bing, Google, and Yahoo! as initial founders (June, 2011), joined by Yandex later
› Similar in intent to sitemaps.org
• Use a single format to communicate the same information to all three search engines
<div vocab="http://schema.org/" typeof="Movie">
<h1 property="name">Pirates of the Carribean: On Stranger Tides (2011)</h1>
<span property="description">Jack Sparrow and Barbossa embark on a quest to
find the elusive fountain of youth, only to discover that Blackbeard and
his daughter are after it too.</span>
Director: <div property="director” typeof="Person">
<span property="name">Rob Marshall</span>
</div>
</div>
26. Substantial adoption of schema.org markup
26
Over 15% of all pages now have schema.org markup
Over 5 million sites, over 25 billion entity references
In other words: same order of magnitude as the web
› Source: R.V. Guha: Light at the end of the tunnel, ISWC 2013 keynote
See also
› P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012
• Based on Bing US corpus
• 31% of webpages, 5% of domains contain some metadata (including Facebook’s OGP)
› WebDataCommons
• Based on CommonCrawl Nov 2013
• 26% of webpages, 14% of domains contain some metadata (including Facebook’s OGP)
27. Knowledge Graphs
27
Linked (Open) Data (linkeddata.org)
› Public movement for making open/public databases
• available in standard Semantic Web formats
• interlinking them
› Dbpedia is a central hub in this network of datasets
• Software framework to extract structured data from Wikipedia
and consolidate it under a common ontology
• The resulting dataset that contains links to Freebase and
others
– Freebase links to IMDB and so on
Basis for private Knowledge Graphs
› Bing, Google, Yahoo
28. Yahoo’s Knowledge Graph
Chicago Cubs
Chicago
Barack Obama
Carlos Zambrano
10% off tickets
for
plays for
plays in
lives in
Brad Pitt
Angelina Jolie
Steven Soderbergh
George Clooney
Ocean’s Twelve
partner
directs
casts in
E/R
casts
in
takes place in
Fight Club
casts in
Dust Brothers
casts
in
music by
Nicolas Torzec: Making knowledge reusable at Yahoo!:
a Look at the Yahoo! Knowledge Base (SemTech 2013)
29. Building Yahoo’s Knowledge Graph
Ontology building and maintenance
› Editorially maintained OWL ontology with 300+ classes
› Covering the domains of interest of Yahoo
Information extraction
› Public datasets and proprietary data
Data fusion
› Manual mapping from the source schemas to the ontology
› Supervised entity reconciliation
• Kedar Bellare, Carlo Curino, Ashwin Machanavajihala, Peter Mika, Mandar Rahurkar, Aamod Sane:
WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis. PVLDB 2013
• Michael J. Welch, Aamod Sane, Chris Drome: Fast and accurate incremental entity resolution relative to
an entity knowledge base. CIKM 2012
› Editorial curation and quality assessment
35. Exploiting Semantic Web markup
(internal prototype, 2007)
Personal and
private
homepage
of the same
person
(clear from the
snippet but it
could be also
automatically
de-duplicated)
Conferences
he plans to attend
and his vacations
from homepage
plus bio events
from LinkedIn
Geolocation
36. Search snippets using Semantic Web markup
Summarization of HTML is a hard task
• Template detection
• Selecting relevant snippets
• Composing readable text
› Efficiency constraints
Yahoo SearchMonkey (2008)
› Enhanced results using structured data from the page
• Key/value pairs
• Deep links
• Image or Video
37. Effectiveness of enhanced results
Explicit user feedback
› Side-by-side editorial evaluation (A/B testing)
• Editors are shown a traditional search result and enhanced result for the same page
• Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)
Implicit user feedback
› Click-through rate analysis
• Long dwell time limit of 100s (Ciemiewicz et al. 2010)
• 15% increase in ‘good’ clicks
› User interaction model
• Enhanced results lead users to relevant documents
– even though less likely to clicked than textual results
• Enhanced results effectively reduce bad clicks!
See
› Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011:
725-734
38. Enhanced results at other search providers
Google announces Rich Snippets - June, 2009
› Faceted search for recipes - Feb, 2011
Bing tiles – Feb, 2011
Facebook’s Like button and the Open Graph Protocol (2010)
› Shows up in profiles and news feed
› Site owners can later reach users who have liked an object
39. Moving beyond entity markup
39
We would like to help our users in task completion
› But we have trained our users to talk in nouns
• Retrieval performance decreases by adding verbs to queries
› Markup for actions/intents could potentially help
Modeling actions
› Understand what actions can be taken on a page
› Help users in mapping their query to potential actions
› Applications in web search, email etc.
THING
THING
Schema.org v1.2
including Actions
vocabulary
published
April 16, 2014
41. Entity retrieval
› Which entity does a keyword query
refer to, if any?
Related entities for navigation
› Which entity would the user visit next?
Entity displays in web search
42. Entity Retrieval
Keyword search over entity graphs
› see Pound et al. WWW08 for a definition
› No common benchmark until 2010
SemSearch Challenge 2010/2011
• 50 entity-mention queries Selected from the Search Query Tiny Sample v1.0 dataset (Yahoo!
Webscope)
• Billion Triples Challenge 2009 data set
• Evaluation using Mechanical Turk
› See report:
• Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson,
Thanh Tran: Repeatable and reliable semantic search evaluation. J. Web Sem. 21: 14-29 (2013)
43. Glimmer: open-source entity retrieval engine from Yahoo
Extension of MG4J from University of Milano
Indexing of RDF data
› MapReduce-based
› Horizontal indexing (subject/predicate/object fields)
› Vertical indexing (one field per predicate)
Retrieval
› BM25F with machine-learned weights for properties and domains
› 52% improvement over the best system in SemSearch 2010
See
› Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data.
International Semantic Web Conference (1) 2011: 83-97
› https://github.com/yahoo/Glimmer/
44. Other evaluations in Entity Retrieval
TREC Entity Track
› 2009-2011
› Data
• ClueWeb 09 collection
› Queries
• Related Entity Finding
– Entities related to a given entity through a
particular relationship
– (Homepages of) airlines that fly Boeing 747
• Entity List Completion
– Given some elements of a list of entities,
complete the list
Professional sports teams in Philadelphia such
as the Philadelphia Wings, …
› Relevance assessments provided by
TREC assessors
Question Answering over Linked Data
› 2011-2014
› Data
• Dbpedia and MusicBrainz in RDF
› Queries
• Full natural language questions of different
forms, written by the organizers
• Multi-lingual
• Give me all actors starring in Batman
Begins
› Results are defined by an equivalent
SPARQL query
• Systems are free to return list of results or
a SPARQL query
45
47. Spark(le) system for related entity recommendations
1. Knowledge Graph
› Filtering and enrichment
2. Feature extraction
› Query logs, Flickr, Twitter
3. MLR
4. Online/offline evaluation
› Point-wise assessments
› Side-by-side testing
› Online evaluation
5. Runtime
› Unary
• Popularity features from text: probability,
entropy, Wiki entity popularity …
• Graph features: PageRank on the entity
graph, Wikipedia, Web graph
• Type features: entity type
› Binary
• Co-occurrence features from text:
conditional probability, joint probability …
• Graph features: common neighbors …
• Type features: relation type
48
Roi Blanco, B. Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013
49. Mobile search on the rise
Information access on-the-go requires hands-free operation
› Driving, walking, gym, etc.
• Americans spend 540 hours a year in their cars [1] vs. 348 hours browsing the Web [2]
~50% of queries are coming from mobile devices (and growing)
› Changing habits, e.g. iPad usage peaks before bedtime
› Limitations in input/output
[1] http://answers.google.com/answers/threadview?id=392456
[2] http://articles.latimes.com/2012/jun/22/business/la-fi-tn-top-us-brands-news-web-sites-20120622
50. Mobile search: challenges and opportunities
51
Interaction
› Question-answering
› Support for interactive retrieval
› Spoken-language access
› Task completion
Contextualization
› Personalization
› Geo
› Context (work/home/travel)
• Try getaviate.com
51. Interactive, conversational voice search
Parlance EU project
› Complex dialogs within a domain
• Requires complete semantic understanding
Complete system (mixed license)
› Automated Speech Recognition (ASR)
› Spoken Language Understanding (SLU)
› Interaction Management
› Knowledge Base
› Natural Language Generation (NLG)
› Text-to-Speech (TTS)
Video
53. Components of a Spoken Dialog Systems (SDS)
Recognizer
(ASR)
Semantic
Decoder
Dialog
Control
Synthesizer
(TTS)
Message
Generator
User
Waveforms Words
Dialog
Acts
I want to find a
restaurant?
inform(task=find, entity=restaurant)
request(food)What kind of food
would you like?
The Web
• Currently limited domain
• Hand-crafted using rule-based parsers, template
generators and flowchart-based dialog control
• Expensive to build and fragile in operation
54. A Statistical Spoken Dialogue System
Bayesian
Belief
Network
Semantic
Decoder
Stochastic
Policy
Response
Generator
Ontology
inform(food=italian){0.6}
inform(food=indian) {0.2}
inform(area=east){0.1}
null(){0.1}
confirm(food=italian)
request(area)
Action
Reward Function
Rewards: success/fail
Reinforcement
Learning
Supervised Learning
Partially Observable Markov Decision Process (POMDP)
ASR
Evidence
Belief
State
Belief
Propagation
I want
an
Italian
You are looking for an Italian
restaurant? Whereabouts?
Id like italian {0.4}
I want an Italian {0.2}
Id like Indian{0.2}
In the east{0.1}
TTS
Ita Ind -
Food
N E S W
Area
55. Semantic Decoding
I’m looking for a place to eat – perhaps french.
Extract features
eg frequent N-grams
I’m looking
I’m looking for
for a place
place to eat
french
u-act = request
u-act = inform
entity=restaurant
entity=bar
entity=hotel
food=french
food=chinese
etc
Bank of binary classifiers
inform(entity=restaurant,
food=french) {0.5}
User Acts0.1
0.6
0.5
0.3
0.0
0.8
0.1
inform(entity=bar,
food=french) {0.3}
….
inform(entity=restaurant,
food=chinese) {0.1}
56. Belief State
oentity
gentity
uentity
Goal
User
Act
Observation
at time t
User
Behaviour
Recognition/
Understanding
Errors
task -> find(entity,method,…)
entity -> restaurant(food, ..)
entity -> bar(food, ..)
food = French, Italian, Indian, ..
ofood
gfood
ufood
NextTimeSlicet+1
Compile
Bayesian
Network
a
Ontology
57. Choosing the next action – the Policy
gentity gfood
inform(entity=bar) {0.4}
HB R Fr It In -
b
Feature
Extraction
summary
belief space
select(entity=bar,
entity=restaurant)
Sample
argmaxa{Q(b,a): a Î A}
Gaussian
Process
Q-Function
Approximation
Q(b, a) = E rt | b, a
t =t+1
T
å
é
ë
ê
ù
û
ú
{Q(b,a) : a Î A}
58. Large Scale Evaluation – Task Success Rates
Word Err Rate Conventional
Success Rate
POMDP System
Success Rate
Telephone 21% 84.6% 86.9%
Telephone +
noise
30% 75.2% 81.2%
In Car 29% 67.8% 75.8%
Success = finding the required information for a restaurant
which matches the supplied criteria
Note that user’s perceived success rate was ~10% higher!
59. Real
Users
Working
System
Scaling up to the Web
We can build a fully statistical spoken dialogue system for a specific
narrow domain – but how do we scale up too much broader domains?
CamInfo
Restaurant System
Crowd-sourced annotators
Data for input
output mapping
User simulator for
policy optimisation
Corpus Data for
model parameter
estimation
Domain
Ontology
Hand-crafted
input, output,
and model
parameters
Personal
Assistant
Corpus Data for
model parameter
estimation
Domain
Ontology
Unsupervised
learning
Fast on-line
reinforcement
learning
Wide
coverage
ontology
Real
Users
60. Conclusions
61
Semantic Search
› Explicit understanding for queries and documents
through links to external knowledge
• Using methods of Information Extraction or
explicit annotations (markup) in webpages
• Semantic Web as a source of external knowledge
Increasing level of understanding
› Early focus on entities and their attributes
• Applications in web search: rich results,
entity displays, entity recommendation
› Moving toward modeling intents/actions
› Adding human-like interaction
61. Q&A
Many thanks to members of the Semantic Search team
at Yahoo Labs Barcelona and to Yahoos around the world
› Slides on POMDP-based dialogue systems courtesy of prof. Steve Young, UCAM
Contact
› pmika@yahoo-inc.com
› @pmika
› http://www.slideshare.net/pmika/
› Ask about our internships and other opportunities