Semantic search: from document retrieval to virtual assistants

Semantic search:
from document retrieval to virtual
assistants
P R E S E N T E D B Y P e t e r M i k a , S r . R e s e a r c h S c i e n t i s t , Y a h o o L a b s ⎪ J u n e 1 9 , 2 0 1 4

Agenda
2
 Invite
 What is Semantic Search?
 Applications to Web search
› Enhanced results
› Entity retrieval and recommendations
 Beyond Web search

Yahoo Labs Barcelona
 Established January, 2006
› Part of a global network of Labs in
Sunnyvale, New York, Barcelona, Haifa,
Bangalore, Beijing, Santiago
 Led by Ricardo Baeza-Yates
 Research areas
› Distributed Systems
› Semantic Search
› Social Media
› Web Mining
› Web Retrieval

Semantic Search Research
Jordi Atserias
Sr. Research Engineer
Roi Blanco
Sr. Research Scientist
Hugues Bouchard
Sr. Research Engineer
Peter Mika
Sr. Research Scientist
Manager
Tim Potter
Research Engineer
Edgar Meij
Research Scientist

Search is really fast, without necessarily being intelligent

Why Semantic Search?
 Improvements in IR are harder and harder to come by
› Basic relevance models are well established
› Machine learning using hundreds of features
› Heavy investment in computational power, e.g. real-time indexing and instant search
 Remaining challenges are not computational, but in modeling user
cognition
› Could Watson explain why the answer is Toronto?
› Need a deeper understanding of the query, the content and the relationship of the two

 Semantic gap
› Ambiguity
• jaguar
• paris hilton
› Secondary meaning
• george bush (and I mean the beer brewer
in Arizona)
› Subjectivity
• reliable digital camera
• paris hilton sexy
› Imprecise or overly precise searches
• jim hendler
 Complex needs
› Missing information
• brad pitt zombie
• florida man with 115 guns
• 35 year old computer scientist living in
barcelona
› Category queries
• countries in africa
• barcelona nightlife
› Transactional or computational queries
• 120 dollars in euros
• digital camera under 300 dollars
• world temperature in 2020
Poorly solved information needs remain
Are there even
true keyword
queries?
Users may
have stopped
asking them

What it’s like to be a machine?
Roi Blanco

↵⏏☐ģ
✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓
ţğ★✜
✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫
≠=⅚©§★✓♪ΒΓΕ℠
✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ
⏎⌥°¶§ΥΦΦΦ✗✕☐

 Def. Semantic Search is any
retrieval method where
› User intent and resources are
represented in a semantic model
• A set of concepts or topics that generalize
over tokens/phrases
• Additional structure such as a hierarchy
among concepts, relationships among
concepts etc.
› Semantic representations of the query
and the user intent are exploited in
some part of the retrieval process
 As a research field
› Workshops
• ESAIR (2008-2014) at CIKM, Semantic
Search (SemSearch) workshop series
(2008-2011) at ESWC/WWW, EOS
workshop (2010-2011) at SIGIR, JIWES
workshop (2012) at SIGIR, Semantic
Search Workshop (2011-2014) at VLDB
› Special Issues of journals
› Surveys
• Christos L. Koumenides, Nigel R.
Shadbolt: Ranking methods for entity-
oriented semantic web search.
JASIST 65(6): 1091-1106 (2014)
12
Semantic Search

Semantic models: implicit vs. explicit
13
 Implicit/internal semantics
› Models of text extracted from a corpus of queries, documents or interaction logs
• Query reformulation, term dependency models, translation models, topic models, latent space
models, learning to match (PLS)
› See
• Hang Li and Jun Xu: Semantic Matching in Search. Foundations and Trends in Information
Retrieval Vol 7 Issue 5, 2013, pp 343-469
 Explicit/external semantics
› Explicit linguistic or ontological structures extracted from text and linked to external
knowledge
› Obtained using IE techniques or acquired from Semantic Web markup

Semantic Search – a process view
Query
Constructi
on
•Keywords
•Forms
•NL
•Formal language
Query
Processin
g
•IR-style matching & ranking
•DB-style precise matching
•KB-style matching & inferences
Result
Presentation
•Query visualization
•Document and data presentation
•Summarization
Query
Refinement
•Implicit feedback
•Explicit feedback
•Incentives
Document Representation
Knowledge Representation
Semantic Models
Resources
Documents

<roi>↵⏏☐ģ</roi>
✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓
ţğ★✜
✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫
≠=⅚©§★✓♪ΒΓΕ℠
✖Γ♫⅜±<roi>⏎↵⏏☐ģ</roi>ğğğμλκσςτ
⏎⌥°¶§ΥΦΦΦ✗✕☐
<roi>

Information Extraction
17
 Documents
› Natural language
• Named Entity Recognition & Disambiguation (“entity linking”)
• Deep parsing (dependency parsing)
› Specific to the Web
• Extraction from web tables, wrapper induction etc.
• Open Information Extraction such as NELL, ReVerb etc.
 Queries
› Short text and no structure… nothing to do?

Information Extraction on queries
18
 Entities play an important role
› ~70% of queries contain a named entity (entity mention queries) and
~50% of queries have an entity focus (entity seeking queries)
• brad pitt attacked by fans
› ~10% of queries are looking for a class of entities
• brad pitt movies
› See
• Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW
2010: 771-780
• Thomas Lin, Patrick Pantel, Michael Gamon, Anitha Kannan, Ariel Fuxman: Active objects:
actions for entity-centric search. WWW 2012: 589-598

Information Extraction on queries
19
 Common structure to entity mention queries:
query = <entity> + <intent>
› Intent is typically an additional word or phrase to
• Disambiguate, e.g. brad pitt actor
• Specify action or aspect e.g. brad pitt net worth, brad pitt download
 Useful also in off-line query log analysis
› Reduce the sparsity of query log data by mapping entities and intents to a
reference base of entities and intents

oakland as bradd pitt movie moneyball movies.yahoo.com oakland as wikipedia.org
captain america movies.yahoo.com moneyball trailer movies.yahoo.com
money moneyball movies.yahoo.com
moneyball movies.yahoo.com movies.yahoo.com en.wikipedia.org movies.yahoo.com peter brand
peter brand oakland nymag.com moneyball the movie www.imdb.com
moneyball trailer movies.yahoo.com moneyball trailer
brad pitt brad pitt moneyball brad pitt moneyball movie brad pitt moneyball brad pitt moneyball oscar
www.imdb.com
relay for life calvert ocunty www.relayforlife.org trailer for moneyball movies.yahoo.com
moneyball.movie-trailer.com
moneyball en.wikipedia.org movies.yahoo.com map of africa www.africaguide.com
money ball movie www.imdb.com money ball movie trailer moneyball.movie-trailer.com
brad pitt new www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com brad pitt
news news.search.yahoo.com moneyball trailer moneyball trailer www.imdb.com www.imdb.com
Patterns in logs are hard to see
 Sample of sessions from June, 2011 containing the term “moneyball”
› What are users trying to do?

oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org
Semantic annotations help to generalize…
Sports team
Movie
Actor

… and understand user needs
6/19/201422
moneyball trailer
what the user wants to do with it
Movie
Object of the query

Information extraction on queries
23
 Entity linking
› Tutorial: Entity Linking and Retrieval by Edgar Meij, Krisztián Balog and Daan Odijk
› Dataset for evaluation of entity linking (2013)
• Yahoo WebScope dataset L24 - Yahoo Search Query Log To Entities, version 1.0
 Semantic annotation for query log analysis
› Frequent pattern mining on raw queries fails due to large amount of noise
› Meaningful patterns start to emerge when mining the semantic annotations instead
› Laura Hollink, Peter Mika, Roi Blanco: Web usage mining with semantic analysis. WWW
2013: 561-570

Semantic Web
24
 Significant extension of the Web stack
› Languages for publishing raw data and document annotations
› Standards for querying, validating and reasoning with data
distributed across the Web
 Research community formed around 2001
› ISWC, ESWC, WWW Semantic Web Track, JWS
 Conflicted history with Information Retrieval
› Misplaced expectations as to what the Semantic Web will bring
› Building the chicken farm before any chickens or eggs
 Since 2007 more solid progress in adoption
› Metadata in HTML
› Public and private ‘Knowledge Graphs’

Metadata in HTML: schema.org
25
 Agreement on a shared set of schemas for common types of web
content
› Bing, Google, and Yahoo! as initial founders (June, 2011), joined by Yandex later
› Similar in intent to sitemaps.org
• Use a single format to communicate the same information to all three search engines
<div vocab="http://schema.org/" typeof="Movie">
<h1 property="name">Pirates of the Carribean: On Stranger Tides (2011)</h1>
<span property="description">Jack Sparrow and Barbossa embark on a quest to
find the elusive fountain of youth, only to discover that Blackbeard and
his daughter are after it too.</span>
Director: <div property="director” typeof="Person">
<span property="name">Rob Marshall</span>
</div>
</div>

Substantial adoption of schema.org markup
26
 Over 15% of all pages now have schema.org markup
 Over 5 million sites, over 25 billion entity references
 In other words: same order of magnitude as the web
› Source: R.V. Guha: Light at the end of the tunnel, ISWC 2013 keynote
 See also
› P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012
• Based on Bing US corpus
• 31% of webpages, 5% of domains contain some metadata (including Facebook’s OGP)
› WebDataCommons
• Based on CommonCrawl Nov 2013
• 26% of webpages, 14% of domains contain some metadata (including Facebook’s OGP)

Knowledge Graphs
27
 Linked (Open) Data (linkeddata.org)
› Public movement for making open/public databases
• available in standard Semantic Web formats
• interlinking them
› Dbpedia is a central hub in this network of datasets
• Software framework to extract structured data from Wikipedia
and consolidate it under a common ontology
• The resulting dataset that contains links to Freebase and
others
– Freebase links to IMDB and so on
 Basis for private Knowledge Graphs
› Bing, Google, Yahoo

Yahoo’s Knowledge Graph
Chicago Cubs
Chicago
Barack Obama
Carlos Zambrano
10% off tickets
for
plays for
plays in
lives in
Brad Pitt
Angelina Jolie
Steven Soderbergh
George Clooney
Ocean’s Twelve
partner
directs
casts in
E/R
casts
in
takes place in
Fight Club
casts in
Dust Brothers
casts
in
music by
Nicolas Torzec: Making knowledge reusable at Yahoo!:
a Look at the Yahoo! Knowledge Base (SemTech 2013)

Building Yahoo’s Knowledge Graph
 Ontology building and maintenance
› Editorially maintained OWL ontology with 300+ classes
› Covering the domains of interest of Yahoo
 Information extraction
› Public datasets and proprietary data
 Data fusion
› Manual mapping from the source schemas to the ontology
› Supervised entity reconciliation
• Kedar Bellare, Carlo Curino, Ashwin Machanavajihala, Peter Mika, Mandar Rahurkar, Aamod Sane:
WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis. PVLDB 2013
• Michael J. Welch, Aamod Sane, Chris Drome: Fast and accurate incremental entity resolution relative to
an entity knowledge base. CIKM 2012
› Editorial curation and quality assessment

Semantic Search for…
34
 Improving ad-hoc document retrieval
› Query composition
› Result presentation
› Matching
› Ranking
 Providing new search functionality
› Entity retrieval
• Related entity recommendation
› Personalization
› Question-answering
› Task completion

Exploiting Semantic Web markup
(internal prototype, 2007)
Personal and
private
homepage
of the same
person
(clear from the
snippet but it
could be also
automatically
de-duplicated)
Conferences
he plans to attend
and his vacations
from homepage
plus bio events
from LinkedIn
Geolocation

Search snippets using Semantic Web markup
 Summarization of HTML is a hard task
• Template detection
• Selecting relevant snippets
• Composing readable text
› Efficiency constraints
 Yahoo SearchMonkey (2008)
› Enhanced results using structured data from the page
• Key/value pairs
• Deep links
• Image or Video

Effectiveness of enhanced results
 Explicit user feedback
› Side-by-side editorial evaluation (A/B testing)
• Editors are shown a traditional search result and enhanced result for the same page
• Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)
 Implicit user feedback
› Click-through rate analysis
• Long dwell time limit of 100s (Ciemiewicz et al. 2010)
• 15% increase in ‘good’ clicks
› User interaction model
• Enhanced results lead users to relevant documents
– even though less likely to clicked than textual results
• Enhanced results effectively reduce bad clicks!
 See
› Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011:
725-734

Enhanced results at other search providers
 Google announces Rich Snippets - June, 2009
› Faceted search for recipes - Feb, 2011
 Bing tiles – Feb, 2011
 Facebook’s Like button and the Open Graph Protocol (2010)
› Shows up in profiles and news feed
› Site owners can later reach users who have liked an object

Moving beyond entity markup
39
 We would like to help our users in task completion
› But we have trained our users to talk in nouns
• Retrieval performance decreases by adding verbs to queries
› Markup for actions/intents could potentially help
 Modeling actions
› Understand what actions can be taken on a page
› Help users in mapping their query to potential actions
› Applications in web search, email etc.
THING
THING
Schema.org v1.2
including Actions
vocabulary
published
April 16, 2014

Applications of Actions markup
Email (Gmail) SERP (Yandex)

 Entity retrieval
› Which entity does a keyword query
refer to, if any?
 Related entities for navigation
› Which entity would the user visit next?
Entity displays in web search

Entity Retrieval
 Keyword search over entity graphs
› see Pound et al. WWW08 for a definition
› No common benchmark until 2010
 SemSearch Challenge 2010/2011
• 50 entity-mention queries Selected from the Search Query Tiny Sample v1.0 dataset (Yahoo!
Webscope)
• Billion Triples Challenge 2009 data set
• Evaluation using Mechanical Turk
› See report:
• Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson,
Thanh Tran: Repeatable and reliable semantic search evaluation. J. Web Sem. 21: 14-29 (2013)

Glimmer: open-source entity retrieval engine from Yahoo
 Extension of MG4J from University of Milano
 Indexing of RDF data
› MapReduce-based
› Horizontal indexing (subject/predicate/object fields)
› Vertical indexing (one field per predicate)
 Retrieval
› BM25F with machine-learned weights for properties and domains
› 52% improvement over the best system in SemSearch 2010
 See
› Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data.
International Semantic Web Conference (1) 2011: 83-97
› https://github.com/yahoo/Glimmer/

Other evaluations in Entity Retrieval
 TREC Entity Track
› 2009-2011
› Data
• ClueWeb 09 collection
› Queries
• Related Entity Finding
– Entities related to a given entity through a
particular relationship
– (Homepages of) airlines that fly Boeing 747
• Entity List Completion
– Given some elements of a list of entities,
complete the list
 Professional sports teams in Philadelphia such
as the Philadelphia Wings, …
› Relevance assessments provided by
TREC assessors
 Question Answering over Linked Data
› 2011-2014
› Data
• Dbpedia and MusicBrainz in RDF
› Queries
• Full natural language questions of different
forms, written by the organizers
• Multi-lingual
• Give me all actors starring in Batman
Begins
› Results are defined by an equivalent
SPARQL query
• Systems are free to return list of results or
a SPARQL query
45

Related entity recommendations Related
entities

Spark(le) system for related entity recommendations
1. Knowledge Graph
› Filtering and enrichment
2. Feature extraction
› Query logs, Flickr, Twitter
3. MLR
4. Online/offline evaluation
› Point-wise assessments
› Side-by-side testing
› Online evaluation
5. Runtime
› Unary
• Popularity features from text: probability,
entropy, Wiki entity popularity …
• Graph features: PageRank on the entity
graph, Wikipedia, Web graph
• Type features: entity type
› Binary
• Co-occurrence features from text:
conditional probability, joint probability …
• Graph features: common neighbors …
• Type features: relation type
48
Roi Blanco, B. Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013

Mobile search on the rise
 Information access on-the-go requires hands-free operation
› Driving, walking, gym, etc.
• Americans spend 540 hours a year in their cars [1] vs. 348 hours browsing the Web [2]
 ~50% of queries are coming from mobile devices (and growing)
› Changing habits, e.g. iPad usage peaks before bedtime
› Limitations in input/output
[1] http://answers.google.com/answers/threadview?id=392456
[2] http://articles.latimes.com/2012/jun/22/business/la-fi-tn-top-us-brands-news-web-sites-20120622

Mobile search: challenges and opportunities
51
 Interaction
› Question-answering
› Support for interactive retrieval
› Spoken-language access
› Task completion
 Contextualization
› Personalization
› Geo
› Context (work/home/travel)
• Try getaviate.com

Interactive, conversational voice search
 Parlance EU project
› Complex dialogs within a domain
• Requires complete semantic understanding
 Complete system (mixed license)
› Automated Speech Recognition (ASR)
› Spoken Language Understanding (SLU)
› Interaction Management
› Knowledge Base
› Natural Language Generation (NLG)
› Text-to-Speech (TTS)
 Video

Components of a Spoken Dialog Systems (SDS)
Recognizer
(ASR)
Semantic
Decoder
Dialog
Control
Synthesizer
(TTS)
Message
Generator
User
Waveforms Words
Dialog
Acts
I want to find a
restaurant?
inform(task=find, entity=restaurant)
request(food)What kind of food
would you like?
The Web
• Currently limited domain
• Hand-crafted using rule-based parsers, template
generators and flowchart-based dialog control
• Expensive to build and fragile in operation

A Statistical Spoken Dialogue System
Bayesian
Belief
Network
Semantic
Decoder
Stochastic
Policy
Response
Generator
Ontology
inform(food=italian){0.6}
inform(food=indian) {0.2}
inform(area=east){0.1}
null(){0.1}
confirm(food=italian)
request(area)
Action
Reward Function
Rewards: success/fail
Reinforcement
Learning
Supervised Learning
Partially Observable Markov Decision Process (POMDP)
ASR
Evidence
Belief
State
Belief
Propagation
I want
an
Italian
You are looking for an Italian
restaurant? Whereabouts?
Id like italian {0.4}
I want an Italian {0.2}
Id like Indian{0.2}
In the east{0.1}
TTS
Ita Ind -
Food
N E S W
Area

Semantic Decoding
I’m looking for a place to eat – perhaps french.
Extract features
eg frequent N-grams
I’m looking
I’m looking for
for a place
place to eat
french
u-act = request
u-act = inform
entity=restaurant
entity=bar
entity=hotel
food=french
food=chinese
etc
Bank of binary classifiers
inform(entity=restaurant,
food=french) {0.5}
User Acts0.1
0.6
0.5
0.3
0.0
0.8
0.1
inform(entity=bar,
food=french) {0.3}
….
inform(entity=restaurant,
food=chinese) {0.1}

Belief State
oentity
gentity
uentity
Goal
User
Act
Observation
at time t
User
Behaviour
Recognition/
Understanding
Errors
task -> find(entity,method,…)
entity -> restaurant(food, ..)
entity -> bar(food, ..)
food = French, Italian, Indian, ..
ofood
gfood
ufood
NextTimeSlicet+1
Compile
Bayesian
Network
a
Ontology

Choosing the next action – the Policy
gentity gfood
inform(entity=bar) {0.4}
HB R Fr It In -
b
Feature
Extraction
summary
belief space
select(entity=bar,
entity=restaurant)
Sample
argmaxa{Q(b,a): a Î A}
Gaussian
Process
Q-Function
Approximation
Q(b, a) = E rt | b, a
t =t+1
T
å
é
ë
ê
ù
û
ú
{Q(b,a) : a Î A}

Large Scale Evaluation – Task Success Rates
Word Err Rate Conventional
Success Rate
POMDP System
Success Rate
Telephone 21% 84.6% 86.9%
Telephone +
noise
30% 75.2% 81.2%
In Car 29% 67.8% 75.8%
Success = finding the required information for a restaurant
which matches the supplied criteria
Note that user’s perceived success rate was ~10% higher!

Real
Users
Working
System
Scaling up to the Web
We can build a fully statistical spoken dialogue system for a specific
narrow domain – but how do we scale up too much broader domains?
CamInfo
Restaurant System
Crowd-sourced annotators
Data for input
output mapping
User simulator for
policy optimisation
Corpus Data for
model parameter
estimation
Domain
Ontology
Hand-crafted
input, output,
and model
parameters
Personal
Assistant
Corpus Data for
model parameter
estimation
Domain
Ontology
Unsupervised
learning
Fast on-line
reinforcement
learning
Wide
coverage
ontology
Real
Users

Conclusions
61
 Semantic Search
› Explicit understanding for queries and documents
through links to external knowledge
• Using methods of Information Extraction or
explicit annotations (markup) in webpages
• Semantic Web as a source of external knowledge
 Increasing level of understanding
› Early focus on entities and their attributes
• Applications in web search: rich results,
entity displays, entity recommendation
› Moving toward modeling intents/actions
› Adding human-like interaction

Q&A
 Many thanks to members of the Semantic Search team
at Yahoo Labs Barcelona and to Yahoos around the world
› Slides on POMDP-based dialogue systems courtesy of prof. Steve Young, UCAM
 Contact
› pmika@yahoo-inc.com
› @pmika
› http://www.slideshare.net/pmika/
› Ask about our internships and other opportunities

Semantic search: from document retrieval to virtual assistants

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Semantic search: from document retrieval to virtual assistants

Similar to Semantic search: from document retrieval to virtual assistants (20)

More from Peter Mika

More from Peter Mika (9)

Recently uploaded

Recently uploaded (20)

Semantic search: from document retrieval to virtual assistants