Sem tech2013 tutorial

Peter Mika| Yahoo! Research, Spain
pmika@yahoo-inc.com
Thanh Tran | Semsolute, Germany
Tran@semsolute.com
Semantic Search on the Rise

About the speakers
 Peter Mika
 Senior Research Scientist
 Head of Semantic Search group at
Yahoo! Labs
 Expertise: Semantic Search, Web
Object Retrieval, Natural Language
Processing
 Tran Duc Thanh
 CEO of Semsolute, Semantic Search
Technologies Company
 Served as Assistant Professor for
Karlsruhe Institute of Technology and
Stanford University
 Expertise: Semantic Search,
Semantic / Linked Data Management

Agenda
 Why Semantic Search
 What is Semantic Search
 Innovative Semantic Search Applications
 Behind the Scene
 Questions

Why Semantic Search? I.
 “We are at the beginning of search.“ (Marissa Mayer)
 Solved large classes of queries, e.g. navigational
 Remaining queries are hard, not solvable by brute
force, require deep understanding of the world and
human cognition, e.g.
 Ambiguous searches: paris hilton
 Imprecise or overly precise searches
 Searches for descriptions: 34 year old computer scientist
living in barcelona
 Background knowledge and metadata can help to
address poorly solved queries
Many of these queries
would not be asked by
users, who learned over
time what search
technology can and can
not do.

Why Semantic Search? II.
 The Semantic Web is now a reality
 Large amounts of data published in RDF
 Linked Data
 Metadata in HTML
 Facebook‟s Open Graph Protocol
 Schema.org
 Casual users
 Don‟t know SPARQL
 Unaware of the schema of the data
 Searching data instead or in addition to searching
documents
 Enable innovative search applications / tasks

Semantic Search: Using Semantic Models for
Search
 Semantic search is a retrieval paradigm that
 Exploits the semantics of the data or explicit background
knowledge to understand user intent and the meaning of
content
 Incorporates the intent of the query and the meaning of
content into the search process (semantic models)

Semantic Search: Different Kinds / Different
Uses of Semantic Models
 Wide range of semantic search systems
 Employ different semantic models, possibly at
different steps of the search process and in order to
support different tasks
 Query formulation
 Query processing / understanding
 Ranking
 Result presentation
 Result / query refinement

Semantic models
 Semantics is concerned with the meaning of the
resources made available for search
 Various representations of meaning
 Word-level models: models of relationships among
words
 Taxonomies, thesauri, dictionaries of entity names
 Inference along linguistic relations, e.g. broader/narrower
terms
 Concept-level models: models of relationships
among objects
 Ontologies capture entities in the world and their
relationships
 Inference along domain-specific relations

Graph-based Conceptual Models
 Core of W3C standards for knowledge representation
and data exchange: RDF, OWL
 Large amount of data / knowledge on the Web
available as graphs
 Linked Data: hundreds of interconnected datasets
capturing domain-independent and domain-specific
knowledge
 Metadata in HTML
 RDFa, microdata, Facebook‟s OGP
 Private graphs
 Google‟s Knowledge Graph
 Facebook Graph
 Yahoo‟s Knowledge Base (talk yesterday)
 Microsoft's Satori

Where can you find Linked Data?
 Downloads
 Dbpedia data dumps
 SPARQL access
 LOD cache by OpenLink: 51 billion triples
 Keyword search
 Sindice by SindiceTech

Google Knowledge Graph
 Start with Freebase‟s database, which had 12 million
entities
 As of June 2012, Knowledge Graph has 500 million
entities and over 3.5 billion relationships between
those entities
 Prioritize properties based on what users were most

Facebook‟s Open Graph Protocol
 The „Like‟ button provides publishers with a way to
promote their content on Facebook and build
communities
 Shows up in profiles and news feed
 Site owners can later reach users who have liked an
object
 Facebook Graph API allows 3rd party developers to
access the data
 Open Graph Protocol is an RDFa-based format that
allows to describe the object that the user „Likes‟

Facebook‟s Open Graph Protocol
 RDF vocabulary to be used in conjunction with RDFa
 Simplify the work of developers by restricting the freedom in RDFa
 Activities, Businesses, Groups, Organizations, People, Places,
Products and Entertainment
 Only HTML <head> accepted
 http://opengraphprotocol.org/
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="movie" />
<meta property="og:url"
content="http://www.imdb.com/title/tt0117500/" />
<meta property="og:image" content="http://ia.media-
imdb.com/images/rock.jpg" /> …
</head> ...

Semantic Web markup: schema.org
 Agreement on a shared set of schemas for common types
of web content
 Use a single format to communicate the same information to all three
search engines
 Bing, Google, and Yahoo! (June, 2011), Yandex (Nov, 2011)
 Microdata and RDFa support
 Schemas for most common web content
 Business listings, images/video, recipes, reviews, products, jobs…
 Community
 public-vocabs@w3.org

Current state of metadata on the Web
 Analysis of the Bing/Yahoo! Search Crawl
 US crawl, January, 2012
 31% of webpages, 5% of domains contain some metadata
 P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus,
LDOW 2012
 WebDataCommons.org
 Data extracted from a public crawl (commoncrawl.org)
 February, 2012 results show 11% of URLs with metadata
compared to 5% in 2009/2010 data
 7.3 billion triples available for download
 H.Mühleisen, C.Bizer.Web Data Commons - Extracting
Structured Data from Two Large Web Corpora, LDOW 2012
 Large increase in RDFa and microdata adoption compared
to microformats

Where can you find HTML metadata?
 Web Data Commons
 Glimmer: glimmer.research.yahoo.com
 Online index of the schema.org data in Web Data
Commons

Innovative Semantic Search Applications

Innovative Semantic Search Applications
 Entity search: entity/entities as results
 Factual search: direct answers, facts (about entities)
 Relational search: complex relationships between entities
 Semantic auto-completion: suggesting queries based on
the intent of the provided inputs
 Results aggregation / analysis / prediction: apply
computational models
 Semantic log analysis: understanding user behavior in
terms of objects
 Semantic profiling: recommendations based on particular
interests
 Semantic context: contextual model of users / interests
 Support for complex tasks, e.g. booking a vacation using a
combination of services
 Conversational search

Entity Search: Entity-based
Disambiguation

Entity Search: Entity-based Navigation / Exploration

Semantic auto-completion: Facebook Graph
Search

Semantic Auto-completion: Semsolute‟s semantic search
engine
Vorlesung Knowledge Discovery - Institut
AIFB
Syntactic
Completions
Keywords
Semantic
Completions
2
9

Contextual (pervasive, ambient) search
Yahoo! Connected
TV:
Widget engine
embedded into the
TV
Yahoo! IntoNow:
recognize audio and
show related content

Interactive Voice Search
 Siri
 Question-Answering
 Variety of backend sources
including Wolfram Alpha and
various Yahoo! services
 Task completion
 E.g. schedule an event

Conversational Search
 Google‟s Interactive Voice Search

Conversational Search
 Parlance EU project
 Complex dialogs around a set of objects
 Restaurant
 Area
 Price range
 Type of cuisine
 Complete system
 Automated Speech Recognition (ASR)
 Spoken Language Understanding (SLU)
 Interaction Management
 Knowledge Base
 Natural Language Generation (NLG)
 Text-to-Speech (TTS)
 Video
 Commercial alternatives from Nuance

Main Technological Building Blocks
 Query Interpretation
 Spelling Correction
 Query Segmentation
 Entity Recognition
 Query Intent Interpretation for Semantic Auto-Completion
 Ranking
 Entity Ranking
 Relationship Ranking
 Aggregation
 Result Fusion
 Rank / Score Aggregation
 Result Presentation
 Summary Generation
 Visualization

Semsolute‟s Building Blocks - Keyword / Key Phrase
Interpretation
Entity
“address company san
francisco”
 Semantic entity index
 Inverted index for entities /
triples
 Return entities / entities‟
relationships as results to
keys
 Semantic entity ranking
 Structured language model:
one language model for every
attribute
 Returns entities‟ LMs that
most likely generate the
keywords, i.e. the entity
descriptions that best match

Relationship
s / Structure
Entity
francisco”
Semsolute‟s Building Blocks – Semantic Graph
Construction
 Offline component: query-
independent schema graph
 Reuse schema
 Pseudo-schema construction:
all possible connections
between classes of entities,
e.g. friendships between users
 Online component: query-
specific keyword matching
elements
 Connect keyword matching
elements / entities to the
classes they belong to

Relationship
s / Structure
Entity
francisco”
Semsolute‟s Building Blocks – Graph Exploration
 Top-k graph exploration
 Shortest-path based algorithm
that finds top-k graphs
connecting keyword matching
elements
 Top-k graph ranking
 Language model based
 Aggregated model that
combines the LMs of entities
matching the keywords

Semsolute‟s Building Blocks – Query Generation &
Processing
TripleRelationship
s / Structure
Entity
Address of companies located in San
Francisco?
francisco”
 Graph to query mapping
 Translation rules that map top
ranked graphs to structured
queries (SQL, SPARQL)
 Translation rules that map
structured queries to natural
language questions
 Graph matching
 Triple index: cover index
supporting different triple
patterns
 Various join implementations

Yahoo! Spark: Entity Recommendation in
Search
 Different use cases in Web Search
 Some users are short on time
 Need direct answers
 Query expansion, question-answering, information boxes, rich
results…
 Other users want to explore
 Long term interests such as sports, celebrities, movies and music
 Long running tasks such as travel planning
 Spark is a search assistance tool for exploration
 Recommend related entities given the user‟s current
query
 Based on explicit relations in a Knowledge Base

High-Level Architecture View
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features

Spark challenges
 Interpretation and disambiguation
 Obama and Toyota are places in Japan, but maybe
the user is not looking for them
 The popularity of “obama” is not a sign of the
popularity of a Japanese town
 Ranking
 “Release me” from Engelbert Humperdinck should
rank higher than “Lesbian Seagull” which only
appeared on the soundtrack of a Beavis and
Butthead episode
 Editorial relevance vs. what people click
 Large-scale data processing and ML
 Knowledge Base built from Wikipedia, Yahoo! data,
Web extraction
 Feature extraction from query logs, Flickr and Twitter
data
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features

Contact
 Peter Mika
 pmika@yahoo-inc.com
 @pmika
 Tran Duc Thanh
 thanh.tran@semsolute.com

Resources
 Detailed information
 Peter Mika. Entity Search on the Web, Keynote at Web of
Linked Entities WS
 Peter Mika, Thanh Tran. Semantic search tutorial
SemTech2012
 Books
 Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern
Information Retrieval. ACM Press. 2011
 Survey papers
 Thanh Tran, Peter Mika. Survey of Semantic Search
Approaches. Under submission, 2012.
 Conferences and workshops
 ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
 Semantic Search workshop series
 Exploiting Semantic Annotations in Information Retrieval
(ESAIR)
 Entity-oriented Search (EOS) workshop
 Web of Linked Entities (WoLE) workshop

Sem tech2013 tutorial

More Related Content

What's hot

Viewers also liked

Similar to Sem tech2013 tutorial

Recently uploaded

Sem tech2013 tutorial

Editor's Notes