SlideShare a Scribd company logo
Large-Scale Semantic Search 
Roi Blanco (roi@yahoo-inc.com) 
http://labs.yahoo.com/Yahoo_Labs_Barcelona
Semantic Search 
• Gain insights/value over your data 
– Aggregate 
– Search 
• Adding a “understanding” layer to the stages of a search 
engine 
– Typically very hard, limited success, slow, no clear benefits or 
application … 
– Boils down to generate structure over unstructured text 
• Currently, (more or less) confined within “entity-search” 
– Identifying (or extracting) real-world concepts in free text, with 
types 
– Although that shouldn’t be the end! 
• Borrows from different fields (IR, SW, NLP, DB) 
– Large scale = only the efficient/reliable parts
Search is really fast, without 
necessarily being intelligent
Why Semantic Search? Part I 
• Improvements in IR are harder and harder to 
come by 
– Machine learning using hundreds of features 
• Text-based features for matching 
• Graph-based features provide authority 
– Heavy investment in computational power, e.g. real-time 
indexing and instant search 
• Remaining challenges are not computational, but 
in modeling user cognition 
– Need a deeper understanding of the query, the 
content and/or the world at large 
– Could Watson explain why the answer is Toronto?
Ambiguity
What it’s like to be a machine? 
Roi Blanco
What it’s like to be a machine? 
✜Θ♬♬ţğ 
✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓ 
ţğ★✜ 
✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫ 
≠=⅚©§★✓♪ΒΓΕ℠ 
✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ 
⏎⌥°¶§ΥΦΦΦ✗✕☐
Poorly solved information needs 
• Multiple interpretations 
– paris hilton 
• Long tail queries 
Many of these queries 
would not be asked by 
users, who learned over 
time what search 
technology can and can 
not do. 
– george bush (and I mean the beer brewer in Arizona) 
• Multimedia search 
– paris hilton sexy 
• Imprecise or overly precise searches 
– jim hendler 
– pictures of strong adventures people 
• Searches for descriptions 
– countries in africa 
– 34 year old computer scientist living in barcelona 
– reliable digital camera under 300 dollars
Use cases in web search 
Top-1 entity with 
structured data 
Related entities 
Structured data 
extracted from HTML
Semantics at every step of the IR process 
bla bla bla? 
bla 
bla bla 
The IR engine The Web 
Query interpretation 
q=“bla” * 3 
Document processing bla 
bla bla 
bla 
bla 
bla 
Indexing 
Ranking 
θ(q,d) “bla” 
Result presentation
Usability 
We also fail at using the technology 
Sometimes
Annotated documents 
Barack Obama visited Tokyo this Monday as part of an extended Asian trip. 
He is expected to deliver a speech at the ASEAN conference next Tuesday 
20 May 2009 
28 May 2009 
Barack Obama visited Tokyo this Monday as part of an extended Asian trip. 
He is expected to deliver a speech at the ASEAN conference next Tuesday
Semantic annotations help to 
generalize… 
Sports team 
oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org 
Movie 
Actor
… and understand user needs 
what the user wants to do with it 
moneyball trailer 
Movie 
Object of the query
Is NLU that complex? 
”A child of five would understand this. 
Send someone to fetch a child of five”. 
Groucho Marx
Applications 
• Enhanced search 
– Better query understanding 
– Better ranking (tail/hard queries) 
– Better results presentation 
– Use heavy types, dependencies + WSD 
• Advisory to employ models to minimize overfitting. (Blanco & Boldi 
Extending BM25 with multiple query operators. SIGIR 2012) 
• Recommender systems 
– Structured data helps cross-domain recommendation 
• Diversity in search/recommendations 
• Crazy prototypes! 
– From Q&A to mining/retrieving heavily annotated information 
• Even predictions about the future! 
– Matthews et al, 2010. Searching over time in the NYT. HCIR 2010 
• Or systems that return entity-grained answers
Other applications 
• Frequent pattern mining over queries 
– PrefixSpan algorithm (movies) 
• Types as items 
– Film queries are more common than Actor queries 
• Attributes as items 
– Trailers and dvd are most commonly searched for new movie 
releases 
– Cast and quote queries are most common for older movies 
• Abandonment 
– ML model to predict when users abandon a some site in favor of the 
competition 
• Combination of attributes, types for past two queries 
• Tree ensemble ~ set of positive/negative patterns 
L. Hollink, P. Mika and R. Blanco. Web Usage Mining with Semantic 
Analysis. WWW 2013
Large-Scale Semantic Search
How does correlator work? 
Monty 
Python 
Inverted Index 
(sentence/doc level) 
Forward Index 
(entity level) 
Flying Circus 
John Cleese 
Brian
Parallel Indexes 
• Standard index contains only tokens 
• Parallel indices contain annotations on the tokens – the 
annotation indices must be aligned with main token index 
• For example: given the sentence “New York has great 
pizza” where New York has been annotated as a LOCATION 
– Token index has five entries 
(“new”, “york”, “has”, “great”, “pizza”) 
– The annotation index has five entries 
(“LOC”, “LOC”, “O”,”O”,”O”) 
Can optionally encode BIO format (e.g. LOC-B, LOC-I) 
• To search for the New York location entity, we search for: 
“token:New ^ entity:LOC token:York ^ entity:LOC”
Parallel Indices (II) 
Doc #3: The last time Peter exercised was in the XXth century. 
Doc #5: Hope claims that in 1994 she run to Peter Town. 
Peter  D3:4, D5:9 
Town  D5:10 
Hope  D5:1 
1994  D5:5 
… 
Possible Queries: 
“Peter AND run” 
“Peter AND WNS:N_DATE” 
“(WSJ:CITY ^ *) AND run” 
“(WSJ:PERSON ^ Hope) AND run” 
WSJ:PERSON  D3:4, D5:1 
WSJ:CITY  D5:9, D5:10 
WNS:V_DATE  D5:5 
(Bracketing can also be dealt with)
Resource Description Framework 
(RDF) 
• Each resource (thing, entity) is identified by a URI 
– Globally unique identifiers 
– Locators of information 
• Data is broken down into individual facts 
– Triples of (subject, predicate, object) 
• A set of triples (an RDF graph) is published together in an RDF document 
example:roi 
“Roi Blanco” 
type 
name 
foaf:Person 
RDF document
Linked Data: interlinked RDF 
example:roi 
“Roi Blanco” 
name 
foaf:Person 
sameAs 
example:roi2 
worksWith 
example:peter 
type 
email 
“pmika@yahoo-inc.com” 
type 
Roi’s homepage 
Yahoo 
Friend-of-a-Friend ontology
Information access in the Semantic 
Web 
• Database-style indexing of RDF data 
– Triple stores 
– Structural queries (SPARQL) 
– No ranking 
– Evaluation focused on efficiency 
• IR-style indexing of RDF data 
– Search engines 
– Keyword queries 
– Ranking 
– Evaluation focused on effectiveness 
• Combined methods 
– Keyword matching and limited join processing
Search over RDF data 
• Unstructured or hybrid search over RDF data 
– Supporting end-users 
• Users who can not express their need in SPARQL 
– Dealing with large-scale data 
• Giving up query expressivity for scale 
– Dealing with heterogeneity 
• Users who are unaware of the schema of the data 
• No single schema to the data 
– Example: 2.6m classes and 33k properties in Billion Triples 2009 
• Entity search 
– Queries where the user is looking for a single entity named or 
described in the query 
– e.g. kaz vaporizer, hospice of cincinnati, mst3000
Conclusions 
• Large-scale semantic search should become a 
commodity soon 
– Plenty of open source tools for extraction, linking 
– (soon) and indexing, ranking semantic information 
• Research challenges ahead 
– Making all the pieces fit together 
– Using more fine-grained structured information 
(think of context, location, device)
Architecture overview 
Doc 
1. Download, uncompress, 
convert (if needed) 
2. Sort quads by subject 
3. Compute Minimal Perfect 
Hash (MPH) 
map 
map 
reduce 
reduce 
map reduce 
Index 
3. Each mapper reads part of the 
collection 
4. Each reducer builds an index 
for a subset of the vocabulary 
5. Optionally, we also build an 
archive (forward-index) 
5. The sub-indices are 
merged into a single index 
6. Serving 
and 
Ranking
RDF indexing using MapReduce 
• Text indexing using MapReduce 
– Map: parse input into (term, doc) pairs 
• Pre-processing such as stemming, blacklisting 
• To support phrase queries values are (doc, position) pairs 
– Reduce: collect all values for the same key: (term, {doc1,doc2…}), output 
posting-list 
• Secondary sort to pre-sort document ids before iteration 
• RDF indexing using MapReduce 
– Document is all triples with a given subject 
• Variations: index also RDF molecules, triples where the URI is an object 
– Index terms in property-values 
• Keys are (field, term) pairs 
• Variation: distinguish values for the same property 
– Index terms in the subject URI 
• Variation: index also terms in object URIs
Horizontal index structure 
• One field per position 
– one for object (token), one for predicates (property), optionally one for context 
• For each term, store the property on the same position in the property 
index 
– Positions are required even without phrase queries 
• Query engine needs to support fields and the alignment operator 
 Dictionary is number of unique terms + number of properties 
 Occurrences is number of tokens * 2
Vertical index structure 
• One field (index) per property 
• Positions are not required 
• Query engine needs to support fields 
 Dictionary is number of unique terms 
 Occurrences is number of tokens 
✗ Number of fields is a problem for merging, query performance 
• In experiments we index the N most common properties
Big data = data 
• Modern data-sets comprise a mixture of structured and non-structured 
data 
– Text, news, blogs 
– Microformats, rdf 
– Images 
– Video 
– Social media (a mixture too) 
• Transform unstructured data into structured data 
• Entity extraction, disambiguation 
• Provide value over the data 
– Aggregation (BI) 
– Search 
• Scalable semantic search 
– Power next-generation search, recommendation, analytics etc. 
– Improvements linear with resources 
– Lightweight processes, powering interactive real-time experiences
Efficiency improvements 
• r-vertical (reduced-vertical) index 
– One field per weight vs. one field per property 
– More efficient for keyword queries but loses the ability to 
restrict per field 
– Example: three weight levels 
• Pre-computation of alignments 
– Additional term-to-field index 
– Used to quickly determine which fields contain a term (in any 
document)
Indexing efficiency 
• Billion Triples 2009 dataset 
– 249 GB in uncompressed N-Quad 
– 114 million URIs and 274 million triples with datatype properties 
– 2.9B / 1.4B occurrences (horiz/vert) 
• Selected 300 most frequent datatype properties for vertical indexing 
• Resulting index is 9-10GB in size 
• Horizontal and vertical indexing using Hadoop 
– Scale is only limited by number of machines 
– Number of reducers is a trade-off between speed and number of sub-indices to be merged
Run-time efficiency 
• Measured average execution time (including ranking) 
– Using 150k queries that lead to a click on Wikipedia 
– Avg. length 2.2 tokens 
– Baseline is plain text indexing with BM25 
• Results 
– Some cost for field-based retrieval compared to plain text indexing 
– AND is always faster than OR 
• Except in horizontal, where alignment time dominates 
– r-vertical significantly improves execution time in OR mode 
AND mode OR mode 
plain text 46 ms 80 ms 
horizontal 819 ms 847 ms 
vertical 97 ms 780 ms 
r-vertical 78 ms 152 ms
Efficient element retrieval 
• Goal 
– Given an ad-hoc query, return a list of documents and 
annotations ranked according to their relevance to the query 
• Simple Solution 
– For each document that matches the query, retrieve its 
annotations and return the ones with the highest counts 
• Problems 
– If there are many documents in the result set this will take too 
long - too many disk seeks, too much data to search through 
– What if counting isn’t the best method for ranking elements? 
• Solution 
– Special compressed data structures designed specifically for 
annotation retrieval
Forward Index 
• Access metadata and document contents 
– Length, terms, annotations 
• Compressed (in memory) forward indexes 
– Gamma, Delta, Nibble, Zeta codes (power laws) 
• Retrieving and scoring annotations 
– Sort terms by frequency 
• Random access using an extra compressed 
pointer list (Elias-Fano)

More Related Content

What's hot

Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Grace Hui Yang
 
Week12
Week12Week12
Week12
Esha Meher
 
Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal
EdiFaizal2
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
9866825059
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
Michel Bruley
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
Lokesh Ramaswamy
 
Apache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open sourceApache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open source
Luca Bonesini
 
Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
Maurice Masih
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval system
Leslie Vargas
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
Yi-Shin Chen
 
Text mining
Text miningText mining
Text mining
Koshy Geoji
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services gen
robin fay
 
Text mining
Text miningText mining
Text mining
Malik Imran
 
CWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlpCWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlp
Capgemini
 
Knowledge Representation, Semantic Web
Knowledge Representation, Semantic WebKnowledge Representation, Semantic Web
Knowledge Representation, Semantic Web
Serendipity Seraph
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search Engine
Jay R Modi
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Knowledge mangement
Knowledge mangementKnowledge mangement
Knowledge mangement
Serendipity Seraph
 

What's hot (19)

Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
 
Week12
Week12Week12
Week12
 
Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Apache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open sourceApache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open source
 
Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval system
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
 
Text mining
Text miningText mining
Text mining
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services gen
 
Text mining
Text miningText mining
Text mining
 
CWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlpCWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlp
 
Knowledge Representation, Semantic Web
Knowledge Representation, Semantic WebKnowledge Representation, Semantic Web
Knowledge Representation, Semantic Web
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search Engine
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Knowledge mangement
Knowledge mangementKnowledge mangement
Knowledge mangement
 

Viewers also liked

Semantic differential in design
Semantic differential in designSemantic differential in design
Semantic differential in design
R. Sosa
 
Semantic differential
Semantic differentialSemantic differential
Semantic differential
poojarameshkumar19
 
Likert scale
Likert scaleLikert scale
Likert scale
Abhilash S Ram
 
Likert Scale
Likert ScaleLikert Scale
Likert Scale
anithagrahalakshmi
 
本当は怖いオープンデータ・ビッグデータ
本当は怖いオープンデータ・ビッグデータ本当は怖いオープンデータ・ビッグデータ
本当は怖いオープンデータ・ビッグデータ
pgcafe
 
Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014
Verina Ingram
 
User-centred innovation at Digital World Research Centre
User-centred innovation at Digital World Research CentreUser-centred innovation at Digital World Research Centre
User-centred innovation at Digital World Research Centre
Peter Lancaster
 
Greater Halifax: A Global Talent Magnet
Greater Halifax: A Global Talent MagnetGreater Halifax: A Global Talent Magnet
Greater Halifax: A Global Talent Magnet
Halifax Partnership
 
Slinky presentation
Slinky presentationSlinky presentation
Slinky presentation
iLoveGeorgeStr8
 
Corporateblogging
CorporatebloggingCorporateblogging
Corporateblogging
guru5016
 
Werven Utrecht
Werven UtrechtWerven Utrecht
Werven Utrecht
Gemeente
 
Presentacion emprendimiento
Presentacion emprendimientoPresentacion emprendimiento
Presentacion emprendimiento
jessicamena95
 
Englishtestunit73 eso d-2
Englishtestunit73 eso d-2Englishtestunit73 eso d-2
Englishtestunit73 eso d-2
Vicky
 
Wssu session 2
Wssu session 2Wssu session 2
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
Roi Blanco
 
Early Intervention Language Coach Course Overview and Introduction
Early Intervention Language Coach Course Overview and IntroductionEarly Intervention Language Coach Course Overview and Introduction
Early Intervention Language Coach Course Overview and Introduction
Jenny Brown
 
Rene descartes
Rene descartesRene descartes
Rene descartes
Jackie Magaña
 
Top+5+world+flatness
Top+5+world+flatnessTop+5+world+flatness
Top+5+world+flatness
IUisawesome
 
Media evaluation
Media evaluationMedia evaluation
Media evaluation
Becca McPartland
 
Pitching your brand
Pitching your brandPitching your brand
Pitching your brand
Halifax Partnership
 

Viewers also liked (20)

Semantic differential in design
Semantic differential in designSemantic differential in design
Semantic differential in design
 
Semantic differential
Semantic differentialSemantic differential
Semantic differential
 
Likert scale
Likert scaleLikert scale
Likert scale
 
Likert Scale
Likert ScaleLikert Scale
Likert Scale
 
本当は怖いオープンデータ・ビッグデータ
本当は怖いオープンデータ・ビッグデータ本当は怖いオープンデータ・ビッグデータ
本当は怖いオープンデータ・ビッグデータ
 
Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014
 
User-centred innovation at Digital World Research Centre
User-centred innovation at Digital World Research CentreUser-centred innovation at Digital World Research Centre
User-centred innovation at Digital World Research Centre
 
Greater Halifax: A Global Talent Magnet
Greater Halifax: A Global Talent MagnetGreater Halifax: A Global Talent Magnet
Greater Halifax: A Global Talent Magnet
 
Slinky presentation
Slinky presentationSlinky presentation
Slinky presentation
 
Corporateblogging
CorporatebloggingCorporateblogging
Corporateblogging
 
Werven Utrecht
Werven UtrechtWerven Utrecht
Werven Utrecht
 
Presentacion emprendimiento
Presentacion emprendimientoPresentacion emprendimiento
Presentacion emprendimiento
 
Englishtestunit73 eso d-2
Englishtestunit73 eso d-2Englishtestunit73 eso d-2
Englishtestunit73 eso d-2
 
Wssu session 2
Wssu session 2Wssu session 2
Wssu session 2
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
Early Intervention Language Coach Course Overview and Introduction
Early Intervention Language Coach Course Overview and IntroductionEarly Intervention Language Coach Course Overview and Introduction
Early Intervention Language Coach Course Overview and Introduction
 
Rene descartes
Rene descartesRene descartes
Rene descartes
 
Top+5+world+flatness
Top+5+world+flatnessTop+5+world+flatness
Top+5+world+flatness
 
Media evaluation
Media evaluationMedia evaluation
Media evaluation
 
Pitching your brand
Pitching your brandPitching your brand
Pitching your brand
 

Similar to Large-Scale Semantic Search

Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
Roi Blanco
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
Peter Mika
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
Barbara Starr
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
IR
IRIR
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Marco Brambilla
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
Amit Sheth
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
Stefanos Anastasiadis
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
Access Innovations, Inc.
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
Riley-o.com
Riley-o.comRiley-o.com
Riley-o.com
Albert Rojas
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
Abhay Ratnaparkhi
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
eXascale Infolab
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
Peter Mika
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
Rob Bogue
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
thenmozhip8
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
Gambari Amosa Isiaka
 

Similar to Large-Scale Semantic Search (20)

Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
IR
IRIR
IR
 
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Riley-o.com
Riley-o.comRiley-o.com
Riley-o.com
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 

More from Roi Blanco

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
Roi Blanco
 
Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance Minimization
Roi Blanco
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Roi Blanco
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
Roi Blanco
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
Roi Blanco
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Roi Blanco
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
Roi Blanco
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
Roi Blanco
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
Roi Blanco
 

More from Roi Blanco (9)

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance Minimization
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
 

Recently uploaded

Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
Intel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdfIntel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdf
Tech Guru
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
Accelerating Migrations = Recommendations
Accelerating Migrations = RecommendationsAccelerating Migrations = Recommendations
Accelerating Migrations = Recommendations
isBullShit
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
SelfMade bd
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
SynapseIndia
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
AimanAthambawa1
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Nicolás Lopéz
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 

Recently uploaded (20)

Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
Intel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdfIntel Unveils Core Ultra 200V Lunar chip .pdf
Intel Unveils Core Ultra 200V Lunar chip .pdf
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
Accelerating Migrations = Recommendations
Accelerating Migrations = RecommendationsAccelerating Migrations = Recommendations
Accelerating Migrations = Recommendations
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 

Large-Scale Semantic Search

  • 1. Large-Scale Semantic Search Roi Blanco (roi@yahoo-inc.com) http://labs.yahoo.com/Yahoo_Labs_Barcelona
  • 2. Semantic Search • Gain insights/value over your data – Aggregate – Search • Adding a “understanding” layer to the stages of a search engine – Typically very hard, limited success, slow, no clear benefits or application … – Boils down to generate structure over unstructured text • Currently, (more or less) confined within “entity-search” – Identifying (or extracting) real-world concepts in free text, with types – Although that shouldn’t be the end! • Borrows from different fields (IR, SW, NLP, DB) – Large scale = only the efficient/reliable parts
  • 3. Search is really fast, without necessarily being intelligent
  • 4. Why Semantic Search? Part I • Improvements in IR are harder and harder to come by – Machine learning using hundreds of features • Text-based features for matching • Graph-based features provide authority – Heavy investment in computational power, e.g. real-time indexing and instant search • Remaining challenges are not computational, but in modeling user cognition – Need a deeper understanding of the query, the content and/or the world at large – Could Watson explain why the answer is Toronto?
  • 6. What it’s like to be a machine? Roi Blanco
  • 7. What it’s like to be a machine? ✜Θ♬♬ţğ ✜Θ♬♬ţğ√∞§®ÇĤĪ✜★♬☐✓✓ ţğ★✜ ✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫ ≠=⅚©§★✓♪ΒΓΕ℠ ✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ ⏎⌥°¶§ΥΦΦΦ✗✕☐
  • 8. Poorly solved information needs • Multiple interpretations – paris hilton • Long tail queries Many of these queries would not be asked by users, who learned over time what search technology can and can not do. – george bush (and I mean the beer brewer in Arizona) • Multimedia search – paris hilton sexy • Imprecise or overly precise searches – jim hendler – pictures of strong adventures people • Searches for descriptions – countries in africa – 34 year old computer scientist living in barcelona – reliable digital camera under 300 dollars
  • 9. Use cases in web search Top-1 entity with structured data Related entities Structured data extracted from HTML
  • 10. Semantics at every step of the IR process bla bla bla? bla bla bla The IR engine The Web Query interpretation q=“bla” * 3 Document processing bla bla bla bla bla bla Indexing Ranking θ(q,d) “bla” Result presentation
  • 11. Usability We also fail at using the technology Sometimes
  • 12. Annotated documents Barack Obama visited Tokyo this Monday as part of an extended Asian trip. He is expected to deliver a speech at the ASEAN conference next Tuesday 20 May 2009 28 May 2009 Barack Obama visited Tokyo this Monday as part of an extended Asian trip. He is expected to deliver a speech at the ASEAN conference next Tuesday
  • 13. Semantic annotations help to generalize… Sports team oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org Movie Actor
  • 14. … and understand user needs what the user wants to do with it moneyball trailer Movie Object of the query
  • 15. Is NLU that complex? ”A child of five would understand this. Send someone to fetch a child of five”. Groucho Marx
  • 16. Applications • Enhanced search – Better query understanding – Better ranking (tail/hard queries) – Better results presentation – Use heavy types, dependencies + WSD • Advisory to employ models to minimize overfitting. (Blanco & Boldi Extending BM25 with multiple query operators. SIGIR 2012) • Recommender systems – Structured data helps cross-domain recommendation • Diversity in search/recommendations • Crazy prototypes! – From Q&A to mining/retrieving heavily annotated information • Even predictions about the future! – Matthews et al, 2010. Searching over time in the NYT. HCIR 2010 • Or systems that return entity-grained answers
  • 17. Other applications • Frequent pattern mining over queries – PrefixSpan algorithm (movies) • Types as items – Film queries are more common than Actor queries • Attributes as items – Trailers and dvd are most commonly searched for new movie releases – Cast and quote queries are most common for older movies • Abandonment – ML model to predict when users abandon a some site in favor of the competition • Combination of attributes, types for past two queries • Tree ensemble ~ set of positive/negative patterns L. Hollink, P. Mika and R. Blanco. Web Usage Mining with Semantic Analysis. WWW 2013
  • 19. How does correlator work? Monty Python Inverted Index (sentence/doc level) Forward Index (entity level) Flying Circus John Cleese Brian
  • 20. Parallel Indexes • Standard index contains only tokens • Parallel indices contain annotations on the tokens – the annotation indices must be aligned with main token index • For example: given the sentence “New York has great pizza” where New York has been annotated as a LOCATION – Token index has five entries (“new”, “york”, “has”, “great”, “pizza”) – The annotation index has five entries (“LOC”, “LOC”, “O”,”O”,”O”) Can optionally encode BIO format (e.g. LOC-B, LOC-I) • To search for the New York location entity, we search for: “token:New ^ entity:LOC token:York ^ entity:LOC”
  • 21. Parallel Indices (II) Doc #3: The last time Peter exercised was in the XXth century. Doc #5: Hope claims that in 1994 she run to Peter Town. Peter  D3:4, D5:9 Town  D5:10 Hope  D5:1 1994  D5:5 … Possible Queries: “Peter AND run” “Peter AND WNS:N_DATE” “(WSJ:CITY ^ *) AND run” “(WSJ:PERSON ^ Hope) AND run” WSJ:PERSON  D3:4, D5:1 WSJ:CITY  D5:9, D5:10 WNS:V_DATE  D5:5 (Bracketing can also be dealt with)
  • 22. Resource Description Framework (RDF) • Each resource (thing, entity) is identified by a URI – Globally unique identifiers – Locators of information • Data is broken down into individual facts – Triples of (subject, predicate, object) • A set of triples (an RDF graph) is published together in an RDF document example:roi “Roi Blanco” type name foaf:Person RDF document
  • 23. Linked Data: interlinked RDF example:roi “Roi Blanco” name foaf:Person sameAs example:roi2 worksWith example:peter type email “pmika@yahoo-inc.com” type Roi’s homepage Yahoo Friend-of-a-Friend ontology
  • 24. Information access in the Semantic Web • Database-style indexing of RDF data – Triple stores – Structural queries (SPARQL) – No ranking – Evaluation focused on efficiency • IR-style indexing of RDF data – Search engines – Keyword queries – Ranking – Evaluation focused on effectiveness • Combined methods – Keyword matching and limited join processing
  • 25. Search over RDF data • Unstructured or hybrid search over RDF data – Supporting end-users • Users who can not express their need in SPARQL – Dealing with large-scale data • Giving up query expressivity for scale – Dealing with heterogeneity • Users who are unaware of the schema of the data • No single schema to the data – Example: 2.6m classes and 33k properties in Billion Triples 2009 • Entity search – Queries where the user is looking for a single entity named or described in the query – e.g. kaz vaporizer, hospice of cincinnati, mst3000
  • 26. Conclusions • Large-scale semantic search should become a commodity soon – Plenty of open source tools for extraction, linking – (soon) and indexing, ranking semantic information • Research challenges ahead – Making all the pieces fit together – Using more fine-grained structured information (think of context, location, device)
  • 27. Architecture overview Doc 1. Download, uncompress, convert (if needed) 2. Sort quads by subject 3. Compute Minimal Perfect Hash (MPH) map map reduce reduce map reduce Index 3. Each mapper reads part of the collection 4. Each reducer builds an index for a subset of the vocabulary 5. Optionally, we also build an archive (forward-index) 5. The sub-indices are merged into a single index 6. Serving and Ranking
  • 28. RDF indexing using MapReduce • Text indexing using MapReduce – Map: parse input into (term, doc) pairs • Pre-processing such as stemming, blacklisting • To support phrase queries values are (doc, position) pairs – Reduce: collect all values for the same key: (term, {doc1,doc2…}), output posting-list • Secondary sort to pre-sort document ids before iteration • RDF indexing using MapReduce – Document is all triples with a given subject • Variations: index also RDF molecules, triples where the URI is an object – Index terms in property-values • Keys are (field, term) pairs • Variation: distinguish values for the same property – Index terms in the subject URI • Variation: index also terms in object URIs
  • 29. Horizontal index structure • One field per position – one for object (token), one for predicates (property), optionally one for context • For each term, store the property on the same position in the property index – Positions are required even without phrase queries • Query engine needs to support fields and the alignment operator  Dictionary is number of unique terms + number of properties  Occurrences is number of tokens * 2
  • 30. Vertical index structure • One field (index) per property • Positions are not required • Query engine needs to support fields  Dictionary is number of unique terms  Occurrences is number of tokens ✗ Number of fields is a problem for merging, query performance • In experiments we index the N most common properties
  • 31. Big data = data • Modern data-sets comprise a mixture of structured and non-structured data – Text, news, blogs – Microformats, rdf – Images – Video – Social media (a mixture too) • Transform unstructured data into structured data • Entity extraction, disambiguation • Provide value over the data – Aggregation (BI) – Search • Scalable semantic search – Power next-generation search, recommendation, analytics etc. – Improvements linear with resources – Lightweight processes, powering interactive real-time experiences
  • 32. Efficiency improvements • r-vertical (reduced-vertical) index – One field per weight vs. one field per property – More efficient for keyword queries but loses the ability to restrict per field – Example: three weight levels • Pre-computation of alignments – Additional term-to-field index – Used to quickly determine which fields contain a term (in any document)
  • 33. Indexing efficiency • Billion Triples 2009 dataset – 249 GB in uncompressed N-Quad – 114 million URIs and 274 million triples with datatype properties – 2.9B / 1.4B occurrences (horiz/vert) • Selected 300 most frequent datatype properties for vertical indexing • Resulting index is 9-10GB in size • Horizontal and vertical indexing using Hadoop – Scale is only limited by number of machines – Number of reducers is a trade-off between speed and number of sub-indices to be merged
  • 34. Run-time efficiency • Measured average execution time (including ranking) – Using 150k queries that lead to a click on Wikipedia – Avg. length 2.2 tokens – Baseline is plain text indexing with BM25 • Results – Some cost for field-based retrieval compared to plain text indexing – AND is always faster than OR • Except in horizontal, where alignment time dominates – r-vertical significantly improves execution time in OR mode AND mode OR mode plain text 46 ms 80 ms horizontal 819 ms 847 ms vertical 97 ms 780 ms r-vertical 78 ms 152 ms
  • 35. Efficient element retrieval • Goal – Given an ad-hoc query, return a list of documents and annotations ranked according to their relevance to the query • Simple Solution – For each document that matches the query, retrieve its annotations and return the ones with the highest counts • Problems – If there are many documents in the result set this will take too long - too many disk seeks, too much data to search through – What if counting isn’t the best method for ranking elements? • Solution – Special compressed data structures designed specifically for annotation retrieval
  • 36. Forward Index • Access metadata and document contents – Length, terms, annotations • Compressed (in memory) forward indexes – Gamma, Delta, Nibble, Zeta codes (power laws) • Retrieving and scoring annotations – Sort terms by frequency • Random access using an extra compressed pointer list (Elias-Fano)