SlideShare a Scribd company logo
Effective and Efficient Entity Search in RDF data 
Roi Blanco1, Peter Mika1 and Sebastiano Vigna2 
1 Yahoo! Research 
2 Università degli Studi di Milano
- 2 - 
Semantic Search 
• Unstructured or hybrid search over RDF data 
– Supporting end-users 
• Users who can not express their need in SPARQL 
– Dealing with large-scale data 
• Giving up query expressivity for scale 
– Dealing with heterogeneity 
• Users who are unaware of the schema of the data 
• No single schema to the data 
– Example: 2.6m classes and 33k properties in Billion Triples 2009 
• Entity search 
– Queries where the user is looking for a single entity named or 
described in the query 
– e.g. kaz vaporizer, hospice of cincinnati, mst3000
- 3 - 
Use cases in web search 
Top-1 entity with 
structured data 
Related entities 
Structured data 
extracted from HTML
Information access in the Semantic Web 
• Database-style indexing of RDF data 
– Triple stores 
– Structural queries (SPARQL) 
– No ranking 
– Evaluation focused on efficiency 
• IR-style indexing of RDF data 
– Search engines 
– Keyword queries 
– Ranking 
– Evaluation focused on effectiveness 
- 4 - 
• Combined methods 
– Keyword matching and limited join processing
- 5 - 
Related works 
• Ranking methods on RDF data 
– Wang et al. Semplore: A scalable IR approach to search the 
Web of Data. ISWC 2007, JWS 7(3) 
– Pérez-Agüera et al. Using BM25F for semantic search. 
SemSearch 2010 
– (many others) 
• Evaluation campaigns 
– SemSearch Challenge 2010, 2011 
– Question-Answering over Linked Data (QALD) 2011 
– TREC Entity Track 2010, 2011 
• Keyword search in databases 
– No open evaluation campaigns
1st part of the talk 2nd part 
- 6 - 
Architecture overview 
Doc 
1. Download, uncompress, 
convert (if needed) 
2. Sort quads by subject 
3. Compute Minimal Perfect 
Hash (MPH) 
map 
map 
reduce 
reduce 
map reduce 
Index 
3. Each mapper reads part of 
the collection 
4. Each reducer builds an 
index for a subset of the 
vocabulary 
5. Optionally, we also build an 
archive (forward-index) 
5. The sub-indices are 
merged into a single 
index 
6. Serving 
and 
Ranking
RDF indexing using MapReduce 
• Text indexing using MapReduce 
– Map: parse input into (term, doc) pairs 
• Pre-processing such as stemming, blacklisting 
• To support phrase queries values are (doc, position) pairs 
– Reduce: collect all values for the same key: (term, {doc1,doc2…}), 
output posting-list 
• Secondary sort to pre-sort document ids before iteration 
• RDF indexing using MapReduce (see Mika, SemSearch 2009) 
– Document is all triples with a given subject 
• Variations: index also RDF molecules, triples where the URI is an object 
– Index terms in property-values 
• Keys are (field, term) pairs 
• Variation: distinguish values for the same property 
– Index terms in the subject URI 
• Variation: index also terms in object URIs 
- 7 -
- 8 - 
Horizontal index structure 
• One field per position 
– one for object (token), one for predicates (property), optionally one for context 
• For each term, store the property on the same position in the 
property index 
– Positions are required even without phrase queries 
• Query engine needs to support fields and the alignment operator 
 Dictionary is number of unique terms + number of properties 
 Occurrences is number of tokens * 2
- 9 - 
Vertical index structure 
• One field (index) per property 
• Positions are not required 
• Query engine needs to support fields 
 Dictionary is number of unique terms 
 Occurrences is number of tokens 
✗ Number of fields is a problem for merging, query performance 
• In experiments we index the N most common properties
- 10 - 
Efficiency improvements 
• r-vertical (reduced-vertical) index 
– One field per weight vs. one field per property 
– More efficient for keyword queries but loses the ability to 
restrict per field 
– Example: three weight levels 
• Pre-computation of alignments 
– Additional term-to-field index 
– Used to quickly determine which fields contain a term (in any 
document)
- 11 - 
Indexing efficiency 
• Billion Triples 2009 dataset 
– 249 GB in uncompressed N-Quad 
– 114 million URIs and 274 million triples with datatype properties 
– 2.9B / 1.4B occurrences (horiz/vert) 
• Selected 300 most frequent datatype properties for vertical indexing 
• Resulting index is 9-10GB in size 
• Horizontal and vertical indexing using Hadoop 
– Scale is only limited by number of machines 
– Number of reducers is a trade-off between speed and number of sub-indices to be merged
- 12 - 
Run-time efficiency 
• Measured average execution time (including ranking) 
– Using 150k queries that lead to a click on Wikipedia 
– Avg. length 2.2 tokens 
– Baseline is plain text indexing with BM25 
• Results 
– Some cost for field-based retrieval compared to plain text indexing 
– AND is always faster than OR 
• Except in horizontal, where alignment time dominates 
– r-vertical significantly improves execution time in OR mode 
AND mode OR mode 
plain text 46 ms 80 ms 
horizontal 819 ms 847 ms 
vertical 97 ms 780 ms 
r-vertical 78 ms 152 ms
- 13 - 
BM25F Ranking 
BM25(F) uses a term-frequency (tf) that accounts for the 
decreasing marginal contribution of terms 
where 
vs is the weight of the field 
tfsi is the frequency of term i in field s 
Bs is the document length normalization factor: 
ls is the length of field s 
avls is the average length of s 
bs is a tunable parameter
- 14 - 
BM25F ranking cont. 
• Final term score is a combination of tf and idf 
where 
k1 is a tunable parameter 
wIDF is the inverse-document frequency: 
• Finally, the score of a document D is the sum of the scores 
of query terms q
- 15 - 
Effectiveness evaluation 
• Semantic Search Challenge 2010 
– Data, queries, assessments available online 
• Billion Triples Challenge 2009 dataset 
• 92 entity queries from web search 
– Queries where the user is looking for a single entity 
– Sampled randomly from Microsoft and Yahoo! query logs 
• Assessed using Amazon’s Mechanical Turk 
– Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010 
– Blanco et al. Repeatable and Reliable Search System 
Evaluation using Crowd-Sourcing, SIGIR2011
- 16 - 
Evaluation form
- 17 - 
Implementation 
• Simplified model to reduce the number of parameters 
– Three levels of vs: important, neutral, unimportant 
– Assign weights to domains instead of individual doc weights wD 
– Single parameter b for all bs 
– Single parameter ls for all l, bounded by a maximum lmax=10 
• Manually classified a small number of properties and 
domains into important, neutral, unimportant 
– Future work to learn this classification 
– Weights are learned (see next)
- 18 - 
Effectiveness results 
• Individual features 
– Positive, stat. significant improvement from each feature 
– Even a manual classification of properties and domains helps 
• Combination 
– Positive stat. significant marginal improvement from each additional feature 
– Total improvement of 53% over the baseline 
– Different signals of relevance
Comparison to SemSearch’10 
• Two-fold cross validation 
• Tuning all parameters at the same time 
– Promising directions algorithm (Robertson and Zaragoza) 
• 42% improvement over the best method submitted 
• Performs well on short, specific queries with many results 
– Negative examples: the morning call lehigh valley pa 
- 19 -
- 20 - 
Conclusions 
• Indexing and ranking RDF data 
– Novel index structures 
– Ranking method based on BM25F 
• Future work 
– Ranking documents with metadata 
• e.g. microdata/RDFa 
– Exploiting more semantics 
• e.g. sameAs 
– Ranking triples for display 
– Question-answering

More Related Content

What's hot

Algorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial OperationsAlgorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial OperationsNatasha Mandal
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMSkoolkampus
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clusteringguest0edcaf
 
SQL: Query optimization in practice
SQL: Query optimization in practiceSQL: Query optimization in practice
SQL: Query optimization in practiceJano Suchal
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimizationWBUTTUTORIALS
 
An Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsAn Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsNikolaos Konstantinou
 
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...DrkhanchanaR
 
Programming in C++ and Data Strucutres
Programming in C++ and Data StrucutresProgramming in C++ and Data Strucutres
Programming in C++ and Data StrucutresDr. C.V. Suresh Babu
 
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Ranking Objects by Exploiting Relationships: Computing Top-K over AggregationRanking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Ranking Objects by Exploiting Relationships: Computing Top-K over AggregationJason Yang
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
Unit 2 linked list
Unit 2   linked listUnit 2   linked list
Unit 2 linked listDrkhanchanaR
 
Information Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesInformation Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesGhislain Atemezing
 
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DatadipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DataeXascale Infolab
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimizationSaranya Natarajan
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query OptimizationJ Singh
 

What's hot (20)

Algorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial OperationsAlgorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial Operations
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Chapter15
Chapter15Chapter15
Chapter15
 
Query trees
Query treesQuery trees
Query trees
 
SQL: Query optimization in practice
SQL: Query optimization in practiceSQL: Query optimization in practice
SQL: Query optimization in practice
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimization
 
IR tutorial
IR tutorialIR tutorial
IR tutorial
 
Unit 3
Unit 3Unit 3
Unit 3
 
An Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsAn Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF Graphs
 
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...
 
Programming in C++ and Data Strucutres
Programming in C++ and Data StrucutresProgramming in C++ and Data Strucutres
Programming in C++ and Data Strucutres
 
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Ranking Objects by Exploiting Relationships: Computing Top-K over AggregationRanking Objects by Exploiting Relationships: Computing Top-K over Aggregation
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Unit 2 linked list
Unit 2   linked listUnit 2   linked list
Unit 2 linked list
 
Information Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open VocabulariesInformation Content based Ranking Metric for Linked Open Vocabularies
Information Content based Ranking Metric for Linked Open Vocabularies
 
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of DatadipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
dipLODocus[RDF]: Short and Long-Tail RDF Analytics for Massive Webs of Data
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 

Viewers also liked

Saidanaturalistaenseadadainsualibretadecampo
SaidanaturalistaenseadadainsualibretadecampoSaidanaturalistaenseadadainsualibretadecampo
SaidanaturalistaenseadadainsualibretadecampoBelén Lorenzo
 
#ForoEGovAR | Casos de PSC y su adaptación
 #ForoEGovAR | Casos de PSC y su adaptación #ForoEGovAR | Casos de PSC y su adaptación
#ForoEGovAR | Casos de PSC y su adaptaciónCESSI ArgenTIna
 
D:\งานส่ง\G48 53011810075
D:\งานส่ง\G48 53011810075D:\งานส่ง\G48 53011810075
D:\งานส่ง\G48 53011810075BenjamasS
 
Mastering the eligible content
Mastering the eligible contentMastering the eligible content
Mastering the eligible contentLeah Vestal
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and futureRoi Blanco
 
July 2012 Newsletter
July 2012 NewsletterJuly 2012 Newsletter
July 2012 NewsletterFelix Ortiz
 
#ForoEGovAR | Plan de Modernización del Estado
#ForoEGovAR | Plan de Modernización del Estado#ForoEGovAR | Plan de Modernización del Estado
#ForoEGovAR | Plan de Modernización del EstadoCESSI ArgenTIna
 
Mastering the Curriculum in Reading and Math
Mastering the Curriculum in Reading and MathMastering the Curriculum in Reading and Math
Mastering the Curriculum in Reading and MathLeah Vestal
 
Build Great Apps on Android - Boris Chan - FITC Spotlight Android
Build Great Apps on Android - Boris Chan - FITC Spotlight AndroidBuild Great Apps on Android - Boris Chan - FITC Spotlight Android
Build Great Apps on Android - Boris Chan - FITC Spotlight AndroidBoris Chan
 
Corporate wellbeing
Corporate wellbeingCorporate wellbeing
Corporate wellbeingRavi Samuel
 
Filosofia 6º ano - 2012
Filosofia   6º ano - 2012Filosofia   6º ano - 2012
Filosofia 6º ano - 2012evertonbazu
 
Workshops Red ArgenTIna IT 2015 - Propuesta de Sponsoreo
Workshops Red ArgenTIna IT 2015 - Propuesta de SponsoreoWorkshops Red ArgenTIna IT 2015 - Propuesta de Sponsoreo
Workshops Red ArgenTIna IT 2015 - Propuesta de SponsoreoCESSI ArgenTIna
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法Takeshi Furusato
 
Tech training 7.17.13
Tech training 7.17.13Tech training 7.17.13
Tech training 7.17.13Leah Vestal
 
Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationRoi Blanco
 
Mayonn, Inc. Website in PDF format
Mayonn, Inc. Website in PDF formatMayonn, Inc. Website in PDF format
Mayonn, Inc. Website in PDF formatmayonn
 
Englishtestunit73 eso d-2
Englishtestunit73 eso d-2Englishtestunit73 eso d-2
Englishtestunit73 eso d-2Vicky
 
Best of the web ms.hs
Best of the web ms.hsBest of the web ms.hs
Best of the web ms.hsLeah Vestal
 

Viewers also liked (20)

Saidanaturalistaenseadadainsualibretadecampo
SaidanaturalistaenseadadainsualibretadecampoSaidanaturalistaenseadadainsualibretadecampo
Saidanaturalistaenseadadainsualibretadecampo
 
#ForoEGovAR | Casos de PSC y su adaptación
 #ForoEGovAR | Casos de PSC y su adaptación #ForoEGovAR | Casos de PSC y su adaptación
#ForoEGovAR | Casos de PSC y su adaptación
 
D:\งานส่ง\G48 53011810075
D:\งานส่ง\G48 53011810075D:\งานส่ง\G48 53011810075
D:\งานส่ง\G48 53011810075
 
Mastering the eligible content
Mastering the eligible contentMastering the eligible content
Mastering the eligible content
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
July 2012 Newsletter
July 2012 NewsletterJuly 2012 Newsletter
July 2012 Newsletter
 
#ForoEGovAR | Plan de Modernización del Estado
#ForoEGovAR | Plan de Modernización del Estado#ForoEGovAR | Plan de Modernización del Estado
#ForoEGovAR | Plan de Modernización del Estado
 
Mastering the Curriculum in Reading and Math
Mastering the Curriculum in Reading and MathMastering the Curriculum in Reading and Math
Mastering the Curriculum in Reading and Math
 
Build Great Apps on Android - Boris Chan - FITC Spotlight Android
Build Great Apps on Android - Boris Chan - FITC Spotlight AndroidBuild Great Apps on Android - Boris Chan - FITC Spotlight Android
Build Great Apps on Android - Boris Chan - FITC Spotlight Android
 
Halifax: Economic Trends
Halifax:  Economic Trends Halifax:  Economic Trends
Halifax: Economic Trends
 
Gic2011 aula4-ingles-theory
Gic2011 aula4-ingles-theoryGic2011 aula4-ingles-theory
Gic2011 aula4-ingles-theory
 
Corporate wellbeing
Corporate wellbeingCorporate wellbeing
Corporate wellbeing
 
Filosofia 6º ano - 2012
Filosofia   6º ano - 2012Filosofia   6º ano - 2012
Filosofia 6º ano - 2012
 
Workshops Red ArgenTIna IT 2015 - Propuesta de Sponsoreo
Workshops Red ArgenTIna IT 2015 - Propuesta de SponsoreoWorkshops Red ArgenTIna IT 2015 - Propuesta de Sponsoreo
Workshops Red ArgenTIna IT 2015 - Propuesta de Sponsoreo
 
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
第4回 JAWS-UG Okayama 月額3.3円〜でレンタルサーバーを始める方法
 
Tech training 7.17.13
Tech training 7.17.13Tech training 7.17.13
Tech training 7.17.13
 
Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance Minimization
 
Mayonn, Inc. Website in PDF format
Mayonn, Inc. Website in PDF formatMayonn, Inc. Website in PDF format
Mayonn, Inc. Website in PDF format
 
Englishtestunit73 eso d-2
Englishtestunit73 eso d-2Englishtestunit73 eso d-2
Englishtestunit73 eso d-2
 
Best of the web ms.hs
Best of the web ms.hsBest of the web ms.hs
Best of the web ms.hs
 

Similar to Effective and Efficient Entity Search in RDF data

Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Thanh Tran
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedis Labs
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Institute of Contemporary Sciences
 
search.ppt
search.pptsearch.ppt
search.pptPikaj2
 
SPARQL and RDF query optimization
SPARQL and RDF query optimizationSPARQL and RDF query optimization
SPARQL and RDF query optimizationKisung Kim
 
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) cseij
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structurecseij
 
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F... Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...Holistic Benchmarking of Big Linked Data
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingKristian Alexander
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
chapter08 - Database fundamentals.pdf
chapter08 - Database fundamentals.pdfchapter08 - Database fundamentals.pdf
chapter08 - Database fundamentals.pdfsatonaka3
 
Physical database design(database)
Physical database design(database)Physical database design(database)
Physical database design(database)welcometofacebook
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 

Similar to Effective and Efficient Entity Search in RDF data (20)

Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
search engine
search enginesearch engine
search engine
 
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
 
search.ppt
search.pptsearch.ppt
search.ppt
 
SPARQL and RDF query optimization
SPARQL and RDF query optimizationSPARQL and RDF query optimization
SPARQL and RDF query optimization
 
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS) ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
ROMAN URDU OPINION MINING SYSTEM (RUOMIS)
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structure
 
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F... Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
Extending LargeRDFBench for Multi-Source Data at Scale for SPARQL Endpoint F...
 
Extended LargeRDFBench
Extended LargeRDFBenchExtended LargeRDFBench
Extended LargeRDFBench
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
MUDROD - Ranking
MUDROD - RankingMUDROD - Ranking
MUDROD - Ranking
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and ApplicationsSemantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
 
chapter08 - Database fundamentals.pdf
chapter08 - Database fundamentals.pdfchapter08 - Database fundamentals.pdf
chapter08 - Database fundamentals.pdf
 
Physical database design(database)
Physical database design(database)Physical database design(database)
Physical database design(database)
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
IR
IRIR
IR
 

More from Roi Blanco

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the WebRoi Blanco
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF GraphsRoi Blanco
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operatorsRoi Blanco
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search EnginesRoi Blanco
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesRoi Blanco
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entitiesRoi Blanco
 

More from Roi Blanco (12)

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...Product School
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationZilliz
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

Effective and Efficient Entity Search in RDF data

  • 1. Effective and Efficient Entity Search in RDF data Roi Blanco1, Peter Mika1 and Sebastiano Vigna2 1 Yahoo! Research 2 Università degli Studi di Milano
  • 2. - 2 - Semantic Search • Unstructured or hybrid search over RDF data – Supporting end-users • Users who can not express their need in SPARQL – Dealing with large-scale data • Giving up query expressivity for scale – Dealing with heterogeneity • Users who are unaware of the schema of the data • No single schema to the data – Example: 2.6m classes and 33k properties in Billion Triples 2009 • Entity search – Queries where the user is looking for a single entity named or described in the query – e.g. kaz vaporizer, hospice of cincinnati, mst3000
  • 3. - 3 - Use cases in web search Top-1 entity with structured data Related entities Structured data extracted from HTML
  • 4. Information access in the Semantic Web • Database-style indexing of RDF data – Triple stores – Structural queries (SPARQL) – No ranking – Evaluation focused on efficiency • IR-style indexing of RDF data – Search engines – Keyword queries – Ranking – Evaluation focused on effectiveness - 4 - • Combined methods – Keyword matching and limited join processing
  • 5. - 5 - Related works • Ranking methods on RDF data – Wang et al. Semplore: A scalable IR approach to search the Web of Data. ISWC 2007, JWS 7(3) – Pérez-Agüera et al. Using BM25F for semantic search. SemSearch 2010 – (many others) • Evaluation campaigns – SemSearch Challenge 2010, 2011 – Question-Answering over Linked Data (QALD) 2011 – TREC Entity Track 2010, 2011 • Keyword search in databases – No open evaluation campaigns
  • 6. 1st part of the talk 2nd part - 6 - Architecture overview Doc 1. Download, uncompress, convert (if needed) 2. Sort quads by subject 3. Compute Minimal Perfect Hash (MPH) map map reduce reduce map reduce Index 3. Each mapper reads part of the collection 4. Each reducer builds an index for a subset of the vocabulary 5. Optionally, we also build an archive (forward-index) 5. The sub-indices are merged into a single index 6. Serving and Ranking
  • 7. RDF indexing using MapReduce • Text indexing using MapReduce – Map: parse input into (term, doc) pairs • Pre-processing such as stemming, blacklisting • To support phrase queries values are (doc, position) pairs – Reduce: collect all values for the same key: (term, {doc1,doc2…}), output posting-list • Secondary sort to pre-sort document ids before iteration • RDF indexing using MapReduce (see Mika, SemSearch 2009) – Document is all triples with a given subject • Variations: index also RDF molecules, triples where the URI is an object – Index terms in property-values • Keys are (field, term) pairs • Variation: distinguish values for the same property – Index terms in the subject URI • Variation: index also terms in object URIs - 7 -
  • 8. - 8 - Horizontal index structure • One field per position – one for object (token), one for predicates (property), optionally one for context • For each term, store the property on the same position in the property index – Positions are required even without phrase queries • Query engine needs to support fields and the alignment operator  Dictionary is number of unique terms + number of properties  Occurrences is number of tokens * 2
  • 9. - 9 - Vertical index structure • One field (index) per property • Positions are not required • Query engine needs to support fields  Dictionary is number of unique terms  Occurrences is number of tokens ✗ Number of fields is a problem for merging, query performance • In experiments we index the N most common properties
  • 10. - 10 - Efficiency improvements • r-vertical (reduced-vertical) index – One field per weight vs. one field per property – More efficient for keyword queries but loses the ability to restrict per field – Example: three weight levels • Pre-computation of alignments – Additional term-to-field index – Used to quickly determine which fields contain a term (in any document)
  • 11. - 11 - Indexing efficiency • Billion Triples 2009 dataset – 249 GB in uncompressed N-Quad – 114 million URIs and 274 million triples with datatype properties – 2.9B / 1.4B occurrences (horiz/vert) • Selected 300 most frequent datatype properties for vertical indexing • Resulting index is 9-10GB in size • Horizontal and vertical indexing using Hadoop – Scale is only limited by number of machines – Number of reducers is a trade-off between speed and number of sub-indices to be merged
  • 12. - 12 - Run-time efficiency • Measured average execution time (including ranking) – Using 150k queries that lead to a click on Wikipedia – Avg. length 2.2 tokens – Baseline is plain text indexing with BM25 • Results – Some cost for field-based retrieval compared to plain text indexing – AND is always faster than OR • Except in horizontal, where alignment time dominates – r-vertical significantly improves execution time in OR mode AND mode OR mode plain text 46 ms 80 ms horizontal 819 ms 847 ms vertical 97 ms 780 ms r-vertical 78 ms 152 ms
  • 13. - 13 - BM25F Ranking BM25(F) uses a term-frequency (tf) that accounts for the decreasing marginal contribution of terms where vs is the weight of the field tfsi is the frequency of term i in field s Bs is the document length normalization factor: ls is the length of field s avls is the average length of s bs is a tunable parameter
  • 14. - 14 - BM25F ranking cont. • Final term score is a combination of tf and idf where k1 is a tunable parameter wIDF is the inverse-document frequency: • Finally, the score of a document D is the sum of the scores of query terms q
  • 15. - 15 - Effectiveness evaluation • Semantic Search Challenge 2010 – Data, queries, assessments available online • Billion Triples Challenge 2009 dataset • 92 entity queries from web search – Queries where the user is looking for a single entity – Sampled randomly from Microsoft and Yahoo! query logs • Assessed using Amazon’s Mechanical Turk – Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010 – Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011
  • 16. - 16 - Evaluation form
  • 17. - 17 - Implementation • Simplified model to reduce the number of parameters – Three levels of vs: important, neutral, unimportant – Assign weights to domains instead of individual doc weights wD – Single parameter b for all bs – Single parameter ls for all l, bounded by a maximum lmax=10 • Manually classified a small number of properties and domains into important, neutral, unimportant – Future work to learn this classification – Weights are learned (see next)
  • 18. - 18 - Effectiveness results • Individual features – Positive, stat. significant improvement from each feature – Even a manual classification of properties and domains helps • Combination – Positive stat. significant marginal improvement from each additional feature – Total improvement of 53% over the baseline – Different signals of relevance
  • 19. Comparison to SemSearch’10 • Two-fold cross validation • Tuning all parameters at the same time – Promising directions algorithm (Robertson and Zaragoza) • 42% improvement over the best method submitted • Performs well on short, specific queries with many results – Negative examples: the morning call lehigh valley pa - 19 -
  • 20. - 20 - Conclusions • Indexing and ranking RDF data – Novel index structures – Ranking method based on BM25F • Future work – Ranking documents with metadata • e.g. microdata/RDFa – Exploiting more semantics • e.g. sameAs – Ranking triples for display – Question-answering