SlideShare a Scribd company logo
1 of 27
TRank: Ranking
Entity Types Using
the Web of Data
Alberto Tonon1, Michele Catasta2, Gianluca Demartini1,
Philippe Cudré-Mauroux1, and Karl Aberer2
1eXascale Infolab,
University of Fribourg, Switzerland
{alberto, demartini, phil}@exascale.info
ISWC– 25 October 2013
2Distributed Information Systems Laboratory
EPFL, Switzerland
{firstname.lastname}@epfl.ch
Why Entities?
• The Web is getting entity-centric!
• Entity-centric services
2
Google
…and Why Types?
• “Summarization” of texts
• Contextual entities summaries in Web-pages
• Disambiguation of other entities
• Diversification of search results
3
Article Title Entities Types
Bin Laden Relative Pleads Not
Guilty in Terrorism Case
Osama Bin Laden
Abu Ghaith
Lewis Kaplan
Manhattan
Al-QaedaPropagandists
Kuwaiti Al-Qaeda members
Judge
Borough (New York City)
Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden
who once served as a spokesman for Al Qaeda
Al-Quaeda
Propagandist
Kuwaiti Al-Qaeda
members
Jihadist
Organizations
Entities May Have Many Types
4
Thing
American
Billionaires
People from
King County People
from
Seattle
Windows
People
Agent
Person
Living
People
American
People of
Scottish Descent
Harvard
University
People
American
Computer
Programmers
American
Philanthropists
People
from
Seattle
G: DBPedia 3.8
e: Bill Gates
c: «Microsoft was founded by Bill Gates
and Paul Allen on April 4, 1975.»
Our Task: Ranking Types Given a
Context
• Input: a knowledge base
G, an Entity e, a context c
in which e appears.
• Output: e’s types ranked
by relevance wrt the
context c.
• Evaluation:
crowdsourcing + MAP,
NDCG
5
Bill Gates
1. American Chief executive
2. American Computer Programmer
3. American Billionaires
4. …
TRank Pipeline
6
Type ranking
Type ranking
Type ranking
Text
extraction
(BoilerPipe)
Named Entity
Recognition
(Stanford NER)
List of
entity
labels
Entity linking
(inverted index:
DBpedia labels ⟹
resource URIs)
foreach
List of
entity
URIs
Type retrieval
(inverted index:
resource URIs ⟹
type URIs)
List of
type
URIs
Type ranking
Ranked
list of
types
Type Hierarchy
7
<owl:equivalentClass>
<owl:Thing>
MappingsYAGO/DBpedia (PARIS)
type: DBpedia schema.org Yago
subClassOf relationship:
explicit inferred from
<owl:equivalentClass>
manually
added
PARISontology
mapping
Ranking Algorithms
• Entity centric
• Hierarchy-based
• Context-aware (featuring type-hierarchy)
• Learning to Rank
8
Entity-Centric Ranking Approaches
(An Example)
9
• SAMEAS
Score(e, t) = number of
URIs representing e with
type t.
Hierarchy-Based Approaches
(An Example)
• ANCESTORS
Score(e, t) = number of t’s
ancestors in the type
hierarchy contained in Te.
10
Te often doesn’t
contain all super
types of a
specific type
Context-Aware Ranking Approaches
(An Example)
• SAMETYPE
Score(e, t, cT) = number of
times t appears among
the types of every other
entity in cT.
11
e'
Person
Actor
Actor
AmericanActor
Context
e''
Organization
Thing
e
Learning to Rank Entity Types
Determine an optimal combination of all our
approaches:
• Decision trees
• Linear regression models
• 10-fold cross validation
12
Avoiding SPARQL Queries with
Inverted Indices and Map/Reduce
• TRank is implemented with Hadoop and
Map/Reduce.
• All computations are done by using inverted
indices:
– Entity linking
– Path index
– Depth index
• The inverted indices are publicly available at
exascale.info/TRank
13
EXPERIMENTAL EVALUATION
14
Datasets
• 128 recent NYTimes articles split to create:
– Entity Collection
– Sentence Collection
– Paragraph Collection
– 3-Paragraphs Collection
• Ground-truth obtained by using crowdsourcing
– 3 workers per entity/context
– 4 levels of relevance for each type
– Overall cost: 190$
15
Effectiveness Evaluation
16
Check our paper or contact
us for a complete
description of all the
approaches we evaluated
Efficiency Evaluation
• Tested efficiency on a CommonCrawl sample
of 1TB
– 1,310,459 HTML pages
– 23GB compressed
• Map/Reduce on a cluster of 8 machines with
12 cores, 32GB of RAM and 3 SATA disks
• On average, 25 min. processing time (> 100
docs/node x sec)
17
Text Extraction NER Entity Linking Type Retrieval Type Ranking
18.9% 35.6% 29.5% 9.8% 6.2%
Conclusions
• New task: ranking entity types.
– Useful for: “summarization” of Web-documents,
entity summaries, disambiguation.
• Various approaches: entity-centric, context-
aware, hierarchy-based, learning to rank.
– Hierarchy-based and learning to rank are the most
effective.
• Hadoop, Map/Reduce, and inverted indices to
achieve scalability.
18
Grazie!
• Datasets (with relevance judgments!),
inverted indices, evaluation tools and more
material are available at exascale.info/Trank.
19
Thank you for
your attention!
Check out B-hist at
the SW Challenge!
Thanks to
for the Travel
Award!
TRank is open-
source!https://github.c
om/MEM0R1ES/TRank
20
Entity-Centric Ranking Approaches
• FREQ
Rank(e, t, ck) = number of triples <e> <rdfs:type> <t> in the
knowledge base.
• WIKILINK
Rank(e, t, ck) = number of e’s “neighbor entities” with type t.
• SAMEAS
Rank(e, t, ck) = number of URIs representing e with type t.
• LABEL
Rank(e, t, ck) = frequency of t among the top-10 most similar entities in
terms of label (thank you, Lucene  )
21
Create
Inverted
Index
"Tom Cruise"
label
...
"Tom Hanks"
label
...
"Bill Gates"
label
...
"Osama Bin Laden"
label
...
Knowledge Base
e1
e2
e3
eN
...
"Tom" e1 e3 . . .
"Cruise" e1 . . .
"Hanks" . . .
e3
"Bill" . . .
e2
Inverted Index
Entity-Centric Ranking Approaches
• LABEL
Rank(e, t, ck) =
frequency of t among
the top-10 most
similar entities in
terms of label.
Exploits an inverted
index.
22
...
"Tom" e1 e3 . . .
"Cruise" e1 . . .
"Hanks" . . .
e3
"Bill" . . .
e2
Inverted Index
Label(e) Query
TF-IDF
Ranking
e2
e3
.
.
.
TOP-10
Hierarchy-Based Ranking Approaches
• DEPTH
Rank(e, t, cH) = depth of t in
the type hierarchy.
• ANCESTORS
Rank(e, t, cH) = number of t’s
ancestors in the type
hierarchy contained in Te.
• ANC_DEPTH
Rank(e, t, cH) =
23
Te often doesn’t
contain all super
types of a
specific type
Context-Aware Ranking Approaches
• The context can help getting a better ranking
of types.
24
Italy’s rebellious voters, who opted for a flamboyant billionaire and a
clown, reminded us last week how deeply in crisis the Continent is.
Meanwhile, France is going it virtually alone in Mali, and Britain talks
openly of jumping the European ship altogether.
Landlocked Countries
Least Developed Countries
States And Territories Established In 1960
French-speaking Countries
World Trade Organization Member Economies
Country
African Union Member States
African Countries
Member States Of La Francophonie
African Union Member Economies
Populated Place
Place
• Which is the right type for Mali?
Context-Aware Ranking Approaches
PATH
• Suppose we have to compute Rank(t, e, cT).
• Consider each type t’ of each other entity e’ in c.
• P(t) = path from the root of the type hierarchy to
t.
25
???
Context-Aware Ranking Approaches
Ranking Tom Hank’s types when co-occurring with Tom
Cruise in some text.
26
1
2
3
4
4
1
1
1
Relevance Judgments
• Crowdsourced relevance
judgments.
• Anonymous Web-users
are TRank users.
• 3 workers per
entity/context.
• Overall cost: 190$
• Pilot study on task
design… mega-bubbles!
• Numbers of votes as
relevance score for a
type.
27

More Related Content

What's hot

Redis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph DistributionRedis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph DistributionRedis Labs
 
MappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScienceMappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScienceKaushik Patidar
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsRupak Roy
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
Java Extension Methods
Java Extension MethodsJava Extension Methods
Java Extension MethodsAndreas Enbohm
 
Incremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher QueriesIncremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher QueriesGábor Szárnyas
 

What's hot (11)

Redis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph DistributionRedis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph Distribution
 
MappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScienceMappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScience
 
R packages
R packagesR packages
R packages
 
21 spam
21 spam21 spam
21 spam
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store Functions
 
Theano tutorial
Theano tutorialTheano tutorial
Theano tutorial
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Java Extension Methods
Java Extension MethodsJava Extension Methods
Java Extension Methods
 
Ghost
GhostGhost
Ghost
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Incremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher QueriesIncremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher Queries
 

Similar to TRank ISWC2013

Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligencekrisztianbalog
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)krisztianbalog
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
Web-scale semantic search
Web-scale semantic searchWeb-scale semantic search
Web-scale semantic searchEdgar Meij
 
What's "For Free" on Craigslist?
What's "For Free" on Craigslist? What's "For Free" on Craigslist?
What's "For Free" on Craigslist? Josh Mayer
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyondErnesto Reig
 
Modern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesModern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesZOLLHOF - Tech Incubator
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Roy Russo
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation enginelucenerevolution
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.pptIshaXogaha
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 

Similar to TRank ISWC2013 (20)

Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Web-scale semantic search
Web-scale semantic searchWeb-scale semantic search
Web-scale semantic search
 
What's "For Free" on Craigslist?
What's "For Free" on Craigslist? What's "For Free" on Craigslist?
What's "For Free" on Craigslist?
 
Type-Aware Entity Retrieval
Type-Aware Entity RetrievalType-Aware Entity Retrieval
Type-Aware Entity Retrieval
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Modern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesModern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutes
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
Big data
Big dataBig data
Big data
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Big data
Big dataBig data
Big data
 

More from eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex GraphseXascale Infolab
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapeXascale Infolab
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...eXascale Infolab
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceanseXascale Infolab
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingeXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
 

More from eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 

Recently uploaded

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
IPCRF/RPMS 2024 Classroom Observation tool is your access to the new performa...
IPCRF/RPMS 2024 Classroom Observation tool is your access to the new performa...IPCRF/RPMS 2024 Classroom Observation tool is your access to the new performa...
IPCRF/RPMS 2024 Classroom Observation tool is your access to the new performa...MerlizValdezGeronimo
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 

Recently uploaded (20)

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
IPCRF/RPMS 2024 Classroom Observation tool is your access to the new performa...
IPCRF/RPMS 2024 Classroom Observation tool is your access to the new performa...IPCRF/RPMS 2024 Classroom Observation tool is your access to the new performa...
IPCRF/RPMS 2024 Classroom Observation tool is your access to the new performa...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 

TRank ISWC2013

  • 1. TRank: Ranking Entity Types Using the Web of Data Alberto Tonon1, Michele Catasta2, Gianluca Demartini1, Philippe Cudré-Mauroux1, and Karl Aberer2 1eXascale Infolab, University of Fribourg, Switzerland {alberto, demartini, phil}@exascale.info ISWC– 25 October 2013 2Distributed Information Systems Laboratory EPFL, Switzerland {firstname.lastname}@epfl.ch
  • 2. Why Entities? • The Web is getting entity-centric! • Entity-centric services 2 Google
  • 3. …and Why Types? • “Summarization” of texts • Contextual entities summaries in Web-pages • Disambiguation of other entities • Diversification of search results 3 Article Title Entities Types Bin Laden Relative Pleads Not Guilty in Terrorism Case Osama Bin Laden Abu Ghaith Lewis Kaplan Manhattan Al-QaedaPropagandists Kuwaiti Al-Qaeda members Judge Borough (New York City) Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden who once served as a spokesman for Al Qaeda Al-Quaeda Propagandist Kuwaiti Al-Qaeda members Jihadist Organizations
  • 4. Entities May Have Many Types 4 Thing American Billionaires People from King County People from Seattle Windows People Agent Person Living People American People of Scottish Descent Harvard University People American Computer Programmers American Philanthropists People from Seattle
  • 5. G: DBPedia 3.8 e: Bill Gates c: «Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975.» Our Task: Ranking Types Given a Context • Input: a knowledge base G, an Entity e, a context c in which e appears. • Output: e’s types ranked by relevance wrt the context c. • Evaluation: crowdsourcing + MAP, NDCG 5 Bill Gates 1. American Chief executive 2. American Computer Programmer 3. American Billionaires 4. …
  • 6. TRank Pipeline 6 Type ranking Type ranking Type ranking Text extraction (BoilerPipe) Named Entity Recognition (Stanford NER) List of entity labels Entity linking (inverted index: DBpedia labels ⟹ resource URIs) foreach List of entity URIs Type retrieval (inverted index: resource URIs ⟹ type URIs) List of type URIs Type ranking Ranked list of types
  • 7. Type Hierarchy 7 <owl:equivalentClass> <owl:Thing> MappingsYAGO/DBpedia (PARIS) type: DBpedia schema.org Yago subClassOf relationship: explicit inferred from <owl:equivalentClass> manually added PARISontology mapping
  • 8. Ranking Algorithms • Entity centric • Hierarchy-based • Context-aware (featuring type-hierarchy) • Learning to Rank 8
  • 9. Entity-Centric Ranking Approaches (An Example) 9 • SAMEAS Score(e, t) = number of URIs representing e with type t.
  • 10. Hierarchy-Based Approaches (An Example) • ANCESTORS Score(e, t) = number of t’s ancestors in the type hierarchy contained in Te. 10 Te often doesn’t contain all super types of a specific type
  • 11. Context-Aware Ranking Approaches (An Example) • SAMETYPE Score(e, t, cT) = number of times t appears among the types of every other entity in cT. 11 e' Person Actor Actor AmericanActor Context e'' Organization Thing e
  • 12. Learning to Rank Entity Types Determine an optimal combination of all our approaches: • Decision trees • Linear regression models • 10-fold cross validation 12
  • 13. Avoiding SPARQL Queries with Inverted Indices and Map/Reduce • TRank is implemented with Hadoop and Map/Reduce. • All computations are done by using inverted indices: – Entity linking – Path index – Depth index • The inverted indices are publicly available at exascale.info/TRank 13
  • 15. Datasets • 128 recent NYTimes articles split to create: – Entity Collection – Sentence Collection – Paragraph Collection – 3-Paragraphs Collection • Ground-truth obtained by using crowdsourcing – 3 workers per entity/context – 4 levels of relevance for each type – Overall cost: 190$ 15
  • 16. Effectiveness Evaluation 16 Check our paper or contact us for a complete description of all the approaches we evaluated
  • 17. Efficiency Evaluation • Tested efficiency on a CommonCrawl sample of 1TB – 1,310,459 HTML pages – 23GB compressed • Map/Reduce on a cluster of 8 machines with 12 cores, 32GB of RAM and 3 SATA disks • On average, 25 min. processing time (> 100 docs/node x sec) 17 Text Extraction NER Entity Linking Type Retrieval Type Ranking 18.9% 35.6% 29.5% 9.8% 6.2%
  • 18. Conclusions • New task: ranking entity types. – Useful for: “summarization” of Web-documents, entity summaries, disambiguation. • Various approaches: entity-centric, context- aware, hierarchy-based, learning to rank. – Hierarchy-based and learning to rank are the most effective. • Hadoop, Map/Reduce, and inverted indices to achieve scalability. 18
  • 19. Grazie! • Datasets (with relevance judgments!), inverted indices, evaluation tools and more material are available at exascale.info/Trank. 19 Thank you for your attention! Check out B-hist at the SW Challenge! Thanks to for the Travel Award! TRank is open- source!https://github.c om/MEM0R1ES/TRank
  • 20. 20
  • 21. Entity-Centric Ranking Approaches • FREQ Rank(e, t, ck) = number of triples <e> <rdfs:type> <t> in the knowledge base. • WIKILINK Rank(e, t, ck) = number of e’s “neighbor entities” with type t. • SAMEAS Rank(e, t, ck) = number of URIs representing e with type t. • LABEL Rank(e, t, ck) = frequency of t among the top-10 most similar entities in terms of label (thank you, Lucene  ) 21
  • 22. Create Inverted Index "Tom Cruise" label ... "Tom Hanks" label ... "Bill Gates" label ... "Osama Bin Laden" label ... Knowledge Base e1 e2 e3 eN ... "Tom" e1 e3 . . . "Cruise" e1 . . . "Hanks" . . . e3 "Bill" . . . e2 Inverted Index Entity-Centric Ranking Approaches • LABEL Rank(e, t, ck) = frequency of t among the top-10 most similar entities in terms of label. Exploits an inverted index. 22 ... "Tom" e1 e3 . . . "Cruise" e1 . . . "Hanks" . . . e3 "Bill" . . . e2 Inverted Index Label(e) Query TF-IDF Ranking e2 e3 . . . TOP-10
  • 23. Hierarchy-Based Ranking Approaches • DEPTH Rank(e, t, cH) = depth of t in the type hierarchy. • ANCESTORS Rank(e, t, cH) = number of t’s ancestors in the type hierarchy contained in Te. • ANC_DEPTH Rank(e, t, cH) = 23 Te often doesn’t contain all super types of a specific type
  • 24. Context-Aware Ranking Approaches • The context can help getting a better ranking of types. 24 Italy’s rebellious voters, who opted for a flamboyant billionaire and a clown, reminded us last week how deeply in crisis the Continent is. Meanwhile, France is going it virtually alone in Mali, and Britain talks openly of jumping the European ship altogether. Landlocked Countries Least Developed Countries States And Territories Established In 1960 French-speaking Countries World Trade Organization Member Economies Country African Union Member States African Countries Member States Of La Francophonie African Union Member Economies Populated Place Place • Which is the right type for Mali?
  • 25. Context-Aware Ranking Approaches PATH • Suppose we have to compute Rank(t, e, cT). • Consider each type t’ of each other entity e’ in c. • P(t) = path from the root of the type hierarchy to t. 25 ???
  • 26. Context-Aware Ranking Approaches Ranking Tom Hank’s types when co-occurring with Tom Cruise in some text. 26 1 2 3 4 4 1 1 1
  • 27. Relevance Judgments • Crowdsourced relevance judgments. • Anonymous Web-users are TRank users. • 3 workers per entity/context. • Overall cost: 190$ • Pilot study on task design… mega-bubbles! • Numbers of votes as relevance score for a type. 27

Editor's Notes

  1. An entity is something that exists by itself, although it need not be of material existance
  2. LEGGI TIPI
  3. STATE OF THE ART NER AND LINKING FOCUS IS RANKING TYPES
  4. PARIS: VLDB2012 ontology alignment Yago super specific types
  5. Entity centric Use only the information connected to the entity Context-aware Exploit the types of entities that co-occur in the context (e.g. Bill Gates + Micr soft vs Bill Gates + Scotland) Hierarchy-based Exploit the type hierarchy Learning to Rank Combine evidences coming from all previous approaches in an optimal way
  6. we start from the node representing an entity, follow same-as links (we get other nodes representing the same entity) and we count how many “new” nodes feature the type we’re giving a score to
  7. The set of types associated to an entity in a knowledge base often doesn’t contain all super types
  8. C_T is the context given by the text
  9. 10 FOLD CROSS VALID DECISION TREE REGRESSION … Preliminary experiments showed that is the best performing model bla la
  10. Use Inverted indices to AVOID SPARQL QUERIES!!
  11. - Increasing granularities of context: from no-context (here is the entity, here are its types, rank them), one sentence/paragraph (rank the types of all entities in this sentence/paragraph) - 3 workers were asked to select the best type of each entity appearing in a given context
  12. ANCESTORS Is the real winner since it uses inverted indices which are faster, no machine learning yadda yadda 
  13. Only pages with schema.org
  14. HADOOP -> scalable, not efficient
  15. SEMANTIC WEB SCIENCE ASSOCIATION!
  16. Ck is the context given by the knowledge base