SlideShare a Scribd company logo
NEXT: Background
SCID: Semantic Co-reference Inaccuracy Detection - [ INTRODUCTION ]
Filtering Inaccurate Entity Co-references on the Linked Open Data
John Cuzzola, Jelena Jovanovic, Ebrahim Bagheri
bagheri@ryerson.ca
DEXA 2015
The Linked-Open-Data (LOD)
cloud represents hundreds of
available datasets throughout
the Web.
❖ 570 datasets and 2909 linkage
relationships between the datasets.1
1. http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/
NEXT: How are datasets linked?
SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
To utilize the data from
multiple ontologies within
the LOD, “equivalence”
relationships between
concepts is necessary (ie:
the “edges” or linkages of
the LOD must be defined).
570 datasets and 2909 linkage
relationships between the datasets.
?
Equivalence relationships
between DBPedia and
Freebase?
NEXT: The sameAs predicate
SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
The equivalency relationship is often accomplished via the predicate owl:sameAs
<owl:sameAs>
http://rdf.freebase.com/ns/en.doghttp://dbpedia.org/resource/Dog
NEXT: sameAs linkage mistakes
SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
<owl:sameAs>
http://rdf.freebase.com/ns/en.bitchhttp://dbpedia.org/resource/Dog
NOT the same!
X
ns:common.topic.description:
"Bitch, literally meaning a female
dog, is a common slang term in the
English language, especially used
as a denigrating term applied to a
person, commonly a woman”
dbo:abstract:
The domestic dog (Canis lupus
familiaris) is a usually furry,
carnivorous member of the canidae
family.
The Problem: There are many incorrect
LOD linkages using owl:sameAs.
The Effect: Incorrect (embarrassing)
assertions by reasoners that use the LOD.
Example:
(from http://www.sameas.org)
NEXT: SCID
SCID: Semantic Co-reference Inaccuracy Detection - [ PROBLEM / MOTIVATION ]
SCID: Semantic Co-reference Inaccuracy Detection
❖ A method of natural language
analysis for detecting incorrect
owl:sameAs assertions.
1. Construct a baseline comparison vector vb(x,Sx).
2. For each resource (1,2,...) claiming to be the “same”,
construct vectors v1(x1,Sx), v2(x2,Sx) …
3. Compare individual distances from v1(x1,Sx),
v2(x2,Sx) … to baseline vb(x,Sx)
4. Disregard those v1(x1,Sx), v2(x2,Sx) … that are
outside some threshold distance δ.
NEXT: The core functions of SCID.
UPCOMING: How is vb(x,Sx) and v1(x1,Sx), v2(x2,Sx) … made?
SCID: Semantic Co-reference Inaccuracy Detection - [ CONTRIBUTION ]
SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
SCID depends on two key functions:
1. A category distribution function: ρ(t,S) .
Given some natural language text (t) and a set of “suitable” subject categories (S) for t,
compute a distribution vector of how t relates to each subject category of S.
1. A category selection function S(uri).
Given a resource (uri), return a “suitable” set of subject categories (S) that can be used
in ρ(t,S).
NEXT: The category distribution function.
UPCOMING: The category selection function.
SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
The category distribution function:
ρ(t,S) .
Ex: Given input text (t) as shown and three
DBpedia subject categories of S=[Fruit, Oranges,
Color] ρ(t,S) produces output:
ρ(t,[Fruit, Oranges, Color]) = v1(x1,S)
= [ 0.27Fruit, 0.50Oranges, 0.22Color ]
NEXT: The category selection function
UPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
● Computes Rx,k defined as the importance of word x to category k for every word in t.
○ uses 5 features: (1) count of x in k, (2) count of x across all k, (3) count of concepts where
word x appears, (4) ratio of x in k to vocabulary of all k, (5) average word frequency of x per
resource in k.
SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
The category selection function: Suri .
⇶ DBpedia contains 656,000+ category:subjects.
How do we select a few suitable for ρ(t,S)?
1. Begin with a candidate resource (uri):
http://dbpedia.org/resource/Orange_(fruit)
2. Find a DBpedia disambiguation page:
http://dbpedia.org/resource/Orange_(disambiguation)
3. Combine (union of) the subject categories for each of these resources.
Suri = [ category: { Optical_Spectrum, Oranges, Citrus_hybrids, Tropical_agriculture,
American_punk_rock, Rock_music, Hellcat_Records } ]
NEXT: How do we compute v1(x1,S) for sameAs inaccuracy filtering?
UPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
 dbr:Orange_(colour)
 dbr:Orange_(fruit)
 dbr:Orange_(band)
 category:Optical_Spectrum  category: Oranges
 category:Citrus_hybrids
 category:Tropical_agriculture
 ...
 category: American_punk_rock
 category: Rock_music
 category: Hellcat_Records
SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
NEXT: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
UPCOMING: Experimental results
http://www.sameas.org
● dbr:Port
● www.w3.org:synset-seaport-noun-1
● rdf.freebase.com:en.port
● sw.opencyc.org:Seaport
● rdf.freebase.com:River_port
● dbr:Bad_conduct
● rdf.freebase.com:en.military_discharge
● dbr:IVDP
● rdf.freebase.com:en.port_wine
How do we compute v1(t1,S), v2(t2,S), .. for sameAs inaccuracy filtering?
1. Start with a group of resources that are identified as sameAs:
Ex: http://dbpedia.org/resource/Port (dbr:Port)
2. Collect subject categories Sdbr:Port using category selection function.
3. For each of the sameAs resources, collect natural language text (t)
describing the resource. Collect (t) using dbpedia rdfs:comment,
freebase ns.common.topic.description, www3.org wn20schema:gloss.
4. Compute vectors v1(t1,Sdbr:Port), v2(t2,Sdbr:Port)..., t1= rdfs:comment of
dbr:Port, t2= ns.common.topic.description of rdf.freebase:River_port, …
using category distribution function ρ(t,Sdbr:port).
We now have individual v1,2..(t1,2..,Sdbr:Port) vectors. Only need base
vector vb(tb,S) for comparison.
SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
NEXT: Experimental results
UPCOMING: Conclusion
How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
1. Retrieve subject:categories of candidate resource from DBpedia
Ex: http://dbpedia.org/resource/Port (dbr:Port)
2. Find (all) other resources that use the categories of the candidate resource. Concatenate
rdfs:comment from all these resources (t).
3. Compute vb(t,Sdbr:Port) using category distribution function ρ(t,Sdbr:port).
● We now have base vector vb(t,Sdbr:Port) and can be compared to
individual sameAs vectors v1,2(x1,2,Sdbr:Port).
● We use Pearson Correlation Coefficient (PCC) to compare vectors.
● Remove vectors whose PCC less than threshold δ.
http://www.dbpedia.org
 category:Nautical_terms
 category:Ports_and_harbours
SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]
NEXT: Experimental results continued.
UPCOMING: Conclusion
● We examined 7,690 resources obtained from www.sameAs.org database of five topics:
○ Animal, City, Person, Color, and Miscellaneous.
● We performed some data cleansing on these resources.
○ removal of: duplicate resources (ie: aliases/redirects), broken links, redundant resources (ie: dbpedialite
is a subset of DBpedia).
● After cleansing 411 unique resources remained with 251 errors identified by human oracle
○ ie: http://dbpedia.org/resource/Dog is not the same as http://rdf.freebase.com/ns/en.bitch
SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]
● We computed v411(t,S) individual vectors for all 411 resources with associated baseline comparison
vector.
● We computed Pearson Correlation
between v411(t,S) and baseline.
● Removed identity links based on thresholds
ranging from 0.0 to 0.90. F-score
calculated.for each threshold used.
○ Original 411 resources contained
160 correct / 251 incorrect sameAs
links (0.560 F-score)
○ Threshold (δ) of 0.50 and 0.60 gave
best F-score.
NEXT: Experimental results continued.
UPCOMING: Conclusion
SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]
Scatter plot of F-score versus Pearson Correlation Coefficient for oracle-identified right(blue) and
wrong(red) identity links.
PEARSON wrong right
δ
NEXT: Conclusion
SCID: Semantic Co-reference Inaccuracy Detection - [ CONCLUSION ]
-- END --
● In this presentation:
○ we introduce SCID: A technique for discovering inaccuracies in identity links assertion
(owl:sameAs).
○ Experimental results indicate SCID can identify incorrect identity link assertions and improve
precision of an identity database (http://www.sameas.org).
● In the future:
○ Experimentation with identity links other than owl:sameAs (ie: skos:closeMatch,
skos:exactMatch, owl:equivalentClasses).
○ Experimentation with vector comparison methods other than Pearson Correlation (ie: cosine
similarity, euclidean distance, Spearman rank coefficient).

More Related Content

What's hot

RDF Validation Future work and applications
RDF Validation Future work and applicationsRDF Validation Future work and applications
RDF Validation Future work and applications
Jose Emilio Labra Gayo
 
20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in RKazuki Yoshida
 
File handling CBSE CLASS 12
File handling CBSE CLASS 12File handling CBSE CLASS 12
File handling CBSE CLASS 12
chinthala Vijaya Kumar
 
Data Trajectories: tracking the reuse of published data for transitive credi...
Data Trajectories: tracking the reuse of published datafor transitive credi...Data Trajectories: tracking the reuse of published datafor transitive credi...
Data Trajectories: tracking the reuse of published data for transitive credi...
Paolo Missier
 
Oshs_9_11_2015
Oshs_9_11_2015Oshs_9_11_2015
Oshs_9_11_2015
Béatrice Bouchou
 
Validating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesValidating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectives
Jose Emilio Labra Gayo
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Mikel Emaldi Manrique
 
Session 17 - Collections - Lists, Sets
Session 17 - Collections - Lists, SetsSession 17 - Collections - Lists, Sets
Session 17 - Collections - Lists, Sets
PawanMM
 
OODB
OODBOODB
OODB
rajukc47
 
Instance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge AcquisitionInstance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge Acquisition
Lihua Zhao
 
Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Lihua Zhao
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
gramana
 
[Master Thesis]: SPARQL Query Rewriting with Paths
[Master Thesis]: SPARQL Query Rewriting with Paths[Master Thesis]: SPARQL Query Rewriting with Paths
[Master Thesis]: SPARQL Query Rewriting with Paths
Abdullah Abbas
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
Vrije Universiteit Amsterdam
 

What's hot (20)

RDF Validation Future work and applications
RDF Validation Future work and applicationsRDF Validation Future work and applications
RDF Validation Future work and applications
 
20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R20130222 Data structures and manipulation in R
20130222 Data structures and manipulation in R
 
File handling CBSE CLASS 12
File handling CBSE CLASS 12File handling CBSE CLASS 12
File handling CBSE CLASS 12
 
Sparql
SparqlSparql
Sparql
 
Data Trajectories: tracking the reuse of published data for transitive credi...
Data Trajectories: tracking the reuse of published datafor transitive credi...Data Trajectories: tracking the reuse of published datafor transitive credi...
Data Trajectories: tracking the reuse of published data for transitive credi...
 
Oshs_9_11_2015
Oshs_9_11_2015Oshs_9_11_2015
Oshs_9_11_2015
 
XSPARQL CrEDIBLE workshop
XSPARQL CrEDIBLE workshopXSPARQL CrEDIBLE workshop
XSPARQL CrEDIBLE workshop
 
Validating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesValidating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectives
 
Database
DatabaseDatabase
Database
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 
Session 17 - Collections - Lists, Sets
Session 17 - Collections - Lists, SetsSession 17 - Collections - Lists, Sets
Session 17 - Collections - Lists, Sets
 
OODB
OODBOODB
OODB
 
Reading Data into R
Reading Data into RReading Data into R
Reading Data into R
 
Oodb
OodbOodb
Oodb
 
Instance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge AcquisitionInstance-Based Ontological Knowledge Acquisition
Instance-Based Ontological Knowledge Acquisition
 
Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
[Master Thesis]: SPARQL Query Rewriting with Paths
[Master Thesis]: SPARQL Query Rewriting with Paths[Master Thesis]: SPARQL Query Rewriting with Paths
[Master Thesis]: SPARQL Query Rewriting with Paths
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 

Viewers also liked

Simplicity, Innovation and Entrepreneurship
Simplicity, Innovation and EntrepreneurshipSimplicity, Innovation and Entrepreneurship
Simplicity, Innovation and Entrepreneurship
ebrahim_bagheri
 
Exploratory Social Network Analysis: Ranking
Exploratory Social Network Analysis: RankingExploratory Social Network Analysis: Ranking
Exploratory Social Network Analysis: Ranking
Hossein Fani
 
Modeling Semantics of Content on Twitter
Modeling Semantics of Content on TwitterModeling Semantics of Content on Twitter
Modeling Semantics of Content on Twitter
ebrahim_bagheri
 
Exploratory Social Network Analysis with Pajek: Blockmodels
Exploratory Social Network Analysis with Pajek: BlockmodelsExploratory Social Network Analysis with Pajek: Blockmodels
Exploratory Social Network Analysis with Pajek: Blockmodels
Hossein Fani
 
Latent Community Analysis: PhD Proposal
Latent Community Analysis: PhD ProposalLatent Community Analysis: PhD Proposal
Latent Community Analysis: PhD Proposal
Hossein Fani
 
WSDM16: Temporal Formation and Evolution of Online Communities
WSDM16: Temporal Formation and Evolution of Online CommunitiesWSDM16: Temporal Formation and Evolution of Online Communities
WSDM16: Temporal Formation and Evolution of Online Communities
Hossein Fani
 
Moviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open DataMoviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open Data
Hossein Fani
 
Software Test
Software TestSoftware Test
Software Test
Hossein Fani
 
Slides ecir2016
Slides ecir2016Slides ecir2016
Slides ecir2016
Fattane Zarrinkalam
 

Viewers also liked (9)

Simplicity, Innovation and Entrepreneurship
Simplicity, Innovation and EntrepreneurshipSimplicity, Innovation and Entrepreneurship
Simplicity, Innovation and Entrepreneurship
 
Exploratory Social Network Analysis: Ranking
Exploratory Social Network Analysis: RankingExploratory Social Network Analysis: Ranking
Exploratory Social Network Analysis: Ranking
 
Modeling Semantics of Content on Twitter
Modeling Semantics of Content on TwitterModeling Semantics of Content on Twitter
Modeling Semantics of Content on Twitter
 
Exploratory Social Network Analysis with Pajek: Blockmodels
Exploratory Social Network Analysis with Pajek: BlockmodelsExploratory Social Network Analysis with Pajek: Blockmodels
Exploratory Social Network Analysis with Pajek: Blockmodels
 
Latent Community Analysis: PhD Proposal
Latent Community Analysis: PhD ProposalLatent Community Analysis: PhD Proposal
Latent Community Analysis: PhD Proposal
 
WSDM16: Temporal Formation and Evolution of Online Communities
WSDM16: Temporal Formation and Evolution of Online CommunitiesWSDM16: Temporal Formation and Evolution of Online Communities
WSDM16: Temporal Formation and Evolution of Online Communities
 
Moviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open DataMoviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open Data
 
Software Test
Software TestSoftware Test
Software Test
 
Slides ecir2016
Slides ecir2016Slides ecir2016
Slides ecir2016
 

Similar to Filtering Inaccurate Entity Co-references on the Linked Open Data

bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
Fabien Gandon
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
Sakthivel C R
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
DataTactics
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
Rich Heimann
 
Framester and WFD
Framester and WFD Framester and WFD
Framester and WFD
Aldo Gangemi
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Paris Sud University
 
Quantifying the bias in data links
Quantifying the bias in data linksQuantifying the bias in data links
Quantifying the bias in data links
Vrije Universiteit Amsterdam
 
Rdf data-model-and-storage
Rdf data-model-and-storageRdf data-model-and-storage
Rdf data-model-and-storage
灿辉 葛
 
Tools for Integrating Heterogeneous Data Sources from a User Perspective
Tools for Integrating Heterogeneous Data Sources from a User PerspectiveTools for Integrating Heterogeneous Data Sources from a User Perspective
Tools for Integrating Heterogeneous Data Sources from a User PerspectiveJie Bao
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolWi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Laura Po
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paolo Missier
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
Sören Auer
 
Web and text
Web and textWeb and text
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
Houw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
A Semantic Multimedia Web (Part 2)
A Semantic Multimedia Web (Part 2)A Semantic Multimedia Web (Part 2)
A Semantic Multimedia Web (Part 2)
Raphael Troncy
 
Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNet
Alexandre Rademaker
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation Learning
Jure Leskovec
 
Query Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data SourcesQuery Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data SourcesJie Bao
 

Similar to Filtering Inaccurate Entity Co-references on the Linked Open Data (20)

bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Framester and WFD
Framester and WFD Framester and WFD
Framester and WFD
 
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment
 
Quantifying the bias in data links
Quantifying the bias in data linksQuantifying the bias in data links
Quantifying the bias in data links
 
Rdf data-model-and-storage
Rdf data-model-and-storageRdf data-model-and-storage
Rdf data-model-and-storage
 
Tools for Integrating Heterogeneous Data Sources from a User Perspective
Tools for Integrating Heterogeneous Data Sources from a User PerspectiveTools for Integrating Heterogeneous Data Sources from a User Perspective
Tools for Integrating Heterogeneous Data Sources from a User Perspective
 
Sina presentation in IBM
Sina presentation in IBMSina presentation in IBM
Sina presentation in IBM
 
Wi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX toolWi2015 - Clustering of Linked Open Data - the LODeX tool
Wi2015 - Clustering of Linked Open Data - the LODeX tool
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
A Semantic Multimedia Web (Part 2)
A Semantic Multimedia Web (Part 2)A Semantic Multimedia Web (Part 2)
A Semantic Multimedia Web (Part 2)
 
Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNet
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation Learning
 
Query Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data SourcesQuery Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data Sources
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

Filtering Inaccurate Entity Co-references on the Linked Open Data

  • 1. NEXT: Background SCID: Semantic Co-reference Inaccuracy Detection - [ INTRODUCTION ] Filtering Inaccurate Entity Co-references on the Linked Open Data John Cuzzola, Jelena Jovanovic, Ebrahim Bagheri bagheri@ryerson.ca DEXA 2015
  • 2. The Linked-Open-Data (LOD) cloud represents hundreds of available datasets throughout the Web. ❖ 570 datasets and 2909 linkage relationships between the datasets.1 1. http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ NEXT: How are datasets linked? SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
  • 3. To utilize the data from multiple ontologies within the LOD, “equivalence” relationships between concepts is necessary (ie: the “edges” or linkages of the LOD must be defined). 570 datasets and 2909 linkage relationships between the datasets. ? Equivalence relationships between DBPedia and Freebase? NEXT: The sameAs predicate SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
  • 4. The equivalency relationship is often accomplished via the predicate owl:sameAs <owl:sameAs> http://rdf.freebase.com/ns/en.doghttp://dbpedia.org/resource/Dog NEXT: sameAs linkage mistakes SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
  • 5. <owl:sameAs> http://rdf.freebase.com/ns/en.bitchhttp://dbpedia.org/resource/Dog NOT the same! X ns:common.topic.description: "Bitch, literally meaning a female dog, is a common slang term in the English language, especially used as a denigrating term applied to a person, commonly a woman” dbo:abstract: The domestic dog (Canis lupus familiaris) is a usually furry, carnivorous member of the canidae family. The Problem: There are many incorrect LOD linkages using owl:sameAs. The Effect: Incorrect (embarrassing) assertions by reasoners that use the LOD. Example: (from http://www.sameas.org) NEXT: SCID SCID: Semantic Co-reference Inaccuracy Detection - [ PROBLEM / MOTIVATION ]
  • 6. SCID: Semantic Co-reference Inaccuracy Detection ❖ A method of natural language analysis for detecting incorrect owl:sameAs assertions. 1. Construct a baseline comparison vector vb(x,Sx). 2. For each resource (1,2,...) claiming to be the “same”, construct vectors v1(x1,Sx), v2(x2,Sx) … 3. Compare individual distances from v1(x1,Sx), v2(x2,Sx) … to baseline vb(x,Sx) 4. Disregard those v1(x1,Sx), v2(x2,Sx) … that are outside some threshold distance δ. NEXT: The core functions of SCID. UPCOMING: How is vb(x,Sx) and v1(x1,Sx), v2(x2,Sx) … made? SCID: Semantic Co-reference Inaccuracy Detection - [ CONTRIBUTION ]
  • 7. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ] SCID depends on two key functions: 1. A category distribution function: ρ(t,S) . Given some natural language text (t) and a set of “suitable” subject categories (S) for t, compute a distribution vector of how t relates to each subject category of S. 1. A category selection function S(uri). Given a resource (uri), return a “suitable” set of subject categories (S) that can be used in ρ(t,S). NEXT: The category distribution function. UPCOMING: The category selection function.
  • 8. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ] The category distribution function: ρ(t,S) . Ex: Given input text (t) as shown and three DBpedia subject categories of S=[Fruit, Oranges, Color] ρ(t,S) produces output: ρ(t,[Fruit, Oranges, Color]) = v1(x1,S) = [ 0.27Fruit, 0.50Oranges, 0.22Color ] NEXT: The category selection function UPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)? ● Computes Rx,k defined as the importance of word x to category k for every word in t. ○ uses 5 features: (1) count of x in k, (2) count of x across all k, (3) count of concepts where word x appears, (4) ratio of x in k to vocabulary of all k, (5) average word frequency of x per resource in k.
  • 9. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ] The category selection function: Suri . ⇶ DBpedia contains 656,000+ category:subjects. How do we select a few suitable for ρ(t,S)? 1. Begin with a candidate resource (uri): http://dbpedia.org/resource/Orange_(fruit) 2. Find a DBpedia disambiguation page: http://dbpedia.org/resource/Orange_(disambiguation) 3. Combine (union of) the subject categories for each of these resources. Suri = [ category: { Optical_Spectrum, Oranges, Citrus_hybrids, Tropical_agriculture, American_punk_rock, Rock_music, Hellcat_Records } ] NEXT: How do we compute v1(x1,S) for sameAs inaccuracy filtering? UPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?  dbr:Orange_(colour)  dbr:Orange_(fruit)  dbr:Orange_(band)  category:Optical_Spectrum  category: Oranges  category:Citrus_hybrids  category:Tropical_agriculture  ...  category: American_punk_rock  category: Rock_music  category: Hellcat_Records
  • 10. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ] NEXT: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)? UPCOMING: Experimental results http://www.sameas.org ● dbr:Port ● www.w3.org:synset-seaport-noun-1 ● rdf.freebase.com:en.port ● sw.opencyc.org:Seaport ● rdf.freebase.com:River_port ● dbr:Bad_conduct ● rdf.freebase.com:en.military_discharge ● dbr:IVDP ● rdf.freebase.com:en.port_wine How do we compute v1(t1,S), v2(t2,S), .. for sameAs inaccuracy filtering? 1. Start with a group of resources that are identified as sameAs: Ex: http://dbpedia.org/resource/Port (dbr:Port) 2. Collect subject categories Sdbr:Port using category selection function. 3. For each of the sameAs resources, collect natural language text (t) describing the resource. Collect (t) using dbpedia rdfs:comment, freebase ns.common.topic.description, www3.org wn20schema:gloss. 4. Compute vectors v1(t1,Sdbr:Port), v2(t2,Sdbr:Port)..., t1= rdfs:comment of dbr:Port, t2= ns.common.topic.description of rdf.freebase:River_port, … using category distribution function ρ(t,Sdbr:port). We now have individual v1,2..(t1,2..,Sdbr:Port) vectors. Only need base vector vb(tb,S) for comparison.
  • 11. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ] NEXT: Experimental results UPCOMING: Conclusion How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)? 1. Retrieve subject:categories of candidate resource from DBpedia Ex: http://dbpedia.org/resource/Port (dbr:Port) 2. Find (all) other resources that use the categories of the candidate resource. Concatenate rdfs:comment from all these resources (t). 3. Compute vb(t,Sdbr:Port) using category distribution function ρ(t,Sdbr:port). ● We now have base vector vb(t,Sdbr:Port) and can be compared to individual sameAs vectors v1,2(x1,2,Sdbr:Port). ● We use Pearson Correlation Coefficient (PCC) to compare vectors. ● Remove vectors whose PCC less than threshold δ. http://www.dbpedia.org  category:Nautical_terms  category:Ports_and_harbours
  • 12. SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ] NEXT: Experimental results continued. UPCOMING: Conclusion ● We examined 7,690 resources obtained from www.sameAs.org database of five topics: ○ Animal, City, Person, Color, and Miscellaneous. ● We performed some data cleansing on these resources. ○ removal of: duplicate resources (ie: aliases/redirects), broken links, redundant resources (ie: dbpedialite is a subset of DBpedia). ● After cleansing 411 unique resources remained with 251 errors identified by human oracle ○ ie: http://dbpedia.org/resource/Dog is not the same as http://rdf.freebase.com/ns/en.bitch
  • 13. SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ] ● We computed v411(t,S) individual vectors for all 411 resources with associated baseline comparison vector. ● We computed Pearson Correlation between v411(t,S) and baseline. ● Removed identity links based on thresholds ranging from 0.0 to 0.90. F-score calculated.for each threshold used. ○ Original 411 resources contained 160 correct / 251 incorrect sameAs links (0.560 F-score) ○ Threshold (δ) of 0.50 and 0.60 gave best F-score. NEXT: Experimental results continued. UPCOMING: Conclusion
  • 14. SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ] Scatter plot of F-score versus Pearson Correlation Coefficient for oracle-identified right(blue) and wrong(red) identity links. PEARSON wrong right δ NEXT: Conclusion
  • 15. SCID: Semantic Co-reference Inaccuracy Detection - [ CONCLUSION ] -- END -- ● In this presentation: ○ we introduce SCID: A technique for discovering inaccuracies in identity links assertion (owl:sameAs). ○ Experimental results indicate SCID can identify incorrect identity link assertions and improve precision of an identity database (http://www.sameas.org). ● In the future: ○ Experimentation with identity links other than owl:sameAs (ie: skos:closeMatch, skos:exactMatch, owl:equivalentClasses). ○ Experimentation with vector comparison methods other than Pearson Correlation (ie: cosine similarity, euclidean distance, Spearman rank coefficient).