A method for identifying incorrect sameAs links on the Linked Open Data cloud
Details published in:
John Cuzzola, Ebrahim Bagheri, Jelena Jovanovic:
Filtering Inaccurate Entity Co-references on the Linked Open Data. DEXA (1) 2015: 128-143
Discovering Alignments in Ontologies of Linked DataCraig Knoblock
Rahul Parundekar and Craig A. Knoblock and Jose Luis Ambite, Discovering Alignments in Ontologies of Linked Data, Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013
Presentation of Profiling Similarity Links in LOD @ DesWEB, ICDE 2016Blerina Spahiu
Abstract—Usually the content of the dataset published as LOD is rather unknown and data publishers have to deal with the challenge of interlinking new knowledge with existing datasets. Although there exist tools to facilitate data interlinking, they use prior knowledge about the datasets to be interlinked. In this paper we present a framework to profile the quality of owl:sameAs property in the Linked Open Data cloud and automatically discover new similarity links giving a similarity score for all the instances without prior knowledge about the properties used. Experimental results demonstrate the usefulness and effectiveness of the framework to automatically generate new links between two or more similar instances.
A Generic Language for Integrated RDF Mappings of Heterogeneous Dataandimou
Despite the significant number of existing tools, incorporating data from multiple sources and didifferent formats into the Linked Open Data cloud remains complicated. No mapping formalization exists to define how to map such heterogeneous sources into RDF in an integrated and interoperable fashion.
This paper introduces the RML mapping language, a generic language based on an extension over R2RML, the W3C standard for mapping relational databases into RDF. Broadening R2RML's scope, the language becomes source-agnostic and extensible, while facilitating the definition of mappings of multiple heterogeneous sources. This leads to higher integrity within datasets and richer interlinking among resources.
Discovering Alignments in Ontologies of Linked DataCraig Knoblock
Rahul Parundekar and Craig A. Knoblock and Jose Luis Ambite, Discovering Alignments in Ontologies of Linked Data, Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), 2013
Presentation of Profiling Similarity Links in LOD @ DesWEB, ICDE 2016Blerina Spahiu
Abstract—Usually the content of the dataset published as LOD is rather unknown and data publishers have to deal with the challenge of interlinking new knowledge with existing datasets. Although there exist tools to facilitate data interlinking, they use prior knowledge about the datasets to be interlinked. In this paper we present a framework to profile the quality of owl:sameAs property in the Linked Open Data cloud and automatically discover new similarity links giving a similarity score for all the instances without prior knowledge about the properties used. Experimental results demonstrate the usefulness and effectiveness of the framework to automatically generate new links between two or more similar instances.
A Generic Language for Integrated RDF Mappings of Heterogeneous Dataandimou
Despite the significant number of existing tools, incorporating data from multiple sources and didifferent formats into the Linked Open Data cloud remains complicated. No mapping formalization exists to define how to map such heterogeneous sources into RDF in an integrated and interoperable fashion.
This paper introduces the RML mapping language, a generic language based on an extension over R2RML, the W3C standard for mapping relational databases into RDF. Broadening R2RML's scope, the language becomes source-agnostic and extensible, while facilitating the definition of mappings of multiple heterogeneous sources. This leads to higher integrity within datasets and richer interlinking among resources.
Although RDF is a corner stone of semantic web and knowledge graphs, it has not been embraced by everyday programmers and software architects who need to safely create and access well-structured data. There is a lack of common tools and methodologies that are available in more conventional settings to improve data quality by defining schemas that can later be validated. Two technologies have recently been proposed for RDF validation: Shape Expressions (ShEx) and Shapes Constraint Language (SHACL). In the talk, we will review the history and motivation of both technologies. We will also and enumerate some challenges and future work with regards to RDF validation.
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
We describe an approach to find similarities between RDF datasets, which may be applicable to tasks such as link discovery, dataset summarization or dataset understanding. Our approach builds on the assumption that similar datasets should have a similar structure and include semantically similar resources and relationships. It is based on the combination of Frequent Subgraph Mining (FSM) techniques, used to synthesize the datasets and find similarities among them. The result of this work can be applied for easing the task of data interlinking and for promoting data reusing in the Semantic Web.
Full paper at: http://memaldi.github.io/pdf/iesd2015.pdf
The Linked Open Data (LOD) cloud contains tremendous amounts of interlinked instances, from where we can retrieve abundant knowledge. However, because of the heterogeneous and big ontologies, it is time consuming to learn all the ontologies manually and it is difficult to observe which properties are important for describing instances of a specific class. In order to construct an ontology that can help users easily access to various data sets, we propose a semi-automatic ontology inte- gration framework that can reduce the heterogeneity of ontologies and retrieve frequently used core properties for each class. The framework consists of three main components: graph-based ontology integration, machine-learning-based ontology schema extraction, and an ontology merger. By analyzing the instances of the linked data sets, this framework acquires ontological knowledge and constructs a high-quality integrated ontology, which is easily understandable and effective in knowledge ac- quisition from various data sets using simple SPARQL queries.
May 2012 JaxDUG presentation by Zachary Gramana on using the Lucene.NET library to add search functionality to .NET applications. Contains an overview of search/information retrieval concepts and highlights some common use-cases.
[Master Thesis]: SPARQL Query Rewriting with PathsAbdullah Abbas
The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It involves publishing in languages specifically designed for data like (RDF) Resource Description Framework. In order to access the published data, it offers a query language named SPARQL.
The goal of this study is to transform SPARQL queries to other SPARQL queries which can be executed more efficiently. Our main goal of transformation is to eliminate non-distinguished variables, which are source of extra complexity, where such elimination is possible. We rewrite SPARQL queries with property paths, which was introduced in SPARQL 1.1.
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.
In this presentation, I talk about how simple yet innovative ideas can solve complex problems. Many very computationally challenging problems can be solved using very simple solution and through the engagement of the public, the so called crowds! I talk about how the concept of gamification has been influential.
Although RDF is a corner stone of semantic web and knowledge graphs, it has not been embraced by everyday programmers and software architects who need to safely create and access well-structured data. There is a lack of common tools and methodologies that are available in more conventional settings to improve data quality by defining schemas that can later be validated. Two technologies have recently been proposed for RDF validation: Shape Expressions (ShEx) and Shapes Constraint Language (SHACL). In the talk, we will review the history and motivation of both technologies. We will also and enumerate some challenges and future work with regards to RDF validation.
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
We describe an approach to find similarities between RDF datasets, which may be applicable to tasks such as link discovery, dataset summarization or dataset understanding. Our approach builds on the assumption that similar datasets should have a similar structure and include semantically similar resources and relationships. It is based on the combination of Frequent Subgraph Mining (FSM) techniques, used to synthesize the datasets and find similarities among them. The result of this work can be applied for easing the task of data interlinking and for promoting data reusing in the Semantic Web.
Full paper at: http://memaldi.github.io/pdf/iesd2015.pdf
The Linked Open Data (LOD) cloud contains tremendous amounts of interlinked instances, from where we can retrieve abundant knowledge. However, because of the heterogeneous and big ontologies, it is time consuming to learn all the ontologies manually and it is difficult to observe which properties are important for describing instances of a specific class. In order to construct an ontology that can help users easily access to various data sets, we propose a semi-automatic ontology inte- gration framework that can reduce the heterogeneity of ontologies and retrieve frequently used core properties for each class. The framework consists of three main components: graph-based ontology integration, machine-learning-based ontology schema extraction, and an ontology merger. By analyzing the instances of the linked data sets, this framework acquires ontological knowledge and constructs a high-quality integrated ontology, which is easily understandable and effective in knowledge ac- quisition from various data sets using simple SPARQL queries.
May 2012 JaxDUG presentation by Zachary Gramana on using the Lucene.NET library to add search functionality to .NET applications. Contains an overview of search/information retrieval concepts and highlights some common use-cases.
[Master Thesis]: SPARQL Query Rewriting with PathsAbdullah Abbas
The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It involves publishing in languages specifically designed for data like (RDF) Resource Description Framework. In order to access the published data, it offers a query language named SPARQL.
The goal of this study is to transform SPARQL queries to other SPARQL queries which can be executed more efficiently. Our main goal of transformation is to eliminate non-distinguished variables, which are source of extra complexity, where such elimination is possible. We rewrite SPARQL queries with property paths, which was introduced in SPARQL 1.1.
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.
In this presentation, I talk about how simple yet innovative ideas can solve complex problems. Many very computationally challenging problems can be solved using very simple solution and through the engagement of the public, the so called crowds! I talk about how the concept of gamification has been influential.
The microblogging service, Twitter, has gained wide popularity with over 300M active users and over 500M tweets per day. The unique characteristic of Twitter, only allowing short length messages to be communicated, has brought about interesting changes to how information is expressed and communicated by the users, i.e., the semantics of information when expressed on Twitter differ from when expressed on other medium. For instance, the word 'metal' when observed on Twitter carries a different semantic meaning, most likely referring to heavy metal music, as opposed to when used in other contexts where its predominant sense is the metal material. In this talk, I will discuss how the meaning and senses of words can be captured and modeled on Twitter to enable better and more efficient search, retrieval and recommendation of content.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
This is a presentation we perform internally every quarter as part of our Data Science Brown Bag Series. This presentation was talking about different types of soft clustering techniques - all of which the team currently performs depending on the complexity of the data and the complexity of customer problems. If you are interested in learning more about working with L-3 Data Tactics or interested in working for the L-3 Data Tactics Data Science team please contact us soon! Thank you.
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University
Today, we are experiencing an unprecedented production of resources, published as Linked Open Data (LOD, for short). This is leading to the creation of knowledge graphs (KGs) containing billions of RDF (Resource Description Framework) triples, such as DBpedia, YAGO and Wikidata on the academic side, and the Google Knowledge Graph or Microsoft’s Satori graph on the commercial side. These KGs contain millions of entities (such as people, proteins, or books), and millions of facts about them. This knowledge is typically expressed in RDF (Resource Description Framework), i.e., as triples of the form ⟨Macron, presidentOf, France⟩. Some KGs provide an ontology expressed in OWL2 (Web Ontology Language), which describes the vocabulary (the classes and properties) for the RDF facts. However, to exploit and take benefits from the richness of this available data and knowledge, several problems have to be faced, namely, data linking, data fusion and knowledge discovery, when data is of big volume, heterogeneous and evolving. In this tutorial we will first give an overview of exiting data linking and key discovery approaches. Then, we will discuss the problem of identity crisis caused by the misuse of owl:sameAs predicate and give some possible solutions. We will finish by highlighting some current challenges in this research area.
An approach to identify how much a Linked Data dataset is biased, using statistical methods and the links between datasets. 28/11/2014 @EKAW2014, Linköping, Sweden
Wi2015 - Clustering of Linked Open Data - the LODeX toolLaura Po
Presentation of the tool LODeX (http://www.dbgroup.unimore.it/lodex2/testCluster) at the 2015 IEEE/WIC/ACM International Conference on Web Intelligence, Singapore, December 6-8, 2015
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. However, traditionally machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph. In this talk I will discuss methods that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. I will provide a conceptual review of key advancements in this area of representation learning on graphs, including random-walk based algorithms, and graph convolutional networks.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Filtering Inaccurate Entity Co-references on the Linked Open Data
1. NEXT: Background
SCID: Semantic Co-reference Inaccuracy Detection - [ INTRODUCTION ]
Filtering Inaccurate Entity Co-references on the Linked Open Data
John Cuzzola, Jelena Jovanovic, Ebrahim Bagheri
bagheri@ryerson.ca
DEXA 2015
2. The Linked-Open-Data (LOD)
cloud represents hundreds of
available datasets throughout
the Web.
❖ 570 datasets and 2909 linkage
relationships between the datasets.1
1. http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/
NEXT: How are datasets linked?
SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
3. To utilize the data from
multiple ontologies within
the LOD, “equivalence”
relationships between
concepts is necessary (ie:
the “edges” or linkages of
the LOD must be defined).
570 datasets and 2909 linkage
relationships between the datasets.
?
Equivalence relationships
between DBPedia and
Freebase?
NEXT: The sameAs predicate
SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
4. The equivalency relationship is often accomplished via the predicate owl:sameAs
<owl:sameAs>
http://rdf.freebase.com/ns/en.doghttp://dbpedia.org/resource/Dog
NEXT: sameAs linkage mistakes
SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
5. <owl:sameAs>
http://rdf.freebase.com/ns/en.bitchhttp://dbpedia.org/resource/Dog
NOT the same!
X
ns:common.topic.description:
"Bitch, literally meaning a female
dog, is a common slang term in the
English language, especially used
as a denigrating term applied to a
person, commonly a woman”
dbo:abstract:
The domestic dog (Canis lupus
familiaris) is a usually furry,
carnivorous member of the canidae
family.
The Problem: There are many incorrect
LOD linkages using owl:sameAs.
The Effect: Incorrect (embarrassing)
assertions by reasoners that use the LOD.
Example:
(from http://www.sameas.org)
NEXT: SCID
SCID: Semantic Co-reference Inaccuracy Detection - [ PROBLEM / MOTIVATION ]
6. SCID: Semantic Co-reference Inaccuracy Detection
❖ A method of natural language
analysis for detecting incorrect
owl:sameAs assertions.
1. Construct a baseline comparison vector vb(x,Sx).
2. For each resource (1,2,...) claiming to be the “same”,
construct vectors v1(x1,Sx), v2(x2,Sx) …
3. Compare individual distances from v1(x1,Sx),
v2(x2,Sx) … to baseline vb(x,Sx)
4. Disregard those v1(x1,Sx), v2(x2,Sx) … that are
outside some threshold distance δ.
NEXT: The core functions of SCID.
UPCOMING: How is vb(x,Sx) and v1(x1,Sx), v2(x2,Sx) … made?
SCID: Semantic Co-reference Inaccuracy Detection - [ CONTRIBUTION ]
7. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
SCID depends on two key functions:
1. A category distribution function: ρ(t,S) .
Given some natural language text (t) and a set of “suitable” subject categories (S) for t,
compute a distribution vector of how t relates to each subject category of S.
1. A category selection function S(uri).
Given a resource (uri), return a “suitable” set of subject categories (S) that can be used
in ρ(t,S).
NEXT: The category distribution function.
UPCOMING: The category selection function.
8. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
The category distribution function:
ρ(t,S) .
Ex: Given input text (t) as shown and three
DBpedia subject categories of S=[Fruit, Oranges,
Color] ρ(t,S) produces output:
ρ(t,[Fruit, Oranges, Color]) = v1(x1,S)
= [ 0.27Fruit, 0.50Oranges, 0.22Color ]
NEXT: The category selection function
UPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
● Computes Rx,k defined as the importance of word x to category k for every word in t.
○ uses 5 features: (1) count of x in k, (2) count of x across all k, (3) count of concepts where
word x appears, (4) ratio of x in k to vocabulary of all k, (5) average word frequency of x per
resource in k.
9. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
The category selection function: Suri .
⇶ DBpedia contains 656,000+ category:subjects.
How do we select a few suitable for ρ(t,S)?
1. Begin with a candidate resource (uri):
http://dbpedia.org/resource/Orange_(fruit)
2. Find a DBpedia disambiguation page:
http://dbpedia.org/resource/Orange_(disambiguation)
3. Combine (union of) the subject categories for each of these resources.
Suri = [ category: { Optical_Spectrum, Oranges, Citrus_hybrids, Tropical_agriculture,
American_punk_rock, Rock_music, Hellcat_Records } ]
NEXT: How do we compute v1(x1,S) for sameAs inaccuracy filtering?
UPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
dbr:Orange_(colour)
dbr:Orange_(fruit)
dbr:Orange_(band)
category:Optical_Spectrum category: Oranges
category:Citrus_hybrids
category:Tropical_agriculture
...
category: American_punk_rock
category: Rock_music
category: Hellcat_Records
10. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
NEXT: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
UPCOMING: Experimental results
http://www.sameas.org
● dbr:Port
● www.w3.org:synset-seaport-noun-1
● rdf.freebase.com:en.port
● sw.opencyc.org:Seaport
● rdf.freebase.com:River_port
● dbr:Bad_conduct
● rdf.freebase.com:en.military_discharge
● dbr:IVDP
● rdf.freebase.com:en.port_wine
How do we compute v1(t1,S), v2(t2,S), .. for sameAs inaccuracy filtering?
1. Start with a group of resources that are identified as sameAs:
Ex: http://dbpedia.org/resource/Port (dbr:Port)
2. Collect subject categories Sdbr:Port using category selection function.
3. For each of the sameAs resources, collect natural language text (t)
describing the resource. Collect (t) using dbpedia rdfs:comment,
freebase ns.common.topic.description, www3.org wn20schema:gloss.
4. Compute vectors v1(t1,Sdbr:Port), v2(t2,Sdbr:Port)..., t1= rdfs:comment of
dbr:Port, t2= ns.common.topic.description of rdf.freebase:River_port, …
using category distribution function ρ(t,Sdbr:port).
We now have individual v1,2..(t1,2..,Sdbr:Port) vectors. Only need base
vector vb(tb,S) for comparison.
11. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
NEXT: Experimental results
UPCOMING: Conclusion
How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
1. Retrieve subject:categories of candidate resource from DBpedia
Ex: http://dbpedia.org/resource/Port (dbr:Port)
2. Find (all) other resources that use the categories of the candidate resource. Concatenate
rdfs:comment from all these resources (t).
3. Compute vb(t,Sdbr:Port) using category distribution function ρ(t,Sdbr:port).
● We now have base vector vb(t,Sdbr:Port) and can be compared to
individual sameAs vectors v1,2(x1,2,Sdbr:Port).
● We use Pearson Correlation Coefficient (PCC) to compare vectors.
● Remove vectors whose PCC less than threshold δ.
http://www.dbpedia.org
category:Nautical_terms
category:Ports_and_harbours
12. SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]
NEXT: Experimental results continued.
UPCOMING: Conclusion
● We examined 7,690 resources obtained from www.sameAs.org database of five topics:
○ Animal, City, Person, Color, and Miscellaneous.
● We performed some data cleansing on these resources.
○ removal of: duplicate resources (ie: aliases/redirects), broken links, redundant resources (ie: dbpedialite
is a subset of DBpedia).
● After cleansing 411 unique resources remained with 251 errors identified by human oracle
○ ie: http://dbpedia.org/resource/Dog is not the same as http://rdf.freebase.com/ns/en.bitch
13. SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]
● We computed v411(t,S) individual vectors for all 411 resources with associated baseline comparison
vector.
● We computed Pearson Correlation
between v411(t,S) and baseline.
● Removed identity links based on thresholds
ranging from 0.0 to 0.90. F-score
calculated.for each threshold used.
○ Original 411 resources contained
160 correct / 251 incorrect sameAs
links (0.560 F-score)
○ Threshold (δ) of 0.50 and 0.60 gave
best F-score.
NEXT: Experimental results continued.
UPCOMING: Conclusion
14. SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]
Scatter plot of F-score versus Pearson Correlation Coefficient for oracle-identified right(blue) and
wrong(red) identity links.
PEARSON wrong right
δ
NEXT: Conclusion
15. SCID: Semantic Co-reference Inaccuracy Detection - [ CONCLUSION ]
-- END --
● In this presentation:
○ we introduce SCID: A technique for discovering inaccuracies in identity links assertion
(owl:sameAs).
○ Experimental results indicate SCID can identify incorrect identity link assertions and improve
precision of an identity database (http://www.sameas.org).
● In the future:
○ Experimentation with identity links other than owl:sameAs (ie: skos:closeMatch,
skos:exactMatch, owl:equivalentClasses).
○ Experimentation with vector comparison methods other than Pearson Correlation (ie: cosine
similarity, euclidean distance, Spearman rank coefficient).