Link Analysis of Life Sciences Linked Data

Link Analysis of Life Science Linked Data
1
Wei Hu1, Honglei Qiu1, and Michel Dumontier2
1State Key Laboratory for Novel Software Technology, Nanjing University, China
2Center for Biomedical Informatics Research, Stanford University
@micheldumontier::ISWC 2015

Linked Data offers links between
datasets, but they are often
incomplete and may contain
errors.

Network Analysis
• Network analysis has long been
used to study link structures
– The structure of the Web
– Network medicine: cellular
networks and implications
Power law is scale free
A graph demonstrates the small world
phenomenon, if its clustering coefficient is
significantly higher than that of a random
graph on the same node set, and if the graph
has a shorter average distance.
BTC2010
The clustering coefficient quantifies how close
its neighbors are to be a clique. The average
distance is the average shortest path length
between all nodes in the graph.

Dataset link analysis
(using RDF data model)
Entity link analysis
(using cross-references)
Term link analysis
(using ontology matching)

Linked Data for the Life Sciences
5
Bio2RDF is an open source project to unify the
representation and interlinking of biological data using RDF.
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
• Release 3 (June 2014)
• 35 datasets
• 11B RDF triples
• 1B entities
• 2K classes
• 4K properties

Dataset Links
Network Properties
1. Well linked
2. Hubs and authorities
3. small-world phenomenon
Average distance = 2.77 vs 6
Clustering coefficient = 0.22 vs
0.13
4. robust on systematic removal
of nodes

Entity Link Analysis
How well do entities link to each other?
• 76% entity links involve a special kind of RDF triples
– e.g. <kegg:D03455, kegg:x-drugbank, drugbank:DB00002>
– x-relations have under-specified semantics
• May be truly identical, may refer to another related entity …
• Degree distribution
– Some do not follow power law
• Exponent is too large (close to 5)
7
BTC2010

symmetry of entity links varies
between different pairs of datasets
• Over 99% of links are reciprocated in DrugBank-PharmGKB and
OMIM-HGNC
– Suggests link sharing and synchronization
• Only 58% of links in DrugBank-KEGG and 51% of OMIM-Orphanet
links are reciprocal
– Suggests incomplete mapping
• 28% of OMIM-Orphanet links are malposed
– Suggests variation in model (omim:Phenotype to orphanet:Disorder)
8 @micheldumontier::ISWC 2015

Transitivity Analysis:
Find mismatches and discover new links

Evaluation of Entity Matching
How accurate are current entity matching approaches?
• Built a benchmark from the reciprocal links between similarly-typed
entities
• Evaluated several entity matching approaches
– Label similarity: Levenstein, Jaro-Winkler, N-gram, Jaccard
– Machine learning: Linear regression, logistic regression with 5 properties
• Many-to-one links are difficult to be discovered

Term Link Analysis
How similar are the topics in the data network?
• Use ontology matching to generate term link graph
– Falcon-AO (linguistic matchers + structural matcher + synonyms)
• Created 83K class mappings, 1.5K object property mappings, and 858 data
property mappings
– Similarity threshold = 0.9
– Top-5 popular labels for classes and properties
• Significant overlap in topics, does not follow power law as in broader SW

Correlation of Link Graphs
To what degree are each of the three link graphs are correlated?
• Spearman’s rank correlation coefficient:
– Entity link graph  dataset pairs: entity links / entities
– Term link graph  dataset pairs: term mappings / terms
– Dataset link graph  dataset pairs: shortest path length
• All positively correlated
– Closer datasets in distance have more linked entities and terms
– Number of linked entities contributes little to overlap of topics

Summary of Findings
• Dataset, entity and term link graphs do not necessarily share the same
characteristics with the Hypertext / Semantic Web
– Degree distribution of entity links does not follow power law
– Data hubs
• A significant number of entities have been linked using x-relations, but
their intended semantics differs
– Classes are identical or equivalent  entity links represent logical equivalence
• Symmetric and transitive entity links do exist, but their utility is weakened
due to their small number
– Meanings of entity links may shift during transitive closure
• Only matching the labels of entities may fail, while combining different
properties and using simple learning algorithms achieve good accuracy

dumontierlab.com
michel.dumontier@stanford.edu
Website: http://dumontierlab.com
Presentations: http://slideshare.com/micheldumontier

Link Analysis of Life Sciences Linked Data

More Related Content

What's hot

Similar to Link Analysis of Life Sciences Linked Data

More from Michel Dumontier

Recently uploaded

Link Analysis of Life Sciences Linked Data