Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Link Analysis of Life Sciences Linked Data

Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.

  • Be the first to comment

Link Analysis of Life Sciences Linked Data

  1. 1. Link Analysis of Life Science Linked Data 1 Wei Hu1, Honglei Qiu1, and Michel Dumontier2 1State Key Laboratory for Novel Software Technology, Nanjing University, China 2Center for Biomedical Informatics Research, Stanford University @micheldumontier::ISWC 2015
  2. 2. Linked Data offers links between datasets, but they are often incomplete and may contain errors. @micheldumontier::ISWC 20152
  3. 3. Network Analysis • Network analysis has long been used to study link structures – The structure of the Web – Network medicine: cellular networks and implications @micheldumontier::ISWC 20153 Power law is scale free A graph demonstrates the small world phenomenon, if its clustering coefficient is significantly higher than that of a random graph on the same node set, and if the graph has a shorter average distance. BTC2010 The clustering coefficient quantifies how close its neighbors are to be a clique. The average distance is the average shortest path length between all nodes in the graph.
  4. 4. Dataset link analysis (using RDF data model) Entity link analysis (using cross-references) Term link analysis (using ontology matching) @micheldumontier::ISWC 20154
  5. 5. @micheldumontier::ISWC 2015 Linked Data for the Life Sciences 5 Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF. chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications • Release 3 (June 2014) • 35 datasets • 11B RDF triples • 1B entities • 2K classes • 4K properties
  6. 6. Dataset Links @micheldumontier::ISWC 20156 Network Properties 1. Well linked 2. Hubs and authorities 3. small-world phenomenon Average distance = 2.77 vs 6 Clustering coefficient = 0.22 vs 0.13 4. robust on systematic removal of nodes
  7. 7. Entity Link Analysis How well do entities link to each other? • 76% entity links involve a special kind of RDF triples – e.g. <kegg:D03455, kegg:x-drugbank, drugbank:DB00002> – x-relations have under-specified semantics • May be truly identical, may refer to another related entity … • Degree distribution – Some do not follow power law • Exponent is too large (close to 5) 7 BTC2010 @micheldumontier::ISWC 2015
  8. 8. symmetry of entity links varies between different pairs of datasets • Over 99% of links are reciprocated in DrugBank-PharmGKB and OMIM-HGNC – Suggests link sharing and synchronization • Only 58% of links in DrugBank-KEGG and 51% of OMIM-Orphanet links are reciprocal – Suggests incomplete mapping • 28% of OMIM-Orphanet links are malposed – Suggests variation in model (omim:Phenotype to orphanet:Disorder) 8 @micheldumontier::ISWC 2015
  9. 9. Transitivity Analysis: Find mismatches and discover new links @micheldumontier::ISWC 20159
  10. 10. Evaluation of Entity Matching How accurate are current entity matching approaches? • Built a benchmark from the reciprocal links between similarly-typed entities • Evaluated several entity matching approaches – Label similarity: Levenstein, Jaro-Winkler, N-gram, Jaccard – Machine learning: Linear regression, logistic regression with 5 properties • Many-to-one links are difficult to be discovered 10 @micheldumontier::ISWC 2015
  11. 11. Term Link Analysis How similar are the topics in the data network? • Use ontology matching to generate term link graph – Falcon-AO (linguistic matchers + structural matcher + synonyms) • Created 83K class mappings, 1.5K object property mappings, and 858 data property mappings – Similarity threshold = 0.9 – Top-5 popular labels for classes and properties • Significant overlap in topics, does not follow power law as in broader SW 11 @micheldumontier::ISWC 2015
  12. 12. Correlation of Link Graphs To what degree are each of the three link graphs are correlated? • Spearman’s rank correlation coefficient: – Entity link graph  dataset pairs: entity links / entities – Term link graph  dataset pairs: term mappings / terms – Dataset link graph  dataset pairs: shortest path length • All positively correlated – Closer datasets in distance have more linked entities and terms – Number of linked entities contributes little to overlap of topics 12 @micheldumontier::ISWC 2015
  13. 13. Summary of Findings • Dataset, entity and term link graphs do not necessarily share the same characteristics with the Hypertext / Semantic Web – Degree distribution of entity links does not follow power law – Data hubs • A significant number of entities have been linked using x-relations, but their intended semantics differs – Classes are identical or equivalent  entity links represent logical equivalence • Symmetric and transitive entity links do exist, but their utility is weakened due to their small number – Meanings of entity links may shift during transitive closure • Only matching the labels of entities may fail, while combining different properties and using simple learning algorithms achieve good accuracy 13 @micheldumontier::ISWC 2015
  14. 14. dumontierlab.com michel.dumontier@stanford.edu Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier 14 @micheldumontier::ISWC 2015

    Be the first to comment

    Login to see the comments

  • jodischneider

    Oct. 26, 2015

Semantic web technologies offer a potential mechanism for the representation and integration of thousands of biomedical databases. Many of these databases offer cross-references to other data sources, but these are generally incomplete and prone to error. In this paper, we conduct an empirical analysis of the link structure of life science Linked Data, obtained from the Bio2RDF project. Three different link graphs for datasets, entities and terms are characterized by degree, connectivity, and clustering metrics, and their correlation is measured as well. Furthermore, we utilize the symmetry and transitivity of entity links to build a benchmark and evaluate several popular entity matching approaches. Our findings indicate that the life science data network can help find hidden links, can be used to validate links, and may offer a mechanism to integrate a wider set of resources to support biomedical knowledge discovery.

Views

Total views

975

On Slideshare

0

From embeds

0

Number of embeds

5

Actions

Downloads

22

Shares

0

Comments

0

Likes

1

×