Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Case Study in Linked Data and Semantic Web: Human Genome


Published on

The National Human Genome Research Institute's "GWAS Catalog" (Genome-Wide Association Studies) project is a successful implementation of Linked Data ( and Semantic Web ( concepts. This deck discusses how this project has been implemented, challenges faced and possible paths for the future.

Published in: Technology
  • Be the first to comment

Case Study in Linked Data and Semantic Web: Human Genome

  1. 1. - David Portnoy 312.970.9740- © Copyright 2012-2014 Datalytx, Inc. Case study in Linked Data and Semantic Web for the Human Genome domain NHGRI’s “GWAS Catalog” Project National Human Genome Research Institute
  2. 2.  Project Growth: About the Project  Project Name: The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog  Project Description: Manually curated collection of published GWAS assaying at least 100,000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P <1 × 10−5.  In addition to SNP-trait association data, provides the “Diagram Browser”, an interactive diagram of these associations mapped to the SNPs’ chromosomal locations. Stats as of Aug 2014:  Almost 2,000 GWAS related publications  Over 14,000 SNPs # of studies # of traits SNP-trait associations 2005 2014 Website:
  3. 3. Accessing the data The GWAS Catalog can be accessed via  Via the “Diagram Browser”  Implemented as a dynamic visualization on the human karyotype  With links to study publication, SNPs in Ensembl and ontology terms in EFO (Experimental Factor Ontology)  Via a web query search interface  Provides tabular data for view or download  Includes traits and links to study publication  Via other GWAS-related data portals, such as  Ensembl  UCSC Genome Browser  PheGenI  GWAS Central
  4. 4. GWAS Components The project is implemented in 3 main components: 1. Curation / Data loading pipeline 2. Data Publisher 3. Diagram Browser Curation SNP Batch Loader PubMed Tracking Publisher Inference engine Ontology Loading Diagram Browser Knowledge Base Ontology Schema * The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project
  5. 5. Application Implementation The following technologies have been used for this project  Java for server-side processing  Spring for MVC framework  Maven for build automation and dependency management  Apache Tomcat for web server  Oracle for relational database  HermiT for OWL reasoner  JavaScript / AJAX for Diagram Browser interactivity  SVG for rendering vector graphics in the Diagram Browser  Apache POI for processing spreadsheets  ColdFusion for generating records for each SNP * The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project
  7. 7. Ontology schema needed Before the project could be implemented, an ontology had to be designed for its components to operate. Working backwards:  The Diagram Browser needs to display GWAS related data in order to answer common GWAS use cases  The Publisher needs to store data, such that it can be reasoned over and served up to the Diagram Browser  The Batch Loader needs to extract GWAS data from publications in a consistent manner for later retrieval by the Publisher
  8. 8. GWAS Catalog Ontology Was created by mapping each trait to one or more terms in the Experimental Factor Ontology (EFO)  At the start, 20% of GWAS traits were already in EFO  SKOS was used to extend EFO for GWAS- specific views  500 new terms were added to create GWAS-EFO-SKOS ontology Reasons for using EFO  It’s actively developed  It’s well suited to cover diversity of GWAS traits Metrics Number of classes 13,850 Number of individuals 370 Number of properties 50 Maximum depth Maximum # of children Average # of children Classes with a single child Classes with > 25 children Classes with no definition 15 700 7 500 100 13,500 * Note: GWAS Catalog Ontology and GWAS Diagram OWL have been used interchangeably
  9. 9. GWAS Catalog Ontology (cont.)  Purpose: Models the relationships between GWAS concepts of “SNP”, “trait” and “chromosome” to the Diagram  Location of ontology schemas used: EFO schema: GWAS-Diagram schema: Class Hierarchy Object property hierarchy Data property hierarchy GWAS study chromosome  chromosome 1..23,  Chromosome X, Y cytogenetic band single nucleotide polymorphism trait association experimental factor has_part located_in location_of associated_with is_about has_about part_of has_name has_snp_reference_id has_bp_position has_length has_p_value has_pubmed_id has_author has_publication_date has_gwas_trait_name * Source:;
  10. 10. Field definitions for OWL schema definitions 1. SNP reference ID: A single nucleotide polymorpism identifier, as assigned by the Single Nucleotide Polymorphism Database (dbSNP). 2. Base pair position: The position, in base pairs, of a particular element on a genome 3. Base pair length: The length, in base pairs, of any genomic element. 4. P-value: The probability of obtaining a test statistic at least as extreme as the one that was actually observed. 5. PubMed ID: The publication ID of a scientific paper, as assigned by the PubMed database. 6. Author: The primary author of a publication, usually expressed as surname followed by initial(s). 7. Publication date: A date on which a given entity was published 8. GWAS trait name: An arbitrary text label used to add a text definition of a GWAS trait name that is does not specificially map. Usually this will be used to annotate instances of Experimental Factor in order to retain information about a trait that was not defined in the ontology. 9. Chromosomes: Chromosome 1-23; Chromosomes X & Y 10. Trait association: An association that can be asserted between two entities with a degree of confidence expressed as a p-value. 11. GWAS Study: A study, described by a scientific publication, that identifies genome wide associations between single nucleotide polymorphisms and phylogenetic traits or disorders.
  11. 11. Using SKOS for defining the GWAS Catalog ontology SKOS (Simple Knowledge Organization System) was used to create the GWAS Catalog ontology by extending the EFO ontology, because:  Requires less expertise, effort and cost, since it is less semantically strict and expressive than OWL  Can be used where the complexity of inferences is limited  Is easy to use for extending other vocabularies Introduction to SKOS SKOS is an area of work developing specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web.
  12. 12. Sample dataset generated by OWL API is broken into… Data Property Assertion Class Assertion Object Property Assertion
  13. 13. Advantage of ontology for traits Using a predefined ontology for describing traits (rather than unstructured lists) allows: 1. More complex, compounded and context- dependent traits to be described  e.g. “Type 2 diabetes and gout”; “Parkinson’s disease (interaction with caffeine)” 2. Creation of semantically meaningful links between traits 3. More complex and meaningful queries Traits • Phenotypes, e.g. hair & eye color • Treatment responses, e.g. response to antineoplastic agents • Diseases, e.g. type 2 diabetes • Assays, e.g. glcyoslyated haemoglogin level • Chemical/drug names, e.g. C- reactive protein
  14. 14. CURATION
  15. 15. The Curation process is partially automated 1. Run automated literature searches to capture eligible studies 2. Enter them into the system for review by curators 3. Triage and assign papers to curator 4. Curators use use a web-based tracking and data entry system which allows multiple users to search, annotate, verify and publish the Catalog data. There are two levels of manual curation: a. First all data are extracted by one curator. b. Some studies could have more than 1000 significant SNPs. So curators create spreadsheets of SNPs for batch loading into the DB (using Apachi POI Java API for Microsoft Documents and a ColdFusion extension). c. Then data are double-checked for accuracy and consistency by another curator 5. Run the automated pipeline that: a. Checks multiple data sources for accuracy, completeness and consistency: PubMed, dbSNP, and NCBI's Gene database b. Adds genomic annotation such as SNP's base pair and cytogenetic location Literature search ID eligible studies Entry into workflow tool Triage & assignment Manual curation • Data entry • Check accuracy Automated pipeline • Check against PubMed, dbSNP, NCBI • Add annotation
  16. 16. Creation of links to external data sources Each entry in the GWAS Catalog has links to supporting data sources for convenience Reference Source Sample Link / URI NCBI’s dbSNP Ensembl -22126003;v=rs1333049;vdb=variation;vf=1004336 PubMed nk&LinkName=snp_pubmed_cited&LinkReadableName=Pubmed+(SNP+ Cited)&IdsFromResult=1333049 OMIM * Note that currently these are links for use by people, rather than machine readable linkages that would allow querying across multiple data sources
  17. 17. Future: Opportunity for automating curation  Machine learning and natural language processing (NLP) to categorize into traits defined in the GWAS Catalog ontology  Assign categorization confidence metrics to assist processing workflow  Accuracy can be verified by humans based on highlighting and annotations provided by NLP engine NLP processing & confidence assignment Workflow for human validation (where needed) Knowledge Base
  18. 18. PUBLISHER
  19. 19. Data flow for GOCI Publisher Start with the Oracle relational database created by the Curation process Java Publisher app converts from the relational database into OWL individuals Knowledge base in format of GWAS Catalog ontology with 13,000 individuals and 43,000 axioms OWL API and HermiT reasoner create inferences from GWAS Catalog ontology Since it takes > 10 hours to run the reasoner, the job is run in batch and results are cached in RAM Results are retrieved by Diagram Browser with requests to app running on Tomcat server HermiT + OWL API SPARQL Endpoint (future) Knowledge Base (OWL individuals / triples cached in RAM) Relational Database (Oracle) Java Publisher job Knowledge Base with Inferred Triples (Cached in RAM) GWAS Diagram Browser
  20. 20. Publisher’s output is to OWL triples …because this format is preferable to having the Diagram Browser query a relational database. The benefits are:  Additional inferences about SNP-trait associations  More expressive queries  Ability to detect errors or inconsistencies, as defined by the ontology Using direct queries Using OWL knowledge base Data has unstructured catalog of traits and in a fixed relational schema Data is structured in semantic triples and reasoned over using an ontology Queries can be only on string pattern matching and must be done one at a time. It’s not possible to query for related or inferred traits. Queries can include inferences and complex questions Example queries: • Can search on trait name containing “diabetes” and get results for both type 1 and type 2 diabetes • Comparison between gastric and esophageal cancers requires manually combining results from two distinct searches Example queries: * • Find all SNPs that are associated with cancers located in the upper digestive tract • Find all SNPs located on chromosomes 5, 7, 15 and 21 that are associated with diseases located in the urinary tract, with a p-value smaller than 10-8 * Source: Welter, D., Burdett, T., et al. (2012) Ontology-driven visualization of NHGRI GWAS data
  21. 21. HermiT OWL Reasoner  HermiT is a reasoner for ontologies written using OWL (Web Ontology Language). It is a Protégé plugin.  HermiT can determine whether the ontology for any given OWL file is consistent and identify the relationship between classes  HermiT passes all OWL 2 conformance tests for direct semantics reasoners  HermiT can be accessed from Java apps through the OWL API  OWL API is a Java interface for creating, manipulating and serializing OWL Ontologies  It includes parsers and writers for RDF, OWL and Turtle, as well as interface for working with reasoners
  22. 22. HermiT reasoner is implemented with “forward chaining”  How it works: Rules are processed by reasoner once in batch mode to generate and cache inferred triples  Best when:  Rules of inference and original data don’t change often  There’s sufficient disk and RAM to store all the inferred triples  Benefits: Retrieval queries run faster  Limitation: When rules or explicit data set changes, it may be necessary to empty and reload the entire data store and re-run the reasoner over it again
  24. 24. What is the Diagram Browser? It’s a diagram that shows SNP-trait associations mapped to the SNPs’ chromosomal locations of the human karyotype. This project has made significant improvements to it:  Originally: The diagram used to be a static document manually created on a quarterly basis (by a medical illustrator)  Now: Creation is fully automated with each study added and it is interactive, so that it can be explored dynamically
  25. 25. Diagram Browser: Interactive functionality Clicking on SNP-associated trait category enables selection of only bands with relevant traits Zoom in and hover over chromosomes in order to see traits by chromosomal location Clicking on diagram displays all SNPs for a trait and band
  26. 26. How is the Diagram Browser implemented? 1. The Diagram Browser is a JavaScript app rendered on the client browser 2. Interaction with the diagram, such as filter, zoom or click, generates a query 3. The query request is sent via AJAX from the web client to the Tomcat server 4. The server runs a Java program that converts this request into an OWL class expression which is processed by the reasoner 5. The query result causes a string of SVG (Scalable Vector Graphics) code to be generated 6. This code is sent back to the web client via AJAX 7. The JavaScript app renders the SVG provided Web Browser JavaScript app Web Server Knowledge Base (using GWAS Catalog ontology) Generate AJAX request Render SVG code 1 Trigger: Filter, zoom, click 2 3 4 6 5 7 Process request Generate SVG
  27. 27. THE FUTURE
  28. 28. Future scalability Will run into scalability issues as…  Size of knowledge base grows  Tools for querying the knowledge base become more sophisticated Current Implementation Short term solution Long Term Solution  Monitor system resources and increase where there are bottlenecks  Limit queries to a predefined ranges  Precompute more inferences, based on query frequency  Migrate to a persistent RDF triplestore (such as Virtuoso) from the knowledge base  Implement SPARQL endpoint for queries instead of using OWL class expressions  Consider backward chaining reasoner if inferred data set gets too big to cache
  29. 29. Future “backward chaining” option  How it works: Reasoner is deployed between the GWAS Diagram or SPARQL endpoint and data store, so that inferred triples are generated in real time as part of query result set  Best when:  Rules of inference and original data change often  Disk or RAM is insufficient to store all the inferred triples  Benefits: No need to re-run reasoner when data or rules change  Limitation: Query response may be slow
  30. 30. SPARQL Example: GWAS Central  Although the NHGRI project currently doesn’t host a live SPARQL endpoint, it could be set up to do so  The GWAS Central project already does this. (It collates data from a range of sources, including the published literature and collaborating databases such as the NHGRI GWAS Catalog.) SPARQL query page for GWAS Centeral uery.html
  31. 31. SPARQL Example: EBI’s Atlas  EBI hosts the GWAS Diagram, but doesn’t provide a SPARQL endpoint associated with that project  It does however host SPARQL endpoints for multiple other projects, such as Atlas SPARQL query page and multiple examples for EBI’s Atlas project (
  32. 32. GWAS Central: Towards Federation  GWAS Central is a comprehensive resource for the comparison and interrogation of multiple GWAS (genome-wide association studies) projects  Allows for storage, mining and display of summary-level association data  More comprehensive than other openly available projects with a similar focus (ie, millions vs. thousands of P-values )  Provides user tools and interfaces not previously available from a single resource  Aggregates other related resources:  GWAS Catalog  OADGAR  SNPedia  GWAS Central platform is available for adoption by other institutes, consortia, teams and countries  Ideally, multiple implementations can be federated to allow searching across multiple data sets
  33. 33. GWAS Central: Towards Federation (cont.) Comparison of features for GWAS Central, GWAS Catalog, OADGAR*, SNPedia * Open Access Database of Genome-wide Association Results
  34. 34. GWAS Central: Towards Federation (cont.) SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. This specification defines the syntax and semantics of SPARQL 1.1 Federated Query extension for executing queries distributed over different SPARQL endpoints. The SERVICE keyword extends SPARQL 1.1 to support queries that merge data distributed across the Web. Source:
  35. 35. Setting up GWAS Catalog project to query across data sets Querying across databases using EFO: Since the GWAS Catalog is based on EFO, it’s possible for a query to include other biomedical databases annotated for EFO: ArrayExpress, Ensembl, BioSamples, Pride, etc. Querying across databases using other ontologies: Even if EFO is not used, cross reference definition citations allows querying across ontologies. The ID of an external class is added as an annotation on the relevant EFO term. Example: Connective tissue is an EFO term that has been mapped to terms in other ontologies, such as term BTO:0000421, the identifier for connective tissue in the Brenda ontology.
  36. 36. THANKS!