The National Human Genome Research Institute's "GWAS Catalog" (Genome-Wide Association Studies) project is a successful implementation of Linked Data (http://linkeddata.org/) and Semantic Web (http://www.w3.org/standards/semanticweb/) concepts. This deck discusses how this project has been implemented, challenges faced and possible paths for the future.
2. īŽ Project Growth:
About the Project
īŽ Project Name: The National Human
Genome Research Institute (NHGRI)
Catalog of Published Genome-Wide
Association Studies (GWAS) Catalog
īŽ Project Description: Manually curated
collection of published GWAS assaying
at least 100,000 single-nucleotide
polymorphisms (SNPs) and all SNP-trait
associations with P <1 Ã 10â5.
īŽ In addition to SNP-trait association
data, provides the âDiagram Browserâ,
an interactive diagram of these
associations mapped to the SNPsâ
chromosomal locations. Stats as of Aug 2014:
īŽ Almost 2,000 GWAS related
publications
īŽ Over 14,000 SNPs
# of studies
# of traits
SNP-trait associations
2005 2014
Website: http://www.genome.gov/gwastudies/
3. Accessing the data
The GWAS Catalog can be accessed via
īŽ Via the âDiagram Browserâ
ī¨ Implemented as a dynamic visualization on the human karyotype
ī¨ With links to study publication, SNPs in Ensembl and ontology terms in
EFO (Experimental Factor Ontology)
īŽ Via a web query search interface
ī¨ Provides tabular data for view or download
ī¨ Includes traits and links to study publication
īŽ Via other GWAS-related data portals, such as
ī¨ Ensembl
ī¨ UCSC Genome Browser
ī¨ PheGenI
ī¨ GWAS Central
4. GWAS Components
The project is implemented in 3 main components:
1. Curation / Data loading pipeline
2. Data Publisher
3. Diagram Browser
Curation
SNP
Batch
Loader
PubMed
Tracking
Publisher
Inference
engine
Ontology
Loading
Diagram
Browser
Knowledge Base
Ontology Schema
* The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project
5. Application Implementation
The following technologies have been used for this project
ī§ Java for server-side processing
ī§ Spring for MVC framework
ī§ Maven for build automation and dependency management
ī§ Apache Tomcat for web server
ī§ Oracle for relational database
ī§ HermiT for OWL reasoner
ī§ JavaScript / AJAX for Diagram Browser interactivity
ī§ SVG for rendering vector graphics in the Diagram Browser
ī§ Apache POI for processing spreadsheets
ī§ ColdFusion for generating records for each SNP
* The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project
7. Ontology schema needed
Before the project could be implemented, an ontology had to be
designed for its components to operate. Working backwards:
īŽ The Diagram Browser needs to display GWAS related data in
order to answer common GWAS use cases
īŽ The Publisher needs to store data, such that it can be reasoned
over and served up to the Diagram Browser
īŽ The Batch Loader needs to extract GWAS data from
publications in a consistent manner for later retrieval by the
Publisher
8. GWAS Catalog Ontology
Was created by mapping each trait to one or
more terms in the Experimental Factor
Ontology (EFO)
īŽ At the start, 20% of GWAS traits were
already in EFO
īŽ SKOS was used to extend EFO for GWAS-
specific views
īŽ 500 new terms were added to create
GWAS-EFO-SKOS ontology
Reasons for using EFO
īŽ Itâs actively developed
īŽ Itâs well suited to cover diversity of GWAS
traits
Metrics
Number of classes 13,850
Number of individuals 370
Number of properties 50
Maximum depth
Maximum # of children
Average # of children
Classes with a single child
Classes with > 25 children
Classes with no definition
15
700
7
500
100
13,500
* Note: GWAS Catalog Ontology and GWAS Diagram OWL have been used interchangeably
9. GWAS Catalog Ontology (cont.)
īŽ Purpose: Models the relationships between GWAS concepts of
âSNPâ, âtraitâ and âchromosomeâ to the Diagram
īŽ Location of ontology schemas used:
EFO schema: http://www.ebi.ac.uk/efo
GWAS-Diagram schema: http://www.ebi.ac.uk/efo/gwas-diagram
Class Hierarchy Object property hierarchy Data property hierarchy
GWAS study
chromosome
ī chromosome 1..23,
ī Chromosome X, Y
cytogenetic band
single nucleotide polymorphism
trait association
experimental factor
has_part
located_in
location_of
associated_with
is_about
has_about
part_of
has_name
has_snp_reference_id
has_bp_position
has_length
has_p_value
has_pubmed_id
has_author
has_publication_date
has_gwas_trait_name
* Source: OntologyConstants.java; http://www.ebi.ac.uk/fgpt/gwas/ontology/gwas-diagram.owl
10. Field definitions for OWL schema definitions
1. SNP reference ID: A single nucleotide polymorpism identifier, as assigned by the Single
Nucleotide Polymorphism Database (dbSNP).
2. Base pair position: The position, in base pairs, of a particular element on a genome
3. Base pair length: The length, in base pairs, of any genomic element.
4. P-value: The probability of obtaining a test statistic at least as extreme as the one that
was actually observed.
5. PubMed ID: The publication ID of a scientific paper, as assigned by the PubMed
database.
6. Author: The primary author of a publication, usually expressed as surname followed by
initial(s).
7. Publication date: A date on which a given entity was published
8. GWAS trait name: An arbitrary text label used to add a text definition of a GWAS trait
name that is does not specificially map. Usually this will be used to annotate instances
of Experimental Factor in order to retain information about a trait that was not defined in
the ontology.
9. Chromosomes: Chromosome 1-23; Chromosomes X & Y
10. Trait association: An association that can be asserted between two entities with a
degree of confidence expressed as a p-value.
11. GWAS Study: A study, described by a scientific publication, that identifies genome wide
associations between single nucleotide polymorphisms and phylogenetic traits or
disorders.
11. Using SKOS for defining the GWAS Catalog ontology
SKOS (Simple Knowledge Organization System) was used to create the
GWAS Catalog ontology by extending the EFO ontology, because:
īŽ Requires less expertise, effort and cost, since it is less semantically
strict and expressive than OWL
īŽ Can be used where the complexity of inferences is limited
īŽ Is easy to use for extending other vocabularies
Introduction to SKOS
SKOS is an area of work developing specifications and
standards to support the use of knowledge organization
systems (KOS) such as thesauri, classification schemes,
subject heading systems and taxonomies within the
framework of the Semantic Web.
12. Sample dataset generated by OWL API is broken intoâĻ
Data Property Assertion
Class Assertion
Object Property Assertion
13. Advantage of ontology for traits
Using a predefined ontology for describing traits
(rather than unstructured lists) allows:
1. More complex, compounded and context-
dependent traits to be described
ī¨ e.g. âType 2 diabetes and goutâ;
âParkinsonâs disease (interaction with
caffeine)â
2. Creation of semantically meaningful links
between traits
3. More complex and meaningful queries
Traits
âĸ Phenotypes, e.g. hair & eye color
âĸ Treatment responses, e.g.
response to antineoplastic agents
âĸ Diseases, e.g. type 2 diabetes
âĸ Assays, e.g. glcyoslyated
haemoglogin level
âĸ Chemical/drug names, e.g. C-
reactive protein
15. The Curation process is partially automated
1. Run automated literature searches to capture eligible studies
2. Enter them into the system for review by curators
3. Triage and assign papers to curator
4. Curators use use a web-based tracking and data entry system which allows multiple
users to search, annotate, verify and publish the Catalog data. There are two levels
of manual curation:
a. First all data are extracted by one curator.
b. Some studies could have more than 1000 significant SNPs. So curators create
spreadsheets of SNPs for batch loading into the DB (using Apachi POI Java API
for Microsoft Documents and a ColdFusion extension).
c. Then data are double-checked for accuracy and consistency by another curator
5. Run the automated pipeline that:
a. Checks multiple data sources for accuracy, completeness and consistency:
PubMed, dbSNP, and NCBI's Gene database
b. Adds genomic annotation such as SNP's base pair and cytogenetic location
Literature
search
ID eligible
studies
Entry into
workflow tool
Triage &
assignment
Manual curation
âĸ Data entry
âĸ Check accuracy
Automated pipeline
âĸ Check against
PubMed, dbSNP,
NCBI
âĸ Add annotation
16. Creation of links to external data sources
Each entry in the GWAS
Catalog has links to
supporting data sources
for convenience
Reference
Source
Sample Link / URI
NCBIâs
dbSNP
http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1333049
Ensembl http://useast.ensembl.org/Homo_sapiens/Variation/Explore?r=9:22125003
-22126003;v=rs1333049;vdb=variation;vf=1004336
PubMed http://www.ncbi.nlm.nih.gov/pubmed?Db=pubmed&DbFrom=snp&Cmd=Li
nk&LinkName=snp_pubmed_cited&LinkReadableName=Pubmed+(SNP+
Cited)&IdsFromResult=1333049
OMIM http://omim.org/entry/611139#0000
* Note that currently these are links for use by people, rather than machine readable
linkages that would allow querying across multiple data sources
17. Future: Opportunity for automating curation
īŽ Machine learning and natural language processing (NLP) to categorize
into traits defined in the GWAS Catalog ontology
īŽ Assign categorization confidence metrics to assist processing workflow
īŽ Accuracy can be verified by humans based on highlighting and
annotations provided by NLP engine
NLP processing &
confidence assignment
Workflow for human
validation (where needed)
Knowledge Base
19. Data flow for GOCI Publisher
Start with the Oracle relational database created
by the Curation process
Java Publisher app converts from the relational
database into OWL individuals
Knowledge base in format of GWAS Catalog
ontology with 13,000 individuals and 43,000
axioms
OWL API and HermiT reasoner create inferences
from GWAS Catalog ontology
Since it takes > 10 hours to run the reasoner, the
job is run in batch and results are cached in RAM
Results are retrieved by Diagram Browser with
requests to app running on Tomcat server
HermiT +
OWL API
SPARQL
Endpoint
(future)
Knowledge Base
(OWL individuals / triples
cached in RAM)
Relational Database
(Oracle)
Java Publisher job
Knowledge Base
with Inferred Triples
(Cached in RAM)
GWAS
Diagram
Browser
20. Publisherâs output is to OWL triples
âĻbecause this format is preferable to having the Diagram Browser query a
relational database. The benefits are:
īŽ Additional inferences about SNP-trait associations
īŽ More expressive queries
īŽ Ability to detect errors or inconsistencies, as defined by the ontology
Using direct queries Using OWL knowledge base
Data has unstructured catalog of traits and
in a fixed relational schema
Data is structured in semantic triples and
reasoned over using an ontology
Queries can be only on string pattern
matching and must be done one at a time.
Itâs not possible to query for related or inferred
traits.
Queries can include inferences and complex
questions
Example queries:
âĸ Can search on trait name containing
âdiabetesâ and get results for both type 1
and type 2 diabetes
âĸ Comparison between gastric and
esophageal cancers requires manually
combining results from two distinct
searches
Example queries: *
âĸ Find all SNPs that are associated with
cancers located in the upper digestive tract
âĸ Find all SNPs located on chromosomes 5,
7, 15 and 21 that are associated with
diseases located in the urinary tract, with a
p-value smaller than 10-8
* Source: Welter, D., Burdett, T., et al. (2012) Ontology-driven visualization of NHGRI GWAS data
21. HermiT OWL Reasoner
īŽ HermiT is a reasoner for ontologies written using OWL (Web
Ontology Language). It is a ProtÊgÊ plugin.
īŽ HermiT can determine whether the ontology for any given OWL
file is consistent and identify the relationship between classes
īŽ HermiT passes all OWL 2 conformance tests for direct semantics
reasoners
īŽ HermiT can be accessed from Java apps through the OWL API
īŽ OWL API is a Java interface for creating, manipulating and
serializing OWL Ontologies
īŽ It includes parsers and writers for RDF, OWL and Turtle, as well as interface
for working with reasoners
22. HermiT reasoner is implemented with âforward chainingâ
īŽ How it works: Rules are processed by reasoner once in batch
mode to generate and cache inferred triples
īŽ Best when:
īŽ Rules of inference and original data donât change often
īŽ Thereâs sufficient disk and RAM to store all the inferred triples
īŽ Benefits: Retrieval queries run faster
īŽ Limitation: When rules or explicit data set changes, it may be
necessary to empty and reload the entire data store and re-run
the reasoner over it again
24. What is the Diagram Browser?
Itâs a diagram that shows SNP-trait associations mapped to the SNPsâ
chromosomal locations of the human karyotype. This project has made
significant improvements to it:
īŽ Originally: The diagram used to be a static document manually created
on a quarterly basis (by a medical illustrator)
īŽ Now: Creation is fully automated with each study added and it is
interactive, so that it can be explored dynamically
25. Diagram Browser: Interactive functionality
Clicking on SNP-associated trait
category enables selection of
only bands with relevant traits
Zoom in and hover over
chromosomes in order to see
traits by chromosomal location
Clicking on diagram displays all
SNPs for a trait and band
26. How is the Diagram Browser implemented?
1. The Diagram Browser is a JavaScript app
rendered on the client browser
2. Interaction with the diagram, such as filter,
zoom or click, generates a query
3. The query request is sent via AJAX from
the web client to the Tomcat server
4. The server runs a Java program that
converts this request into an OWL class
expression which is processed by the
reasoner
5. The query result causes a string of SVG
(Scalable Vector Graphics) code to be
generated
6. This code is sent back to the web client via
AJAX
7. The JavaScript app renders the SVG
provided
Web Browser
JavaScript app
Web Server
Knowledge Base
(using GWAS
Catalog ontology)
Generate
AJAX request
Render
SVG code
1
Trigger:
Filter,
zoom,
click
2
3
4
6
5
7
Process
request
Generate
SVG
28. Future scalability
Will run into scalability issues asâĻ
īŽ Size of knowledge base grows
īŽ Tools for querying the knowledge base become more
sophisticated
Current
Implementation
Short term
solution
Long Term
Solution
īŽ Monitor system resources and increase where
there are bottlenecks
īŽ Limit queries to a predefined ranges
īŽ Precompute more inferences, based on query
frequency
īŽ Migrate to a persistent RDF triplestore
(such as Virtuoso) from the knowledge base
īŽ Implement SPARQL endpoint for queries
instead of using OWL class expressions
īŽ Consider backward chaining reasoner if
inferred data set gets too big to cache
29. Future âbackward chainingâ option
īŽ How it works: Reasoner is deployed between the GWAS
Diagram or SPARQL endpoint and data store, so that inferred
triples are generated in real time as part of query result set
īŽ Best when:
īŽ Rules of inference and original data change often
īŽ Disk or RAM is insufficient to store all the inferred triples
īŽ Benefits: No need to re-run reasoner when data or rules change
īŽ Limitation: Query response may be slow
30. SPARQL Example: GWAS Central
īŽ Although the NHGRI project currently doesnât host a live SPARQL
endpoint, it could be set up to do so
īŽ The GWAS Central project already does this. (It collates data from a
range of sources, including the published literature and collaborating
databases such as the NHGRI GWAS Catalog.)
SPARQL query page for
GWAS Centeral
http://fuseki.gwascentral.org/q
uery.html
31. SPARQL Example: EBIâs Atlas
īŽ EBI hosts the GWAS Diagram, but doesnât provide a SPARQL endpoint
associated with that project
īŽ It does however host SPARQL endpoints for multiple other projects,
such as Atlas
SPARQL query page and multiple examples for EBIâs Atlas project
(https://www.ebi.ac.uk/rdf/services/atlas/sparql)
32. GWAS Central: Towards Federation
īŽ GWAS Central is a comprehensive resource for the comparison
and interrogation of multiple GWAS (genome-wide association
studies) projects
īŽ Allows for storage, mining and display of summary-level
association data
īŽ More comprehensive than other openly available projects with a
similar focus (ie, millions vs. thousands of P-values )
īŽ Provides user tools and interfaces not previously available from a
single resource
īŽ Aggregates other related resources:
ī¨ GWAS Catalog
ī¨ OADGAR
ī¨ SNPedia
īŽ GWAS Central platform is available for adoption by other
institutes, consortia, teams and countries
ī¨ Ideally, multiple implementations can be federated to allow searching across
multiple data sets
33. GWAS Central: Towards Federation (cont.)
Comparison of features for GWAS Central, GWAS Catalog,
OADGAR*, SNPedia
* Open Access Database of Genome-wide Association Results
34. GWAS Central: Towards Federation (cont.)
SPARQL can be used to express queries across diverse data sources, whether
the data is stored natively as RDF or viewed as RDF via middleware. This
specification defines the syntax and semantics of SPARQL 1.1 Federated
Query extension for executing queries distributed over different SPARQL
endpoints.
The SERVICE keyword extends SPARQL 1.1 to support queries that merge
data distributed across the Web.
Source: http://www.w3.org/TR/sparql11-federated-query/
35. Setting up GWAS Catalog project to query across data sets
Querying across databases using EFO: Since the
GWAS Catalog is based on EFO, itâs possible for a
query to include other biomedical databases annotated
for EFO: ArrayExpress, Ensembl, BioSamples, Pride,
etc.
Querying across databases using other ontologies:
Even if EFO is not used, cross reference definition
citations allows querying across ontologies. The ID of
an external class is added as an annotation on the
relevant EFO term.
Example: Connective tissue is an EFO term that has been
mapped to terms in other ontologies, such as term
BTO:0000421, the identifier for connective tissue in the
Brenda ontology.