Case Study in Linked Data and Semantic Web: Human Genome

- David Portnoy
http://LinkedIn.com/in/DavidPortnoy
312.970.9740-
© Copyright 2012-2014 Datalytx, Inc.
Case study in Linked Data and Semantic Web
for the Human Genome domain
NHGRI’s
“GWAS Catalog” Project
National Human Genome Research Institute

 Project Growth:
About the Project
 Project Name: The National Human
Genome Research Institute (NHGRI)
Catalog of Published Genome-Wide
Association Studies (GWAS) Catalog
 Project Description: Manually curated
collection of published GWAS assaying
at least 100,000 single-nucleotide
polymorphisms (SNPs) and all SNP-trait
associations with P <1 × 10−5.
 In addition to SNP-trait association
data, provides the “Diagram Browser”,
an interactive diagram of these
associations mapped to the SNPs’
chromosomal locations. Stats as of Aug 2014:
 Almost 2,000 GWAS related
publications
 Over 14,000 SNPs
# of studies
# of traits
SNP-trait associations
2005 2014
Website: http://www.genome.gov/gwastudies/

Accessing the data
The GWAS Catalog can be accessed via
 Via the “Diagram Browser”
 Implemented as a dynamic visualization on the human karyotype
 With links to study publication, SNPs in Ensembl and ontology terms in
EFO (Experimental Factor Ontology)
 Via a web query search interface
 Provides tabular data for view or download
 Includes traits and links to study publication
 Via other GWAS-related data portals, such as
 Ensembl
 UCSC Genome Browser
 PheGenI
 GWAS Central

GWAS Components
The project is implemented in 3 main components:
1. Curation / Data loading pipeline
2. Data Publisher
3. Diagram Browser
Curation
SNP
Batch
Loader
PubMed
Tracking
Publisher
Inference
engine
Ontology
Loading
Diagram
Browser
Knowledge Base
Ontology Schema
* The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project

Application Implementation
The following technologies have been used for this project
 Java for server-side processing
 Spring for MVC framework
 Maven for build automation and dependency management
 Apache Tomcat for web server
 Oracle for relational database
 HermiT for OWL reasoner
 JavaScript / AJAX for Diagram Browser interactivity
 SVG for rendering vector graphics in the Diagram Browser
 Apache POI for processing spreadsheets
 ColdFusion for generating records for each SNP
* The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project

Ontology schema needed
Before the project could be implemented, an ontology had to be
designed for its components to operate. Working backwards:
 The Diagram Browser needs to display GWAS related data in
order to answer common GWAS use cases
 The Publisher needs to store data, such that it can be reasoned
over and served up to the Diagram Browser
 The Batch Loader needs to extract GWAS data from
publications in a consistent manner for later retrieval by the
Publisher

GWAS Catalog Ontology
Was created by mapping each trait to one or
more terms in the Experimental Factor
Ontology (EFO)
 At the start, 20% of GWAS traits were
already in EFO
 SKOS was used to extend EFO for GWAS-
specific views
 500 new terms were added to create
GWAS-EFO-SKOS ontology
Reasons for using EFO
 It’s actively developed
 It’s well suited to cover diversity of GWAS
traits
Metrics
Number of classes 13,850
Number of individuals 370
Number of properties 50
Maximum depth
Maximum # of children
Average # of children
Classes with a single child
Classes with > 25 children
Classes with no definition
15
700
7
500
100
13,500
* Note: GWAS Catalog Ontology and GWAS Diagram OWL have been used interchangeably

GWAS Catalog Ontology (cont.)
 Purpose: Models the relationships between GWAS concepts of
“SNP”, “trait” and “chromosome” to the Diagram
 Location of ontology schemas used:
EFO schema: http://www.ebi.ac.uk/efo
GWAS-Diagram schema: http://www.ebi.ac.uk/efo/gwas-diagram
Class Hierarchy Object property hierarchy Data property hierarchy
GWAS study
chromosome
 chromosome 1..23,
 Chromosome X, Y
cytogenetic band
single nucleotide polymorphism
trait association
experimental factor
has_part
located_in
location_of
associated_with
is_about
has_about
part_of
has_name
has_snp_reference_id
has_bp_position
has_length
has_p_value
has_pubmed_id
has_author
has_publication_date
has_gwas_trait_name
* Source: OntologyConstants.java; http://www.ebi.ac.uk/fgpt/gwas/ontology/gwas-diagram.owl

Field definitions for OWL schema definitions
1. SNP reference ID: A single nucleotide polymorpism identifier, as assigned by the Single
Nucleotide Polymorphism Database (dbSNP).
2. Base pair position: The position, in base pairs, of a particular element on a genome
3. Base pair length: The length, in base pairs, of any genomic element.
4. P-value: The probability of obtaining a test statistic at least as extreme as the one that
was actually observed.
5. PubMed ID: The publication ID of a scientific paper, as assigned by the PubMed
database.
6. Author: The primary author of a publication, usually expressed as surname followed by
initial(s).
7. Publication date: A date on which a given entity was published
8. GWAS trait name: An arbitrary text label used to add a text definition of a GWAS trait
name that is does not specificially map. Usually this will be used to annotate instances
of Experimental Factor in order to retain information about a trait that was not defined in
the ontology.
9. Chromosomes: Chromosome 1-23; Chromosomes X & Y
10. Trait association: An association that can be asserted between two entities with a
degree of confidence expressed as a p-value.
11. GWAS Study: A study, described by a scientific publication, that identifies genome wide
associations between single nucleotide polymorphisms and phylogenetic traits or
disorders.

Using SKOS for defining the GWAS Catalog ontology
SKOS (Simple Knowledge Organization System) was used to create the
GWAS Catalog ontology by extending the EFO ontology, because:
 Requires less expertise, effort and cost, since it is less semantically
strict and expressive than OWL
 Can be used where the complexity of inferences is limited
 Is easy to use for extending other vocabularies
Introduction to SKOS
SKOS is an area of work developing specifications and
standards to support the use of knowledge organization
systems (KOS) such as thesauri, classification schemes,
subject heading systems and taxonomies within the
framework of the Semantic Web.

Sample dataset generated by OWL API is broken into…
Data Property Assertion
Class Assertion
Object Property Assertion

Advantage of ontology for traits
Using a predefined ontology for describing traits
(rather than unstructured lists) allows:
1. More complex, compounded and context-
dependent traits to be described
 e.g. “Type 2 diabetes and gout”;
“Parkinson’s disease (interaction with
caffeine)”
2. Creation of semantically meaningful links
between traits
3. More complex and meaningful queries
Traits
• Phenotypes, e.g. hair & eye color
• Treatment responses, e.g.
response to antineoplastic agents
• Diseases, e.g. type 2 diabetes
• Assays, e.g. glcyoslyated
haemoglogin level
• Chemical/drug names, e.g. C-
reactive protein

The Curation process is partially automated
1. Run automated literature searches to capture eligible studies
2. Enter them into the system for review by curators
3. Triage and assign papers to curator
4. Curators use use a web-based tracking and data entry system which allows multiple
users to search, annotate, verify and publish the Catalog data. There are two levels
of manual curation:
a. First all data are extracted by one curator.
b. Some studies could have more than 1000 significant SNPs. So curators create
spreadsheets of SNPs for batch loading into the DB (using Apachi POI Java API
for Microsoft Documents and a ColdFusion extension).
c. Then data are double-checked for accuracy and consistency by another curator
5. Run the automated pipeline that:
a. Checks multiple data sources for accuracy, completeness and consistency:
PubMed, dbSNP, and NCBI's Gene database
b. Adds genomic annotation such as SNP's base pair and cytogenetic location
Literature
search
ID eligible
studies
Entry into
workflow tool
Triage &
assignment
Manual curation
• Data entry
• Check accuracy
Automated pipeline
• Check against
PubMed, dbSNP,
NCBI
• Add annotation

Creation of links to external data sources
Each entry in the GWAS
Catalog has links to
supporting data sources
for convenience
Reference
Source
Sample Link / URI
NCBI’s
dbSNP
http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1333049
Ensembl http://useast.ensembl.org/Homo_sapiens/Variation/Explore?r=9:22125003
-22126003;v=rs1333049;vdb=variation;vf=1004336
PubMed http://www.ncbi.nlm.nih.gov/pubmed?Db=pubmed&DbFrom=snp&Cmd=Li
nk&LinkName=snp_pubmed_cited&LinkReadableName=Pubmed+(SNP+
Cited)&IdsFromResult=1333049
OMIM http://omim.org/entry/611139#0000
* Note that currently these are links for use by people, rather than machine readable
linkages that would allow querying across multiple data sources

Future: Opportunity for automating curation
 Machine learning and natural language processing (NLP) to categorize
into traits defined in the GWAS Catalog ontology
 Assign categorization confidence metrics to assist processing workflow
 Accuracy can be verified by humans based on highlighting and
annotations provided by NLP engine
NLP processing &
confidence assignment
Workflow for human
validation (where needed)
Knowledge Base

Data flow for GOCI Publisher
Start with the Oracle relational database created
by the Curation process
Java Publisher app converts from the relational
database into OWL individuals
Knowledge base in format of GWAS Catalog
ontology with 13,000 individuals and 43,000
axioms
OWL API and HermiT reasoner create inferences
from GWAS Catalog ontology
Since it takes > 10 hours to run the reasoner, the
job is run in batch and results are cached in RAM
Results are retrieved by Diagram Browser with
requests to app running on Tomcat server
HermiT +
OWL API
SPARQL
Endpoint
(future)
Knowledge Base
(OWL individuals / triples
cached in RAM)
Relational Database
(Oracle)
Java Publisher job
Knowledge Base
with Inferred Triples
(Cached in RAM)
GWAS
Diagram
Browser

Publisher’s output is to OWL triples
…because this format is preferable to having the Diagram Browser query a
relational database. The benefits are:
 Additional inferences about SNP-trait associations
 More expressive queries
 Ability to detect errors or inconsistencies, as defined by the ontology
Using direct queries Using OWL knowledge base
Data has unstructured catalog of traits and
in a fixed relational schema
Data is structured in semantic triples and
reasoned over using an ontology
Queries can be only on string pattern
matching and must be done one at a time.
It’s not possible to query for related or inferred
traits.
Queries can include inferences and complex
questions
Example queries:
• Can search on trait name containing
“diabetes” and get results for both type 1
and type 2 diabetes
• Comparison between gastric and
esophageal cancers requires manually
combining results from two distinct
searches
Example queries: *
• Find all SNPs that are associated with
cancers located in the upper digestive tract
• Find all SNPs located on chromosomes 5,
7, 15 and 21 that are associated with
diseases located in the urinary tract, with a
p-value smaller than 10-8
* Source: Welter, D., Burdett, T., et al. (2012) Ontology-driven visualization of NHGRI GWAS data

HermiT OWL Reasoner
 HermiT is a reasoner for ontologies written using OWL (Web
Ontology Language). It is a Protégé plugin.
 HermiT can determine whether the ontology for any given OWL
file is consistent and identify the relationship between classes
 HermiT passes all OWL 2 conformance tests for direct semantics
reasoners
 HermiT can be accessed from Java apps through the OWL API
 OWL API is a Java interface for creating, manipulating and
serializing OWL Ontologies
 It includes parsers and writers for RDF, OWL and Turtle, as well as interface
for working with reasoners

HermiT reasoner is implemented with “forward chaining”
 How it works: Rules are processed by reasoner once in batch
mode to generate and cache inferred triples
 Best when:
 Rules of inference and original data don’t change often
 There’s sufficient disk and RAM to store all the inferred triples
 Benefits: Retrieval queries run faster
 Limitation: When rules or explicit data set changes, it may be
necessary to empty and reload the entire data store and re-run
the reasoner over it again

What is the Diagram Browser?
It’s a diagram that shows SNP-trait associations mapped to the SNPs’
chromosomal locations of the human karyotype. This project has made
significant improvements to it:
 Originally: The diagram used to be a static document manually created
on a quarterly basis (by a medical illustrator)
 Now: Creation is fully automated with each study added and it is
interactive, so that it can be explored dynamically

Diagram Browser: Interactive functionality
Clicking on SNP-associated trait
category enables selection of
only bands with relevant traits
Zoom in and hover over
chromosomes in order to see
traits by chromosomal location
Clicking on diagram displays all
SNPs for a trait and band

How is the Diagram Browser implemented?
1. The Diagram Browser is a JavaScript app
rendered on the client browser
2. Interaction with the diagram, such as filter,
zoom or click, generates a query
3. The query request is sent via AJAX from
the web client to the Tomcat server
4. The server runs a Java program that
converts this request into an OWL class
expression which is processed by the
reasoner
5. The query result causes a string of SVG
(Scalable Vector Graphics) code to be
generated
6. This code is sent back to the web client via
AJAX
7. The JavaScript app renders the SVG
provided
Web Browser
JavaScript app
Web Server
Knowledge Base
(using GWAS
Catalog ontology)
Generate
AJAX request
Render
SVG code
1
Trigger:
Filter,
zoom,
click
2
3
4
6
5
7
Process
request
Generate
SVG

Future scalability
Will run into scalability issues as…
 Size of knowledge base grows
 Tools for querying the knowledge base become more
sophisticated
Current
Implementation
Short term
solution
Long Term
Solution
 Monitor system resources and increase where
there are bottlenecks
 Limit queries to a predefined ranges
 Precompute more inferences, based on query
frequency
 Migrate to a persistent RDF triplestore
(such as Virtuoso) from the knowledge base
 Implement SPARQL endpoint for queries
instead of using OWL class expressions
 Consider backward chaining reasoner if
inferred data set gets too big to cache

Future “backward chaining” option
 How it works: Reasoner is deployed between the GWAS
Diagram or SPARQL endpoint and data store, so that inferred
triples are generated in real time as part of query result set
 Best when:
 Rules of inference and original data change often
 Disk or RAM is insufficient to store all the inferred triples
 Benefits: No need to re-run reasoner when data or rules change
 Limitation: Query response may be slow

SPARQL Example: GWAS Central
 Although the NHGRI project currently doesn’t host a live SPARQL
endpoint, it could be set up to do so
 The GWAS Central project already does this. (It collates data from a
range of sources, including the published literature and collaborating
databases such as the NHGRI GWAS Catalog.)
SPARQL query page for
GWAS Centeral
http://fuseki.gwascentral.org/q
uery.html

SPARQL Example: EBI’s Atlas
 EBI hosts the GWAS Diagram, but doesn’t provide a SPARQL endpoint
associated with that project
 It does however host SPARQL endpoints for multiple other projects,
such as Atlas
SPARQL query page and multiple examples for EBI’s Atlas project
(https://www.ebi.ac.uk/rdf/services/atlas/sparql)

GWAS Central: Towards Federation
 GWAS Central is a comprehensive resource for the comparison
and interrogation of multiple GWAS (genome-wide association
studies) projects
 Allows for storage, mining and display of summary-level
association data
 More comprehensive than other openly available projects with a
similar focus (ie, millions vs. thousands of P-values )
 Provides user tools and interfaces not previously available from a
single resource
 Aggregates other related resources:
 GWAS Catalog
 OADGAR
 SNPedia
 GWAS Central platform is available for adoption by other
institutes, consortia, teams and countries
 Ideally, multiple implementations can be federated to allow searching across
multiple data sets

GWAS Central: Towards Federation (cont.)
Comparison of features for GWAS Central, GWAS Catalog,
OADGAR*, SNPedia
* Open Access Database of Genome-wide Association Results

GWAS Central: Towards Federation (cont.)
SPARQL can be used to express queries across diverse data sources, whether
the data is stored natively as RDF or viewed as RDF via middleware. This
specification defines the syntax and semantics of SPARQL 1.1 Federated
Query extension for executing queries distributed over different SPARQL
endpoints.
The SERVICE keyword extends SPARQL 1.1 to support queries that merge
data distributed across the Web.
Source: http://www.w3.org/TR/sparql11-federated-query/

Setting up GWAS Catalog project to query across data sets
Querying across databases using EFO: Since the
GWAS Catalog is based on EFO, it’s possible for a
query to include other biomedical databases annotated
for EFO: ArrayExpress, Ensembl, BioSamples, Pride,
etc.
Querying across databases using other ontologies:
Even if EFO is not used, cross reference definition
citations allows querying across ontologies. The ID of
an external class is added as an annotation on the
relevant EFO term.
Example: Connective tissue is an EFO term that has been
mapped to terms in other ontologies, such as term
BTO:0000421, the identifier for connective tissue in the
Brenda ontology.

Case Study in Linked Data and Semantic Web: Human Genome

More Related Content

What's hot

Viewers also liked

Similar to Case Study in Linked Data and Semantic Web: Human Genome

More from David Portnoy

Recently uploaded

Case Study in Linked Data and Semantic Web: Human Genome