- David Portnoy
http://LinkedIn.com/in/DavidPortnoy
312.970.9740-
© Copyright 2012-2014 Datalytx, Inc.
Case study in Linked Data and Semantic Web
for the Human Genome domain
NHGRI’s
“GWAS Catalog” Project
National Human Genome Research Institute
 Project Growth:
About the Project
 Project Name: The National Human
Genome Research Institute (NHGRI)
Catalog of Published Genome-Wide
Association Studies (GWAS) Catalog
 Project Description: Manually curated
collection of published GWAS assaying
at least 100,000 single-nucleotide
polymorphisms (SNPs) and all SNP-trait
associations with P <1 × 10−5.
 In addition to SNP-trait association
data, provides the “Diagram Browser”,
an interactive diagram of these
associations mapped to the SNPs’
chromosomal locations. Stats as of Aug 2014:
 Almost 2,000 GWAS related
publications
 Over 14,000 SNPs
# of studies
# of traits
SNP-trait associations
2005 2014
Website: http://www.genome.gov/gwastudies/
Accessing the data
The GWAS Catalog can be accessed via
 Via the “Diagram Browser”
 Implemented as a dynamic visualization on the human karyotype
 With links to study publication, SNPs in Ensembl and ontology terms in
EFO (Experimental Factor Ontology)
 Via a web query search interface
 Provides tabular data for view or download
 Includes traits and links to study publication
 Via other GWAS-related data portals, such as
 Ensembl
 UCSC Genome Browser
 PheGenI
 GWAS Central
GWAS Components
The project is implemented in 3 main components:
1. Curation / Data loading pipeline
2. Data Publisher
3. Diagram Browser
Curation
SNP
Batch
Loader
PubMed
Tracking
Publisher
Inference
engine
Ontology
Loading
Diagram
Browser
Knowledge Base
Ontology Schema
* The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project
Application Implementation
The following technologies have been used for this project
 Java for server-side processing
 Spring for MVC framework
 Maven for build automation and dependency management
 Apache Tomcat for web server
 Oracle for relational database
 HermiT for OWL reasoner
 JavaScript / AJAX for Diagram Browser interactivity
 SVG for rendering vector graphics in the Diagram Browser
 Apache POI for processing spreadsheets
 ColdFusion for generating records for each SNP
* The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project
ONTOLOGY SCHEMAS
Ontology schema needed
Before the project could be implemented, an ontology had to be
designed for its components to operate. Working backwards:
 The Diagram Browser needs to display GWAS related data in
order to answer common GWAS use cases
 The Publisher needs to store data, such that it can be reasoned
over and served up to the Diagram Browser
 The Batch Loader needs to extract GWAS data from
publications in a consistent manner for later retrieval by the
Publisher
GWAS Catalog Ontology
Was created by mapping each trait to one or
more terms in the Experimental Factor
Ontology (EFO)
 At the start, 20% of GWAS traits were
already in EFO
 SKOS was used to extend EFO for GWAS-
specific views
 500 new terms were added to create
GWAS-EFO-SKOS ontology
Reasons for using EFO
 It’s actively developed
 It’s well suited to cover diversity of GWAS
traits
Metrics
Number of classes 13,850
Number of individuals 370
Number of properties 50
Maximum depth
Maximum # of children
Average # of children
Classes with a single child
Classes with > 25 children
Classes with no definition
15
700
7
500
100
13,500
* Note: GWAS Catalog Ontology and GWAS Diagram OWL have been used interchangeably
GWAS Catalog Ontology (cont.)
 Purpose: Models the relationships between GWAS concepts of
“SNP”, “trait” and “chromosome” to the Diagram
 Location of ontology schemas used:
EFO schema: http://www.ebi.ac.uk/efo
GWAS-Diagram schema: http://www.ebi.ac.uk/efo/gwas-diagram
Class Hierarchy Object property hierarchy Data property hierarchy
GWAS study
chromosome
 chromosome 1..23,
 Chromosome X, Y
cytogenetic band
single nucleotide polymorphism
trait association
experimental factor
has_part
located_in
location_of
associated_with
is_about
has_about
part_of
has_name
has_snp_reference_id
has_bp_position
has_length
has_p_value
has_pubmed_id
has_author
has_publication_date
has_gwas_trait_name
* Source: OntologyConstants.java; http://www.ebi.ac.uk/fgpt/gwas/ontology/gwas-diagram.owl
Field definitions for OWL schema definitions
1. SNP reference ID: A single nucleotide polymorpism identifier, as assigned by the Single
Nucleotide Polymorphism Database (dbSNP).
2. Base pair position: The position, in base pairs, of a particular element on a genome
3. Base pair length: The length, in base pairs, of any genomic element.
4. P-value: The probability of obtaining a test statistic at least as extreme as the one that
was actually observed.
5. PubMed ID: The publication ID of a scientific paper, as assigned by the PubMed
database.
6. Author: The primary author of a publication, usually expressed as surname followed by
initial(s).
7. Publication date: A date on which a given entity was published
8. GWAS trait name: An arbitrary text label used to add a text definition of a GWAS trait
name that is does not specificially map. Usually this will be used to annotate instances
of Experimental Factor in order to retain information about a trait that was not defined in
the ontology.
9. Chromosomes: Chromosome 1-23; Chromosomes X & Y
10. Trait association: An association that can be asserted between two entities with a
degree of confidence expressed as a p-value.
11. GWAS Study: A study, described by a scientific publication, that identifies genome wide
associations between single nucleotide polymorphisms and phylogenetic traits or
disorders.
Using SKOS for defining the GWAS Catalog ontology
SKOS (Simple Knowledge Organization System) was used to create the
GWAS Catalog ontology by extending the EFO ontology, because:
 Requires less expertise, effort and cost, since it is less semantically
strict and expressive than OWL
 Can be used where the complexity of inferences is limited
 Is easy to use for extending other vocabularies
Introduction to SKOS
SKOS is an area of work developing specifications and
standards to support the use of knowledge organization
systems (KOS) such as thesauri, classification schemes,
subject heading systems and taxonomies within the
framework of the Semantic Web.
Sample dataset generated by OWL API is broken into…
Data Property Assertion
Class Assertion
Object Property Assertion
Advantage of ontology for traits
Using a predefined ontology for describing traits
(rather than unstructured lists) allows:
1. More complex, compounded and context-
dependent traits to be described
 e.g. “Type 2 diabetes and gout”;
“Parkinson’s disease (interaction with
caffeine)”
2. Creation of semantically meaningful links
between traits
3. More complex and meaningful queries
Traits
• Phenotypes, e.g. hair & eye color
• Treatment responses, e.g.
response to antineoplastic agents
• Diseases, e.g. type 2 diabetes
• Assays, e.g. glcyoslyated
haemoglogin level
• Chemical/drug names, e.g. C-
reactive protein
CURATION
The Curation process is partially automated
1. Run automated literature searches to capture eligible studies
2. Enter them into the system for review by curators
3. Triage and assign papers to curator
4. Curators use use a web-based tracking and data entry system which allows multiple
users to search, annotate, verify and publish the Catalog data. There are two levels
of manual curation:
a. First all data are extracted by one curator.
b. Some studies could have more than 1000 significant SNPs. So curators create
spreadsheets of SNPs for batch loading into the DB (using Apachi POI Java API
for Microsoft Documents and a ColdFusion extension).
c. Then data are double-checked for accuracy and consistency by another curator
5. Run the automated pipeline that:
a. Checks multiple data sources for accuracy, completeness and consistency:
PubMed, dbSNP, and NCBI's Gene database
b. Adds genomic annotation such as SNP's base pair and cytogenetic location
Literature
search
ID eligible
studies
Entry into
workflow tool
Triage &
assignment
Manual curation
• Data entry
• Check accuracy
Automated pipeline
• Check against
PubMed, dbSNP,
NCBI
• Add annotation
Creation of links to external data sources
Each entry in the GWAS
Catalog has links to
supporting data sources
for convenience
Reference
Source
Sample Link / URI
NCBI’s
dbSNP
http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1333049
Ensembl http://useast.ensembl.org/Homo_sapiens/Variation/Explore?r=9:22125003
-22126003;v=rs1333049;vdb=variation;vf=1004336
PubMed http://www.ncbi.nlm.nih.gov/pubmed?Db=pubmed&DbFrom=snp&Cmd=Li
nk&LinkName=snp_pubmed_cited&LinkReadableName=Pubmed+(SNP+
Cited)&IdsFromResult=1333049
OMIM http://omim.org/entry/611139#0000
* Note that currently these are links for use by people, rather than machine readable
linkages that would allow querying across multiple data sources
Future: Opportunity for automating curation
 Machine learning and natural language processing (NLP) to categorize
into traits defined in the GWAS Catalog ontology
 Assign categorization confidence metrics to assist processing workflow
 Accuracy can be verified by humans based on highlighting and
annotations provided by NLP engine
NLP processing &
confidence assignment
Workflow for human
validation (where needed)
Knowledge Base
PUBLISHER
Data flow for GOCI Publisher
Start with the Oracle relational database created
by the Curation process
Java Publisher app converts from the relational
database into OWL individuals
Knowledge base in format of GWAS Catalog
ontology with 13,000 individuals and 43,000
axioms
OWL API and HermiT reasoner create inferences
from GWAS Catalog ontology
Since it takes > 10 hours to run the reasoner, the
job is run in batch and results are cached in RAM
Results are retrieved by Diagram Browser with
requests to app running on Tomcat server
HermiT +
OWL API
SPARQL
Endpoint
(future)
Knowledge Base
(OWL individuals / triples
cached in RAM)
Relational Database
(Oracle)
Java Publisher job
Knowledge Base
with Inferred Triples
(Cached in RAM)
GWAS
Diagram
Browser
Publisher’s output is to OWL triples
…because this format is preferable to having the Diagram Browser query a
relational database. The benefits are:
 Additional inferences about SNP-trait associations
 More expressive queries
 Ability to detect errors or inconsistencies, as defined by the ontology
Using direct queries Using OWL knowledge base
Data has unstructured catalog of traits and
in a fixed relational schema
Data is structured in semantic triples and
reasoned over using an ontology
Queries can be only on string pattern
matching and must be done one at a time.
It’s not possible to query for related or inferred
traits.
Queries can include inferences and complex
questions
Example queries:
• Can search on trait name containing
“diabetes” and get results for both type 1
and type 2 diabetes
• Comparison between gastric and
esophageal cancers requires manually
combining results from two distinct
searches
Example queries: *
• Find all SNPs that are associated with
cancers located in the upper digestive tract
• Find all SNPs located on chromosomes 5,
7, 15 and 21 that are associated with
diseases located in the urinary tract, with a
p-value smaller than 10-8
* Source: Welter, D., Burdett, T., et al. (2012) Ontology-driven visualization of NHGRI GWAS data
HermiT OWL Reasoner
 HermiT is a reasoner for ontologies written using OWL (Web
Ontology Language). It is a Protégé plugin.
 HermiT can determine whether the ontology for any given OWL
file is consistent and identify the relationship between classes
 HermiT passes all OWL 2 conformance tests for direct semantics
reasoners
 HermiT can be accessed from Java apps through the OWL API
 OWL API is a Java interface for creating, manipulating and
serializing OWL Ontologies
 It includes parsers and writers for RDF, OWL and Turtle, as well as interface
for working with reasoners
HermiT reasoner is implemented with “forward chaining”
 How it works: Rules are processed by reasoner once in batch
mode to generate and cache inferred triples
 Best when:
 Rules of inference and original data don’t change often
 There’s sufficient disk and RAM to store all the inferred triples
 Benefits: Retrieval queries run faster
 Limitation: When rules or explicit data set changes, it may be
necessary to empty and reload the entire data store and re-run
the reasoner over it again
DIAGRAM BROWSER
What is the Diagram Browser?
It’s a diagram that shows SNP-trait associations mapped to the SNPs’
chromosomal locations of the human karyotype. This project has made
significant improvements to it:
 Originally: The diagram used to be a static document manually created
on a quarterly basis (by a medical illustrator)
 Now: Creation is fully automated with each study added and it is
interactive, so that it can be explored dynamically
Diagram Browser: Interactive functionality
Clicking on SNP-associated trait
category enables selection of
only bands with relevant traits
Zoom in and hover over
chromosomes in order to see
traits by chromosomal location
Clicking on diagram displays all
SNPs for a trait and band
How is the Diagram Browser implemented?
1. The Diagram Browser is a JavaScript app
rendered on the client browser
2. Interaction with the diagram, such as filter,
zoom or click, generates a query
3. The query request is sent via AJAX from
the web client to the Tomcat server
4. The server runs a Java program that
converts this request into an OWL class
expression which is processed by the
reasoner
5. The query result causes a string of SVG
(Scalable Vector Graphics) code to be
generated
6. This code is sent back to the web client via
AJAX
7. The JavaScript app renders the SVG
provided
Web Browser
JavaScript app
Web Server
Knowledge Base
(using GWAS
Catalog ontology)
Generate
AJAX request
Render
SVG code
1
Trigger:
Filter,
zoom,
click
2
3
4
6
5
7
Process
request
Generate
SVG
THE FUTURE
Future scalability
Will run into scalability issues as…
 Size of knowledge base grows
 Tools for querying the knowledge base become more
sophisticated
Current
Implementation
Short term
solution
Long Term
Solution
 Monitor system resources and increase where
there are bottlenecks
 Limit queries to a predefined ranges
 Precompute more inferences, based on query
frequency
 Migrate to a persistent RDF triplestore
(such as Virtuoso) from the knowledge base
 Implement SPARQL endpoint for queries
instead of using OWL class expressions
 Consider backward chaining reasoner if
inferred data set gets too big to cache
Future “backward chaining” option
 How it works: Reasoner is deployed between the GWAS
Diagram or SPARQL endpoint and data store, so that inferred
triples are generated in real time as part of query result set
 Best when:
 Rules of inference and original data change often
 Disk or RAM is insufficient to store all the inferred triples
 Benefits: No need to re-run reasoner when data or rules change
 Limitation: Query response may be slow
SPARQL Example: GWAS Central
 Although the NHGRI project currently doesn’t host a live SPARQL
endpoint, it could be set up to do so
 The GWAS Central project already does this. (It collates data from a
range of sources, including the published literature and collaborating
databases such as the NHGRI GWAS Catalog.)
SPARQL query page for
GWAS Centeral
http://fuseki.gwascentral.org/q
uery.html
SPARQL Example: EBI’s Atlas
 EBI hosts the GWAS Diagram, but doesn’t provide a SPARQL endpoint
associated with that project
 It does however host SPARQL endpoints for multiple other projects,
such as Atlas
SPARQL query page and multiple examples for EBI’s Atlas project
(https://www.ebi.ac.uk/rdf/services/atlas/sparql)
GWAS Central: Towards Federation
 GWAS Central is a comprehensive resource for the comparison
and interrogation of multiple GWAS (genome-wide association
studies) projects
 Allows for storage, mining and display of summary-level
association data
 More comprehensive than other openly available projects with a
similar focus (ie, millions vs. thousands of P-values )
 Provides user tools and interfaces not previously available from a
single resource
 Aggregates other related resources:
 GWAS Catalog
 OADGAR
 SNPedia
 GWAS Central platform is available for adoption by other
institutes, consortia, teams and countries
 Ideally, multiple implementations can be federated to allow searching across
multiple data sets
GWAS Central: Towards Federation (cont.)
Comparison of features for GWAS Central, GWAS Catalog,
OADGAR*, SNPedia
* Open Access Database of Genome-wide Association Results
GWAS Central: Towards Federation (cont.)
SPARQL can be used to express queries across diverse data sources, whether
the data is stored natively as RDF or viewed as RDF via middleware. This
specification defines the syntax and semantics of SPARQL 1.1 Federated
Query extension for executing queries distributed over different SPARQL
endpoints.
The SERVICE keyword extends SPARQL 1.1 to support queries that merge
data distributed across the Web.
Source: http://www.w3.org/TR/sparql11-federated-query/
Setting up GWAS Catalog project to query across data sets
Querying across databases using EFO: Since the
GWAS Catalog is based on EFO, it’s possible for a
query to include other biomedical databases annotated
for EFO: ArrayExpress, Ensembl, BioSamples, Pride,
etc.
Querying across databases using other ontologies:
Even if EFO is not used, cross reference definition
citations allows querying across ontologies. The ID of
an external class is added as an annotation on the
relevant EFO term.
Example: Connective tissue is an EFO term that has been
mapped to terms in other ontologies, such as term
BTO:0000421, the identifier for connective tissue in the
Brenda ontology.
THANKS!

Case Study in Linked Data and Semantic Web: Human Genome

  • 1.
    - David Portnoy http://LinkedIn.com/in/DavidPortnoy 312.970.9740- ©Copyright 2012-2014 Datalytx, Inc. Case study in Linked Data and Semantic Web for the Human Genome domain NHGRI’s “GWAS Catalog” Project National Human Genome Research Institute
  • 2.
     Project Growth: Aboutthe Project  Project Name: The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog  Project Description: Manually curated collection of published GWAS assaying at least 100,000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P <1 × 10−5.  In addition to SNP-trait association data, provides the “Diagram Browser”, an interactive diagram of these associations mapped to the SNPs’ chromosomal locations. Stats as of Aug 2014:  Almost 2,000 GWAS related publications  Over 14,000 SNPs # of studies # of traits SNP-trait associations 2005 2014 Website: http://www.genome.gov/gwastudies/
  • 3.
    Accessing the data TheGWAS Catalog can be accessed via  Via the “Diagram Browser”  Implemented as a dynamic visualization on the human karyotype  With links to study publication, SNPs in Ensembl and ontology terms in EFO (Experimental Factor Ontology)  Via a web query search interface  Provides tabular data for view or download  Includes traits and links to study publication  Via other GWAS-related data portals, such as  Ensembl  UCSC Genome Browser  PheGenI  GWAS Central
  • 4.
    GWAS Components The projectis implemented in 3 main components: 1. Curation / Data loading pipeline 2. Data Publisher 3. Diagram Browser Curation SNP Batch Loader PubMed Tracking Publisher Inference engine Ontology Loading Diagram Browser Knowledge Base Ontology Schema * The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project
  • 5.
    Application Implementation The followingtechnologies have been used for this project  Java for server-side processing  Spring for MVC framework  Maven for build automation and dependency management  Apache Tomcat for web server  Oracle for relational database  HermiT for OWL reasoner  JavaScript / AJAX for Diagram Browser interactivity  SVG for rendering vector graphics in the Diagram Browser  Apache POI for processing spreadsheets  ColdFusion for generating records for each SNP * The source code is managed under the GOCI (GWAS Ontology and Curation Infrastructure) project
  • 6.
  • 7.
    Ontology schema needed Beforethe project could be implemented, an ontology had to be designed for its components to operate. Working backwards:  The Diagram Browser needs to display GWAS related data in order to answer common GWAS use cases  The Publisher needs to store data, such that it can be reasoned over and served up to the Diagram Browser  The Batch Loader needs to extract GWAS data from publications in a consistent manner for later retrieval by the Publisher
  • 8.
    GWAS Catalog Ontology Wascreated by mapping each trait to one or more terms in the Experimental Factor Ontology (EFO)  At the start, 20% of GWAS traits were already in EFO  SKOS was used to extend EFO for GWAS- specific views  500 new terms were added to create GWAS-EFO-SKOS ontology Reasons for using EFO  It’s actively developed  It’s well suited to cover diversity of GWAS traits Metrics Number of classes 13,850 Number of individuals 370 Number of properties 50 Maximum depth Maximum # of children Average # of children Classes with a single child Classes with > 25 children Classes with no definition 15 700 7 500 100 13,500 * Note: GWAS Catalog Ontology and GWAS Diagram OWL have been used interchangeably
  • 9.
    GWAS Catalog Ontology(cont.)  Purpose: Models the relationships between GWAS concepts of “SNP”, “trait” and “chromosome” to the Diagram  Location of ontology schemas used: EFO schema: http://www.ebi.ac.uk/efo GWAS-Diagram schema: http://www.ebi.ac.uk/efo/gwas-diagram Class Hierarchy Object property hierarchy Data property hierarchy GWAS study chromosome  chromosome 1..23,  Chromosome X, Y cytogenetic band single nucleotide polymorphism trait association experimental factor has_part located_in location_of associated_with is_about has_about part_of has_name has_snp_reference_id has_bp_position has_length has_p_value has_pubmed_id has_author has_publication_date has_gwas_trait_name * Source: OntologyConstants.java; http://www.ebi.ac.uk/fgpt/gwas/ontology/gwas-diagram.owl
  • 10.
    Field definitions forOWL schema definitions 1. SNP reference ID: A single nucleotide polymorpism identifier, as assigned by the Single Nucleotide Polymorphism Database (dbSNP). 2. Base pair position: The position, in base pairs, of a particular element on a genome 3. Base pair length: The length, in base pairs, of any genomic element. 4. P-value: The probability of obtaining a test statistic at least as extreme as the one that was actually observed. 5. PubMed ID: The publication ID of a scientific paper, as assigned by the PubMed database. 6. Author: The primary author of a publication, usually expressed as surname followed by initial(s). 7. Publication date: A date on which a given entity was published 8. GWAS trait name: An arbitrary text label used to add a text definition of a GWAS trait name that is does not specificially map. Usually this will be used to annotate instances of Experimental Factor in order to retain information about a trait that was not defined in the ontology. 9. Chromosomes: Chromosome 1-23; Chromosomes X & Y 10. Trait association: An association that can be asserted between two entities with a degree of confidence expressed as a p-value. 11. GWAS Study: A study, described by a scientific publication, that identifies genome wide associations between single nucleotide polymorphisms and phylogenetic traits or disorders.
  • 11.
    Using SKOS fordefining the GWAS Catalog ontology SKOS (Simple Knowledge Organization System) was used to create the GWAS Catalog ontology by extending the EFO ontology, because:  Requires less expertise, effort and cost, since it is less semantically strict and expressive than OWL  Can be used where the complexity of inferences is limited  Is easy to use for extending other vocabularies Introduction to SKOS SKOS is an area of work developing specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web.
  • 12.
    Sample dataset generatedby OWL API is broken into… Data Property Assertion Class Assertion Object Property Assertion
  • 13.
    Advantage of ontologyfor traits Using a predefined ontology for describing traits (rather than unstructured lists) allows: 1. More complex, compounded and context- dependent traits to be described  e.g. “Type 2 diabetes and gout”; “Parkinson’s disease (interaction with caffeine)” 2. Creation of semantically meaningful links between traits 3. More complex and meaningful queries Traits • Phenotypes, e.g. hair & eye color • Treatment responses, e.g. response to antineoplastic agents • Diseases, e.g. type 2 diabetes • Assays, e.g. glcyoslyated haemoglogin level • Chemical/drug names, e.g. C- reactive protein
  • 14.
  • 15.
    The Curation processis partially automated 1. Run automated literature searches to capture eligible studies 2. Enter them into the system for review by curators 3. Triage and assign papers to curator 4. Curators use use a web-based tracking and data entry system which allows multiple users to search, annotate, verify and publish the Catalog data. There are two levels of manual curation: a. First all data are extracted by one curator. b. Some studies could have more than 1000 significant SNPs. So curators create spreadsheets of SNPs for batch loading into the DB (using Apachi POI Java API for Microsoft Documents and a ColdFusion extension). c. Then data are double-checked for accuracy and consistency by another curator 5. Run the automated pipeline that: a. Checks multiple data sources for accuracy, completeness and consistency: PubMed, dbSNP, and NCBI's Gene database b. Adds genomic annotation such as SNP's base pair and cytogenetic location Literature search ID eligible studies Entry into workflow tool Triage & assignment Manual curation • Data entry • Check accuracy Automated pipeline • Check against PubMed, dbSNP, NCBI • Add annotation
  • 16.
    Creation of linksto external data sources Each entry in the GWAS Catalog has links to supporting data sources for convenience Reference Source Sample Link / URI NCBI’s dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1333049 Ensembl http://useast.ensembl.org/Homo_sapiens/Variation/Explore?r=9:22125003 -22126003;v=rs1333049;vdb=variation;vf=1004336 PubMed http://www.ncbi.nlm.nih.gov/pubmed?Db=pubmed&DbFrom=snp&Cmd=Li nk&LinkName=snp_pubmed_cited&LinkReadableName=Pubmed+(SNP+ Cited)&IdsFromResult=1333049 OMIM http://omim.org/entry/611139#0000 * Note that currently these are links for use by people, rather than machine readable linkages that would allow querying across multiple data sources
  • 17.
    Future: Opportunity forautomating curation  Machine learning and natural language processing (NLP) to categorize into traits defined in the GWAS Catalog ontology  Assign categorization confidence metrics to assist processing workflow  Accuracy can be verified by humans based on highlighting and annotations provided by NLP engine NLP processing & confidence assignment Workflow for human validation (where needed) Knowledge Base
  • 18.
  • 19.
    Data flow forGOCI Publisher Start with the Oracle relational database created by the Curation process Java Publisher app converts from the relational database into OWL individuals Knowledge base in format of GWAS Catalog ontology with 13,000 individuals and 43,000 axioms OWL API and HermiT reasoner create inferences from GWAS Catalog ontology Since it takes > 10 hours to run the reasoner, the job is run in batch and results are cached in RAM Results are retrieved by Diagram Browser with requests to app running on Tomcat server HermiT + OWL API SPARQL Endpoint (future) Knowledge Base (OWL individuals / triples cached in RAM) Relational Database (Oracle) Java Publisher job Knowledge Base with Inferred Triples (Cached in RAM) GWAS Diagram Browser
  • 20.
    Publisher’s output isto OWL triples …because this format is preferable to having the Diagram Browser query a relational database. The benefits are:  Additional inferences about SNP-trait associations  More expressive queries  Ability to detect errors or inconsistencies, as defined by the ontology Using direct queries Using OWL knowledge base Data has unstructured catalog of traits and in a fixed relational schema Data is structured in semantic triples and reasoned over using an ontology Queries can be only on string pattern matching and must be done one at a time. It’s not possible to query for related or inferred traits. Queries can include inferences and complex questions Example queries: • Can search on trait name containing “diabetes” and get results for both type 1 and type 2 diabetes • Comparison between gastric and esophageal cancers requires manually combining results from two distinct searches Example queries: * • Find all SNPs that are associated with cancers located in the upper digestive tract • Find all SNPs located on chromosomes 5, 7, 15 and 21 that are associated with diseases located in the urinary tract, with a p-value smaller than 10-8 * Source: Welter, D., Burdett, T., et al. (2012) Ontology-driven visualization of NHGRI GWAS data
  • 21.
    HermiT OWL Reasoner HermiT is a reasoner for ontologies written using OWL (Web Ontology Language). It is a Protégé plugin.  HermiT can determine whether the ontology for any given OWL file is consistent and identify the relationship between classes  HermiT passes all OWL 2 conformance tests for direct semantics reasoners  HermiT can be accessed from Java apps through the OWL API  OWL API is a Java interface for creating, manipulating and serializing OWL Ontologies  It includes parsers and writers for RDF, OWL and Turtle, as well as interface for working with reasoners
  • 22.
    HermiT reasoner isimplemented with “forward chaining”  How it works: Rules are processed by reasoner once in batch mode to generate and cache inferred triples  Best when:  Rules of inference and original data don’t change often  There’s sufficient disk and RAM to store all the inferred triples  Benefits: Retrieval queries run faster  Limitation: When rules or explicit data set changes, it may be necessary to empty and reload the entire data store and re-run the reasoner over it again
  • 23.
  • 24.
    What is theDiagram Browser? It’s a diagram that shows SNP-trait associations mapped to the SNPs’ chromosomal locations of the human karyotype. This project has made significant improvements to it:  Originally: The diagram used to be a static document manually created on a quarterly basis (by a medical illustrator)  Now: Creation is fully automated with each study added and it is interactive, so that it can be explored dynamically
  • 25.
    Diagram Browser: Interactivefunctionality Clicking on SNP-associated trait category enables selection of only bands with relevant traits Zoom in and hover over chromosomes in order to see traits by chromosomal location Clicking on diagram displays all SNPs for a trait and band
  • 26.
    How is theDiagram Browser implemented? 1. The Diagram Browser is a JavaScript app rendered on the client browser 2. Interaction with the diagram, such as filter, zoom or click, generates a query 3. The query request is sent via AJAX from the web client to the Tomcat server 4. The server runs a Java program that converts this request into an OWL class expression which is processed by the reasoner 5. The query result causes a string of SVG (Scalable Vector Graphics) code to be generated 6. This code is sent back to the web client via AJAX 7. The JavaScript app renders the SVG provided Web Browser JavaScript app Web Server Knowledge Base (using GWAS Catalog ontology) Generate AJAX request Render SVG code 1 Trigger: Filter, zoom, click 2 3 4 6 5 7 Process request Generate SVG
  • 27.
  • 28.
    Future scalability Will runinto scalability issues as…  Size of knowledge base grows  Tools for querying the knowledge base become more sophisticated Current Implementation Short term solution Long Term Solution  Monitor system resources and increase where there are bottlenecks  Limit queries to a predefined ranges  Precompute more inferences, based on query frequency  Migrate to a persistent RDF triplestore (such as Virtuoso) from the knowledge base  Implement SPARQL endpoint for queries instead of using OWL class expressions  Consider backward chaining reasoner if inferred data set gets too big to cache
  • 29.
    Future “backward chaining”option  How it works: Reasoner is deployed between the GWAS Diagram or SPARQL endpoint and data store, so that inferred triples are generated in real time as part of query result set  Best when:  Rules of inference and original data change often  Disk or RAM is insufficient to store all the inferred triples  Benefits: No need to re-run reasoner when data or rules change  Limitation: Query response may be slow
  • 30.
    SPARQL Example: GWASCentral  Although the NHGRI project currently doesn’t host a live SPARQL endpoint, it could be set up to do so  The GWAS Central project already does this. (It collates data from a range of sources, including the published literature and collaborating databases such as the NHGRI GWAS Catalog.) SPARQL query page for GWAS Centeral http://fuseki.gwascentral.org/q uery.html
  • 31.
    SPARQL Example: EBI’sAtlas  EBI hosts the GWAS Diagram, but doesn’t provide a SPARQL endpoint associated with that project  It does however host SPARQL endpoints for multiple other projects, such as Atlas SPARQL query page and multiple examples for EBI’s Atlas project (https://www.ebi.ac.uk/rdf/services/atlas/sparql)
  • 32.
    GWAS Central: TowardsFederation  GWAS Central is a comprehensive resource for the comparison and interrogation of multiple GWAS (genome-wide association studies) projects  Allows for storage, mining and display of summary-level association data  More comprehensive than other openly available projects with a similar focus (ie, millions vs. thousands of P-values )  Provides user tools and interfaces not previously available from a single resource  Aggregates other related resources:  GWAS Catalog  OADGAR  SNPedia  GWAS Central platform is available for adoption by other institutes, consortia, teams and countries  Ideally, multiple implementations can be federated to allow searching across multiple data sets
  • 33.
    GWAS Central: TowardsFederation (cont.) Comparison of features for GWAS Central, GWAS Catalog, OADGAR*, SNPedia * Open Access Database of Genome-wide Association Results
  • 34.
    GWAS Central: TowardsFederation (cont.) SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. This specification defines the syntax and semantics of SPARQL 1.1 Federated Query extension for executing queries distributed over different SPARQL endpoints. The SERVICE keyword extends SPARQL 1.1 to support queries that merge data distributed across the Web. Source: http://www.w3.org/TR/sparql11-federated-query/
  • 35.
    Setting up GWASCatalog project to query across data sets Querying across databases using EFO: Since the GWAS Catalog is based on EFO, it’s possible for a query to include other biomedical databases annotated for EFO: ArrayExpress, Ensembl, BioSamples, Pride, etc. Querying across databases using other ontologies: Even if EFO is not used, cross reference definition citations allows querying across ontologies. The ID of an external class is added as an annotation on the relevant EFO term. Example: Connective tissue is an EFO term that has been mapped to terms in other ontologies, such as term BTO:0000421, the identifier for connective tissue in the Brenda ontology.
  • 36.