On the frontier of genotype-2-phenotype
data integration
Melissa Haendel, PhD
March 22, 2016
AMIA TBI
@monarchinit @ontowonka
haendel@ohsu.edu
Filling the G2P knowledge gap from other
organisms
Other= rat, fly, worm, mouse, zebrafish
monarchinitiative.org
Ulcerated
paws
Palmoplantar
hyperkeratosis
Thick hand skin
Challenge: Each database uses their own
vocabulary/ontology
MP
HP
MGI
HPOA
Challenge: Each database uses their own
phenotype vocabulary/ontology
ZFA
MP
DPO
WPO
HP
OMIA
VT
FYPO
APO
SNO
MED
…
…
…
WB
PB
FB
OMIA
MGI
RGD
ZFIN
SGD
HPOA
EHR
IMPC
OMIM
…
QTLdb
monarchinitiative.org
Can we help machines understand
phenotype terms?
“Palmoplantar
hyperkeratosis”
Human phenotype
I have
absolutely no
idea what that
means
The Human Phenotype Ontology
Hyposmia
Abnormality of
globe location
eyeball of
camera-type eye
sensory
perception of smell
Abnormal eye
morphology
Motor neuron
atrophyDeeply set eyes
motor neuronCL
34571 annotations in
22 species
157534 phenotype
annotations
2150 phenotype
annotations
monarchinitiative.org
Genotype-phenotype integration
One source
Two sources
3 or more
9%
91% of our 2.2 Million G2P associations required
integrating 2 or more data sources
(this number does not even include orthology (Panther) or any ontologies!)
91%
Diagnosing an undiagnosed disease
www.owlsim.org
Phenotype Exchange Standard
Mechanistic discovery
Improved
searchability
Integrated Data Landscape
Tool/algorithm creation
Cohort
identification
Patient
registries
Databases,
Web tools,
AlgorithmsPhenopacket
Registry
JournalsDiagnostic
screening
programs Clinical
trials
Phenopacket
flow
Primarybenefits
tostakeholders
Patients/
Families
Physicians
Patient
matchmaking
Diagnosis speed/accuracy
Organismal
biologist
www.phenopackets.org
What’s in a Phenopacket?
Ontology-based phenotypic descriptions for:
 Human patients, model organisms, or any organism
 Groupings of human patients or organisms
What does it include?
 age of patient or organism
 sex of patient or organism
 disease (if named)
 age of onset of disease
 Positive and negative phenotype associations
 Reference to Genes, variants, or collections of variants
 Reference to environmental factors
Multiple formats: TSV, JSON, YAML, JSON
Validation tools
Uses standardized publication citation mechanism for data sharing
brca-website.cloudapp.net
 13501 variants from ENIGMA, ClinVar, LOVD, exLOVD, BIC
 Merged by genomic coordinate and alternate allele string
Problems with evidence and provenance
of G2P Associations
PROBLEMS:
Variants have different pathogenicity calls due to annotation
inconsistency AND different experimental evidence
Incomplete, not computable, and frequently conflated
Annotations are to different aspects of the genotype: allele, variant,
gene, transcript, etc.
A computable model would enable:
 context to evaluate credibility/confidence
 support filtering and analysis of data
 detailed history for attribution
Building a computable model for ACMG
guidelines
http://brcaexchange.org/
Provenance Evidence Claim
- Materials & methods
- Agent(s) of evidence
- Agent(s) of claim
- Time and place
- Data (eg: images, sequences)
- Evidence codes
- Publications
- Confidence (p-val, z-score)
- Summary figures
- Conclusions from previous studies
- Domain expert’s knowledge
Causal relationships,
hypothesized relationships,
correlations etc.
https://github.com/monarch-initiative/SEPIO-ontology
Summary
 Ontologies can be used to perform deep phenotyping
integration across species
 An exchange standard is needed to facilitate distributed
phenotype data sharing
 A computable G2P evidence model can aid variant
interpretation
Acknowledgements
Lawrence Berkeley
Chris Mungall
Nicole Washington
Suzanna Lewis
Jeremy Nguyen
Seth Carbon
Charité
Peter Robinson
Sebastian Kohler
U of Pittsburgh
Harry Hochheiser
Mike Davis
Joe Zhou
OHSU
Nicole Vasilesky
Matt Brush
Kent Shefchek
Julie McMurry
Tom Conlin
Genomics England
Damian Smedley
Jules Jacobson
UCSC
David Haussler
Benedict Paten
Mark Diekhans
Melissa Cline
Garvan
Tudor Groza
Craig McNamara
Edwin Zhang
FUNDING: NIH Office of Director: 1R24OD011883; NIH-UDP:
HHSN268201300036C, HHSN268201400093P;
NCINCI/Leidos #15X143, BD2K U54HG007990-S2 (Haussler)
& BD2K PA-15-144-U01 (Kesselman)

On the frontier of genotype-2-phenotype data integration

  • 1.
    On the frontierof genotype-2-phenotype data integration Melissa Haendel, PhD March 22, 2016 AMIA TBI @monarchinit @ontowonka haendel@ohsu.edu
  • 2.
    Filling the G2Pknowledge gap from other organisms Other= rat, fly, worm, mouse, zebrafish
  • 3.
  • 4.
    Challenge: Each databaseuses their own vocabulary/ontology MP HP MGI HPOA
  • 5.
    Challenge: Each databaseuses their own phenotype vocabulary/ontology ZFA MP DPO WPO HP OMIA VT FYPO APO SNO MED … … … WB PB FB OMIA MGI RGD ZFIN SGD HPOA EHR IMPC OMIM … QTLdb
  • 6.
    monarchinitiative.org Can we helpmachines understand phenotype terms? “Palmoplantar hyperkeratosis” Human phenotype I have absolutely no idea what that means
  • 7.
    The Human PhenotypeOntology Hyposmia Abnormality of globe location eyeball of camera-type eye sensory perception of smell Abnormal eye morphology Motor neuron atrophyDeeply set eyes motor neuronCL 34571 annotations in 22 species 157534 phenotype annotations 2150 phenotype annotations
  • 9.
    monarchinitiative.org Genotype-phenotype integration One source Twosources 3 or more 9% 91% of our 2.2 Million G2P associations required integrating 2 or more data sources (this number does not even include orthology (Panther) or any ontologies!) 91%
  • 10.
    Diagnosing an undiagnoseddisease www.owlsim.org
  • 11.
    Phenotype Exchange Standard Mechanisticdiscovery Improved searchability Integrated Data Landscape Tool/algorithm creation Cohort identification Patient registries Databases, Web tools, AlgorithmsPhenopacket Registry JournalsDiagnostic screening programs Clinical trials Phenopacket flow Primarybenefits tostakeholders Patients/ Families Physicians Patient matchmaking Diagnosis speed/accuracy Organismal biologist www.phenopackets.org
  • 12.
    What’s in aPhenopacket? Ontology-based phenotypic descriptions for:  Human patients, model organisms, or any organism  Groupings of human patients or organisms What does it include?  age of patient or organism  sex of patient or organism  disease (if named)  age of onset of disease  Positive and negative phenotype associations  Reference to Genes, variants, or collections of variants  Reference to environmental factors Multiple formats: TSV, JSON, YAML, JSON Validation tools Uses standardized publication citation mechanism for data sharing
  • 13.
    brca-website.cloudapp.net  13501 variantsfrom ENIGMA, ClinVar, LOVD, exLOVD, BIC  Merged by genomic coordinate and alternate allele string
  • 14.
    Problems with evidenceand provenance of G2P Associations PROBLEMS: Variants have different pathogenicity calls due to annotation inconsistency AND different experimental evidence Incomplete, not computable, and frequently conflated Annotations are to different aspects of the genotype: allele, variant, gene, transcript, etc. A computable model would enable:  context to evaluate credibility/confidence  support filtering and analysis of data  detailed history for attribution
  • 15.
    Building a computablemodel for ACMG guidelines http://brcaexchange.org/ Provenance Evidence Claim - Materials & methods - Agent(s) of evidence - Agent(s) of claim - Time and place - Data (eg: images, sequences) - Evidence codes - Publications - Confidence (p-val, z-score) - Summary figures - Conclusions from previous studies - Domain expert’s knowledge Causal relationships, hypothesized relationships, correlations etc. https://github.com/monarch-initiative/SEPIO-ontology
  • 16.
    Summary  Ontologies canbe used to perform deep phenotyping integration across species  An exchange standard is needed to facilitate distributed phenotype data sharing  A computable G2P evidence model can aid variant interpretation
  • 17.
    Acknowledgements Lawrence Berkeley Chris Mungall NicoleWashington Suzanna Lewis Jeremy Nguyen Seth Carbon Charité Peter Robinson Sebastian Kohler U of Pittsburgh Harry Hochheiser Mike Davis Joe Zhou OHSU Nicole Vasilesky Matt Brush Kent Shefchek Julie McMurry Tom Conlin Genomics England Damian Smedley Jules Jacobson UCSC David Haussler Benedict Paten Mark Diekhans Melissa Cline Garvan Tudor Groza Craig McNamara Edwin Zhang FUNDING: NIH Office of Director: 1R24OD011883; NIH-UDP: HHSN268201300036C, HHSN268201400093P; NCINCI/Leidos #15X143, BD2K U54HG007990-S2 (Haussler) & BD2K PA-15-144-U01 (Kesselman)

Editor's Notes

  • #5 2 issues: database integration, vocabulary integration
  • #6 Multiple databases
  • #7 Our approach is to try and get the machine to understand the terms so that it can assist us intelligently.
  • #8 Represent organism as a biological subject Represent diseases/genotypes as collections of nodes in the graph 3. Interoperable with other bioinformatics resources and leverage modern semantic standards
  • #9 If we include bridging ontologies, we can unify diseases across sources AND phenotypes across sources and organisms.
  • #16 To support downstream hypothesis testing and evaluation, “trust”, we need a computable model for evidence.
  • #18 There are a lot of people who have contributed to this work over many years. 