Keynote presented at the the Ninth International Biocuration Conference Geneva, Switzerland, April 10-14, 2016
The health of an individual organism results from complex interplay between its genes and environment. Although great strides have been made in standardizing the representation of genetic information for exchange, there are no comparable standards to represent phenotypes (e.g. patient disease features, variation across biodiversity) or environmental factors that may influence such phenotypic outcomes. Phenotypic features of individual organisms are currently described in diverse places and in diverse formats: publications, databases, health records, registries, clinical trials, museum collections, and even social media. In these contexts, biocuration has been pivotal to obtaining a computable representation, but is still deeply challenged by the lack of standardization, accessibility, persistence, and computability among these contexts. How can we help all phenotype data creators contribute to this biocuration effort when the data is so distributed across so many communities, sources, and scales? How can we track contributions and provide proper attribution? How can we leverage phenotypic data from the model organism or biodiversity communities to help diagnose disease or determine evolutionary relatedness? Biocurators unite in a new community effort to address these challenges.
4. Genes Environment Phenotypes+ =
Biology central dogma
Standards for encoding and exchanging data
must be up to these challenges.
This is where you come in.
@ontowonka
5. Genes Environment Phenotypes+ =
Computable encodings are essential
Base pairs
Variant notation (eg. HGVS)
Human Phenotype
Ontology
Mammalian
Phenotype Ontology
Medical procedure coding
Environment Ontology
@ontowonka
6. Genes Environment Phenotypes
VCF PXFGFF
Standard exchange formats exist for genes …
but for phenotypes? Environment?
BED
@ontowonka
7. The relationships too must be captured
It is not just the bits…
G-P or D (disease)
causes
contributes to
is risk factor for
protects against
correlates with
is marker for
modulates
involved in
increases susceptibility to
G-G (kind of)
regulates
negatively regulates (inhibits)
positively regulates (activates)
directly regulates
interacts with
co-localizes with
co-expressed with
P/D - P/D
part of
results in
co-occurs with
correlates with
hallmark of (P->D)
E-P
contributes to (E->P)
influences (E->P)
exacerbates (E->P)
manifest in (P->E)
G-E (kind of)
expressed in
expressed during
contains
inactivated by
8. The genome is sequenced, but…
…we still don’t know very much about what it does
3,435
OMIM
Mendelian Diseases with
no known genetic basis
?
66,396
ClinVar
Variants with no known
pathogenicity
9. Why we need all the organisms
Model data can provide up to
80% phenotypic coverage of the human coding genome
11. B6.Cg-Alms1foz/fox/J
increased weight,
adipose tissue volume,
glucose homeostasis altered
ALSM1(NM_015120.4)
[c.10775delC] + [-]
GENOTYPE
PHENOTYPE
obesity,
diabetes mellitus,
insulin resistance
increased food intake,
hyperglycemia,
insulin resistance
kcnj11c14/c14; insrt143/+(AB)
Can we use model phenotypes to
inform genetic mechanisms of disease?
???
12. CC2.0 European Southern Observatory
https://www.flickr.com/photos/esoastronomy/6923443595
Crossing the language barrier
13. Ulcerated
paws
Palmoplantar
hyperkeratosis
Thick hand skin
Image credits:
"HandsEBS" by James Heilman, MD - Own work. Licensed under CC BY-SA 3.0 via Commons –
https://commons.wikimedia.org/wiki/File:HandsEBS.JPG#/media/File:HandsEBS.JPG
http://www.guinealynx.info/pododermatitis.html
16. Challenge: Each database uses their
own phenotype vocabulary/ontology
ZFA
MP
DPO
WPO
HP
OMIA
VT
FYPO
APO
SNO
MED
…
…
…
WB
PB
FB
OMIA
MGI
RGD
ZFIN
SGD
HPOA
EHR
IMPC
OMIM
…
QTLdb
17. Can we help machines understand
phenotype terms?
“Palmoplantar
hyperkeratosis”
Human phenotype
I have
absolutely no
idea what that
means
18. Decomposition of complex concepts
allows interoperability
Mungall, C. J., Gkoutos, G., Smith, C., Haendel, M., Lewis, S., & Ashburner,
M. (2010). Integrating phenotype ontologies across multiple species.
Genome Biology, 11(1), R2. doi:10.1186/gb-2010-11-1-r2
“Palmoplantar
hyperkeratosis”
increased
Stratum corneum
layer of skin
=
Human phenotype
PATO
Uberon
Species neutral ontologies, homologous concepts
Autopod
keratinization
GO
23. The prevailing clinical diagnosis pipelines leverage
only a tiny fraction of the available data
PATIENT EXOME
/ GENOME
PATIENT PHENOTYPES
PATIENT ENVIRONMENT
PUBLIC GENOMIC DATA
PUBLIC PHENOTYPE,
DISEASE DATA
PUBLIC ENVIRONMENT,
DISEASE DATA
POSSIBLE DISEASES
DIAGNOSIS & TREATMENT
Under-utilized data
24. It takes an interoperable village to diagnose
a rare platelet syndrome
http://bit.ly/stim1paper
Phenotypic
profile
Genes
Heterozygous,
missense mutation
STIM-1
MGI mouse
N/A
Heterozygous,
missense mutation
STIM-1
N/A
Ranked STIM-1 variant maximally pathogenic
based on cross-species G2P data,
in the absence of traditional data sources
http://bit.ly/exomiser
Stim1Sax/Sax
26. If it is alive, it can be PhenoPackaged
Some biodiversity images adapted from http://i.vimeocdn.com/video/417366050_1280x720.jpg
Model Organisms
Biodiversity Crops Domestic Animals
Disease vectors
Epidemiological
Monitoring
Drug discovery
& Development
Rare Disease
Diagnosis
Personalized
Medicine
Environmental
Monitoring
Patients & Cohorts
Genetic
Engineering
Mechanistic
Discovery
27. What is in a PhenoPacket?
This is “Maru”,
a 4-year-old, male
cat of the Scottish
Fold breed
abnormal
sheltering behavior
[MP:0014039]
(onset at birth)
Biography
Phenotypes
&qualifiers
youtube.com/user
/mugumogu
Weighs 6kg
Measurements
Source
29. title: "measurement example, taken from genenetwork.org"
organisms:
- id: "#1"
label: "BXD mouse population”
taxon: NCBITaxon:10090
phenotype_profile:
- entity: "#1"
phenotype:
description: "cerebellum weight"
types:
- id: "PATO:0000128"
label: "weight"
measurements:
- unit: mg
value: 61.400
property_values:
- property: standard_error
filler: 2.38
attribute_of:
types:
- id: "UBERON:0002037"
label: "cerebellum"
onset:
description: "measured in adults"
types:
- id: "MmusDv:0000061"
label: "early adult"
Ontology of
Statistical
properties
We can represent
population
phenotypes too
attribute
For non-abnormal
phenotypes we can
use a trait ontology,
or a building block
approach, with
• PATO
• Uberon
Measured entity
UO
How does it handle measurements?
31. Human Phenotype Ontology, now with 6,200
plain language synonyms
for patients, families, and non-experts
http://bit.ly/hpo-biocuration
32. Phenopackets for journals
Each article can be
associated with a
phenopacket
Robinson, P. N., Mungall, C. J., & Haendel, M. (2015). Capturing phenotypes for precision
medicine. Molecular Case Studies, 1(1), a000372. doi:10.1101/mcs.a000372
Each phenopacket
can be shared via
DOI in any repository
outside paywall (eg.
Figshare, Zenodo,
etc)
33. So, do you expect us to put these together
ourselves?
Emerging tool: WebPhenote
(based on Phenote)
create.monarchinitiative.org
39. PHENOTYPING ISN’T FREE;
SO HOW MUCH IS ENOUGH?
bit.ly/annotationsufficiency
Enlarged ears
Dark hair
Blue skin
Pointy ears
Hair on head
Horns
Enlarged lip
Increased skin pigmentation
yes
no
!
40. THE MORE PHENOTYPE DATA WE HAVE,
THE BETTER ABLE WE ARE
TO ANSWER THAT QUESTION
bit.ly/annotationsufficiency
• Depth/specificity of phenotypic coverage
• Rarity
• Breadth of phenotypic coverage
41. Which phenotypes
(and sets of phenotypes)
enable precision recall and matching
Enlarged ears (2)Dark hair (6) Female (4)
Male (4)
Blue skin (1)
Pointy ears (1)
Hair absent on head (1)
Horns present (1)
Hair present
on head (7)
Enlarged lip (2)
Increased skin
pigmentation (3)
42. PhenoPackets make phenotype data:
Findable
Accessible outside paywalls and private data sources
Attributable
Interoperable and Computable,
Reusable, exchangeable across contexts and disciplines
FAIR++
43. Sign up below to receive updates
Or to provide feedback and requirements
http://bit.ly/biocuration2016
Thank you!
Live Long and Phenotype
44. Acknowledgements
Lawrence Berkeley
Chris Mungall
Suzanna Lewis
Jeremy Nguyen
Seth Carbon
Charité
Peter Robinson
Sebastian Kohler
RTI
Jim Balhoff
Cyverse
Ramona Walls
U of Pittsburgh
Harry Hochheiser
OHSU
Matt Brush
Kent Shefchek
Julie McMurry
Tom Conlin
Nicole Vasilevsky
Queen Mary College
London
Damian Smedley
Jules Jacobson
Garvan
Tudor Groza
Alfred Wegener
Pier Buttigieg
FUNDING: NIH Office of Director: 1R24OD011883; NIH-UDP: HHSN268201300036C,
HHSN268201400093P, Phenotype Ontology Research Coordination Network (NSF-DEB-0956049)
With special thanks to Julie McMurry for excellent graphic design
Editor's Notes
Trite answer: Something that can be represented by a class in a phenotype ontology
MP
HP
..
But there is more
This basic phenotype description can be adorned with…
Natural language descriptions
Temporal information (onset)
Qualifiers
Severity, progression
Quantitative information
Measurements (unit, value, error, etc)
Environment
…Much more!
The classic G+E=P. But the = has a lot that can be applied to aid the linking.
The classic G+E=P. But the = has a lot that can be applied to aid the linking.
The classic G+E=P. But the = has a lot that can be applied to aid the linking.
The classic G+E=P. But the = has a lot that can be applied to aid the linking.
The classic G+E=P. But the = has a lot that can be applied to aid the linking.
There is a lot we don’t know about the genome
As of April 2016
OMIM updated number: 3435
ClinVar updated number: 66396
Data from mouse, rat, zebrafish, worm, fruitfly
Human:OMIM, clinvar
Orthology via PANTHER v9
Highlighting how we get different phenotypic information from different sources, species
Data from MGI, ZFIN, & HPO, reasoned over with cross-species phenotype ontology
https://code.google.com/p/phenotype-ontologies/
The distribution of phenotype information per model genotype is different compared to human disease annotations.
For mouse, there’s a much higher representation of metabolic, cardiovascular, blood, and endocrine phenotypes available to compare;
For fish, there’s increased nervous, skeletal, head and neck, and cardiovascular, and connective tissue.
(Note that these do not include “normal” phenotypes for either diseases or genotypes.)
What does it mean to replicate a phenotypic profile in a model organism? For many patients or diseases, we may need different models to fully recapitulate the disease. Further, some phenotypes are common in a given species and if present in the patient, would be a less significant result.
Our approach is to try and get the machine to understand the terms so that it can assist us intelligently.
We make things digestible. Complex concepts into simpler parts. We use ontologies that are comparative by design.
If we include bridging ontologies, we can unify diseases across sources AND phenotypes across sources and organisms.
This was the novel case we solved. The UDP patient had a number of signs and symptoms including various platelet abnormalities. The same heterozygous, missense mutation was seen in 2 patients and ranked top by Exomiser. It had never been seen in any of the SNP databases and was predicted maximally pathogenic. Finally a mouse curated by MGI involving a heterozygous, missense point mutation introduced by chemical mutagenesis exhibited strikingly similar platelet abnormalities.
Mosquito image from https://pixabay.com/en/brazil-health-mosquito-news-virus-1300017/ no attribution required
Knowing what the normal distribution and clustering of phenotypes is helps us know that blue skin is rare and can reliably distinguish between phenotype profiles. Likewise to know that if the first phenotype entered is enlarged lip, the next one to ask for would be enlarged ears. The combination of 3 non-unique phenotypes offers a perfect match.
There are a lot of people who have contributed to this work over many years.