Function and Phenotype Prediction through Data and Knowledge Fusion

Function and Phenotype Prediction
through Data and Knowledge Fusion
Karin M. Verspoor, The University of Melbourne
karin.verspoor@unimelb.edu.au
27 January 2016 – King Abdullah University of Science and
Technology, Computational Bioscience Research Center

We have the blueprints to life,
but we don’t know how to read them.
• At least a quarter of protein families in
PFAM have no known function
(Domains of Unknown Function)
• Millions of proteins uncharacterised

What is protein function?
• Captures biological
process, molecular
function, cellular
component
• Common
representation for
Model organism
databases to
facilitate sharing
The Gene Ontology (GO) provides a vocabulary

What about phenotype?
Human Phenotype Ontology

Knowledge-based features
Knowledge source:

Exponential knowledge growth
• ~1550 peer-reviewed gene-
related databases in NAR
online Mol Bio collection
• Over 25 million PubMed
entries (> 2,000/day)
• Breakdown of disciplinary
boundaries makes more of it
relevant to each of us
• “Like drinking from a firehose”
– Jim Ostell (NCBI IEB Chief)

Text as a primary source of knowledge
Despite ever increasing structured resources,
the literature remains the primary repository of
knowledge in biomedicine
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
#Swiss-ProtProteins
Proteins missing a FUNCTION comment
Proteins gaining a FUNCTION comment
“Manual curation is not sufficient for annotation of genomic databases”
Baumgartner et al Bioinformatics (ISMB 2007)

Why biomedical text mining?
0
200000
400000
600000
800000
1000000
1200000 1914
1918
1922
1926
1930
1934
1938
1942
1946
1950
1954
1958
1962
1966
1970
1974
1978
1982
1986
1990
1994
1998
2002
2006
2010
2014
Publicationsperyear
Year
Exponential growth in size of Pubmed

Data sources, Data Integration
• Structured Resources
– Largely manually ‘curated’, high quality
– Often unannotated
– Organizes targeted information
– Computable
• Unstructured Resources
– Literature: peer reviewed, well-formed
– Natural Language: ambiguity, complexity
– Broad, current coverage of biological knowledge
– Intended for Human communication

Bio Text Analysis in a nutshell
Input
Documents
pre-processing
(e.g., format conversion)
tokenization
sentence detection
term normalisation
(e.g., stem, lemmatise)
biological named entity
recognition
biological concept
recognition
syntactic analysis
coordination resolution
co-reference resolution
ambiguity resolution
entity linking
Domain Knowledge:
Terminologies
Ontologies
Known Relationships
relation extraction
event annotation
reasoning and inference
Annotated
Documents
extracted facts
and
relationships

GO Function Prediction
Sokolov and Ben-Hur. J Bioinform Comput Biol. 2010 Apr;8(2):357-76.
Sokolov, Funk, Graim, Verspoor, Ben-Hur. BMC Bioinformatics. 2013;14 Suppl 3:S10.

GOstruct: Structured output SVM
cross-
species view
species-
specific view
Sequence- based
features
f eat ur es
mouse GO
annotations
l abel s
human GO
annotations
mouse
annotati
l abel
co- mention
f eat ur es
PPI
gene expression
st r uct ur ed
SVM t r ai ni ng
st r uct ur ed
SVM t r ai ni ng
f(x,y) = f(c)
(x,y) + f(s)
(x,y)
mul t i - vi ew

Structured output
• Represent a set of annotations as a single vector
• Encodes the hierarchical structure from annotation to
root

GOstruct approach
“What functions does this protein perform?”

Feature integration via kernels
• Cross-species (sequence-based) features
– e-values from significant BLAST hits
– features from WoLF PSORT protein localization software
– transmembrane protein prediction using TMHMM
– k-mer composition of the N and C termini
– low complexity regions
• Species-specific features
– Protein interactions
– Gene Expression
– Phylogenetic profiles
– Text-derived features

Extraction & Analysis pipeline
Christopher Funk (2015) PhD dissertation, U. Colorado Denver

Integrating Text
• Protein – Gene Ontology term co-occurrence
• Protein – Protein co-occurrence
Sokolov et al. BMC Bioinformatics 2013, 14(Suppl 3):S10

Text-based features
• Words
– (tokens)
• Entities or Concepts
– (gene/protein mentions)
– (gene ontology concepts)
• Relations
– (simple co-occurrences)

Feature Extraction from text
Target: P50281 – Matrix metalloproteinase 14 (MMP14)

Feature Extraction

Feature Extraction
Bag of words:
WordsSent1(membrane, otherwise, known, … , proteolytic,
enzyme, known, extracellular, invasion, … , progression)
WordsSent2(protein, and, message, levels, of, was , …)

Feature Extraction
Protein GO term co-mentions:
sent_comen(P50281, GO:0008237)
nonSent_comen(P50281, GO:0010467)

Feature Extraction
Protein GO term co-mentions:

Feature Representation
Bag of Words:
P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, …
Protein GO term co-mentions (sentence):
P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1
Protein GO term co-mentions (non-sentence):
P40281, GO:0010467=2, GO:0005623=2

Feature Representation
Bag of Words:
UniprotID1, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi
UniprotID2, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi
…
UniprotIDi, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi
Protein GO term co-mentions (sentence):
UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
…
UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
Protein GO term co-mentions (non-sentence):
…
UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi

An aside on GO concept recognition
• Given:
– Gene Ontology (~46,000 concepts)
In mice lacking ephrin-A5 function, cell proliferation and
survival of newborn neurons… (PMID 20474079)
• Return:
– GO:0008283 cell proliferation
– GO:0005125 cytokine activity
– GO:0048666 neuron development
(can be based on a judgment about the depth of
experimental evidence)

(CRAFT example)
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
GO:0005623 – “cell”
CL:0000000 – “cell”
PR:000004182 – “aquaporin-2”
EG:359 – “Aqp2”
SO:0001059 – “sequence_alteration” GO:0006810 – “transport”
SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity”
CHEBI:15377 – “water”

GO:0006900 – membrane budding
[Term]
id: GO:0006900
name: membrane budding
…
def: "The evagination of a membrane,
resulting in formation of a vesicle.”
…
synonym: "membrane evagination”
synonym: "nonselective vesicle assembly”
synonym: "vesicle biosynthesis”
synonym: "vesicle formation”
…
Variation in PMID: 12925238
• Lipid rafts play a key role in
membrane budding…
• …involvement of annexin A7 in
budding of vesicles…
• …Ca2+-mediated vesiculation
process was not impared.
• Red blood cells which lack the
ability to vesiculate cause…
• Having excluded a direct role
in vesicle formation…
GO vs NL

Comparing tool performance on CR
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Best performance for all tools on all ontologies
Precision
Recall
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
f=0.1
f=0.2
f=0.3
f=0.4
f=0.5
f=0.6
f=0.7
f=0.8
f=0.9
●
Systems
MetaMap
Concept Mapper
NCBO Annotator
Ontologies
GO_CC
GO_MF
GO_BP
SO
CL
PR
NCBITAXON
CHEBI
• NCBO Annotator
(96 combinations)
wholeWordOnly, filterNumber,
stopWords, stopWordsCaseSensitive,
minTermSize, withSynonyms
• MetaMap
(864 combinations)
model, gaps, wordOrder,
acronymAbb, derivationalVars,
scoreFilter, minTermSize
• Concept Mapper
(576 combinations)
searchStrategy, caseMatch, stemmer,
orderIndependentLookup,
findAllMatches, stopWords, synonyms
Funk et al. BMC Bioinformatics 2014, Feb 26;15:59.

Literature alone is useful
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MF BP CC
Macro-averagedF-measure
Gene Ontology Branch
Baseline (co-mentions as predictions)
Co-mentions
BoW
Co-mentions + BoW

Literature features approach performance of
commonly used biological features
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
MF BP CC
Macro-averagedF-measure
Trans/Localization
Homology
Network
Literature
All Combined
(and combining them with other features is even better!)

Manual inspection of misclassifications
Some false positives appear to have literature support:
• GCNT1 – carbohydrate metabolic process
(Q02742 - GO:0005975)
Genes related to carbohydrate metabolism
include PPP1R3C, B3GNT1, and GCNT1…
[PMID:23646466]
• CERS2 – ceramide biosynthetic process
(Q96G23 - GO:0046513)
…CersS2, which uses C22-CoA for
ceramide synthesis…
[PMID:22144673]

Results: different sources
Mouse annotations from geneontology.org

Phenotype Prediction
Kahanda I, Funk C, Verspoor K and Ben-Hur A 2015;]
F1000Research 2015, 4:259 (doi:
10.12688/f1000research.6670.1)

PHENOstruct
Human Phenotype Ontology
Organ, Inheritance, Onset subontologies
have separate models

PHENOstruct Features
• Network (functional association data)
– protein-protein interactions
– co-expression
– co-localization
– From BioGRID, STRING, GeneMANIA
• Gene Ontology (experimental) annotations
• Literature mined data: bag of words in gene sentences
• Genetic variants (protein -> disease -> variants)

Performance
Subont. Terms Method AUC P-value
Organ 1,796
Binary SVMs 0.66 1.70E-262
Clus-HMC-Ens 0.65 0.00E+00
PHENOstruct 0.73 —
Inheritance 12
Clus-HMC-Ens 0.73 7.30E-01
Onset 23
Clus-HMC-Ens 0.58 3.30E-05

PHENOstruct in Organ subontology

Gold vs Predicted, P43681
Gold Predicted
Hierarchical, protein-centric
P = 1.0; R = 0.62

Top Literature Features
Category Tokens
proteins &
complexes
cx32, kisspeptin, -308, t308, smn2, ns5, trap-positive, mpp+-induced, 1-
methyl-4-phenylpyridinium,
tnf-alpha-mediated, tnf-alpha-stimulated, tnf–mediated, ink4a/arf, ns4b,
hmsh6, fukutin, cdtb, ns5b,
apoai, tnf–stimulated, ns4a, tnf-alpha-, rhbmp-2, tnf-alpha-treated, frataxin,
ki-ras, connexin32, tcdb,
recql4, =-galcer, tyrosinase-related, hpms2, her4, cd40-cd40l, lmp2a, ryrs,
mg2+-atpase, ews-fli1,
abeta42, fancc, p40phox, her1, bdnf-induced, trap+, gfap-ir, daf-16/foxo,
hdl3, -238, [tnf-alpha],
cd40/cd40l, tnf–treated, anti-ngf, tep1, recq, nt-4, pfemp1, zo-2, nphp1, tnf-
alpha-dependent,
pomt1, igm-positive, apoa-ii, p110alpha, fancf, tbx4, anti-cd40l, igg
genes hmsh2, cx26, fkrp, smn1, cln3, nphp4, mn1, nnt, apex2, akt-2
pathways ras/raf/mek/erk, pi3k-akt-mtor
diseases/phenot
ypes
cmt1a, hnpp, hdl2, cln2, hpp, fmf, rtt, hnpcc, charcot-marie-tooth,
amenorrhea, rett, anticardiolipin
misc.
sheldrick, shelxl97, bruker, farrugia, ortep-3, platon, shelxs97, spek, sgdid,
wlds, caii, aoa, tdf,
crysalis, wingx, amf

Conclusions
• The literature provides a significant resource for
biological function prediction
• The literature provides one ‘view’ of biological
knowledge and is best combined with other resources
• Even some simple strategies for extracting
associations from the literature can provide valuable
information, taken at large scale
– “bag of words” and co-occurrence models reasonable
starting point: capture implied relationships
– scope for integration of more targeted extracted
relationships (e.g. protein-protein interactions), with the
usual Precision/Recall tradeoff

Acknowledgements
• Los Alamos National Laboratory
– Michael Wall
• Colorado
– Larry Hunter (U. Colorado Denver)
– Christopher Funk (U. Colorado Denver)
– Asa Ben-Hur (Colorado State University)
– Indika Kahanda (Colorado State University)
• NICTA Victoria Research Laboratory
– Geoffrey Macintyre (U. Cambridge)
– Antonio Jimeno Yepes (IBM Research Australia)
– Cheng-Soon Ong (NICTA Canberra)
• Funding:
US NIH, US NSF, NICTA, Australian Research Council

Machine learning for text analysis
Training set
Notes + labels
for classes of interest
Machine learning
algorithm
Words, Phrases,
Linguistic categories;
names of entities;
Domain concepts;
Document features
Biomedical
knowledge sources
UMLS
OBOs
Language processing
Model
Relating features
of the text to
classes of interest

Machine learning for text analysis
New text
to be classified
Words, Phrases,
Linguistic categories;
names of entities;
Domain concepts;
Document features
Biomedical
knowledge sources
UMLS
OBOs
Language processing
Model
Predicted
Classification
(label)

Function and Phenotype Prediction through Data and Knowledge Fusion

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Function and Phenotype Prediction through Data and Knowledge Fusion

Similar to Function and Phenotype Prediction through Data and Knowledge Fusion (20)

More from Karin Verspoor

More from Karin Verspoor (8)

Recently uploaded

Recently uploaded (20)

Function and Phenotype Prediction through Data and Knowledge Fusion