SlideShare a Scribd company logo
Function and Phenotype Prediction
through Data and Knowledge Fusion
Karin M. Verspoor, The University of Melbourne
karin.verspoor@unimelb.edu.au
27 January 2016 – King Abdullah University of Science and
Technology, Computational Bioscience Research Center
We have the blueprints to life,
but we don’t know how to read them.
• At least a quarter of protein families in
PFAM have no known function
(Domains of Unknown Function)
• Millions of proteins uncharacterised
From sequence to function
What is protein function?
• Captures biological
process, molecular
function, cellular
component
• Common
representation for
Model organism
databases to
facilitate sharing
The Gene Ontology (GO) provides a vocabulary
What about phenotype?
Human Phenotype Ontology
Knowledge-based features
Knowledge source:
Exponential knowledge growth
• ~1550 peer-reviewed gene-
related databases in NAR
online Mol Bio collection
• Over 25 million PubMed
entries (> 2,000/day)
• Breakdown of disciplinary
boundaries makes more of it
relevant to each of us
• “Like drinking from a firehose”
– Jim Ostell (NCBI IEB Chief)
Text as a primary source of knowledge
Despite ever increasing structured resources,
the literature remains the primary repository of
knowledge in biomedicine
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
#Swiss-ProtProteins
Proteins missing a FUNCTION comment
Proteins gaining a FUNCTION comment
“Manual curation is not sufficient for annotation of genomic databases”
Baumgartner et al Bioinformatics (ISMB 2007)
Why biomedical text mining?
0
200000
400000
600000
800000
1000000
1200000 1914
1918
1922
1926
1930
1934
1938
1942
1946
1950
1954
1958
1962
1966
1970
1974
1978
1982
1986
1990
1994
1998
2002
2006
2010
2014
Publicationsperyear
Year
Exponential growth in size of Pubmed
Data sources, Data Integration
• Structured Resources
– Largely manually ‘curated’, high quality
– Often unannotated
– Organizes targeted information
– Computable
• Unstructured Resources
– Literature: peer reviewed, well-formed
– Natural Language: ambiguity, complexity
– Broad, current coverage of biological knowledge
– Intended for Human communication
Bio Text Analysis in a nutshell
Input
Documents
pre-processing
(e.g., format conversion)
tokenization
sentence detection
term normalisation
(e.g., stem, lemmatise)
biological named entity
recognition
biological concept
recognition
syntactic analysis
coordination resolution
co-reference resolution
ambiguity resolution
entity linking
Domain Knowledge:
Terminologies
Ontologies
Known Relationships
relation extraction
event annotation
reasoning and inference
Annotated
Documents
extracted facts
and
relationships
GO Function Prediction
Sokolov and Ben-Hur. J Bioinform Comput Biol. 2010 Apr;8(2):357-76.
Sokolov, Funk, Graim, Verspoor, Ben-Hur. BMC Bioinformatics. 2013;14 Suppl 3:S10.
GOstruct: Structured output SVM
cross-
species view
species-
specific view
Sequence- based
features
f eat ur es
mouse GO
annotations
l abel s
human GO
annotations
mouse
annotati
l abel
co- mention
f eat ur es
PPI
gene expression
st r uct ur ed
SVM t r ai ni ng
st r uct ur ed
SVM t r ai ni ng
f(x,y) = f(c)
(x,y) + f(s)
(x,y)
mul t i - vi ew
Structured output
• Represent a set of annotations as a single vector
• Encodes the hierarchical structure from annotation to
root
GOstruct approach
“What functions does this protein perform?”
Feature integration via kernels
• Cross-species (sequence-based) features
– e-values from significant BLAST hits
– features from WoLF PSORT protein localization software
– transmembrane protein prediction using TMHMM
– k-mer composition of the N and C termini
– low complexity regions
• Species-specific features
– Protein interactions
– Gene Expression
– Phylogenetic profiles
– Text-derived features
Extraction & Analysis pipeline
Christopher Funk (2015) PhD dissertation, U. Colorado Denver
Integrating Text
• Protein – Gene Ontology term co-occurrence
• Protein – Protein co-occurrence
Sokolov et al. BMC Bioinformatics 2013, 14(Suppl 3):S10
Text-based features
• Words
– (tokens)
• Entities or Concepts
– (gene/protein mentions)
– (gene ontology concepts)
• Relations
– (simple co-occurrences)
Feature Extraction from text
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Bag of words:
WordsSent1(membrane, otherwise, known, … , proteolytic,
enzyme, known, extracellular, invasion, … , progression)
WordsSent2(protein, and, message, levels, of, was , …)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:
sent_comen(P50281, GO:0008237)
sent_comen(P50281, GO:0006508)
sent_comen(P50281, GO:0009056)
sent_comen(P50281, GO:0031012)
nonSent_comen(P50281, GO:0010467)
nonSent_comen(P50281, GO:0005623)
Feature Extraction
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Protein GO term co-mentions:
nonSent_comen(P50281, GO:0008237)
nonSent_comen(P50281, GO:0006508)
nonSent_comen(P50281, GO:0009056)
nonSent_comen(P50281, GO:0031012)
nonSent_comen(P50281, GO:0010467)
nonSent_comen(P50281, GO:0005623)
Feature Representation
Target: P50281 – Matrix metalloproteinase 14 (MMP14)
Bag of Words:
P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, …
Protein GO term co-mentions (sentence):
P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1
Protein GO term co-mentions (non-sentence):
P40281, GO:0010467=2, GO:0005623=2
Feature Representation
Bag of Words:
UniprotID1, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi
UniprotID2, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi
…
UniprotIDi, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi
Protein GO term co-mentions (sentence):
UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
UniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
…
UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
Protein GO term co-mentions (non-sentence):
UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
UniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
…
UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
An aside on GO concept recognition
• Given:
– Gene Ontology (~46,000 concepts)
In mice lacking ephrin-A5 function, cell proliferation and
survival of newborn neurons… (PMID 20474079)
• Return:
– GO:0008283 cell proliferation
– GO:0005125 cytokine activity
– GO:0048666 neuron development
(can be based on a judgment about the depth of
experimental evidence)
(CRAFT example)
Previous in vitro experiments using renal
cell lines suggest recessive Aqp2
mutations result in improper trafficking
of the mutant water pore.
GO:0005623 – “cell”
CL:0000000 – “cell”
PR:000004182 – “aquaporin-2”
EG:359 – “Aqp2”
SO:0001059 – “sequence_alteration” GO:0006810 – “transport”
SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity”
CHEBI:15377 – “water”
GO:0006900 – membrane budding
[Term]
id: GO:0006900
name: membrane budding
…
def: "The evagination of a membrane,
resulting in formation of a vesicle.”
…
synonym: "membrane evagination”
synonym: "nonselective vesicle assembly”
synonym: "vesicle biosynthesis”
synonym: "vesicle formation”
…
Variation in PMID: 12925238
• Lipid rafts play a key role in
membrane budding…
• …involvement of annexin A7 in
budding of vesicles…
• …Ca2+-mediated vesiculation
process was not impared.
• Red blood cells which lack the
ability to vesiculate cause…
• Having excluded a direct role
in vesicle formation…
GO vs NL
Comparing tool performance on CR
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Best performance for all tools on all ontologies
Precision
Recall
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
f=0.1
f=0.2
f=0.3
f=0.4
f=0.5
f=0.6
f=0.7
f=0.8
f=0.9
●
Systems
MetaMap
Concept Mapper
NCBO Annotator
Ontologies
GO_CC
GO_MF
GO_BP
SO
CL
PR
NCBITAXON
CHEBI
• NCBO Annotator
(96 combinations)
wholeWordOnly, filterNumber,
stopWords, stopWordsCaseSensitive,
minTermSize, withSynonyms
• MetaMap
(864 combinations)
model, gaps, wordOrder,
acronymAbb, derivationalVars,
scoreFilter, minTermSize
• Concept Mapper
(576 combinations)
searchStrategy, caseMatch, stemmer,
orderIndependentLookup,
findAllMatches, stopWords, synonyms
Funk et al. BMC Bioinformatics 2014, Feb 26;15:59.
Literature alone is useful
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
MF BP CC
Macro-averagedF-measure
Gene Ontology Branch
Baseline (co-mentions as predictions)
Co-mentions
BoW
Co-mentions + BoW
Literature features approach performance of
commonly used biological features
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
MF BP CC
Macro-averagedF-measure
Trans/Localization
Homology
Network
Literature
All Combined
(and combining them with other features is even better!)
Manual inspection of misclassifications
Some false positives appear to have literature support:
• GCNT1 – carbohydrate metabolic process
(Q02742 - GO:0005975)
Genes related to carbohydrate metabolism
include PPP1R3C, B3GNT1, and GCNT1…
[PMID:23646466]
• CERS2 – ceramide biosynthetic process
(Q96G23 - GO:0046513)
…CersS2, which uses C22-CoA for
ceramide synthesis…
[PMID:22144673]
Results: Multi-view learning
Results: different sources
Mouse annotations from geneontology.org
Phenotype Prediction
Kahanda I, Funk C, Verspoor K and Ben-Hur A 2015;]
F1000Research 2015, 4:259 (doi:
10.12688/f1000research.6670.1)
PHENOstruct
Human Phenotype Ontology
Organ, Inheritance, Onset subontologies
have separate models
Gold annotations via transfer
PHENOstruct Features
• Network (functional association data)
– protein-protein interactions
– co-expression
– co-localization
– From BioGRID, STRING, GeneMANIA
• Gene Ontology (experimental) annotations
• Literature mined data: bag of words in gene sentences
• Genetic variants (protein -> disease -> variants)
Performance
Subont. Terms Method AUC P-value
Organ 1,796
Binary SVMs 0.66 1.70E-262
Clus-HMC-Ens 0.65 0.00E+00
PHENOstruct 0.73 —
Inheritance 12
Binary SVMs 0.72 2.20E-01
Clus-HMC-Ens 0.73 7.30E-01
PHENOstruct 0.74 —
Onset 23
Binary SVMs 0.62 4.40E-03
Clus-HMC-Ens 0.58 3.30E-05
PHENOstruct 0.64 —
PHENOstruct in Organ subontology
Gold vs Predicted, P43681
Gold Predicted
Hierarchical, protein-centric
P = 1.0; R = 0.62
Impact of data sources
Leave-one-source-out
Top Literature Features
Category Tokens
proteins &
complexes
cx32, kisspeptin, -308, t308, smn2, ns5, trap-positive, mpp+-induced, 1-
methyl-4-phenylpyridinium,
tnf-alpha-mediated, tnf-alpha-stimulated, tnf–mediated, ink4a/arf, ns4b,
hmsh6, fukutin, cdtb, ns5b,
apoai, tnf–stimulated, ns4a, tnf-alpha-, rhbmp-2, tnf-alpha-treated, frataxin,
ki-ras, connexin32, tcdb,
recql4, =-galcer, tyrosinase-related, hpms2, her4, cd40-cd40l, lmp2a, ryrs,
mg2+-atpase, ews-fli1,
abeta42, fancc, p40phox, her1, bdnf-induced, trap+, gfap-ir, daf-16/foxo,
hdl3, -238, [tnf-alpha],
cd40/cd40l, tnf–treated, anti-ngf, tep1, recq, nt-4, pfemp1, zo-2, nphp1, tnf-
alpha-dependent,
pomt1, igm-positive, apoa-ii, p110alpha, fancf, tbx4, anti-cd40l, igg
genes hmsh2, cx26, fkrp, smn1, cln3, nphp4, mn1, nnt, apex2, akt-2
pathways ras/raf/mek/erk, pi3k-akt-mtor
diseases/phenot
ypes
cmt1a, hnpp, hdl2, cln2, hpp, fmf, rtt, hnpcc, charcot-marie-tooth,
amenorrhea, rett, anticardiolipin
misc.
sheldrick, shelxl97, bruker, farrugia, ortep-3, platon, shelxs97, spek, sgdid,
wlds, caii, aoa, tdf,
crysalis, wingx, amf
Conclusions
• The literature provides a significant resource for
biological function prediction
• The literature provides one ‘view’ of biological
knowledge and is best combined with other resources
• Even some simple strategies for extracting
associations from the literature can provide valuable
information, taken at large scale
– “bag of words” and co-occurrence models reasonable
starting point: capture implied relationships
– scope for integration of more targeted extracted
relationships (e.g. protein-protein interactions), with the
usual Precision/Recall tradeoff
Acknowledgements
• Los Alamos National Laboratory
– Michael Wall
• Colorado
– Larry Hunter (U. Colorado Denver)
– Christopher Funk (U. Colorado Denver)
– Asa Ben-Hur (Colorado State University)
– Indika Kahanda (Colorado State University)
• NICTA Victoria Research Laboratory
– Geoffrey Macintyre (U. Cambridge)
– Antonio Jimeno Yepes (IBM Research Australia)
– Cheng-Soon Ong (NICTA Canberra)
• Funding:
US NIH, US NSF, NICTA, Australian Research Council
Machine learning for text analysis
Training set
Notes + labels
for classes of interest
Machine learning
algorithm
Words, Phrases,
Linguistic categories;
names of entities;
Domain concepts;
Document features
Biomedical
knowledge sources
UMLS
OBOs
Language processing
Model
Relating features
of the text to
classes of interest
Machine learning for text analysis
New text
to be classified
Words, Phrases,
Linguistic categories;
names of entities;
Domain concepts;
Document features
Biomedical
knowledge sources
UMLS
OBOs
Language processing
Model
Predicted
Classification
(label)

More Related Content

What's hot

The Language of the Gene Ontology
The Language of the Gene OntologyThe Language of the Gene Ontology
The Language of the Gene Ontology
robertstevens65
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
Yasset Perez-Riverol
 
Thesis def
Thesis defThesis def
Thesis def
Jay Vyas
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
Prof. Wim Van Criekinge
 
Computational Biology thesis defense
Computational Biology thesis defenseComputational Biology thesis defense
Computational Biology thesis defense
csfunk
 
2012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les12012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les1
Prof. Wim Van Criekinge
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems Pharmacology
Philip Bourne
 
2015 03 13_puurs_v_public
2015 03 13_puurs_v_public2015 03 13_puurs_v_public
2015 03 13_puurs_v_public
Prof. Wim Van Criekinge
 
Introduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsIntroduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR Genomics
Andrea Telatin
 

What's hot (9)

The Language of the Gene Ontology
The Language of the Gene OntologyThe Language of the Gene Ontology
The Language of the Gene Ontology
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
Thesis def
Thesis defThesis def
Thesis def
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
 
Computational Biology thesis defense
Computational Biology thesis defenseComputational Biology thesis defense
Computational Biology thesis defense
 
2012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les12012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les1
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems Pharmacology
 
2015 03 13_puurs_v_public
2015 03 13_puurs_v_public2015 03 13_puurs_v_public
2015 03 13_puurs_v_public
 
Introduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsIntroduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR Genomics
 

Similar to Function and Phenotype Prediction through Data and Knowledge Fusion

RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
Jackie Wirz, PhD
 
Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...
Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...
Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...
RussellHanson
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wdWagied Davids
 
Cross Product Extensions to the Gene Ontology
Cross Product Extensions to the Gene OntologyCross Product Extensions to the Gene Ontology
Cross Product Extensions to the Gene OntologyChris Mungall
 
Identification of pathological mutations from the single-gene case to exome p...
Identification of pathological mutations from the single-gene case to exome p...Identification of pathological mutations from the single-gene case to exome p...
Identification of pathological mutations from the single-gene case to exome p...
Vall d'Hebron Institute of Research (VHIR)
 
DextMP: Text mining for finding moonlighting proteins
DextMP: Text mining for finding moonlighting proteinsDextMP: Text mining for finding moonlighting proteins
DextMP: Text mining for finding moonlighting proteins
Purdue University
 
Collaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeCollaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of Life
Chris Mungall
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
Session i overview bioinfo dm and app mmc
Session i overview bioinfo dm and app mmcSession i overview bioinfo dm and app mmc
Session i overview bioinfo dm and app mmcUSD Bioinformatics
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
Monica Munoz-Torres
 
Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing
Monica Munoz-Torres
 
Integrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS dataIntegrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS data
João André Carriço
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
Connected Data World
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
Barry Smith
 
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge
Prof. Wim Van Criekinge
 
2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekinge2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekinge
Prof. Wim Van Criekinge
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
International Institute of Tropical Agriculture
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
David Cook
 
Introduction to biocomputing
 Introduction to biocomputing Introduction to biocomputing
Introduction to biocomputing
Natalio Krasnogor
 

Similar to Function and Phenotype Prediction through Data and Knowledge Fusion (20)

RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...
Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...
Synthetic Biology and Data-Driven Synthetic Biology for Personalized Medicine...
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
 
Cross Product Extensions to the Gene Ontology
Cross Product Extensions to the Gene OntologyCross Product Extensions to the Gene Ontology
Cross Product Extensions to the Gene Ontology
 
Identification of pathological mutations from the single-gene case to exome p...
Identification of pathological mutations from the single-gene case to exome p...Identification of pathological mutations from the single-gene case to exome p...
Identification of pathological mutations from the single-gene case to exome p...
 
DextMP: Text mining for finding moonlighting proteins
DextMP: Text mining for finding moonlighting proteinsDextMP: Text mining for finding moonlighting proteins
DextMP: Text mining for finding moonlighting proteins
 
Collaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeCollaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of Life
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Session i overview bioinfo dm and app mmc
Session i overview bioinfo dm and app mmcSession i overview bioinfo dm and app mmc
Session i overview bioinfo dm and app mmc
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
 
Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing
 
Integrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS dataIntegrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS data
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge
 
2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekinge2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekinge
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
 
Introduction to biocomputing
 Introduction to biocomputing Introduction to biocomputing
Introduction to biocomputing
 

More from Karin Verspoor

Using real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questionsUsing real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questions
Karin Verspoor
 
Robogals 10th Anniversary Gala Keynote, Karin Verspoor
Robogals 10th Anniversary Gala Keynote, Karin VerspoorRobogals 10th Anniversary Gala Keynote, Karin Verspoor
Robogals 10th Anniversary Gala Keynote, Karin Verspoor
Karin Verspoor
 
Doctor Digital will see you now
Doctor Digital will see you nowDoctor Digital will see you now
Doctor Digital will see you now
Karin Verspoor
 
Using text mining to inform genetic variant interpretation
Using text mining to inform genetic variant interpretationUsing text mining to inform genetic variant interpretation
Using text mining to inform genetic variant interpretation
Karin Verspoor
 
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Karin Verspoor
 
Syndromic Surveillance from Emergency Department Triage Notes
Syndromic Surveillance from Emergency Department Triage NotesSyndromic Surveillance from Emergency Department Triage Notes
Syndromic Surveillance from Emergency Department Triage Notes
Karin Verspoor
 
Topic modeling of Emergency Department Triage notes for characterising pain-r...
Topic modeling of Emergency Department Triage notes for characterising pain-r...Topic modeling of Emergency Department Triage notes for characterising pain-r...
Topic modeling of Emergency Department Triage notes for characterising pain-r...
Karin Verspoor
 
Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)
Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)
Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)
Karin Verspoor
 

More from Karin Verspoor (8)

Using real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questionsUsing real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questions
 
Robogals 10th Anniversary Gala Keynote, Karin Verspoor
Robogals 10th Anniversary Gala Keynote, Karin VerspoorRobogals 10th Anniversary Gala Keynote, Karin Verspoor
Robogals 10th Anniversary Gala Keynote, Karin Verspoor
 
Doctor Digital will see you now
Doctor Digital will see you nowDoctor Digital will see you now
Doctor Digital will see you now
 
Using text mining to inform genetic variant interpretation
Using text mining to inform genetic variant interpretationUsing text mining to inform genetic variant interpretation
Using text mining to inform genetic variant interpretation
 
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
 
Syndromic Surveillance from Emergency Department Triage Notes
Syndromic Surveillance from Emergency Department Triage NotesSyndromic Surveillance from Emergency Department Triage Notes
Syndromic Surveillance from Emergency Department Triage Notes
 
Topic modeling of Emergency Department Triage notes for characterising pain-r...
Topic modeling of Emergency Department Triage notes for characterising pain-r...Topic modeling of Emergency Department Triage notes for characterising pain-r...
Topic modeling of Emergency Department Triage notes for characterising pain-r...
 
Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)
Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)
Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)
 

Recently uploaded

Pictures of Superficial & Deep Fascia.ppt.pdf
Pictures of Superficial & Deep Fascia.ppt.pdfPictures of Superficial & Deep Fascia.ppt.pdf
Pictures of Superficial & Deep Fascia.ppt.pdf
Dr. Rabia Inam Gandapore
 
basicmodesofventilation2022-220313203758.pdf
basicmodesofventilation2022-220313203758.pdfbasicmodesofventilation2022-220313203758.pdf
basicmodesofventilation2022-220313203758.pdf
aljamhori teaching hospital
 
A Classical Text Review on Basavarajeeyam
A Classical Text Review on BasavarajeeyamA Classical Text Review on Basavarajeeyam
A Classical Text Review on Basavarajeeyam
Dr. Jyothirmai Paindla
 
Dehradun #ℂall #gIRLS Oyo Hotel 9719300533 #ℂall #gIRL in Dehradun
Dehradun #ℂall #gIRLS Oyo Hotel 9719300533 #ℂall #gIRL in DehradunDehradun #ℂall #gIRLS Oyo Hotel 9719300533 #ℂall #gIRL in Dehradun
Dehradun #ℂall #gIRLS Oyo Hotel 9719300533 #ℂall #gIRL in Dehradun
chandankumarsmartiso
 
Cervical & Brachial Plexus By Dr. RIG.pptx
Cervical & Brachial Plexus By Dr. RIG.pptxCervical & Brachial Plexus By Dr. RIG.pptx
Cervical & Brachial Plexus By Dr. RIG.pptx
Dr. Rabia Inam Gandapore
 
Top 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in IndiaTop 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in India
SwastikAyurveda
 
Evaluation of antidepressant activity of clitoris ternatea in animals
Evaluation of antidepressant activity of clitoris ternatea in animalsEvaluation of antidepressant activity of clitoris ternatea in animals
Evaluation of antidepressant activity of clitoris ternatea in animals
Shweta
 
Vision-1.pptx, Eye structure, basics of optics
Vision-1.pptx, Eye structure, basics of opticsVision-1.pptx, Eye structure, basics of optics
Vision-1.pptx, Eye structure, basics of optics
Sai Sailesh Kumar Goothy
 
Flu Vaccine Alert in Bangalore Karnataka
Flu Vaccine Alert in Bangalore KarnatakaFlu Vaccine Alert in Bangalore Karnataka
Flu Vaccine Alert in Bangalore Karnataka
addon Scans
 
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...
kevinkariuki227
 
Tom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness Journey
Tom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness JourneyTom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness Journey
Tom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness Journey
greendigital
 
Hemodialysis: Chapter 3, Dialysis Water Unit - Dr.Gawad
Hemodialysis: Chapter 3, Dialysis Water Unit - Dr.GawadHemodialysis: Chapter 3, Dialysis Water Unit - Dr.Gawad
Hemodialysis: Chapter 3, Dialysis Water Unit - Dr.Gawad
NephroTube - Dr.Gawad
 
CDSCO and Phamacovigilance {Regulatory body in India}
CDSCO and Phamacovigilance {Regulatory body in India}CDSCO and Phamacovigilance {Regulatory body in India}
CDSCO and Phamacovigilance {Regulatory body in India}
NEHA GUPTA
 
Pharynx and Clinical Correlations BY Dr.Rabia Inam Gandapore.pptx
Pharynx and Clinical Correlations BY Dr.Rabia Inam Gandapore.pptxPharynx and Clinical Correlations BY Dr.Rabia Inam Gandapore.pptx
Pharynx and Clinical Correlations BY Dr.Rabia Inam Gandapore.pptx
Dr. Rabia Inam Gandapore
 
Effective-Soaps-for-Fungal-Skin-Infections.pptx
Effective-Soaps-for-Fungal-Skin-Infections.pptxEffective-Soaps-for-Fungal-Skin-Infections.pptx
Effective-Soaps-for-Fungal-Skin-Infections.pptx
SwisschemDerma
 
ANATOMY AND PHYSIOLOGY OF URINARY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF URINARY SYSTEM.pptxANATOMY AND PHYSIOLOGY OF URINARY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF URINARY SYSTEM.pptx
Swetaba Besh
 
Triangles of Neck and Clinical Correlation by Dr. RIG.pptx
Triangles of Neck and Clinical Correlation by Dr. RIG.pptxTriangles of Neck and Clinical Correlation by Dr. RIG.pptx
Triangles of Neck and Clinical Correlation by Dr. RIG.pptx
Dr. Rabia Inam Gandapore
 
How STIs Influence the Development of Pelvic Inflammatory Disease.pptx
How STIs Influence the Development of Pelvic Inflammatory Disease.pptxHow STIs Influence the Development of Pelvic Inflammatory Disease.pptx
How STIs Influence the Development of Pelvic Inflammatory Disease.pptx
FFragrant
 
ARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTS
ARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTSARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTS
ARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTS
Dr. Vinay Pareek
 
Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...
Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...
Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...
Oleg Kshivets
 

Recently uploaded (20)

Pictures of Superficial & Deep Fascia.ppt.pdf
Pictures of Superficial & Deep Fascia.ppt.pdfPictures of Superficial & Deep Fascia.ppt.pdf
Pictures of Superficial & Deep Fascia.ppt.pdf
 
basicmodesofventilation2022-220313203758.pdf
basicmodesofventilation2022-220313203758.pdfbasicmodesofventilation2022-220313203758.pdf
basicmodesofventilation2022-220313203758.pdf
 
A Classical Text Review on Basavarajeeyam
A Classical Text Review on BasavarajeeyamA Classical Text Review on Basavarajeeyam
A Classical Text Review on Basavarajeeyam
 
Dehradun #ℂall #gIRLS Oyo Hotel 9719300533 #ℂall #gIRL in Dehradun
Dehradun #ℂall #gIRLS Oyo Hotel 9719300533 #ℂall #gIRL in DehradunDehradun #ℂall #gIRLS Oyo Hotel 9719300533 #ℂall #gIRL in Dehradun
Dehradun #ℂall #gIRLS Oyo Hotel 9719300533 #ℂall #gIRL in Dehradun
 
Cervical & Brachial Plexus By Dr. RIG.pptx
Cervical & Brachial Plexus By Dr. RIG.pptxCervical & Brachial Plexus By Dr. RIG.pptx
Cervical & Brachial Plexus By Dr. RIG.pptx
 
Top 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in IndiaTop 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in India
 
Evaluation of antidepressant activity of clitoris ternatea in animals
Evaluation of antidepressant activity of clitoris ternatea in animalsEvaluation of antidepressant activity of clitoris ternatea in animals
Evaluation of antidepressant activity of clitoris ternatea in animals
 
Vision-1.pptx, Eye structure, basics of optics
Vision-1.pptx, Eye structure, basics of opticsVision-1.pptx, Eye structure, basics of optics
Vision-1.pptx, Eye structure, basics of optics
 
Flu Vaccine Alert in Bangalore Karnataka
Flu Vaccine Alert in Bangalore KarnatakaFlu Vaccine Alert in Bangalore Karnataka
Flu Vaccine Alert in Bangalore Karnataka
 
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...
 
Tom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness Journey
Tom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness JourneyTom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness Journey
Tom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness Journey
 
Hemodialysis: Chapter 3, Dialysis Water Unit - Dr.Gawad
Hemodialysis: Chapter 3, Dialysis Water Unit - Dr.GawadHemodialysis: Chapter 3, Dialysis Water Unit - Dr.Gawad
Hemodialysis: Chapter 3, Dialysis Water Unit - Dr.Gawad
 
CDSCO and Phamacovigilance {Regulatory body in India}
CDSCO and Phamacovigilance {Regulatory body in India}CDSCO and Phamacovigilance {Regulatory body in India}
CDSCO and Phamacovigilance {Regulatory body in India}
 
Pharynx and Clinical Correlations BY Dr.Rabia Inam Gandapore.pptx
Pharynx and Clinical Correlations BY Dr.Rabia Inam Gandapore.pptxPharynx and Clinical Correlations BY Dr.Rabia Inam Gandapore.pptx
Pharynx and Clinical Correlations BY Dr.Rabia Inam Gandapore.pptx
 
Effective-Soaps-for-Fungal-Skin-Infections.pptx
Effective-Soaps-for-Fungal-Skin-Infections.pptxEffective-Soaps-for-Fungal-Skin-Infections.pptx
Effective-Soaps-for-Fungal-Skin-Infections.pptx
 
ANATOMY AND PHYSIOLOGY OF URINARY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF URINARY SYSTEM.pptxANATOMY AND PHYSIOLOGY OF URINARY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF URINARY SYSTEM.pptx
 
Triangles of Neck and Clinical Correlation by Dr. RIG.pptx
Triangles of Neck and Clinical Correlation by Dr. RIG.pptxTriangles of Neck and Clinical Correlation by Dr. RIG.pptx
Triangles of Neck and Clinical Correlation by Dr. RIG.pptx
 
How STIs Influence the Development of Pelvic Inflammatory Disease.pptx
How STIs Influence the Development of Pelvic Inflammatory Disease.pptxHow STIs Influence the Development of Pelvic Inflammatory Disease.pptx
How STIs Influence the Development of Pelvic Inflammatory Disease.pptx
 
ARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTS
ARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTSARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTS
ARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTS
 
Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...
Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...
Lung Cancer: Artificial Intelligence, Synergetics, Complex System Analysis, S...
 

Function and Phenotype Prediction through Data and Knowledge Fusion

  • 1. Function and Phenotype Prediction through Data and Knowledge Fusion Karin M. Verspoor, The University of Melbourne karin.verspoor@unimelb.edu.au 27 January 2016 – King Abdullah University of Science and Technology, Computational Bioscience Research Center
  • 2. We have the blueprints to life, but we don’t know how to read them. • At least a quarter of protein families in PFAM have no known function (Domains of Unknown Function) • Millions of proteins uncharacterised
  • 3. From sequence to function
  • 4. What is protein function? • Captures biological process, molecular function, cellular component • Common representation for Model organism databases to facilitate sharing The Gene Ontology (GO) provides a vocabulary
  • 5. What about phenotype? Human Phenotype Ontology
  • 7. Exponential knowledge growth • ~1550 peer-reviewed gene- related databases in NAR online Mol Bio collection • Over 25 million PubMed entries (> 2,000/day) • Breakdown of disciplinary boundaries makes more of it relevant to each of us • “Like drinking from a firehose” – Jim Ostell (NCBI IEB Chief)
  • 8. Text as a primary source of knowledge Despite ever increasing structured resources, the literature remains the primary repository of knowledge in biomedicine 0 20000 40000 60000 80000 100000 120000 1/02 1/03 1/04 1/05 1/06 1/07 #Swiss-ProtProteins Proteins missing a FUNCTION comment Proteins gaining a FUNCTION comment “Manual curation is not sufficient for annotation of genomic databases” Baumgartner et al Bioinformatics (ISMB 2007)
  • 9. Why biomedical text mining? 0 200000 400000 600000 800000 1000000 1200000 1914 1918 1922 1926 1930 1934 1938 1942 1946 1950 1954 1958 1962 1966 1970 1974 1978 1982 1986 1990 1994 1998 2002 2006 2010 2014 Publicationsperyear Year Exponential growth in size of Pubmed
  • 10. Data sources, Data Integration • Structured Resources – Largely manually ‘curated’, high quality – Often unannotated – Organizes targeted information – Computable • Unstructured Resources – Literature: peer reviewed, well-formed – Natural Language: ambiguity, complexity – Broad, current coverage of biological knowledge – Intended for Human communication
  • 11. Bio Text Analysis in a nutshell Input Documents pre-processing (e.g., format conversion) tokenization sentence detection term normalisation (e.g., stem, lemmatise) biological named entity recognition biological concept recognition syntactic analysis coordination resolution co-reference resolution ambiguity resolution entity linking Domain Knowledge: Terminologies Ontologies Known Relationships relation extraction event annotation reasoning and inference Annotated Documents extracted facts and relationships
  • 12. GO Function Prediction Sokolov and Ben-Hur. J Bioinform Comput Biol. 2010 Apr;8(2):357-76. Sokolov, Funk, Graim, Verspoor, Ben-Hur. BMC Bioinformatics. 2013;14 Suppl 3:S10.
  • 13. GOstruct: Structured output SVM cross- species view species- specific view Sequence- based features f eat ur es mouse GO annotations l abel s human GO annotations mouse annotati l abel co- mention f eat ur es PPI gene expression st r uct ur ed SVM t r ai ni ng st r uct ur ed SVM t r ai ni ng f(x,y) = f(c) (x,y) + f(s) (x,y) mul t i - vi ew
  • 14. Structured output • Represent a set of annotations as a single vector • Encodes the hierarchical structure from annotation to root
  • 15. GOstruct approach “What functions does this protein perform?”
  • 16. Feature integration via kernels • Cross-species (sequence-based) features – e-values from significant BLAST hits – features from WoLF PSORT protein localization software – transmembrane protein prediction using TMHMM – k-mer composition of the N and C termini – low complexity regions • Species-specific features – Protein interactions – Gene Expression – Phylogenetic profiles – Text-derived features
  • 17. Extraction & Analysis pipeline Christopher Funk (2015) PhD dissertation, U. Colorado Denver
  • 18. Integrating Text • Protein – Gene Ontology term co-occurrence • Protein – Protein co-occurrence Sokolov et al. BMC Bioinformatics 2013, 14(Suppl 3):S10
  • 19. Text-based features • Words – (tokens) • Entities or Concepts – (gene/protein mentions) – (gene ontology concepts) • Relations – (simple co-occurrences)
  • 20. Feature Extraction from text Target: P50281 – Matrix metalloproteinase 14 (MMP14)
  • 21. Feature Extraction Target: P50281 – Matrix metalloproteinase 14 (MMP14)
  • 22. Feature Extraction Target: P50281 – Matrix metalloproteinase 14 (MMP14) Bag of words: WordsSent1(membrane, otherwise, known, … , proteolytic, enzyme, known, extracellular, invasion, … , progression) WordsSent2(protein, and, message, levels, of, was , …)
  • 23. Feature Extraction Target: P50281 – Matrix metalloproteinase 14 (MMP14) Protein GO term co-mentions: sent_comen(P50281, GO:0008237) sent_comen(P50281, GO:0006508) sent_comen(P50281, GO:0009056) sent_comen(P50281, GO:0031012) nonSent_comen(P50281, GO:0010467) nonSent_comen(P50281, GO:0005623)
  • 24. Feature Extraction Target: P50281 – Matrix metalloproteinase 14 (MMP14) Protein GO term co-mentions: nonSent_comen(P50281, GO:0008237) nonSent_comen(P50281, GO:0006508) nonSent_comen(P50281, GO:0009056) nonSent_comen(P50281, GO:0031012) nonSent_comen(P50281, GO:0010467) nonSent_comen(P50281, GO:0005623)
  • 25. Feature Representation Target: P50281 – Matrix metalloproteinase 14 (MMP14) Bag of Words: P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, … Protein GO term co-mentions (sentence): P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1 Protein GO term co-mentions (non-sentence): P40281, GO:0010467=2, GO:0005623=2
  • 26. Feature Representation Bag of Words: UniprotID1, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi UniprotID2, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi … UniprotIDi, w1=countw1, w2=countw2, w3=countw3, … , wi=countwi Protein GO term co-mentions (sentence): UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi UniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi … UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi Protein GO term co-mentions (non-sentence): UniprotID1, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi UniprotID2, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi … UniprotIDi, GO:1=countGO1, GO:2=countGO2, … , GO:i=countGOi
  • 27. An aside on GO concept recognition • Given: – Gene Ontology (~46,000 concepts) In mice lacking ephrin-A5 function, cell proliferation and survival of newborn neurons… (PMID 20474079) • Return: – GO:0008283 cell proliferation – GO:0005125 cytokine activity – GO:0048666 neuron development (can be based on a judgment about the depth of experimental evidence)
  • 28. (CRAFT example) Previous in vitro experiments using renal cell lines suggest recessive Aqp2 mutations result in improper trafficking of the mutant water pore. GO:0005623 – “cell” CL:0000000 – “cell” PR:000004182 – “aquaporin-2” EG:359 – “Aqp2” SO:0001059 – “sequence_alteration” GO:0006810 – “transport” SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity” CHEBI:15377 – “water”
  • 29. GO:0006900 – membrane budding [Term] id: GO:0006900 name: membrane budding … def: "The evagination of a membrane, resulting in formation of a vesicle.” … synonym: "membrane evagination” synonym: "nonselective vesicle assembly” synonym: "vesicle biosynthesis” synonym: "vesicle formation” … Variation in PMID: 12925238 • Lipid rafts play a key role in membrane budding… • …involvement of annexin A7 in budding of vesicles… • …Ca2+-mediated vesiculation process was not impared. • Red blood cells which lack the ability to vesiculate cause… • Having excluded a direct role in vesicle formation… GO vs NL
  • 30. Comparing tool performance on CR 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 Best performance for all tools on all ontologies Precision Recall 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 f=0.1 f=0.2 f=0.3 f=0.4 f=0.5 f=0.6 f=0.7 f=0.8 f=0.9 ● Systems MetaMap Concept Mapper NCBO Annotator Ontologies GO_CC GO_MF GO_BP SO CL PR NCBITAXON CHEBI • NCBO Annotator (96 combinations) wholeWordOnly, filterNumber, stopWords, stopWordsCaseSensitive, minTermSize, withSynonyms • MetaMap (864 combinations) model, gaps, wordOrder, acronymAbb, derivationalVars, scoreFilter, minTermSize • Concept Mapper (576 combinations) searchStrategy, caseMatch, stemmer, orderIndependentLookup, findAllMatches, stopWords, synonyms Funk et al. BMC Bioinformatics 2014, Feb 26;15:59.
  • 31. Literature alone is useful 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 MF BP CC Macro-averagedF-measure Gene Ontology Branch Baseline (co-mentions as predictions) Co-mentions BoW Co-mentions + BoW
  • 32. Literature features approach performance of commonly used biological features 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 MF BP CC Macro-averagedF-measure Trans/Localization Homology Network Literature All Combined (and combining them with other features is even better!)
  • 33. Manual inspection of misclassifications Some false positives appear to have literature support: • GCNT1 – carbohydrate metabolic process (Q02742 - GO:0005975) Genes related to carbohydrate metabolism include PPP1R3C, B3GNT1, and GCNT1… [PMID:23646466] • CERS2 – ceramide biosynthetic process (Q96G23 - GO:0046513) …CersS2, which uses C22-CoA for ceramide synthesis… [PMID:22144673]
  • 35. Results: different sources Mouse annotations from geneontology.org
  • 36. Phenotype Prediction Kahanda I, Funk C, Verspoor K and Ben-Hur A 2015;] F1000Research 2015, 4:259 (doi: 10.12688/f1000research.6670.1)
  • 37. PHENOstruct Human Phenotype Ontology Organ, Inheritance, Onset subontologies have separate models
  • 39. PHENOstruct Features • Network (functional association data) – protein-protein interactions – co-expression – co-localization – From BioGRID, STRING, GeneMANIA • Gene Ontology (experimental) annotations • Literature mined data: bag of words in gene sentences • Genetic variants (protein -> disease -> variants)
  • 40. Performance Subont. Terms Method AUC P-value Organ 1,796 Binary SVMs 0.66 1.70E-262 Clus-HMC-Ens 0.65 0.00E+00 PHENOstruct 0.73 — Inheritance 12 Binary SVMs 0.72 2.20E-01 Clus-HMC-Ens 0.73 7.30E-01 PHENOstruct 0.74 — Onset 23 Binary SVMs 0.62 4.40E-03 Clus-HMC-Ens 0.58 3.30E-05 PHENOstruct 0.64 —
  • 41. PHENOstruct in Organ subontology
  • 42. Gold vs Predicted, P43681 Gold Predicted Hierarchical, protein-centric P = 1.0; R = 0.62
  • 43. Impact of data sources
  • 45. Top Literature Features Category Tokens proteins & complexes cx32, kisspeptin, -308, t308, smn2, ns5, trap-positive, mpp+-induced, 1- methyl-4-phenylpyridinium, tnf-alpha-mediated, tnf-alpha-stimulated, tnf–mediated, ink4a/arf, ns4b, hmsh6, fukutin, cdtb, ns5b, apoai, tnf–stimulated, ns4a, tnf-alpha-, rhbmp-2, tnf-alpha-treated, frataxin, ki-ras, connexin32, tcdb, recql4, =-galcer, tyrosinase-related, hpms2, her4, cd40-cd40l, lmp2a, ryrs, mg2+-atpase, ews-fli1, abeta42, fancc, p40phox, her1, bdnf-induced, trap+, gfap-ir, daf-16/foxo, hdl3, -238, [tnf-alpha], cd40/cd40l, tnf–treated, anti-ngf, tep1, recq, nt-4, pfemp1, zo-2, nphp1, tnf- alpha-dependent, pomt1, igm-positive, apoa-ii, p110alpha, fancf, tbx4, anti-cd40l, igg genes hmsh2, cx26, fkrp, smn1, cln3, nphp4, mn1, nnt, apex2, akt-2 pathways ras/raf/mek/erk, pi3k-akt-mtor diseases/phenot ypes cmt1a, hnpp, hdl2, cln2, hpp, fmf, rtt, hnpcc, charcot-marie-tooth, amenorrhea, rett, anticardiolipin misc. sheldrick, shelxl97, bruker, farrugia, ortep-3, platon, shelxs97, spek, sgdid, wlds, caii, aoa, tdf, crysalis, wingx, amf
  • 46. Conclusions • The literature provides a significant resource for biological function prediction • The literature provides one ‘view’ of biological knowledge and is best combined with other resources • Even some simple strategies for extracting associations from the literature can provide valuable information, taken at large scale – “bag of words” and co-occurrence models reasonable starting point: capture implied relationships – scope for integration of more targeted extracted relationships (e.g. protein-protein interactions), with the usual Precision/Recall tradeoff
  • 47. Acknowledgements • Los Alamos National Laboratory – Michael Wall • Colorado – Larry Hunter (U. Colorado Denver) – Christopher Funk (U. Colorado Denver) – Asa Ben-Hur (Colorado State University) – Indika Kahanda (Colorado State University) • NICTA Victoria Research Laboratory – Geoffrey Macintyre (U. Cambridge) – Antonio Jimeno Yepes (IBM Research Australia) – Cheng-Soon Ong (NICTA Canberra) • Funding: US NIH, US NSF, NICTA, Australian Research Council
  • 48.
  • 49. Machine learning for text analysis Training set Notes + labels for classes of interest Machine learning algorithm Words, Phrases, Linguistic categories; names of entities; Domain concepts; Document features Biomedical knowledge sources UMLS OBOs Language processing Model Relating features of the text to classes of interest
  • 50. Machine learning for text analysis New text to be classified Words, Phrases, Linguistic categories; names of entities; Domain concepts; Document features Biomedical knowledge sources UMLS OBOs Language processing Model Predicted Classification (label)