SlideShare a Scribd company logo
1 of 20
DextMP: Deep dive into Text for
predicting Moonlighting Proteins
Ishita K. Khan1, Mansurul Bhuiyan3
& Daisuke Kihara1,2
1Department of Computer Sciences, 2Department of Biological
Sciences, Purdue University, IN, USA
3Department of Computer Science, Indiana University-Purdue
University Indianapolis, IN, USA
1
Bioinformatics (2017) 33 (14): i83-i91
Moonlighting Proteins
 Proteins that are involved in more than one
mechanistically different, independent cellular functions.
 Two distinct functions are not due to splice variants, gene
fusions, or pleiotropism (same function in different
pathways)
 An ancestral protein possessed a single function, but
developed an additional functionality through the course
of evolution.
 The most common primary moonlighting functions are
enzymatic catalyst; secondary functions include signal
transduction, transcriptional regulation, apoptosis,
motility, etc.
2
Examples of
Mechanisms of
Moonlighting
Proteins
3
(Jeffery JC, TIBS 1999)
Examples of Moonlighting Proteins
Protein ID # Domains Function 1 Function 2 Cause
Aconitase Q99798 2
TCA cycle
enzyme
Iron homeostasis
Fe concentration
fluctose-bisphosphate aldolase Q968V9 1
Glycolytic
enzyme
Host-cell
invasion
independent
functions
Phosphopantothenoylcysteine
decarboxylase subunit VHS3
Q08438 1
halotoleranc
e
determinant
coenzyme A
biosynthesis
independent
functions
cAMP-dependent transcription
factor ATF-2
P15336 1
transcription
factor
DNA damage
response
radiation stress
Dihydrolipoyl dehydrogenase,
mitochondrial, DLD
P09622 4
energy
metabolism
Protease
pH in
mitochondrial
matrix
Vacuolar protein-sorting-
associated protein 25
Q7JXV9 1
endosomal
protein
sorting bicoid
mRNA
independent
functions
glutamate racemase D3FPC2 1
glutamate
racemase
DNA gyrase
inhibitor
independent
functions
STAT3 Q99ML3 0
transcription
factor
Electron
transport chain
mutation and
phosphorylation
galactokinase P09608 3
galactose
catabolism
enzyme
Induction of
galactose genes
presence of
galactose 4
Databases of Moonlighting Proteins
5
MOONPROT DB MOON DBMultitasking Protein DB
Jeffrey Lab
Manual curation
E. Querol et al.
From review articles
Keywords from Pubmed
Brun Lab
Human MPs
Literature
Network-based prediction
How to Identify Moonlighting Proteins?
 From currently available annotations (UniProt)
• Most of moonlighting proteins are not labeled as
terms as “moonlighting”, “dual function”,
“multitasking”
1. Are current GO annotations useful to find novel
moonlighting proteins?
2. By text mining?
 From large-scale omics data
• Without GO annotations
• Do moonlighting proteins have any characteristics in
protein-protein interactions, co-expressed genes,
phylogenetic profile, genetic interactions, etc? 6
GO-Based Identification Applied to
the E. coli Genome
7
E. coli
proteins with
GO term
annotation
4146 proteins
Clustering Profile
MP: 140 proteins
Non-MP: 150 proteins
Moonlighting Proteins
1. > 8 GO terms
2. > 2 Clusters at 0.1 Score
3. > 4 Clusters at 0.4 Score
Non-Moonlighting Proteins
1. > 8 GO terms
2. 1 Cluster at 0.1 Score
3. 1 Cluster at 0.4 Score
Literature Survey
43 proteins
(Khan et al., Biology Direct, 2014)
33 proteins
Dual functions
that do not
originate from
multiple domains
8
Features Considered:
• GO annotations (GO)
• PPI network (PPI)
• gene expression profiles (GI)
• phylogenetic profiles (PE)
• genetic interactions (GI)
• disordered protein regions (DOR)
• graph properties of PPI (NET)
Dataset for DextMP
9
• Moonlighting Proteins (MPs): from the MoonProt DB
• Non-MPs: the criteria applied to human, E. coli, yeast, mouse
• Text information taken from UniProt
Khan, Bhuiyan, & Kihara, Bioinformatics (2017) 33 (14): i83-i91
The Number of Abstracts Available to
MPs and non-MPs
10
Workflow of DextMP
11
Text Level
Prediction
Protein Level
Prediction
3 Language Models
 Bag-of-Words: Term Frequency-Inverse Document Frequency
(TFIDF)
 N-dimensional vector (N: dictionary size of a corpus)
 TFIDF(word) = TF(word)*IDF(word)
 TF(w) = (# of w in a text)/(total # of words in the text)
 IDF, Inverse Document Frequency (w) = log(total # of texts in the corpus/# of
texts with w)
 Latent Dirichlet Allocation (LDA)
 A text is characterized by a set of latent topics, which have a distribution of
words
 Dirichlet multinomial distributions for mapping documents to topics, topics to
words
 Deep learning
 Constructs feature vectors so that similar text appear close
 DEEP: texts were from MPs and non-MPs
 PDEEP: pre-trained on the entire texts in UniProt. 1,060,520 titles and 551,056
function descriptions
 paragraph2vec
12
Hyper-Parameter Tuning
 LDA
 # of topics: 10-100 with a step of 10
 DEEP
 Minimum count: 1-5
 Convolution window size: 2-8
 Dimension size (feature vector length): 20-200 with a step of
20
 PDEEP used the same parameters as DEEP
 LR, SVM
 Regularization, a cost parameter
 kernel (linear, radial)
 RF
 # of trees
 GBM
 Learning rate 13
5-Fold Cross Validation
 3 subsets: training
 Under a hyper-parameter set
 Combinations of hyper-parameters from a
language-model & a classifier
 1 subset: validation, to decide the best
hyper-parameter set
 1 subset: testing
 F1-score reported. Average over 5 testing
sets. Weighted for MPs and non-MPs.
 F1 = 2(precision * recall)/(precision + recall)
14
Text-Level Accuracy
15
Protein-Level Accuracy
16
Used for
Genome-scale
prediction
Genome-Scale MP Prediction
17
MPFit: 10.97% 7.82%
18
Summary
 DextMP, a machine learning approach for identifying
moon-lighting proteins from text information is
presented.
 DextMP can help filtering UniProt entries of potential
moonlighting proteins, which can be later examined
manually.
 Estimated moonlighting proteins in a genome:
 Human: ~10-20% of proteins
 Yeast: ~10-30% of proteins
 Xenopus: ~5% of proteins
 Prediction relies on literature information, thus there
maybe more moonlighting proteins in each genome 19
Acknowledgements
http://kiharalab.org@kiharalab
20
Mansurul
Bhuiyan
Ishita
Khan

More Related Content

Similar to DextMP: Text mining for finding moonlighting proteins

Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
bosc
 
University of Texas at Austin
University of Texas at AustinUniversity of Texas at Austin
University of Texas at Austin
butest
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
nadeem akhter
 

Similar to DextMP: Text mining for finding moonlighting proteins (20)

Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...Deep learning for extracting protein-protein interactions from biomedical lit...
Deep learning for extracting protein-protein interactions from biomedical lit...
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
In search of tissue specific regulators in periodontium - a bioinformatic ap...
In search of tissue specific regulators in periodontium  - a bioinformatic ap...In search of tissue specific regulators in periodontium  - a bioinformatic ap...
In search of tissue specific regulators in periodontium - a bioinformatic ap...
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
lecture 1.pptx
lecture 1.pptxlecture 1.pptx
lecture 1.pptx
 
Cornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 NetsCornell Pbsb 20090126 Nets
Cornell Pbsb 20090126 Nets
 
Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...
 
STRING: protein association networks
STRING: protein association networksSTRING: protein association networks
STRING: protein association networks
 
STRING: Protein association networks
STRING: Protein association networksSTRING: Protein association networks
STRING: Protein association networks
 
University of Texas at Austin
University of Texas at AustinUniversity of Texas at Austin
University of Texas at Austin
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 
Network biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and textNetwork biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and text
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
 
Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...
Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...
Reading circle of Epigenome Roadmap: Roadmap Epigenomics Consortium et. al. I...
 
Austin Neurology & Neurosciences
Austin Neurology & NeurosciencesAustin Neurology & Neurosciences
Austin Neurology & Neurosciences
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster Identification
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 

More from Purdue University

More from Purdue University (10)

Alphafold2 - Protein Structural Bioinformatics After CASP14
Alphafold2 - Protein Structural Bioinformatics After CASP14Alphafold2 - Protein Structural Bioinformatics After CASP14
Alphafold2 - Protein Structural Bioinformatics After CASP14
 
CASP14 Data Assisted Modeling (KIharalab)
CASP14 Data Assisted Modeling (KIharalab)CASP14 Data Assisted Modeling (KIharalab)
CASP14 Data Assisted Modeling (KIharalab)
 
Kiharalab Bioinformatics Projects 2019
Kiharalab Bioinformatics Projects 2019Kiharalab Bioinformatics Projects 2019
Kiharalab Bioinformatics Projects 2019
 
Predicting Assembly Order of Multimeric Protein Complexes
Predicting Assembly Order of Multimeric Protein ComplexesPredicting Assembly Order of Multimeric Protein Complexes
Predicting Assembly Order of Multimeric Protein Complexes
 
Structure Modeling of Disordered Protein Interactions
Structure Modeling of Disordered Protein InteractionsStructure Modeling of Disordered Protein Interactions
Structure Modeling of Disordered Protein Interactions
 
Discovery of Ligand-Protein Interactome
Discovery of Ligand-Protein InteractomeDiscovery of Ligand-Protein Interactome
Discovery of Ligand-Protein Interactome
 
Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11Kihara Lab protein structure prediction performance in CASP11
Kihara Lab protein structure prediction performance in CASP11
 
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
Protein docking by LZerD, KiharaLab at CAPRI meeting 2016
 
Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016Kihara Bioinformatics Lab Research Summary 2016
Kihara Bioinformatics Lab Research Summary 2016
 
Flexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure modelsFlexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure models
 

Recently uploaded

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 

Recently uploaded (20)

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

DextMP: Text mining for finding moonlighting proteins

  • 1. DextMP: Deep dive into Text for predicting Moonlighting Proteins Ishita K. Khan1, Mansurul Bhuiyan3 & Daisuke Kihara1,2 1Department of Computer Sciences, 2Department of Biological Sciences, Purdue University, IN, USA 3Department of Computer Science, Indiana University-Purdue University Indianapolis, IN, USA 1 Bioinformatics (2017) 33 (14): i83-i91
  • 2. Moonlighting Proteins  Proteins that are involved in more than one mechanistically different, independent cellular functions.  Two distinct functions are not due to splice variants, gene fusions, or pleiotropism (same function in different pathways)  An ancestral protein possessed a single function, but developed an additional functionality through the course of evolution.  The most common primary moonlighting functions are enzymatic catalyst; secondary functions include signal transduction, transcriptional regulation, apoptosis, motility, etc. 2
  • 4. Examples of Moonlighting Proteins Protein ID # Domains Function 1 Function 2 Cause Aconitase Q99798 2 TCA cycle enzyme Iron homeostasis Fe concentration fluctose-bisphosphate aldolase Q968V9 1 Glycolytic enzyme Host-cell invasion independent functions Phosphopantothenoylcysteine decarboxylase subunit VHS3 Q08438 1 halotoleranc e determinant coenzyme A biosynthesis independent functions cAMP-dependent transcription factor ATF-2 P15336 1 transcription factor DNA damage response radiation stress Dihydrolipoyl dehydrogenase, mitochondrial, DLD P09622 4 energy metabolism Protease pH in mitochondrial matrix Vacuolar protein-sorting- associated protein 25 Q7JXV9 1 endosomal protein sorting bicoid mRNA independent functions glutamate racemase D3FPC2 1 glutamate racemase DNA gyrase inhibitor independent functions STAT3 Q99ML3 0 transcription factor Electron transport chain mutation and phosphorylation galactokinase P09608 3 galactose catabolism enzyme Induction of galactose genes presence of galactose 4
  • 5. Databases of Moonlighting Proteins 5 MOONPROT DB MOON DBMultitasking Protein DB Jeffrey Lab Manual curation E. Querol et al. From review articles Keywords from Pubmed Brun Lab Human MPs Literature Network-based prediction
  • 6. How to Identify Moonlighting Proteins?  From currently available annotations (UniProt) • Most of moonlighting proteins are not labeled as terms as “moonlighting”, “dual function”, “multitasking” 1. Are current GO annotations useful to find novel moonlighting proteins? 2. By text mining?  From large-scale omics data • Without GO annotations • Do moonlighting proteins have any characteristics in protein-protein interactions, co-expressed genes, phylogenetic profile, genetic interactions, etc? 6
  • 7. GO-Based Identification Applied to the E. coli Genome 7 E. coli proteins with GO term annotation 4146 proteins Clustering Profile MP: 140 proteins Non-MP: 150 proteins Moonlighting Proteins 1. > 8 GO terms 2. > 2 Clusters at 0.1 Score 3. > 4 Clusters at 0.4 Score Non-Moonlighting Proteins 1. > 8 GO terms 2. 1 Cluster at 0.1 Score 3. 1 Cluster at 0.4 Score Literature Survey 43 proteins (Khan et al., Biology Direct, 2014) 33 proteins Dual functions that do not originate from multiple domains
  • 8. 8 Features Considered: • GO annotations (GO) • PPI network (PPI) • gene expression profiles (GI) • phylogenetic profiles (PE) • genetic interactions (GI) • disordered protein regions (DOR) • graph properties of PPI (NET)
  • 9. Dataset for DextMP 9 • Moonlighting Proteins (MPs): from the MoonProt DB • Non-MPs: the criteria applied to human, E. coli, yeast, mouse • Text information taken from UniProt Khan, Bhuiyan, & Kihara, Bioinformatics (2017) 33 (14): i83-i91
  • 10. The Number of Abstracts Available to MPs and non-MPs 10
  • 11. Workflow of DextMP 11 Text Level Prediction Protein Level Prediction
  • 12. 3 Language Models  Bag-of-Words: Term Frequency-Inverse Document Frequency (TFIDF)  N-dimensional vector (N: dictionary size of a corpus)  TFIDF(word) = TF(word)*IDF(word)  TF(w) = (# of w in a text)/(total # of words in the text)  IDF, Inverse Document Frequency (w) = log(total # of texts in the corpus/# of texts with w)  Latent Dirichlet Allocation (LDA)  A text is characterized by a set of latent topics, which have a distribution of words  Dirichlet multinomial distributions for mapping documents to topics, topics to words  Deep learning  Constructs feature vectors so that similar text appear close  DEEP: texts were from MPs and non-MPs  PDEEP: pre-trained on the entire texts in UniProt. 1,060,520 titles and 551,056 function descriptions  paragraph2vec 12
  • 13. Hyper-Parameter Tuning  LDA  # of topics: 10-100 with a step of 10  DEEP  Minimum count: 1-5  Convolution window size: 2-8  Dimension size (feature vector length): 20-200 with a step of 20  PDEEP used the same parameters as DEEP  LR, SVM  Regularization, a cost parameter  kernel (linear, radial)  RF  # of trees  GBM  Learning rate 13
  • 14. 5-Fold Cross Validation  3 subsets: training  Under a hyper-parameter set  Combinations of hyper-parameters from a language-model & a classifier  1 subset: validation, to decide the best hyper-parameter set  1 subset: testing  F1-score reported. Average over 5 testing sets. Weighted for MPs and non-MPs.  F1 = 2(precision * recall)/(precision + recall) 14
  • 18. 18
  • 19. Summary  DextMP, a machine learning approach for identifying moon-lighting proteins from text information is presented.  DextMP can help filtering UniProt entries of potential moonlighting proteins, which can be later examined manually.  Estimated moonlighting proteins in a genome:  Human: ~10-20% of proteins  Yeast: ~10-30% of proteins  Xenopus: ~5% of proteins  Prediction relies on literature information, thus there maybe more moonlighting proteins in each genome 19

Editor's Notes

  1. P15336 (ATF2), they mentioned stress in terms of 2 things - ionizing radiation (IR) and UV-induced lesions. In addition, they also mentioned that the second function (DNA damage response) is also due to ATF2's association with a member of chromatin remodeling complex. "independent function" are the once that don’t have a "switch" identified between two functions. In most cases, send function was found accidentally (termed as 'serendipitious' in literature), and the papers claim that the two functions are independent.