OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
Eccb poster
1. DeepSCLpred: Protein subcellular localization prediction by Deep N-to-1 neural networks
Manaz Kaleel1
, Amina Khalid1
, Tejaswini Kumar1
, Zheng Yandan2
, Chen Jialiang2
, Feng Xuanming2
, Gianluca Pollastri1
& Catherine Mooney1,2,∗
1
School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland;
2
Beijing-Dublin International College, Beijing University of Technology, Bejing, China
∗
catherine.mooney@ucd.ie
1. Introduction
Knowledge of the subcellular location of a protein provides valuable information about its function and possible interaction with other proteins. In the
post-genomic era, fast and accurate predictors of subcellular location are required if this abundance of sequence data is to be fully exploited. We have
developed a subcellular localization predictor, DeepSCLpred, which predicts the location of a protein within the cell into two classes: the endomembrane
system and secretory pathway (EMS) versus all others. The predictor uses machine learning models trained in 10-fold cross-validation on a large 30%
redundancy reduced set of metazoan protein sequences along with evolutionary information. The evolutionary information for each protein is encoded
in the form of PSI-BLAST[1] generated profiles. The algorithm powering DeepSCLpred is a Deep Convolutional Neural Network, which is capable
of mapping whole protein sequences into single properties without resorting to predefined transformations, but rather by adaptively compressing the
sequence into a hidden feature vector.
2. Materials and Methods
Datasets
• UniProt release 2018_06 [2]
• All eukaryotic entries except those from fungi and plants
• 112,449 protein sequences from 4,879 species
• 274 unique locations categorised into two classes: the endomembrane
system and secretory pathway (EMS) and everything else
• Internally redundancy reduced to less than 30% sequence similarity
• Split into training set (TS) and independent test set (ITS) – sequences
that were added to UniprotKB after 2015
Predictive architecture
• The prediction algorithm employs a modified Convolutional Neural
Network (CNN)
• Trained and tested in 10-fold cross-validation
• We use matthews correlation coefficient (MCC) as a measure of the
correlation between observed and predicted states
MCC =
TP × TN − FP × FN
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
TS ITS
EMS 1,678 78
Other 5,406 285
Total 7,084 363
Pooling
Feature vector
Input convolution
Pooling
Feature vector
N * hidden convolution
Fullly conected network
Output class
Neural Network Architecture
3. Results
Treatments MCC Accuracy
DeepLoc [3] 0.351 71.9%
SCLpredN
0.696 90.4%
SCLpredD
0.715 90.9%
The performance of DeepSCLpred compared to
DeepLoc [3] on the ITS. Here, N
, D
indicates the
non-deep and deep versions of the system.
SCLpredN
(TS) SCLpredN
(ITS)
4. Conclusions
• DeepSCLpred is a state-of-the-art protein
subcellular localization prediction tool
• We predict subcellular localization into
two classes: the endomembrane system
and secretory pathway (EMS) and every-
thing else
• DeepSCLpred achieves an MCC of 0.715
on an independent test set of 363 protein
sequences added to UniProt since 2015
5. Future Work
Current work is focused on developing Deep-
SCLpred to predict sub-locations within the
endomembrane system and secretory pathway.
We are developing DeepSCLpred into a publicly
available as a web server.
6. References
[1] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped BLAST and PSI-BLAST:
a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997.
[2] UniProt Consortium et al. UniProt: the universal protein knowledgebase. Nucleic acids research, 46(5):2699, 2018.
[3] José Juan Almagro Armenteros, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen, and Ole Winther. DeepLoc: prediction of protein subcellular
localization using deep learning. Bioinformatics, 33(21):3387–3395, 2017.
7. Acknowledgements
The work of M.K. is supported by a grant from the Irish Research Council [GOIPG/2014/603].
The authors acknowledge the Research IT Service at University College Dublin for providing HPC resources that have contributed to the research results reported
within this poster. http://www.ucd.ie/itservices/ourservices/researchit/