S-Glutathionylation site prediction in proteins

NSF REU EMCoR@NCAT
Grant # ACI-1560385
S-Glutathionylation site prediction in
proteins
REU Fellow: Marcus Postell
Mentor: Dr. Dukka KC
July 27, 2017

Outline
• Motivation
• Introduction
• Research Problem Statement
• Research Goal(s)
• Literature Review
• Methodology / Approach / Tools
• Results
• Conclusion
• Future
NSF REU EMCoR@NCAT
Grant # ACI-1560385

North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Motivation
• Machine learning technique = saving money and less time consuming than laboratory
techniques .
• It is very useful when dealing with large data sets.
• Computational approaches can effectively and accurately identify the S-
glutathionylated sites.

North Carolina
EMCOR@NCAT
NSF Grant
Motivation
 Computational approaches for S-glutathionylation are urgently needed.
 Bioinformatics tools have been proposed to identify the disulfide bonding state of
cysteine and the catalytic redox-active cysteine.
 It has been multiple methods that has been developed to predict S-glutathionylation
sites which is PGlus.

Introduction
• S-Glutathionylation is a reversible protein post-translational
modification.
• It generates mixed disulfides between glutathione(GSH) and cysteine
residues.
• This plays an important role in regulating protein stability, activity,
and redox regulation
• It provides valuable insights to understand the molecular mechanism
of S-glutathionylation.
• Due to the labile nature and low abundance of in vivo S-
glutathionylation, the details characteristics and mechanisms of S-
glutathionylation still await to be clarified.
NSF REU EMCoR@NCAT
Grant # ACI-1560385

Research Problem Statement
• There is a current lack of reliable tools which
limits researchers to using expensive and time-
consuming laboratory techniques for the
identification of S-Glutathionylation.
• These biological experiments often times run
into cross contamination.
• Computational predictions of S-
glutathionylated sites are very desirable due to
their convenience and high speed.
NSF REU EMCoR@NCAT
Grant # ACI-1560385

Research Goal(s)
• By using Machine Learning, the goal is to model
neural networks in the python language to help
generate an algorithm that can have a accurate
prediction on S-Glutathionylation.
NSF REU EMCoR@NCAT
Grant # ACI-1560385

Literature Review
• Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY. GSHSite:
exploiting an iteratively statistical method to identify s-
glutathionylation sites with substrate specificity. PLoS One. 2015 Apr
7
• S-glutathionylation, the covalent connection of a glutathione (GSH)
to the sulfur molecule of cysteine, is a particular and reversible
protein post-translational adjustment (PTM) that manages protein
movement, confinement, and solidness.
• In spite of its suggestion in the control of protein capacities and cell
flagging, the substrate specificity of cysteine S-glutathionylation
stays obscure.
• Based on a total of 1783 tentatively distinguished S-glutathionylation
locales from mouse macrophages, this work shows an informatics
examination on S-glutathionylation destinations including basic
variables, for example, the flanking amino acids organization and the
open surface zone (ASA).
NSF REU EMCoR@NCAT
Grant # ACI-1560385

North Carolina
EMCOR@NCAT
NSF Grant
Literature Review
• The training data below has been divided into five groups by splitting each dataset.
• During cross-validation, one subgroup was regarded as the test set and the remaining four as the training set.
• Cross validation was repeated five times and the validation results were combined to produce a single estimation.
Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY. GSHSite: exploiting an iteratively statistical method to identify s-glutathionylation sites with substrate specificity. PLoS One. 2015 Apr 7

North Carolina
EMCOR@NCAT
NSF Grant
Literature Review
 Research article: GSHSite: Exploiting an Iteratively Statistical Method to Identify S-
Glutathionylation Sites with Substrate Specificity.
 1783 experimentally identified S-glutathionylation sites compared to 2, 326
experimentally identified S-glutathionylation sites.
 Bioinformatic approaches are powerful tools for prediction.
 Following evaluation by cross-validation and an independent test.

North Carolina
EMCOR@NCAT
NSF Grant
Literature Review
• Zhao X, Ning Q, Ai M, Chai H, Yin M. PGluS: prediction of protein S-glutathionylation sites with
multiple features and analysis. Mol Biosyst. 2015 Mar;11
• A new bioinformatics tool named PGluS was made to predict S-glutathionylated sites based on
mutiple features and support vector machines.
• PGluS was evaluated using an independent testing dataset resulting in an accuracy of 71.25%,
which demonstrated that PGluS was very promising for predicting S-glutathionylated sites.
• Also, feature analysis was performed and it was shown that all features adopted in the method
contributed to the S-glutathionylation process.

North Carolina
EMCOR@NCAT
NSF Grant
Literature Review
 Research article: PGluS: prediction of protein S-glutathionylation sites with multiple
features and analysis.
 Computational predictions importance.
 Accuracy prediction of 71.41% compared to 87%.
 Feature analysis was performed and shown for features adopted that contributed to the
S-glutathionylation process.
 Identification of specific S-glutathionylated sites are cruical.

Methodology / Approach / Tools
NSF REU EMCoR@NCAT
Grant # ACI-1560385

North Carolina
EMCOR@NCAT
NSF Grant
Methodology / Approach / Tools
• In order to the train the model, a cross-validation model will be used as a validation technique
for assessing how the results of a statistical analysis will generalize to an independent data set.
• The plan is to calculate the a given amount of data sites and comparing them to the model
prediction from the given set to check the algorithm accuracy.
• The math is for Accuracy:
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
• Matthews Correlation Coefficient (MCC):
𝑇𝑃𝑥 𝑇𝑁 −(𝐹𝑃 𝑋 𝐹𝑁)
(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁)
• TP- True Positive, TN - True Negative, FP – False Positive, FN – False Negative.
NSF REU EMCoR@NCAT
Grant # ACI-1560385

North Carolina
EMCOR@NCAT
NSF Grant
SMOTE
 Synthetic Minority Oversampling Technique is a variation of
Random Oversampling (ROS) that solves the overfitting.
 This is done by creating synthetic instances instead of making
random copies.
 It is useful because it can extract more information from data which
is very helpful when our dataset is small.

North Carolina
EMCOR@NCAT
NSF Grant
Feature Selection
 Feature selection also known as variable selection or attribute selection.
 It is the automatic selection of attributes in the data such as tabular data that are most
applicable the predictive modeling problem a researcher is working on.
 Problem feature selection solves – creating an accurate predictive model.
 This helps with choosing feature that will give a good or better accuracy requiring less data.
 The methods can be be used to identify and remove unnecessary attributes from the data.
 Three general classes of feature selection algorithms: filter methods, wrapper methods and
embedded methods.

North Carolina
EMCOR@NCAT
NSF Grant
Results
Predicted
Negative Positive
Actual Negative 65 1 66
Positive 0 40 40
65 41
Predicted
Negative Positive
Positive 69 1080 1149
1064 1234
Accuracy is: 0.865530596437 x 100 = 86.5 %
Matthews Correlation Coefficient (MCC): .80
Accuracy is: 99%993217784476 x 100 =
99.3%
 Independent Test(5%) Training Set(95%)
Key terms: TN, FP
FN, TP
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Accuracy is: 90%
With Feature Selection, With SMOTE

North Carolina
EMCOR@NCAT
NSF Grant
Results
 Training Set (95%) Independent Test (5%)
Predicted
Negative Positive
Positive 93 1056 1149
1118 1180
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Without Feature Selection, With SMOTE
Predicted
Negative Positive
Positive 1 39 40
67 39
Accuracy is: 99%Accuracy is: 90%
Matthews Correlation Coefficient (MCC): .81 Matthews Correlation Coefficient (MCC): .98

North Carolina
EMCOR@NCAT
NSF Grant
Results
 Training Set Independent Test
Predicted
Negative Positive
Positive 244 621 865
1222 792
With Feature Selection, Without SMOTE
Accuracy is: 79%
Predicted
Negative Positive
Positive 2 38 40
68 38
Accuracy is: 98%
NSF REU EMCoR@NCAT
Grant # ACI-1560385
95% 5%

North Carolina
EMCOR@NCAT
NSF Grant
Results
 Training Set (95%) Independent Test (5%)
Without Feature Selection, Without SMOTE
Predicted
Negative Positive
Positive 187 678 865
1202 812
Accuracy is: 98%
Predicted
Negative Positive
Positive 32 275 40
98 275
Accuracy is: 84%
NSF REU EMCoR@NCAT
Grant # ACI-1560385

North Carolina
EMCOR@NCAT
NSF Grant
Benchmark Dataset
Class Label Total Training Set Independent Test
0 1215 1149 66
1 905 865 40

North Carolina
EMCOR@NCAT
NSF Grant
Conclusion
 To completely comprehend S-glutathionylation components, ID of substrates
and particular S-glutathionylated destinations is significant.
 After running several test, the training set data accuracy was 90% and
independent testing dataset accuracy was 99%.
 Computational predictions of S-glutathionylation are very useful due to their
high speed.
 The experimental results showed that scikit-learn could be useful in assisting
the discovery of S-glutathionylated sites.
NSF REU EMCoR@NCAT
Grant # ACI-1560385

North Carolina
EMCOR@NCAT
NSF Grant
Future Work
 Bioinformatics in Systems Biology
 Systems biology seeks to understand how cells, tissues, and organisms
functions from the perspective of the system as a whole.
 Computational systems which uses mathematical modeling, simulation, and
statistical analysis to gain a fundamental understanding of biological processes.
 Biological processes such as minimal requirements for function, dissecting
protein and nucleic acid networks.
NSF REU EMCoR@NCAT
Grant # ACI-1560385

North Carolina
EMCOR@NCAT
NSF Grant
Acknowledgments
 This research is fully supported by The National Science Foundation Grant #(ACl-1560385).
 I would like to give a special thanks to Dr. Bala Ram for this great opportunity.
 Thank you to My REU Team:
 Mentor: Dr. Dukka KC.
 PhD Student: Mr. Clarence White.
 Director Special Academic Programs: Dr. Marcia Williams.
 Grad Student: Manoj Rijal
 EMCOR REU participants
NSF REU EMCoR@NCAT
Grant # ACI-1560385

North Carolina
EMCOR@NCAT
NSF Grant
Thank You
NSF REU EMCoR@NCAT
Grant # ACI-1560385

S-Glutathionylation site prediction in proteins

Recommended

Recommended

More Related Content

Similar to S-Glutathionylation site prediction in proteins

Similar to S-Glutathionylation site prediction in proteins (20)

Recently uploaded

Recently uploaded (20)

S-Glutathionylation site prediction in proteins

Editor's Notes