1. NSF REU EMCoR@NCAT
Grant # ACI-1560385
S-Glutathionylation site prediction in
proteins
REU Fellow: Marcus Postell
Mentor: Dr. Dukka KC
July 27, 2017
2. Outline
• Motivation
• Introduction
• Research Problem Statement
• Research Goal(s)
• Literature Review
• Methodology / Approach / Tools
• Results
• Conclusion
• Future
NSF REU EMCoR@NCAT
Grant # ACI-1560385
3. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Motivation
• Machine learning technique = saving money and less time consuming than laboratory
techniques .
• It is very useful when dealing with large data sets.
• Computational approaches can effectively and accurately identify the S-
glutathionylated sites.
4. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Motivation
Computational approaches for S-glutathionylation are urgently needed.
Bioinformatics tools have been proposed to identify the disulfide bonding state of
cysteine and the catalytic redox-active cysteine.
It has been multiple methods that has been developed to predict S-glutathionylation
sites which is PGlus.
5. Introduction
• S-Glutathionylation is a reversible protein post-translational
modification.
• It generates mixed disulfides between glutathione(GSH) and cysteine
residues.
• This plays an important role in regulating protein stability, activity,
and redox regulation
• It provides valuable insights to understand the molecular mechanism
of S-glutathionylation.
• Due to the labile nature and low abundance of in vivo S-
glutathionylation, the details characteristics and mechanisms of S-
glutathionylation still await to be clarified.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
6. Research Problem Statement
• There is a current lack of reliable tools which
limits researchers to using expensive and time-
consuming laboratory techniques for the
identification of S-Glutathionylation.
• These biological experiments often times run
into cross contamination.
• Computational predictions of S-
glutathionylated sites are very desirable due to
their convenience and high speed.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
7. Research Goal(s)
• By using Machine Learning, the goal is to model
neural networks in the python language to help
generate an algorithm that can have a accurate
prediction on S-Glutathionylation.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
8. Literature Review
• Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY. GSHSite:
exploiting an iteratively statistical method to identify s-
glutathionylation sites with substrate specificity. PLoS One. 2015 Apr
7
• S-glutathionylation, the covalent connection of a glutathione (GSH)
to the sulfur molecule of cysteine, is a particular and reversible
protein post-translational adjustment (PTM) that manages protein
movement, confinement, and solidness.
• In spite of its suggestion in the control of protein capacities and cell
flagging, the substrate specificity of cysteine S-glutathionylation
stays obscure.
• Based on a total of 1783 tentatively distinguished S-glutathionylation
locales from mouse macrophages, this work shows an informatics
examination on S-glutathionylation destinations including basic
variables, for example, the flanking amino acids organization and the
open surface zone (ASA).
NSF REU EMCoR@NCAT
Grant # ACI-1560385
9. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Literature Review
• The training data below has been divided into five groups by splitting each dataset.
• During cross-validation, one subgroup was regarded as the test set and the remaining four as the training set.
• Cross validation was repeated five times and the validation results were combined to produce a single estimation.
Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY. GSHSite: exploiting an iteratively statistical method to identify s-glutathionylation sites with substrate specificity. PLoS One. 2015 Apr 7
10. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Literature Review
Research article: GSHSite: Exploiting an Iteratively Statistical Method to Identify S-
Glutathionylation Sites with Substrate Specificity.
1783 experimentally identified S-glutathionylation sites compared to 2, 326
experimentally identified S-glutathionylation sites.
Bioinformatic approaches are powerful tools for prediction.
Following evaluation by cross-validation and an independent test.
11. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Literature Review
• Zhao X, Ning Q, Ai M, Chai H, Yin M. PGluS: prediction of protein S-glutathionylation sites with
multiple features and analysis. Mol Biosyst. 2015 Mar;11
• A new bioinformatics tool named PGluS was made to predict S-glutathionylated sites based on
mutiple features and support vector machines.
• PGluS was evaluated using an independent testing dataset resulting in an accuracy of 71.25%,
which demonstrated that PGluS was very promising for predicting S-glutathionylated sites.
• Also, feature analysis was performed and it was shown that all features adopted in the method
contributed to the S-glutathionylation process.
12. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Literature Review
Research article: PGluS: prediction of protein S-glutathionylation sites with multiple
features and analysis.
Computational predictions importance.
Accuracy prediction of 71.41% compared to 87%.
Feature analysis was performed and shown for features adopted that contributed to the
S-glutathionylation process.
Identification of specific S-glutathionylated sites are cruical.
14. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Methodology / Approach / Tools
• In order to the train the model, a cross-validation model will be used as a validation technique
for assessing how the results of a statistical analysis will generalize to an independent data set.
• The plan is to calculate the a given amount of data sites and comparing them to the model
prediction from the given set to check the algorithm accuracy.
• The math is for Accuracy:
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
• Matthews Correlation Coefficient (MCC):
𝑇𝑃𝑥 𝑇𝑁 −(𝐹𝑃 𝑋 𝐹𝑁)
(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁)
• TP- True Positive, TN - True Negative, FP – False Positive, FN – False Negative.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
15. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
SMOTE
Synthetic Minority Oversampling Technique is a variation of
Random Oversampling (ROS) that solves the overfitting.
This is done by creating synthetic instances instead of making
random copies.
It is useful because it can extract more information from data which
is very helpful when our dataset is small.
16. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Feature Selection
Feature selection also known as variable selection or attribute selection.
It is the automatic selection of attributes in the data such as tabular data that are most
applicable the predictive modeling problem a researcher is working on.
Problem feature selection solves – creating an accurate predictive model.
This helps with choosing feature that will give a good or better accuracy requiring less data.
The methods can be be used to identify and remove unnecessary attributes from the data.
Three general classes of feature selection algorithms: filter methods, wrapper methods and
embedded methods.
17. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Results
Predicted
Negative Positive
Actual Negative 65 1 66
Positive 0 40 40
65 41
Predicted
Negative Positive
Actual Negative 995 154 1149
Positive 69 1080 1149
1064 1234
Accuracy is: 0.865530596437 x 100 = 86.5 %
Matthews Correlation Coefficient (MCC): .80
Accuracy is: 99%993217784476 x 100 =
99.3%
Independent Test(5%) Training Set(95%)
Key terms: TN, FP
FN, TP
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Accuracy is: 90%
With Feature Selection, With SMOTE
Matthews Correlation Coefficient (MCC): .98
18. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Results
Training Set (95%) Independent Test (5%)
Predicted
Negative Positive
Actual Negative 1025 124 1149
Positive 93 1056 1149
1118 1180
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Without Feature Selection, With SMOTE
Predicted
Negative Positive
Actual Negative 66 0 66
Positive 1 39 40
67 39
Accuracy is: 99%Accuracy is: 90%
Matthews Correlation Coefficient (MCC): .81 Matthews Correlation Coefficient (MCC): .98
19. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Results
Training Set Independent Test
Predicted
Negative Positive
Actual Negative 978 171 1149
Positive 244 621 865
1222 792
With Feature Selection, Without SMOTE
Accuracy is: 79%
Matthews Correlation Coefficient (MCC): .57
Predicted
Negative Positive
Actual Negative 66 0 66
Positive 2 38 40
68 38
Accuracy is: 98%
Matthews Correlation Coefficient (MCC): .96
NSF REU EMCoR@NCAT
Grant # ACI-1560385
95% 5%
20. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Results
Training Set (95%) Independent Test (5%)
Without Feature Selection, Without SMOTE
Predicted
Negative Positive
Actual Negative 1015 134 1149
Positive 187 678 865
1202 812
Accuracy is: 98%
Matthews Correlation Coefficient (MCC): .96
Predicted
Negative Positive
Actual Negative 66 0 66
Positive 32 275 40
98 275
Accuracy is: 84%
Matthews Correlation Coefficient (MCC): .67
NSF REU EMCoR@NCAT
Grant # ACI-1560385
21. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Benchmark Dataset
Class Label Total Training Set Independent Test
0 1215 1149 66
1 905 865 40
22. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Conclusion
To completely comprehend S-glutathionylation components, ID of substrates
and particular S-glutathionylated destinations is significant.
After running several test, the training set data accuracy was 90% and
independent testing dataset accuracy was 99%.
Computational predictions of S-glutathionylation are very useful due to their
high speed.
The experimental results showed that scikit-learn could be useful in assisting
the discovery of S-glutathionylated sites.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
23. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Future Work
Bioinformatics in Systems Biology
Systems biology seeks to understand how cells, tissues, and organisms
functions from the perspective of the system as a whole.
Computational systems which uses mathematical modeling, simulation, and
statistical analysis to gain a fundamental understanding of biological processes.
Biological processes such as minimal requirements for function, dissecting
protein and nucleic acid networks.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
24. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Acknowledgments
This research is fully supported by The National Science Foundation Grant #(ACl-1560385).
I would like to give a special thanks to Dr. Bala Ram for this great opportunity.
Thank you to My REU Team:
Mentor: Dr. Dukka KC.
PhD Student: Mr. Clarence White.
Director Special Academic Programs: Dr. Marcia Williams.
Grad Student: Manoj Rijal
EMCOR REU participants
NSF REU EMCoR@NCAT
Grant # ACI-1560385
25. North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Thank You
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Editor's Notes
Good afternoon Mentors, REU Participants, and staff
Name
Classification
School
Where I’m From
Computational needed because its faster, convenient, high speed.
P-Glus was developed to predict S-glutathionylated sites based on multiple feature and support vector machines.
Support Vector Machines– a algorithm used for classification and regression problems.
Protein- large molecules that our cells need to functions properly consisting of amino acids.
It is mainly extracted from RedoxDB and Satellite Global Data Base (SGDB).
RedboxDB- A cuvated databse for experimentally verified protein oxidative modification
Vivo: in the living organism.
Cross contaiminations occurs most frequently occurs through avoidable procedural errors
Practicing good aseptic technique is critical, but the computational approach is even better.
Machine learning = representation+evaluation+optimization
Steps for machine learning: Define the problem, prepare the data, spot-check the algorithm, improve results, and presents.
To prepare the data, it is important to choose a set of data that is representative of the defined problem. This is known as the Hypothesis Space.
If the data is not in the hypothesis space, it cannot be learned.
Spot checking the algorithm is important in determining a scoring function to differentiate a good classifier from a bad one.
Based on TP, TN, FP, FN
Optimization applying various methods to improve the results
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges
it is mostly used in classification problems.
Algorithm used: Multilayer perceptron- a class of feedforward artificial neural network.
Currently have 2,326 S-glutathionylation sites from dbGSH- databased experimentally verifiedS-glutathionylation sites from multiple species.
Data Set provide Uniprot, ID, Organism PubMedId and Sequence.
Feature vector is a vector that contains multiple features.
Universal Protein resource, a central repository of protein data created by combining multiple databases
Databases are Swiss-Prot, TreEMBL and PIR-PSD.
Talk about data request(Web scraper) and using web scraper to extract information from websites.
The duty of the script is to read in a file containing the positive site information and read in the SwissProt fasta sequences file.
Sequences match and are written to a fasta file. Fasta file created and combine neg and pos data sets
Have to than split the pos and neg data sets into two files. (Lengthy script is responsible for that contain c sites.
The sites are identified in a file containing only positive sites. Positive window sequences are added to a fasta file which represent positive sequences. All other C sites are placed in the negative files
Cross-validation: evaluating estimator performance.
MCC is used in machine learning as a measure of the quality of binary (two-class) classification ‘
It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.
How is feature importance determined ?
In the experiment, it is used through a method known as random forest which gives each features a score known as the Gini importance score. Random forest is a bunch of decision tress. Gini importance is that combines Gini impurity index for all instances across all tress. The impurity index is responsible for how well each feature is at splitting the tree between positive and negative decisions.
The filter feature selection models apply a statistical measure in order to assign a scoring to each feature.
The features are ranked by the score and can be either selected to be kept or removed from the dataset.
The wrapper methods are considered the selection of a set of features as a search problem, where different combinations are prepared, evaluated and comparted to other combinations.
A predictive model is used to evaluate a combination of feature and assign a score based on model accuracy.
Embedded methods learn the features that best contribute to the accuracy of the model while the model is being created.
A common type of embedded feature selection methods are regularization methods.
A confusion matrix that is often used to describe the performance of a classification model.
TN, FP
FN, TP
Benchmark data in last
TN, FP
FN, TP
TN, FP
FN, TP
TN, FP
FN, TP
Dataset used as the baseline for your experiment
Scikit learn- open source. Simple and efficient tools for data mining and data analysis