SlideShare a Scribd company logo
1 of 25
NSF REU EMCoR@NCAT
Grant # ACI-1560385
S-Glutathionylation site prediction in
proteins
REU Fellow: Marcus Postell
Mentor: Dr. Dukka KC
July 27, 2017
Outline
• Motivation
• Introduction
• Research Problem Statement
• Research Goal(s)
• Literature Review
• Methodology / Approach / Tools
• Results
• Conclusion
• Future
NSF REU EMCoR@NCAT
Grant # ACI-1560385
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Motivation
• Machine learning technique = saving money and less time consuming than laboratory
techniques .
• It is very useful when dealing with large data sets.
• Computational approaches can effectively and accurately identify the S-
glutathionylated sites.
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Motivation
 Computational approaches for S-glutathionylation are urgently needed.
 Bioinformatics tools have been proposed to identify the disulfide bonding state of
cysteine and the catalytic redox-active cysteine.
 It has been multiple methods that has been developed to predict S-glutathionylation
sites which is PGlus.
Introduction
• S-Glutathionylation is a reversible protein post-translational
modification.
• It generates mixed disulfides between glutathione(GSH) and cysteine
residues.
• This plays an important role in regulating protein stability, activity,
and redox regulation
• It provides valuable insights to understand the molecular mechanism
of S-glutathionylation.
• Due to the labile nature and low abundance of in vivo S-
glutathionylation, the details characteristics and mechanisms of S-
glutathionylation still await to be clarified.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Research Problem Statement
• There is a current lack of reliable tools which
limits researchers to using expensive and time-
consuming laboratory techniques for the
identification of S-Glutathionylation.
• These biological experiments often times run
into cross contamination.
• Computational predictions of S-
glutathionylated sites are very desirable due to
their convenience and high speed.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Research Goal(s)
• By using Machine Learning, the goal is to model
neural networks in the python language to help
generate an algorithm that can have a accurate
prediction on S-Glutathionylation.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Literature Review
• Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY. GSHSite:
exploiting an iteratively statistical method to identify s-
glutathionylation sites with substrate specificity. PLoS One. 2015 Apr
7
• S-glutathionylation, the covalent connection of a glutathione (GSH)
to the sulfur molecule of cysteine, is a particular and reversible
protein post-translational adjustment (PTM) that manages protein
movement, confinement, and solidness.
• In spite of its suggestion in the control of protein capacities and cell
flagging, the substrate specificity of cysteine S-glutathionylation
stays obscure.
• Based on a total of 1783 tentatively distinguished S-glutathionylation
locales from mouse macrophages, this work shows an informatics
examination on S-glutathionylation destinations including basic
variables, for example, the flanking amino acids organization and the
open surface zone (ASA).
NSF REU EMCoR@NCAT
Grant # ACI-1560385
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Literature Review
• The training data below has been divided into five groups by splitting each dataset.
• During cross-validation, one subgroup was regarded as the test set and the remaining four as the training set.
• Cross validation was repeated five times and the validation results were combined to produce a single estimation.
Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY. GSHSite: exploiting an iteratively statistical method to identify s-glutathionylation sites with substrate specificity. PLoS One. 2015 Apr 7
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Literature Review
 Research article: GSHSite: Exploiting an Iteratively Statistical Method to Identify S-
Glutathionylation Sites with Substrate Specificity.
 1783 experimentally identified S-glutathionylation sites compared to 2, 326
experimentally identified S-glutathionylation sites.
 Bioinformatic approaches are powerful tools for prediction.
 Following evaluation by cross-validation and an independent test.
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Literature Review
• Zhao X, Ning Q, Ai M, Chai H, Yin M. PGluS: prediction of protein S-glutathionylation sites with
multiple features and analysis. Mol Biosyst. 2015 Mar;11
• A new bioinformatics tool named PGluS was made to predict S-glutathionylated sites based on
mutiple features and support vector machines.
• PGluS was evaluated using an independent testing dataset resulting in an accuracy of 71.25%,
which demonstrated that PGluS was very promising for predicting S-glutathionylated sites.
• Also, feature analysis was performed and it was shown that all features adopted in the method
contributed to the S-glutathionylation process.
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Literature Review
 Research article: PGluS: prediction of protein S-glutathionylation sites with multiple
features and analysis.
 Computational predictions importance.
 Accuracy prediction of 71.41% compared to 87%.
 Feature analysis was performed and shown for features adopted that contributed to the
S-glutathionylation process.
 Identification of specific S-glutathionylated sites are cruical.
Methodology / Approach / Tools
NSF REU EMCoR@NCAT
Grant # ACI-1560385
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Methodology / Approach / Tools
• In order to the train the model, a cross-validation model will be used as a validation technique
for assessing how the results of a statistical analysis will generalize to an independent data set.
• The plan is to calculate the a given amount of data sites and comparing them to the model
prediction from the given set to check the algorithm accuracy.
• The math is for Accuracy:
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
• Matthews Correlation Coefficient (MCC):
𝑇𝑃𝑥 𝑇𝑁 −(𝐹𝑃 𝑋 𝐹𝑁)
(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁)
• TP- True Positive, TN - True Negative, FP – False Positive, FN – False Negative.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
SMOTE
 Synthetic Minority Oversampling Technique is a variation of
Random Oversampling (ROS) that solves the overfitting.
 This is done by creating synthetic instances instead of making
random copies.
 It is useful because it can extract more information from data which
is very helpful when our dataset is small.
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Feature Selection
 Feature selection also known as variable selection or attribute selection.
 It is the automatic selection of attributes in the data such as tabular data that are most
applicable the predictive modeling problem a researcher is working on.
 Problem feature selection solves – creating an accurate predictive model.
 This helps with choosing feature that will give a good or better accuracy requiring less data.
 The methods can be be used to identify and remove unnecessary attributes from the data.
 Three general classes of feature selection algorithms: filter methods, wrapper methods and
embedded methods.
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Results
Predicted
Negative Positive
Actual Negative 65 1 66
Positive 0 40 40
65 41
Predicted
Negative Positive
Actual Negative 995 154 1149
Positive 69 1080 1149
1064 1234
Accuracy is: 0.865530596437 x 100 = 86.5 %
Matthews Correlation Coefficient (MCC): .80
Accuracy is: 99%993217784476 x 100 =
99.3%
 Independent Test(5%) Training Set(95%)
Key terms: TN, FP
FN, TP
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Accuracy is: 90%
With Feature Selection, With SMOTE
Matthews Correlation Coefficient (MCC): .98
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Results
 Training Set (95%) Independent Test (5%)
Predicted
Negative Positive
Actual Negative 1025 124 1149
Positive 93 1056 1149
1118 1180
NSF REU EMCoR@NCAT
Grant # ACI-1560385
Without Feature Selection, With SMOTE
Predicted
Negative Positive
Actual Negative 66 0 66
Positive 1 39 40
67 39
Accuracy is: 99%Accuracy is: 90%
Matthews Correlation Coefficient (MCC): .81 Matthews Correlation Coefficient (MCC): .98
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Results
 Training Set Independent Test
Predicted
Negative Positive
Actual Negative 978 171 1149
Positive 244 621 865
1222 792
With Feature Selection, Without SMOTE
Accuracy is: 79%
Matthews Correlation Coefficient (MCC): .57
Predicted
Negative Positive
Actual Negative 66 0 66
Positive 2 38 40
68 38
Accuracy is: 98%
Matthews Correlation Coefficient (MCC): .96
NSF REU EMCoR@NCAT
Grant # ACI-1560385
95% 5%
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Results
 Training Set (95%) Independent Test (5%)
Without Feature Selection, Without SMOTE
Predicted
Negative Positive
Actual Negative 1015 134 1149
Positive 187 678 865
1202 812
Accuracy is: 98%
Matthews Correlation Coefficient (MCC): .96
Predicted
Negative Positive
Actual Negative 66 0 66
Positive 32 275 40
98 275
Accuracy is: 84%
Matthews Correlation Coefficient (MCC): .67
NSF REU EMCoR@NCAT
Grant # ACI-1560385
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Benchmark Dataset
Class Label Total Training Set Independent Test
0 1215 1149 66
1 905 865 40
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Conclusion
 To completely comprehend S-glutathionylation components, ID of substrates
and particular S-glutathionylated destinations is significant.
 After running several test, the training set data accuracy was 90% and
independent testing dataset accuracy was 99%.
 Computational predictions of S-glutathionylation are very useful due to their
high speed.
 The experimental results showed that scikit-learn could be useful in assisting
the discovery of S-glutathionylated sites.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Future Work
 Bioinformatics in Systems Biology
 Systems biology seeks to understand how cells, tissues, and organisms
functions from the perspective of the system as a whole.
 Computational systems which uses mathematical modeling, simulation, and
statistical analysis to gain a fundamental understanding of biological processes.
 Biological processes such as minimal requirements for function, dissecting
protein and nucleic acid networks.
NSF REU EMCoR@NCAT
Grant # ACI-1560385
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Acknowledgments
 This research is fully supported by The National Science Foundation Grant #(ACl-1560385).
 I would like to give a special thanks to Dr. Bala Ram for this great opportunity.
 Thank you to My REU Team:
 Mentor: Dr. Dukka KC.
 PhD Student: Mr. Clarence White.
 Director Special Academic Programs: Dr. Marcia Williams.
 Grad Student: Manoj Rijal
 EMCOR REU participants
NSF REU EMCoR@NCAT
Grant # ACI-1560385
North Carolina
Agricultural and Technical State University
EMCOR@NCAT
NSF Grant
Thank You
NSF REU EMCoR@NCAT
Grant # ACI-1560385

More Related Content

Similar to S-Glutathionylation site prediction in proteins

Eugm 2012 mehta - future plans for east - 2012 eugm
Eugm 2012   mehta - future plans for east - 2012 eugmEugm 2012   mehta - future plans for east - 2012 eugm
Eugm 2012 mehta - future plans for east - 2012 eugmCytel USA
 
Stability based validation of dietary patterns obtained by cluster
Stability based validation of dietary patterns obtained by clusterStability based validation of dietary patterns obtained by cluster
Stability based validation of dietary patterns obtained by clusterAjay RJ
 
Stability based validation of dietary patterns obtained by cluster (1)
Stability based validation of dietary patterns obtained by cluster (1)Stability based validation of dietary patterns obtained by cluster (1)
Stability based validation of dietary patterns obtained by cluster (1)SarathvarmaTirumalar
 
Performance evaluation of random forest with feature selection methods in pre...
Performance evaluation of random forest with feature selection methods in pre...Performance evaluation of random forest with feature selection methods in pre...
Performance evaluation of random forest with feature selection methods in pre...IJECEIAES
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data setsIjripublishers Ijri
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
Hupo2017 wessels mb2021 Glycopeptide profiling
Hupo2017 wessels mb2021 Glycopeptide profilingHupo2017 wessels mb2021 Glycopeptide profiling
Hupo2017 wessels mb2021 Glycopeptide profilingHans Wessels
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopNuria Lopez-Bigas
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...Sara Alvarez
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
 
Data Integrity in Decentralized Clinical Trials (DCTs)
Data Integrity in Decentralized Clinical Trials (DCTs)Data Integrity in Decentralized Clinical Trials (DCTs)
Data Integrity in Decentralized Clinical Trials (DCTs)InsideScientific
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSGolden Helix
 
Peter (Yun-shao) Sung's Resume 2016III
Peter (Yun-shao) Sung's Resume 2016IIIPeter (Yun-shao) Sung's Resume 2016III
Peter (Yun-shao) Sung's Resume 2016IIIPeter Sung
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesAli Al Hamadani
 
Can a combination of constrained-based and kinetic modeling bridge time scale...
Can a combination of constrained-based and kinetic modeling bridge time scale...Can a combination of constrained-based and kinetic modeling bridge time scale...
Can a combination of constrained-based and kinetic modeling bridge time scale...Natal van Riel
 
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...SOYEON KIM
 

Similar to S-Glutathionylation site prediction in proteins (20)

Eugm 2012 mehta - future plans for east - 2012 eugm
Eugm 2012   mehta - future plans for east - 2012 eugmEugm 2012   mehta - future plans for east - 2012 eugm
Eugm 2012 mehta - future plans for east - 2012 eugm
 
CPRIT Poster Final
CPRIT Poster FinalCPRIT Poster Final
CPRIT Poster Final
 
Stability based validation of dietary patterns obtained by cluster
Stability based validation of dietary patterns obtained by clusterStability based validation of dietary patterns obtained by cluster
Stability based validation of dietary patterns obtained by cluster
 
Stability based validation of dietary patterns obtained by cluster (1)
Stability based validation of dietary patterns obtained by cluster (1)Stability based validation of dietary patterns obtained by cluster (1)
Stability based validation of dietary patterns obtained by cluster (1)
 
Performance evaluation of random forest with feature selection methods in pre...
Performance evaluation of random forest with feature selection methods in pre...Performance evaluation of random forest with feature selection methods in pre...
Performance evaluation of random forest with feature selection methods in pre...
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data sets
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
Hupo2017 wessels mb2021 Glycopeptide profiling
Hupo2017 wessels mb2021 Glycopeptide profilingHupo2017 wessels mb2021 Glycopeptide profiling
Hupo2017 wessels mb2021 Glycopeptide profiling
 
Qi liu 08.08.2014
Qi liu 08.08.2014Qi liu 08.08.2014
Qi liu 08.08.2014
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
Data Integrity in Decentralized Clinical Trials (DCTs)
Data Integrity in Decentralized Clinical Trials (DCTs)Data Integrity in Decentralized Clinical Trials (DCTs)
Data Integrity in Decentralized Clinical Trials (DCTs)
 
ISHIposter16_f
ISHIposter16_fISHIposter16_f
ISHIposter16_f
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVS
 
Peter (Yun-shao) Sung's Resume 2016III
Peter (Yun-shao) Sung's Resume 2016IIIPeter (Yun-shao) Sung's Resume 2016III
Peter (Yun-shao) Sung's Resume 2016III
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesPredictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
 
Can a combination of constrained-based and kinetic modeling bridge time scale...
Can a combination of constrained-based and kinetic modeling bridge time scale...Can a combination of constrained-based and kinetic modeling bridge time scale...
Can a combination of constrained-based and kinetic modeling bridge time scale...
 
SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...
SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...
SLAS Screen Design and Assay Technology Special Interest Group SLAS2017 Prese...
 
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
 

Recently uploaded

Jual obat aborsi Dubai ( 085657271886 ) Cytote pil telat bulan penggugur kand...
Jual obat aborsi Dubai ( 085657271886 ) Cytote pil telat bulan penggugur kand...Jual obat aborsi Dubai ( 085657271886 ) Cytote pil telat bulan penggugur kand...
Jual obat aborsi Dubai ( 085657271886 ) Cytote pil telat bulan penggugur kand...ZurliaSoop
 
Top profile Call Girls In Sagar [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Sagar [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Sagar [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Sagar [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...HyderabadDolls
 
B.tech civil major project by Deepak Kumar
B.tech civil major project by Deepak KumarB.tech civil major project by Deepak Kumar
B.tech civil major project by Deepak KumarDeepak15CivilEngg
 
Complete Curriculum Vita for Paul Warshauer
Complete Curriculum Vita for Paul WarshauerComplete Curriculum Vita for Paul Warshauer
Complete Curriculum Vita for Paul WarshauerPaul Warshauer
 
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In Rampur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rampur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rampur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rampur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Gabriel_Carter_EXPOLRATIONpp.pptx........
Gabriel_Carter_EXPOLRATIONpp.pptx........Gabriel_Carter_EXPOLRATIONpp.pptx........
Gabriel_Carter_EXPOLRATIONpp.pptx........deejay178
 
b-sc-agri-course-curriculum.pdf for Karnataka state board
b-sc-agri-course-curriculum.pdf for Karnataka state boardb-sc-agri-course-curriculum.pdf for Karnataka state board
b-sc-agri-course-curriculum.pdf for Karnataka state boardramyaul734
 
Dating Call Girls inTiruvallur { 9332606886 } VVIP NISHA Call Girls Near 5 St...
Dating Call Girls inTiruvallur { 9332606886 } VVIP NISHA Call Girls Near 5 St...Dating Call Girls inTiruvallur { 9332606886 } VVIP NISHA Call Girls Near 5 St...
Dating Call Girls inTiruvallur { 9332606886 } VVIP NISHA Call Girls Near 5 St...ruksarkahn825
 
Vip Malegaon Escorts Service Girl ^ 9332606886, WhatsApp Anytime Malegaon
Vip Malegaon Escorts Service Girl ^ 9332606886, WhatsApp Anytime MalegaonVip Malegaon Escorts Service Girl ^ 9332606886, WhatsApp Anytime Malegaon
Vip Malegaon Escorts Service Girl ^ 9332606886, WhatsApp Anytime Malegaonmeghakumariji156
 
K Venkat Naveen Kumar | GCP Data Engineer | CV
K Venkat Naveen Kumar | GCP Data Engineer | CVK Venkat Naveen Kumar | GCP Data Engineer | CV
K Venkat Naveen Kumar | GCP Data Engineer | CVK VENKAT NAVEEN KUMAR
 
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...nirzagarg
 
7737669865 Call Girls In Ahmedabad Escort Service Available 24×7 In In Ahmedabad
7737669865 Call Girls In Ahmedabad Escort Service Available 24×7 In In Ahmedabad7737669865 Call Girls In Ahmedabad Escort Service Available 24×7 In In Ahmedabad
7737669865 Call Girls In Ahmedabad Escort Service Available 24×7 In In Ahmedabadgargpaaro
 
UXPA Boston 2024 Maximize the Client Consultant Relationship.pdf
UXPA Boston 2024 Maximize the Client Consultant Relationship.pdfUXPA Boston 2024 Maximize the Client Consultant Relationship.pdf
UXPA Boston 2024 Maximize the Client Consultant Relationship.pdfDan Berlin
 
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Miletti Gabriela_Vision Plan for artist Jahzel.pdf
Miletti Gabriela_Vision Plan for artist Jahzel.pdfMiletti Gabriela_Vision Plan for artist Jahzel.pdf
Miletti Gabriela_Vision Plan for artist Jahzel.pdfGabrielaMiletti
 
Personal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando NegronPersonal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando Negronnegronf24
 

Recently uploaded (20)

Jual obat aborsi Dubai ( 085657271886 ) Cytote pil telat bulan penggugur kand...
Jual obat aborsi Dubai ( 085657271886 ) Cytote pil telat bulan penggugur kand...Jual obat aborsi Dubai ( 085657271886 ) Cytote pil telat bulan penggugur kand...
Jual obat aborsi Dubai ( 085657271886 ) Cytote pil telat bulan penggugur kand...
 
Top profile Call Girls In Sagar [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Sagar [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Sagar [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Sagar [ 7014168258 ] Call Me For Genuine Models We ...
 
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
 
B.tech civil major project by Deepak Kumar
B.tech civil major project by Deepak KumarB.tech civil major project by Deepak Kumar
B.tech civil major project by Deepak Kumar
 
Complete Curriculum Vita for Paul Warshauer
Complete Curriculum Vita for Paul WarshauerComplete Curriculum Vita for Paul Warshauer
Complete Curriculum Vita for Paul Warshauer
 
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
 
Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Top profile Call Girls In Rampur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rampur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rampur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rampur [ 7014168258 ] Call Me For Genuine Models We...
 
Gabriel_Carter_EXPOLRATIONpp.pptx........
Gabriel_Carter_EXPOLRATIONpp.pptx........Gabriel_Carter_EXPOLRATIONpp.pptx........
Gabriel_Carter_EXPOLRATIONpp.pptx........
 
b-sc-agri-course-curriculum.pdf for Karnataka state board
b-sc-agri-course-curriculum.pdf for Karnataka state boardb-sc-agri-course-curriculum.pdf for Karnataka state board
b-sc-agri-course-curriculum.pdf for Karnataka state board
 
Dating Call Girls inTiruvallur { 9332606886 } VVIP NISHA Call Girls Near 5 St...
Dating Call Girls inTiruvallur { 9332606886 } VVIP NISHA Call Girls Near 5 St...Dating Call Girls inTiruvallur { 9332606886 } VVIP NISHA Call Girls Near 5 St...
Dating Call Girls inTiruvallur { 9332606886 } VVIP NISHA Call Girls Near 5 St...
 
Vip Malegaon Escorts Service Girl ^ 9332606886, WhatsApp Anytime Malegaon
Vip Malegaon Escorts Service Girl ^ 9332606886, WhatsApp Anytime MalegaonVip Malegaon Escorts Service Girl ^ 9332606886, WhatsApp Anytime Malegaon
Vip Malegaon Escorts Service Girl ^ 9332606886, WhatsApp Anytime Malegaon
 
K Venkat Naveen Kumar | GCP Data Engineer | CV
K Venkat Naveen Kumar | GCP Data Engineer | CVK Venkat Naveen Kumar | GCP Data Engineer | CV
K Venkat Naveen Kumar | GCP Data Engineer | CV
 
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
 
7737669865 Call Girls In Ahmedabad Escort Service Available 24×7 In In Ahmedabad
7737669865 Call Girls In Ahmedabad Escort Service Available 24×7 In In Ahmedabad7737669865 Call Girls In Ahmedabad Escort Service Available 24×7 In In Ahmedabad
7737669865 Call Girls In Ahmedabad Escort Service Available 24×7 In In Ahmedabad
 
UXPA Boston 2024 Maximize the Client Consultant Relationship.pdf
UXPA Boston 2024 Maximize the Client Consultant Relationship.pdfUXPA Boston 2024 Maximize the Client Consultant Relationship.pdf
UXPA Boston 2024 Maximize the Client Consultant Relationship.pdf
 
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
 
Miletti Gabriela_Vision Plan for artist Jahzel.pdf
Miletti Gabriela_Vision Plan for artist Jahzel.pdfMiletti Gabriela_Vision Plan for artist Jahzel.pdf
Miletti Gabriela_Vision Plan for artist Jahzel.pdf
 
Cara Gugurkan Kandungan Awal Kehamilan 1 bulan (087776558899)
Cara Gugurkan Kandungan Awal Kehamilan 1 bulan (087776558899)Cara Gugurkan Kandungan Awal Kehamilan 1 bulan (087776558899)
Cara Gugurkan Kandungan Awal Kehamilan 1 bulan (087776558899)
 
Personal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando NegronPersonal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando Negron
 

S-Glutathionylation site prediction in proteins

  • 1. NSF REU EMCoR@NCAT Grant # ACI-1560385 S-Glutathionylation site prediction in proteins REU Fellow: Marcus Postell Mentor: Dr. Dukka KC July 27, 2017
  • 2. Outline • Motivation • Introduction • Research Problem Statement • Research Goal(s) • Literature Review • Methodology / Approach / Tools • Results • Conclusion • Future NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 3. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Motivation • Machine learning technique = saving money and less time consuming than laboratory techniques . • It is very useful when dealing with large data sets. • Computational approaches can effectively and accurately identify the S- glutathionylated sites.
  • 4. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Motivation  Computational approaches for S-glutathionylation are urgently needed.  Bioinformatics tools have been proposed to identify the disulfide bonding state of cysteine and the catalytic redox-active cysteine.  It has been multiple methods that has been developed to predict S-glutathionylation sites which is PGlus.
  • 5. Introduction • S-Glutathionylation is a reversible protein post-translational modification. • It generates mixed disulfides between glutathione(GSH) and cysteine residues. • This plays an important role in regulating protein stability, activity, and redox regulation • It provides valuable insights to understand the molecular mechanism of S-glutathionylation. • Due to the labile nature and low abundance of in vivo S- glutathionylation, the details characteristics and mechanisms of S- glutathionylation still await to be clarified. NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 6. Research Problem Statement • There is a current lack of reliable tools which limits researchers to using expensive and time- consuming laboratory techniques for the identification of S-Glutathionylation. • These biological experiments often times run into cross contamination. • Computational predictions of S- glutathionylated sites are very desirable due to their convenience and high speed. NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 7. Research Goal(s) • By using Machine Learning, the goal is to model neural networks in the python language to help generate an algorithm that can have a accurate prediction on S-Glutathionylation. NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 8. Literature Review • Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY. GSHSite: exploiting an iteratively statistical method to identify s- glutathionylation sites with substrate specificity. PLoS One. 2015 Apr 7 • S-glutathionylation, the covalent connection of a glutathione (GSH) to the sulfur molecule of cysteine, is a particular and reversible protein post-translational adjustment (PTM) that manages protein movement, confinement, and solidness. • In spite of its suggestion in the control of protein capacities and cell flagging, the substrate specificity of cysteine S-glutathionylation stays obscure. • Based on a total of 1783 tentatively distinguished S-glutathionylation locales from mouse macrophages, this work shows an informatics examination on S-glutathionylation destinations including basic variables, for example, the flanking amino acids organization and the open surface zone (ASA). NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 9. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Literature Review • The training data below has been divided into five groups by splitting each dataset. • During cross-validation, one subgroup was regarded as the test set and the remaining four as the training set. • Cross validation was repeated five times and the validation results were combined to produce a single estimation. Chen YJ, Lu CT, Huang KY, Wu HY, Chen YJ, Lee TY. GSHSite: exploiting an iteratively statistical method to identify s-glutathionylation sites with substrate specificity. PLoS One. 2015 Apr 7
  • 10. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Literature Review  Research article: GSHSite: Exploiting an Iteratively Statistical Method to Identify S- Glutathionylation Sites with Substrate Specificity.  1783 experimentally identified S-glutathionylation sites compared to 2, 326 experimentally identified S-glutathionylation sites.  Bioinformatic approaches are powerful tools for prediction.  Following evaluation by cross-validation and an independent test.
  • 11. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Literature Review • Zhao X, Ning Q, Ai M, Chai H, Yin M. PGluS: prediction of protein S-glutathionylation sites with multiple features and analysis. Mol Biosyst. 2015 Mar;11 • A new bioinformatics tool named PGluS was made to predict S-glutathionylated sites based on mutiple features and support vector machines. • PGluS was evaluated using an independent testing dataset resulting in an accuracy of 71.25%, which demonstrated that PGluS was very promising for predicting S-glutathionylated sites. • Also, feature analysis was performed and it was shown that all features adopted in the method contributed to the S-glutathionylation process.
  • 12. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Literature Review  Research article: PGluS: prediction of protein S-glutathionylation sites with multiple features and analysis.  Computational predictions importance.  Accuracy prediction of 71.41% compared to 87%.  Feature analysis was performed and shown for features adopted that contributed to the S-glutathionylation process.  Identification of specific S-glutathionylated sites are cruical.
  • 13. Methodology / Approach / Tools NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 14. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Methodology / Approach / Tools • In order to the train the model, a cross-validation model will be used as a validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. • The plan is to calculate the a given amount of data sites and comparing them to the model prediction from the given set to check the algorithm accuracy. • The math is for Accuracy: 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 • Matthews Correlation Coefficient (MCC): 𝑇𝑃𝑥 𝑇𝑁 −(𝐹𝑃 𝑋 𝐹𝑁) (𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁) • TP- True Positive, TN - True Negative, FP – False Positive, FN – False Negative. NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 15. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant SMOTE  Synthetic Minority Oversampling Technique is a variation of Random Oversampling (ROS) that solves the overfitting.  This is done by creating synthetic instances instead of making random copies.  It is useful because it can extract more information from data which is very helpful when our dataset is small.
  • 16. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Feature Selection  Feature selection also known as variable selection or attribute selection.  It is the automatic selection of attributes in the data such as tabular data that are most applicable the predictive modeling problem a researcher is working on.  Problem feature selection solves – creating an accurate predictive model.  This helps with choosing feature that will give a good or better accuracy requiring less data.  The methods can be be used to identify and remove unnecessary attributes from the data.  Three general classes of feature selection algorithms: filter methods, wrapper methods and embedded methods.
  • 17. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Results Predicted Negative Positive Actual Negative 65 1 66 Positive 0 40 40 65 41 Predicted Negative Positive Actual Negative 995 154 1149 Positive 69 1080 1149 1064 1234 Accuracy is: 0.865530596437 x 100 = 86.5 % Matthews Correlation Coefficient (MCC): .80 Accuracy is: 99%993217784476 x 100 = 99.3%  Independent Test(5%) Training Set(95%) Key terms: TN, FP FN, TP NSF REU EMCoR@NCAT Grant # ACI-1560385 Accuracy is: 90% With Feature Selection, With SMOTE Matthews Correlation Coefficient (MCC): .98
  • 18. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Results  Training Set (95%) Independent Test (5%) Predicted Negative Positive Actual Negative 1025 124 1149 Positive 93 1056 1149 1118 1180 NSF REU EMCoR@NCAT Grant # ACI-1560385 Without Feature Selection, With SMOTE Predicted Negative Positive Actual Negative 66 0 66 Positive 1 39 40 67 39 Accuracy is: 99%Accuracy is: 90% Matthews Correlation Coefficient (MCC): .81 Matthews Correlation Coefficient (MCC): .98
  • 19. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Results  Training Set Independent Test Predicted Negative Positive Actual Negative 978 171 1149 Positive 244 621 865 1222 792 With Feature Selection, Without SMOTE Accuracy is: 79% Matthews Correlation Coefficient (MCC): .57 Predicted Negative Positive Actual Negative 66 0 66 Positive 2 38 40 68 38 Accuracy is: 98% Matthews Correlation Coefficient (MCC): .96 NSF REU EMCoR@NCAT Grant # ACI-1560385 95% 5%
  • 20. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Results  Training Set (95%) Independent Test (5%) Without Feature Selection, Without SMOTE Predicted Negative Positive Actual Negative 1015 134 1149 Positive 187 678 865 1202 812 Accuracy is: 98% Matthews Correlation Coefficient (MCC): .96 Predicted Negative Positive Actual Negative 66 0 66 Positive 32 275 40 98 275 Accuracy is: 84% Matthews Correlation Coefficient (MCC): .67 NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 21. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Benchmark Dataset Class Label Total Training Set Independent Test 0 1215 1149 66 1 905 865 40
  • 22. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Conclusion  To completely comprehend S-glutathionylation components, ID of substrates and particular S-glutathionylated destinations is significant.  After running several test, the training set data accuracy was 90% and independent testing dataset accuracy was 99%.  Computational predictions of S-glutathionylation are very useful due to their high speed.  The experimental results showed that scikit-learn could be useful in assisting the discovery of S-glutathionylated sites. NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 23. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Future Work  Bioinformatics in Systems Biology  Systems biology seeks to understand how cells, tissues, and organisms functions from the perspective of the system as a whole.  Computational systems which uses mathematical modeling, simulation, and statistical analysis to gain a fundamental understanding of biological processes.  Biological processes such as minimal requirements for function, dissecting protein and nucleic acid networks. NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 24. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Acknowledgments  This research is fully supported by The National Science Foundation Grant #(ACl-1560385).  I would like to give a special thanks to Dr. Bala Ram for this great opportunity.  Thank you to My REU Team:  Mentor: Dr. Dukka KC.  PhD Student: Mr. Clarence White.  Director Special Academic Programs: Dr. Marcia Williams.  Grad Student: Manoj Rijal  EMCOR REU participants NSF REU EMCoR@NCAT Grant # ACI-1560385
  • 25. North Carolina Agricultural and Technical State University EMCOR@NCAT NSF Grant Thank You NSF REU EMCoR@NCAT Grant # ACI-1560385

Editor's Notes

  1. Good afternoon Mentors, REU Participants, and staff Name Classification School Where I’m From
  2. Machine learning = representation+evaluation+optimization
  3. Computational needed because its faster, convenient, high speed. P-Glus was developed to predict S-glutathionylated sites based on multiple feature and support vector machines. Support Vector Machines– a algorithm used for classification and regression problems.
  4. Protein- large molecules that our cells need to functions properly consisting of amino acids. It is mainly extracted from RedoxDB and Satellite Global Data Base (SGDB). RedboxDB- A cuvated databse for experimentally verified protein oxidative modification Vivo: in the living organism.
  5. Cross contaiminations occurs most frequently occurs through avoidable procedural errors Practicing good aseptic technique is critical, but the computational approach is even better.
  6. Machine learning = representation+evaluation+optimization Steps for machine learning: Define the problem, prepare the data, spot-check the algorithm, improve results, and presents. To prepare the data, it is important to choose a set of data that is representative of the defined problem. This is known as the Hypothesis Space. If the data is not in the hypothesis space, it cannot be learned. Spot checking the algorithm is important in determining a scoring function to differentiate a good classifier from a bad one. Based on TP, TN, FP, FN Optimization applying various methods to improve the results
  7. “Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges it is mostly used in classification problems.
  8. Algorithm used: Multilayer perceptron-  a class of feedforward artificial neural network.  Currently have 2,326 S-glutathionylation sites from dbGSH- databased experimentally verifiedS-glutathionylation sites from multiple species. Data Set provide Uniprot, ID, Organism PubMedId and Sequence. Feature vector is a vector that contains multiple features. Universal Protein resource, a central repository of protein data created by combining multiple databases Databases are Swiss-Prot, TreEMBL and PIR-PSD. Talk about data request(Web scraper) and using web scraper to extract information from websites. The duty of the script is to read in a file containing the positive site information and read in the SwissProt fasta sequences file. Sequences match and are written to a fasta file. Fasta file created and combine neg and pos data sets Have to than split the pos and neg data sets into two files. (Lengthy script is responsible for that contain c sites. The sites are identified in a file containing only positive sites. Positive window sequences are added to a fasta file which represent positive sequences. All other C sites are placed in the negative files
  9. Cross-validation: evaluating estimator performance. MCC is used in machine learning as a measure of the quality of binary (two-class) classification ‘ It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.
  10. How is feature importance determined ? In the experiment, it is used through a method known as random forest which gives each features a score known as the Gini importance score. Random forest is a bunch of decision tress. Gini importance is that combines Gini impurity index for all instances across all tress. The impurity index is responsible for how well each feature is at splitting the tree between positive and negative decisions. The filter feature selection models apply a statistical measure in order to assign a scoring to each feature. The features are ranked by the score and can be either selected to be kept or removed from the dataset. The wrapper methods are considered the selection of a set of features as a search problem, where different combinations are prepared, evaluated and comparted to other combinations. A predictive model is used to evaluate a combination of feature and assign a score based on model accuracy. Embedded methods learn the features that best contribute to the accuracy of the model while the model is being created. A common type of embedded feature selection methods are regularization methods.
  11. A confusion matrix that is often used to describe the performance of a classification model. TN, FP FN, TP Benchmark data in last
  12. TN, FP FN, TP
  13. TN, FP FN, TP
  14. TN, FP FN, TP
  15. Dataset used as the baseline for your experiment
  16. Scikit learn- open source. Simple and efficient tools for data mining and data analysis