SUMOylation-site Prediction Denis C. Bauer Fabian A. Buske Mikael Bod én
Overview Background SUMOylation - what is that ? Published predictors Our approach What makes SUMO hard to tackle
SUMO is not  相撲 S mall  U biquitin-related  Mo difier is a small protein of 97 amino acids.  20% homology to ubiquitin Post-translational modification Covalently attached to  Lysines Involved in many pathways/mechanisms Transcriptional regulation Compartmentisation
SUMOylation pathway
SUMOylation motif One consensus motif  [ILV]K.E  for about 60% of known sites However Not all  [ILV]K.E  -sites are SUMOylated Not all SUMOylated sites have the consensus motif  TP FP FN
Baseline prediction Method CC Regular Expression scanner 0.68
Comparison with existing predictors + Xu J.,  BMC Bioinformatics  2008, 9:8 ‡  Xue Y.,  Nucleic Acid Res  2006,  W254 -W 257 †  http://www.abgent.com/doc/sumoplot (commercial) Method CC Regular Expression scanner 0.68 SUMOpre + 0.64 SUMOsp ‡ 0.26 SUMOplot † 0.48
Case study : Core histones in yeast Identified SUMOylation sites + H2B : K6/7, K16/17 H2A : K2, K126 H4 : somewhere in the tail  No SUMOylation consensus site Predictor to date are not able to predict even a single SUMOylation site in the histone sequence  + Nathan D.,  Genes Dev  2006, 20(8):966-76
Our approach Identify  window size which ML method is best Voil á: better predictor ! Sequence xxxx K xxxx SUMOylation 1/0 ML
Training in more Detail w U w D Protein  Sequence K Imbalance in the dataset - more negatives than positives  SUMOylated K Not SUMOylated K K K ML T 0 1 0 P 1 1 0 K K
Prediction in more Detail w U w D Protein  Sequence K K K Trained ML 1 1 0 K K SUMOylated K Not SUMOylated K K K
ML methods Bidirectional Recurrent Neural Network (BRNN) Using information of flanking windows Decaying with distance to center window Prone to overfit Support Vector Machine (SVM) regularized requires suitable kernel and feature representation  Standard Kernels Linear, Polynomial, RBF String Kernel P-kernel, local-alignment kernel
Data set Training/Testing data 144 proteins with  241 SUMOylation sites 5,741 non-SUMOylated Lysines 68% of the SUMOulated sites confom to the consensus motif  Hold-out  13 proteins with 27 SUMOylation sites 48% consensus motif Xu J.,  BMC Bioinformatics  2008, 9:8
Evaluation 5-fold cross-validation Matthews correlation coefficient (CC) Sensitivity, Specificity, Accuracy Area under the curve ( AUC )
Performance overview SUMOsvm
Comparison with existing methods
Quest to improve performance  Protein structural features and evolutionary features  Separating SUMOylation sites from different species or compartment  Clustering for other motifs using kernel hierarchical clustering
Summary Regular Expression Scanner is still the best classifier. SUMO more versatile than expected ! The road to better predictions Are there other motifs? Which features can discriminate? Is the dataset biased? http://spot.colorado.edu/~colemab/Theatre_Resources/SumoBallerina.jpg
Acknowledgment  Predictor/Analysis Mikael Bod én Fabian Buske Dataset Xu et al. PhD Supervisors Tim Bailey Andrew Perkins Mikael Bod én Other Bioinformatic tools: STREAM – a practical workbench for modeling  transcriptional regulation. www.bioinformatics.org.au/stream/

SUMOylation site prediction

  • 1.
    SUMOylation-site Prediction DenisC. Bauer Fabian A. Buske Mikael Bod én
  • 2.
    Overview Background SUMOylation- what is that ? Published predictors Our approach What makes SUMO hard to tackle
  • 3.
    SUMO is not 相撲 S mall U biquitin-related Mo difier is a small protein of 97 amino acids. 20% homology to ubiquitin Post-translational modification Covalently attached to Lysines Involved in many pathways/mechanisms Transcriptional regulation Compartmentisation
  • 4.
  • 5.
    SUMOylation motif Oneconsensus motif [ILV]K.E for about 60% of known sites However Not all [ILV]K.E -sites are SUMOylated Not all SUMOylated sites have the consensus motif TP FP FN
  • 6.
    Baseline prediction MethodCC Regular Expression scanner 0.68
  • 7.
    Comparison with existingpredictors + Xu J., BMC Bioinformatics 2008, 9:8 ‡ Xue Y., Nucleic Acid Res 2006, W254 -W 257 † http://www.abgent.com/doc/sumoplot (commercial) Method CC Regular Expression scanner 0.68 SUMOpre + 0.64 SUMOsp ‡ 0.26 SUMOplot † 0.48
  • 8.
    Case study :Core histones in yeast Identified SUMOylation sites + H2B : K6/7, K16/17 H2A : K2, K126 H4 : somewhere in the tail No SUMOylation consensus site Predictor to date are not able to predict even a single SUMOylation site in the histone sequence + Nathan D., Genes Dev 2006, 20(8):966-76
  • 9.
    Our approach Identify window size which ML method is best Voil á: better predictor ! Sequence xxxx K xxxx SUMOylation 1/0 ML
  • 10.
    Training in moreDetail w U w D Protein Sequence K Imbalance in the dataset - more negatives than positives SUMOylated K Not SUMOylated K K K ML T 0 1 0 P 1 1 0 K K
  • 11.
    Prediction in moreDetail w U w D Protein Sequence K K K Trained ML 1 1 0 K K SUMOylated K Not SUMOylated K K K
  • 12.
    ML methods BidirectionalRecurrent Neural Network (BRNN) Using information of flanking windows Decaying with distance to center window Prone to overfit Support Vector Machine (SVM) regularized requires suitable kernel and feature representation Standard Kernels Linear, Polynomial, RBF String Kernel P-kernel, local-alignment kernel
  • 13.
    Data set Training/Testingdata 144 proteins with 241 SUMOylation sites 5,741 non-SUMOylated Lysines 68% of the SUMOulated sites confom to the consensus motif Hold-out 13 proteins with 27 SUMOylation sites 48% consensus motif Xu J., BMC Bioinformatics 2008, 9:8
  • 14.
    Evaluation 5-fold cross-validationMatthews correlation coefficient (CC) Sensitivity, Specificity, Accuracy Area under the curve ( AUC )
  • 15.
  • 16.
  • 17.
    Quest to improveperformance Protein structural features and evolutionary features Separating SUMOylation sites from different species or compartment Clustering for other motifs using kernel hierarchical clustering
  • 18.
    Summary Regular ExpressionScanner is still the best classifier. SUMO more versatile than expected ! The road to better predictions Are there other motifs? Which features can discriminate? Is the dataset biased? http://spot.colorado.edu/~colemab/Theatre_Resources/SumoBallerina.jpg
  • 19.
    Acknowledgment Predictor/AnalysisMikael Bod én Fabian Buske Dataset Xu et al. PhD Supervisors Tim Bailey Andrew Perkins Mikael Bod én Other Bioinformatic tools: STREAM – a practical workbench for modeling transcriptional regulation. www.bioinformatics.org.au/stream/