Gao bosc2010 musite

Musite: Prediction of Protein
Phosphorylation Sites

Jianjiong Gao
University of Missouri Columbia
Missouri,
http://musite.sourceforge.net/

Background:
Protein Phosphorylation
Protein phosphorylation is one of the most
important p
p post-translational modifications.
It was estimated that up to 50% of proteins are
phosphorylated in some cellular state
Abnormality in phosphorylation is a cause or
consequence of many diseases
Cancer
Diabete
Parkinson’s
Hepertitis B
…

Background:
Protein Phosphorylation
Phosphorylation-dephosphorylation is a
biochemical switch system regulating
y g g
various cellular processes.
Catalyzed by various specific protein
kinases.
Kinase
ON

OFF
Phosphatase

Phosphorylation Site Prediction
Problem Formulation

Phosphorylation site: a phosphorylated amino acid
in a protein (determined by protein sequence)
General phosphorylation site prediction: to predict
whether an amino acid can be phosphorylated
Kinase-specific p
p phosphorylation site p
p y prediction: to
predict whether an amino acid can be
p
phosphorylated by a specific kinase
p y y p
Based on protein sequence only

Limitations of Current Methods

Current prediction tools have
limitations when applying to whole
proteomes
Prediction accuracy could be improved
Most were released as web servers and have
restrictions for the uploaded data by users
Training data were out of date
Stringency adjustment was not fully
supported

Our tool Musite is unique

Novel method with better accuracy
First open source tool in the field that meet
open-source
OSI Open Standards Requirement
Standalone program designed for proteome-
scale prediction
p
Support both general and kinase-specific
phosphorylation site prediction
Support customized model training
Support continuous stringency adjustment

Flowchart
Data collection from high quality sources, Training data
such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt Bootstrap

Non-redundant datasets built by BLASTclust
Bootstrap
sample 1
... Bootstrap
sample m
Training
Phosphorylation it
Ph h l ti sites Non-phosphorylation it
N h h l ti sites
Feature extraction Classifier 1 ... Classifier m
KNN scores Disorder scores
Amino acid frequencies Aggregating
Specificity
Features from Features from estimation Phosphorylation
positive set negative set
prediction model

Control data Making predictions
on new data

Data Extraction

Bootstrap
sample 1
... Bootstrap
sample m
Training
Phosphorylation it
N h h l ti sites
Specificity
prediction model

on new data

Feature Extraction

Bootstrap
sample 1
... Bootstrap
sample m
Training
Phosphorylation it
N h h l ti sites
Specificity
prediction model

on new data

KNN Features
Motivation
Rationale of using KNN features: local
sequence clusters exist around
phosphorylation sites, since
Each phosphorylation site is a substrate of a specific
protein kinase
Substrates of the same kinase or kinase family
usually shares similar patterns in local sequences

KNN Features
Result
(A)

Overall, phosphosites Phospho Nonphospho

have larger KNN scores 1

than non-phosphosites 0.8

core
KNN sc
0.6

Average KNN scores 0.4

0.7~0.8 for phosphosites 0.2

≈0.5 for non-phosphosites 0
0.25
0 25 0.5
05 1 2 4
Size of nearest neighbors (% of sample size)

Boxplot of KNN features
(Human S /Th )
(H Ser/Thr)

Disorder Features
Concept & Rationale

Disordered region (structure)
Some parts of a protein have a rigid structure,
such as α-helix and β-sheet.
Other parts, disordered regions, do not have
well defined
well-defined conformations
The conformational flexibility of disordered
regions may facilitate protein phosphorylation
[Dunker, 2008]: protein phosphorylation sites
are frequently located within disordered regions

Disorder Features
Result
For h
F phosphosites
h it (A) Phospho-S/T in H. sapiens
6
Occurrence increases exponentially 10000 5
when d so de sco e increases
e disorder score c eases 4
For non-phosphosites 5000 3
2
Significantly different distribution

occurrence
e
0 1
0 0.2 0.4 0.6 0.8 1
x 10
5
(B) Non-phospho-S/T in H. sapiens 0
Disorder score > 0.5 2.5
-1
2
Phosphosites: ~91% -2
1.5
Non-phosphosites: ~55% -3
1
Phosphosites are significantly 0.5
05
-4

over-represented in disordered 0
-5
-6
regions 0 0.2 0.4 0.6
Disorder Score
0.8 1

Histogram of disorder features
(Human Ser/Thr)

Amino Acid Frequencies
Result
quency) 1
0.5
0
Log2(Ratio of Freq

-0.5 H. sapiens (S/T)
M. musculus (S/T)
-1
1
D. melanogaster (S/T)
-1.5 C. elegans (S/T)
-2
2 S. cerevisiae (S/T)
( )
g

A. thaliana (S/T)
-2.5
P R D E S K G A Q N V T H L M I F Y W C
Amino Acid
A i A id

P, R, D, E, S, K, and G are enriched around
phosphosites
C, W, Y, F, I, M, L, H, T, and V are depleted

Classifier Training

Bootstrap
sample 1
... Bootstrap
sample m
Training
Phosphorylation it
N h h l ti sites
Specificity
prediction model

on new data

Results
Trained Models
General Prediction Kinase-Specific
Human ser/thr
/ Prediction
Human tyr ATM
Mouse ser/thr CDK/CDK1/CDK2
Mouse tyr CK1/CK2
Fluit fly ser/thr MAPK1/MAPK3
Worm ser/thr PKA
Yeast ser/thr PKB
Arabidopsis ser/thr PKC
Src

Results
Cross validation
1
C. elegans (S/T)
A. thaliana (S/T)
H. sapiens (S/T)
0.8
08
M. musculus (S/T)
S. cerevisiae (S/T)
0.8 D. melanogaster (S/T)
Sensitivity
y

0.6
06
M. musculus (Y)
0.6 H. sapiens (Y)
Random guess
0.4
04
S

0.4

0.2
02 0.2

0
0 0.02 0.04 0.06 0.08 0.1
0
0 0.2 0.4 0.6 0.8 1
1 - Specificity

Results
Comparison to other tools
1

0.9
Musite
0.8
08
Scan-x
0.7 DISPHOS
NetPhos
0.6
06
Sensitivity

0.6
0.5
S

0.4
0 0.4

0.3
0.2
0.2

0.1
0
0 0.02
0 02 0.04
0 04 0.06
0 06 0.08
0 08 0.1
01
0
0 0.2 0.4 0.6 0.8 1
1 - Specificity

Software Implementation-Musite

Open Source
License: GNU General Public License (GPL)
http://musite.sourceforge.net/
http://musite sourceforge net/
Stand-alone application
Based on Java
Support Windows Linux and Mac OS X
Windows, Linux,
A web server is also being developed
g p
http://musite.net/

Implementation
User Interface

Implementation
Customized Model Training

A unique utility for users to train
prediction models f
di ti d l from th i own d t
their data
Take advantage of latest data
Train disease-specific models
Train organ-specific models
Integrate into experimental p
g p procedure in an
iterative way

Summary

Musite is for prediction of general and kinase-
specific phosphosites in a better accuracy

Musite is a open-source standalone program
capable of performing proteome-wide
proteome wide
predictions

Acknowledgements

Dr. Dong Xu (University of Missouri)
Dr. Jay Thelen (U e s ty o Missouri)
e e (University of ssou )
Dr. Keith Dunker (Indiana University)
Curtis Bollinger (University of Missouri)

Funding Visit us at
NSF [# DBI 0604439]
DBI-0604439] http://musite.sourceforge.net
p g
NIH [# R21/R33 GM078601] http://musite.net
Poster R09 at ISMB

Gao bosc2010 musite

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from BOSC 2010

More from BOSC 2010 (20)

Recently uploaded

Recently uploaded (20)

Gao bosc2010 musite