PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

PSLDoc: Protein subcellular localization
prediction based on gapped-dipeptides and
probabilistic latent semantic analysis
1/50

Outline
•  Introduction
– Protein Subcellular Localization
– Document Classification
•  PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
•  Evaluation and Results
•  Discussion
2/50

Outline
•  Introduction
•  PSLDoc
– SVM learning
•  Discussion
3/50

Protein Subcellular Localization
4/50

Outline
•  Introduction
•  PSLDoc
– SVM learning
•  Discussion
5/50

Vector Space Model
•  Salton’s Vector Space Model
– Represent each document by a high-dimensional
vector in the space of words
Documents
Vectors
Gerald Salton
7/50

Term-‐Document
Matrix

•  Term-document matrix is m×n matrix where m is
number of terms and n is number of documents
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
⎡
=
↓↓↓
aaa
aaa
aaa
A
ddd
mnmm
n
n
n
…

…
…
21
22221
11211
21
t
t
t
m
←
←
←

2
1
document
term
9

•  The inverse document frequency (idf) is obtained by dividing the number of
all documents by the number of documents containing the term ti,
Term Weighting by TFIDF
•  The term frequency (tf) in the given document d gives a measure of the
importance of the term ti within the particular document
∑
=
k
k
i
i
n
n
dttf ),(
)(
log)(
ii
i
td
D
tidf
⊃
=
with ni being the number of occurrences of the considered term, and the
denominator is the number of occurrences of all terms
|D| : total number of document in the corpus
: number of documents where the term ti appears
4idf
=
4*idf
10/50

Predicted by 1 Nearest-Neighbor
based on Cosine Similarity
• similarity between document and query
11/50

Feature Reduction
•  ∃ a best choice of axes – shows most variation in the
data. => Found by linear algebra: Singular Value
Decomposition (SVD)
True plot in k dimensions Reduced-dimensionality plot
12/50

Singular Value Decomposition
Term-document

matrix

Reduced feature size = 40 features
40
13/50

Outline
•  Introduction
•  PSLDoc
– SVM learning
•  Discussion
14/50

The Terms of Proteins - Gapped-
dipeptides*
•  Let XdZ denote the amino acid coupling
pattern of amino acid types X and Z that are
separated by d amino acids
If d= 20, there are 8400 (=20*20*21) features for a vector
M P L D L N T L T
Sequence:
M0P
M2D
M1L ...
*Liang HK, Huang CM, Ko MT, Hwang JK. The Amino Acid-Coupling Patterns in Thermophilic Proteins. Proteins:
Structure, Function and Bioinformatics (2005), 59, 58-63.
15/50

Term Weighting Scheme – TF
Position Specific Score Matrix (1/2)
•  Position Specific Score Matrix (PSSM) : A PSSM is
constructed from a multiple alignment of the highest
scoring hits in the BLAST search
A R N D C Q E G H I L K M F P S T W Y V
1 M - 3 - 3 - 4 - 5 - 3 - 3 - 4 - 5 - 4 0 1 - 3 10 - 2 - 5 - 4 - 3 - 4 - 3 - 1
2 P 2 - 3 - 3 - 1 - 3 - 1 - 1 - 1 - 4 - 2 - 4 - 2 - 2 - 5 4 2 4 - 5 - 4 - 3
3 L - 4 - 5 - 6 - 6 - 4 - 3 - 5 - 6 - 5 3 5 - 5 4 0 - 5 - 5 - 3 - 4 - 3 2
4 D - 2 5 - 1 - 3 - 4 2 - 1 - 4 2 - 5 - 3 5 - 2 - 2 - 4 - 2 0 - 1 0 - 3
5 L - 4 - 5 - 6 - 6 - 4 - 5 - 6 - 6 - 4 4 4 - 5 0 1 - 5 - 5 - 3 - 4 - 3 3
. . .
78 N - 4 - 3 8 4 - 6 - 3 - 2 - 3 - 2 - 6 - 6 - 3 - 5 - 6 - 4 - 1 - 3 - 7 - 5 - 6
79 T - 2 - 3 - 1 - 3 - 1 - 3 - 3 - 4 - 3 - 4 - 4 - 1 - 4 - 4 - 4 4 6 - 5 - 4 - 2
80 L 0 - 1 - 5 - 5 - 4 - 3 - 4 - 4 - 3 - 1 5 - 3 3 0 - 4 - 3 - 3 - 3 - 2 - 1
81 T - 1 - 3 - 1 - 1 - 4 - 2 - 3 - 2 - 1 - 4 - 3 - 1 - 3 - 4 - 4 3 6 - 5 - 4 - 3
16/50

Term Weighting Scheme – TF
Position Specific Score Matrix (2/2)
•  The weight of XdZ :
where f(i,Y) denotes the normalized value of the PSSM entry at the ith
row and the column corresponding to amino acid typeY
•  An example
W(M2D,P)
= f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D)
= 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894

∑ +−≤≤
++×=
)1(1
),1(),(),(
dni
ZdifXifPXdZW
17/50

Outline
•  Introduction
•  PSLDoc
– SVM learning
•  Discussion
18/50

Feature Reduction - Probabilistic
Latent Semantic Analysis (1/3)
Latent
Concepts
TermsDocuments
TRADE
economic
imports
trade
19/50

•  A joint probability between a term w and a
document d can be modeled as:
Latent variable z
(“small” #states)
Concept
expression
probabilities
Document-specific
mixing proportions
)|()|()(),( dzPzwPdPdwP
Zz
∑∈
=
•  The parameters could be estimated by
maximum-likelihood function through EM
algorithm.
20/50

Term 1
Term 2
Term 3
Term 4Term 5
Vector
Term Space
PLSA
Feature Reduction
Topic 1
Topic 2
Topic 3
Topic Space
21/50

Outline
•  Introduction
•  PSLDoc
– SVM learning
•  Discussion
22/50

Classifier – Support Vector
Machines
•  Support Vector Machines (SVM)
–  LIBSVM software
–  Five 1-v-rest SVM classifiers corresponding to
five localization sites.
–  Kernel: Radial Basis Function (RBF)
–  Parameter selection
•  c (cost) and γ(gamma) are optimized
•  five-fold cross-validation
SVMCP v.s. -CP
SVMIM v.s. -IM
SVMPP v.s. -PP
SVMOM v.s. -OM
SVMEC v.s. -EC
*Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
23/50

System Architecture
MPLDLYNTLT…
SVMCP SVMIM SVMPP SVMEC
Predicted Localization Site
SVMOM
{0.012103 , 0.014095 , 0.015480 , 0.018894 , ,0.003121 }
1 M - 3 - 3 - 4 - 5 - 3 - 3 - 4 - 5 - 4 0 1 - 3 10 - 2 - 5 - 4 - 3 - 4 - 3 - 1
2 P 2 - 3 - 3 - 1 - 3 - 1 - 1 - 1 - 4 - 2 - 4 - 2 - 2 - 5 4 2 4 - 5 - 4 - 3
3 L - 4 - 5 - 6 - 6 - 4 - 3 - 5 - 6 - 5 3 5 - 5 4 0 - 5 - 5 - 3 - 4 - 3 2
4 D - 2 5 - 1 - 3 - 4 2 - 1 - 4 2 - 5 - 3 5 - 2 - 2 - 4 - 2 0 - 1 0 - 3
5 L - 4 - 5 - 6 - 6 - 4 - 5 - 6 - 6 - 4 4 4 - 5 0 1 - 5 - 5 - 3 - 4 - 3 3
6 Y - 4 - 3 - 3 - 5 - 5 - 3 - 4 - 5 4 - 4 - 3 - 3 - 2 4 - 5 - 3 - 2 2 8 - 4
7 N - 4 - 3 8 4 - 6 - 3 - 2 - 3 - 2 - 6 - 6 - 3 - 5 - 6 - 4 - 1 - 3 - 7 - 5 - 6
8 T - 2 - 3 - 1 - 3 - 1 - 3 - 3 - 4 - 3 - 4 - 4 - 1 - 4 - 4 - 4 4 6 - 5 - 4 - 2
9 L 0 - 1 - 5 - 5 - 4 - 3 - 4 - 4 - 3 - 1 5 - 3 3 0 - 4 - 3 - 3 - 3 - 2 - 1
10 T - 1 - 3 - 1 - 1 - 4 - 2 - 3 - 2 - 1 - 4 - 3 - 1 - 3 - 4 - 4 3 6 - 5 - 4 - 3
A0A, A1A, A2A, A3A, A4A, A5A , , Y5Y
{0.81396 , 0.78755, 0.788206 , 0.799535 , 0.784058 , 0.742093 , ,0.437457 }
Gapped-Dipeptide Representation
PSIBLAST
PLSA Reduction
Highest Probability
PSLDoc
Protein
Subcellular
Localization
prediction by
Document
classification
24/50

Outline
•  Introduction
•  PSLDoc
– SVM learning
•  Discussion
25/50

Data set (1/3)
•  Gram-negative bacteria : PS1444
– ePSORTdb version 2.0 Gram-negative
– 1444 proteins
Cytoplasmic
19%
Inner
membrane
22%
Periplasmic
19%
Outer
membrane
27%
Extracellular
13%
Pairwise Sequence
identity > 30%
PSHigh783
PSLow661
26/50

Data set (2/3)
•  Eukaryotic proteins, 7579 proteins, 12
localization sites
Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using
compositions of amino acids and amino acid pairs. Bioinformatics 2003;19(13):1656-1663.
Chloroplast
9%
Cytoplasmic
16%
Cytoskeleton
1%
ER
1%
Extracellular
11%
Golgi
1%
Lysosomal
1%
Mitochondrial
10%
Nuclear
25%
Peroxisomal
2%
PlasmaMembrane
22%
Vacuole
1%
27/50

Data set (3/3)
•  Human data set, 2197 proteins, 9 localization
sites
Scott MS, Thomas DY, Hallett MT. Predicting subcellular localization via protein motif
co-occurrence. Genome Res 2004;14(10A):1957-1966.

ER
15%
Golgi
4%
Cytosol
16%
Nucleus
27%
Peroxisome
2%
PlasmaMembrane
9%
Lysosome
4%
Mitochondria
10%
Extracellular
13%
28/50

Evaluation
•  Accuracy (Acc)
–  l = 5 is the number of total localization sites
– Ni are the number of proteins in localization site I
•  Matthew’s correlation coefficient (MCC)
1 1
l l
i i
i i
Acc TP N
= =
= ∑ ∑

( )( ) ( )( )
( )( )( )( )
i i i i
i
i i i i i i i i
TP TN FP FN
MCC
TP FN TP FP TN FP TN FN
−
=
+ + + +
29/50

Simple Prediction Methods (1/2)
•  1NN_TFIDF : 1NN + gapped-dipeptides + TFIDF
•  1NN_TFPSSM : 1NN + gapped-dipeptides + PSSM
MPLDLYNTLT…
{0.012103 , 0.014095 , 0.015480 , 0.018894 , ,0.003121 }
1 M - 3 - 3 - 4 - 5 - 3 - 3 - 4 - 5 - 4 0 1 - 3 10 - 2 - 5 - 4 - 3 - 4 - 3 - 1
2 P 2 - 3 - 3 - 1 - 3 - 1 - 1 - 1 - 4 - 2 - 4 - 2 - 2 - 5 4 2 4 - 5 - 4 - 3
3 L - 4 - 5 - 6 - 6 - 4 - 3 - 5 - 6 - 5 3 5 - 5 4 0 - 5 - 5 - 3 - 4 - 3 2
4 D - 2 5 - 1 - 3 - 4 2 - 1 - 4 2 - 5 - 3 5 - 2 - 2 - 4 - 2 0 - 1 0 - 3
5 L - 4 - 5 - 6 - 6 - 4 - 5 - 6 - 6 - 4 4 4 - 5 0 1 - 5 - 5 - 3 - 4 - 3 3
6 Y - 4 - 3 - 3 - 5 - 5 - 3 - 4 - 5 4 - 4 - 3 - 3 - 2 4 - 5 - 3 - 2 2 8 - 4
7 N - 4 - 3 8 4 - 6 - 3 - 2 - 3 - 2 - 6 - 6 - 3 - 5 - 6 - 4 - 1 - 3 - 7 - 5 - 6
8 T - 2 - 3 - 1 - 3 - 1 - 3 - 3 - 4 - 3 - 4 - 4 - 1 - 4 - 4 - 4 4 6 - 5 - 4 - 2
9 L 0 - 1 - 5 - 5 - 4 - 3 - 4 - 4 - 3 - 1 5 - 3 3 0 - 4 - 3 - 3 - 3 - 2 - 1
10 T - 1 - 3 - 1 - 1 - 4 - 2 - 3 - 2 - 1 - 4 - 3 - 1 - 3 - 4 - 4 3 6 - 5 - 4 - 3
A0A, A1A, A2A, A3A, A4A, A5A , , Y5Y
{0.81396 , 0.78755, 0.788206 , 0.799535 , 0.784058 , 0.742093 , ,0.437457 }
PSIBLAST
PLSA Reduction
30/50

Simple Prediction Methods (2/2)
•  1NN_PSI-BLASTps , 1NN_PSI-BLASTnr
•  1NN_ClustalW
Query

Protein
Training

Database
NCBI
nr

Database
Training

Database
PSSM
PSSM
Similar

Protein
PSI-‐BLAST
PSI-‐BLAST
ClustalW

31/50

The comparison of 1NN_TFIDF and
1NN_TFPSSM on the PSHigh783and PSLow661
data sets.
PSHigh783
PSLow661
1NN_TFPSSM
1NN_TFIDF
1NN_TFPSSM
1NN_TFIDF
L o c .
Sites
Acc.(%)
MCC
Acc.(%)
MCC
Acc.(%)
MCC
Acc.(%)
MCC
CP
94.20
0.96
71.01
0.74
83.25
0.77
41.15
0.36
IM
99.31
0.99
98.62
0.89
82.93
0.82
84.15
0.48
PP
95.86
0.94
86.21
0.89
74.05
0.63
38.17
0.46
99.66
0.99
95.88
0.95
85
0.82
66.00
0.48
EC
96.99
0.96
92.48
0.91
57.89
0.51
28.07
0.26
Overall
97.96
-
91.83
-
79.43
-
53.86
-
32/50

Comparison of 1NN_TFPSSM, 1NN_ClustalW,
1NN_PSI-BLASTps and 1NN_PSI-BLASTnr
PSHigh783
Loc.
Sites
1NN_TFPSSM
1NN_ClustalW
1NN_PSI-‐BLASTps
1NN_PSI-‐BLASTnr
Acc.(%)
MCC
Acc.(%)
MCC
Acc.(%)
MCC
Acc.(%)
MCC
CP
94.20
0.96
89.86

0.90

88.41

0.92

86.96

0.90

IM
99.31
0.99
98.62

0.97

99.31

0.98

99.31

0.98

PP
95.86
0.94
93.79

0.93

93.79

0.93

92.41

0.91

OM
99.66
0.99
99.66

0.99

99.66

0.99

99.66

0.99

EC
96.99
0.96
98.50

0.98

98.50

0.98

98.50

0.98

Overall
97.96
-‐
97.32

-‐
97.32

-‐
96.93

-‐
PSLow661
Loc.
Sites
1NN_TFPSSM
1NN_ClustalW
1NN_PSI-‐BLASTps
1NN_PSI-‐BLASTnr
Acc.(%)
MCC
Acc.(%)
MCC
Acc.(%)
MCC
Acc.(%)
MCC
CP
83.25
0.77
39.23
0.23
36.84

0.40

55.50

0.53

IM
82.93
0.82
46.95
0.33
68.29

0.57

75.00

0.66

PP
74.05
0.63
41.98
0.44
59.54

0.51

64.12

0.54

OM
85.00
0.82
45.00
0.47
87.00

0.57

87.00

0.66

EC
57.89
0.51
43.86
0.10
50.88

0.37

52.63

0.45

Overall
79.43
-‐
42.97
-‐
57.94

-‐
66.57

-‐
33/50

Evaluation and Results
*HYBIRD combines the results of CELLO II and ALIGN.
34/50

Evaluation and Results
Eukaryotic PSLDoc CELLO II PLOC
Loc. Sites Acc. MCC Acc. MCC Acc. MCC
Chloroplast 76.9 0.77 79.9 0.81 72.3 -
Cytoplasmic 81.4 0.76 77.2 0.71 72.2 -
Cytoskeleton 70.0 0.81 67.5 0.81 58.5 -
ER 74.6 0.81 67.5 0.78 46.5 -
Extracellular 90.9 0.88 90.2 0.88 78.0 -
Golgi 76.6 0.87 53.2 0.69 14.6 -
Lysosomal 75.3 0.78 68.8 0.78 61.8 -
Mitochondrial 71.9 0.71 72.9 0.72 57.4 -
Nuclear 93.5 0.87 91.0 0.83 89.6 -
Peroxisomal 67.2 0.76 47.2 0.63 25.2 -
Plasma Membrane 95.5 0.95 95.9 0.94 92.2 -
Vacuole 33.3 0.46 51.9 0.66 25.0 -
Overall 86.5 - 85.0 - 78.2 -
35/50

Prediction Confidence
•  The confidence of the final predicted class
•  Prediction Confidence = the largest probability - the second largest probability
0
10
20
30
40
50
60
70
80
90
100
[0-0.1) [0.1-0.2) [0.2-0.3) [0.3-0.4) [0.4-0.5) [0.5-0.6) [0.6-0.7) [0.7-0.8) [0.8-0.9) [0.9-1]
O
verallA
ccuracy
(%
)
SVMCP SVMIM SVMPP SVMEC
SVMOM
{0.012103 , 0.014095 , 0.015480 , 0.018894 , ,0.003121 }
5 L - 4 - 5 - 6 - 6 - 4 - 5 - 6 - 6 - 4 4 4 - 5 0 1 - 5 - 5 - 3 - 4 - 3 3
6 Y - 4 - 3 - 3 - 5 - 5 - 3 - 4 - 5 4 - 4 - 3 - 3 - 2 4 - 5 - 3 - 2 2 8 - 4
7 N - 4 - 3 8 4 - 6 - 3 - 2 - 3 - 2 - 6 - 6 - 3 - 5 - 6 - 4 - 1 - 3 - 7 - 5 - 6
8 T - 2 - 3 - 1 - 3 - 1 - 3 - 3 - 4 - 3 - 4 - 4 - 1 - 4 - 4 - 4 4 6 - 5 - 4 - 2
9 L 0 - 1 - 5 - 5 - 4 - 3 - 4 - 4 - 3 - 1 5 - 3 3 0 - 4 - 3 - 3 - 3 - 2 - 1
10 T - 1 - 3 - 1 - 1 - 4 - 2 - 3 - 2 - 1 - 4 - 3 - 1 - 3 - 4 - 4 3 6 - 5 - 4 - 3
A0A, A1A, A2A, A3A, A4A, A5A , , Y5Y
{0.81396 , 0.78755, 0.788206 , 0.799535 , 0.784058 , 0.742093 , ,0.437457 }
PLSA Reduction
Highest Probability
Largest
Second
Prediction Confidence = SVMCP – SVMOM
36/50

Prediction Threshold (1/3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
70
75
80
85
90
95
92 93 94 95 96 97 98 99 100
R
ecall(%
)
Precision(%)
The value above the point denotes the corresponding
prediction threshold.
No
Unknown
Yes
≥ Threshold
37/50

PSLDoc_PreThr=0.7 PSLDoc_PreThr=0.3 PSORTb v.2.0
Loc. Sites Precision Recall Precision Recall Precision Recall
CP 97.30 77.70 94.92 87.41 92.86 70.14
IM 98.91 88.35 97.94 92.23 95.33 92.56
PP 96.19 73.19 93.00 81.88 95.50 69.20
OM 99.46 93.61 98.41 95.14 97.38 94.88
EC 95.57 79.47 91.57 85.79 97.40 78.95
Overall 97.89 83.66 95.77 89.27 95.82 82.62
38/50

*The threshold is set such that the coverage is similar with PSLT.
39/50

Outline
•  Introduction
•  PSLDoc
– SVM learning
•  Discussion
40/50

Gapped-peptide signature
•  The size of topics = 80
41/50

•  The site-topic preference of the topic z for a localization site l
= average { P(z|d)| d (a protein) belongs to l class}
Acc.=89
Acc.=90
42/50

•  Distance = 13 (The size of gapped-dipeptides =
5,600)
43/50

•  For each localization site, ten preferred topics
according to site-preference confidence ( = the
largest site-topic preference - the second
largest site-topic preference)
•  For each topic, five most frequent gapped-
dipeptides are selected.
44/50

Site
Gapped-dipeptide signatures
CP
E0E, K1I, K5V, K1V, D0E;
L1H, L5H, L3H, H4L, H0L;
A12C, A9C, A13C, A5C, A7C;
R3R, R6R, R2R, R0R, R9R;
A6A, A13A, A7A, A10A, A11A;
I0E, R6I, I3R, I3K, R6V;
H3H, H1H, H7H, H13H, H10H;
H1M, H2M, H11M, M0H, H0M;
A4E, E1E, A2E, V4E, A9E;
E4E, K6E, E6E, E3E, E0E
IM
I2I, I3I, I0I, L0I, I0F;
L7L, L4L, L10L, L3L, L6L;
M3M, M2M, M0M, M8M, M6M;
V2I, V2V, V3I, V3V, I0V;
T2F, T6F, F3F, T4F, T8F;
A1A, A7L, A4A, A1C, A11L;
W3W, W0W, W2W, W6W, W4W;
Y12L, Y1L, Y11L, L0Y, L1L;
M2T, M3T, M10T, M4T, M0L;
F10P, F8P, F12P, F3P, F13P
PP
A1A, A2A, A0A, A3A, M4A;
M0H, W1Q, W1H, W1K, W5Q;
P1E, P0E, E0P, P0K, E1P;
D0D, Q0D, D3D, D3Q, D11D;
W0E, E4W, W11E, E0W, W13E;
K3K, K0K, K2K, K1K, K7K;
A3A, A7A, A1P, A6R, A10R;
P3N, N4P, N3P, N5P, N0P;
H6G, G3M, H7D, G11H, H11G;
A10A, A11A, A6A, A12A, A3A
OM
T1R, R3T, R1T, T5R, P0P;
R0F, R4F, Y13R, R6F, R2F;
N4N, N0N, N10N, N7N, F1N;
Q6Q, Q1Q, Q3Q, Q13Q, Q4Q;
S0F, A3F, F0S, R9F, F7F;
G0G, A0G, A1G, G1A, G3A;
N1Q, N1N, Q1Q, N12N, Q11V;
W2N, N2W, N0W, D2W, N13W;
Q5R, R1Q, Q1R, Q3R, R2Q;
Y1Y, Y0Y, Y5Y, Y4Y, Y12Y
EC
S6S, S2S, T11T, S13S, T6S;
G8G, G0G, G7G, G9G, G6G;
T1T, T3T, T5T, T9T, T10T;
N10N, N9N, N13N, N11N, N12N;
N1N, N3N, N4N, N11N, N1T;
I5Y, Y12S, Y3S, Y9S, Y6I;
Q2N, N1Q, Q1Q, N3Q, Q7Q;
K1S, S6S, S5S, S11M, S0S;
S3G, G3G, G4S, G3S, G2G;
N0N, N12V, N4V, V12N, N9V
45/50

Gapped-dipeptide signatures reflecting
motifs relevant to protein localization sites
•  In the integral membrane proteins, in which helix-helix
interactions are stabilized by aromatic residues.
Specifically, the aromatic motif (WXXW or W2W) is
involved in the dimerization of transmembrane domains
by π-π interactions.
•  In the outer membrane class, where the C-terminal
signature sequence is recognized by the assembly
factor, OMP85, regulating the insertion and integration
of OM proteins in the outer membrane of gram-
negative bacteria. The C-terminal signature sequence
contains a Phe (F) at the C-terminal position, preceded
by a strong preference for a basic amino acid (K, R). =>
R0F

46/50

The amino acid compositions of single
residues and gapped-dipeptide signatures
for each localization site
0
2
4
6
8
10
12
14
A I G L M V C N P Q S T D E K R H F Y W
Composition(%)
Amino
Acids
(A)
single
residue
CP
IM
PP
OM
EC
0
5
10
15
20
25
30
A I G L M V C N P Q S T D E K R H F Y W
Composition(%)
Amino
Acids
(B)
gapped-‐dipeptide
signature
CP
IM
PP
OM
EC
47/50

The grouped amino acid compositions of single
residues and gapped-dipeptide signature
0

10

20

30

40

50

60

70

CP IM PP OM EC
Proportion(%)
Localization
Sites
(A)
single
residue
N
P
C
A
0
10
20
30
40
50
60
70
CP IM PP OM EC
Proportion(%)
Localization
Sites
(B)
gapped-‐dipeptide
signature
N
P
C
A
Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR),
and A (aromatic: FYW)
48/50

Gapped-dipeptide signatures and their amino acid
compositions for each localization site
0
10
20
30
40
50
60
70
CP IM PP OM EC
Proportion(%)
Localization Site
N
P
C
A
Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR),
and A (aromatic: FYW)
49/50

Gapped-dipeptide signatures and their
amino acid compositions for each
localization site
•  IM has a high percentage of non-polar amino acids (60%)
and no charged (0%) amino acids.
–  The physico-chemical properties of the lipid bilayer, in which
non-polar amino acids are favored in the transmembrane
domains of IM proteins.
–  Charged amino acids are disfavored due to the penalty incurred
in energy terms in the assembly of IM proteins.
•  CP and EC classes have a high percentage of charged and
polar amino acids, respectively.
–  The role of charged amino acids in the cytoplasm is probably
related to pH homeostasis in which they act as buffers, whereas
secreted proteins in the EC classes may require more polar
amino acids for promoting interactions in the solvent
environment.
50/50

People
Wen-‐Lian
Hsu
Ting-‐Yi
Sung

Jia-‐Ming
Chang

Hua-‐Sheng
Chiu

Allan
Lo
Emily
Chia-‐Yu
Su

51

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

Similar to PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis (20)

Recently uploaded

Recently uploaded (20)

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

Editor's Notes