SlideShare a Scribd company logo
PSLDoc: Protein subcellular localization
prediction based on gapped-dipeptides and
probabilistic latent semantic analysis
1/50
Outline
•  Introduction
– Protein Subcellular Localization
– Document Classification
•  PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
•  Evaluation and Results
•  Discussion
2/50
Outline
•  Introduction
– Protein Subcellular Localization
– Document Classification
•  PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
•  Evaluation and Results
•  Discussion
3/50
Protein Subcellular Localization
4/50
Outline
•  Introduction
– Protein Subcellular Localization
– Document Classification
•  PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
•  Evaluation and Results
•  Discussion
5/50
Document Classification
6/50
Vector Space Model
•  Salton’s Vector Space Model
– Represent each document by a high-dimensional
vector in the space of words
Documents
 Vectors
Gerald Salton
7/50
Vectors in Term Space
8/50
Term-­‐Document	
  Matrix	
  
•  Term-document matrix is m×n matrix where m is
number of terms and n is number of documents
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
⎡
=
↓↓↓
aaa
aaa
aaa
A
ddd
mnmm
n
n
n
…

…
…
21
22221
11211
21
t
t
t
m
←
←
←

2
1
document
term
9
•  The inverse document frequency (idf) is obtained by dividing the number of
all documents by the number of documents containing the term ti,
Term Weighting by TFIDF
•  The term frequency (tf) in the given document d gives a measure of the
importance of the term ti within the particular document
∑
=
k
k
i
i
n
n
dttf ),(
)(
log)(
ii
i
td
D
tidf
⊃
=
with ni being the number of occurrences of the considered term, and the
denominator is the number of occurrences of all terms
|D| : total number of document in the corpus
: number of documents where the term ti appears
4idf	
  =	
  4*idf	
   10/50
Predicted by 1 Nearest-Neighbor
based on Cosine Similarity
• similarity between document and query
11/50
Feature Reduction
•  ∃ a best choice of axes – shows most variation in the
data. => Found by linear algebra: Singular Value
Decomposition (SVD)
True plot in k dimensions Reduced-dimensionality plot
12/50
Singular Value Decomposition
Term-document	
  
matrix	
  	
  
Reduced feature size = 40 features
40
13/50
Outline
•  Introduction
– Protein Subcellular Localization
– Document Classification
•  PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
•  Evaluation and Results
•  Discussion
14/50
The Terms of Proteins - Gapped-
dipeptides*
•  Let XdZ denote the amino acid coupling
pattern of amino acid types X and Z that are
separated by d amino acids
If d= 20, there are 8400 (=20*20*21) features for a vector
M P L D L N T L T
Sequence:
M0P
M2D
M1L ...
*Liang HK, Huang CM, Ko MT, Hwang JK. The Amino Acid-Coupling Patterns in Thermophilic Proteins. Proteins:
Structure, Function and Bioinformatics (2005), 59, 58-63.
15/50
Term Weighting Scheme – TF
Position Specific Score Matrix (1/2)
•  Position Specific Score Matrix (PSSM) : A PSSM is
constructed from a multiple alignment of the highest
scoring hits in the BLAST search
	  	  	 	  	  	 	  	  	  	 	  A	  	 R	  	  N	  	  D	 	  C	 	  Q	  	 E	  	  G	  	  H	 	  I 	  	 L 	  	  K	  	  M	  	  F	 	  P	  	 S	  	  T	  	  W	 	  Y	 	  V	 
	  	  	 	  1	 M	  	  	 - 3	  - 3	  - 4	  - 5	 - 3	 - 3	  - 4	  - 5	  - 4	 	  0	  	 1	  - 3	  10	  - 2	 - 5	  - 4	  - 3	  - 4	 - 3	 - 1	 
	  	  	 	  2	 P	  	  	 	  2	  - 3	  - 3	  - 1	 - 3	 - 1	  - 1	  - 1	  - 4	 - 2	  - 4	  - 2	  - 2	  - 5	 	  4	  	 2	  	  4	  - 5	 - 4	 - 3	 
	  	  	 	  3	 L 	  	  	 - 4	  - 5	  - 6	  - 6	 - 4	 - 3	  - 5	  - 6	  - 5	 	  3	  	 5	  - 5	  	  4	  	  0	 - 5	  - 5	  - 3	  - 4	 - 3	 	  2	 
	  	  	 	  4	 D	  	  	 - 2	  	 5	  - 1	  - 3	 - 4	 	  2	  - 1	  - 4	  	  2	 - 5	  - 3	  	  5	  - 2	  - 2	 - 4	  - 2	  	  0	  - 1	 	  0	 - 3	 
	  	  	 	  5	 L 	  	  	 - 4	  - 5	  - 6	  - 6	 - 4	 - 5	  - 6	  - 6	  - 4	 	  4	  	 4	  - 5	  	  0	  	  1	 - 5	  - 5	  - 3	  - 4	 - 3	 	  3	 
. . . 	 
	  	  	 78	 N	  	  	 - 4	  - 3	  	  8	  	  4	  - 6	 - 3	  - 2	  - 3	  - 2	 - 6	  - 6	  - 3	  - 5	  - 6	 - 4	  - 1	  - 3	  - 7	 - 5	 - 6	 
	  	  	 79	 T 	  	  	 - 2	  - 3	  - 1	  - 3	 - 1	 - 3	  - 3	  - 4	  - 3	 - 4	  - 4	  - 1	  - 4	  - 4	 - 4	  	 4	  	  6	  - 5	 - 4	 - 2	 
	  	  	 80	 L 	  	  	 	  0	  - 1	  - 5	  - 5	 - 4	 - 3	  - 4	  - 4	  - 3	 - 1	  	 5	  - 3	  	  3	  	  0	 - 4	  - 3	  - 3	  - 3	 - 2	 - 1	 
	  	  	 81	 T 	  	  	 - 1	  - 3	  - 1	  - 1	 - 4	 - 2	  - 3	  - 2	  - 1	 - 4	  - 3	  - 1	  - 3	  - 4	 - 4	  	 3	  	  6	  - 5	 - 4	 - 3	 
16/50
Term Weighting Scheme – TF
Position Specific Score Matrix (2/2)
•  The weight of XdZ :
where f(i,Y) denotes the normalized value of the PSSM entry at the ith
row and the column corresponding to amino acid typeY
•  An example
W(M2D,P)
= f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D) 
= 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894 
	
   ∑ +−≤≤
++×=
)1(1
),1(),(),(
dni
ZdifXifPXdZW
17/50
Outline
•  Introduction
– Protein Subcellular Localization
– Document Classification
•  PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
•  Evaluation and Results
•  Discussion
18/50
Feature Reduction - Probabilistic
Latent Semantic Analysis (1/3)
Latent
Concepts
TermsDocuments
TRADE
economic
imports
trade
19/50
Feature Reduction - Probabilistic
Latent Semantic Analysis (2/3)
•  A joint probability between a term w and a
document d can be modeled as:
Latent variable z
(“small” #states)
Concept
expression
probabilities
Document-specific
mixing proportions
)|()|()(),( dzPzwPdPdwP
Zz
∑∈
=
•  The parameters could be estimated by
maximum-likelihood function through EM
algorithm.
20/50
Feature Reduction - Probabilistic
Latent Semantic Analysis (3/3)
Term 1
Term 2
Term 3
Term 4Term 5
Vector
Term Space
PLSA
Feature Reduction
Topic 1
Topic 2
Topic 3
Topic Space
21/50
Outline
•  Introduction
– Protein Subcellular Localization
– Document Classification
•  PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
•  Evaluation and Results
•  Discussion
22/50
Classifier – Support Vector
Machines
•  Support Vector Machines (SVM)
–  LIBSVM software
–  Five 1-v-rest SVM classifiers corresponding to
five localization sites.
–  Kernel: Radial Basis Function (RBF)
–  Parameter selection
•  c (cost) and γ(gamma) are optimized
•  five-fold cross-validation
SVMCP v.s. -CP
 SVMIM v.s. -IM
 SVMPP v.s. -PP
 SVMOM v.s. -OM
 SVMEC v.s. -EC
*Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
23/50
System Architecture
MPLDLYNTLT…
SVMCP SVMIM SVMPP SVMEC
Predicted Localization Site
SVMOM
{0.012103 , 0.014095 , 0.015480 , 0.018894 , ,0.003121 }
	 	 	 	 	 	 	 	 	 	 	 A	 	 R	 	 N	 	 D	 	 C	 	 Q	 	 E	 	 G	 	 H	 	 I 	 	 L	 	 K	 	 M	 	 F	 	 P	 	 S	 	 T	 	 W	 	 Y	 	 V
	 	 	 	 1	  M	 	 	  - 3	  - 3	  - 4	  - 5	  - 3	  - 3	  - 4	  - 5	  - 4	 	 0	 	 1	  - 3	 10	  - 2	  - 5	  - 4	  - 3	  - 4	  - 3	  - 1
	 	 	 	 2	  P	 	 	 	  2	  - 3	  - 3	  - 1	  - 3	  - 1	  - 1	  - 1	  - 4	  - 2	  - 4	  - 2	  - 2	  - 5	 	 4	 	 2	 	 4	  - 5	  - 4	  - 3
	 	 	 	 3	  L	 	 	  - 4	  - 5	  - 6	  - 6	  - 4	  - 3	  - 5	  - 6	  - 5	 	 3	 	 5	  - 5	 	 4	 	 0	  - 5	  - 5	  - 3	  - 4	  - 3	 	 2
	 	 	 	 4	  D	 	 	  - 2	 	 5	  - 1	  - 3	  - 4	 	 2	  - 1	  - 4	 	 2	  - 5	  - 3	 	 5	  - 2	  - 2	  - 4	  - 2	 	 0	  - 1	 	 0	  - 3
	 	 	 	 5	  L	 	 	  - 4	  - 5	  - 6	  - 6	  - 4	  - 5	  - 6	  - 6	  - 4	 	 4	 	 4	  - 5	 	 0	 	 1	  - 5	  - 5	  - 3	  - 4	  - 3	 	 3
	 	 	 	 6	  Y	 	 	  - 4	  - 3	  - 3	  - 5	  - 5	  - 3	  - 4	  - 5	 	 4	  - 4	  - 3	  - 3	  - 2	 	 4	  - 5	  - 3	  - 2	 	 2	 	 8	  - 4
	 	 	 	 7	  N	 	 	  - 4	  - 3	 	 8	 	 4	  - 6	  - 3	  - 2	  - 3	  - 2	  - 6	  - 6	  - 3	  - 5	  - 6	  - 4	  - 1	  - 3	  - 7	  - 5	  - 6
	 	 	 	 8	  T	 	 	  - 2	  - 3	  - 1	  - 3	  - 1	  - 3	  - 3	  - 4	  - 3	  - 4	  - 4	  - 1	  - 4	  - 4	  - 4	 	 4	 	 6	  - 5	  - 4	  - 2
	 	 	 	 9	  L	 	 	 	  0	  - 1	  - 5	  - 5	  - 4	  - 3	  - 4	  - 4	  - 3	  - 1	 	 5	  - 3	 	 3	 	 0	  - 4	  - 3	  - 3	  - 3	  - 2	  - 1
	 	 	 10	  T	 	 	  - 1	  - 3	  - 1	  - 1	  - 4	  - 2	  - 3	  - 2	  - 1	  - 4	  - 3	  - 1	  - 3	  - 4	  - 4	 	 3	 	 6	  - 5	  - 4	  - 3
A0A, A1A, A2A, A3A, A4A, A5A , , Y5Y
{0.81396 , 0.78755, 0.788206 , 0.799535 , 0.784058 , 0.742093 , ,0.437457 }
Gapped-Dipeptide Representation
PSIBLAST
PLSA Reduction
Highest Probability
PSLDoc
Protein
Subcellular
Localization
prediction by
Document
classification
24/50
Outline
•  Introduction
– Protein Subcellular Localization
– Document Classification
•  PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
•  Evaluation and Results
•  Discussion
25/50
Data set (1/3)
•  Gram-negative bacteria : PS1444
– ePSORTdb version 2.0 Gram-negative
– 1444 proteins	
Cytoplasmic
19%
Inner
membrane
22%
Periplasmic
19%
Outer
membrane
27%
Extracellular
13%
Pairwise Sequence
identity > 30%
PSHigh783
PSLow661
26/50
Data set (2/3)
•  Eukaryotic proteins, 7579 proteins, 12
localization sites
Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using
compositions of amino acids and amino acid pairs. Bioinformatics 2003;19(13):1656-1663.
Chloroplast
9%
Cytoplasmic
16%
Cytoskeleton
1%
ER
1%
Extracellular
11%
Golgi
1%
Lysosomal
1%
Mitochondrial
10%
Nuclear
25%
Peroxisomal
2%
PlasmaMembrane
22%
Vacuole
1%
27/50
Data set (3/3)
•  Human data set, 2197 proteins, 9 localization
sites
Scott MS, Thomas DY, Hallett MT. Predicting subcellular localization via protein motif
co-occurrence. Genome Res 2004;14(10A):1957-1966.

ER
15%
Golgi
4%
Cytosol
16%
Nucleus
27%
Peroxisome
2%
PlasmaMembrane
9%
Lysosome
4%
Mitochondria
10%
Extracellular
13%
28/50
Evaluation
•  Accuracy (Acc)
–  l = 5 is the number of total localization sites
– Ni are the number of proteins in localization site I
•  Matthew’s correlation coefficient (MCC)
1 1
l l
i i
i i
Acc TP N
= =
= ∑ ∑
	
   ( )( ) ( )( )
( )( )( )( )
i i i i
i
i i i i i i i i
TP TN FP FN
MCC
TP FN TP FP TN FP TN FN
−
=
+ + + +
29/50
Simple Prediction Methods (1/2)
•  1NN_TFIDF : 1NN + gapped-dipeptides + TFIDF
•  1NN_TFPSSM : 1NN + gapped-dipeptides + PSSM
MPLDLYNTLT…
{0.012103 , 0.014095 , 0.015480 , 0.018894 , ,0.003121 }
	 	 	 	 	 	 	 	 	 	 	 A	 	 R	 	 N	 	 D	 	 C	 	 Q	 	 E	 	 G	 	 H	 	 I 	 	 L	 	 K	 	 M	 	 F	 	 P	 	 S	 	 T	 	 W	 	 Y	 	 V
	 	 	 	 1	  M	 	 	  - 3	  - 3	  - 4	  - 5	  - 3	  - 3	  - 4	  - 5	  - 4	 	 0	 	 1	  - 3	 10	  - 2	  - 5	  - 4	  - 3	  - 4	  - 3	  - 1
	 	 	 	 2	  P	 	 	 	  2	  - 3	  - 3	  - 1	  - 3	  - 1	  - 1	  - 1	  - 4	  - 2	  - 4	  - 2	  - 2	  - 5	 	 4	 	 2	 	 4	  - 5	  - 4	  - 3
	 	 	 	 3	  L	 	 	  - 4	  - 5	  - 6	  - 6	  - 4	  - 3	  - 5	  - 6	  - 5	 	 3	 	 5	  - 5	 	 4	 	 0	  - 5	  - 5	  - 3	  - 4	  - 3	 	 2
	 	 	 	 4	  D	 	 	  - 2	 	 5	  - 1	  - 3	  - 4	 	 2	  - 1	  - 4	 	 2	  - 5	  - 3	 	 5	  - 2	  - 2	  - 4	  - 2	 	 0	  - 1	 	 0	  - 3
	 	 	 	 5	  L	 	 	  - 4	  - 5	  - 6	  - 6	  - 4	  - 5	  - 6	  - 6	  - 4	 	 4	 	 4	  - 5	 	 0	 	 1	  - 5	  - 5	  - 3	  - 4	  - 3	 	 3
	 	 	 	 6	  Y	 	 	  - 4	  - 3	  - 3	  - 5	  - 5	  - 3	  - 4	  - 5	 	 4	  - 4	  - 3	  - 3	  - 2	 	 4	  - 5	  - 3	  - 2	 	 2	 	 8	  - 4
	 	 	 	 7	  N	 	 	  - 4	  - 3	 	 8	 	 4	  - 6	  - 3	  - 2	  - 3	  - 2	  - 6	  - 6	  - 3	  - 5	  - 6	  - 4	  - 1	  - 3	  - 7	  - 5	  - 6
	 	 	 	 8	  T	 	 	  - 2	  - 3	  - 1	  - 3	  - 1	  - 3	  - 3	  - 4	  - 3	  - 4	  - 4	  - 1	  - 4	  - 4	  - 4	 	 4	 	 6	  - 5	  - 4	  - 2
	 	 	 	 9	  L	 	 	 	  0	  - 1	  - 5	  - 5	  - 4	  - 3	  - 4	  - 4	  - 3	  - 1	 	 5	  - 3	 	 3	 	 0	  - 4	  - 3	  - 3	  - 3	  - 2	  - 1
	 	 	 10	  T	 	 	  - 1	  - 3	  - 1	  - 1	  - 4	  - 2	  - 3	  - 2	  - 1	  - 4	  - 3	  - 1	  - 3	  - 4	  - 4	 	 3	 	 6	  - 5	  - 4	  - 3
A0A, A1A, A2A, A3A, A4A, A5A , , Y5Y
{0.81396 , 0.78755, 0.788206 , 0.799535 , 0.784058 , 0.742093 , ,0.437457 }
Gapped-Dipeptide Representation
PSIBLAST
PLSA Reduction
30/50
Simple Prediction Methods (2/2)
•  1NN_PSI-BLASTps , 1NN_PSI-BLASTnr
•  1NN_ClustalW
Query	
  
Protein
Training	
  
Database
NCBI	
  nr	
  
Database
Training	
  
Database
PSSM
PSSM
Similar	
  
Protein
PSI-­‐BLAST
 PSI-­‐BLAST
ClustalW	
  
31/50
The comparison of 1NN_TFIDF and
1NN_TFPSSM on the PSHigh783and PSLow661
data sets.
PSHigh783
 PSLow661
1NN_TFPSSM
 1NN_TFIDF
 1NN_TFPSSM
 1NN_TFIDF
L o c .
Sites
 Acc.(%)
 MCC
 Acc.(%)
 MCC
 Acc.(%)
 MCC
 Acc.(%)
 MCC
CP
 94.20
 0.96
 71.01
 0.74
 83.25
 0.77
 41.15
 0.36
IM
 99.31
 0.99
 98.62
 0.89
 82.93
 0.82
 84.15
 0.48
PP
 95.86
 0.94
 86.21
 0.89
 74.05
 0.63
 38.17
 0.46
99.66
 0.99
 95.88
 0.95
 85
 0.82
 66.00
 0.48
EC
 96.99
 0.96
 92.48
 0.91
 57.89
 0.51
 28.07
 0.26
Overall
 97.96
 -
 91.83
 -
 79.43
 -
 53.86
 -
32/50
Comparison of 1NN_TFPSSM, 1NN_ClustalW,
1NN_PSI-BLASTps and 1NN_PSI-BLASTnr
PSHigh783
Loc.	
  Sites
1NN_TFPSSM
 1NN_ClustalW
 1NN_PSI-­‐BLASTps
 1NN_PSI-­‐BLASTnr
Acc.(%)
 MCC
 Acc.(%)
 MCC
 Acc.(%)
 MCC
 Acc.(%)
 MCC
CP
 94.20
 0.96
 89.86	
  
 0.90	
  
 88.41	
  
 0.92	
  
 86.96	
  
 0.90	
  
IM
 99.31
 0.99
 98.62	
  
 0.97	
  
 99.31	
  
 0.98	
  
 99.31	
  
 0.98	
  
PP
 95.86
 0.94
 93.79	
  
 0.93	
  
 93.79	
  
 0.93	
  
 92.41	
  
 0.91	
  
OM
 99.66
 0.99
 99.66	
  
 0.99	
  
 99.66	
  
 0.99	
  
 99.66	
  
 0.99	
  
EC
 96.99
 0.96
 98.50	
  
 0.98	
  
 98.50	
  
 0.98	
  
 98.50	
  
 0.98	
  
Overall
 97.96
 -­‐
 97.32	
  
 -­‐
 97.32	
  
 -­‐
 96.93	
  
 -­‐
PSLow661
Loc.	
  Sites
1NN_TFPSSM
 1NN_ClustalW
 1NN_PSI-­‐BLASTps
 1NN_PSI-­‐BLASTnr
Acc.(%)
 MCC
 Acc.(%)
 MCC
 Acc.(%)
 MCC
 Acc.(%)
 MCC
CP
 83.25
 0.77
 39.23
 0.23
 36.84	
  
 0.40	
  
 55.50	
  
 0.53	
  
IM
 82.93
 0.82
 46.95
 0.33
 68.29	
  
 0.57	
  
 75.00	
  
 0.66	
  
PP
 74.05
 0.63
 41.98
 0.44
 59.54	
  
 0.51	
  
 64.12	
  
 0.54	
  
OM
 85.00
 0.82
 45.00
 0.47
 87.00	
  
 0.57	
  
 87.00	
  
 0.66	
  
EC
 57.89
 0.51
 43.86
 0.10
 50.88	
  
 0.37	
  
 52.63	
  
 0.45	
  
Overall
 79.43
 -­‐
 42.97
 -­‐
 57.94	
  
 -­‐
 66.57	
  
 -­‐
33/50
Evaluation and Results
*HYBIRD combines the results of CELLO II and ALIGN.
34/50
Evaluation and Results
Eukaryotic PSLDoc CELLO II PLOC
Loc. Sites Acc. MCC Acc. MCC Acc. MCC
Chloroplast 76.9 0.77 79.9 0.81 72.3 -
Cytoplasmic 81.4 0.76 77.2 0.71 72.2 -
Cytoskeleton 70.0 0.81 67.5 0.81 58.5 -
ER 74.6 0.81 67.5 0.78 46.5 -
Extracellular 90.9 0.88 90.2 0.88 78.0 -
Golgi 76.6 0.87 53.2 0.69 14.6 -
Lysosomal 75.3 0.78 68.8 0.78 61.8 -
Mitochondrial 71.9 0.71 72.9 0.72 57.4 -
Nuclear 93.5 0.87 91.0 0.83 89.6 -
Peroxisomal 67.2 0.76 47.2 0.63 25.2 -
Plasma Membrane 95.5 0.95 95.9 0.94 92.2 -
Vacuole 33.3 0.46 51.9 0.66 25.0 -
Overall 86.5 - 85.0 - 78.2 -
35/50
Prediction Confidence
•  The confidence of the final predicted class
•  Prediction Confidence = the largest probability - the second largest probability
0
10
20
30
40
50
60
70
80
90
100
[0-0.1) [0.1-0.2) [0.2-0.3) [0.3-0.4) [0.4-0.5) [0.5-0.6) [0.6-0.7) [0.7-0.8) [0.8-0.9) [0.9-1]
O
verallA
ccuracy
(%
)
Prediction Confidence
SVMCP SVMIM SVMPP SVMEC
Predicted Localization Site
SVMOM
{0.012103 , 0.014095 , 0.015480 , 0.018894 , ,0.003121 }
	 	 	 	 5	  L	 	 	  - 4	  - 5	  - 6	  - 6	  - 4	  - 5	  - 6	  - 6	  - 4	 	 4	 	 4	  - 5	 	 0	 	 1	  - 5	  - 5	  - 3	  - 4	  - 3	 	 3
	 	 	 	 6	  Y	 	 	  - 4	  - 3	  - 3	  - 5	  - 5	  - 3	  - 4	  - 5	 	 4	  - 4	  - 3	  - 3	  - 2	 	 4	  - 5	  - 3	  - 2	 	 2	 	 8	  - 4
	 	 	 	 7	  N	 	 	  - 4	  - 3	 	 8	 	 4	  - 6	  - 3	  - 2	  - 3	  - 2	  - 6	  - 6	  - 3	  - 5	  - 6	  - 4	  - 1	  - 3	  - 7	  - 5	  - 6
	 	 	 	 8	  T	 	 	  - 2	  - 3	  - 1	  - 3	  - 1	  - 3	  - 3	  - 4	  - 3	  - 4	  - 4	  - 1	  - 4	  - 4	  - 4	 	 4	 	 6	  - 5	  - 4	  - 2
	 	 	 	 9	  L	 	 	 	  0	  - 1	  - 5	  - 5	  - 4	  - 3	  - 4	  - 4	  - 3	  - 1	 	 5	  - 3	 	 3	 	 0	  - 4	  - 3	  - 3	  - 3	  - 2	  - 1
	 	 	 10	  T	 	 	  - 1	  - 3	  - 1	  - 1	  - 4	  - 2	  - 3	  - 2	  - 1	  - 4	  - 3	  - 1	  - 3	  - 4	  - 4	 	 3	 	 6	  - 5	  - 4	  - 3
A0A, A1A, A2A, A3A, A4A, A5A , , Y5Y
{0.81396 , 0.78755, 0.788206 , 0.799535 , 0.784058 , 0.742093 , ,0.437457 }
Gapped-Dipeptide Representation
PLSA Reduction
Highest Probability
Largest
 Second
Prediction Confidence = SVMCP – SVMOM
36/50
Prediction Threshold (1/3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
70
75
80
85
90
95
92 93 94 95 96 97 98 99 100
R
ecall(%
)
Precision(%)
The value above the point denotes the corresponding
prediction threshold.
No
Unknown
Predicted Localization Site
Yes
Prediction Confidence
≥ Threshold
37/50
Prediction Threshold (2/3)
PSLDoc_PreThr=0.7 PSLDoc_PreThr=0.3 PSORTb v.2.0
Loc. Sites Precision Recall Precision Recall Precision Recall
CP 97.30 77.70 94.92 87.41 92.86 70.14
IM 98.91 88.35 97.94 92.23 95.33 92.56
PP 96.19 73.19 93.00 81.88 95.50 69.20
OM 99.46 93.61 98.41 95.14 97.38 94.88
EC 95.57 79.47 91.57 85.79 97.40 78.95
Overall 97.89 83.66 95.77 89.27 95.82 82.62
38/50
Prediction Threshold (3/3)
*The threshold is set such that the coverage is similar with PSLT.
39/50
Outline
•  Introduction
– Protein Subcellular Localization
– Document Classification
•  PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
•  Evaluation and Results
•  Discussion
40/50
Gapped-peptide signature
•  The size of topics = 80
41/50
Gapped-peptide signature
•  The site-topic preference of the topic z for a localization site l
= average { P(z|d)| d (a protein) belongs to l class}
Acc.=89
 Acc.=90
 42/50
Gapped-peptide signature
•  Distance = 13 (The size of gapped-dipeptides =
5,600)
43/50
Gapped-peptide signature
•  For each localization site, ten preferred topics
according to site-preference confidence ( = the
largest site-topic preference - the second
largest site-topic preference)
•  For each topic, five most frequent gapped-
dipeptides are selected.
44/50
Gapped-peptide signature
Site
 Gapped-dipeptide signatures
CP
E0E, K1I, K5V, K1V, D0E;
 L1H, L5H, L3H, H4L, H0L;
 A12C, A9C, A13C, A5C, A7C;
R3R, R6R, R2R, R0R, R9R;
 A6A, A13A, A7A, A10A, A11A;
 I0E, R6I, I3R, I3K, R6V;
H3H, H1H, H7H, H13H, H10H;
 H1M, H2M, H11M, M0H, H0M;
 A4E, E1E, A2E, V4E, A9E;
E4E, K6E, E6E, E3E, E0E
IM
I2I, I3I, I0I, L0I, I0F;
 L7L, L4L, L10L, L3L, L6L;
 M3M, M2M, M0M, M8M, M6M;
V2I, V2V, V3I, V3V, I0V;
 T2F, T6F, F3F, T4F, T8F;
 A1A, A7L, A4A, A1C, A11L;
W3W, W0W, W2W, W6W, W4W;
 Y12L, Y1L, Y11L, L0Y, L1L;
 M2T, M3T, M10T, M4T, M0L;
F10P, F8P, F12P, F3P, F13P
PP
A1A, A2A, A0A, A3A, M4A;
 M0H, W1Q, W1H, W1K, W5Q;
 P1E, P0E, E0P, P0K, E1P;
D0D, Q0D, D3D, D3Q, D11D;
 W0E, E4W, W11E, E0W, W13E;
 K3K, K0K, K2K, K1K, K7K;
A3A, A7A, A1P, A6R, A10R;
 P3N, N4P, N3P, N5P, N0P;
 H6G, G3M, H7D, G11H, H11G;
A10A, A11A, A6A, A12A, A3A
OM
T1R, R3T, R1T, T5R, P0P;
 R0F, R4F, Y13R, R6F, R2F;
 N4N, N0N, N10N, N7N, F1N;
Q6Q, Q1Q, Q3Q, Q13Q, Q4Q;
 S0F, A3F, F0S, R9F, F7F;
 G0G, A0G, A1G, G1A, G3A;
N1Q, N1N, Q1Q, N12N, Q11V;
 W2N, N2W, N0W, D2W, N13W;
 Q5R, R1Q, Q1R, Q3R, R2Q;
Y1Y, Y0Y, Y5Y, Y4Y, Y12Y
EC
S6S, S2S, T11T, S13S, T6S;
 G8G, G0G, G7G, G9G, G6G;
 T1T, T3T, T5T, T9T, T10T;
N10N, N9N, N13N, N11N, N12N;
 N1N, N3N, N4N, N11N, N1T;
 I5Y, Y12S, Y3S, Y9S, Y6I;
Q2N, N1Q, Q1Q, N3Q, Q7Q;
 K1S, S6S, S5S, S11M, S0S;
 S3G, G3G, G4S, G3S, G2G;
N0N, N12V, N4V, V12N, N9V
45/50
Gapped-dipeptide signatures reflecting
motifs relevant to protein localization sites
•  In the integral membrane proteins, in which helix-helix
interactions are stabilized by aromatic residues.
Specifically, the aromatic motif (WXXW or W2W) is
involved in the dimerization of transmembrane domains
by π-π interactions.
•  In the outer membrane class, where the C-terminal
signature sequence is recognized by the assembly
factor, OMP85, regulating the insertion and integration
of OM proteins in the outer membrane of gram-
negative bacteria. The C-terminal signature sequence
contains a Phe (F) at the C-terminal position, preceded
by a strong preference for a basic amino acid (K, R). =>
R0F

46/50
The amino acid compositions of single
residues and gapped-dipeptide signatures
for each localization site
0
2
4
6
8
10
12
14
A I G L M V C N P Q S T D E K R H F Y W
Composition(%)
Amino	
  Acids
(A)	
  single	
  residue
CP
IM
PP
OM
EC
0
5
10
15
20
25
30
A I G L M V C N P Q S T D E K R H F Y W
Composition(%)
Amino	
  Acids
(B)	
  gapped-­‐dipeptide	
  signature
CP
IM
PP
OM
EC
47/50
The grouped amino acid compositions of single
residues and gapped-dipeptide signature
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
70	
  
CP IM PP OM EC
Proportion(%)
Localization	
  Sites
(A)	
  single	
  residue
N
P
C
A
0
10
20
30
40
50
60
70
CP IM PP OM EC
Proportion(%)
Localization	
  Sites
(B)	
  gapped-­‐dipeptide	
  signature
N
P
C
A
Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR),
and A (aromatic: FYW)
48/50
Gapped-dipeptide signatures and their amino acid
compositions for each localization site
0
10
20
30
40
50
60
70
CP IM PP OM EC
Proportion(%)
Localization Site
N
P
C
A
Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR),
and A (aromatic: FYW)
 49/50
Gapped-dipeptide signatures and their
amino acid compositions for each
localization site
•  IM has a high percentage of non-polar amino acids (60%)
and no charged (0%) amino acids.
–  The physico-chemical properties of the lipid bilayer, in which
non-polar amino acids are favored in the transmembrane
domains of IM proteins.
–  Charged amino acids are disfavored due to the penalty incurred
in energy terms in the assembly of IM proteins.
•  CP and EC classes have a high percentage of charged and
polar amino acids, respectively.
–  The role of charged amino acids in the cytoplasm is probably
related to pH homeostasis in which they act as buffers, whereas
secreted proteins in the EC classes may require more polar
amino acids for promoting interactions in the solvent
environment.	
50/50
People
Wen-­‐Lian	
  Hsu	
  Ting-­‐Yi	
  Sung	
  
Jia-­‐Ming	
  Chang	
  
Hua-­‐Sheng	
  Chiu	
  
Allan	
  Lo	
  Emily	
  Chia-­‐Yu	
  Su	
  
51
Thank You!
52/50
Questions?
53/50

More Related Content

Viewers also liked

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
Tomonari Masada
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
zukun
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
Innovation Engineering
 
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot..."Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
Davide Chicco
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
Karol Grzegorczyk
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
Auro Tripathy
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and Analysis
Mercy Livingstone
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 

Viewers also liked (10)

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot..."Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
 
Latent Semantic Indexing and Analysis
Latent Semantic Indexing and AnalysisLatent Semantic Indexing and Analysis
Latent Semantic Indexing and Analysis
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
LDA入門
LDA入門LDA入門
LDA入門
 

Similar to PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

Sequence Analysis.ppt
Sequence Analysis.pptSequence Analysis.ppt
Sequence Analysis.ppt
GobindRamBiotech
 
2016.09.28TOPIC REVIEW• Exam • PS2 Sequence Alignment .docx
2016.09.28TOPIC REVIEW• Exam • PS2 Sequence Alignment .docx2016.09.28TOPIC REVIEW• Exam • PS2 Sequence Alignment .docx
2016.09.28TOPIC REVIEW• Exam • PS2 Sequence Alignment .docx
felicidaddinwoodie
 
bioinfo_6th_20070720
bioinfo_6th_20070720bioinfo_6th_20070720
bioinfo_6th_20070720
sesejun
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
baoilleach
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
guest5929fa7
 
Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015
Prof. Wim Van Criekinge
 
4900514.ppt
4900514.ppt4900514.ppt
4900514.ppt
ArunachalamSelva
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
BioinformaticsInstitute
 
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Prof. Wim Van Criekinge
 
Pham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterPham,Nhat_ResearchPoster
Pham,Nhat_ResearchPoster
Nhat Pham
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarity
BITS
 
A New Multi-Objective Mixed-Discrete Particle Swarm Optimization Algorithm
A New Multi-Objective Mixed-Discrete Particle Swarm Optimization AlgorithmA New Multi-Objective Mixed-Discrete Particle Swarm Optimization Algorithm
A New Multi-Objective Mixed-Discrete Particle Swarm Optimization Algorithm
Weiyang Tong
 
Bioinformatica t3-scoringmatrices v2014
Bioinformatica t3-scoringmatrices v2014Bioinformatica t3-scoringmatrices v2014
Bioinformatica t3-scoringmatrices v2014
Prof. Wim Van Criekinge
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
Yusuke Yamamoto
 
Benchmarking Perl (Chicago UniForum 2006)
Benchmarking Perl (Chicago UniForum 2006)Benchmarking Perl (Chicago UniForum 2006)
Benchmarking Perl (Chicago UniForum 2006)
brian d foy
 
Genetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleGenetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step Example
Ahmed Gad
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming language
Ashwini Mathur
 
Cluto presentation
Cluto presentationCluto presentation
Cluto presentation
Roseline Antai
 
MMseqs NGS 2014
MMseqs NGS 2014MMseqs NGS 2014
MMseqs NGS 2014
Martin Steinegger
 
CUDA_ICLR_2023.pdf
CUDA_ICLR_2023.pdfCUDA_ICLR_2023.pdf
CUDA_ICLR_2023.pdf
sumyeong ahn
 

Similar to PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis (20)

Sequence Analysis.ppt
Sequence Analysis.pptSequence Analysis.ppt
Sequence Analysis.ppt
 
2016.09.28TOPIC REVIEW• Exam • PS2 Sequence Alignment .docx
2016.09.28TOPIC REVIEW• Exam • PS2 Sequence Alignment .docx2016.09.28TOPIC REVIEW• Exam • PS2 Sequence Alignment .docx
2016.09.28TOPIC REVIEW• Exam • PS2 Sequence Alignment .docx
 
bioinfo_6th_20070720
bioinfo_6th_20070720bioinfo_6th_20070720
bioinfo_6th_20070720
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015
 
4900514.ppt
4900514.ppt4900514.ppt
4900514.ppt
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013
 
Pham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterPham,Nhat_ResearchPoster
Pham,Nhat_ResearchPoster
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarity
 
A New Multi-Objective Mixed-Discrete Particle Swarm Optimization Algorithm
A New Multi-Objective Mixed-Discrete Particle Swarm Optimization AlgorithmA New Multi-Objective Mixed-Discrete Particle Swarm Optimization Algorithm
A New Multi-Objective Mixed-Discrete Particle Swarm Optimization Algorithm
 
Bioinformatica t3-scoringmatrices v2014
Bioinformatica t3-scoringmatrices v2014Bioinformatica t3-scoringmatrices v2014
Bioinformatica t3-scoringmatrices v2014
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Benchmarking Perl (Chicago UniForum 2006)
Benchmarking Perl (Chicago UniForum 2006)Benchmarking Perl (Chicago UniForum 2006)
Benchmarking Perl (Chicago UniForum 2006)
 
Genetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleGenetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step Example
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming language
 
Cluto presentation
Cluto presentationCluto presentation
Cluto presentation
 
MMseqs NGS 2014
MMseqs NGS 2014MMseqs NGS 2014
MMseqs NGS 2014
 
CUDA_ICLR_2023.pdf
CUDA_ICLR_2023.pdfCUDA_ICLR_2023.pdf
CUDA_ICLR_2023.pdf
 

Recently uploaded

Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
BoudhayanBhattachari
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
ssuser13ffe4
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 

Recently uploaded (20)

Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
 
math operations ued in python and all used
math operations ued in python and all usedmath operations ued in python and all used
math operations ued in python and all used
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

  • 1. PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis 1/50
  • 2. Outline •  Introduction – Protein Subcellular Localization – Document Classification •  PSLDoc – Term and its weighting scheme – Feature Reduction – SVM learning •  Evaluation and Results •  Discussion 2/50
  • 3. Outline •  Introduction – Protein Subcellular Localization – Document Classification •  PSLDoc – Term and its weighting scheme – Feature Reduction – SVM learning •  Evaluation and Results •  Discussion 3/50
  • 5. Outline •  Introduction – Protein Subcellular Localization – Document Classification •  PSLDoc – Term and its weighting scheme – Feature Reduction – SVM learning •  Evaluation and Results •  Discussion 5/50
  • 7. Vector Space Model •  Salton’s Vector Space Model – Represent each document by a high-dimensional vector in the space of words Documents Vectors Gerald Salton 7/50
  • 8. Vectors in Term Space 8/50
  • 9. Term-­‐Document  Matrix   •  Term-document matrix is m×n matrix where m is number of terms and n is number of documents ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = ↓↓↓ aaa aaa aaa A ddd mnmm n n n …  … … 21 22221 11211 21 t t t m ← ← ←  2 1 document term 9
  • 10. •  The inverse document frequency (idf) is obtained by dividing the number of all documents by the number of documents containing the term ti, Term Weighting by TFIDF •  The term frequency (tf) in the given document d gives a measure of the importance of the term ti within the particular document ∑ = k k i i n n dttf ),( )( log)( ii i td D tidf ⊃ = with ni being the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms |D| : total number of document in the corpus : number of documents where the term ti appears 4idf  =  4*idf   10/50
  • 11. Predicted by 1 Nearest-Neighbor based on Cosine Similarity • similarity between document and query 11/50
  • 12. Feature Reduction •  ∃ a best choice of axes – shows most variation in the data. => Found by linear algebra: Singular Value Decomposition (SVD) True plot in k dimensions Reduced-dimensionality plot 12/50
  • 13. Singular Value Decomposition Term-document   matrix     Reduced feature size = 40 features 40 13/50
  • 14. Outline •  Introduction – Protein Subcellular Localization – Document Classification •  PSLDoc – Term and its weighting scheme – Feature Reduction – SVM learning •  Evaluation and Results •  Discussion 14/50
  • 15. The Terms of Proteins - Gapped- dipeptides* •  Let XdZ denote the amino acid coupling pattern of amino acid types X and Z that are separated by d amino acids If d= 20, there are 8400 (=20*20*21) features for a vector M P L D L N T L T Sequence: M0P M2D M1L ... *Liang HK, Huang CM, Ko MT, Hwang JK. The Amino Acid-Coupling Patterns in Thermophilic Proteins. Proteins: Structure, Function and Bioinformatics (2005), 59, 58-63. 15/50
  • 16. Term Weighting Scheme – TF Position Specific Score Matrix (1/2) •  Position Specific Score Matrix (PSSM) : A PSSM is constructed from a multiple alignment of the highest scoring hits in the BLAST search A R N D C Q E G H I L K M F P S T W Y V 1 M - 3 - 3 - 4 - 5 - 3 - 3 - 4 - 5 - 4 0 1 - 3 10 - 2 - 5 - 4 - 3 - 4 - 3 - 1 2 P 2 - 3 - 3 - 1 - 3 - 1 - 1 - 1 - 4 - 2 - 4 - 2 - 2 - 5 4 2 4 - 5 - 4 - 3 3 L - 4 - 5 - 6 - 6 - 4 - 3 - 5 - 6 - 5 3 5 - 5 4 0 - 5 - 5 - 3 - 4 - 3 2 4 D - 2 5 - 1 - 3 - 4 2 - 1 - 4 2 - 5 - 3 5 - 2 - 2 - 4 - 2 0 - 1 0 - 3 5 L - 4 - 5 - 6 - 6 - 4 - 5 - 6 - 6 - 4 4 4 - 5 0 1 - 5 - 5 - 3 - 4 - 3 3 . . . 78 N - 4 - 3 8 4 - 6 - 3 - 2 - 3 - 2 - 6 - 6 - 3 - 5 - 6 - 4 - 1 - 3 - 7 - 5 - 6 79 T - 2 - 3 - 1 - 3 - 1 - 3 - 3 - 4 - 3 - 4 - 4 - 1 - 4 - 4 - 4 4 6 - 5 - 4 - 2 80 L 0 - 1 - 5 - 5 - 4 - 3 - 4 - 4 - 3 - 1 5 - 3 3 0 - 4 - 3 - 3 - 3 - 2 - 1 81 T - 1 - 3 - 1 - 1 - 4 - 2 - 3 - 2 - 1 - 4 - 3 - 1 - 3 - 4 - 4 3 6 - 5 - 4 - 3 16/50
  • 17. Term Weighting Scheme – TF Position Specific Score Matrix (2/2) •  The weight of XdZ : where f(i,Y) denotes the normalized value of the PSSM entry at the ith row and the column corresponding to amino acid typeY •  An example W(M2D,P) = f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D) = 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894   ∑ +−≤≤ ++×= )1(1 ),1(),(),( dni ZdifXifPXdZW 17/50
  • 18. Outline •  Introduction – Protein Subcellular Localization – Document Classification •  PSLDoc – Term and its weighting scheme – Feature Reduction – SVM learning •  Evaluation and Results •  Discussion 18/50
  • 19. Feature Reduction - Probabilistic Latent Semantic Analysis (1/3) Latent Concepts TermsDocuments TRADE economic imports trade 19/50
  • 20. Feature Reduction - Probabilistic Latent Semantic Analysis (2/3) •  A joint probability between a term w and a document d can be modeled as: Latent variable z (“small” #states) Concept expression probabilities Document-specific mixing proportions )|()|()(),( dzPzwPdPdwP Zz ∑∈ = •  The parameters could be estimated by maximum-likelihood function through EM algorithm. 20/50
  • 21. Feature Reduction - Probabilistic Latent Semantic Analysis (3/3) Term 1 Term 2 Term 3 Term 4Term 5 Vector Term Space PLSA Feature Reduction Topic 1 Topic 2 Topic 3 Topic Space 21/50
  • 22. Outline •  Introduction – Protein Subcellular Localization – Document Classification •  PSLDoc – Term and its weighting scheme – Feature Reduction – SVM learning •  Evaluation and Results •  Discussion 22/50
  • 23. Classifier – Support Vector Machines •  Support Vector Machines (SVM) –  LIBSVM software –  Five 1-v-rest SVM classifiers corresponding to five localization sites. –  Kernel: Radial Basis Function (RBF) –  Parameter selection •  c (cost) and γ(gamma) are optimized •  five-fold cross-validation SVMCP v.s. -CP SVMIM v.s. -IM SVMPP v.s. -PP SVMOM v.s. -OM SVMEC v.s. -EC *Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm 23/50
  • 24. System Architecture MPLDLYNTLT… SVMCP SVMIM SVMPP SVMEC Predicted Localization Site SVMOM {0.012103 , 0.014095 , 0.015480 , 0.018894 , ,0.003121 } A R N D C Q E G H I L K M F P S T W Y V 1 M - 3 - 3 - 4 - 5 - 3 - 3 - 4 - 5 - 4 0 1 - 3 10 - 2 - 5 - 4 - 3 - 4 - 3 - 1 2 P 2 - 3 - 3 - 1 - 3 - 1 - 1 - 1 - 4 - 2 - 4 - 2 - 2 - 5 4 2 4 - 5 - 4 - 3 3 L - 4 - 5 - 6 - 6 - 4 - 3 - 5 - 6 - 5 3 5 - 5 4 0 - 5 - 5 - 3 - 4 - 3 2 4 D - 2 5 - 1 - 3 - 4 2 - 1 - 4 2 - 5 - 3 5 - 2 - 2 - 4 - 2 0 - 1 0 - 3 5 L - 4 - 5 - 6 - 6 - 4 - 5 - 6 - 6 - 4 4 4 - 5 0 1 - 5 - 5 - 3 - 4 - 3 3 6 Y - 4 - 3 - 3 - 5 - 5 - 3 - 4 - 5 4 - 4 - 3 - 3 - 2 4 - 5 - 3 - 2 2 8 - 4 7 N - 4 - 3 8 4 - 6 - 3 - 2 - 3 - 2 - 6 - 6 - 3 - 5 - 6 - 4 - 1 - 3 - 7 - 5 - 6 8 T - 2 - 3 - 1 - 3 - 1 - 3 - 3 - 4 - 3 - 4 - 4 - 1 - 4 - 4 - 4 4 6 - 5 - 4 - 2 9 L 0 - 1 - 5 - 5 - 4 - 3 - 4 - 4 - 3 - 1 5 - 3 3 0 - 4 - 3 - 3 - 3 - 2 - 1 10 T - 1 - 3 - 1 - 1 - 4 - 2 - 3 - 2 - 1 - 4 - 3 - 1 - 3 - 4 - 4 3 6 - 5 - 4 - 3 A0A, A1A, A2A, A3A, A4A, A5A , , Y5Y {0.81396 , 0.78755, 0.788206 , 0.799535 , 0.784058 , 0.742093 , ,0.437457 } Gapped-Dipeptide Representation PSIBLAST PLSA Reduction Highest Probability PSLDoc Protein Subcellular Localization prediction by Document classification 24/50
  • 25. Outline •  Introduction – Protein Subcellular Localization – Document Classification •  PSLDoc – Term and its weighting scheme – Feature Reduction – SVM learning •  Evaluation and Results •  Discussion 25/50
  • 26. Data set (1/3) •  Gram-negative bacteria : PS1444 – ePSORTdb version 2.0 Gram-negative – 1444 proteins Cytoplasmic 19% Inner membrane 22% Periplasmic 19% Outer membrane 27% Extracellular 13% Pairwise Sequence identity > 30% PSHigh783 PSLow661 26/50
  • 27. Data set (2/3) •  Eukaryotic proteins, 7579 proteins, 12 localization sites Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003;19(13):1656-1663. Chloroplast 9% Cytoplasmic 16% Cytoskeleton 1% ER 1% Extracellular 11% Golgi 1% Lysosomal 1% Mitochondrial 10% Nuclear 25% Peroxisomal 2% PlasmaMembrane 22% Vacuole 1% 27/50
  • 28. Data set (3/3) •  Human data set, 2197 proteins, 9 localization sites Scott MS, Thomas DY, Hallett MT. Predicting subcellular localization via protein motif co-occurrence. Genome Res 2004;14(10A):1957-1966. ER 15% Golgi 4% Cytosol 16% Nucleus 27% Peroxisome 2% PlasmaMembrane 9% Lysosome 4% Mitochondria 10% Extracellular 13% 28/50
  • 29. Evaluation •  Accuracy (Acc) –  l = 5 is the number of total localization sites – Ni are the number of proteins in localization site I •  Matthew’s correlation coefficient (MCC) 1 1 l l i i i i Acc TP N = = = ∑ ∑   ( )( ) ( )( ) ( )( )( )( ) i i i i i i i i i i i i i TP TN FP FN MCC TP FN TP FP TN FP TN FN − = + + + + 29/50
  • 30. Simple Prediction Methods (1/2) •  1NN_TFIDF : 1NN + gapped-dipeptides + TFIDF •  1NN_TFPSSM : 1NN + gapped-dipeptides + PSSM MPLDLYNTLT… {0.012103 , 0.014095 , 0.015480 , 0.018894 , ,0.003121 } A R N D C Q E G H I L K M F P S T W Y V 1 M - 3 - 3 - 4 - 5 - 3 - 3 - 4 - 5 - 4 0 1 - 3 10 - 2 - 5 - 4 - 3 - 4 - 3 - 1 2 P 2 - 3 - 3 - 1 - 3 - 1 - 1 - 1 - 4 - 2 - 4 - 2 - 2 - 5 4 2 4 - 5 - 4 - 3 3 L - 4 - 5 - 6 - 6 - 4 - 3 - 5 - 6 - 5 3 5 - 5 4 0 - 5 - 5 - 3 - 4 - 3 2 4 D - 2 5 - 1 - 3 - 4 2 - 1 - 4 2 - 5 - 3 5 - 2 - 2 - 4 - 2 0 - 1 0 - 3 5 L - 4 - 5 - 6 - 6 - 4 - 5 - 6 - 6 - 4 4 4 - 5 0 1 - 5 - 5 - 3 - 4 - 3 3 6 Y - 4 - 3 - 3 - 5 - 5 - 3 - 4 - 5 4 - 4 - 3 - 3 - 2 4 - 5 - 3 - 2 2 8 - 4 7 N - 4 - 3 8 4 - 6 - 3 - 2 - 3 - 2 - 6 - 6 - 3 - 5 - 6 - 4 - 1 - 3 - 7 - 5 - 6 8 T - 2 - 3 - 1 - 3 - 1 - 3 - 3 - 4 - 3 - 4 - 4 - 1 - 4 - 4 - 4 4 6 - 5 - 4 - 2 9 L 0 - 1 - 5 - 5 - 4 - 3 - 4 - 4 - 3 - 1 5 - 3 3 0 - 4 - 3 - 3 - 3 - 2 - 1 10 T - 1 - 3 - 1 - 1 - 4 - 2 - 3 - 2 - 1 - 4 - 3 - 1 - 3 - 4 - 4 3 6 - 5 - 4 - 3 A0A, A1A, A2A, A3A, A4A, A5A , , Y5Y {0.81396 , 0.78755, 0.788206 , 0.799535 , 0.784058 , 0.742093 , ,0.437457 } Gapped-Dipeptide Representation PSIBLAST PLSA Reduction 30/50
  • 31. Simple Prediction Methods (2/2) •  1NN_PSI-BLASTps , 1NN_PSI-BLASTnr •  1NN_ClustalW Query   Protein Training   Database NCBI  nr   Database Training   Database PSSM PSSM Similar   Protein PSI-­‐BLAST PSI-­‐BLAST ClustalW   31/50
  • 32. The comparison of 1NN_TFIDF and 1NN_TFPSSM on the PSHigh783and PSLow661 data sets. PSHigh783 PSLow661 1NN_TFPSSM 1NN_TFIDF 1NN_TFPSSM 1NN_TFIDF L o c . Sites Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC CP 94.20 0.96 71.01 0.74 83.25 0.77 41.15 0.36 IM 99.31 0.99 98.62 0.89 82.93 0.82 84.15 0.48 PP 95.86 0.94 86.21 0.89 74.05 0.63 38.17 0.46 99.66 0.99 95.88 0.95 85 0.82 66.00 0.48 EC 96.99 0.96 92.48 0.91 57.89 0.51 28.07 0.26 Overall 97.96 - 91.83 - 79.43 - 53.86 - 32/50
  • 33. Comparison of 1NN_TFPSSM, 1NN_ClustalW, 1NN_PSI-BLASTps and 1NN_PSI-BLASTnr PSHigh783 Loc.  Sites 1NN_TFPSSM 1NN_ClustalW 1NN_PSI-­‐BLASTps 1NN_PSI-­‐BLASTnr Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC CP 94.20 0.96 89.86   0.90   88.41   0.92   86.96   0.90   IM 99.31 0.99 98.62   0.97   99.31   0.98   99.31   0.98   PP 95.86 0.94 93.79   0.93   93.79   0.93   92.41   0.91   OM 99.66 0.99 99.66   0.99   99.66   0.99   99.66   0.99   EC 96.99 0.96 98.50   0.98   98.50   0.98   98.50   0.98   Overall 97.96 -­‐ 97.32   -­‐ 97.32   -­‐ 96.93   -­‐ PSLow661 Loc.  Sites 1NN_TFPSSM 1NN_ClustalW 1NN_PSI-­‐BLASTps 1NN_PSI-­‐BLASTnr Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC CP 83.25 0.77 39.23 0.23 36.84   0.40   55.50   0.53   IM 82.93 0.82 46.95 0.33 68.29   0.57   75.00   0.66   PP 74.05 0.63 41.98 0.44 59.54   0.51   64.12   0.54   OM 85.00 0.82 45.00 0.47 87.00   0.57   87.00   0.66   EC 57.89 0.51 43.86 0.10 50.88   0.37   52.63   0.45   Overall 79.43 -­‐ 42.97 -­‐ 57.94   -­‐ 66.57   -­‐ 33/50
  • 34. Evaluation and Results *HYBIRD combines the results of CELLO II and ALIGN. 34/50
  • 35. Evaluation and Results Eukaryotic PSLDoc CELLO II PLOC Loc. Sites Acc. MCC Acc. MCC Acc. MCC Chloroplast 76.9 0.77 79.9 0.81 72.3 - Cytoplasmic 81.4 0.76 77.2 0.71 72.2 - Cytoskeleton 70.0 0.81 67.5 0.81 58.5 - ER 74.6 0.81 67.5 0.78 46.5 - Extracellular 90.9 0.88 90.2 0.88 78.0 - Golgi 76.6 0.87 53.2 0.69 14.6 - Lysosomal 75.3 0.78 68.8 0.78 61.8 - Mitochondrial 71.9 0.71 72.9 0.72 57.4 - Nuclear 93.5 0.87 91.0 0.83 89.6 - Peroxisomal 67.2 0.76 47.2 0.63 25.2 - Plasma Membrane 95.5 0.95 95.9 0.94 92.2 - Vacuole 33.3 0.46 51.9 0.66 25.0 - Overall 86.5 - 85.0 - 78.2 - 35/50
  • 36. Prediction Confidence •  The confidence of the final predicted class •  Prediction Confidence = the largest probability - the second largest probability 0 10 20 30 40 50 60 70 80 90 100 [0-0.1) [0.1-0.2) [0.2-0.3) [0.3-0.4) [0.4-0.5) [0.5-0.6) [0.6-0.7) [0.7-0.8) [0.8-0.9) [0.9-1] O verallA ccuracy (% ) Prediction Confidence SVMCP SVMIM SVMPP SVMEC Predicted Localization Site SVMOM {0.012103 , 0.014095 , 0.015480 , 0.018894 , ,0.003121 } 5 L - 4 - 5 - 6 - 6 - 4 - 5 - 6 - 6 - 4 4 4 - 5 0 1 - 5 - 5 - 3 - 4 - 3 3 6 Y - 4 - 3 - 3 - 5 - 5 - 3 - 4 - 5 4 - 4 - 3 - 3 - 2 4 - 5 - 3 - 2 2 8 - 4 7 N - 4 - 3 8 4 - 6 - 3 - 2 - 3 - 2 - 6 - 6 - 3 - 5 - 6 - 4 - 1 - 3 - 7 - 5 - 6 8 T - 2 - 3 - 1 - 3 - 1 - 3 - 3 - 4 - 3 - 4 - 4 - 1 - 4 - 4 - 4 4 6 - 5 - 4 - 2 9 L 0 - 1 - 5 - 5 - 4 - 3 - 4 - 4 - 3 - 1 5 - 3 3 0 - 4 - 3 - 3 - 3 - 2 - 1 10 T - 1 - 3 - 1 - 1 - 4 - 2 - 3 - 2 - 1 - 4 - 3 - 1 - 3 - 4 - 4 3 6 - 5 - 4 - 3 A0A, A1A, A2A, A3A, A4A, A5A , , Y5Y {0.81396 , 0.78755, 0.788206 , 0.799535 , 0.784058 , 0.742093 , ,0.437457 } Gapped-Dipeptide Representation PLSA Reduction Highest Probability Largest Second Prediction Confidence = SVMCP – SVMOM 36/50
  • 37. Prediction Threshold (1/3) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 70 75 80 85 90 95 92 93 94 95 96 97 98 99 100 R ecall(% ) Precision(%) The value above the point denotes the corresponding prediction threshold. No Unknown Predicted Localization Site Yes Prediction Confidence ≥ Threshold 37/50
  • 38. Prediction Threshold (2/3) PSLDoc_PreThr=0.7 PSLDoc_PreThr=0.3 PSORTb v.2.0 Loc. Sites Precision Recall Precision Recall Precision Recall CP 97.30 77.70 94.92 87.41 92.86 70.14 IM 98.91 88.35 97.94 92.23 95.33 92.56 PP 96.19 73.19 93.00 81.88 95.50 69.20 OM 99.46 93.61 98.41 95.14 97.38 94.88 EC 95.57 79.47 91.57 85.79 97.40 78.95 Overall 97.89 83.66 95.77 89.27 95.82 82.62 38/50
  • 39. Prediction Threshold (3/3) *The threshold is set such that the coverage is similar with PSLT. 39/50
  • 40. Outline •  Introduction – Protein Subcellular Localization – Document Classification •  PSLDoc – Term and its weighting scheme – Feature Reduction – SVM learning •  Evaluation and Results •  Discussion 40/50
  • 41. Gapped-peptide signature •  The size of topics = 80 41/50
  • 42. Gapped-peptide signature •  The site-topic preference of the topic z for a localization site l = average { P(z|d)| d (a protein) belongs to l class} Acc.=89 Acc.=90 42/50
  • 43. Gapped-peptide signature •  Distance = 13 (The size of gapped-dipeptides = 5,600) 43/50
  • 44. Gapped-peptide signature •  For each localization site, ten preferred topics according to site-preference confidence ( = the largest site-topic preference - the second largest site-topic preference) •  For each topic, five most frequent gapped- dipeptides are selected. 44/50
  • 45. Gapped-peptide signature Site Gapped-dipeptide signatures CP E0E, K1I, K5V, K1V, D0E; L1H, L5H, L3H, H4L, H0L; A12C, A9C, A13C, A5C, A7C; R3R, R6R, R2R, R0R, R9R; A6A, A13A, A7A, A10A, A11A; I0E, R6I, I3R, I3K, R6V; H3H, H1H, H7H, H13H, H10H; H1M, H2M, H11M, M0H, H0M; A4E, E1E, A2E, V4E, A9E; E4E, K6E, E6E, E3E, E0E IM I2I, I3I, I0I, L0I, I0F; L7L, L4L, L10L, L3L, L6L; M3M, M2M, M0M, M8M, M6M; V2I, V2V, V3I, V3V, I0V; T2F, T6F, F3F, T4F, T8F; A1A, A7L, A4A, A1C, A11L; W3W, W0W, W2W, W6W, W4W; Y12L, Y1L, Y11L, L0Y, L1L; M2T, M3T, M10T, M4T, M0L; F10P, F8P, F12P, F3P, F13P PP A1A, A2A, A0A, A3A, M4A; M0H, W1Q, W1H, W1K, W5Q; P1E, P0E, E0P, P0K, E1P; D0D, Q0D, D3D, D3Q, D11D; W0E, E4W, W11E, E0W, W13E; K3K, K0K, K2K, K1K, K7K; A3A, A7A, A1P, A6R, A10R; P3N, N4P, N3P, N5P, N0P; H6G, G3M, H7D, G11H, H11G; A10A, A11A, A6A, A12A, A3A OM T1R, R3T, R1T, T5R, P0P; R0F, R4F, Y13R, R6F, R2F; N4N, N0N, N10N, N7N, F1N; Q6Q, Q1Q, Q3Q, Q13Q, Q4Q; S0F, A3F, F0S, R9F, F7F; G0G, A0G, A1G, G1A, G3A; N1Q, N1N, Q1Q, N12N, Q11V; W2N, N2W, N0W, D2W, N13W; Q5R, R1Q, Q1R, Q3R, R2Q; Y1Y, Y0Y, Y5Y, Y4Y, Y12Y EC S6S, S2S, T11T, S13S, T6S; G8G, G0G, G7G, G9G, G6G; T1T, T3T, T5T, T9T, T10T; N10N, N9N, N13N, N11N, N12N; N1N, N3N, N4N, N11N, N1T; I5Y, Y12S, Y3S, Y9S, Y6I; Q2N, N1Q, Q1Q, N3Q, Q7Q; K1S, S6S, S5S, S11M, S0S; S3G, G3G, G4S, G3S, G2G; N0N, N12V, N4V, V12N, N9V 45/50
  • 46. Gapped-dipeptide signatures reflecting motifs relevant to protein localization sites •  In the integral membrane proteins, in which helix-helix interactions are stabilized by aromatic residues. Specifically, the aromatic motif (WXXW or W2W) is involved in the dimerization of transmembrane domains by π-π interactions. •  In the outer membrane class, where the C-terminal signature sequence is recognized by the assembly factor, OMP85, regulating the insertion and integration of OM proteins in the outer membrane of gram- negative bacteria. The C-terminal signature sequence contains a Phe (F) at the C-terminal position, preceded by a strong preference for a basic amino acid (K, R). => R0F 46/50
  • 47. The amino acid compositions of single residues and gapped-dipeptide signatures for each localization site 0 2 4 6 8 10 12 14 A I G L M V C N P Q S T D E K R H F Y W Composition(%) Amino  Acids (A)  single  residue CP IM PP OM EC 0 5 10 15 20 25 30 A I G L M V C N P Q S T D E K R H F Y W Composition(%) Amino  Acids (B)  gapped-­‐dipeptide  signature CP IM PP OM EC 47/50
  • 48. The grouped amino acid compositions of single residues and gapped-dipeptide signature 0   10   20   30   40   50   60   70   CP IM PP OM EC Proportion(%) Localization  Sites (A)  single  residue N P C A 0 10 20 30 40 50 60 70 CP IM PP OM EC Proportion(%) Localization  Sites (B)  gapped-­‐dipeptide  signature N P C A Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR), and A (aromatic: FYW) 48/50
  • 49. Gapped-dipeptide signatures and their amino acid compositions for each localization site 0 10 20 30 40 50 60 70 CP IM PP OM EC Proportion(%) Localization Site N P C A Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR), and A (aromatic: FYW) 49/50
  • 50. Gapped-dipeptide signatures and their amino acid compositions for each localization site •  IM has a high percentage of non-polar amino acids (60%) and no charged (0%) amino acids. –  The physico-chemical properties of the lipid bilayer, in which non-polar amino acids are favored in the transmembrane domains of IM proteins. –  Charged amino acids are disfavored due to the penalty incurred in energy terms in the assembly of IM proteins. •  CP and EC classes have a high percentage of charged and polar amino acids, respectively. –  The role of charged amino acids in the cytoplasm is probably related to pH homeostasis in which they act as buffers, whereas secreted proteins in the EC classes may require more polar amino acids for promoting interactions in the solvent environment. 50/50
  • 51. People Wen-­‐Lian  Hsu  Ting-­‐Yi  Sung   Jia-­‐Ming  Chang   Hua-­‐Sheng  Chiu   Allan  Lo  Emily  Chia-­‐Yu  Su   51

Editor's Notes

  1. 05/17/13 Transformation Term-document matrix  gapped-dipeptide – protein matrix
  2. 05/17/13 LSI for dimension reduction Similar to SVD
  3. 05/17/13 SVD reduces the # of features from 18,081  40
  4. (i.e., d has localization site l .)
  5. IM has a high percentage of non-polar amino acids (60%) and no charged (0%) amino acids.
  6. 05/17/13