Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836–2847; Yu et al., Proteins 2006;64:643–651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643–651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617–623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/∼bioapp/PSLDoc/.
The document proposes a data-driven approach to predict travel times for vehicles traveling on known routes using historical trajectory data. It represents routes and trajectories mathematically and defines a weighted LP-norm distance measure to find the most similar historical trajectory to a current partial trajectory. This nearest neighbor trajectory is then used to predict the future movement. Evaluation shows the correlation-based weighted LP-norm together with an iterative threshold algorithm provides accurate and efficient predictions.
This study investigates the localization and function of VDAC4, a porin protein in Tetrahymena thermophila mitochondria. Bioinformatics analysis predicted VDAC4 contains a conserved Porin3 domain found in mitochondrial porins and Tom40 proteins. The researcher amplified and cloned the VDAC4 gene, created a YFP fusion construct, transformed Tetrahymena cells, and observed YFP localization using microscopy. YFP localized to punctate structures consistent with mitochondrial localization, supporting VDAC4 involvement in mitochondrial membrane processes.
This document summarizes a study on the localization and potential functions of two proteins, VDAC4 and TGrPE1, in the ciliated protozoan Tetrahymena thermophilia. VDAC4 encodes a porin protein that localizes to mitochondria and basal bodies, suggesting a role in mitochondrial-ciliary transport. TGrPE1 encodes a protein containing a GrPE1 domain that may regulate chaperone proteins through ATP/ADP exchange, and it localizes near mitochondria. The goal is to better understand the roles of these proteins in mitochondrial processes and protein folding in Tetrahymena.
This document discusses the use of latent semantic analysis (LSA) for document clustering. It describes issues with traditional information retrieval systems, defines key concepts like synonymy and polysemy, and explains how LSA addresses these issues by reducing the semantic space. An experiment is described where documents are clustered with and without LSA preprocessing, showing that LSA leads to improved cluster quality metrics like purity, entropy, and average intra-cluster similarity. The study demonstrates LSA can perform comparably to dedicated clustering tools for organizing documents by topic.
The document discusses the Flint water crisis where the city switched its water source in 2014 from the Detroit water system to the Flint River without implementing corrosion control. This caused lead to leach into the drinking water from aging pipes. Independent studies showed high lead levels in water and children's blood, but officials dismissed residents' concerns. The crisis highlighted issues with aging infrastructure, improper water treatment, and environmental racism. Lead exposure can negatively impact childhood development and public health. Similar problems with lead in drinking water have been found in over half of Massachusetts schools tested.
Finding new friends: A different kind of recommendation systemEva Ward
This document discusses using topic modeling techniques like Non-negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA) to analyze a corpus of 2,000 documents about Seattle in order to discover common topics that could be used to recommend potential hosts to visitors. The analysis identified 12 topics in the documents, including "artsy", "yoga", "hippies", "outdoorsy", and "young professional". The document compares the topics discovered using NMF versus LDA.
The document discusses model-based clustering of bike sharing station usage data from the Velib' system in Paris. It presents an approach using a naive Poisson mixture model to cluster stations based on their temporal usage profiles, represented as count time series. The model assumes stations belong to clusters that capture their weekly and daily usage patterns. Expectation-maximization is used to estimate the model parameters, assigning stations to clusters and identifying cluster-specific temporal profiles. Analysis of results from applying this approach to Velib' data aims to better understand station usage and grouping stations with similar behaviors.
This document summarizes a presentation on the OpenNLP toolkit. OpenNLP is an open-source Java toolkit for natural language processing. It provides common NLP features like tokenization, sentence segmentation, part-of-speech tagging, and named entity extraction. The presentation discusses how these features work using pre-trained models for different languages. An example is also given showing how OpenNLP could be used to extract tags from a website and display them in a tag cloud. The presentation concludes by providing contact information for the presenter.
The document proposes a data-driven approach to predict travel times for vehicles traveling on known routes using historical trajectory data. It represents routes and trajectories mathematically and defines a weighted LP-norm distance measure to find the most similar historical trajectory to a current partial trajectory. This nearest neighbor trajectory is then used to predict the future movement. Evaluation shows the correlation-based weighted LP-norm together with an iterative threshold algorithm provides accurate and efficient predictions.
This study investigates the localization and function of VDAC4, a porin protein in Tetrahymena thermophila mitochondria. Bioinformatics analysis predicted VDAC4 contains a conserved Porin3 domain found in mitochondrial porins and Tom40 proteins. The researcher amplified and cloned the VDAC4 gene, created a YFP fusion construct, transformed Tetrahymena cells, and observed YFP localization using microscopy. YFP localized to punctate structures consistent with mitochondrial localization, supporting VDAC4 involvement in mitochondrial membrane processes.
This document summarizes a study on the localization and potential functions of two proteins, VDAC4 and TGrPE1, in the ciliated protozoan Tetrahymena thermophilia. VDAC4 encodes a porin protein that localizes to mitochondria and basal bodies, suggesting a role in mitochondrial-ciliary transport. TGrPE1 encodes a protein containing a GrPE1 domain that may regulate chaperone proteins through ATP/ADP exchange, and it localizes near mitochondria. The goal is to better understand the roles of these proteins in mitochondrial processes and protein folding in Tetrahymena.
This document discusses the use of latent semantic analysis (LSA) for document clustering. It describes issues with traditional information retrieval systems, defines key concepts like synonymy and polysemy, and explains how LSA addresses these issues by reducing the semantic space. An experiment is described where documents are clustered with and without LSA preprocessing, showing that LSA leads to improved cluster quality metrics like purity, entropy, and average intra-cluster similarity. The study demonstrates LSA can perform comparably to dedicated clustering tools for organizing documents by topic.
The document discusses the Flint water crisis where the city switched its water source in 2014 from the Detroit water system to the Flint River without implementing corrosion control. This caused lead to leach into the drinking water from aging pipes. Independent studies showed high lead levels in water and children's blood, but officials dismissed residents' concerns. The crisis highlighted issues with aging infrastructure, improper water treatment, and environmental racism. Lead exposure can negatively impact childhood development and public health. Similar problems with lead in drinking water have been found in over half of Massachusetts schools tested.
Finding new friends: A different kind of recommendation systemEva Ward
This document discusses using topic modeling techniques like Non-negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA) to analyze a corpus of 2,000 documents about Seattle in order to discover common topics that could be used to recommend potential hosts to visitors. The analysis identified 12 topics in the documents, including "artsy", "yoga", "hippies", "outdoorsy", and "young professional". The document compares the topics discovered using NMF versus LDA.
The document discusses model-based clustering of bike sharing station usage data from the Velib' system in Paris. It presents an approach using a naive Poisson mixture model to cluster stations based on their temporal usage profiles, represented as count time series. The model assumes stations belong to clusters that capture their weekly and daily usage patterns. Expectation-maximization is used to estimate the model parameters, assigning stations to clusters and identifying cluster-specific temporal profiles. Analysis of results from applying this approach to Velib' data aims to better understand station usage and grouping stations with similar behaviors.
This document summarizes a presentation on the OpenNLP toolkit. OpenNLP is an open-source Java toolkit for natural language processing. It provides common NLP features like tokenization, sentence segmentation, part-of-speech tagging, and named entity extraction. The presentation discusses how these features work using pre-trained models for different languages. An example is also given showing how OpenNLP could be used to extract tags from a website and display them in a tag cloud. The presentation concludes by providing contact information for the presenter.
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationTomonari Masada
This document proposes applying stochastic gradient variational Bayes (SGVB) to latent Dirichlet allocation (LDA) topic modeling to obtain an efficient posterior estimation. SGVB introduces randomness into variational inference for LDA by estimating expectations with Monte Carlo integration and using reparameterization to sample from approximate posterior distributions. Evaluation on several text corpora shows perplexities comparable to existing LDA inference methods, with the potential for faster parallelization using techniques like GPU processing. Future work will explore applying SGVB to other probabilistic document models like correlated topic models.
EM algorithm and its application in probabilistic latent semantic analysiszukun
The document discusses the EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA). It begins by introducing the parameter estimation problem and comparing frequentist and Bayesian approaches. It then describes the EM algorithm, which iteratively computes lower bounds to the log-likelihood function. Finally, it applies the EM algorithm to pLSA by modeling documents and words as arising from a mixture of latent topics.
Latent semantic analysis (LSA) is a technique used in natural language processing to analyze relationships between documents and terms by producing concepts related to them. LSA assumes words with similar meanings will occur in similar texts, and uses a documents-terms matrix and singular value decomposition to discover hidden concepts and represent words and documents as vectors in a semantic vector space. Apache OpenNLP is a machine learning toolkit that can be used for various natural language processing tasks like part-of-speech tagging and parsing, and LSA can be seen as part of natural language processing.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
Latent semantic indexing (LSI) allows search engines to determine the topic of a page outside of directly matching search terms. LSI models the contexts in which words are used to find related pages, similar to how humans understand language through context. LSI was introduced to search engines by Applied Semantics and acquired by Google to power adsense through finding related ads. LSI gives search engines the ability to return more relevant results to users by understanding related terms, synonyms, singular/plural forms, and words with similar meanings or roots. Implementing LSI in a website involves developing thematically focused content using related keywords and synonyms throughout to better match user intent.
The document provides an introduction to Probabilistic Latent Semantic Analysis (PLSA). It discusses how PLSA improves on previous Latent Semantic Analysis methods by incorporating a probabilistic framework. PLSA models documents as mixtures of topics and allows words to have multiple meanings. The parameters of the PLSA model, including the topic distributions and word-topic distributions, are estimated using an expectation-maximization algorithm to find the parameters that best explain the observed word-document co-occurrence data.
1. The document discusses various bioinformatics concepts and tools including sequence alignment, BLAST, substitution matrices, and open reading frames. Sequence alignment involves comparing sequences to find similar regions and can be local or global. BLAST is a tool used to find similar sequences in a database by searching for exact and similar matches. Substitution matrices like BLOSUM and PAM assign scores to amino acid substitutions observed in protein evolution. Open reading frames refer to the three possible frames for translating a nucleic acid sequence into a protein.
2016.09.28
TOPIC REVIEW
• Exam
• PS2 Sequence Alignment
• Command Line Blast
• PS1 Molecular Biology
• Personal Microbiome Project
CURRENTLY
LET’S NEGOTIATE
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam (1) - 20%
• Research project - 45%
• Participation - 5%
OR
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam 1 - 15%
• Exam 2 - 15%
• Research project - 35%
• Participation - 5%
PS2 SEQUENCE ALIGNMENT
PS2 SEQUENCE ALIGNMENT
RefSeqs, protein (experimentally supported)
On chromosome 17
Reverse strand
PRCD Progressive rod-cone degeneration
PS2: GLOBAL ALIGNMENT
BLOSUM62
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
BLOSUM80
• Substitutions more penalized and
gaps are favored.
PAM60
• Substitutions more penalized and gaps
are favored.
PAM250
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
PS2: LOCAL ALIGNMENT
SEQ1 A L S C V W M I P
SEQ2 A I S C M I P T
9 residues
8 residues
Create Matrix: length of seq1 + 1
x
length of seq2 + 1
Matrix 10 x 9
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
Ala A 4
Arg R -1 5
Asn N -2 0 6
Asp D -2 -2 1 6
Cys C 0 -3 -3 -3 9
Gln Q -1 1 0 0 -3 5
Glu E -1 0 0 2 -4 2 5
Gly G 0 -2 0 -1 -3 -2 -2 6
His H -2 0 1 -1 -3 0 0 -2 8
Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A
la
A
rg
A
sn
A
sp
C
y
s
G
ln
G
lu
G
ly
H
is
Il
e
L
e
u
L
y
s
M
e
t
P
h
e
P
ro
S
e
r
T
h
r
T
rp
T
y
r
V
a
l
A R N D C Q E G H I L K M F P S T W Y V
Dynamical programming - global alignment
83
BLOSUM62
GAP COST: -2
At each cell, 3 scores are calculated:
• match score = diagonal cell score +
score from the substitution matrix.
• Vertical gap score = upper neighbor
+ gap cost
• Horizontal gap score = left neighbor
+ gap cost
• The highest score is retained and
the arrow is labelled
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
A ...
The document appears to be the output from running a BLAST search on a query sequence against a yeast genome database. The top hit from the BLAST search is a sequence on Saccharomyces cerevisiae chromosome XI with 85% identity over 88 amino acids and an E-value of 4e-12. Additional significant hits are also reported. The bottom of the document discusses retrieving the genomic sequence for the top hit and generating a dotplot comparison between the query and hit sequences.
This document summarizes a talk given by Dr. Noel O'Boyle on using Python for chemistry. It discusses what Python is, why it is useful for chemistry, and how it can be used. Specific examples are given of popular Python modules for tasks like data analysis, visualization, cheminformatics, and interfacing with other languages like R and Java. The document provides an overview of the capabilities of Python for scientific computing and highlights its growing adoption in the chemistry community.
This document summarizes a talk given by Dr. Noel O'Boyle on using Python for chemistry. It discusses what Python is, why it is useful for chemistry, and how it can be used. Specific examples are given of popular Python modules for tasks like data analysis, visualization, cheminformatics, and interfacing with other languages like R and Java. The document provides an overview of the capabilities of Python for scientific computing and highlights its growing adoption in the chemistry community.
The document discusses bioinformatics and computational biology. It describes a lab with over 100 people from diverse backgrounds, including engineers, scientists, technicians, geneticists and clinicians. The lab applies information technology to analyze biological data, focusing on areas like sequence analysis, molecular modeling, phylogeny, medical applications, statistics and more. Specific applications mentioned include analyzing genomes to study genetic diseases and drug design, as well as using the same techniques in agriculture and animal health.
- Maintain a 2D array P of size V×V along with the distance array D
- P[i][j] stores the intermediate vertex on the shortest path from i to j
- During each iteration of k, if an updated shortest path is found via k, store k in P[i][j]
- To extract the path, start from P[i][j] and follow the chain of intermediate vertices stored in P until reaching i
- This allows tracing the shortest path after the algorithm terminates in O(1) time per path
The modification requires storing only the
This document discusses sequence alignment and contains four sections:
1) Global alignment which finds the highest scoring alignment between entire sequences using dynamic programming.
2) Scoring matrices which generalize alignment scoring by assigning scores to individual character matches/mismatches based on biological evidence.
3) Local alignment which finds the best scoring alignment between substrings of sequences to identify conserved regions, as global alignment may miss these.
4) Ways to solve the local alignment problem efficiently in quadratic time instead of quartic time by computing alignments from each vertex in the grid.
This document provides an overview of databases, definitions, scoring matrices, and pairwise sequence alignment. It discusses major bioinformatics databases like NCBI, ExPASy, and EBI. It also defines key terms like identity, homology, orthologous, and paralogous sequences. Additionally, it examines the theoretical and empirical bases for scoring matrices like PAM, BLOSUM, and transition/transversion matrices, and how they are used in sequence alignment.
The document presents a compartment-based model to describe the degradation kinetics of a peptide substrate reporter (VI-B) in five cell cultures. The model uses a system of differential equations to track the concentration of the reporter and its fragments over time. Parameters of the model are fitted to time-series data from the cultures. Results show variation in degradation rates between cultures and fragments, revealing targets of peptidases. The model has potential utility in peptide substrate reporter design.
1) This document introduces methods for detecting sequence similarity, which is a fundamental analysis in bioinformatics.
2) It describes how to search databases for similar sequences using BLAST or FASTA, and how to compare two sequences using dynamic programming algorithms like Needleman-Wunsch or Smith-Waterman.
3) Substitution matrices like BLOSUM62 are used to score alignments and measure sequence similarity based on amino acid properties.
A New Multi-Objective Mixed-Discrete Particle Swarm Optimization AlgorithmWeiyang Tong
A new multi-objective optimization algorithm to handle problems that are hightly constrained, highly nonlinear, and with mixed types of design variables
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationTomonari Masada
This document proposes applying stochastic gradient variational Bayes (SGVB) to latent Dirichlet allocation (LDA) topic modeling to obtain an efficient posterior estimation. SGVB introduces randomness into variational inference for LDA by estimating expectations with Monte Carlo integration and using reparameterization to sample from approximate posterior distributions. Evaluation on several text corpora shows perplexities comparable to existing LDA inference methods, with the potential for faster parallelization using techniques like GPU processing. Future work will explore applying SGVB to other probabilistic document models like correlated topic models.
EM algorithm and its application in probabilistic latent semantic analysiszukun
The document discusses the EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA). It begins by introducing the parameter estimation problem and comparing frequentist and Bayesian approaches. It then describes the EM algorithm, which iteratively computes lower bounds to the log-likelihood function. Finally, it applies the EM algorithm to pLSA by modeling documents and words as arising from a mixture of latent topics.
Latent semantic analysis (LSA) is a technique used in natural language processing to analyze relationships between documents and terms by producing concepts related to them. LSA assumes words with similar meanings will occur in similar texts, and uses a documents-terms matrix and singular value decomposition to discover hidden concepts and represent words and documents as vectors in a semantic vector space. Apache OpenNLP is a machine learning toolkit that can be used for various natural language processing tasks like part-of-speech tagging and parsing, and LSA can be seen as part of natural language processing.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
Latent semantic indexing (LSI) allows search engines to determine the topic of a page outside of directly matching search terms. LSI models the contexts in which words are used to find related pages, similar to how humans understand language through context. LSI was introduced to search engines by Applied Semantics and acquired by Google to power adsense through finding related ads. LSI gives search engines the ability to return more relevant results to users by understanding related terms, synonyms, singular/plural forms, and words with similar meanings or roots. Implementing LSI in a website involves developing thematically focused content using related keywords and synonyms throughout to better match user intent.
The document provides an introduction to Probabilistic Latent Semantic Analysis (PLSA). It discusses how PLSA improves on previous Latent Semantic Analysis methods by incorporating a probabilistic framework. PLSA models documents as mixtures of topics and allows words to have multiple meanings. The parameters of the PLSA model, including the topic distributions and word-topic distributions, are estimated using an expectation-maximization algorithm to find the parameters that best explain the observed word-document co-occurrence data.
1. The document discusses various bioinformatics concepts and tools including sequence alignment, BLAST, substitution matrices, and open reading frames. Sequence alignment involves comparing sequences to find similar regions and can be local or global. BLAST is a tool used to find similar sequences in a database by searching for exact and similar matches. Substitution matrices like BLOSUM and PAM assign scores to amino acid substitutions observed in protein evolution. Open reading frames refer to the three possible frames for translating a nucleic acid sequence into a protein.
2016.09.28
TOPIC REVIEW
• Exam
• PS2 Sequence Alignment
• Command Line Blast
• PS1 Molecular Biology
• Personal Microbiome Project
CURRENTLY
LET’S NEGOTIATE
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam (1) - 20%
• Research project - 45%
• Participation - 5%
OR
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam 1 - 15%
• Exam 2 - 15%
• Research project - 35%
• Participation - 5%
PS2 SEQUENCE ALIGNMENT
PS2 SEQUENCE ALIGNMENT
RefSeqs, protein (experimentally supported)
On chromosome 17
Reverse strand
PRCD Progressive rod-cone degeneration
PS2: GLOBAL ALIGNMENT
BLOSUM62
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
BLOSUM80
• Substitutions more penalized and
gaps are favored.
PAM60
• Substitutions more penalized and gaps
are favored.
PAM250
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
PS2: LOCAL ALIGNMENT
SEQ1 A L S C V W M I P
SEQ2 A I S C M I P T
9 residues
8 residues
Create Matrix: length of seq1 + 1
x
length of seq2 + 1
Matrix 10 x 9
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
Ala A 4
Arg R -1 5
Asn N -2 0 6
Asp D -2 -2 1 6
Cys C 0 -3 -3 -3 9
Gln Q -1 1 0 0 -3 5
Glu E -1 0 0 2 -4 2 5
Gly G 0 -2 0 -1 -3 -2 -2 6
His H -2 0 1 -1 -3 0 0 -2 8
Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A
la
A
rg
A
sn
A
sp
C
y
s
G
ln
G
lu
G
ly
H
is
Il
e
L
e
u
L
y
s
M
e
t
P
h
e
P
ro
S
e
r
T
h
r
T
rp
T
y
r
V
a
l
A R N D C Q E G H I L K M F P S T W Y V
Dynamical programming - global alignment
83
BLOSUM62
GAP COST: -2
At each cell, 3 scores are calculated:
• match score = diagonal cell score +
score from the substitution matrix.
• Vertical gap score = upper neighbor
+ gap cost
• Horizontal gap score = left neighbor
+ gap cost
• The highest score is retained and
the arrow is labelled
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
A ...
The document appears to be the output from running a BLAST search on a query sequence against a yeast genome database. The top hit from the BLAST search is a sequence on Saccharomyces cerevisiae chromosome XI with 85% identity over 88 amino acids and an E-value of 4e-12. Additional significant hits are also reported. The bottom of the document discusses retrieving the genomic sequence for the top hit and generating a dotplot comparison between the query and hit sequences.
This document summarizes a talk given by Dr. Noel O'Boyle on using Python for chemistry. It discusses what Python is, why it is useful for chemistry, and how it can be used. Specific examples are given of popular Python modules for tasks like data analysis, visualization, cheminformatics, and interfacing with other languages like R and Java. The document provides an overview of the capabilities of Python for scientific computing and highlights its growing adoption in the chemistry community.
This document summarizes a talk given by Dr. Noel O'Boyle on using Python for chemistry. It discusses what Python is, why it is useful for chemistry, and how it can be used. Specific examples are given of popular Python modules for tasks like data analysis, visualization, cheminformatics, and interfacing with other languages like R and Java. The document provides an overview of the capabilities of Python for scientific computing and highlights its growing adoption in the chemistry community.
The document discusses bioinformatics and computational biology. It describes a lab with over 100 people from diverse backgrounds, including engineers, scientists, technicians, geneticists and clinicians. The lab applies information technology to analyze biological data, focusing on areas like sequence analysis, molecular modeling, phylogeny, medical applications, statistics and more. Specific applications mentioned include analyzing genomes to study genetic diseases and drug design, as well as using the same techniques in agriculture and animal health.
- Maintain a 2D array P of size V×V along with the distance array D
- P[i][j] stores the intermediate vertex on the shortest path from i to j
- During each iteration of k, if an updated shortest path is found via k, store k in P[i][j]
- To extract the path, start from P[i][j] and follow the chain of intermediate vertices stored in P until reaching i
- This allows tracing the shortest path after the algorithm terminates in O(1) time per path
The modification requires storing only the
This document discusses sequence alignment and contains four sections:
1) Global alignment which finds the highest scoring alignment between entire sequences using dynamic programming.
2) Scoring matrices which generalize alignment scoring by assigning scores to individual character matches/mismatches based on biological evidence.
3) Local alignment which finds the best scoring alignment between substrings of sequences to identify conserved regions, as global alignment may miss these.
4) Ways to solve the local alignment problem efficiently in quadratic time instead of quartic time by computing alignments from each vertex in the grid.
This document provides an overview of databases, definitions, scoring matrices, and pairwise sequence alignment. It discusses major bioinformatics databases like NCBI, ExPASy, and EBI. It also defines key terms like identity, homology, orthologous, and paralogous sequences. Additionally, it examines the theoretical and empirical bases for scoring matrices like PAM, BLOSUM, and transition/transversion matrices, and how they are used in sequence alignment.
The document presents a compartment-based model to describe the degradation kinetics of a peptide substrate reporter (VI-B) in five cell cultures. The model uses a system of differential equations to track the concentration of the reporter and its fragments over time. Parameters of the model are fitted to time-series data from the cultures. Results show variation in degradation rates between cultures and fragments, revealing targets of peptidases. The model has potential utility in peptide substrate reporter design.
1) This document introduces methods for detecting sequence similarity, which is a fundamental analysis in bioinformatics.
2) It describes how to search databases for similar sequences using BLAST or FASTA, and how to compare two sequences using dynamic programming algorithms like Needleman-Wunsch or Smith-Waterman.
3) Substitution matrices like BLOSUM62 are used to score alignments and measure sequence similarity based on amino acid properties.
A New Multi-Objective Mixed-Discrete Particle Swarm Optimization AlgorithmWeiyang Tong
A new multi-objective optimization algorithm to handle problems that are hightly constrained, highly nonlinear, and with mixed types of design variables
The document provides an overview of sequence alignment concepts including:
- Definitions of terms like identity, homology, orthologous, and paralogous genes
- Examples and explanations of scoring matrices used for nucleotide and protein sequence alignments like BLOSUM and PAM matrices
- An example multiple sequence alignment of glyceraldehyde-3-phosphate dehydrogenases from different species
- Descriptions of how scoring matrices are used to quantify sequence similarity and their importance in sequence analysis
Matrix factorization techniques can be used to address some of the limitations of traditional collaborative filtering approaches for recommender systems. Matrix factorization decomposes the user-item rating matrix into the product of two lower-dimensional matrices, one representing latent factors for users and the other for items. This reduced dimensionality addresses data sparsity and scalability issues. Specifically, singular value decomposition is often used to perform this matrix factorization, which can approximate the original rating matrix while ignoring less important singular values and factor vectors. The decomposed matrices can then be multiplied to predict unknown user ratings.
Benchmarking Perl (Chicago UniForum 2006)brian d foy
The document discusses various techniques for benchmarking and profiling Perl code, including:
- Using Benchmark.pm to compare the performance of different code snippets, but noting its limitations in precision and accuracy.
- Profiling applications first using SmallProf to identify bottlenecks before optimizing code.
- Considering what aspects of performance are important for the specific application, like speed, memory usage, or network load.
- Recognizing one's own biases when benchmarking and verifying results with predictions.
Selection of the optimal parameters for machine learning tasks is challenging. Some results may be bad not because the data is noisy or the used learning algorithm is weak, but due to the bad selection of the parameters values. This presentation gives a brief introduction about evolutionary algorithms (EAs) and describes genetic algorithm (GA) which is one of the simplest random-based EAs. A step-by-step example is given in addition to its implementation in Python 3.5.
---------------------------------
Read more about GA:
Yu, Xinjie, and Mitsuo Gen. Introduction to evolutionary algorithms. Springer Science & Business Media, 2010.
https://www.kdnuggets.com/2018/03/introduction-optimization-with-genetic-algorithm.html
https://www.linkedin.com/pulse/introduction-optimization-genetic-algorithm-ahmed-gad
Descriptive analytics in r programming languageAshwini Mathur
The document discusses experiments involving R programming code and analysis of built-in data sets. It includes questions about manipulating gender data, evaluating mathematical expressions, creating a data frame, generating Fibonacci sequences with different starting values, analyzing solar radiation data through plotting and statistics, and using the built-in mtcars data set to explore plotting options in ggplot2. Code solutions are provided for each question to demonstrate the requested data manipulations and analyses.
CLUTO is a software toolkit used for clustering high-dimensional datasets and analyzing cluster characteristics. It contains two main algorithms: Vcluster, which clusters based on the actual multi-dimensional data representation, and Scluster, which clusters based on a pre-computed similarity matrix. CLUTO can be run from the command line with various optional parameters to control the clustering method, analysis, and visualization of results.
MMseqs (Many-against-Many sequence searching) is a novel software suite for very fast protein sequence searches and clustering of huge protein sequence data sets, such as sets of predicted protein sequences or 6-frame-translated open reading frames (ORFs) from large metagenomics experiments. MMseqs is around 1000 times faster than protein BLAST and sensitive enough to capture similarities down to less than 30% sequence identity.
At the core of MMseqs are two modules for the comparison of two sequence sets with each other. The first, prefiltering module computes the similarities between all sequences in one set with all sequences in the other based on a very fast and sensitive alignment-free metric, the sum of scores of similar 7-mers. The second module implements an AVX2-accelerated Smith-Waterman-alignment of all sequences that pass a cut-off for the score in the first module. Due to its unparalleled combination of speed and sensitivity, searches of all predicted ORFs in large metagenomics data sets through the entire UniProt or NCBI-NR databases will be feasible. This could allow to assign to functional clusters and taxonomic clades many reads that are too diverged to be mappable by current software.
MMseqs' third module can also cluster sequence sets efficiently, based on the similarity graph obtained from the comparison of the sequence set with itself in modules 1 and 2. MMseqs further supports an updating mode in which sequences can be added to an existing clustering with stable cluster identifiers and without the need to recluster the entire sequence set. MMseqs will therefore be used to offer high-quality clustered versions of the UniProt database down to 30% sequence similarity threshold.
Class imbalance problems frequently occur in real-world tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by re-weighting or re-sampling training samples. These re-balancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. However, the extracted representations may be of poor quality owing to the limited number of minority samples. To handle this restriction, several methods have been developed that increase the representations of minority samples by leveraging the features of the majority samples. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we first investigate the correlation between the degree of augmentation and class-wise performance, and find that the proper degree of augmentation must be allocated for each class to mitigate class imbalance problems. Motivated by this finding, we propose a simple and efficient novel curriculum, which is designed to find the appropriate per-class strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA can simply be integrated into existing long-tailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the state-of-the-art method on various imbalanced datasets such as CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018.
Similar to PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis (20)
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
7. Vector Space Model
• Salton’s Vector Space Model
– Represent each document by a high-dimensional
vector in the space of words
Documents
Vectors
Gerald Salton
7/50
9. Term-‐Document
Matrix
• Term-document matrix is m×n matrix where m is
number of terms and n is number of documents
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
⎡
=
↓↓↓
aaa
aaa
aaa
A
ddd
mnmm
n
n
n
…
…
…
21
22221
11211
21
t
t
t
m
←
←
←
2
1
document
term
9
10. • The inverse document frequency (idf) is obtained by dividing the number of
all documents by the number of documents containing the term ti,
Term Weighting by TFIDF
• The term frequency (tf) in the given document d gives a measure of the
importance of the term ti within the particular document
∑
=
k
k
i
i
n
n
dttf ),(
)(
log)(
ii
i
td
D
tidf
⊃
=
with ni being the number of occurrences of the considered term, and the
denominator is the number of occurrences of all terms
|D| : total number of document in the corpus
: number of documents where the term ti appears
4idf
=
4*idf
10/50
11. Predicted by 1 Nearest-Neighbor
based on Cosine Similarity
• similarity between document and query
11/50
12. Feature Reduction
• ∃ a best choice of axes – shows most variation in the
data. => Found by linear algebra: Singular Value
Decomposition (SVD)
True plot in k dimensions Reduced-dimensionality plot
12/50
14. Outline
• Introduction
– Protein Subcellular Localization
– Document Classification
• PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
• Evaluation and Results
• Discussion
14/50
15. The Terms of Proteins - Gapped-
dipeptides*
• Let XdZ denote the amino acid coupling
pattern of amino acid types X and Z that are
separated by d amino acids
If d= 20, there are 8400 (=20*20*21) features for a vector
M P L D L N T L T
Sequence:
M0P
M2D
M1L ...
*Liang HK, Huang CM, Ko MT, Hwang JK. The Amino Acid-Coupling Patterns in Thermophilic Proteins. Proteins:
Structure, Function and Bioinformatics (2005), 59, 58-63.
15/50
16. Term Weighting Scheme – TF
Position Specific Score Matrix (1/2)
• Position Specific Score Matrix (PSSM) : A PSSM is
constructed from a multiple alignment of the highest
scoring hits in the BLAST search
A R N D C Q E G H I L K M F P S T W Y V
1 M - 3 - 3 - 4 - 5 - 3 - 3 - 4 - 5 - 4 0 1 - 3 10 - 2 - 5 - 4 - 3 - 4 - 3 - 1
2 P 2 - 3 - 3 - 1 - 3 - 1 - 1 - 1 - 4 - 2 - 4 - 2 - 2 - 5 4 2 4 - 5 - 4 - 3
3 L - 4 - 5 - 6 - 6 - 4 - 3 - 5 - 6 - 5 3 5 - 5 4 0 - 5 - 5 - 3 - 4 - 3 2
4 D - 2 5 - 1 - 3 - 4 2 - 1 - 4 2 - 5 - 3 5 - 2 - 2 - 4 - 2 0 - 1 0 - 3
5 L - 4 - 5 - 6 - 6 - 4 - 5 - 6 - 6 - 4 4 4 - 5 0 1 - 5 - 5 - 3 - 4 - 3 3
. . .
78 N - 4 - 3 8 4 - 6 - 3 - 2 - 3 - 2 - 6 - 6 - 3 - 5 - 6 - 4 - 1 - 3 - 7 - 5 - 6
79 T - 2 - 3 - 1 - 3 - 1 - 3 - 3 - 4 - 3 - 4 - 4 - 1 - 4 - 4 - 4 4 6 - 5 - 4 - 2
80 L 0 - 1 - 5 - 5 - 4 - 3 - 4 - 4 - 3 - 1 5 - 3 3 0 - 4 - 3 - 3 - 3 - 2 - 1
81 T - 1 - 3 - 1 - 1 - 4 - 2 - 3 - 2 - 1 - 4 - 3 - 1 - 3 - 4 - 4 3 6 - 5 - 4 - 3
16/50
17. Term Weighting Scheme – TF
Position Specific Score Matrix (2/2)
• The weight of XdZ :
where f(i,Y) denotes the normalized value of the PSSM entry at the ith
row and the column corresponding to amino acid typeY
• An example
W(M2D,P)
= f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D)
= 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894
∑ +−≤≤
++×=
)1(1
),1(),(),(
dni
ZdifXifPXdZW
17/50
18. Outline
• Introduction
– Protein Subcellular Localization
– Document Classification
• PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
• Evaluation and Results
• Discussion
18/50
20. Feature Reduction - Probabilistic
Latent Semantic Analysis (2/3)
• A joint probability between a term w and a
document d can be modeled as:
Latent variable z
(“small” #states)
Concept
expression
probabilities
Document-specific
mixing proportions
)|()|()(),( dzPzwPdPdwP
Zz
∑∈
=
• The parameters could be estimated by
maximum-likelihood function through EM
algorithm.
20/50
21. Feature Reduction - Probabilistic
Latent Semantic Analysis (3/3)
Term 1
Term 2
Term 3
Term 4Term 5
Vector
Term Space
PLSA
Feature Reduction
Topic 1
Topic 2
Topic 3
Topic Space
21/50
22. Outline
• Introduction
– Protein Subcellular Localization
– Document Classification
• PSLDoc
– Term and its weighting scheme
– Feature Reduction
– SVM learning
• Evaluation and Results
• Discussion
22/50
23. Classifier – Support Vector
Machines
• Support Vector Machines (SVM)
– LIBSVM software
– Five 1-v-rest SVM classifiers corresponding to
five localization sites.
– Kernel: Radial Basis Function (RBF)
– Parameter selection
• c (cost) and γ(gamma) are optimized
• five-fold cross-validation
SVMCP v.s. -CP
SVMIM v.s. -IM
SVMPP v.s. -PP
SVMOM v.s. -OM
SVMEC v.s. -EC
*Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
23/50
27. Data set (2/3)
• Eukaryotic proteins, 7579 proteins, 12
localization sites
Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using
compositions of amino acids and amino acid pairs. Bioinformatics 2003;19(13):1656-1663.
Chloroplast
9%
Cytoplasmic
16%
Cytoskeleton
1%
ER
1%
Extracellular
11%
Golgi
1%
Lysosomal
1%
Mitochondrial
10%
Nuclear
25%
Peroxisomal
2%
PlasmaMembrane
22%
Vacuole
1%
27/50
28. Data set (3/3)
• Human data set, 2197 proteins, 9 localization
sites
Scott MS, Thomas DY, Hallett MT. Predicting subcellular localization via protein motif
co-occurrence. Genome Res 2004;14(10A):1957-1966.
ER
15%
Golgi
4%
Cytosol
16%
Nucleus
27%
Peroxisome
2%
PlasmaMembrane
9%
Lysosome
4%
Mitochondria
10%
Extracellular
13%
28/50
29. Evaluation
• Accuracy (Acc)
– l = 5 is the number of total localization sites
– Ni are the number of proteins in localization site I
• Matthew’s correlation coefficient (MCC)
1 1
l l
i i
i i
Acc TP N
= =
= ∑ ∑
( )( ) ( )( )
( )( )( )( )
i i i i
i
i i i i i i i i
TP TN FP FN
MCC
TP FN TP FP TN FP TN FN
−
=
+ + + +
29/50
31. Simple Prediction Methods (2/2)
• 1NN_PSI-BLASTps , 1NN_PSI-BLASTnr
• 1NN_ClustalW
Query
Protein
Training
Database
NCBI
nr
Database
Training
Database
PSSM
PSSM
Similar
Protein
PSI-‐BLAST
PSI-‐BLAST
ClustalW
31/50
32. The comparison of 1NN_TFIDF and
1NN_TFPSSM on the PSHigh783and PSLow661
data sets.
PSHigh783
PSLow661
1NN_TFPSSM
1NN_TFIDF
1NN_TFPSSM
1NN_TFIDF
L o c .
Sites
Acc.(%)
MCC
Acc.(%)
MCC
Acc.(%)
MCC
Acc.(%)
MCC
CP
94.20
0.96
71.01
0.74
83.25
0.77
41.15
0.36
IM
99.31
0.99
98.62
0.89
82.93
0.82
84.15
0.48
PP
95.86
0.94
86.21
0.89
74.05
0.63
38.17
0.46
99.66
0.99
95.88
0.95
85
0.82
66.00
0.48
EC
96.99
0.96
92.48
0.91
57.89
0.51
28.07
0.26
Overall
97.96
-
91.83
-
79.43
-
53.86
-
32/50
42. Gapped-peptide signature
• The site-topic preference of the topic z for a localization site l
= average { P(z|d)| d (a protein) belongs to l class}
Acc.=89
Acc.=90
42/50
44. Gapped-peptide signature
• For each localization site, ten preferred topics
according to site-preference confidence ( = the
largest site-topic preference - the second
largest site-topic preference)
• For each topic, five most frequent gapped-
dipeptides are selected.
44/50
46. Gapped-dipeptide signatures reflecting
motifs relevant to protein localization sites
• In the integral membrane proteins, in which helix-helix
interactions are stabilized by aromatic residues.
Specifically, the aromatic motif (WXXW or W2W) is
involved in the dimerization of transmembrane domains
by π-π interactions.
• In the outer membrane class, where the C-terminal
signature sequence is recognized by the assembly
factor, OMP85, regulating the insertion and integration
of OM proteins in the outer membrane of gram-
negative bacteria. The C-terminal signature sequence
contains a Phe (F) at the C-terminal position, preceded
by a strong preference for a basic amino acid (K, R). =>
R0F
46/50
47. The amino acid compositions of single
residues and gapped-dipeptide signatures
for each localization site
0
2
4
6
8
10
12
14
A I G L M V C N P Q S T D E K R H F Y W
Composition(%)
Amino
Acids
(A)
single
residue
CP
IM
PP
OM
EC
0
5
10
15
20
25
30
A I G L M V C N P Q S T D E K R H F Y W
Composition(%)
Amino
Acids
(B)
gapped-‐dipeptide
signature
CP
IM
PP
OM
EC
47/50
48. The grouped amino acid compositions of single
residues and gapped-dipeptide signature
0
10
20
30
40
50
60
70
CP IM PP OM EC
Proportion(%)
Localization
Sites
(A)
single
residue
N
P
C
A
0
10
20
30
40
50
60
70
CP IM PP OM EC
Proportion(%)
Localization
Sites
(B)
gapped-‐dipeptide
signature
N
P
C
A
Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR),
and A (aromatic: FYW)
48/50
49. Gapped-dipeptide signatures and their amino acid
compositions for each localization site
0
10
20
30
40
50
60
70
CP IM PP OM EC
Proportion(%)
Localization Site
N
P
C
A
Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR),
and A (aromatic: FYW)
49/50
50. Gapped-dipeptide signatures and their
amino acid compositions for each
localization site
• IM has a high percentage of non-polar amino acids (60%)
and no charged (0%) amino acids.
– The physico-chemical properties of the lipid bilayer, in which
non-polar amino acids are favored in the transmembrane
domains of IM proteins.
– Charged amino acids are disfavored due to the penalty incurred
in energy terms in the assembly of IM proteins.
• CP and EC classes have a high percentage of charged and
polar amino acids, respectively.
– The role of charged amino acids in the cytoplasm is probably
related to pH homeostasis in which they act as buffers, whereas
secreted proteins in the EC classes may require more polar
amino acids for promoting interactions in the solvent
environment.
50/50