1. Development of algorithms and software for
unravelling the biological role of low
complexity regions in protein sequences
PhD thesis by Ioannis Kirmitzoglou
Supervised by Vasilis Promponas
Bioinformatics Research Laboratory
Department of Biological Sciences
University of Cyprus
2. Outline
Introduction
Scientific
hypothesis &
aims
Effects of LCRs in
database
similarity
searches
Development of
novel software
for
searching,
displaying and
comparing LCRs
Predicting the
pathogenicity of
Escherichia
strains based on
local and global
amino acid
compositional
signatures
3. Protein sequences
>gi|157159493|ref|YP_001456811.1| isoleucyl-tRNA synthetase
MSDYKSTLNLPETGFPMRGDLAKREPGMLARWTDDDLYGIIRAAKKGKKTFILHDGPPYANGSIHIGHSVNK
ILKDIIVKSKGLSGYDSPYVPGWDCHGLPIELKVEQEYGKPGEKFTAAEFRAKCREYAATQVDGQRKDFIRL
GVLGDWSHPYLTMDFKTEANIIRALGKIIGNGHLHKGAKPVHWCVDCRSALAEAEVEYYDKTSLSIDVAFQA
VDQDALKAKFAVSNVNGPISLVIWTTTPWTLPANRAISIAPDFDYALVQIDGQAVILAKDLVESVMQRIGVT
DYTILGTVKGAELELLRFTHPFMGFDVPAILGDHVTLDAGTGAVHTAPGHGPDDYVIGQKYGLETANPVGPD
GTYLPGTYPTLDGVNVFKANDIVVALLQEKGALLHVEKMQHSYPCCWRHKTPIIFRATPQWFVSMDQKGLRA
QSLKEIKGVQWIPDWGQARIESMVANRPDWCISRQRTWGVPMSLFVHKDTEELHPRTLELMEEVAKRVEVDG
IQAWWDLDAKEILGDEADQYVKVPDTLDVWFDSGSTHSSVVDVRPEFAGHAADMYLEGSDQHRGWFMSSLMI
STAMKGKAPYRQVLTHGFTVDGQGRKMSKSIGNTVSPQDVMNKLGADILRLWVASTDYTGEMAVSDEILKRA
ADSYRRIRNTARFLLANLNGFDPAKDMVKPEEMVVLDRWAVGCAKAAQEDILKAYEAYDFHEVVQRLMRFCS
VEMGSFYLDIIKDRQYTAKADSVARRSCQTALYHIAEALVRWMAPILSFTADEVWGYLPGEREKYVFTGEWY
EGLFGLADSEAMNDAFWDELLKVRGEVNKVIEQARADKKVGGSLEAAVTLYAEPELSAKLTALGDELRFVLL
TSGATVADYNDAPADAQQSEVLKGLKVALSKAEGEKCPRCWHYTQDVGKVAEHAEICGRCVSNVAGDGEKRK
FA
8. Lack of experimental data
15%
12%
70%
3% 0%
1 2 3 4 5
June 2013; http://www.expasy.org/sprot/relnotes/relstat.html
9. Searching...
• Huge Number of
Sequences
• Redundancy
• Exponential Growth
Database
• Fast
• Accurate
• Easy to use and
program
Search
Method • Statistical
Significance
• Ranking
• Detailed Info
Results
10. Sequence comparison methods
Dot Plots (intuitive)
Dynamic Programming (e.g. SSEARCH; exact but slow)
Heuristics (e.g. BLAST, FASTA; fast but not exact)
prot A M---EEPQSDPSVEPPLSQETFS
: :: ::: :.: ::::::::
prot B MTAMEESQSDISLELPLSQETFS
13. λ: depends on the composition of the database
E(S≥s) = Kmne-λs
BLAST scoring and statistics
Karlin, S. and S. F. Altschul, 1990
𝑆𝑖𝑗 =
𝑙𝑛
𝑞𝑖𝑗
𝑝𝑖 𝑝𝑗
𝜆
15. Low complexity regions (LCRs)
Any of the 20 AAs...
Single or grouped,
periodic or not...
May cover a big
portion of the
sequence...
>NP_702279.1 [Plasmodium falciparum 3D7]
MGSKKNSNTVDSSENVEEVVDNLTSEKNKESLKKDKRKKKEKKNNDVDDINEEEDEEGNDEDTMKKFSVD
TSENEDDKEDDDDDEDDDDDDDDDDDEDDDDEDDDDDDDDDDDDDDDDDEDDDDDDDDDEDDDDEDDEDD
DDDDEDDDDDFDDMDEDDDDDDDDDEDDDDEEDYDDDDDDDDDDDEDDDEDDDEDDDDENDQNEYAGDDK
KDEDGDAKKGSDDEGFD
>NP_702802.1 [Plasmodium falciparum 3D7]
MSRLFFFYFFIFPYPLFRSYVLTFLRTSFLSYFLSFLLSFLLSFLLSFLLSFLLSFLLSFLLSFLLSFLL
SFLLSFLLICFLNYLLSFSFSFSLFLSLSLFFFK
>NP_703640.1 [Plasmodium falciparum 3D7]
MKFFEKKKKKKKKKKEKKKKKKEKQNKTKVLFISSLFPFFFLFFVLSLLSYIFIIFFVSLSSFLYIDDII
LLRIIVLYTMTLYIYKYIHIYIYIYIYIYITYFLIIIFL
Wootton & Federhen, 1993
25. LCRs and Protein Structure
LCRs
Lack of 3D structure
high-order non-globular
structures when in complexes
greater flexibility
transcription factors
greater flexibility
protein to protein interactions
proteins can perform more than
one specific function
Corral et al., 1993; Suzuki et al., 1993; Gill et al., 1993; Wootton, 1994; Romero et al., 2001; Karlin et al., 2002; Coletta et al., 2010
26. LCRs and Evolution
LCRs Evolve rapidly
New LCRs generate novel
material for functional
expansions
Expansion and contractions of
LCRs may constitute a large
source of phenotypic variation
Coletta et al., 2010
27. LCRs and Human Health
LCRs
Very common in the
malaria parasite
(Plasmodium falciparum)
polyQ repeats involved in
more than 10 neurological
diseases
Play important roles in the
proper function of health-
related proteins (e.g. p53)
Miller et al., 2000; Promponas et al., 2000; Li et al., 2007
28. Problems faced in LCR research
Problems
LCRs detection and
manipulation is
everywhere
LCRs are missing
from experimentally
determined protein
structures
Various and
conflicting
definitions
Lack or proper tools
aimed at searching,
displaying and
comparing LCRs
29. Scientific Hypothesis & Aims
LCRs as a means to predict
the phenotype of an
organism
Tools for LCRs
manipulation:
Search, Display, Compare
LCRs Review:
Algorithms, Tools &
Biological Relevance
30. Effects of LCRs and LCR-handling tools
on protein sequence comparisons
36. Self comparisons:
Affected proteins
E-value only:
many false hits
E-value + bit score
+ others:
most false hits
disappear
No LCRs?
(almost) no
problems
SM1 is the worst
LCR-handling
method tested
37. BBH: Run times
and file sizes
All species
• All modes except SM3
don’t cause a
noticeable decrease in
runtimes
• SM3 significantly
increases runtime
(fragmented db)
P. falciparum
• At least 80% decrease
in runtimes for all
masking modes
• CM3 causes a 50-fold
decrease in runtime
• Differences are more
profound when E-
value threshold is 10
(data not shown)
38. Dataset: Database Search
http://scop.mrc-lmb.cam.ac.uk/scop/, http://astral.berkeley.edu/
• Secondary
structure
content of the
domainClass
• Similar
arrangement of
regular secondary
structures
• Major structural
similarity
Fold
• Structural and
functional similarity to
suggest evolutional
relationship
• Probable common
evolutionary origin
Super-
family
• Clear sequence similarity
can be detected
• Clear evolutionarily
relationshipFamily
39. Database search:
Affected proteins
CAST masking is
better than SEG or
CBS
CBS are better than
SEG
Bare BLAST (NM) is
sometimes better
than SEG or CBS!!!
NM is more
sensitive
40. Database search:
Affected proteins
No LCRs return 20x
more TPs for the
same number of
FPs
SEG & NM struggle
when LCRs are
present
CM2 & NMCB2 are
the best methods
when LCRs are
present
REMEMBER:
few sequences
with LCRs in
ASTRAL
No LCRs With LCRs
41. DS: Run times
and file sizes
Compositional
statistics increase
CPU time
Differences are
small, possibly of
the low LCR
content of ASTRAL
42. Conclusions
Best bi-
directional hit
• Utilizing extra BLASTP features may
increase performance
Masking the
database with
CAST
• Generally increases sensitivity
• Decreases usage of computational
resources (especially CM3)
ASTRAL is
poor in LCRs
• It’s usage as a benchmark tool does not
simulate a real-life scenario
• Different data-sets for benchmarking
sequence similarity tools are needed
Hits rich in
LCRs are not
always
spurious
• One should be extra cautious when
applying LCR manipulation methods
44. The problem
Most LCR-detection
algorithms are designed
to mask them
Lack of tools aimed in
searching, visualizing and
sharing LCRs
Major databases store
LCRs in an inconsistent
and incomplete manner
Only a few studies with
large-scale analysis of
LCRs
LCR-related research is
focused on the problems
caused by them
46. Main goals
Uniform way of
LCRs
representation
Combine the most
commonly used
algorithms and
databases
The first (?)
universal resource
for LCR related
research
Aid researches in
assessing the
biological relevance
of LCRs
http://repeat.biol.ucy.ac.cy/mgb2/gbrowse/
48. Features
Search
• Protein properties
• LCRs by residue type(s)
• Within UniProt results
Browse or
download results
Database schema
optimized for fast
search and
retrieval of data
Deep links to
UniProt, PDB & GO
49. Features
Local BLAST search
against masked
databases
Initiate NCBI’s BLASTP
with normal or inverted
masking with optimal
pre-configured settings
for LCR detection
53. Performance of old and new CAST versions
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
20
40
60
80
100
120
140
160
P.f. P.v. P.c. P.a. 1 P.a. 2 C.t. C.p.
MillionsofmaskedAAs
Executiontime(secs)
CAST v1.0 (gcc) CAST v2.1 (gcc) CAST v2.1s (gcc) CAST v2.1s (icc) # Xs
54. Porting CAST v2.1 on other hardware platforms
FPGAs
GP-GPUsMulti-coreCPUs
55. Predicting the pathogenicity of Escherichia strains based
on local and global amino acid compositional signatures
56. Motivation
Kreil & Ouzounis (2001)
identified thermophilic
species by their global
amino acid composition
Promponas (2009)
identified pathogenic
Escherichia strains using
global and local amino acid
compositional signatures
57. Data retrieval
31 full genomes
from GenBank
Original genome
publications
Complementary
publications &
data
28 Escherichia strains with
verified pathogenicity status
NCBI Bacterial Genomes Division [ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/]; GOLD [http://genomesonline.org/];
NCBI Bacterial Genome Project Division [ftp://ftp.ncbi.nlm.nih.gov/genomes/bioproject/]
31genomes
Training set: 22 verified
Validation set: 6 verified
Prediction: 3 with
unverified pathogenicity
58. Signatures Generation
& Training Set Creation
Kreil & Ouzounis, 2003; Promponas, 2009
CAST with
default settings
and –stat
GC: Global AA
composition
LB: LCRs
composition
XB: Masked
resides
composition
x 28
60. Validation sets
generation
6 verified genomes
Randomly sample
sequences from each
strain
9 gene count classes:
• 100, 500, 1000, 1500, 2000,
2500, 3000, 3500, 4000
• 2-98% genome coverage
5400 sub-samples: 100
for each genome &
gene count class
61. Validation Results
Global composition
signatures work
better
Accuracy over 80%
using less than 20%
of proteins
Similar results using
“glocal” vectors
Preliminary results
indicate that local
signatures at lower
CAST thresholds are
enhanced
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
100 500 1000 1500 2000 2500 3000 3500 4000
NUMBER OF GENES
GC (Accuracy) GC (Sensitivity) GC (Specificity)
LB (Accuracy) LB (Sensitivity) LB (Specificity)
Species Genbank Accession Prediction
E. coli ATCC 8739 NC_010468 Non-pathogenic
E. coli SMS-3-5 NC_010498 Pathogenic
E. fergusonii ATCC 35469 NC_011740 Pathogenic
62. Exploration for the identification of proteomic subsets
responsible for the predictive power of the final models
63. Chimeras generations
Genes clustered using
blastclust
4 pools of genes
• P or NP only
• Core genes
• Mixed genes
4 chimera types
• P only or NP only
• Equal P/NP
• All
14 gene count classes
• 100, 500, 1000,
2000, 2500, 3000,
3500 … 10000
Genomes
Gene
PoolsChimeras
64. Results
P & NP only sequences
can characterize a
genome
Prediction rate with
4000 genes similar to
external validation
results
Compositional signal of
P only class stronger
than NP only
LB similar to GC on P
only. Perhaps stronger
LCR signature?
65. Conclusions
Compositional features can be used to accurately perform
large-scale phenotype prediction
knn is much faster compared to Sims et al. (2011)
Compositional features enable a more compressed
representation of complete genomes which further increase
accuracy & performance
Robust even when incomplete data are available
Gene classes associated with pathogenicity seem to
generate enough signal to control the pathogenicity of a
genome according to the predictive model
66. Synopsis
We evaluated the
performance of the most
popular LCRs handling
algorithms
We evaluated, and
questioned, the most
commonly employed
data-set for assessing the
performance of LCRs
handling algorithms
We described the best
ways to handle LCRs
We built tools to search,
display and compare LCRs
We used LCRs and other
compositional measures
to accurately predict the
pathogenicity of
Escherichia strains
We demonstrated that
our methods also work
when using only parts of
the available information
LCRs
67. Future work
Publish at least 3 papers
(1 submitted, 2 are in preparation)
Predict the pathogenicity of the 77 incomplete
genomes that we collected from GenBank
Expand the LCR-eXXXplorer
Apply for a Post-Doc position
Party
68. Acknowledgments
Dr. I. Iliopoulos
Dr L. Kostrikis
Dr. P. Skourides
Dr. C.A. Ouzounis
Dr. V. Promponas
Cyprus Research
Promotion Foundation
ΠΕΝΕΚ/ENIΣΧ/0308/77
Athina Theodosiou
Stella Tamana
Ioanna Kalvari
Maria Xenophontos
Marilena Aplikioti
My family