SlideShare a Scribd company logo
1 of 69
Development of algorithms and software for
unravelling the biological role of low
complexity regions in protein sequences
PhD thesis by Ioannis Kirmitzoglou
Supervised by Vasilis Promponas
Bioinformatics Research Laboratory
Department of Biological Sciences
University of Cyprus
Outline
Introduction
Scientific
hypothesis &
aims
Effects of LCRs in
database
similarity
searches
Development of
novel software
for
searching,
displaying and
comparing LCRs
Predicting the
pathogenicity of
Escherichia
strains based on
local and global
amino acid
compositional
signatures
Protein sequences
>gi|157159493|ref|YP_001456811.1| isoleucyl-tRNA synthetase
MSDYKSTLNLPETGFPMRGDLAKREPGMLARWTDDDLYGIIRAAKKGKKTFILHDGPPYANGSIHIGHSVNK
ILKDIIVKSKGLSGYDSPYVPGWDCHGLPIELKVEQEYGKPGEKFTAAEFRAKCREYAATQVDGQRKDFIRL
GVLGDWSHPYLTMDFKTEANIIRALGKIIGNGHLHKGAKPVHWCVDCRSALAEAEVEYYDKTSLSIDVAFQA
VDQDALKAKFAVSNVNGPISLVIWTTTPWTLPANRAISIAPDFDYALVQIDGQAVILAKDLVESVMQRIGVT
DYTILGTVKGAELELLRFTHPFMGFDVPAILGDHVTLDAGTGAVHTAPGHGPDDYVIGQKYGLETANPVGPD
GTYLPGTYPTLDGVNVFKANDIVVALLQEKGALLHVEKMQHSYPCCWRHKTPIIFRATPQWFVSMDQKGLRA
QSLKEIKGVQWIPDWGQARIESMVANRPDWCISRQRTWGVPMSLFVHKDTEELHPRTLELMEEVAKRVEVDG
IQAWWDLDAKEILGDEADQYVKVPDTLDVWFDSGSTHSSVVDVRPEFAGHAADMYLEGSDQHRGWFMSSLMI
STAMKGKAPYRQVLTHGFTVDGQGRKMSKSIGNTVSPQDVMNKLGADILRLWVASTDYTGEMAVSDEILKRA
ADSYRRIRNTARFLLANLNGFDPAKDMVKPEEMVVLDRWAVGCAKAAQEDILKAYEAYDFHEVVQRLMRFCS
VEMGSFYLDIIKDRQYTAKADSVARRSCQTALYHIAEALVRWMAPILSFTADEVWGYLPGEREKYVFTGEWY
EGLFGLADSEAMNDAFWDELLKVRGEVNKVIEQARADKKVGGSLEAAVTLYAEPELSAKLTALGDELRFVLL
TSGATVADYNDAPADAQQSEVLKGLKVALSKAEGEKCPRCWHYTQDVGKVAEHAEICGRCVSNVAGDGEKRK
FA
Protein Sequence Resources
NCBI nr Swiss-Prot
TrEMBL PDB
Environmental
samples
Organism
Proteomes
Most common databases
NCBI nr
UniProtKB
TrEMBL
UniProtKB
SwissProt
Protein Data Bank
(PDB)
• 20,703,722 Seqs
• 7,197,824,772 AAs
• 35,502,518 Seqs
• 11,384,440,438 AAs
• 540,261 Seqs
• 191,876,607 AAs
• 91,359 Seqs
• 60,847,644 AAs
June 2013; ftp://ftp.ncbi.nih.gov/blast/db/, http://www.ebi.ac.uk/uniprot/TrEMBLstats/,
http://www.expasy.org/sprot/relnotes/relstat.html, http://www.pdb.org
(Most) Databases grow exponentially
0
10
20
30
40
50
60
70
80
90
100
1 5 9 13 17
Thousands
Number of searchable structures per year in PDB
0
10
20
30
40
50
60
70
80
90
1 4 7 10 13 16 19
Millions
Number of entries in NCBI/Proteins
June 2013; http://www.ncbi.nlm.nih.gov/protein, http://www.ebi.ac.uk/uniprot/TrEMBLstats/,
http://www.expasy.org/sprot/relnotes/relstat.html, http://www.pdb.org
Similarity
Structure
Function
Homology
Evolution
Sequence similarity search
Lack of experimental data
15%
12%
70%
3% 0%
1 2 3 4 5
June 2013; http://www.expasy.org/sprot/relnotes/relstat.html
Searching...
• Huge Number of
Sequences
• Redundancy
• Exponential Growth
Database
• Fast
• Accurate
• Easy to use and
program
Search
Method • Statistical
Significance
• Ranking
• Detailed Info
Results
Sequence comparison methods
 Dot Plots (intuitive)
 Dynamic Programming (e.g. SSEARCH; exact but slow)
 Heuristics (e.g. BLAST, FASTA; fast but not exact)
prot A M---EEPQSDPSVEPPLSQETFS
: :: ::: :.: ::::::::
prot B MTAMEESQSDISLELPLSQETFS
BLAST (Altschul et al., 1990)
BLAST output
λ: depends on the composition of the database
E(S≥s) = Kmne-λs
BLAST scoring and statistics
Karlin, S. and S. F. Altschul, 1990
𝑆𝑖𝑗 =
𝑙𝑛
𝑞𝑖𝑗
𝑝𝑖 𝑝𝑗
𝜆
Background frequencies
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
http://www.ncbi.nlm.nih.gov/protein
Low complexity regions (LCRs)
Any of the 20 AAs...
Single or grouped,
periodic or not...
May cover a big
portion of the
sequence...
>NP_702279.1 [Plasmodium falciparum 3D7]
MGSKKNSNTVDSSENVEEVVDNLTSEKNKESLKKDKRKKKEKKNNDVDDINEEEDEEGNDEDTMKKFSVD
TSENEDDKEDDDDDEDDDDDDDDDDDEDDDDEDDDDDDDDDDDDDDDDDEDDDDDDDDDEDDDDEDDEDD
DDDDEDDDDDFDDMDEDDDDDDDDDEDDDDEEDYDDDDDDDDDDDEDDDEDDDEDDDDENDQNEYAGDDK
KDEDGDAKKGSDDEGFD
>NP_702802.1 [Plasmodium falciparum 3D7]
MSRLFFFYFFIFPYPLFRSYVLTFLRTSFLSYFLSFLLSFLLSFLLSFLLSFLLSFLLSFLLSFLLSFLL
SFLLSFLLICFLNYLLSFSFSFSLFLSLSLFFFK
>NP_703640.1 [Plasmodium falciparum 3D7]
MKFFEKKKKKKKKKKEKKKKKKEKQNKTKVLFISSLFPFFFLFFVLSLLSYIFIIFFVSLSSFLYIDDII
LLRIIVLYTMTLYIYKYIHIYIYIYIYIYITYFLIIIFL
Wootton & Federhen, 1993
Compositional extremes
>NP_702279.1 [Plasmodium falciparum 3D7]
MGSKKNSNTVDSSENVEEVVDNLTSEKNKESLKKDKRKKKEKKNNDVDDINEEEDEEGNDEDTMKKFSVD
TSENEDDKEDDDDDEDDDDDDDDDDDEDDDDEDDDDDDDDDDDDDDDDDEDDDDDDDDDEDDDDEDDEDD
DDDDEDDDDDFDDMDEDDDDDDDDDEDDDDEEDYDDDDDDDDDDDEDDDEDDDEDDDDENDQNEYAGDDK
KDEDGDAKKGSDDEGFD
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LCRs affect homology search
BLASTP results for kinesin k39 (Leishmania major)
Workarounds
masking
composition-
based score
adjustments
>NP_702279.1 [Plasmodium falciparum 3D7]
MGSKKNSNTVDSSENVEEVVDNLTSEKNKESLKKDKRKKKEKKNNDVDDINEEEDEEGND
EDTMKKFSVDTSENEDDKEDDDDDEDDDDDDDDDDDEDDDDEDDDDDDDDDDDDDDDDDE
DDDDDDDDDEDDDDEDDEDDDDDDEDDDDDFDDMDEDDDDDDDDDEDDDDEEDYDDDDDD
DDDDDEDDDEDDDEDDDDENDQNEYAGDDKKDEDGDAKKGSDDEGFD
>NP_702802.1 [Plasmodium falciparum 3D7]
MSRLFFFYFFIFPYPLFRSYVLTFLRTSFLSYFLSFLLSFLLSFLLSFLLSFLLSFLLSF
LLSFLLSFLLSFLLSFLLICFLNYLLSFSFSFSLFLSLSLFFFK
>NP_703640.1 [Plasmodium falciparum 3D7]
MKFFEKKKKKKKKKKEKKKKKKEKQNKTKVLFISSLFPFFFLFFVLSLLSYIFIIFFVSL
SSFLYIDDIILLRIIVLYTMTLYIYKYIHIYIYIYIYIYITYFLIIIFL
SEG (Wootton and Federhen, 1993)
>NP_702279.1 [Plasmodium falciparum 3D7]
MGSKKxxxxxxxxxxxxxxxxxLTSxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxTMKKFSVDTSENxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxNDQNEYxxxxxxxxxxxxxxxxxxxxFD
>NP_702802.1 [Plasmodium falciparum 3D7]
MSRxxxxxxxxxxxxxxRSYVxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxK
>NP_703640.1 [Plasmodium falciparum 3D7]
MxxxxxxxxxxxxxxxxxxxxxxxQNKTKVxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxTMTLxxxxxxxxxxxxxxxxxxxxxxxxxFL
SEG (Wootton and Federhen, 1993)
>NP_702279.1 [Plasmodium falciparum 3D7]
MGSKKNSNTVxSSENVEEVVxNLTSExNxESLxxxxRxxxExxNNxVxxINEEExEEGNx
ExTMxxFSVxTSENExxKExxxxxExxxxxxxxxxxExxxxExxxxxxxxxxxxxxxxxE
xxxxxxxxxExxxxExxExxxxxxExxxxxFxxMxExxxxxxxxxExxxxEExYxxxxxx
xxxxxExxxExxxExxxxENxQNEYAGxxKKxExGxAKKGSxxEGFx
>NP_702802.1 [Plasmodium falciparum 3D7]
MSRLxxxYxxIxPYPLxRSYVLTxLRTSxLSYxxSxxxSxxxSxxxSxxxSxxxSxxxSx
xxSxxxSxxxSxxxSxxxICxxNYxxSxSxSxSLxLSLSLxxxK
>NP_703640.1 [Plasmodium falciparum 3D7]
MKFFExxxxxxxxxxExxxxxxExQNxTxVLFxSSLFPFFFLFFVLSLLSxxFxxFFVSL
SSFLxxDDxxLLRxxVLxTMTLxxxKxxHxxxxxxxxxxxTxFLxxxFL
CAST (Promponas et al., 2000)
SEG vs. CAST
>No_masking
MSRLFFFYFFIFPYPLFRSYVLTFLRTSFLSYFLSFLLSFLLSFLLSFLLSFLLSFLLSF
LLSFLLSFLLSFLLSFLLICFLNYLLSFSFSFSLFLSLSLFFFK
>SEG
MSRxxxxxxxxxxxxxxRSYVxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxK
>CAST
MSRLxxxYxxIxPYPLxRSYVLTxLRTSxLSYxxSxxxSxxxSxxxSxxxSxxxSxxxSx
xxSxxxSxxxSxxxSxxxICxxNYxxSxSxSxSLxLSLSLxxxK
Composition-based score adjustments
Are LCRs the
equivalent of
“junk DNA”
for protein sequences?
Haerty & Golding, 2010
LCRs and Protein Structure
LCRs
Lack of 3D structure
high-order non-globular
structures when in complexes
greater flexibility
transcription factors
greater flexibility
protein to protein interactions
proteins can perform more than
one specific function
Corral et al., 1993; Suzuki et al., 1993; Gill et al., 1993; Wootton, 1994; Romero et al., 2001; Karlin et al., 2002; Coletta et al., 2010
LCRs and Evolution
LCRs Evolve rapidly
New LCRs generate novel
material for functional
expansions
Expansion and contractions of
LCRs may constitute a large
source of phenotypic variation
Coletta et al., 2010
LCRs and Human Health
LCRs
Very common in the
malaria parasite
(Plasmodium falciparum)
polyQ repeats involved in
more than 10 neurological
diseases
Play important roles in the
proper function of health-
related proteins (e.g. p53)
Miller et al., 2000; Promponas et al., 2000; Li et al., 2007
Problems faced in LCR research
Problems
LCRs detection and
manipulation is
everywhere
LCRs are missing
from experimentally
determined protein
structures
Various and
conflicting
definitions
Lack or proper tools
aimed at searching,
displaying and
comparing LCRs
Scientific Hypothesis & Aims
LCRs as a means to predict
the phenotype of an
organism
Tools for LCRs
manipulation:
Search, Display, Compare
LCRs Review:
Algorithms, Tools &
Biological Relevance
Effects of LCRs and LCR-handling tools
on protein sequence comparisons
Common methods
Database search Best bi-directional hit
LCRs cause problems
Ideal scenario Real life
Workarounds
masking
composition-
based score
adjustments
LCR-handling schemes
All possible combinations
Best performing
Most commonly employed
NM No masking
NMCB2 Compositional based statistics mode 2
SM1 SEG masked query
SM2 SEG masked database
SM2CB2 SEG masked database & CBS2
SM3 SEG filtered query and database (two-way masking)
CM1 CAST masked query
CM2 CAST masked database
CM3 CAST masked query & database
Datasets: Self comparisons
Low complexity content
Determining best hit
Self comparisons:
Affected proteins
E-value only:
many false hits
E-value + bit score
+ others:
most false hits
disappear
No LCRs?
(almost) no
problems
SM1 is the worst
LCR-handling
method tested
BBH: Run times
and file sizes
All species
• All modes except SM3
don’t cause a
noticeable decrease in
runtimes
• SM3 significantly
increases runtime
(fragmented db)
P. falciparum
• At least 80% decrease
in runtimes for all
masking modes
• CM3 causes a 50-fold
decrease in runtime
• Differences are more
profound when E-
value threshold is 10
(data not shown)
Dataset: Database Search
http://scop.mrc-lmb.cam.ac.uk/scop/, http://astral.berkeley.edu/
• Secondary
structure
content of the
domainClass
• Similar
arrangement of
regular secondary
structures
• Major structural
similarity
Fold
• Structural and
functional similarity to
suggest evolutional
relationship
• Probable common
evolutionary origin
Super-
family
• Clear sequence similarity
can be detected
• Clear evolutionarily
relationshipFamily
Database search:
Affected proteins
CAST masking is
better than SEG or
CBS
CBS are better than
SEG
Bare BLAST (NM) is
sometimes better
than SEG or CBS!!!
NM is more
sensitive
Database search:
Affected proteins
No LCRs return 20x
more TPs for the
same number of
FPs
SEG & NM struggle
when LCRs are
present
CM2 & NMCB2 are
the best methods
when LCRs are
present
REMEMBER:
few sequences
with LCRs in
ASTRAL
No LCRs With LCRs
DS: Run times
and file sizes
Compositional
statistics increase
CPU time
Differences are
small, possibly of
the low LCR
content of ASTRAL
Conclusions
Best bi-
directional hit
• Utilizing extra BLASTP features may
increase performance
Masking the
database with
CAST
• Generally increases sensitivity
• Decreases usage of computational
resources (especially CM3)
ASTRAL is
poor in LCRs
• It’s usage as a benchmark tool does not
simulate a real-life scenario
• Different data-sets for benchmarking
sequence similarity tools are needed
Hits rich in
LCRs are not
always
spurious
• One should be extra cautious when
applying LCR manipulation methods
Development of novel software for
searching, displaying and comparing LCRs
The problem
Most LCR-detection
algorithms are designed
to mask them
Lack of tools aimed in
searching, visualizing and
sharing LCRs
Major databases store
LCRs in an inconsistent
and incomplete manner
Only a few studies with
large-scale analysis of
LCRs
LCR-related research is
focused on the problems
caused by them
Prior art
LPS-annotate HRaP
http://bioinfo.protres.ru/hrap/
Lobanov et al., 2014
http://cedra.biol.mcgill.ca/LPS/lps-
annotate.html
Harbi et al., 2011
Main goals
Uniform way of
LCRs
representation
Combine the most
commonly used
algorithms and
databases
The first (?)
universal resource
for LCR related
research
Aid researches in
assessing the
biological relevance
of LCRs
http://repeat.biol.ucy.ac.cy/mgb2/gbrowse/
Features
Based on
UniProt/SwissProt
data
CAST & SEG
(expandable)
Ease of use
Allow settings
saving and
sharing
Mix and Match
masks
Features
Search
• Protein properties
• LCRs by residue type(s)
• Within UniProt results
Browse or
download results
Database schema
optimized for fast
search and
retrieval of data
Deep links to
UniProt, PDB & GO
Features
Local BLAST search
against masked
databases
Initiate NCBI’s BLASTP
with normal or inverted
masking with optimal
pre-configured settings
for LCR detection
A new generation of the CAST algorithm
MxxxQSDxSVxxxLSQxTFS
AAAAAAAAAAAAAAAAAAAA
MEEPQSDPSVEPPLSQETFS
EEEEEEEEEEEEEEEEEEEE
MEEPQSDPSVEPPLSQETFS
FFFFFFFFFFFFFFFFFFFF
MxxPQSDPSVxPPLSQxTFS
PPPPPPPPPPPPPPPPPPPP
MxxPQSDPSVxPPLSQxTFS
QQQQQQQQQQQQQQQQQQQQ
MxxxQSDxSVxxxLSQxTFS
CAST (Promponas et al., 2000)
Performance of old and new CAST versions
154.4s
85.2s
55.5s
11.4s
6.5s 4.9s
1.9s
0x
10x
20x
30x
40x
50x
60x
70x
80x
90x
0
20
40
60
80
100
120
140
160
CAST v1.0 (gcc) CAST v2.1 (gcc) CAST v2.1s
(gcc)
CAST v1.0 (icc) CAST v2.1 (icc) CAST v2.1s (icc) SEG
Speed-upfactorcomparedtoCASTv1
Executiontime(secs)
Performance of old and new CAST versions
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
20
40
60
80
100
120
140
160
P.f. P.v. P.c. P.a. 1 P.a. 2 C.t. C.p.
MillionsofmaskedAAs
Executiontime(secs)
CAST v1.0 (gcc) CAST v2.1 (gcc) CAST v2.1s (gcc) CAST v2.1s (icc) # Xs
Porting CAST v2.1 on other hardware platforms
FPGAs
GP-GPUsMulti-coreCPUs
Predicting the pathogenicity of Escherichia strains based
on local and global amino acid compositional signatures
Motivation
Kreil & Ouzounis (2001)
identified thermophilic
species by their global
amino acid composition
Promponas (2009)
identified pathogenic
Escherichia strains using
global and local amino acid
compositional signatures
Data retrieval
31 full genomes
from GenBank
Original genome
publications
Complementary
publications &
data
28 Escherichia strains with
verified pathogenicity status
NCBI Bacterial Genomes Division [ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/]; GOLD [http://genomesonline.org/];
NCBI Bacterial Genome Project Division [ftp://ftp.ncbi.nlm.nih.gov/genomes/bioproject/]
31genomes
Training set: 22 verified
Validation set: 6 verified
Prediction: 3 with
unverified pathogenicity
Signatures Generation
& Training Set Creation
Kreil & Ouzounis, 2003; Promponas, 2009
CAST with
default settings
and –stat
GC: Global AA
composition
LB: LCRs
composition
XB: Masked
resides
composition
x 28
k-nearest neighbour
cross-validation and
optimal k selection
LOOCV for
1 ≤ k ≤ 11
Optimal k
selection
Kuhn, 2008
Pathogenic
Non Pathogenic
Validation sets
generation
6 verified genomes
Randomly sample
sequences from each
strain
9 gene count classes:
• 100, 500, 1000, 1500, 2000,
2500, 3000, 3500, 4000
• 2-98% genome coverage
5400 sub-samples: 100
for each genome &
gene count class
Validation Results
Global composition
signatures work
better
Accuracy over 80%
using less than 20%
of proteins
Similar results using
“glocal” vectors
Preliminary results
indicate that local
signatures at lower
CAST thresholds are
enhanced
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
100 500 1000 1500 2000 2500 3000 3500 4000
NUMBER OF GENES
GC (Accuracy) GC (Sensitivity) GC (Specificity)
LB (Accuracy) LB (Sensitivity) LB (Specificity)
Species Genbank Accession Prediction
E. coli ATCC 8739 NC_010468 Non-pathogenic
E. coli SMS-3-5 NC_010498 Pathogenic
E. fergusonii ATCC 35469 NC_011740 Pathogenic
Exploration for the identification of proteomic subsets
responsible for the predictive power of the final models
Chimeras generations
Genes clustered using
blastclust
4 pools of genes
• P or NP only
• Core genes
• Mixed genes
4 chimera types
• P only or NP only
• Equal P/NP
• All
14 gene count classes
• 100, 500, 1000,
2000, 2500, 3000,
3500 … 10000
Genomes
Gene
PoolsChimeras
Results
P & NP only sequences
can characterize a
genome
Prediction rate with
4000 genes similar to
external validation
results
Compositional signal of
P only class stronger
than NP only
LB similar to GC on P
only. Perhaps stronger
LCR signature?
Conclusions
 Compositional features can be used to accurately perform
large-scale phenotype prediction
 knn is much faster compared to Sims et al. (2011)
 Compositional features enable a more compressed
representation of complete genomes which further increase
accuracy & performance
 Robust even when incomplete data are available
 Gene classes associated with pathogenicity seem to
generate enough signal to control the pathogenicity of a
genome according to the predictive model
Synopsis
We evaluated the
performance of the most
popular LCRs handling
algorithms
We evaluated, and
questioned, the most
commonly employed
data-set for assessing the
performance of LCRs
handling algorithms
We described the best
ways to handle LCRs
We built tools to search,
display and compare LCRs
We used LCRs and other
compositional measures
to accurately predict the
pathogenicity of
Escherichia strains
We demonstrated that
our methods also work
when using only parts of
the available information
LCRs
Future work
Publish at least 3 papers
(1 submitted, 2 are in preparation)
Predict the pathogenicity of the 77 incomplete
genomes that we collected from GenBank
Expand the LCR-eXXXplorer
Apply for a Post-Doc position
Party
Acknowledgments
 Dr. I. Iliopoulos
 Dr L. Kostrikis
 Dr. P. Skourides
 Dr. C.A. Ouzounis
 Dr. V. Promponas
 Cyprus Research
Promotion Foundation
 ΠΕΝΕΚ/ENIΣΧ/0308/77
 Athina Theodosiou
 Stella Tamana
 Ioanna Kalvari
 Maria Xenophontos
 Marilena Aplikioti
 My family
PhD_Presentation_public

More Related Content

What's hot

Single-cell RNA-seq tutorial
Single-cell RNA-seq tutorialSingle-cell RNA-seq tutorial
Single-cell RNA-seq tutorialAaron Diaz
 
Noncoding RNAs in Cardiovascular Disease – Potential as Biomarkers and More
Noncoding RNAs in Cardiovascular Disease – Potential as Biomarkers and MoreNoncoding RNAs in Cardiovascular Disease – Potential as Biomarkers and More
Noncoding RNAs in Cardiovascular Disease – Potential as Biomarkers and MoreQIAGEN
 
Emergingroleo fmi rnainmedicalsciences
Emergingroleo fmi rnainmedicalsciencesEmergingroleo fmi rnainmedicalsciences
Emergingroleo fmi rnainmedicalscienceskarenbbs
 
Comparitive genomic hybridisation
Comparitive genomic hybridisationComparitive genomic hybridisation
Comparitive genomic hybridisationnamrathrs87
 
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...VHIR Vall d’Hebron Institut de Recerca
 
Dna microarray technique for detection and identification of VIRUS
Dna microarray technique for detection and identification of VIRUSDna microarray technique for detection and identification of VIRUS
Dna microarray technique for detection and identification of VIRUSUniversity Of Wuerzburg,Germany
 
Applications of microarray
Applications of microarrayApplications of microarray
Applications of microarraysana shakeel
 
Talk 2008-meeting about NAD
Talk 2008-meeting about NADTalk 2008-meeting about NAD
Talk 2008-meeting about NADchenmiaomiao
 
Use of Affymetrix Arrays (GeneChip® Human Transcriptome 2.0 Array and Cytosca...
Use of Affymetrix Arrays (GeneChip® Human Transcriptome 2.0 Array and Cytosca...Use of Affymetrix Arrays (GeneChip® Human Transcriptome 2.0 Array and Cytosca...
Use of Affymetrix Arrays (GeneChip® Human Transcriptome 2.0 Array and Cytosca...Affymetrix
 
Clinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceClinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceNikesh Shah
 
Aug2015 deanna church analytical validation
Aug2015 deanna church analytical validationAug2015 deanna church analytical validation
Aug2015 deanna church analytical validationGenomeInABottle
 
CRISPR/Cas9-mediated gene editing in human tripronuclear zygotes
CRISPR/Cas9-mediated gene editing in human tripronuclear zygotes CRISPR/Cas9-mediated gene editing in human tripronuclear zygotes
CRISPR/Cas9-mediated gene editing in human tripronuclear zygotes Jaehee Jeong
 
overview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csnceroverview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csncerSeham Al-Shehri
 

What's hot (20)

Microarray CGH
Microarray CGHMicroarray CGH
Microarray CGH
 
Single-cell RNA-seq tutorial
Single-cell RNA-seq tutorialSingle-cell RNA-seq tutorial
Single-cell RNA-seq tutorial
 
Noncoding RNAs in Cardiovascular Disease – Potential as Biomarkers and More
Noncoding RNAs in Cardiovascular Disease – Potential as Biomarkers and MoreNoncoding RNAs in Cardiovascular Disease – Potential as Biomarkers and More
Noncoding RNAs in Cardiovascular Disease – Potential as Biomarkers and More
 
Molecular profiling 2013
Molecular profiling 2013Molecular profiling 2013
Molecular profiling 2013
 
Acc 2002 mehran for print
Acc 2002  mehran for printAcc 2002  mehran for print
Acc 2002 mehran for print
 
Emergingroleo fmi rnainmedicalsciences
Emergingroleo fmi rnainmedicalsciencesEmergingroleo fmi rnainmedicalsciences
Emergingroleo fmi rnainmedicalsciences
 
Comparitive genomic hybridisation
Comparitive genomic hybridisationComparitive genomic hybridisation
Comparitive genomic hybridisation
 
Sage technology
Sage technologySage technology
Sage technology
 
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformat...
 
Dna microarray technique for detection and identification of VIRUS
Dna microarray technique for detection and identification of VIRUSDna microarray technique for detection and identification of VIRUS
Dna microarray technique for detection and identification of VIRUS
 
Applications of microarray
Applications of microarrayApplications of microarray
Applications of microarray
 
Talk 2008-meeting about NAD
Talk 2008-meeting about NADTalk 2008-meeting about NAD
Talk 2008-meeting about NAD
 
Thesis ppt
Thesis pptThesis ppt
Thesis ppt
 
155 dna microarray
155 dna microarray155 dna microarray
155 dna microarray
 
Use of Affymetrix Arrays (GeneChip® Human Transcriptome 2.0 Array and Cytosca...
Use of Affymetrix Arrays (GeneChip® Human Transcriptome 2.0 Array and Cytosca...Use of Affymetrix Arrays (GeneChip® Human Transcriptome 2.0 Array and Cytosca...
Use of Affymetrix Arrays (GeneChip® Human Transcriptome 2.0 Array and Cytosca...
 
Clinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceClinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidance
 
Aug2015 deanna church analytical validation
Aug2015 deanna church analytical validationAug2015 deanna church analytical validation
Aug2015 deanna church analytical validation
 
CRISPR/Cas9-mediated gene editing in human tripronuclear zygotes
CRISPR/Cas9-mediated gene editing in human tripronuclear zygotes CRISPR/Cas9-mediated gene editing in human tripronuclear zygotes
CRISPR/Cas9-mediated gene editing in human tripronuclear zygotes
 
Multiplex Assays for Studying Gene Regulation and Cell Function
Multiplex Assays for Studying Gene Regulation and Cell FunctionMultiplex Assays for Studying Gene Regulation and Cell Function
Multiplex Assays for Studying Gene Regulation and Cell Function
 
overview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csnceroverview on Next generation sequencing in breast csncer
overview on Next generation sequencing in breast csncer
 

Viewers also liked

Aditya s. shukla scrum alliance-cspo_certificate (1)
Aditya s. shukla scrum alliance-cspo_certificate (1)Aditya s. shukla scrum alliance-cspo_certificate (1)
Aditya s. shukla scrum alliance-cspo_certificate (1)Adi Shukla
 
Ejercicio v3 17
Ejercicio v3 17Ejercicio v3 17
Ejercicio v3 17Jose Otero
 
псц электроника крым_конференция_овченков
псц электроника крым_конференция_овченковпсц электроника крым_конференция_овченков
псц электроника крым_конференция_овченковjournalrubezh
 
Delårsrapport 3. kvartal 2015 for SpareBank 1 BV
Delårsrapport 3. kvartal 2015 for SpareBank 1 BVDelårsrapport 3. kvartal 2015 for SpareBank 1 BV
Delårsrapport 3. kvartal 2015 for SpareBank 1 BVSpareBank1BV
 
Ejercicio v3 9
Ejercicio v3 9Ejercicio v3 9
Ejercicio v3 9Jose Otero
 
Energy performance contracting
Energy performance contractingEnergy performance contracting
Energy performance contractingDario Di Santo
 
ESCOs and white certificates in Italy
ESCOs and white certificates in ItalyESCOs and white certificates in Italy
ESCOs and white certificates in ItalyDario Di Santo
 
Considerazioni sul mercato dei certificati bianchi
Considerazioni sul mercato dei certificati bianchiConsiderazioni sul mercato dei certificati bianchi
Considerazioni sul mercato dei certificati bianchiDario Di Santo
 

Viewers also liked (19)

TIG CODE 4
TIG CODE 4TIG CODE 4
TIG CODE 4
 
Aditya s. shukla scrum alliance-cspo_certificate (1)
Aditya s. shukla scrum alliance-cspo_certificate (1)Aditya s. shukla scrum alliance-cspo_certificate (1)
Aditya s. shukla scrum alliance-cspo_certificate (1)
 
Flyer
FlyerFlyer
Flyer
 
Darren Johnson CV
Darren Johnson CVDarren Johnson CV
Darren Johnson CV
 
Res4
Res4Res4
Res4
 
AARP Evidence Mail
AARP Evidence MailAARP Evidence Mail
AARP Evidence Mail
 
Resume
Resume Resume
Resume
 
2013 RonRideout New Proposal
2013 RonRideout New Proposal2013 RonRideout New Proposal
2013 RonRideout New Proposal
 
Ejercicio v3 17
Ejercicio v3 17Ejercicio v3 17
Ejercicio v3 17
 
псц электроника крым_конференция_овченков
псц электроника крым_конференция_овченковпсц электроника крым_конференция_овченков
псц электроника крым_конференция_овченков
 
Quicktes
QuicktesQuicktes
Quicktes
 
MAG Code 10
MAG Code 10MAG Code 10
MAG Code 10
 
Delårsrapport 3. kvartal 2015 for SpareBank 1 BV
Delårsrapport 3. kvartal 2015 for SpareBank 1 BVDelårsrapport 3. kvartal 2015 for SpareBank 1 BV
Delårsrapport 3. kvartal 2015 for SpareBank 1 BV
 
ABRASIVE WHEELS
ABRASIVE WHEELSABRASIVE WHEELS
ABRASIVE WHEELS
 
Yaquelin
YaquelinYaquelin
Yaquelin
 
Ejercicio v3 9
Ejercicio v3 9Ejercicio v3 9
Ejercicio v3 9
 
Energy performance contracting
Energy performance contractingEnergy performance contracting
Energy performance contracting
 
ESCOs and white certificates in Italy
ESCOs and white certificates in ItalyESCOs and white certificates in Italy
ESCOs and white certificates in Italy
 
Considerazioni sul mercato dei certificati bianchi
Considerazioni sul mercato dei certificati bianchiConsiderazioni sul mercato dei certificati bianchi
Considerazioni sul mercato dei certificati bianchi
 

Similar to PhD_Presentation_public

miScript Single Cell Poster
miScript Single Cell PostermiScript Single Cell Poster
miScript Single Cell PosterQIAGEN
 
Total RNA Discovery for RNA Biomarker Development Webinar
Total RNA Discovery for RNA Biomarker Development WebinarTotal RNA Discovery for RNA Biomarker Development Webinar
Total RNA Discovery for RNA Biomarker Development WebinarQIAGEN
 
Acc 2002 mehran for print
Acc 2002  mehran for printAcc 2002  mehran for print
Acc 2002 mehran for printSHAPE Society
 
Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysayeshasattarsandhu
 
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018Amazon Web Services
 
A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...
A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...
A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...Rafael Casiano
 
DNA micro array by kk sahu
DNA micro array by kk sahuDNA micro array by kk sahu
DNA micro array by kk sahuKAUSHAL SAHU
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informaticsDaniela Rotariu
 
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...Pavan Kumar
 
Functional genomics, and tools
Functional genomics, and toolsFunctional genomics, and tools
Functional genomics, and toolsKAUSHAL SAHU
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaquesSHAPE Society
 
Navigating through disease maps
Navigating through disease mapsNavigating through disease maps
Navigating through disease mapsJoaquin Dopazo
 
I psc and stem cell 2013
I psc and stem cell 2013I psc and stem cell 2013
I psc and stem cell 2013Elsa von Licy
 
Bioinformatics applications and challenges
Bioinformatics applications and challengesBioinformatics applications and challenges
Bioinformatics applications and challengesS V Singh
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)talhakhat
 

Similar to PhD_Presentation_public (20)

miScript Single Cell Poster
miScript Single Cell PostermiScript Single Cell Poster
miScript Single Cell Poster
 
Total RNA Discovery for RNA Biomarker Development Webinar
Total RNA Discovery for RNA Biomarker Development WebinarTotal RNA Discovery for RNA Biomarker Development Webinar
Total RNA Discovery for RNA Biomarker Development Webinar
 
Dna microarray application in vp research mehran
Dna microarray application in vp research  mehranDna microarray application in vp research  mehran
Dna microarray application in vp research mehran
 
Acc 2002 mehran for print
Acc 2002  mehran for printAcc 2002  mehran for print
Acc 2002 mehran for print
 
Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarrays
 
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
Intelligent Systems for Cancer Genomics (AIS305) - AWS re:Invent 2018
 
A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...
A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...
A novel platform for in situ, multiomic, hyper-plexed analyses of systems bio...
 
DNA micro array by kk sahu
DNA micro array by kk sahuDNA micro array by kk sahu
DNA micro array by kk sahu
 
Molecular Markers as a Diagnostic Tool
Molecular Markers as a Diagnostic ToolMolecular Markers as a Diagnostic Tool
Molecular Markers as a Diagnostic Tool
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
 
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
 
Dr. Subha Madhavan: G-DOC – Enabling Systems Medicine through Innovations in ...
Dr. Subha Madhavan: G-DOC – Enabling Systems Medicine through Innovations in ...Dr. Subha Madhavan: G-DOC – Enabling Systems Medicine through Innovations in ...
Dr. Subha Madhavan: G-DOC – Enabling Systems Medicine through Innovations in ...
 
Functional genomics, and tools
Functional genomics, and toolsFunctional genomics, and tools
Functional genomics, and tools
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques
 
Micro array study for gene expression in vp
Micro array study for gene expression in vpMicro array study for gene expression in vp
Micro array study for gene expression in vp
 
Navigating through disease maps
Navigating through disease mapsNavigating through disease maps
Navigating through disease maps
 
I psc and stem cell 2013
I psc and stem cell 2013I psc and stem cell 2013
I psc and stem cell 2013
 
Bioinformatics applications and challenges
Bioinformatics applications and challengesBioinformatics applications and challenges
Bioinformatics applications and challenges
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)
 

PhD_Presentation_public

  • 1. Development of algorithms and software for unravelling the biological role of low complexity regions in protein sequences PhD thesis by Ioannis Kirmitzoglou Supervised by Vasilis Promponas Bioinformatics Research Laboratory Department of Biological Sciences University of Cyprus
  • 2. Outline Introduction Scientific hypothesis & aims Effects of LCRs in database similarity searches Development of novel software for searching, displaying and comparing LCRs Predicting the pathogenicity of Escherichia strains based on local and global amino acid compositional signatures
  • 3. Protein sequences >gi|157159493|ref|YP_001456811.1| isoleucyl-tRNA synthetase MSDYKSTLNLPETGFPMRGDLAKREPGMLARWTDDDLYGIIRAAKKGKKTFILHDGPPYANGSIHIGHSVNK ILKDIIVKSKGLSGYDSPYVPGWDCHGLPIELKVEQEYGKPGEKFTAAEFRAKCREYAATQVDGQRKDFIRL GVLGDWSHPYLTMDFKTEANIIRALGKIIGNGHLHKGAKPVHWCVDCRSALAEAEVEYYDKTSLSIDVAFQA VDQDALKAKFAVSNVNGPISLVIWTTTPWTLPANRAISIAPDFDYALVQIDGQAVILAKDLVESVMQRIGVT DYTILGTVKGAELELLRFTHPFMGFDVPAILGDHVTLDAGTGAVHTAPGHGPDDYVIGQKYGLETANPVGPD GTYLPGTYPTLDGVNVFKANDIVVALLQEKGALLHVEKMQHSYPCCWRHKTPIIFRATPQWFVSMDQKGLRA QSLKEIKGVQWIPDWGQARIESMVANRPDWCISRQRTWGVPMSLFVHKDTEELHPRTLELMEEVAKRVEVDG IQAWWDLDAKEILGDEADQYVKVPDTLDVWFDSGSTHSSVVDVRPEFAGHAADMYLEGSDQHRGWFMSSLMI STAMKGKAPYRQVLTHGFTVDGQGRKMSKSIGNTVSPQDVMNKLGADILRLWVASTDYTGEMAVSDEILKRA ADSYRRIRNTARFLLANLNGFDPAKDMVKPEEMVVLDRWAVGCAKAAQEDILKAYEAYDFHEVVQRLMRFCS VEMGSFYLDIIKDRQYTAKADSVARRSCQTALYHIAEALVRWMAPILSFTADEVWGYLPGEREKYVFTGEWY EGLFGLADSEAMNDAFWDELLKVRGEVNKVIEQARADKKVGGSLEAAVTLYAEPELSAKLTALGDELRFVLL TSGATVADYNDAPADAQQSEVLKGLKVALSKAEGEKCPRCWHYTQDVGKVAEHAEICGRCVSNVAGDGEKRK FA
  • 4. Protein Sequence Resources NCBI nr Swiss-Prot TrEMBL PDB Environmental samples Organism Proteomes
  • 5. Most common databases NCBI nr UniProtKB TrEMBL UniProtKB SwissProt Protein Data Bank (PDB) • 20,703,722 Seqs • 7,197,824,772 AAs • 35,502,518 Seqs • 11,384,440,438 AAs • 540,261 Seqs • 191,876,607 AAs • 91,359 Seqs • 60,847,644 AAs June 2013; ftp://ftp.ncbi.nih.gov/blast/db/, http://www.ebi.ac.uk/uniprot/TrEMBLstats/, http://www.expasy.org/sprot/relnotes/relstat.html, http://www.pdb.org
  • 6. (Most) Databases grow exponentially 0 10 20 30 40 50 60 70 80 90 100 1 5 9 13 17 Thousands Number of searchable structures per year in PDB 0 10 20 30 40 50 60 70 80 90 1 4 7 10 13 16 19 Millions Number of entries in NCBI/Proteins June 2013; http://www.ncbi.nlm.nih.gov/protein, http://www.ebi.ac.uk/uniprot/TrEMBLstats/, http://www.expasy.org/sprot/relnotes/relstat.html, http://www.pdb.org
  • 8. Lack of experimental data 15% 12% 70% 3% 0% 1 2 3 4 5 June 2013; http://www.expasy.org/sprot/relnotes/relstat.html
  • 9. Searching... • Huge Number of Sequences • Redundancy • Exponential Growth Database • Fast • Accurate • Easy to use and program Search Method • Statistical Significance • Ranking • Detailed Info Results
  • 10. Sequence comparison methods  Dot Plots (intuitive)  Dynamic Programming (e.g. SSEARCH; exact but slow)  Heuristics (e.g. BLAST, FASTA; fast but not exact) prot A M---EEPQSDPSVEPPLSQETFS : :: ::: :.: :::::::: prot B MTAMEESQSDISLELPLSQETFS
  • 11. BLAST (Altschul et al., 1990)
  • 13. λ: depends on the composition of the database E(S≥s) = Kmne-λs BLAST scoring and statistics Karlin, S. and S. F. Altschul, 1990 𝑆𝑖𝑗 = 𝑙𝑛 𝑞𝑖𝑗 𝑝𝑖 𝑝𝑗 𝜆
  • 14. Background frequencies 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 http://www.ncbi.nlm.nih.gov/protein
  • 15. Low complexity regions (LCRs) Any of the 20 AAs... Single or grouped, periodic or not... May cover a big portion of the sequence... >NP_702279.1 [Plasmodium falciparum 3D7] MGSKKNSNTVDSSENVEEVVDNLTSEKNKESLKKDKRKKKEKKNNDVDDINEEEDEEGNDEDTMKKFSVD TSENEDDKEDDDDDEDDDDDDDDDDDEDDDDEDDDDDDDDDDDDDDDDDEDDDDDDDDDEDDDDEDDEDD DDDDEDDDDDFDDMDEDDDDDDDDDEDDDDEEDYDDDDDDDDDDDEDDDEDDDEDDDDENDQNEYAGDDK KDEDGDAKKGSDDEGFD >NP_702802.1 [Plasmodium falciparum 3D7] MSRLFFFYFFIFPYPLFRSYVLTFLRTSFLSYFLSFLLSFLLSFLLSFLLSFLLSFLLSFLLSFLLSFLL SFLLSFLLICFLNYLLSFSFSFSLFLSLSLFFFK >NP_703640.1 [Plasmodium falciparum 3D7] MKFFEKKKKKKKKKKEKKKKKKEKQNKTKVLFISSLFPFFFLFFVLSLLSYIFIIFFVSLSSFLYIDDII LLRIIVLYTMTLYIYKYIHIYIYIYIYIYITYFLIIIFL Wootton & Federhen, 1993
  • 16. Compositional extremes >NP_702279.1 [Plasmodium falciparum 3D7] MGSKKNSNTVDSSENVEEVVDNLTSEKNKESLKKDKRKKKEKKNNDVDDINEEEDEEGNDEDTMKKFSVD TSENEDDKEDDDDDEDDDDDDDDDDDEDDDDEDDDDDDDDDDDDDDDDDEDDDDDDDDDEDDDDEDDEDD DDDDEDDDDDFDDMDEDDDDDDDDDEDDDDEEDYDDDDDDDDDDDEDDDEDDDEDDDDENDQNEYAGDDK KDEDGDAKKGSDDEGFD 0 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  • 17. LCRs affect homology search BLASTP results for kinesin k39 (Leishmania major)
  • 19. >NP_702279.1 [Plasmodium falciparum 3D7] MGSKKNSNTVDSSENVEEVVDNLTSEKNKESLKKDKRKKKEKKNNDVDDINEEEDEEGND EDTMKKFSVDTSENEDDKEDDDDDEDDDDDDDDDDDEDDDDEDDDDDDDDDDDDDDDDDE DDDDDDDDDEDDDDEDDEDDDDDDEDDDDDFDDMDEDDDDDDDDDEDDDDEEDYDDDDDD DDDDDEDDDEDDDEDDDDENDQNEYAGDDKKDEDGDAKKGSDDEGFD >NP_702802.1 [Plasmodium falciparum 3D7] MSRLFFFYFFIFPYPLFRSYVLTFLRTSFLSYFLSFLLSFLLSFLLSFLLSFLLSFLLSF LLSFLLSFLLSFLLSFLLICFLNYLLSFSFSFSLFLSLSLFFFK >NP_703640.1 [Plasmodium falciparum 3D7] MKFFEKKKKKKKKKKEKKKKKKEKQNKTKVLFISSLFPFFFLFFVLSLLSYIFIIFFVSL SSFLYIDDIILLRIIVLYTMTLYIYKYIHIYIYIYIYIYITYFLIIIFL SEG (Wootton and Federhen, 1993)
  • 20. >NP_702279.1 [Plasmodium falciparum 3D7] MGSKKxxxxxxxxxxxxxxxxxLTSxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxTMKKFSVDTSENxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxNDQNEYxxxxxxxxxxxxxxxxxxxxFD >NP_702802.1 [Plasmodium falciparum 3D7] MSRxxxxxxxxxxxxxxRSYVxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxK >NP_703640.1 [Plasmodium falciparum 3D7] MxxxxxxxxxxxxxxxxxxxxxxxQNKTKVxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxTMTLxxxxxxxxxxxxxxxxxxxxxxxxxFL SEG (Wootton and Federhen, 1993)
  • 21. >NP_702279.1 [Plasmodium falciparum 3D7] MGSKKNSNTVxSSENVEEVVxNLTSExNxESLxxxxRxxxExxNNxVxxINEEExEEGNx ExTMxxFSVxTSENExxKExxxxxExxxxxxxxxxxExxxxExxxxxxxxxxxxxxxxxE xxxxxxxxxExxxxExxExxxxxxExxxxxFxxMxExxxxxxxxxExxxxEExYxxxxxx xxxxxExxxExxxExxxxENxQNEYAGxxKKxExGxAKKGSxxEGFx >NP_702802.1 [Plasmodium falciparum 3D7] MSRLxxxYxxIxPYPLxRSYVLTxLRTSxLSYxxSxxxSxxxSxxxSxxxSxxxSxxxSx xxSxxxSxxxSxxxSxxxICxxNYxxSxSxSxSLxLSLSLxxxK >NP_703640.1 [Plasmodium falciparum 3D7] MKFFExxxxxxxxxxExxxxxxExQNxTxVLFxSSLFPFFFLFFVLSLLSxxFxxFFVSL SSFLxxDDxxLLRxxVLxTMTLxxxKxxHxxxxxxxxxxxTxFLxxxFL CAST (Promponas et al., 2000)
  • 24. Are LCRs the equivalent of “junk DNA” for protein sequences? Haerty & Golding, 2010
  • 25. LCRs and Protein Structure LCRs Lack of 3D structure high-order non-globular structures when in complexes greater flexibility transcription factors greater flexibility protein to protein interactions proteins can perform more than one specific function Corral et al., 1993; Suzuki et al., 1993; Gill et al., 1993; Wootton, 1994; Romero et al., 2001; Karlin et al., 2002; Coletta et al., 2010
  • 26. LCRs and Evolution LCRs Evolve rapidly New LCRs generate novel material for functional expansions Expansion and contractions of LCRs may constitute a large source of phenotypic variation Coletta et al., 2010
  • 27. LCRs and Human Health LCRs Very common in the malaria parasite (Plasmodium falciparum) polyQ repeats involved in more than 10 neurological diseases Play important roles in the proper function of health- related proteins (e.g. p53) Miller et al., 2000; Promponas et al., 2000; Li et al., 2007
  • 28. Problems faced in LCR research Problems LCRs detection and manipulation is everywhere LCRs are missing from experimentally determined protein structures Various and conflicting definitions Lack or proper tools aimed at searching, displaying and comparing LCRs
  • 29. Scientific Hypothesis & Aims LCRs as a means to predict the phenotype of an organism Tools for LCRs manipulation: Search, Display, Compare LCRs Review: Algorithms, Tools & Biological Relevance
  • 30. Effects of LCRs and LCR-handling tools on protein sequence comparisons
  • 31. Common methods Database search Best bi-directional hit
  • 32. LCRs cause problems Ideal scenario Real life
  • 34. LCR-handling schemes All possible combinations Best performing Most commonly employed NM No masking NMCB2 Compositional based statistics mode 2 SM1 SEG masked query SM2 SEG masked database SM2CB2 SEG masked database & CBS2 SM3 SEG filtered query and database (two-way masking) CM1 CAST masked query CM2 CAST masked database CM3 CAST masked query & database
  • 35. Datasets: Self comparisons Low complexity content Determining best hit
  • 36. Self comparisons: Affected proteins E-value only: many false hits E-value + bit score + others: most false hits disappear No LCRs? (almost) no problems SM1 is the worst LCR-handling method tested
  • 37. BBH: Run times and file sizes All species • All modes except SM3 don’t cause a noticeable decrease in runtimes • SM3 significantly increases runtime (fragmented db) P. falciparum • At least 80% decrease in runtimes for all masking modes • CM3 causes a 50-fold decrease in runtime • Differences are more profound when E- value threshold is 10 (data not shown)
  • 38. Dataset: Database Search http://scop.mrc-lmb.cam.ac.uk/scop/, http://astral.berkeley.edu/ • Secondary structure content of the domainClass • Similar arrangement of regular secondary structures • Major structural similarity Fold • Structural and functional similarity to suggest evolutional relationship • Probable common evolutionary origin Super- family • Clear sequence similarity can be detected • Clear evolutionarily relationshipFamily
  • 39. Database search: Affected proteins CAST masking is better than SEG or CBS CBS are better than SEG Bare BLAST (NM) is sometimes better than SEG or CBS!!! NM is more sensitive
  • 40. Database search: Affected proteins No LCRs return 20x more TPs for the same number of FPs SEG & NM struggle when LCRs are present CM2 & NMCB2 are the best methods when LCRs are present REMEMBER: few sequences with LCRs in ASTRAL No LCRs With LCRs
  • 41. DS: Run times and file sizes Compositional statistics increase CPU time Differences are small, possibly of the low LCR content of ASTRAL
  • 42. Conclusions Best bi- directional hit • Utilizing extra BLASTP features may increase performance Masking the database with CAST • Generally increases sensitivity • Decreases usage of computational resources (especially CM3) ASTRAL is poor in LCRs • It’s usage as a benchmark tool does not simulate a real-life scenario • Different data-sets for benchmarking sequence similarity tools are needed Hits rich in LCRs are not always spurious • One should be extra cautious when applying LCR manipulation methods
  • 43. Development of novel software for searching, displaying and comparing LCRs
  • 44. The problem Most LCR-detection algorithms are designed to mask them Lack of tools aimed in searching, visualizing and sharing LCRs Major databases store LCRs in an inconsistent and incomplete manner Only a few studies with large-scale analysis of LCRs LCR-related research is focused on the problems caused by them
  • 45. Prior art LPS-annotate HRaP http://bioinfo.protres.ru/hrap/ Lobanov et al., 2014 http://cedra.biol.mcgill.ca/LPS/lps- annotate.html Harbi et al., 2011
  • 46. Main goals Uniform way of LCRs representation Combine the most commonly used algorithms and databases The first (?) universal resource for LCR related research Aid researches in assessing the biological relevance of LCRs http://repeat.biol.ucy.ac.cy/mgb2/gbrowse/
  • 47. Features Based on UniProt/SwissProt data CAST & SEG (expandable) Ease of use Allow settings saving and sharing Mix and Match masks
  • 48. Features Search • Protein properties • LCRs by residue type(s) • Within UniProt results Browse or download results Database schema optimized for fast search and retrieval of data Deep links to UniProt, PDB & GO
  • 49. Features Local BLAST search against masked databases Initiate NCBI’s BLASTP with normal or inverted masking with optimal pre-configured settings for LCR detection
  • 50. A new generation of the CAST algorithm
  • 52. Performance of old and new CAST versions 154.4s 85.2s 55.5s 11.4s 6.5s 4.9s 1.9s 0x 10x 20x 30x 40x 50x 60x 70x 80x 90x 0 20 40 60 80 100 120 140 160 CAST v1.0 (gcc) CAST v2.1 (gcc) CAST v2.1s (gcc) CAST v1.0 (icc) CAST v2.1 (icc) CAST v2.1s (icc) SEG Speed-upfactorcomparedtoCASTv1 Executiontime(secs)
  • 53. Performance of old and new CAST versions 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 20 40 60 80 100 120 140 160 P.f. P.v. P.c. P.a. 1 P.a. 2 C.t. C.p. MillionsofmaskedAAs Executiontime(secs) CAST v1.0 (gcc) CAST v2.1 (gcc) CAST v2.1s (gcc) CAST v2.1s (icc) # Xs
  • 54. Porting CAST v2.1 on other hardware platforms FPGAs GP-GPUsMulti-coreCPUs
  • 55. Predicting the pathogenicity of Escherichia strains based on local and global amino acid compositional signatures
  • 56. Motivation Kreil & Ouzounis (2001) identified thermophilic species by their global amino acid composition Promponas (2009) identified pathogenic Escherichia strains using global and local amino acid compositional signatures
  • 57. Data retrieval 31 full genomes from GenBank Original genome publications Complementary publications & data 28 Escherichia strains with verified pathogenicity status NCBI Bacterial Genomes Division [ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/]; GOLD [http://genomesonline.org/]; NCBI Bacterial Genome Project Division [ftp://ftp.ncbi.nlm.nih.gov/genomes/bioproject/] 31genomes Training set: 22 verified Validation set: 6 verified Prediction: 3 with unverified pathogenicity
  • 58. Signatures Generation & Training Set Creation Kreil & Ouzounis, 2003; Promponas, 2009 CAST with default settings and –stat GC: Global AA composition LB: LCRs composition XB: Masked resides composition x 28
  • 59. k-nearest neighbour cross-validation and optimal k selection LOOCV for 1 ≤ k ≤ 11 Optimal k selection Kuhn, 2008 Pathogenic Non Pathogenic
  • 60. Validation sets generation 6 verified genomes Randomly sample sequences from each strain 9 gene count classes: • 100, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000 • 2-98% genome coverage 5400 sub-samples: 100 for each genome & gene count class
  • 61. Validation Results Global composition signatures work better Accuracy over 80% using less than 20% of proteins Similar results using “glocal” vectors Preliminary results indicate that local signatures at lower CAST thresholds are enhanced 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 500 1000 1500 2000 2500 3000 3500 4000 NUMBER OF GENES GC (Accuracy) GC (Sensitivity) GC (Specificity) LB (Accuracy) LB (Sensitivity) LB (Specificity) Species Genbank Accession Prediction E. coli ATCC 8739 NC_010468 Non-pathogenic E. coli SMS-3-5 NC_010498 Pathogenic E. fergusonii ATCC 35469 NC_011740 Pathogenic
  • 62. Exploration for the identification of proteomic subsets responsible for the predictive power of the final models
  • 63. Chimeras generations Genes clustered using blastclust 4 pools of genes • P or NP only • Core genes • Mixed genes 4 chimera types • P only or NP only • Equal P/NP • All 14 gene count classes • 100, 500, 1000, 2000, 2500, 3000, 3500 … 10000 Genomes Gene PoolsChimeras
  • 64. Results P & NP only sequences can characterize a genome Prediction rate with 4000 genes similar to external validation results Compositional signal of P only class stronger than NP only LB similar to GC on P only. Perhaps stronger LCR signature?
  • 65. Conclusions  Compositional features can be used to accurately perform large-scale phenotype prediction  knn is much faster compared to Sims et al. (2011)  Compositional features enable a more compressed representation of complete genomes which further increase accuracy & performance  Robust even when incomplete data are available  Gene classes associated with pathogenicity seem to generate enough signal to control the pathogenicity of a genome according to the predictive model
  • 66. Synopsis We evaluated the performance of the most popular LCRs handling algorithms We evaluated, and questioned, the most commonly employed data-set for assessing the performance of LCRs handling algorithms We described the best ways to handle LCRs We built tools to search, display and compare LCRs We used LCRs and other compositional measures to accurately predict the pathogenicity of Escherichia strains We demonstrated that our methods also work when using only parts of the available information LCRs
  • 67. Future work Publish at least 3 papers (1 submitted, 2 are in preparation) Predict the pathogenicity of the 77 incomplete genomes that we collected from GenBank Expand the LCR-eXXXplorer Apply for a Post-Doc position Party
  • 68. Acknowledgments  Dr. I. Iliopoulos  Dr L. Kostrikis  Dr. P. Skourides  Dr. C.A. Ouzounis  Dr. V. Promponas  Cyprus Research Promotion Foundation  ΠΕΝΕΚ/ENIΣΧ/0308/77  Athina Theodosiou  Stella Tamana  Ioanna Kalvari  Maria Xenophontos  Marilena Aplikioti  My family