Homology Search
Paul Gardner
March 24, 2015
Paul Gardner Homology Search
News & Views reminder (20% of your course grade, due
March 26, Reviewed April 2 (5/20), Revisions April 28
(15/20))
Meredith et al. (2014) Evidence for a single loss of
mineralized teeth in the common avian ancestor. Science
Nunez et al. (2015) Integrase-mediated spacer acquisition
during CRISPR-Cas adaptive immunity. Nature
Paul Gardner Homology Search
Homology search
In a huge collection of biological
sequences how can you locate
similar sequences?
by using heuristic, super fast,
sequence alignment methods
Paul Gardner Homology Search
BLAST
Paul Gardner Homology Search
BLAST
Identify all ’hits’ of at least W long
Find any hits on the same diagonal of an alignment matrix
Trigger a full alignment in that region
Basic idea: identify near-identical sub-sequences first → align any
hits in full
Paul Gardner Homology Search
What does that E-value (Expect) mean?
>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome
Length=4537948
Features in this part of subject sequence:
cold-shock DNA-binding domain protein
Score = 57.2 bits (62), Expect = 2e-05
Identities = 78/106 (74%), Gaps = 6/106 (6%)
Strand=Plus/Plus
Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC
|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||
Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC
Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG
| | || |||||| ||| ||||||||||| |||||| ||| |||
Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
Paul Gardner Homology Search
How can we evaluate the significance of a score?
Note that a bit-score of 57.2 by itself is not that useful.
It depends on the sequence & database size & composition.
To counter this we can compute an Expect-value (E-value).
This is the expected number of hits with the observed score for
the given query and database sizes.
P-values can also be used
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
Separating true from false hits
score (bits)
Num.matches
Random sequences/Negative controls
True homologs/Positive controls
Threshold
False negatives
True positives
False positives
True negatives
Paul Gardner Homology Search
How can we evaluate the significance of a score?
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
Separating true from false hits
score (bits)
Num.matches
Random sequences/Negative controls
True homologs/Positive controls
Threshold
False negatives
True positives
False positives
True negatives
E = κMN2−λx
E: E-value
M&N: query &
database size
κ&λ: fitting
parameters
Paul Gardner Homology Search
BLAST is not the only, or best tool for the job!
Paul Gardner Homology Search
Profile-based homology search
Krogh, A. et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol
Biol.
Image provided by Eric Nawrocki.
Paul Gardner Homology Search
Profile-based homology search – scoring sequences
Image provided by Eric Nawrocki.
Paul Gardner Homology Search
Profile HMM are slightly more complicated
A tree-weighting scheme takes care of unbalanced
alignments
Dirichlet-mixture priors are used to incorporate information
about amino-acid biochemistry
Effective sequence number is used to down-weight priors
when many sequences are available
Transition probabilities to Insert & Delete states are estimated
from the alignment
Paul Gardner Homology Search
Why not just use BLAST?
ACCURACY!
Every benchmark of homology search tools has shown that
profile methods are more accurate than single-sequence
methods.
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
Paul Gardner Homology Search
Why not just use BLAST?
SPEED! To search a single query vs a database of all proteins:
BLAST: searches 42 million UniProt sequences
HMMER: searches 15,000 Pfam profiles
The search space is ∼ 3, 000x smaller for profiles
Save Planet Earth, use HMMER3
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
Paul Gardner Homology Search
Pfam
What is a Pfam-A Entry?
hmmsearch
hmmbuild
hmmalign
SEED
HMM
OUTOUT
ALIGNDESC
Slide borrowed from Rob Finn.
Paul Gardner Homology Search
But, what about RNA?
5’
3’
0
Sequence conservation
1
A
G
U
K G
C
U
C
A
U
U
CA
C
C
K
W
Y U
U
A
U
G
W
YR
G
YCC
C
g
C
Y
V
U
U
H R G C G
G
A
A
K
A
Y
G
YG
C
U
W
C
A
U
A
A R
M
Y
A
Y
C
G
A
A
U
G
AY
G
C M
H
A
A
G
M
M
WG
G
U
G
C
C
U R
Y
C
G
U
C
C A M
C
U
W
A
a
C
Y
G
A
U
A
W Y
R
K
G
U
G
MRU
R
C
R
C
W
U
U
A
U
C
AA
V
C
A
Y
C
G
G
R
C
GA
M
A
C
G
UY
G
A G
U
K
A
G
G
C
A
C
CGC
C
U
W
5’
3’
0
Sequence conservation
1
A
A
Y
A
A
A
A
U
A
A
U
U
U
A
C
AUUCCA AG
G
A
C
C
G
G
UA
U
U
A
U
U
GU A
G
G
G
G
A
U
U
U
GU
G
AC
U
U
Y C
A
A
G
G
C
A
A
Y
G
U
C
C
U
C
U
C
U
A
C
AA
C
C
G
A
G
U
U
C R
A
G
A
A
U
A
A
R
Y
A
C
M
A
A
YG
G
C
U
C U
U
U
U
U
G
UU
A
U
U
C
G
A
A
A
G C
U
U
A
C
A
A
G
DU
V
Y
R
G
Y
R
U
M
U
U
C
U
R
U
A
U
R
C
U
C
W
C
Y
Uc
a
M
U
Y
A C
U
U
U
C
M
A
G
U
AC
U
U
C
A
C
A
C G
G
G
C
CWRACAK
M
U
5’ 3’
0
Sequence conservation
1
U
V
D
WHAUGA
U
G
A
G
Y
U
C
M
A
C
U
U
C
W
U
u
G
G
U
C
C
G
U
G U U U C U G A g a R
M
C
Y
M
R
U
G
A
U
M
U
B
W
R
U
G
a
S
A
A
a
G
U
UCUGAY
U
H
M
Paul Gardner Homology Search
Covariance models
Nawrocki & Eddy (2007) Query-Dependent Banding (QDB) for Faster RNA Similarity Searches. PLOS
computational biology.
Paul Gardner Homology Search
Benchmark
Freyhult, Bollback & Gardner (2007) Exploring genomic dark matter: A critical assessment of the performance of
homology search methods on noncoding RNA. Genome Research.
Paul Gardner Homology Search
Rfam
Paul Gardner Homology Search
Relevant reading
Reviews:
Eddy SR (2004) What is a hidden Markov model? Nature
Biotechnology.
Methods:
Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs. Nucleic
acids research.
Eddy (2011) Accelerated Profile HMM Searches. PLoS
Computational Biology.
Paul Gardner Homology Search
The End
Paul Gardner Homology Search

BIOL335: Homology search

  • 1.
    Homology Search Paul Gardner March24, 2015 Paul Gardner Homology Search
  • 2.
    News & Viewsreminder (20% of your course grade, due March 26, Reviewed April 2 (5/20), Revisions April 28 (15/20)) Meredith et al. (2014) Evidence for a single loss of mineralized teeth in the common avian ancestor. Science Nunez et al. (2015) Integrase-mediated spacer acquisition during CRISPR-Cas adaptive immunity. Nature Paul Gardner Homology Search
  • 3.
    Homology search In ahuge collection of biological sequences how can you locate similar sequences? by using heuristic, super fast, sequence alignment methods Paul Gardner Homology Search
  • 4.
  • 5.
    BLAST Identify all ’hits’of at least W long Find any hits on the same diagonal of an alignment matrix Trigger a full alignment in that region Basic idea: identify near-identical sub-sequences first → align any hits in full Paul Gardner Homology Search
  • 6.
    What does thatE-value (Expect) mean? >gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome Length=4537948 Features in this part of subject sequence: cold-shock DNA-binding domain protein Score = 57.2 bits (62), Expect = 2e-05 Identities = 78/106 (74%), Gaps = 6/106 (6%) Strand=Plus/Plus Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC || |||||||| ||||||||| |||||| | | | || |||| |||| |||| Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG | | || |||||| ||| ||||||||||| |||||| ||| ||| Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG Paul Gardner Homology Search
  • 7.
    How can weevaluate the significance of a score? Note that a bit-score of 57.2 by itself is not that useful. It depends on the sequence & database size & composition. To counter this we can compute an Expect-value (E-value). This is the expected number of hits with the observed score for the given query and database sizes. P-values can also be used 0 100 200 300 400 500 600 700 0 2000 4000 6000 8000 10000 Separating true from false hits score (bits) Num.matches Random sequences/Negative controls True homologs/Positive controls Threshold False negatives True positives False positives True negatives Paul Gardner Homology Search
  • 8.
    How can weevaluate the significance of a score? 0 100 200 300 400 500 600 700 0 2000 4000 6000 8000 10000 Separating true from false hits score (bits) Num.matches Random sequences/Negative controls True homologs/Positive controls Threshold False negatives True positives False positives True negatives E = κMN2−λx E: E-value M&N: query & database size κ&λ: fitting parameters Paul Gardner Homology Search
  • 9.
    BLAST is notthe only, or best tool for the job! Paul Gardner Homology Search
  • 10.
    Profile-based homology search Krogh,A. et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. Image provided by Eric Nawrocki. Paul Gardner Homology Search
  • 11.
    Profile-based homology search– scoring sequences Image provided by Eric Nawrocki. Paul Gardner Homology Search
  • 12.
    Profile HMM areslightly more complicated A tree-weighting scheme takes care of unbalanced alignments Dirichlet-mixture priors are used to incorporate information about amino-acid biochemistry Effective sequence number is used to down-weight priors when many sequences are available Transition probabilities to Insert & Delete states are estimated from the alignment Paul Gardner Homology Search
  • 13.
    Why not justuse BLAST? ACCURACY! Every benchmark of homology search tools has shown that profile methods are more accurate than single-sequence methods. Eddy (2011) Accelerated Profile HMM Searches. PLoS Computational Biology. Paul Gardner Homology Search
  • 14.
    Why not justuse BLAST? SPEED! To search a single query vs a database of all proteins: BLAST: searches 42 million UniProt sequences HMMER: searches 15,000 Pfam profiles The search space is ∼ 3, 000x smaller for profiles Save Planet Earth, use HMMER3 Eddy (2011) Accelerated Profile HMM Searches. PLoS Computational Biology. Paul Gardner Homology Search
  • 15.
    Pfam What is aPfam-A Entry? hmmsearch hmmbuild hmmalign SEED HMM OUTOUT ALIGNDESC Slide borrowed from Rob Finn. Paul Gardner Homology Search
  • 16.
    But, what aboutRNA? 5’ 3’ 0 Sequence conservation 1 A G U K G C U C A U U CA C C K W Y U U A U G W YR G YCC C g C Y V U U H R G C G G A A K A Y G YG C U W C A U A A R M Y A Y C G A A U G AY G C M H A A G M M WG G U G C C U R Y C G U C C A M C U W A a C Y G A U A W Y R K G U G MRU R C R C W U U A U C AA V C A Y C G G R C GA M A C G UY G A G U K A G G C A C CGC C U W 5’ 3’ 0 Sequence conservation 1 A A Y A A A A U A A U U U A C AUUCCA AG G A C C G G UA U U A U U GU A G G G G A U U U GU G AC U U Y C A A G G C A A Y G U C C U C U C U A C AA C C G A G U U C R A G A A U A A R Y A C M A A YG G C U C U U U U U G UU A U U C G A A A G C U U A C A A G DU V Y R G Y R U M U U C U R U A U R C U C W C Y Uc a M U Y A C U U U C M A G U AC U U C A C A C G G G C CWRACAK M U 5’ 3’ 0 Sequence conservation 1 U V D WHAUGA U G A G Y U C M A C U U C W U u G G U C C G U G U U U C U G A g a R M C Y M R U G A U M U B W R U G a S A A a G U UCUGAY U H M Paul Gardner Homology Search
  • 17.
    Covariance models Nawrocki &Eddy (2007) Query-Dependent Banding (QDB) for Faster RNA Similarity Searches. PLOS computational biology. Paul Gardner Homology Search
  • 18.
    Benchmark Freyhult, Bollback &Gardner (2007) Exploring genomic dark matter: A critical assessment of the performance of homology search methods on noncoding RNA. Genome Research. Paul Gardner Homology Search
  • 19.
  • 20.
    Relevant reading Reviews: Eddy SR(2004) What is a hidden Markov model? Nature Biotechnology. Methods: Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. Eddy (2011) Accelerated Profile HMM Searches. PLoS Computational Biology. Paul Gardner Homology Search
  • 21.
    The End Paul GardnerHomology Search