Alignment: Spaced Seed
2012-03-16
LEAST Seminar
Abner Huang
< abnercyh@cs.nthu.edu.tw >
CSBB Lab, NTHU
Outline
• Introduction
• Theory
• Remarks
2
Stringology
• String matching
• Pattern matching
• Periodicities
• Data structure
• Text Compression
• Alignment
3
Alignment
• Spelling correction
• Bitext word alignment
• File comparison (diff)
• Amino acid sequences comparison
4
Nature Milestones in DNA: BLAST
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local
alignment search tool. J. Mol. Biol. 215, 403–410 (1990) 5
A. Califano and I. Rigoutsos, “Flash: A fast look-up algorithm for string homology,”
in Proceedings of the 1st International Conference on Intelligent Systems for
Molecular Biology (ISMB), pp. 56-64, July 1993.
6
Spaced seed
Spaced seed, Multiple spaced seeds,
Vector (Relaxed) seed, Neighbor seeds
7
Outline
• Introduction
• Theory
• Remarks
8
Optimal Spaced Seed
(Ma, Tromp, Li: Bioinformatics, 18:3, 2002, 440-445)
• Spaced Seed: nonconsecutive matches and
optimized match positions.
• Represent BLAST seed by 11111111111
• Spaced seed: 111*1**1*1**11*111
– 1 means a required match
– * means “don’t care” position
• This seemingly simple change makes a huge
difference: significantly increases hit prob. to
homologous region while reducing bad hits.
Formalization
• Given i.i.d. sequence (homology region) with
Pr(1)=p and Pr(0)=1-p for each bit:
1100111011101101011101101011111011101
• Which seed is more likely to hit this region:
– BLAST seed: 11111111111
– Spaced seed: 111*1**1*1**11*111
111*1**1*1**11*111
Sensitivity: PH weight 11 seed vs Blast 11 & 10
PH 2-hit sensitivity vs Blastn 11, 12 1-hit
Expect Less, Get More
• Lemma: The expected number of hits of a weight W length M
seed model within a length L region with similarity p is
(L-M+1)pW
Proof: The expected number of hits is the sum, over the L-M+1 possible
positions of fitting the seed within the region, of the probability of W
specific matches, the latter being pW. ■
• Example: In a region of length 64 with 0.7 similarity, PH has
probability of 0.466 to hit vs Blast 0.3, 50% increase. On the
other hand, by above lemma, Blast expects 1.07 hits, while PH
0.93, 14% less.
Why Is Spaced Seed Better?
A wrong, but intuitive, proof: seed s, interval I, similarity p
E(#hits) = Pr(s hits) E(#hits | s hits)
Thus:
Pr(s hits) = Lpw / E(#hits | s hits)
For optimized spaced seed, E(#hits | s hits)
111*1**1*1**11*111 Non overlap Prob
111*1**1*1**11*111 6 p6
111*1**1*1**11*111 6 p6
111*1**1*1**11*111 6 p6
111*1**1*1**11*111 7 p7
…..
• For spaced seed: the divisor is 1+p6+p6+p6+p7+ …
• For BLAST seed: the divisor is bigger: 1+ p + p2 + p3 + …
Improvements
• Brejova-Brown-Vinar (HMM) and Buhler-
Keich-Sun (Markov): The input sequence can
be modeled by a (hidden) Markov process,
instead of iid.
• Multiple seeds
• Brejova-Brown-Vinar: Vector seeds
• Csuros: Variable length seeds – e.g. shorter
seeds for rare query words.
Outline
• Introduction
• Theory
• Remarks
16
Thank You

Alignment spaced seed

  • 1.
    Alignment: Spaced Seed 2012-03-16 LEASTSeminar Abner Huang < abnercyh@cs.nthu.edu.tw > CSBB Lab, NTHU
  • 2.
  • 3.
    Stringology • String matching •Pattern matching • Periodicities • Data structure • Text Compression • Alignment 3
  • 4.
    Alignment • Spelling correction •Bitext word alignment • File comparison (diff) • Amino acid sequences comparison 4
  • 5.
    Nature Milestones inDNA: BLAST Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990) 5
  • 6.
    A. Califano andI. Rigoutsos, “Flash: A fast look-up algorithm for string homology,” in Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB), pp. 56-64, July 1993. 6
  • 7.
    Spaced seed Spaced seed,Multiple spaced seeds, Vector (Relaxed) seed, Neighbor seeds 7
  • 8.
  • 9.
    Optimal Spaced Seed (Ma,Tromp, Li: Bioinformatics, 18:3, 2002, 440-445) • Spaced Seed: nonconsecutive matches and optimized match positions. • Represent BLAST seed by 11111111111 • Spaced seed: 111*1**1*1**11*111 – 1 means a required match – * means “don’t care” position • This seemingly simple change makes a huge difference: significantly increases hit prob. to homologous region while reducing bad hits.
  • 10.
    Formalization • Given i.i.d.sequence (homology region) with Pr(1)=p and Pr(0)=1-p for each bit: 1100111011101101011101101011111011101 • Which seed is more likely to hit this region: – BLAST seed: 11111111111 – Spaced seed: 111*1**1*1**11*111 111*1**1*1**11*111
  • 11.
    Sensitivity: PH weight11 seed vs Blast 11 & 10
  • 12.
    PH 2-hit sensitivityvs Blastn 11, 12 1-hit
  • 13.
    Expect Less, GetMore • Lemma: The expected number of hits of a weight W length M seed model within a length L region with similarity p is (L-M+1)pW Proof: The expected number of hits is the sum, over the L-M+1 possible positions of fitting the seed within the region, of the probability of W specific matches, the latter being pW. ■ • Example: In a region of length 64 with 0.7 similarity, PH has probability of 0.466 to hit vs Blast 0.3, 50% increase. On the other hand, by above lemma, Blast expects 1.07 hits, while PH 0.93, 14% less.
  • 14.
    Why Is SpacedSeed Better? A wrong, but intuitive, proof: seed s, interval I, similarity p E(#hits) = Pr(s hits) E(#hits | s hits) Thus: Pr(s hits) = Lpw / E(#hits | s hits) For optimized spaced seed, E(#hits | s hits) 111*1**1*1**11*111 Non overlap Prob 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 7 p7 ….. • For spaced seed: the divisor is 1+p6+p6+p6+p7+ … • For BLAST seed: the divisor is bigger: 1+ p + p2 + p3 + …
  • 15.
    Improvements • Brejova-Brown-Vinar (HMM)and Buhler- Keich-Sun (Markov): The input sequence can be modeled by a (hidden) Markov process, instead of iid. • Multiple seeds • Brejova-Brown-Vinar: Vector seeds • Csuros: Variable length seeds – e.g. shorter seeds for rare query words.
  • 16.
  • 17.

Editor's Notes

  • #3 Introduction history Applications Homology search, BFEAST Theory finding seeds sensitivity NP-HARD
  • #5 Research extends back to 1957, including spelling checkers for bitmap images of cursive writing and special applications to find records in databases in spite of incorrect entries. In 1961, Les Earnest, who headed the research on this budding technology, saw it necessary to include the first spell checker that accessed a list of 10,000 acceptable words.[1] Ralph Gorin, a graduate student under Earnest at the time, created the first true spelling checker program written as an applications program (rather than research) for general English text: Spell for the DEC PDP-10 at Stanford University's Artificial Intelligence Laboratory, in February 1971.[2] Gorin wrote SPELL in assembly language, for faster action; he made the first spelling corrector by searching the word list for plausible correct spellings that differ by a single letter or adjacent letter transpositions and presenting them to the user. Gorin made SPELL publicly accessible, as was done with most SAIL (Stanford Artificial Intelligence Laboratory) programs, and it soon spread around the world via the new ARPAnet, about ten years before personal computers came into general use.[3] Spell, its algorithms and data structures inspired the Unix ispell program. Bitext word alignment http://en.wikipedia.org/wiki/Bitext_word_alignment  Italian porre Spanish poner To put Diff Longest common subsequence problem David Maier (1978). "The Complexity of Some Problems on Subsequences and Supersequences". J. ACM (ACM Press) 25 (2): 322–336. doi:10.1145/322063.322075 For the general case of an arbitrary number of input sequences, the problem is NP-hard.[1] When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming (see Solution below). Robert A. Wagner and Michael J. Fischer (1974). "The String-to-String Correction Problem". Journal of the ACM 21 (1): 168–173. doi:10.1145/321796.321811. http://dl.acm.org/citation.cfm?id=321811 Possible applications are to the problems of automatic spelling correction and determining the longest subsequence of characters common to two strings.
  • #9 Introduction history Applications Homology search, BFEAST Theory finding seeds sensitivity NP-HARD
  • #17 Introduction history Applications Homology search, BFEAST Theory finding seeds sensitivity NP-HARD