Finding and aligning   related sequences           Martin C. FrithComputational Biology Research Center            AIST, T...
CBRCwww.cbrc.jp   2
Finding and aligning related sequences•   Examples•   What are we really trying to do?•   Simple sequences•   Repeat maski...
Compare human and mouse genomes            gctagtgtac            ||| || ||            gct--tgaac             aa-gtaca     ...
HumanMouse        5
Compare DNA from a patient              to a reference genome                                            tttaccagctgga    ...
What kinds of microbial genes                   are there?                                               tttaccagctgga    ...
More examples• Compare ancient DNA to a reference genome  – Mammoth, neanderthal, Turin Shroud, …• Align (potentially spli...
Finding and aligning related sequences•   Examples•   What are we really trying to do?•   Simple sequences•   Repeat maski...
What are we really trying to do?1. Find and align similar sequences?2. Find and align homologous sequences?3. Find and ali...
Homology, orthology, paralogy                Homology: descent from a common ancestor                                     ...
Example         β-globinhuman                    mouse           β1-globin   β2-globin                                   12
Example         β-globin         Orthologshuman                             mouse                     Paralogs            ...
Example                     • Orthology is not necessarily 1-to-1                     • Orthology is not transitive       ...
What are we really trying to do?1. Find and align similar sequences?2. Find and align homologous sequences?3. Find and ali...
Compare human and mouse genomes         What is the aim?         • Find similar sequences         • Find homologs         ...
Compare human and mouse genomes         What is the aim?         • Find similar sequences         • Find homologs         ...
Compare DNA from a patient              to a reference genome                                            tttaccagctgga    ...
Compare DNA from a patient              to a reference genome                                               tttaccagctgga ...
What are we really trying to do?1. Find and align similar sequences?2. Find and align homologous sequences?3. Find and ali...
Aims and algorithms• Sequence comparison algorithms basically  find similar sequences• Finding homologs is harder• Finding...
Similarity versus homology                       Homologous                        sequences                              ...
• The most frequent case of convergent  evolution is simple sequences                                         23
Finding and aligning related sequences•   Examples•   What are we really trying to do?•   Simple sequences•   Repeat maski...
Simple sequences• DNA (and RNA and protein) frequently has  simple sequences:atgatcgattatcgtagtctaggtcgtatgctatgattcgataaa...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcatcatc              gtagtagtagta...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcatcatca              gtagtagtagt...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcatcatcat              gtagtagtag...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcatcatcatc              gtagtagta...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcatcatcatca              gtagtagt...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcatcatcatcat              gtagtag...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcat   cat              gtagtagtag...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcat   catc              gtagtagta...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcat   catca              gtagtagt...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcat   catcat              gtagtag...
How do simple sequences evolve?• Strand slippage during DNA replication:              catcat   catcat              gtagtag...
How do simple sequences evolve?• An initial (short, mild) simple sequence occurs  by chance• Due to slippage, it gets long...
Homology between human and                  banana?Human  atatatatatatatatatatatatatatatatatatatata       ||||||||||||||||...
Avoiding non-homologous alignments        of simple sequences• The standard way is to identify and “mask”  them, before al...
Finding and aligning related sequences•   Examples•   What are we really trying to do?•   Simple sequences•   Repeat maski...
Repeat masking• There are standard “repeat masking” tools  – RepeatMasker, DustMasker, SegMasker, TRF, …• Most people just...
Repeat confusion Simple sequence:      atcttatgtctctctctctctctctctctggatgcttgaccac Interspersed repeat:cttgttattgctgatcgtc...
Test of avoiding non-homologous                    alignments • Compare two sequences after reversing one of them • Sequen...
Test result   The C. elegans genome versus the reversed P. pacificus genome,                after masking both with DustMa...
A spurious alignmentUpper sequence: from C. elegansLower sequence: from reversed P. pacificus        Conclusion: DustMaske...
Other methods?• SegMasker does not work either:Upper sequence: part of an animal proteinLower sequence: part of a reversed...
Repeat masking• There are standard “repeat masking” tools  – RepeatMasker, DustMasker, SegMasker, TRF, …• Most people just...
New repeat-masking method• tantan: http://www.cbrc.jp/tantan/• It looks for slippery regions in sequences• Slippery = simi...
tantan test result   The C. elegans genome versus the reversed P. pacificus genome,                     after masking with...
Conclusion• tantan prevents simple-sequence alignments• Without masking an excessive amount• It even works for extremely A...
Finding and aligning related sequences•   Examples•   What are we really trying to do?•   Simple sequences•   Repeat maski...
Classic score-based alignment1. Define a scoring scheme2. Find alignments with high (maximum) scores                      ...
Alignment scoring schemeSubstitution score matrix       Gap scores           a c g t      a    2 -3 -1 -3       Gap existe...
Alignment scoring scheme    Substitution score matrix                      Gap scores                a c g t           a  ...
Classic score-based alignment1. Define a scoring scheme2. Find alignments with high (maximum) scores  ctatgctacgtgaggtgtgg...
How to find alignments               with max score?  ctatgctacgtgaggtgtggc                        tacgtg--aggt           ...
Finding and aligning related sequences•   Examples•   What are we really trying to do?•   Simple sequences•   Repeat maski...
Alignment scoring scheme   Substitution score matrix       Gap scores              a c g t         a    2 -3 -1 -3Sxy =   ...
Scores are log likelihood ratios  Model of                               Probability of x aligned to yhomologous          ...
Different matrices for different tasksStrong similarities (~99% identity)   Weak similarities (~75% identity)             ...
What about gap scores?Pair hidden Markov modelThe arrows describe probabilities for insertions and deletions.(It looks mor...
A useful formulaProb ( alignment ) µ exp ( alignment score / t )
Finding and aligning related sequences•   Examples•   What are we really trying to do?•   Simple sequences•   Repeat maski...
Alignment ambiguityctagctaaccgtatcgtgggc||||| |   ||||| | ||ctagcca---gtatctagtgc         Orctagctaaccgtatcgtgggc|||||   |...
Per-column probabilities… g c a t         c c t       t   g g g t         c t     c g a c a t …… g c c t         c g t    ...
Importance… g c a t         c c t       t   g g g t         c t     c g a c a t …… g c c t         c g t       t   a g a -...
How to calculate ambiguity                  ctagctaaccgtatcgtgggc                  ||||| |   ||||| | ||                  c...
An aligner that indicates ambiguity                       Since 2008                http://last.cbrc.jp/       Warning: LA...
Finding and aligning related sequences•   Examples•   What are we really trying to do?•   Simple sequences•   Repeat maski...
Sequence quality data• Some DNA sequencers estimate the error  probability of every base      t      a      g      c      ...
General case: compare 2 sequences     with error probabilities      a      t      g      c      c     …     0.01   0.02   ...
Traditional sequence comparison                           Giga-sequencers                 Score matrix                    ...
An aligner that combinesscore matrix & quality data                  Since 2008           http://last.cbrc.jp/  Warning: L...
Finding and aligning related sequences•   Examples•   What are we really trying to do?•   Simple sequences•   Repeat maski...
Why is BLAST too slow?                         75
Why is BLAST too slow? 1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from ea...
Why is BLAST too slow?    1. Find “seeds” (initial matches) of a fixed length (e.g. 11)    2. Try extending an alignment f...
Example• Compare the human and chimp genomes• Each genome has ~ 1 million Alu elements• So we will get ~ 1012 seed matches...
Solution: adaptive seeds   1. Find “seeds” (initial matches) of a fixed length rareness   2. Try extending an alignment fr...
An aligner that uses adaptive seeds                       Since 2008                http://last.cbrc.jp/       Warning: LA...
LAST run times• Compare the human and chicken genomes  – 3.5 hours• Align 1 million length-87 DNA reads to the  human geno...
Error rate(% of aligned reads  that are wrong)     Simulated DNA readsSensitivity (% of reads that are correctly aligned) ...
Method                 Time (min) bwa                  16 bwa-n10              67 last                 41 last last ...
For more detail• Adaptive seeds tame genomic sequence comparison  Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC  Genome R...
Summary• It is feasible to use classic, statistical alignment  approaches with large modern sequence datasets   – This is ...
Main collaboratorsPaul Horton         Michiaki Hamada     Szymon Kielbasa   CBRC             U of Tokyo / CBRC   Leiden Un...
Programming wisdom•   Measuring programming progress by lines of code is like measuring aircraft    building progress by w...
Upcoming SlideShare
Loading in...5
×

Finding and Aligning Related Sequences (Martin Frith)

496

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
496
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Finding and Aligning Related Sequences (Martin Frith)"

  1. 1. Finding and aligning related sequences Martin C. FrithComputational Biology Research Center AIST, Tokyo www.cbrc.jp/~martin2012-12-09 @ BioinfoSummer, Adelaide
  2. 2. CBRCwww.cbrc.jp 2
  3. 3. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 3
  4. 4. Compare human and mouse genomes gctagtgtac ||| || || gct--tgaac aa-gtaca || ||||| aaggtaca 4
  5. 5. HumanMouse 5
  6. 6. Compare DNA from a patient to a reference genome tttaccagctgga ctatgctagtcgta ccctagtcgtatgg cctatagtctgtatg ctagtcgtagtgtgg atatatatattatta Patient DNA Sequencer DNA reads ctagcttatcgtctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggcReference genome sequence 6
  7. 7. What kinds of microbial genes are there? tttaccagctgga ctatgctagtcgta ccctagtcgtatgg cctatagtctgtatg ctagtcgtagtgtgg atatatatattattaWater from e.g. DNA Sequencer DNA reads a hot spring atatatatatattagccgt |||...||| |||...||| ArgLysTyrProPheLeuLeuIsoArgLysPheAlaPro-ProGlyGlyAlaGly… GlyGlyPhePheGlyAlaLeuCysCysTrpTrpAlaGlyAlaPro… Database of all known proteins 7
  8. 8. More examples• Compare ancient DNA to a reference genome – Mammoth, neanderthal, Turin Shroud, …• Align (potentially spliced) RNA sequences to a reference genome – To see which genes are active• Align short DNA reads to each other – In order to assemble them 8
  9. 9. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 9
  10. 10. What are we really trying to do?1. Find and align similar sequences?2. Find and align homologous sequences?3. Find and align orthologous sequences?4. Find and align paralogous sequences? 10
  11. 11. Homology, orthology, paralogy Homology: descent from a common ancestor Past PresentOrthology: descent from a common Paralogy: descent from a common ancestor by genome division ancestor by duplication within a genome 11
  12. 12. Example β-globinhuman mouse β1-globin β2-globin 12
  13. 13. Example β-globin Orthologshuman mouse Paralogs β1-globin β2-globin 13
  14. 14. Example • Orthology is not necessarily 1-to-1 • Orthology is not transitive  Not an equivalence relation β-globinhuman mouse Orthologs Orthologs β1-globin β2-globin 14
  15. 15. What are we really trying to do?1. Find and align similar sequences?2. Find and align homologous sequences?3. Find and align orthologous sequences?4. Find and align paralogous sequences? 15
  16. 16. Compare human and mouse genomes What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs 16
  17. 17. Compare human and mouse genomes What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs Do we want to align mouse α-globin with human β-globin? Probably not 17
  18. 18. Compare DNA from a patient to a reference genome tttaccagctgga ctatgctagtcgta ccctagtcgtatgg cctatagtctgtatg ctagtcgtagtgtgg atatatatattatta Patient DNA Sequencer DNA readsWhat is the aim?• Find similar sequences• Find homologs• Find orthologs• Find paralogsctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggcReference genome sequence 18
  19. 19. Compare DNA from a patient to a reference genome tttaccagctgga ctatgctagtcgta ccctagtcgtatgg cctatagtctgtatg ctagtcgtagtgtgg atatatatattatta Patient DNA Sequencer DNA readsWhat is the aim?• Find similar sequences Do we want to align the• Find homologs patient’s α-globin to the• Find orthologs reference’s β-globin?• Find paralogsctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggcReference genome sequence 19
  20. 20. What are we really trying to do?1. Find and align similar sequences?2. Find and align homologous sequences?3. Find and align orthologous sequences?4. Find and align paralogous sequences? 20
  21. 21. Aims and algorithms• Sequence comparison algorithms basically find similar sequences• Finding homologs is harder• Finding orthologs is even harder 21
  22. 22. Similarity versus homology Homologous sequences Rapid evolution over a long time span Similar sequencesConvergent evolution 22
  23. 23. • The most frequent case of convergent evolution is simple sequences 23
  24. 24. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 24
  25. 25. Simple sequences• DNA (and RNA and protein) frequently has simple sequences:atgatcgattatcgtagtctaggtcgtatgctatgattcgataaaaaaaaaaaaaaaaaaacggtatgcgtagctgcgatcgtagtgactatatgagagaggattcgatgctaagttctctaggagaggcttaggctgagcgcgtatcactggctcgcggctgtgtgtgtgtgtgtgtgtgtgtgtgtgacgtatcgcacatcgtcgattttgagattcccgatggcc 25
  26. 26. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatc gtagtagtagtagtagta 26
  27. 27. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatca gtagtagtagtagtagta 27
  28. 28. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatcat gtagtagtagtagtagta 28
  29. 29. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatcatc gtagtagtagtagtagta 29
  30. 30. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatcatca gtagtagtagtagtagta 30
  31. 31. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatcatcat gtagtagtagtagtagta 31
  32. 32. How do simple sequences evolve?• Strand slippage during DNA replication: catcat cat gtagtagtagtagtagta 32
  33. 33. How do simple sequences evolve?• Strand slippage during DNA replication: catcat catc gtagtagtagtagtagta 33
  34. 34. How do simple sequences evolve?• Strand slippage during DNA replication: catcat catca gtagtagtagtagtagta 34
  35. 35. How do simple sequences evolve?• Strand slippage during DNA replication: catcat catcat gtagtagtagtagtagta 35
  36. 36. How do simple sequences evolve?• Strand slippage during DNA replication: catcat catcat gtagtagtagtagtagta On the top strand, it has got longer 36
  37. 37. How do simple sequences evolve?• An initial (short, mild) simple sequence occurs by chance• Due to slippage, it gets longer…• And longer… 37
  38. 38. Homology between human and banana?Human atatatatatatatatatatatatatatatatatatatata |||||||||||||||||||||||||||||||||||||||||Banana atatatatatatatatatatatatatatatatatatatata • Probably not. 38
  39. 39. Avoiding non-homologous alignments of simple sequences• The standard way is to identify and “mask” them, before alignment 39
  40. 40. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 40
  41. 41. Repeat masking• There are standard “repeat masking” tools – RepeatMasker, DustMasker, SegMasker, TRF, …• Most people just assume they work 41
  42. 42. Repeat confusion Simple sequence: atcttatgtctctctctctctctctctctggatgcttgaccac Interspersed repeat:cttgttattgctgatcgtcctctctgtaaattgttattgctgatcatgctttaac They are both called “repeats”, but they are rather different. Don’t confuse them. 42
  43. 43. Test of avoiding non-homologous alignments • Compare two sequences after reversing one of them • Sequences never evolve by reversal, so there are no true homologs in this test • But repeats may still cause strong similarities, if they are not suppressedHuman atatatatatatatatatatatatatatatatatatatata • Hello |||||||||||||||||||||||||||||||||||||||||Banana atatatatatatatatatatatatatatatatatatatata 43
  44. 44. Test result The C. elegans genome versus the reversed P. pacificus genome, after masking both with DustMasker:Red: observed number of alignmentsBlack: expected number of alignments for random sequences (E-value) 44
  45. 45. A spurious alignmentUpper sequence: from C. elegansLower sequence: from reversed P. pacificus Conclusion: DustMasker fails to mask some tandem repeats 45
  46. 46. Other methods?• SegMasker does not work either:Upper sequence: part of an animal proteinLower sequence: part of a reversed plant protein• Nor does RepeatMasker, TRF…A new repeat-masking method enables specific detection of homologous sequencesFrith MC. Nucleic Acids Research 2011 39:e23. 46
  47. 47. Repeat masking• There are standard “repeat masking” tools – RepeatMasker, DustMasker, SegMasker, TRF, …• Most people just assume they work – Cargo cult science • Genomic bioinformatics is riddled with it 47
  48. 48. New repeat-masking method• tantan: http://www.cbrc.jp/tantan/• It looks for slippery regions in sequences• Slippery = similar to shifted versions of itself• It integrates similarity at different slip distances, using a Forward-BackwardA new repeat-masking method enables specific detection of homologous sequences algorithmFrith MC. Nucleic Acids Research 2011 39:e23. 48
  49. 49. tantan test result The C. elegans genome versus the reversed P. pacificus genome, after masking with tantan:Red: observed number of alignmentsBlack: expected number of alignments for random sequences (E-value) 49
  50. 50. Conclusion• tantan prevents simple-sequence alignments• Without masking an excessive amount• It even works for extremely AT-rich DNA – Plasmodium falciparum (malaria): 80% AT – Dictyostelium discoideum (slime mould): 80% AT 50
  51. 51. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 51
  52. 52. Classic score-based alignment1. Define a scoring scheme2. Find alignments with high (maximum) scores 52
  53. 53. Alignment scoring schemeSubstitution score matrix Gap scores a c g t a 2 -3 -1 -3 Gap existence cost: 5 c -3 2 -3 -1 Gap extension cost: 1 g -1 -3 2 -3 t -3 -1 -3 2 53
  54. 54. Alignment scoring scheme Substitution score matrix Gap scores a c g t a 2 -3 -1 -3 Gap existence cost: 5 c -3 2 -3 -1 Gap extension cost: 1 g -1 -3 2 -3 t -3 -1 -3 2 t a c g t g - - a g g tExample: | | | | | | | | | t a c a t g c t a g g t 2 +2 +2 -1 +2 +2 -7 +2 +2 +2 +2 Alignment score: 10 54
  55. 55. Classic score-based alignment1. Define a scoring scheme2. Find alignments with high (maximum) scores ctatgctacgtgaggtgtggc tacgtg--aggt ||| || |||| attacatgctaggtccac tacatgctaggt 55
  56. 56. How to find alignments with max score? ctatgctacgtgaggtgtggc tacgtg--aggt ||| || |||| attacatgctaggtccac tacatgctaggt• Smith-Waterman algorithm – Exact: guarantees to find the max score – A bit slow• BLAST, FASTA, etc – Heuristic: no guarantee – Faster 56
  57. 57. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 57
  58. 58. Alignment scoring scheme Substitution score matrix Gap scores a c g t a 2 -3 -1 -3Sxy = c -3 2 -3 -1 Gap existence cost: 5 Gap extension cost: 1 g -1 -3 2 -3 t -3 -1 -3 2• Where do these scores come from?• Why is this a good method anyway? 58
  59. 59. Scores are log likelihood ratios Model of Probability of x aligned to yhomologous in a true alignment sequences æ A ö ç xy ÷ Sxy = t ´ log ç ÷ Model of è Px ´ Qy øindependent sequences Probability of x Probability of y in the first sequence in the second sequence
  60. 60. Different matrices for different tasksStrong similarities (~99% identity) Weak similarities (~75% identity) a c g t a c g t a 1 -3 -3 -3 a 1 -1 -1 -1 c -3 1 -3 -3 c -1 1 -1 -1 g -3 -3 1 -3 g -1 -1 1 -1 t -3 -3 -3 1 t -1 -1 -1 1 AT-rich DNA (e.g. malaria) Bisulfite-converted DNA a c g t a c g t a 2 -3 -2 -3 a 2 -6 -6 -6 c -3 5 -3 -2 c -6 2 -6 1 g -2 -3 5 -3 g -6 -6 2 -6 t -3 -2 -3 2 t -6 -6 -6 1 60
  61. 61. What about gap scores?Pair hidden Markov modelThe arrows describe probabilities for insertions and deletions.(It looks more complicated than it really is.)
  62. 62. A useful formulaProb ( alignment ) µ exp ( alignment score / t )
  63. 63. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 63
  64. 64. Alignment ambiguityctagctaaccgtatcgtgggc||||| | ||||| | ||ctagcca---gtatctagtgc Orctagctaaccgtatcgtgggc||||| | ||||| | ||ctagc---cagtatctagtgc ? 64
  65. 65. Per-column probabilities… g c a t c c t t g g g t c t c g a c a t …… g c c t c g t t a g a - - t a g a t a g … .99 .99 .99 .95 .93 .92 .90 .79 .55 .33 .16 .22 .49 .55 .59 .71 .93 .97 .98 .99 65
  66. 66. Importance… g c a t c c t t g g g t c t c g a c a t …… g c c t c g t t a g a - - t a g a t a g … .99 .99 .99 .95 .93 .92 .90 .79 .55 .33 .16 .22 .49 .55 .59 .71 .93 .97 .98 .99 • Column reliability is important for: – Studying the evolution of binding sites – Identifying polymorphisms – Finding recombination breakpoints –… 66
  67. 67. How to calculate ambiguity ctagctaaccgtatcgtgggc ||||| | ||||| | || ctagcca---gtatctagtgcProb(column) = sum of probs of all alignments that include the column sum of exp(score / t) for all alignments that include the column Prob(column) = sum of exp(score / t) for all alignments 67
  68. 68. An aligner that indicates ambiguity Since 2008 http://last.cbrc.jp/ Warning: LAST was made by me and colleagues 68
  69. 69. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 69
  70. 70. Sequence quality data• Some DNA sequencers estimate the error probability of every base t a g c t g a 0.01 0.02 0.07 0.24 0.32 0.75 0.75• We ought to use this information when comparing sequences 70
  71. 71. General case: compare 2 sequences with error probabilities a t g c c … 0.01 0.02 0.02 0.09 0.17 g t a c c … 0.03 0.01 0.08 0.12 0.44 71
  72. 72. Traditional sequence comparison Giga-sequencers Score matrix Error probabilities a c g t a 2 -3 -1 -3 t a g c t c -3 2 -3 -1 0.01 0.02 0.07 0.24 0.32 g -1 -3 2 -3 t -3 -1 -3 2 Real substitutions (mutation / evolution) Erroneous substitutions é A ù xy Sxy = log ê ú ë Bx By û é Axy ù Generalizedlog likelihood ratio: Sxpyq = log ê pq + (1 - pq)ú ë Bx By û Incorporating sequence quality data into alignment improves DNA read mapping. 72 Frith MC, Wan R, Horton P. Nucleic Acids Research 2010 38:e100
  73. 73. An aligner that combinesscore matrix & quality data Since 2008 http://last.cbrc.jp/ Warning: LAST was made by me and colleagues 73
  74. 74. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 74
  75. 75. Why is BLAST too slow? 75
  76. 76. Why is BLAST too slow? 1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed…atcgtatcgtatcgtactgctggcctagtggggga……ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg… 76
  77. 77. Why is BLAST too slow? 1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed …atcgtatcgtatcgtactgctggcctagtggggga… …ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg… Problem atatatatatatatatatata AluNon-uniform Isochorescomposition: LINEs SINEs CpG islands  too many seeds  too many extensions  too slow 77
  78. 78. Example• Compare the human and chimp genomes• Each genome has ~ 1 million Alu elements• So we will get ~ 1012 seed matches… Problem atatatatatatatatatata AluNon-uniform Isochorescomposition: LINEs SINEs CpG islands  too many seeds  too many extensions  too slow 78
  79. 79. Solution: adaptive seeds 1. Find “seeds” (initial matches) of a fixed length rareness 2. Try extending an alignment from each seed …atcgtatcgtatcgtactgctggcctagtggggga… …ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…Adaptive seeds can be found efficiently by using a suffix array 79
  80. 80. An aligner that uses adaptive seeds Since 2008 http://last.cbrc.jp/ Warning: LAST was made by me and colleagues 80
  81. 81. LAST run times• Compare the human and chicken genomes – 3.5 hours• Align 1 million length-87 DNA reads to the human genome – 6 minutes• (Using 1 CPU core) 81
  82. 82. Error rate(% of aligned reads that are wrong) Simulated DNA readsSensitivity (% of reads that are correctly aligned) Run time (minutes) 82
  83. 83. Method Time (min) bwa 16 bwa-n10 67 last 41 last last novoalign 518 shrimp2 ? stampy 72 stampy (sensitive) 248
  84. 84. For more detail• Adaptive seeds tame genomic sequence comparison Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC Genome Research 2011 21:487• Incorporating sequence quality data into alignment improves DNA read mapping Frith MC, Wan R, Horton P Nucleic Acids Research 2010 38:e100• A mostly traditional approach improves alignment of bisulfite-converted DNA Frith MC, Mori R, Asai K Nucleic Acids Research 2012 40:e100 84
  85. 85. Summary• It is feasible to use classic, statistical alignment approaches with large modern sequence datasets – This is beneficial for modeling: diverged sequences, biased base frequencies, etc.• Alignment ambiguity should be used more often• Try to avoid cargo cult science! 85
  86. 86. Main collaboratorsPaul Horton Michiaki Hamada Szymon Kielbasa CBRC U of Tokyo / CBRC Leiden University
  87. 87. Programming wisdom• Measuring programming progress by lines of code is like measuring aircraft building progress by weight. – Bill Gates• As youre about to add a comment, ask yourself, How can I improve the code so that this comment isnt needed?’ – Steve McConnell• The key to performance is elegance, not battalions of special cases. – Jon Bently and M. Douglas McIlroy• Weeks of programming can save you hours of planning. – Unknown• Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. – Unknown 87
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×