Finding and Aligning Related Sequences (Martin Frith)
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
605
On Slideshare
605
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Finding and aligning related sequences Martin C. FrithComputational Biology Research Center AIST, Tokyo www.cbrc.jp/~martin2012-12-09 @ BioinfoSummer, Adelaide
  • 2. CBRCwww.cbrc.jp 2
  • 3. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 3
  • 4. Compare human and mouse genomes gctagtgtac ||| || || gct--tgaac aa-gtaca || ||||| aaggtaca 4
  • 5. HumanMouse 5
  • 6. Compare DNA from a patient to a reference genome tttaccagctgga ctatgctagtcgta ccctagtcgtatgg cctatagtctgtatg ctagtcgtagtgtgg atatatatattatta Patient DNA Sequencer DNA reads ctagcttatcgtctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggcReference genome sequence 6
  • 7. What kinds of microbial genes are there? tttaccagctgga ctatgctagtcgta ccctagtcgtatgg cctatagtctgtatg ctagtcgtagtgtgg atatatatattattaWater from e.g. DNA Sequencer DNA reads a hot spring atatatatatattagccgt |||...||| |||...||| ArgLysTyrProPheLeuLeuIsoArgLysPheAlaPro-ProGlyGlyAlaGly… GlyGlyPhePheGlyAlaLeuCysCysTrpTrpAlaGlyAlaPro… Database of all known proteins 7
  • 8. More examples• Compare ancient DNA to a reference genome – Mammoth, neanderthal, Turin Shroud, …• Align (potentially spliced) RNA sequences to a reference genome – To see which genes are active• Align short DNA reads to each other – In order to assemble them 8
  • 9. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 9
  • 10. What are we really trying to do?1. Find and align similar sequences?2. Find and align homologous sequences?3. Find and align orthologous sequences?4. Find and align paralogous sequences? 10
  • 11. Homology, orthology, paralogy Homology: descent from a common ancestor Past PresentOrthology: descent from a common Paralogy: descent from a common ancestor by genome division ancestor by duplication within a genome 11
  • 12. Example β-globinhuman mouse β1-globin β2-globin 12
  • 13. Example β-globin Orthologshuman mouse Paralogs β1-globin β2-globin 13
  • 14. Example • Orthology is not necessarily 1-to-1 • Orthology is not transitive  Not an equivalence relation β-globinhuman mouse Orthologs Orthologs β1-globin β2-globin 14
  • 15. What are we really trying to do?1. Find and align similar sequences?2. Find and align homologous sequences?3. Find and align orthologous sequences?4. Find and align paralogous sequences? 15
  • 16. Compare human and mouse genomes What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs 16
  • 17. Compare human and mouse genomes What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs Do we want to align mouse α-globin with human β-globin? Probably not 17
  • 18. Compare DNA from a patient to a reference genome tttaccagctgga ctatgctagtcgta ccctagtcgtatgg cctatagtctgtatg ctagtcgtagtgtgg atatatatattatta Patient DNA Sequencer DNA readsWhat is the aim?• Find similar sequences• Find homologs• Find orthologs• Find paralogsctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggcReference genome sequence 18
  • 19. Compare DNA from a patient to a reference genome tttaccagctgga ctatgctagtcgta ccctagtcgtatgg cctatagtctgtatg ctagtcgtagtgtgg atatatatattatta Patient DNA Sequencer DNA readsWhat is the aim?• Find similar sequences Do we want to align the• Find homologs patient’s α-globin to the• Find orthologs reference’s β-globin?• Find paralogsctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggcReference genome sequence 19
  • 20. What are we really trying to do?1. Find and align similar sequences?2. Find and align homologous sequences?3. Find and align orthologous sequences?4. Find and align paralogous sequences? 20
  • 21. Aims and algorithms• Sequence comparison algorithms basically find similar sequences• Finding homologs is harder• Finding orthologs is even harder 21
  • 22. Similarity versus homology Homologous sequences Rapid evolution over a long time span Similar sequencesConvergent evolution 22
  • 23. • The most frequent case of convergent evolution is simple sequences 23
  • 24. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 24
  • 25. Simple sequences• DNA (and RNA and protein) frequently has simple sequences:atgatcgattatcgtagtctaggtcgtatgctatgattcgataaaaaaaaaaaaaaaaaaacggtatgcgtagctgcgatcgtagtgactatatgagagaggattcgatgctaagttctctaggagaggcttaggctgagcgcgtatcactggctcgcggctgtgtgtgtgtgtgtgtgtgtgtgtgtgacgtatcgcacatcgtcgattttgagattcccgatggcc 25
  • 26. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatc gtagtagtagtagtagta 26
  • 27. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatca gtagtagtagtagtagta 27
  • 28. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatcat gtagtagtagtagtagta 28
  • 29. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatcatc gtagtagtagtagtagta 29
  • 30. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatcatca gtagtagtagtagtagta 30
  • 31. How do simple sequences evolve?• Strand slippage during DNA replication: catcatcatcatcat gtagtagtagtagtagta 31
  • 32. How do simple sequences evolve?• Strand slippage during DNA replication: catcat cat gtagtagtagtagtagta 32
  • 33. How do simple sequences evolve?• Strand slippage during DNA replication: catcat catc gtagtagtagtagtagta 33
  • 34. How do simple sequences evolve?• Strand slippage during DNA replication: catcat catca gtagtagtagtagtagta 34
  • 35. How do simple sequences evolve?• Strand slippage during DNA replication: catcat catcat gtagtagtagtagtagta 35
  • 36. How do simple sequences evolve?• Strand slippage during DNA replication: catcat catcat gtagtagtagtagtagta On the top strand, it has got longer 36
  • 37. How do simple sequences evolve?• An initial (short, mild) simple sequence occurs by chance• Due to slippage, it gets longer…• And longer… 37
  • 38. Homology between human and banana?Human atatatatatatatatatatatatatatatatatatatata |||||||||||||||||||||||||||||||||||||||||Banana atatatatatatatatatatatatatatatatatatatata • Probably not. 38
  • 39. Avoiding non-homologous alignments of simple sequences• The standard way is to identify and “mask” them, before alignment 39
  • 40. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 40
  • 41. Repeat masking• There are standard “repeat masking” tools – RepeatMasker, DustMasker, SegMasker, TRF, …• Most people just assume they work 41
  • 42. Repeat confusion Simple sequence: atcttatgtctctctctctctctctctctggatgcttgaccac Interspersed repeat:cttgttattgctgatcgtcctctctgtaaattgttattgctgatcatgctttaac They are both called “repeats”, but they are rather different. Don’t confuse them. 42
  • 43. Test of avoiding non-homologous alignments • Compare two sequences after reversing one of them • Sequences never evolve by reversal, so there are no true homologs in this test • But repeats may still cause strong similarities, if they are not suppressedHuman atatatatatatatatatatatatatatatatatatatata • Hello |||||||||||||||||||||||||||||||||||||||||Banana atatatatatatatatatatatatatatatatatatatata 43
  • 44. Test result The C. elegans genome versus the reversed P. pacificus genome, after masking both with DustMasker:Red: observed number of alignmentsBlack: expected number of alignments for random sequences (E-value) 44
  • 45. A spurious alignmentUpper sequence: from C. elegansLower sequence: from reversed P. pacificus Conclusion: DustMasker fails to mask some tandem repeats 45
  • 46. Other methods?• SegMasker does not work either:Upper sequence: part of an animal proteinLower sequence: part of a reversed plant protein• Nor does RepeatMasker, TRF…A new repeat-masking method enables specific detection of homologous sequencesFrith MC. Nucleic Acids Research 2011 39:e23. 46
  • 47. Repeat masking• There are standard “repeat masking” tools – RepeatMasker, DustMasker, SegMasker, TRF, …• Most people just assume they work – Cargo cult science • Genomic bioinformatics is riddled with it 47
  • 48. New repeat-masking method• tantan: http://www.cbrc.jp/tantan/• It looks for slippery regions in sequences• Slippery = similar to shifted versions of itself• It integrates similarity at different slip distances, using a Forward-BackwardA new repeat-masking method enables specific detection of homologous sequences algorithmFrith MC. Nucleic Acids Research 2011 39:e23. 48
  • 49. tantan test result The C. elegans genome versus the reversed P. pacificus genome, after masking with tantan:Red: observed number of alignmentsBlack: expected number of alignments for random sequences (E-value) 49
  • 50. Conclusion• tantan prevents simple-sequence alignments• Without masking an excessive amount• It even works for extremely AT-rich DNA – Plasmodium falciparum (malaria): 80% AT – Dictyostelium discoideum (slime mould): 80% AT 50
  • 51. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 51
  • 52. Classic score-based alignment1. Define a scoring scheme2. Find alignments with high (maximum) scores 52
  • 53. Alignment scoring schemeSubstitution score matrix Gap scores a c g t a 2 -3 -1 -3 Gap existence cost: 5 c -3 2 -3 -1 Gap extension cost: 1 g -1 -3 2 -3 t -3 -1 -3 2 53
  • 54. Alignment scoring scheme Substitution score matrix Gap scores a c g t a 2 -3 -1 -3 Gap existence cost: 5 c -3 2 -3 -1 Gap extension cost: 1 g -1 -3 2 -3 t -3 -1 -3 2 t a c g t g - - a g g tExample: | | | | | | | | | t a c a t g c t a g g t 2 +2 +2 -1 +2 +2 -7 +2 +2 +2 +2 Alignment score: 10 54
  • 55. Classic score-based alignment1. Define a scoring scheme2. Find alignments with high (maximum) scores ctatgctacgtgaggtgtggc tacgtg--aggt ||| || |||| attacatgctaggtccac tacatgctaggt 55
  • 56. How to find alignments with max score? ctatgctacgtgaggtgtggc tacgtg--aggt ||| || |||| attacatgctaggtccac tacatgctaggt• Smith-Waterman algorithm – Exact: guarantees to find the max score – A bit slow• BLAST, FASTA, etc – Heuristic: no guarantee – Faster 56
  • 57. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 57
  • 58. Alignment scoring scheme Substitution score matrix Gap scores a c g t a 2 -3 -1 -3Sxy = c -3 2 -3 -1 Gap existence cost: 5 Gap extension cost: 1 g -1 -3 2 -3 t -3 -1 -3 2• Where do these scores come from?• Why is this a good method anyway? 58
  • 59. Scores are log likelihood ratios Model of Probability of x aligned to yhomologous in a true alignment sequences æ A ö ç xy ÷ Sxy = t ´ log ç ÷ Model of è Px ´ Qy øindependent sequences Probability of x Probability of y in the first sequence in the second sequence
  • 60. Different matrices for different tasksStrong similarities (~99% identity) Weak similarities (~75% identity) a c g t a c g t a 1 -3 -3 -3 a 1 -1 -1 -1 c -3 1 -3 -3 c -1 1 -1 -1 g -3 -3 1 -3 g -1 -1 1 -1 t -3 -3 -3 1 t -1 -1 -1 1 AT-rich DNA (e.g. malaria) Bisulfite-converted DNA a c g t a c g t a 2 -3 -2 -3 a 2 -6 -6 -6 c -3 5 -3 -2 c -6 2 -6 1 g -2 -3 5 -3 g -6 -6 2 -6 t -3 -2 -3 2 t -6 -6 -6 1 60
  • 61. What about gap scores?Pair hidden Markov modelThe arrows describe probabilities for insertions and deletions.(It looks more complicated than it really is.)
  • 62. A useful formulaProb ( alignment ) µ exp ( alignment score / t )
  • 63. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 63
  • 64. Alignment ambiguityctagctaaccgtatcgtgggc||||| | ||||| | ||ctagcca---gtatctagtgc Orctagctaaccgtatcgtgggc||||| | ||||| | ||ctagc---cagtatctagtgc ? 64
  • 65. Per-column probabilities… g c a t c c t t g g g t c t c g a c a t …… g c c t c g t t a g a - - t a g a t a g … .99 .99 .99 .95 .93 .92 .90 .79 .55 .33 .16 .22 .49 .55 .59 .71 .93 .97 .98 .99 65
  • 66. Importance… g c a t c c t t g g g t c t c g a c a t …… g c c t c g t t a g a - - t a g a t a g … .99 .99 .99 .95 .93 .92 .90 .79 .55 .33 .16 .22 .49 .55 .59 .71 .93 .97 .98 .99 • Column reliability is important for: – Studying the evolution of binding sites – Identifying polymorphisms – Finding recombination breakpoints –… 66
  • 67. How to calculate ambiguity ctagctaaccgtatcgtgggc ||||| | ||||| | || ctagcca---gtatctagtgcProb(column) = sum of probs of all alignments that include the column sum of exp(score / t) for all alignments that include the column Prob(column) = sum of exp(score / t) for all alignments 67
  • 68. An aligner that indicates ambiguity Since 2008 http://last.cbrc.jp/ Warning: LAST was made by me and colleagues 68
  • 69. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 69
  • 70. Sequence quality data• Some DNA sequencers estimate the error probability of every base t a g c t g a 0.01 0.02 0.07 0.24 0.32 0.75 0.75• We ought to use this information when comparing sequences 70
  • 71. General case: compare 2 sequences with error probabilities a t g c c … 0.01 0.02 0.02 0.09 0.17 g t a c c … 0.03 0.01 0.08 0.12 0.44 71
  • 72. Traditional sequence comparison Giga-sequencers Score matrix Error probabilities a c g t a 2 -3 -1 -3 t a g c t c -3 2 -3 -1 0.01 0.02 0.07 0.24 0.32 g -1 -3 2 -3 t -3 -1 -3 2 Real substitutions (mutation / evolution) Erroneous substitutions é A ù xy Sxy = log ê ú ë Bx By û é Axy ù Generalizedlog likelihood ratio: Sxpyq = log ê pq + (1 - pq)ú ë Bx By û Incorporating sequence quality data into alignment improves DNA read mapping. 72 Frith MC, Wan R, Horton P. Nucleic Acids Research 2010 38:e100
  • 73. An aligner that combinesscore matrix & quality data Since 2008 http://last.cbrc.jp/ Warning: LAST was made by me and colleagues 73
  • 74. Finding and aligning related sequences• Examples• What are we really trying to do?• Simple sequences• Repeat masking• Classic score-based alignment• Alignment & probability models• Alignment ambiguity• Using sequence quality data• Scaling to huge datasets 74
  • 75. Why is BLAST too slow? 75
  • 76. Why is BLAST too slow? 1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed…atcgtatcgtatcgtactgctggcctagtggggga……ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg… 76
  • 77. Why is BLAST too slow? 1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed …atcgtatcgtatcgtactgctggcctagtggggga… …ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg… Problem atatatatatatatatatata AluNon-uniform Isochorescomposition: LINEs SINEs CpG islands  too many seeds  too many extensions  too slow 77
  • 78. Example• Compare the human and chimp genomes• Each genome has ~ 1 million Alu elements• So we will get ~ 1012 seed matches… Problem atatatatatatatatatata AluNon-uniform Isochorescomposition: LINEs SINEs CpG islands  too many seeds  too many extensions  too slow 78
  • 79. Solution: adaptive seeds 1. Find “seeds” (initial matches) of a fixed length rareness 2. Try extending an alignment from each seed …atcgtatcgtatcgtactgctggcctagtggggga… …ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…Adaptive seeds can be found efficiently by using a suffix array 79
  • 80. An aligner that uses adaptive seeds Since 2008 http://last.cbrc.jp/ Warning: LAST was made by me and colleagues 80
  • 81. LAST run times• Compare the human and chicken genomes – 3.5 hours• Align 1 million length-87 DNA reads to the human genome – 6 minutes• (Using 1 CPU core) 81
  • 82. Error rate(% of aligned reads that are wrong) Simulated DNA readsSensitivity (% of reads that are correctly aligned) Run time (minutes) 82
  • 83. Method Time (min) bwa 16 bwa-n10 67 last 41 last last novoalign 518 shrimp2 ? stampy 72 stampy (sensitive) 248
  • 84. For more detail• Adaptive seeds tame genomic sequence comparison Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC Genome Research 2011 21:487• Incorporating sequence quality data into alignment improves DNA read mapping Frith MC, Wan R, Horton P Nucleic Acids Research 2010 38:e100• A mostly traditional approach improves alignment of bisulfite-converted DNA Frith MC, Mori R, Asai K Nucleic Acids Research 2012 40:e100 84
  • 85. Summary• It is feasible to use classic, statistical alignment approaches with large modern sequence datasets – This is beneficial for modeling: diverged sequences, biased base frequencies, etc.• Alignment ambiguity should be used more often• Try to avoid cargo cult science! 85
  • 86. Main collaboratorsPaul Horton Michiaki Hamada Szymon Kielbasa CBRC U of Tokyo / CBRC Leiden University
  • 87. Programming wisdom• Measuring programming progress by lines of code is like measuring aircraft building progress by weight. – Bill Gates• As youre about to add a comment, ask yourself, How can I improve the code so that this comment isnt needed?’ – Steve McConnell• The key to performance is elegance, not battalions of special cases. – Jon Bently and M. Douglas McIlroy• Weeks of programming can save you hours of planning. – Unknown• Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. – Unknown 87