Sequence Alignment by Information Compression

A presentation based on Minh Cao's 2010 paper "A genome alignment algorithm based on compression"

1. 1. Sequence Alignmentby InformationCompression Nacho Caballero
2. 2. Traditional Alignments Probability and InformationAlignment byCompression
4. 4. Traditional alignments can’t handle lowcomplexity regionsNNNNNNNNNNNNNNNNNNNNNNNNNNNN AAGCAGAATTTAACATGTGGTTTGCTCANNNNNNNNNNNNNNNNNNNNNNNNNNNN TTTGTTCTTTATCGCATCTTTTGAAAACNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCTATCGAAATAGCAGTACCTTCAGACTNNNNNNNNNNNNNNNNNNNNNNNNNNNN TTTTCCGAATACAGTTTAGCCAAAAATANNNNNNNNNNNNNNNNNNNNNNNNNNNN TCAAGAAAAGCTTGAGCGCAAGTTCCTCNNNNNNNNNNNNNNNNNNNNNNNNNNNN GAACTTTCTGGACACCCCATTAAACTTTNNNNNNNNNNNNNNNNNNNNNNNNNNNN! TGTTTGCCGTTAAAAAAGGTACTTATCT! 50% of the human genome is masked
5. 5. Traditional scoring schemes don’t reflectsequence bias GC content GC skew Match +8 Mismatch -4 Gap -3
6. 6. Traditional alignments lack an objectivefunction to measure quality
7. 7. Probability andInformation
8. 8. Information and probability are two sidesof the same coin 1 I(event) = log 2 = ! log 2 p(event) p(event)Information 4.3 bits 2 bits 1 bit Probability event .05 .25 .5 1 occurs
9. 9. Information and probability are two sidesof the same coin 1 I(event) = log 2 = ! log 2 p(event) p(event)Information AAAAAAAAAAAAAAAA… AAAAAAATTTTTTTTT… ATGCACTACTAACGGA…Maximum in DNA 2 bits A A 1 bit A 0 bits Probability event .25 .5 1 occurs
10. 10. Compression encodes symbols using aprobability distribution AAAAAACGGG A A C G T G C T 00000000000001101010 11011010
11. 11. Alignment byCompression
12. 12. Homologous sequences share informationT Markov Expert CA CG GT AA AA TC CA AG TT GT TT I(Query) C!C CC GG AA AA TT CC AA TA TG
13. 13. Homologous sequences share informationT Markov Expert CA CG GT Align Expert AA AA TC CA AG TT GT I(Query| Reference) Mutual Information TT I(Query) C!C CC GG AA AA TT CC AA TA TG
14. 14. Homologous sequences share informationT Markov Expert CA CG GT Align Expert AA AA TC CA AG TT GT I(Query| Reference) Mutual Information TT I(Query) C!C CC GG AA AA TT CC AA TA TG
15. 15. XMAligner wins on distantly related biasedsequencesSpecificity Sensitivity
16. 16. XMAligner is the most sensitive detectingexons
17. 17. XMAligner detecting a gene cluster PLASMODIUM GENE CLUSTER
18. 18. Alignment by compression overcomes thelimitations of traditional alignmentproducing better results in distantly relatedor biased sequences