Sequence Alignment by Information Compression

439 views

Published on

A presentation based on Minh Cao's 2010 paper "A genome alignment algorithm based on compression"

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
439
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Sequence Alignment by Information Compression

  1. 1. Sequence Alignmentby InformationCompression Nacho Caballero
  2. 2. Traditional Alignments Probability and InformationAlignment byCompression
  3. 3. Traditional Alignments
  4. 4. Traditional alignments can’t handle lowcomplexity regionsNNNNNNNNNNNNNNNNNNNNNNNNNNNN AAGCAGAATTTAACATGTGGTTTGCTCANNNNNNNNNNNNNNNNNNNNNNNNNNNN TTTGTTCTTTATCGCATCTTTTGAAAACNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCTATCGAAATAGCAGTACCTTCAGACTNNNNNNNNNNNNNNNNNNNNNNNNNNNN TTTTCCGAATACAGTTTAGCCAAAAATANNNNNNNNNNNNNNNNNNNNNNNNNNNN TCAAGAAAAGCTTGAGCGCAAGTTCCTCNNNNNNNNNNNNNNNNNNNNNNNNNNNN GAACTTTCTGGACACCCCATTAAACTTTNNNNNNNNNNNNNNNNNNNNNNNNNNNN! TGTTTGCCGTTAAAAAAGGTACTTATCT! 50% of the human genome is masked
  5. 5. Traditional scoring schemes don’t reflectsequence bias GC content GC skew Match +8 Mismatch -4 Gap -3
  6. 6. Traditional alignments lack an objectivefunction to measure quality
  7. 7. Probability andInformation
  8. 8. Information and probability are two sidesof the same coin 1 I(event) = log 2 = ! log 2 p(event) p(event)Information 4.3 bits 2 bits 1 bit Probability event .05 .25 .5 1 occurs
  9. 9. Information and probability are two sidesof the same coin 1 I(event) = log 2 = ! log 2 p(event) p(event)Information AAAAAAAAAAAAAAAA… AAAAAAATTTTTTTTT… ATGCACTACTAACGGA…Maximum in DNA 2 bits A A 1 bit A 0 bits Probability event .25 .5 1 occurs
  10. 10. Compression encodes symbols using aprobability distribution AAAAAACGGG A A C G T G C T 00000000000001101010 11011010
  11. 11. Alignment byCompression
  12. 12. Homologous sequences share informationT Markov Expert CA CG GT AA AA TC CA AG TT GT TT I(Query) C!C CC GG AA AA TT CC AA TA TG
  13. 13. Homologous sequences share informationT Markov Expert CA CG GT Align Expert AA AA TC CA AG TT GT I(Query| Reference) Mutual Information TT I(Query) C!C CC GG AA AA TT CC AA TA TG
  14. 14. Homologous sequences share informationT Markov Expert CA CG GT Align Expert AA AA TC CA AG TT GT I(Query| Reference) Mutual Information TT I(Query) C!C CC GG AA AA TT CC AA TA TG
  15. 15. XMAligner wins on distantly related biasedsequencesSpecificity Sensitivity
  16. 16. XMAligner is the most sensitive detectingexons
  17. 17. XMAligner detecting a gene cluster PLASMODIUM GENE CLUSTER
  18. 18. Alignment by compression overcomes thelimitations of traditional alignmentproducing better results in distantly relatedor biased sequences

×