Sequence Alignment
by Information
Compression     Nacho Caballero
Traditional Alignments



          Probability and
          Information



Alignment by
Compression
Traditional Alignments
Traditional alignments can’t handle low
complexity regions


NNNNNNNNNNNNNNNNNNNNNNNNNNNN    AAGCAGAATTTAACATGTGGTTTGCTCA
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    TTTGTTCTTTATCGCATCTTTTGAAAAC
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    GCTATCGAAATAGCAGTACCTTCAGACT
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    TTTTCCGAATACAGTTTAGCCAAAAATA
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    TCAAGAAAAGCTTGAGCGCAAGTTCCTC
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    GAACTTTCTGGACACCCCATTAAACTTT
NNNNNNNNNNNNNNNNNNNNNNNNNNNN!   TGTTTGCCGTTAAAAAAGGTACTTATCT!


 50% of the
 human genome
 is masked
Traditional scoring schemes don’t reflect
sequence bias

                  GC content




                  GC skew


                 Match    +8
                 Mismatch -4
                 Gap      -3
Traditional alignments lack an objective
function to measure quality
Probability and
Information
Information and probability are two sides
of the same coin
                                        1
                    I(event) = log 2          = ! log 2 p(event)
                                     p(event)
Information




  4.3 bits


    2 bits
     1 bit
                                                  Probability event
              .05 .25     .5                  1       occurs
Information and probability are two sides
of the same coin
                                    1
                I(event) = log 2          = ! log 2 p(event)
                                 p(event)
Information

                AAAAAAAAAAAAAAAA…
               AAAAAAATTTTTTTTT…
              ATGCACTACTAACGGA…
Maximum
 in DNA
    2 bits       A
                         A
     1 bit
                                          A
    0 bits                                    Probability event
              .25     .5                  1       occurs
Compression encodes symbols using a
probability distribution

                  AAAAAACGGG




                               A
      A   C   G   T                     G
                                    C
                                            T




  00000000000001101010             11011010
Alignment by
Compression
Homologous sequences share information
T     Markov Expert                      C
A
                                         C
G
                                         G
T
                                         A
A
                                         A
A
                                         T
C
                                         C
A
                                         A
G
                                         T
T
                                         G
T
                                         T
T   I(Query)                             C!
C
                                         C
C
                                         G
G
                                         A
A
                                         A
A
                                         T
T
                                         C
C
                                         A
A
                                         T
A
                                         T
G
Homologous sequences share information
T     Markov Expert                                               C
A
                                                                  C
G
                                                                  G
T
                                                   Align Expert   A
A
                                                                  A
A
                                                                  T
C
                                                                  C
A
                                                                  A
G
                                                                  T
T
                                                                  G
T
                      I(Query| Reference)   Mutual Information    T
T   I(Query)                                                      C!
C
                                                                  C
C
                                                                  G
G
                                                                  A
A
                                                                  A
A
                                                                  T
T
                                                                  C
C
                                                                  A
A
                                                                  T
A
                                                                  T
G
Homologous sequences share information
T     Markov Expert                                               C
A
                                                                  C
G
                                                                  G
T
                                                   Align Expert   A
A
                                                                  A
A
                                                                  T
C
                                                                  C
A
                                                                  A
G
                                                                  T
T
                                                                  G
T
                      I(Query| Reference)   Mutual Information    T
T   I(Query)                                                      C!
C
                                                                  C
C
                                                                  G
G
                                                                  A
A
                                                                  A
A
                                                                  T
T
                                                                  C
C
                                                                  A
A
                                                                  T
A
                                                                  T
G
XMAligner wins on distantly related biased
sequences

Specificity




              Sensitivity
XMAligner is the most sensitive detecting
exons
XMAligner detecting a gene cluster




                   PLASMODIUM GENE CLUSTER
Alignment by compression overcomes the
limitations of traditional alignment




producing better results in distantly related
or biased sequences

Sequence Alignment by Information Compression

  • 1.
  • 3.
    Traditional Alignments Probability and Information Alignment by Compression
  • 4.
  • 5.
    Traditional alignments can’thandle low complexity regions NNNNNNNNNNNNNNNNNNNNNNNNNNNN AAGCAGAATTTAACATGTGGTTTGCTCA NNNNNNNNNNNNNNNNNNNNNNNNNNNN TTTGTTCTTTATCGCATCTTTTGAAAAC NNNNNNNNNNNNNNNNNNNNNNNNNNNN GCTATCGAAATAGCAGTACCTTCAGACT NNNNNNNNNNNNNNNNNNNNNNNNNNNN TTTTCCGAATACAGTTTAGCCAAAAATA NNNNNNNNNNNNNNNNNNNNNNNNNNNN TCAAGAAAAGCTTGAGCGCAAGTTCCTC NNNNNNNNNNNNNNNNNNNNNNNNNNNN GAACTTTCTGGACACCCCATTAAACTTT NNNNNNNNNNNNNNNNNNNNNNNNNNNN! TGTTTGCCGTTAAAAAAGGTACTTATCT! 50% of the human genome is masked
  • 6.
    Traditional scoring schemesdon’t reflect sequence bias GC content GC skew Match +8 Mismatch -4 Gap -3
  • 7.
    Traditional alignments lackan objective function to measure quality
  • 8.
  • 9.
    Information and probabilityare two sides of the same coin 1 I(event) = log 2 = ! log 2 p(event) p(event) Information 4.3 bits 2 bits 1 bit Probability event .05 .25 .5 1 occurs
  • 10.
    Information and probabilityare two sides of the same coin 1 I(event) = log 2 = ! log 2 p(event) p(event) Information AAAAAAAAAAAAAAAA… AAAAAAATTTTTTTTT… ATGCACTACTAACGGA… Maximum in DNA 2 bits A A 1 bit A 0 bits Probability event .25 .5 1 occurs
  • 11.
    Compression encodes symbolsusing a probability distribution AAAAAACGGG A A C G T G C T 00000000000001101010 11011010
  • 12.
  • 13.
    Homologous sequences shareinformation T Markov Expert C A C G G T A A A A T C C A A G T T G T T T I(Query) C! C C C G G A A A A T T C C A A T A T G
  • 14.
    Homologous sequences shareinformation T Markov Expert C A C G G T Align Expert A A A A T C C A A G T T G T I(Query| Reference) Mutual Information T T I(Query) C! C C C G G A A A A T T C C A A T A T G
  • 15.
    Homologous sequences shareinformation T Markov Expert C A C G G T Align Expert A A A A T C C A A G T T G T I(Query| Reference) Mutual Information T T I(Query) C! C C C G G A A A A T T C C A A T A T G
  • 16.
    XMAligner wins ondistantly related biased sequences Specificity Sensitivity
  • 17.
    XMAligner is themost sensitive detecting exons
  • 18.
    XMAligner detecting agene cluster PLASMODIUM GENE CLUSTER
  • 19.
    Alignment by compressionovercomes the limitations of traditional alignment producing better results in distantly related or biased sequences