SlideShare a Scribd company logo
1 of 75
Download to read offline
DNA Motif Finding
                                          Stewart MacArthur

                                           Bioinformatics Core


                                          March 11th, 2010




Stewart MacArthur (Bioinformatics Core)      DNA Motif Finding   March 11th, 2010   1 / 33
Introduction




What is a DNA Motif?

 DNA motifs are short, recurring patterns that are presumed to have a
 biological function.




Stewart MacArthur (Bioinformatics Core)        DNA Motif Finding   March 11th, 2010   2 / 33
Introduction




What is a DNA Motif?
 DNA motifs are short, recurring patterns that are presumed to have a
 biological function.
    • sequence-specific binding sites
        • transcription factors
        • nucleases
    • ribosome binding
    • mRNA processing
         • splicing
         • editing
         • polyadenylation
    • transcription termination




Stewart MacArthur (Bioinformatics Core)        DNA Motif Finding   March 11th, 2010   2 / 33
Introduction




What is a DNA Motif?
 DNA motifs are short, recurring patterns that are presumed to have a
 biological function.
    • sequence-specific binding sites
        • transcription factors
        • nucleases
    • ribosome binding
    • mRNA processing
         • splicing
         • editing
         • polyadenylation
    • transcription termination




Stewart MacArthur (Bioinformatics Core)        DNA Motif Finding   March 11th, 2010   2 / 33
Representing a motif




How to represent a DNA motif?
 How can we represent the binding specificity of a protein, such that we
 can reliably predict its binding to any given sequence?
 Restriction enzymes sites can be written as simple DNA sequence,
 e.g. GAATTC for EcoRI

                                            5’-G A A T T C-3’
                                            3’-C T T A A G-5’

 These sequences can incorporate ambiguity, e.g. GTYRAC for HincII,
 using the IUPAC code.

                                                      GTYRAC
                                                    Y = C or T
                                                    R = A or C

 All matching sites will be cut by the restriction enzyme
Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding   March 11th, 2010   3 / 33
Representing a motif




Transcription Factors are different...

     • Regulatory motifs are often degenerate,variable but similar.
     • Transcription factors are often pleiotropic, regulating several
         genes, but they may need to be expressed at different levels.
     • A side effect of this degeneracy is spurious binding, where the
         protein has affinity at positions in the genome other than their
         functional sites.
     • Degeneracy in restriction enzyme binding would be lethal
     • Non-specific binding competes for protein and requires more
         protein to be produced than would be required otherwise




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding   March 11th, 2010   4 / 33
Representing a motif   Consensus




The Consensus Sequence
     • A consensus binding site is often used to represent transcription
         factor binding
     • Refers to a sequence that matches all examples of the binding
         site closely but not exactly
     • There is a trade-off between the ambiguity in the consensus and
         its sensitivity




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding     March 11th, 2010   5 / 33
Representing a motif   Consensus




The Consensus Sequence
     • A consensus binding site is often used to represent transcription
         factor binding
     • Refers to a sequence that matches all examples of the binding
         site closely but not exactly
     • There is a trade-off between the ambiguity in the consensus and
         its sensitivity

                                                         TACGAT
                                                         TATAAT
                                                         TATAAT
                                                         GATACT
                                                         TATGAT
                                                         TATGTT


Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding     March 11th, 2010   5 / 33
Representing a motif   Consensus




The Consensus Sequence : Example

                                                       TACGAT
                                                       TATAAT
                                                       TATAAT
                                                       TATACT
                                                       TATGAT
                                                       TATGTT
                                                       TATAAT

 Allowing 0 mismatches finds 2/6 Sites
 1 site every 4kb




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding     March 11th, 2010   6 / 33
Representing a motif   Consensus




The Consensus Sequence : Example

                                                       TACGAT
                                                       TATAAT*
                                                       TATAAT*
                                                       TATACT
                                                       TATGAT
                                                       TATGTT
                                                       TATAAT

 Allowing 0 mismatches finds 2/6 Sites
 1 site every 4kb




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding     March 11th, 2010   6 / 33
Representing a motif   Consensus




The Consensus Sequence : Example

                                                       TACGAT
                                                       TATAAT*
                                                       TATAAT*
                                                       TATACT
                                                       TATGAT*
                                                       TATGTT
                                                       TATAAT

 Allowing at most 1 mismatch finds 3/6 Sites
 1 site every 200bp




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding     March 11th, 2010   6 / 33
Representing a motif   Consensus




The Consensus Sequence : Example

                                                       TACGAT*
                                                       TATAAT*
                                                       TATAAT*
                                                       TATACT*
                                                       TATGAT*
                                                       TATGTT*
                                                       TATAAT

 Allowing up to 2 mismatches finds 6/6 Sites
 1 site every 30bp




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding     March 11th, 2010   6 / 33
Representing a motif   IUPAC




IUPAC codes
                                                 A           Adenine
                                                 C           Cytosine
                                                 G           Guanine
                                                 T           Thymine
                                                 R            A or G
                                                 Y            C or T
                                                 S            G or C
                                                 W            A or T
                                                 K            G or T
                                                 M            A or C
                                                 B          C or G or T
                                                 D          A or G or T
                                                 H          A or C or T
                                                 V          A or C or G
                                                 N           any base
                                               . or -           gap
Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding   March 11th, 2010   7 / 33
Representing a motif   IUPAC




The Consensus Sequence : Example

                                                       TACGAT
                                                       TATAAT
                                                       TATAAT
                                                       TATACT
                                                       TATGAT
                                                       TATGTT
                                                       TATRNT

 Allowing 0 mismatches finds 2/6 Sites




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding   March 11th, 2010   8 / 33
Representing a motif   IUPAC




The Consensus Sequence : Example

                                                       TACGAT
                                                       TATAAT*
                                                       TATAAT*
                                                       TATACT
                                                       TATGAT*
                                                       TATGTT*
                                                       TATRNT

 Exact match finds 4/6 Sites - 1 site every 500bp




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding   March 11th, 2010   8 / 33
Representing a motif   IUPAC




The Consensus Sequence : Example

                                                       TACGAT*
                                                       TATAAT*
                                                       TATAAT*
                                                       TATACT*
                                                       TATGAT*
                                                       TATGTT*
                                                       TATRNT

 Up to one mismatch finds 6/6 Sites - 1 site every 30bp




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding   March 11th, 2010   8 / 33
Representing a motif   Matrix




The Matrix
     • A position weight matrix (PWM)
         • also called position-specific weight matrix (PSWM)
         • also called position-frequency matrix (PFM)
         • also called position-specific scoring matrix (PSSM)
         • or just matrix
     • Alternative to the consensus.
     • There is a matrix element for all possible bases at every position.




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding   March 11th, 2010   9 / 33
Representing a motif   Matrix




The Matrix
     • A position weight matrix (PWM)
         • also called position-specific weight matrix (PSWM)
         • also called position-frequency matrix (PFM)
         • also called position-specific scoring matrix (PSSM)
         • or just matrix
     • Alternative to the consensus.
     • There is a matrix element for all possible bases at every position.

                      1      2        3         4       5         6         7    8   9    10     11
              A       4     13        5         3       0         0         0    0   17    0      6
              C       4      1        2         0       0         0         0    0   0     1      0
              G       3      3        0         0      18         0         0    0   1     4      3
              T       7      1       11        15       0        18        18   18   0    13     9


Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                  March 11th, 2010   9 / 33
Representing a motif   Matrix




Matrix Formats
 Counts
  A 4            13       5      3         0       0       0       0       17   0    6
  C 4             1       2      0         0       0       0       0        0   1    0
  G 3             3       0      0         18      0       0       0        1   4    3
  T 7             1      11      15         0      18      18      18       0   13   9




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                 March 11th, 2010   10 / 33
Representing a motif   Matrix




Matrix Formats
 Counts
  A 4 13 5     3   0                               0       0       0       17    0     6
  C 4 1    2   0   0                               0       0       0        0    1     0
  G 3 3    0   0 18                                0       0       0        1    4     3
  T 7 1 11 15 0                                    18      18      18       0    13    9
 Frequency
  A 0.2 0.7 0.3 0.2                              0.0      0.0      0.0     0.0   0.9       0.0    0.3
  C 0.2 0.1 0.1 0.0                              0.0      0.0      0.0     0.0   0.0       0.1    0.0
  G 0.2 0.2 0.0 0.0                              1.0      0.0      0.0     0.0   0.1       0.2    0.2
  T 0.4 0.1 0.6 0.8                              0.0      1.0      1.0     1.0   0.0       0.7    0.5




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                     March 11th, 2010   10 / 33
Representing a motif   Matrix




Matrix Formats
 Counts
  A 4 13 5          3   0   0   0   0                                      17     0      6
  C 4 1         2   0   0   0   0   0                                       0     1      0
  G 3 3         0   0 18 0      0   0                                       1     4      3
  T 7 1 11 15 0 18 18 18                                                    0     13     9
 Frequency
  A 0.2 0.7 0.3 0.2 0.0 0.0 0.0                                             0.0   0.9        0.0    0.3
  C 0.2 0.1 0.1 0.0 0.0 0.0 0.0                                             0.0   0.0        0.1    0.0
  G 0.2 0.2 0.0 0.0 1.0 0.0 0.0                                             0.0   0.1        0.2    0.2
  T 0.4 0.1 0.6 0.8 0.0 1.0 1.0                                             1.0   0.0        0.7    0.5
 Weight (log odds)
  A -0.1 1.0       0.1 -0.4 -2.9 -2.9                                      -2.9   -2.9        1.3     -2.9        0.3
  C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9                                          -2.9   -2.9       -2.9     -1.3        -2.9
  G -0.4 -0.4 -2.9 -2.9 1.3 -2.9                                           -2.9   -2.9       -1.3     -0.1        -0.4
  T 0.4 -1.3 0.9       1.2 -2.9 1.3                                        1.3    1.3        -2.9     1.0         0.7


Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                       March 11th, 2010     10 / 33
Representing a motif   Matrix




Sequence Logos
    • A visual representation of the
        motif                                                      A       4   13   5    3    0    0    0    0    17   0    6
                                                                   C       4   1    2    0    0    0    0    0    0    1    0
    • Each column of the matrix is                                 G       3   3    0    0    18   0    0    0    1    4    3
                                                                   T       7   1    11   15   0    18   18   18   0    13   9
        represented as a stack of
        letters whose size is
        proportional to the
        corresponding residue
        frequency
    • The total height of each
        column is proportional to its
        information content.



Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                            March 11th, 2010        11 / 33
Information theory




Information Theory

     • Information theory is a branch of applied mathematics involved
         with the quantification of information
     • It has been applied to DNA motifs in order to determine the
         amount of uncertainly at each position in a site
     • Uncertainly is measured in bits of information, which is on a log2
         scale.
     • Information is a decrease in uncertainty




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   12 / 33
Information theory




Information theory
                                                                         A   4   13   5    3    0    0    0    0    17   0    6
                                                                         C   4   1    2    0    0    0    0    0    0    1    0
                                                                         G   3   3    0    0    18   0    0    0    1    4    3
                                                                         T   7   1    11   15   0    18   18   18   0    13   9

    • 1 base occurs every time - 2 bits
    • 2 bases occur 50% of time - 1bit
    • 4 bases occur equally - 0 bits




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding                           March 11th, 2010        13 / 33
Information theory




Information theory
                                                                         A   4   13   5    3    0    0    0    0    17   0    6
                                                                         C   4   1    2    0    0    0    0    0    0    1    0
                                                                         G   3   3    0    0    18   0    0    0    1    4    3
                                                                         T   7   1    11   15   0    18   18   18   0    13   9

    • 1 base occurs every time - 2 bits
    • 2 bases occur 50% of time - 1bit
    • 4 bases occur equally - 0 bits



 Example
                                          Ii = 2 +             fb,i log2 fb,i
                           1 = 2 + 0.5 × log2 (0.5) + 0.5 × log2 (0.5)



Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding                           March 11th, 2010        13 / 33
Information theory




Why do we want to find them?

Expression Microarrays
    • Find co-regulated genes
    • Suggest Pathways




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   14 / 33
Information theory




Why do we want to find them?

Expression Microarrays                                         ChIP seq/chip
    • Find co-regulated genes                                     • Determine binding
    • Suggest Pathways                                                   preferences
                                                                  • Find co-factors




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding                 March 11th, 2010   14 / 33
Information theory




Two Methods

           Pattern Matching
            Finding known motifs

    • Does protein X bind upstream
        of my genes?
    • Does it bind more than
        expected by chance?




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   15 / 33
Information theory




Two Methods

           Pattern Matching                                         Pattern Discovery
            Finding known motifs                                          Finding unknown motifs

    • Does protein X bind upstream                                • What motifs are upstream of
        of my genes?                                                     my genes?
    • Does it bind more than                                      • What are these motifs
        expected by chance?




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding                March 11th, 2010   15 / 33
Information theory




Two Methods

           Pattern Matching                                         Pattern Discovery
            Finding known motifs                                          Finding unknown motifs

    • Does protein X bind upstream                                • What motifs are upstream of
        of my genes?                                                     my genes?
    • Does it bind more than                                      • What are these motifs
        expected by chance?




e.g. Patser, Pscan, Mast..                                     e.g. MEME, Weeder, MDScan ...

Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding                March 11th, 2010   15 / 33
Databases of Motifs




Where can we find known motifs?




Stewart MacArthur (Bioinformatics Core)               DNA Motif Finding   March 11th, 2010   16 / 33
Databases of Motifs




Where can we find known motifs?
 Online databases
  • Multicellular Eukaryotes
            • Jaspar
            • Transfac
            • Pazar




Stewart MacArthur (Bioinformatics Core)               DNA Motif Finding   March 11th, 2010   16 / 33
Databases of Motifs




Where can we find known motifs?
 Online databases
  • Multicellular Eukaryotes
            • Jaspar
            • Transfac
            • Pazar
    • Yeast
        • Yeastract
        • SCPD
    • Prokaryotes
        • RegulonDB
        • Prodoric
    • Other
        • UniProbe



Stewart MacArthur (Bioinformatics Core)               DNA Motif Finding   March 11th, 2010   16 / 33
Finding known motifs




How do we find them?




       TATATTGTTTATTTTCATGACTTCATGTCGCATGTATTGTTAATTAA
       CACATGTCTCATGTACTGGACCATGTCTAAGGGGTGTAAGGGTACTA
       ACGAATCGTAGCATGTCCAGAGGTGCGGAGTACGTAAGGAGGGTGCC
       CATACATGTCCGTTTCATATGAGCCTGCATTAATGTACCAACCTTCA
       ACCATGTCTCAACATGTCGCGGGTGTGCCTCCACGTACGAGCCGGAA
       GTCGACTCGCATGTCTGTCAGTATTATCCAAAGCATGTCGACCTCTT
       CATGTCAGCGAACGCAAGATCTTCATATGAGCCTGCATTAATGTACC


Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding   March 11th, 2010   17 / 33
Finding known motifs




Pattern Matching
 Counts
  A 4            13       5      3         0       0       0       0       17   0    6
  C 4             1       2      0         0       0       0       0        0   1    0
  G 3             3       0      0         18      0       0       0        1   4    3
  T 7             1      11      15         0      18      18      18       0   13   9




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                 March 11th, 2010   18 / 33
Finding known motifs




Pattern Matching
 Counts
  A 4 13 5     3   0                               0       0       0       17    0     6
  C 4 1    2   0   0                               0       0       0        0    1     0
  G 3 3    0   0 18                                0       0       0        1    4     3
  T 7 1 11 15 0                                    18      18      18       0    13    9
 Frequency
  A 0.2 0.7 0.3 0.2                              0.0      0.0      0.0     0.0   0.9       0.0    0.3
  C 0.2 0.1 0.1 0.0                              0.0      0.0      0.0     0.0   0.0       0.1    0.0
  G 0.2 0.2 0.0 0.0                              1.0      0.0      0.0     0.0   0.1       0.2    0.2
  T 0.4 0.1 0.6 0.8                              0.0      1.0      1.0     1.0   0.0       0.7    0.5




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                     March 11th, 2010   18 / 33
Finding known motifs




Pattern Matching
 Counts
  A 4 13 5          3   0   0   0   0                                      17     0      6
  C 4 1         2   0   0   0   0   0                                       0     1      0
  G 3 3         0   0 18 0      0   0                                       1     4      3
  T 7 1 11 15 0 18 18 18                                                    0     13     9
 Frequency
  A 0.2 0.7 0.3 0.2 0.0 0.0 0.0                                             0.0   0.9        0.0    0.3
  C 0.2 0.1 0.1 0.0 0.0 0.0 0.0                                             0.0   0.0        0.1    0.0
  G 0.2 0.2 0.0 0.0 1.0 0.0 0.0                                             0.0   0.1        0.2    0.2
  T 0.4 0.1 0.6 0.8 0.0 1.0 1.0                                             1.0   0.0        0.7    0.5
 Weight (log odds)
  A -0.1 1.0       0.1 -0.4 -2.9 -2.9                                      -2.9   -2.9        1.3     -2.9        0.3
  C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9                                          -2.9   -2.9       -2.9     -1.3        -2.9
  G -0.4 -0.4 -2.9 -2.9 1.3 -2.9                                           -2.9   -2.9       -1.3     -0.1        -0.4
  T 0.4 -1.3 0.9       1.2 -2.9 1.3                                        1.3    1.3        -2.9     1.0         0.7


Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                       March 11th, 2010     18 / 33
Finding known motifs




Pattern Matching


   A      -0.1       1.0       0.1         -0.4       -2.9       -2.9      -2.9   -2.9    1.3     -2.9        0.3
   C      -0.1       -1.3      -0.7        -2.9       -2.9       -2.9      -2.9   -2.9   -2.9     -1.3        -2.9
   G      -0.4       -0.4      -2.9        -2.9        1.3       -2.9      -2.9   -2.9   -1.3     -0.1        -0.4
   T      0.4        -1.3      0.9         1.2        -2.9       1.3       1.3    1.3    -2.9     1.0         0.7

   TATATTGTTTATTTTCATGACTTCATGTCGCATGTATTGTTAATTAA




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                   March 11th, 2010     19 / 33
Finding known motifs




Pattern Matching


   A      -0.1       1.0       0.1         -0.4       -2.9       -2.9      -2.9   -2.9    1.3     -2.9        0.3
   C      -0.1       -1.3      -0.7        -2.9       -2.9       -2.9      -2.9   -2.9   -2.9     -1.3        -2.9
   G      -0.4       -0.4      -2.9        -2.9        1.3       -2.9      -2.9   -2.9   -1.3     -0.1        -0.4
   T      0.4        -1.3      0.9         1.2        -2.9       1.3       1.3    1.3    -2.9     1.0         0.7
           T          A         T           A           T         T         G      T       T       T           A
   TATATTGTTTA TTTTCATGACTTCATGTCGCATGTATTGTTAATTAA




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                   March 11th, 2010     19 / 33
Finding known motifs




Pattern Matching


   A      -0.1       1.0       0.1         -0.4       -2.9       -2.9      -2.9   -2.9    1.3     -2.9        0.3
   C      -0.1       -1.3      -0.7        -2.9       -2.9       -2.9      -2.9   -2.9   -2.9     -1.3        -2.9
   G      -0.4       -0.4      -2.9        -2.9        1.3       -2.9      -2.9   -2.9   -1.3     -0.1        -0.4
   T      0.4        -1.3      0.9         1.2        -2.9       1.3       1.3    1.3    -2.9     1.0         0.7
           A          T         A           T           T         G         T      T       T       A           T
 T ATATTGTTTAT TTTCATGACTTCATGTCGCATGTATTGTTAATTAA




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                   March 11th, 2010     19 / 33
Finding known motifs




Pattern Matching


   A      -0.1       1.0       0.1         -0.4       -2.9       -2.9      -2.9   -2.9    1.3     -2.9        0.3
   C      -0.1       -1.3      -0.7        -2.9       -2.9       -2.9      -2.9   -2.9   -2.9     -1.3        -2.9
   G      -0.4       -0.4      -2.9        -2.9        1.3       -2.9      -2.9   -2.9   -1.3     -0.1        -0.4
   T      0.4        -1.3      0.9         1.2        -2.9       1.3       1.3    1.3    -2.9     1.0         0.7
           T          A         T           T           G         T         T      T       A       T           T
 TA TATTGTTTATT TTCATGACTTCATGTCGCATGTATTGTTAATTAA




Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding                   March 11th, 2010     19 / 33
Finding known motifs




Pattern Matching




 TA TATTGTTTATT TTCATGACTTCATGTCGCATG TATTGTTAATT AA
Stewart MacArthur (Bioinformatics Core)                DNA Motif Finding   March 11th, 2010   20 / 33
Pattern Discovery




Introduction to de-novo motif finding

 de-novo or ab-initio motif finding refers to finding motifs “from the
 beginning”, i.e. without previous knowledge

 Various Methods
     • Word-based algorithms e.g. Oligo-Analysis, Weeder
     • Expectation-Maximization methods e.g. MEME
     • Gibbs sampling methods e.g. Gibbs sampler, MotifSampler




Stewart MacArthur (Bioinformatics Core)             DNA Motif Finding   March 11th, 2010   21 / 33
Pattern Discovery




Guidelines

     • If possible, remove repeat patterns from the target sequences
     • Use multiple motif prediction algorithms.
     • Run probabilistic algorithms multiple times
     • Return multiple motifs
     • Try a range of motif widths and expected number of sites




Stewart MacArthur (Bioinformatics Core)             DNA Motif Finding   March 11th, 2010   22 / 33
Pattern Discovery




Guidelines

     • If possible, remove repeat patterns from the target sequences
     • Use multiple motif prediction algorithms.
     • Run probabilistic algorithms multiple times
     • Return multiple motifs
     • Try a range of motif widths and expected number of sites

            “... we do not recommend to trust pattern discovery
         results with vertebrate genomes. ”

 Jacques van Helden




Stewart MacArthur (Bioinformatics Core)             DNA Motif Finding   March 11th, 2010   22 / 33
Recommended Tools




Recommended Tools


Pattern Matching
    • RSAT




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   23 / 33
Recommended Tools




Recommended Tools


Pattern Matching
    • RSAT
    • Pscan




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   23 / 33
Recommended Tools




Recommended Tools


Pattern Matching
    • RSAT
    • Pscan
    • Galaxy




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   23 / 33
Recommended Tools




Recommended Tools


Pattern Matching
    • RSAT
    • Pscan
    • Galaxy
    • MotifMogul




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   23 / 33
Recommended Tools




Recommended Tools


Pattern Matching
    • RSAT
    • Pscan
    • Galaxy
    • MotifMogul




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   23 / 33
Recommended Tools




Recommended Tools


Pattern Matching                                              Pattern Discovery
    • RSAT                                                        • RSAT
    • Pscan
    • Galaxy
    • MotifMogul




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding            March 11th, 2010   23 / 33
Recommended Tools




Recommended Tools


Pattern Matching                                              Pattern Discovery
    • RSAT                                                        • RSAT
    • Pscan                                                       • MEME
    • Galaxy
    • MotifMogul




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding            March 11th, 2010   23 / 33
Recommended Tools




Recommended Tools


Pattern Matching                                              Pattern Discovery
    • RSAT                                                        • RSAT
    • Pscan                                                       • MEME
    • Galaxy                                                      • Weeder
    • MotifMogul




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding            March 11th, 2010   23 / 33
Recommended Tools




Recommended Tools


Pattern Matching                                              Pattern Discovery
    • RSAT                                                        • RSAT
    • Pscan                                                       • MEME
    • Galaxy                                                      • Weeder
    • MotifMogul                                                  • WebMOTIFS




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding            March 11th, 2010   23 / 33
Recommended Tools




Recommended Tools


Pattern Matching                                              Pattern Discovery
    • RSAT                                                        • RSAT
    • Pscan                                                       • MEME
    • Galaxy                                                      • Weeder
    • MotifMogul                                                  • WebMOTIFS




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding            March 11th, 2010   23 / 33
Recommended Tools    RSA Tools




Regulatory Sequence Analysis Tools
                               http://rsat.ulb.ac.be/rsat/

 Modular computer programs specifically designed for the detection of
 regulatory signals in non-coding sequences.




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding     March 11th, 2010   24 / 33
Recommended Tools    RSA Tools




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding     March 11th, 2010   25 / 33
Recommended Tools    RSA Tools




Regulatory Sequence Analysis Tools

 Nature Protocols Series: Volume 3 No 10 2008
     • Using RSAT to scan genome sequences for transcription factor binding
       sites and cis-regulatory modules
     • Using RSAT oligo-analysis and dyad-analysis tools to discover
       regulatory signals in nucleic sequences
     • Analyzing multiple data sets by interconnecting RSAT programs via
       SOAP Web services - an example with ChIP-chip data
     • Network Analysis Tools: from biological networks to clusters and
       pathways




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding     March 11th, 2010   26 / 33
Recommended Tools    RSA Tools




Example Workflow
 Problem
 I have some differentially expressed genes from a microarray
 experiment. I would like to know if P53 binds in their promoter regions,
 and if so where.

 Workflow
     • BioMart: Convert Gene IDs, if necessary
     • RSAT: retrieve sequence
     • JASPAR: Get PWM (MA0106.1)
     • RSAT: matrix-scan
     • RSAT: feature map



Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding     March 11th, 2010   27 / 33
Recommended Tools    Pscan




  Pscan
         “Finding over-represented transcription
         factor binding site motifs in sequences from
         co-regulated or co-expressed genes”




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   28 / 33
Recommended Tools    Pscan




Example Workflow

 Problem
 I have some differentially expressed genes from a microarray
 experiment. I would like to know which transcription factors bind to
 their promoters.

 Workflow
     • BioMart: Convert Gene IDs, if necessary
     • Pscan: retrieve sequence




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   29 / 33
Recommended Tools    Galaxy




Galaxy
 http://main.g2.bx.psu.edu
             “Galaxy allows you to do analyses you cannot do anywhere
         else without the need to install or download anything. You can
         analyze multiple alignments, compare genomic annotations, profile
         metagenomic samples and much much more...”




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   30 / 33
Recommended Tools    Galaxy




Galaxy
 http://main.g2.bx.psu.edu
             “Galaxy allows you to do analyses you cannot do anywhere
         else without the need to install or download anything. You can
         analyze multiple alignments, compare genomic annotations, profile
         metagenomic samples and much much more...”


    • Collection of online tools




                                      http://kinchie/galaxy


Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   30 / 33
Recommended Tools    Galaxy




Galaxy
 http://main.g2.bx.psu.edu
             “Galaxy allows you to do analyses you cannot do anywhere
         else without the need to install or download anything. You can
         analyze multiple alignments, compare genomic annotations, profile
         metagenomic samples and much much more...”


    • Collection of online tools
    • Modular




                                      http://kinchie/galaxy


Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   30 / 33
Recommended Tools    Galaxy




Galaxy
 http://main.g2.bx.psu.edu
             “Galaxy allows you to do analyses you cannot do anywhere
         else without the need to install or download anything. You can
         analyze multiple alignments, compare genomic annotations, profile
         metagenomic samples and much much more...”


    • Collection of online tools
    • Modular
    • Can create workflows


                                      http://kinchie/galaxy


Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   30 / 33
Recommended Tools    Galaxy




Galaxy
 http://main.g2.bx.psu.edu
             “Galaxy allows you to do analyses you cannot do anywhere
         else without the need to install or download anything. You can
         analyze multiple alignments, compare genomic annotations, profile
         metagenomic samples and much much more...”


    • Collection of online tools
    • Modular
    • Can create workflows
    • Saved Histories

                                      http://kinchie/galaxy


Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding   March 11th, 2010   30 / 33
Recommended Tools    Galaxy




Galaxy
 http://main.g2.bx.psu.edu
             “Galaxy allows you to do analyses you cannot do anywhere
         else without the need to install or download anything. You can
         analyze multiple alignments, compare genomic annotations, profile
         metagenomic samples and much much more...”


    • Collection of online tools                                  • Reproducible analysis
    • Modular
    • Can create workflows
    • Saved Histories

                                      http://kinchie/galaxy


Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding            March 11th, 2010   30 / 33
Recommended Tools    Galaxy




Galaxy
 http://main.g2.bx.psu.edu
             “Galaxy allows you to do analyses you cannot do anywhere
         else without the need to install or download anything. You can
         analyze multiple alignments, compare genomic annotations, profile
         metagenomic samples and much much more...”


    • Collection of online tools                                  • Reproducible analysis
    • Modular                                                     • Shared histories
    • Can create workflows
    • Saved Histories

                                      http://kinchie/galaxy


Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding            March 11th, 2010   30 / 33
Recommended Tools    Galaxy




Galaxy
 http://main.g2.bx.psu.edu
             “Galaxy allows you to do analyses you cannot do anywhere
         else without the need to install or download anything. You can
         analyze multiple alignments, compare genomic annotations, profile
         metagenomic samples and much much more...”


    • Collection of online tools                                  • Reproducible analysis
    • Modular                                                     • Shared histories
    • Can create workflows                                         • In house version
    • Saved Histories

                                      http://kinchie/galaxy


Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding            March 11th, 2010   30 / 33
Recommended Tools    Galaxy




Galaxy
 http://main.g2.bx.psu.edu
             “Galaxy allows you to do analyses you cannot do anywhere
         else without the need to install or download anything. You can
         analyze multiple alignments, compare genomic annotations, profile
         metagenomic samples and much much more...”


    • Collection of online tools                                  • Reproducible analysis
    • Modular                                                     • Shared histories
    • Can create workflows                                         • In house version
    • Saved Histories                                             • Easily extendable

                                      http://kinchie/galaxy


Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding            March 11th, 2010   30 / 33
Recommended Tools    MEME Suite




MEME Suite
 Suite of web based tools for motif discovery

 • MEME - de-novo motif finding




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding      March 11th, 2010   31 / 33
Recommended Tools    MEME Suite




MEME Suite
 Suite of web based tools for motif discovery

 • MEME - de-novo motif finding
 • MAST - find matches to known
     motifs (MEME output)




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding      March 11th, 2010   31 / 33
Recommended Tools    MEME Suite




MEME Suite
 Suite of web based tools for motif discovery

 • MEME - de-novo motif finding
 • MAST - find matches to known
     motifs (MEME output)
 • TOMTOM - Compare motifs to
     TRANSFAC and Jaspar




Stewart MacArthur (Bioinformatics Core)              DNA Motif Finding      March 11th, 2010   31 / 33
Further Reading




Further Reading
     • Stormo GD. DNA binding sites: representation and discovery.
         Bioinformatics. 2000 Jan;16(1):16-23. Review. PubMed PMID:
         10812473.
     • D’haeseleer P. How does DNA sequence motif discovery work?
         Nat Biotechnol. 2006 Aug;24(8):959-61. Review. PubMed PMID:
         16900144.
     • Das MK, Dai HK. A survey of DNA motif finding algorithms. BMC
         Bioinformatics. 2007 Nov 1;8 Suppl 7:S21. Review. PubMed
         PMID: 18047721; PubMed Central PMCID: PMC2099490.
     • Tompa M, Li N et.al. Assessing computational tools for the
         discovery of transcription factor binding sites. Nat Biotechnol.
         2005 Jan;23(1):137-44. PubMed PMID: 15637633.


Stewart MacArthur (Bioinformatics Core)           DNA Motif Finding   March 11th, 2010   32 / 33
Practical




Practical Session




Stewart MacArthur (Bioinformatics Core)    DNA Motif Finding   March 11th, 2010   33 / 33

More Related Content

What's hot

An introduction to promoter prediction and analysis
An introduction to promoter prediction and analysisAn introduction to promoter prediction and analysis
An introduction to promoter prediction and analysisSarbesh D. Dangol
 
Formation and expression ofpseudogenes
Formation and expression ofpseudogenesFormation and expression ofpseudogenes
Formation and expression ofpseudogenesShilpa Malaghan
 
Gene mapping and DNA markers
Gene mapping and DNA markersGene mapping and DNA markers
Gene mapping and DNA markersAFSATH
 
molecular marker AFLP, and application
molecular marker AFLP, and applicationmolecular marker AFLP, and application
molecular marker AFLP, and applicationKAUSHAL SAHU
 
Lectut btn-202-ppt-l23. labeling techniques for nucleic acids
Lectut btn-202-ppt-l23. labeling techniques for nucleic acidsLectut btn-202-ppt-l23. labeling techniques for nucleic acids
Lectut btn-202-ppt-l23. labeling techniques for nucleic acidsRishabh Jain
 
Sanger sequencing
Sanger sequencing Sanger sequencing
Sanger sequencing JYOTI PAWAR
 
Epigenetic role in plant
Epigenetic role in plant Epigenetic role in plant
Epigenetic role in plant harshdeep josan
 
Molecular chaperones
Molecular chaperonesMolecular chaperones
Molecular chaperonesanju vs
 
Sanger sequencing (DNA sequencing by ENZYMATIC METHOD)
Sanger sequencing (DNA sequencing by ENZYMATIC METHOD)Sanger sequencing (DNA sequencing by ENZYMATIC METHOD)
Sanger sequencing (DNA sequencing by ENZYMATIC METHOD)RaihanathusSahdhiyya
 
Protein – DNA interactions, an overview
Protein – DNA interactions, an overviewProtein – DNA interactions, an overview
Protein – DNA interactions, an overviewDariyus Kabraji
 
Histone modifications
Histone modificationsHistone modifications
Histone modificationsBansari Patel
 
Yeast two hybrid system
Yeast two hybrid systemYeast two hybrid system
Yeast two hybrid systemiqraakbar8
 
Mass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification StrategiesMass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification StrategiesMichel Dumontier
 

What's hot (20)

An introduction to promoter prediction and analysis
An introduction to promoter prediction and analysisAn introduction to promoter prediction and analysis
An introduction to promoter prediction and analysis
 
genomic comparison
genomic comparison genomic comparison
genomic comparison
 
What is Epigenetics?
What is Epigenetics?What is Epigenetics?
What is Epigenetics?
 
Formation and expression ofpseudogenes
Formation and expression ofpseudogenesFormation and expression ofpseudogenes
Formation and expression ofpseudogenes
 
Micro RNA.ppt
Micro RNA.pptMicro RNA.ppt
Micro RNA.ppt
 
Gene mapping and DNA markers
Gene mapping and DNA markersGene mapping and DNA markers
Gene mapping and DNA markers
 
molecular marker AFLP, and application
molecular marker AFLP, and applicationmolecular marker AFLP, and application
molecular marker AFLP, and application
 
Lectut btn-202-ppt-l23. labeling techniques for nucleic acids
Lectut btn-202-ppt-l23. labeling techniques for nucleic acidsLectut btn-202-ppt-l23. labeling techniques for nucleic acids
Lectut btn-202-ppt-l23. labeling techniques for nucleic acids
 
Genome sequencing
Genome sequencingGenome sequencing
Genome sequencing
 
Sanger sequencing
Sanger sequencing Sanger sequencing
Sanger sequencing
 
Epigenetic role in plant
Epigenetic role in plant Epigenetic role in plant
Epigenetic role in plant
 
Molecular chaperones
Molecular chaperonesMolecular chaperones
Molecular chaperones
 
Whole genome sequencing
Whole genome sequencingWhole genome sequencing
Whole genome sequencing
 
Proteomics
ProteomicsProteomics
Proteomics
 
Sanger sequencing (DNA sequencing by ENZYMATIC METHOD)
Sanger sequencing (DNA sequencing by ENZYMATIC METHOD)Sanger sequencing (DNA sequencing by ENZYMATIC METHOD)
Sanger sequencing (DNA sequencing by ENZYMATIC METHOD)
 
Protein – DNA interactions, an overview
Protein – DNA interactions, an overviewProtein – DNA interactions, an overview
Protein – DNA interactions, an overview
 
Kegg
KeggKegg
Kegg
 
Histone modifications
Histone modificationsHistone modifications
Histone modifications
 
Yeast two hybrid system
Yeast two hybrid systemYeast two hybrid system
Yeast two hybrid system
 
Mass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification StrategiesMass Spectrometry: Protein Identification Strategies
Mass Spectrometry: Protein Identification Strategies
 

Viewers also liked

Dna binding protein(motif)
Dna binding protein(motif)Dna binding protein(motif)
Dna binding protein(motif)mamad416
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq DataPhil Ewels
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformaticsavrilcoghlan
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsNikesh Narayanan
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsJTADrexel
 
An Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif IdentificationAn Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif IdentificationCSCJournals
 
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionTranscription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionUT, San Antonio
 
Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009bosc
 
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...Luca Cozzuto
 
Angry birds presentation
Angry birds presentationAngry birds presentation
Angry birds presentationlinhvu28
 
XPRIME: A Novel Motif Searching Method
XPRIME: A Novel Motif Searching MethodXPRIME: A Novel Motif Searching Method
XPRIME: A Novel Motif Searching Methodrlpoulsen
 
MEMEs in the Classroom
MEMEs in the ClassroomMEMEs in the Classroom
MEMEs in the ClassroomMichael A.
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club avrilcoghlan
 

Viewers also liked (20)

Dna binding protein(motif)
Dna binding protein(motif)Dna binding protein(motif)
Dna binding protein(motif)
 
What Is a Meme
What Is a MemeWhat Is a Meme
What Is a Meme
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq Data
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
An Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif IdentificationAn Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif Identification
 
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionTranscription Factor DNA Binding Prediction
Transcription Factor DNA Binding Prediction
 
Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009
 
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
 
Angry birds presentation
Angry birds presentationAngry birds presentation
Angry birds presentation
 
XPRIME: A Novel Motif Searching Method
XPRIME: A Novel Motif Searching MethodXPRIME: A Novel Motif Searching Method
XPRIME: A Novel Motif Searching Method
 
6 motif and pattern
6   motif and pattern6   motif and pattern
6 motif and pattern
 
MEMEs in the Classroom
MEMEs in the ClassroomMEMEs in the Classroom
MEMEs in the Classroom
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
Macs course
Macs courseMacs course
Macs course
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
DNA binding Domains
DNA binding DomainsDNA binding Domains
DNA binding Domains
 

Recently uploaded

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 

Recently uploaded (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 

DNA Motif Finding 2010

  • 1. DNA Motif Finding Stewart MacArthur Bioinformatics Core March 11th, 2010 Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 1 / 33
  • 2. Introduction What is a DNA Motif? DNA motifs are short, recurring patterns that are presumed to have a biological function. Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 2 / 33
  • 3. Introduction What is a DNA Motif? DNA motifs are short, recurring patterns that are presumed to have a biological function. • sequence-specific binding sites • transcription factors • nucleases • ribosome binding • mRNA processing • splicing • editing • polyadenylation • transcription termination Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 2 / 33
  • 4. Introduction What is a DNA Motif? DNA motifs are short, recurring patterns that are presumed to have a biological function. • sequence-specific binding sites • transcription factors • nucleases • ribosome binding • mRNA processing • splicing • editing • polyadenylation • transcription termination Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 2 / 33
  • 5. Representing a motif How to represent a DNA motif? How can we represent the binding specificity of a protein, such that we can reliably predict its binding to any given sequence? Restriction enzymes sites can be written as simple DNA sequence, e.g. GAATTC for EcoRI 5’-G A A T T C-3’ 3’-C T T A A G-5’ These sequences can incorporate ambiguity, e.g. GTYRAC for HincII, using the IUPAC code. GTYRAC Y = C or T R = A or C All matching sites will be cut by the restriction enzyme Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 3 / 33
  • 6. Representing a motif Transcription Factors are different... • Regulatory motifs are often degenerate,variable but similar. • Transcription factors are often pleiotropic, regulating several genes, but they may need to be expressed at different levels. • A side effect of this degeneracy is spurious binding, where the protein has affinity at positions in the genome other than their functional sites. • Degeneracy in restriction enzyme binding would be lethal • Non-specific binding competes for protein and requires more protein to be produced than would be required otherwise Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 4 / 33
  • 7. Representing a motif Consensus The Consensus Sequence • A consensus binding site is often used to represent transcription factor binding • Refers to a sequence that matches all examples of the binding site closely but not exactly • There is a trade-off between the ambiguity in the consensus and its sensitivity Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 5 / 33
  • 8. Representing a motif Consensus The Consensus Sequence • A consensus binding site is often used to represent transcription factor binding • Refers to a sequence that matches all examples of the binding site closely but not exactly • There is a trade-off between the ambiguity in the consensus and its sensitivity TACGAT TATAAT TATAAT GATACT TATGAT TATGTT Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 5 / 33
  • 9. Representing a motif Consensus The Consensus Sequence : Example TACGAT TATAAT TATAAT TATACT TATGAT TATGTT TATAAT Allowing 0 mismatches finds 2/6 Sites 1 site every 4kb Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33
  • 10. Representing a motif Consensus The Consensus Sequence : Example TACGAT TATAAT* TATAAT* TATACT TATGAT TATGTT TATAAT Allowing 0 mismatches finds 2/6 Sites 1 site every 4kb Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33
  • 11. Representing a motif Consensus The Consensus Sequence : Example TACGAT TATAAT* TATAAT* TATACT TATGAT* TATGTT TATAAT Allowing at most 1 mismatch finds 3/6 Sites 1 site every 200bp Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33
  • 12. Representing a motif Consensus The Consensus Sequence : Example TACGAT* TATAAT* TATAAT* TATACT* TATGAT* TATGTT* TATAAT Allowing up to 2 mismatches finds 6/6 Sites 1 site every 30bp Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33
  • 13. Representing a motif IUPAC IUPAC codes A Adenine C Cytosine G Guanine T Thymine R A or G Y C or T S G or C W A or T K G or T M A or C B C or G or T D A or G or T H A or C or T V A or C or G N any base . or - gap Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 7 / 33
  • 14. Representing a motif IUPAC The Consensus Sequence : Example TACGAT TATAAT TATAAT TATACT TATGAT TATGTT TATRNT Allowing 0 mismatches finds 2/6 Sites Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 8 / 33
  • 15. Representing a motif IUPAC The Consensus Sequence : Example TACGAT TATAAT* TATAAT* TATACT TATGAT* TATGTT* TATRNT Exact match finds 4/6 Sites - 1 site every 500bp Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 8 / 33
  • 16. Representing a motif IUPAC The Consensus Sequence : Example TACGAT* TATAAT* TATAAT* TATACT* TATGAT* TATGTT* TATRNT Up to one mismatch finds 6/6 Sites - 1 site every 30bp Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 8 / 33
  • 17. Representing a motif Matrix The Matrix • A position weight matrix (PWM) • also called position-specific weight matrix (PSWM) • also called position-frequency matrix (PFM) • also called position-specific scoring matrix (PSSM) • or just matrix • Alternative to the consensus. • There is a matrix element for all possible bases at every position. Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 9 / 33
  • 18. Representing a motif Matrix The Matrix • A position weight matrix (PWM) • also called position-specific weight matrix (PSWM) • also called position-frequency matrix (PFM) • also called position-specific scoring matrix (PSSM) • or just matrix • Alternative to the consensus. • There is a matrix element for all possible bases at every position. 1 2 3 4 5 6 7 8 9 10 11 A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 9 / 33
  • 19. Representing a motif Matrix Matrix Formats Counts A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 10 / 33
  • 20. Representing a motif Matrix Matrix Formats Counts A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 Frequency A 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3 C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2 T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5 Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 10 / 33
  • 21. Representing a motif Matrix Matrix Formats Counts A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 Frequency A 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3 C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2 T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5 Weight (log odds) A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3 C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9 G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4 T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7 Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 10 / 33
  • 22. Representing a motif Matrix Sequence Logos • A visual representation of the motif A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 • Each column of the matrix is G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 represented as a stack of letters whose size is proportional to the corresponding residue frequency • The total height of each column is proportional to its information content. Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 11 / 33
  • 23. Information theory Information Theory • Information theory is a branch of applied mathematics involved with the quantification of information • It has been applied to DNA motifs in order to determine the amount of uncertainly at each position in a site • Uncertainly is measured in bits of information, which is on a log2 scale. • Information is a decrease in uncertainty Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 12 / 33
  • 24. Information theory Information theory A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 • 1 base occurs every time - 2 bits • 2 bases occur 50% of time - 1bit • 4 bases occur equally - 0 bits Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 13 / 33
  • 25. Information theory Information theory A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 • 1 base occurs every time - 2 bits • 2 bases occur 50% of time - 1bit • 4 bases occur equally - 0 bits Example Ii = 2 + fb,i log2 fb,i 1 = 2 + 0.5 × log2 (0.5) + 0.5 × log2 (0.5) Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 13 / 33
  • 26. Information theory Why do we want to find them? Expression Microarrays • Find co-regulated genes • Suggest Pathways Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 14 / 33
  • 27. Information theory Why do we want to find them? Expression Microarrays ChIP seq/chip • Find co-regulated genes • Determine binding • Suggest Pathways preferences • Find co-factors Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 14 / 33
  • 28. Information theory Two Methods Pattern Matching Finding known motifs • Does protein X bind upstream of my genes? • Does it bind more than expected by chance? Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 15 / 33
  • 29. Information theory Two Methods Pattern Matching Pattern Discovery Finding known motifs Finding unknown motifs • Does protein X bind upstream • What motifs are upstream of of my genes? my genes? • Does it bind more than • What are these motifs expected by chance? Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 15 / 33
  • 30. Information theory Two Methods Pattern Matching Pattern Discovery Finding known motifs Finding unknown motifs • Does protein X bind upstream • What motifs are upstream of of my genes? my genes? • Does it bind more than • What are these motifs expected by chance? e.g. Patser, Pscan, Mast.. e.g. MEME, Weeder, MDScan ... Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 15 / 33
  • 31. Databases of Motifs Where can we find known motifs? Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 16 / 33
  • 32. Databases of Motifs Where can we find known motifs? Online databases • Multicellular Eukaryotes • Jaspar • Transfac • Pazar Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 16 / 33
  • 33. Databases of Motifs Where can we find known motifs? Online databases • Multicellular Eukaryotes • Jaspar • Transfac • Pazar • Yeast • Yeastract • SCPD • Prokaryotes • RegulonDB • Prodoric • Other • UniProbe Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 16 / 33
  • 34. Finding known motifs How do we find them? TATATTGTTTATTTTCATGACTTCATGTCGCATGTATTGTTAATTAA CACATGTCTCATGTACTGGACCATGTCTAAGGGGTGTAAGGGTACTA ACGAATCGTAGCATGTCCAGAGGTGCGGAGTACGTAAGGAGGGTGCC CATACATGTCCGTTTCATATGAGCCTGCATTAATGTACCAACCTTCA ACCATGTCTCAACATGTCGCGGGTGTGCCTCCACGTACGAGCCGGAA GTCGACTCGCATGTCTGTCAGTATTATCCAAAGCATGTCGACCTCTT CATGTCAGCGAACGCAAGATCTTCATATGAGCCTGCATTAATGTACC Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 17 / 33
  • 35. Finding known motifs Pattern Matching Counts A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 18 / 33
  • 36. Finding known motifs Pattern Matching Counts A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 Frequency A 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3 C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2 T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5 Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 18 / 33
  • 37. Finding known motifs Pattern Matching Counts A 4 13 5 3 0 0 0 0 17 0 6 C 4 1 2 0 0 0 0 0 0 1 0 G 3 3 0 0 18 0 0 0 1 4 3 T 7 1 11 15 0 18 18 18 0 13 9 Frequency A 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3 C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2 T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5 Weight (log odds) A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3 C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9 G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4 T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7 Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 18 / 33
  • 38. Finding known motifs Pattern Matching A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3 C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9 G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4 T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7 TATATTGTTTATTTTCATGACTTCATGTCGCATGTATTGTTAATTAA Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33
  • 39. Finding known motifs Pattern Matching A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3 C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9 G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4 T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7 T A T A T T G T T T A TATATTGTTTA TTTTCATGACTTCATGTCGCATGTATTGTTAATTAA Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33
  • 40. Finding known motifs Pattern Matching A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3 C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9 G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4 T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7 A T A T T G T T T A T T ATATTGTTTAT TTTCATGACTTCATGTCGCATGTATTGTTAATTAA Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33
  • 41. Finding known motifs Pattern Matching A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3 C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9 G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4 T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7 T A T T G T T T A T T TA TATTGTTTATT TTCATGACTTCATGTCGCATGTATTGTTAATTAA Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33
  • 42. Finding known motifs Pattern Matching TA TATTGTTTATT TTCATGACTTCATGTCGCATG TATTGTTAATT AA Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 20 / 33
  • 43. Pattern Discovery Introduction to de-novo motif finding de-novo or ab-initio motif finding refers to finding motifs “from the beginning”, i.e. without previous knowledge Various Methods • Word-based algorithms e.g. Oligo-Analysis, Weeder • Expectation-Maximization methods e.g. MEME • Gibbs sampling methods e.g. Gibbs sampler, MotifSampler Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 21 / 33
  • 44. Pattern Discovery Guidelines • If possible, remove repeat patterns from the target sequences • Use multiple motif prediction algorithms. • Run probabilistic algorithms multiple times • Return multiple motifs • Try a range of motif widths and expected number of sites Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 22 / 33
  • 45. Pattern Discovery Guidelines • If possible, remove repeat patterns from the target sequences • Use multiple motif prediction algorithms. • Run probabilistic algorithms multiple times • Return multiple motifs • Try a range of motif widths and expected number of sites “... we do not recommend to trust pattern discovery results with vertebrate genomes. ” Jacques van Helden Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 22 / 33
  • 46. Recommended Tools Recommended Tools Pattern Matching • RSAT Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 47. Recommended Tools Recommended Tools Pattern Matching • RSAT • Pscan Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 48. Recommended Tools Recommended Tools Pattern Matching • RSAT • Pscan • Galaxy Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 49. Recommended Tools Recommended Tools Pattern Matching • RSAT • Pscan • Galaxy • MotifMogul Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 50. Recommended Tools Recommended Tools Pattern Matching • RSAT • Pscan • Galaxy • MotifMogul Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 51. Recommended Tools Recommended Tools Pattern Matching Pattern Discovery • RSAT • RSAT • Pscan • Galaxy • MotifMogul Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 52. Recommended Tools Recommended Tools Pattern Matching Pattern Discovery • RSAT • RSAT • Pscan • MEME • Galaxy • MotifMogul Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 53. Recommended Tools Recommended Tools Pattern Matching Pattern Discovery • RSAT • RSAT • Pscan • MEME • Galaxy • Weeder • MotifMogul Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 54. Recommended Tools Recommended Tools Pattern Matching Pattern Discovery • RSAT • RSAT • Pscan • MEME • Galaxy • Weeder • MotifMogul • WebMOTIFS Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 55. Recommended Tools Recommended Tools Pattern Matching Pattern Discovery • RSAT • RSAT • Pscan • MEME • Galaxy • Weeder • MotifMogul • WebMOTIFS Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33
  • 56. Recommended Tools RSA Tools Regulatory Sequence Analysis Tools http://rsat.ulb.ac.be/rsat/ Modular computer programs specifically designed for the detection of regulatory signals in non-coding sequences. Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 24 / 33
  • 57. Recommended Tools RSA Tools Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 25 / 33
  • 58. Recommended Tools RSA Tools Regulatory Sequence Analysis Tools Nature Protocols Series: Volume 3 No 10 2008 • Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules • Using RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in nucleic sequences • Analyzing multiple data sets by interconnecting RSAT programs via SOAP Web services - an example with ChIP-chip data • Network Analysis Tools: from biological networks to clusters and pathways Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 26 / 33
  • 59. Recommended Tools RSA Tools Example Workflow Problem I have some differentially expressed genes from a microarray experiment. I would like to know if P53 binds in their promoter regions, and if so where. Workflow • BioMart: Convert Gene IDs, if necessary • RSAT: retrieve sequence • JASPAR: Get PWM (MA0106.1) • RSAT: matrix-scan • RSAT: feature map Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 27 / 33
  • 60. Recommended Tools Pscan Pscan “Finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes” Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 28 / 33
  • 61. Recommended Tools Pscan Example Workflow Problem I have some differentially expressed genes from a microarray experiment. I would like to know which transcription factors bind to their promoters. Workflow • BioMart: Convert Gene IDs, if necessary • Pscan: retrieve sequence Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 29 / 33
  • 62. Recommended Tools Galaxy Galaxy http://main.g2.bx.psu.edu “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
  • 63. Recommended Tools Galaxy Galaxy http://main.g2.bx.psu.edu “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” • Collection of online tools http://kinchie/galaxy Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
  • 64. Recommended Tools Galaxy Galaxy http://main.g2.bx.psu.edu “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” • Collection of online tools • Modular http://kinchie/galaxy Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
  • 65. Recommended Tools Galaxy Galaxy http://main.g2.bx.psu.edu “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” • Collection of online tools • Modular • Can create workflows http://kinchie/galaxy Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
  • 66. Recommended Tools Galaxy Galaxy http://main.g2.bx.psu.edu “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” • Collection of online tools • Modular • Can create workflows • Saved Histories http://kinchie/galaxy Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
  • 67. Recommended Tools Galaxy Galaxy http://main.g2.bx.psu.edu “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” • Collection of online tools • Reproducible analysis • Modular • Can create workflows • Saved Histories http://kinchie/galaxy Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
  • 68. Recommended Tools Galaxy Galaxy http://main.g2.bx.psu.edu “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” • Collection of online tools • Reproducible analysis • Modular • Shared histories • Can create workflows • Saved Histories http://kinchie/galaxy Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
  • 69. Recommended Tools Galaxy Galaxy http://main.g2.bx.psu.edu “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” • Collection of online tools • Reproducible analysis • Modular • Shared histories • Can create workflows • In house version • Saved Histories http://kinchie/galaxy Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
  • 70. Recommended Tools Galaxy Galaxy http://main.g2.bx.psu.edu “Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...” • Collection of online tools • Reproducible analysis • Modular • Shared histories • Can create workflows • In house version • Saved Histories • Easily extendable http://kinchie/galaxy Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33
  • 71. Recommended Tools MEME Suite MEME Suite Suite of web based tools for motif discovery • MEME - de-novo motif finding Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 31 / 33
  • 72. Recommended Tools MEME Suite MEME Suite Suite of web based tools for motif discovery • MEME - de-novo motif finding • MAST - find matches to known motifs (MEME output) Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 31 / 33
  • 73. Recommended Tools MEME Suite MEME Suite Suite of web based tools for motif discovery • MEME - de-novo motif finding • MAST - find matches to known motifs (MEME output) • TOMTOM - Compare motifs to TRANSFAC and Jaspar Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 31 / 33
  • 74. Further Reading Further Reading • Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000 Jan;16(1):16-23. Review. PubMed PMID: 10812473. • D’haeseleer P. How does DNA sequence motif discovery work? Nat Biotechnol. 2006 Aug;24(8):959-61. Review. PubMed PMID: 16900144. • Das MK, Dai HK. A survey of DNA motif finding algorithms. BMC Bioinformatics. 2007 Nov 1;8 Suppl 7:S21. Review. PubMed PMID: 18047721; PubMed Central PMCID: PMC2099490. • Tompa M, Li N et.al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005 Jan;23(1):137-44. PubMed PMID: 15637633. Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 32 / 33
  • 75. Practical Practical Session Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 33 / 33