SlideShare a Scribd company logo
R’MES
  Finding Exceptional Motifs in Sequences

                 S. Schbath


         INRA, Jouy-en-Josas, France



http://genome.jouy.inra.fr/ssb/rmes/




                                            BOSC, Stockholm, June 27-28, 2009 – p.1
Introduction:
motifs and statistics




                        BOSC, Stockholm, June 27-28, 2009 – p.2
DNA and motifs


 • DNA: Long molecule, sequence of
   nucleotides
 • Nucleotides: A(denine), C(ytosine),
   G(uanine), T(hymine).
 • Motif (= oligonucleotides): short
   sequence of nucleotides, e.g.
   CAGTAG
 • Functional motif: recognized by
   proteins or enzymes to initiate a
   biological process



TAGACAGATAGACGATCAGTAGCCAGTAGACAGTAGGCATGA. . .



                                                  BOSC, Stockholm, June 27-28, 2009 – p.3
Some functional motifs

• Restriction sites: recognized by specific bacterial restriction enzymes ⇒
  double-strand DNA break.
  E.g. GAATTC recognized by EcoRI

• Chi motif: recognized by an enzyme which processes along DNA sequence
  and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
  stimulated by recombination.
  E.g. GCTGGTGG recognized by RecBCD (E. coli)

• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
  into macro-domains.
        t
  TGTTAACACGTGAAACA
   c    c   t           t



• promoter: structured motif recognized by the RNA polymerase to initiate
  gene transcription.
                (16;18)
  E.g. TTGAC − − − TATAAT (E. coli).

                                                              BOSC, Stockholm, June 27-28, 2009 – p.4
Some functional motifs

• Restriction sites: recognized by specific bacterial restriction enzymes ⇒
  double-strand DNA break.
  E.g. GAATTC recognized by EcoRI
  very rare along bacterial genomes
• Chi motif: recognized by an enzyme which processes along DNA sequence
  and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
  stimulated by recombination.
  E.g. GCTGGTGG recognized by RecBCD (E. coli)
  very frequent along E. coli genome
• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
  into macro-domains.
        t
  TGTTAACACGTGAAACA
   c    c   t           t
  very frequent into the ORI domain, rare elsewhere
• promoter: structured motif recognized by the RNA polymerase to initiate
  gene transcription.
                (16;18)
  E.g. TTGAC − − − TATAAT (E. coli).
  particularly located in front of genes
                                                              BOSC, Stockholm, June 27-28, 2009 – p.4
Prediction of functional motifs

Most of the functional motifs are unknown in the different species.
For instance,
  • which would be the Chi motif of S. aureus? [Halpern et al. (08)]
  • Is there an equivalent of parS in E. coli? [Mercier et al. (08)]
Statistical approach: to identify candidate motifs based on their statistical
properties.

     The most over-represented                The most over-represented families
       8-letter words under M1                           anbcdef g under M1
           E. coli ( = 4.6 106 )                  H. influenzae ( = 1.8 106 )
    word         obs       exp     score         motif        obs     exp        score
 gctggtgg        762      84.9     73.5       gntggtgg        223     55.3       22.33
 ggcgctgg        828     125.9     62.6       anttcatc        469    180.3       21.59
 cgctggcg        870     150.8     58.6       anatcgcc        288     87.8       21.38
 gctggcgg        723     125.9     53.3       tnatcgcc        279     84.5       21.18
 cgctggtg        619     101.7     51.3       gnagaaga        270     83.6       20.10
                                                                         BOSC, Stockholm, June 27-28, 2009 – p.5
Presentation of R’MES




                        BOSC, Stockholm, June 27-28, 2009 – p.6
Statistical questions addressed by R’MES

Questions related to the significance of the number of occurrences of motifs w
in sequences:

  • Is N obs (w) significantly high?
  • Is N obs (w) significantly higher than N obs (w )?
     −→ If w = w: is w significantly skewed (strand bias)?
        obs                                      obs
  • Is N1 (w) significantly more unexpected than N2 (w)?
Several types of motifs w:

  • fixed words (e.g. gctggtgg),
  • degenerated patterns (e.g. gntggtgg),
  • set of words (e.g. {w, w}).




                                                                BOSC, Stockholm, June 27-28, 2009 – p.7
Is N obs (w) significantly high?

• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
  count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
  sequence composition in oligos of length 1- up to -(m + 1).
  Possibility to take the phase in coding sequences into account (Mm_3)




                                                              BOSC, Stockholm, June 27-28, 2009 – p.8
Is N obs (w) significantly high?

• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
  count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
  sequence composition in oligos of length 1- up to -(m + 1).
  Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
   • either a Gaussian approximation of N (w) (when E(N (w)) is large)
      [Prum et al. (95)], [Schbath et al. (95)]
   • or a compound Poisson distribution of N (w) (when E(N (w)) is small)
      [Schbath (95)], [Roquain and Schbath (07)]
  (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )




                                                              BOSC, Stockholm, June 27-28, 2009 – p.8
Is N obs (w) significantly high?

• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
  count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which fit the
  sequence composition in oligos of length 1- up to -(m + 1).
  Possibility to take the phase in coding sequences into account (Mm_3)
• R’MES approximates the p-value by using
   • either a Gaussian approximation of N (w) (when E(N (w)) is large)
      [Prum et al. (95)], [Schbath et al. (95)]
   • or a compound Poisson distribution of N (w) (when E(N (w)) is small)
      [Schbath (95)], [Roquain and Schbath (07)]
  (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )
• R’MES produces scores of exceptionality (probit transformation).
  High positive (resp. negative) scores correspond to exceptionally frequent
  (resp. rare) motifs.

         rmes –gauss –s seqfile –m m –l wordlength –o outputfile


                                                               BOSC, Stockholm, June 27-28, 2009 – p.8
Is N obs (w) significantly higher than N obs (w)?

                                                          N obs (w)
 • One needs to calculate the p-value P                                   where N (·) is the
                                          “                           ”
                                              N (w)
                                              N (w)
                                                      ≥   N obs (w)
   count (r.v.) in random sequences (→ model).
 • R’MES considers Markov chain models of order m (Mm) which fit the
   sequence composition in oligos of length 1- up to -(m + 1).
   Possibility to take the phase in coding sequences into account (Mm_3)




                                                                              BOSC, Stockholm, June 27-28, 2009 – p.9
Is N obs (w) significantly higher than N obs (w)?

                                                           N obs (w)
 • One needs to calculate the p-value P                                    where N (·) is the
                                           “                           ”
                                               N (w)
                                               N (w)
                                                       ≥   N obs (w)
   count (r.v.) in random sequences (→ model).
 • R’MES considers Markov chain models of order m (Mm) which fit the
   sequence composition in oligos of length 1- up to -(m + 1).
   Possibility to take the phase in coding sequences into account (Mm_3)
 • R’MES approximates the p-value by using
    • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
       E(N (w)) and E(N (w)) are large)
       [Prum et al. (95)], [Schbath et al. (95)]




                                                                               BOSC, Stockholm, June 27-28, 2009 – p.9
Is N obs (w) significantly higher than N obs (w)?

                                                           N obs (w)
 • One needs to calculate the p-value P                                    where N (·) is the
                                           “                           ”
                                               N (w)
                                               N (w)
                                                       ≥   N obs (w)
   count (r.v.) in random sequences (→ model).
 • R’MES considers Markov chain models of order m (Mm) which fit the
   sequence composition in oligos of length 1- up to -(m + 1).
   Possibility to take the phase in coding sequences into account (Mm_3)
 • R’MES approximates the p-value by using
    • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
       E(N (w)) and E(N (w)) are large)
       [Prum et al. (95)], [Schbath et al. (95)]
 • R’MES produces scores of exceptional skew (probit transformation):
   High positive (resp. negative) scores correspond to motifs significantlty more
   frequent (resp. rare) along the sequence than along the complementary one.

          rmes –skew –seq seqfile –m m –l wordlength –o outputfile




                                                                               BOSC, Stockholm, June 27-28, 2009 – p.9
obs                                   obs
Is N1 (w) significantly more except. than N2 (w)?

    • One wants to compare the exceptionality of a motif w in two different
                                      obs        obs
      sequences (two observed counts N1 (w) and N2 (w))




                                                                   BOSC, Stockholm, June 27-28, 2009 – p.10
obs                                   obs
Is N1 (w) significantly more except. than N2 (w)?

    • One wants to compare the exceptionality of a motif w in two different
                                      obs        obs
      sequences (two observed counts N1 (w) and N2 (w))
    • R’MES computes a test statistic and its asociated p-value to test
                 H0 : {w is equally exceptional in both sequences}
      against
                 H1 : {w is more exceptional in the first sequence}
      [Robin et al. (08)]




                                                                     BOSC, Stockholm, June 27-28, 2009 – p.10
obs                                   obs
Is N1 (w) significantly more except. than N2 (w)?

    • One wants to compare the exceptionality of a motif w in two different
                                      obs        obs
      sequences (two observed counts N1 (w) and N2 (w))
    • R’MES computes a test statistic and its asociated p-value to test
                 H0 : {w is equally exceptional in both sequences}
      against
                 H1 : {w is more exceptional in the first sequence}
      [Robin et al. (08)]
    • The test is performed by considering occurrence processes like Poisson
      processes whose intensities take the sequence compositions in oligos of
      length 1- up to -(m + 1) into account.
    • Option –seq2 soon available in R’MES.




                                                                     BOSC, Stockholm, June 27-28, 2009 – p.10
RMESPlot interface




                     BOSC, Stockholm, June 27-28, 2009 – p.11
Prediction and identification
 of functional DNA motifs




                               BOSC, Stockholm, June 27-28, 2009 – p.12
Chi motifs in bacterial genomes

• Motif involved in the repair of double-strand DNA breaks.
  Chi needs to be frequent along bacterial genomes.
• Chi motifs have been identified for few bacterial species. They are not
  conserved through species.
• Known Chi motifs are 5 to 8 nucleotides long and can be degenerated.
• Moreover, Chi activity is strongly orientation-dependent (direction of DNA
  replication).
  It is present preferentially on the leading strands (high skew).




                                                                     BOSC, Stockholm, June 27-28, 2009 – p.13
E. coli as a learning case

  • 8-letter word GCTGGTGG
  • 762 occurrences on the leading strands ( = 4.6 106 )
  • Among the most over-represented 8-letter words (whatever the model Mm)
     ⇒ its frequency cannot be explained by the genome composition.
  • Its rank is improved if one analyzes only the backbone genome (genome
     conserved in several strains of the species).
  • Its skew equals 3.20 (p-value of 3.310−11 ).


The skew of a motif w is defined by N obs (w)/N obs (w) where w is the reverse
complementary of w.




                                                                 BOSC, Stockholm, June 27-28, 2009 – p.14
Identification of Chi motif in S. aureus

                               Halpern et al. (07)
 •  Analysis of the S. aureus backbone ( = 2.44 106 ).
 • 8-letter words: none of the most over-represented and skewed motifs were
    frequent enough.
 • 7-letter words:




A=gaaaatg (1067),      B=ggattag (266),   C=gaagcgg (272),   D=gaattag (614)
                                                              BOSC, Stockholm, June 27-28, 2009 – p.15
Organization of the Ter macrodomain in E. coli

The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?




                                                                 BOSC, Stockholm, June 27-28, 2009 – p.16
Organization of the Ter macrodomain in E. coli

The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?



Bacillus subtilis as a learning case:

  • In B. subtilis, the parS motif is responsible for the structuration of the
     chromosomal domain surrounding the origin of replication [Lin and
     Grossman (98)].
  • parS motif is 16 nt long, its sequence is partially degenerated and rather
     palindromic.
                                    t
                             TGTTAACACGTGAAACA
                             c      c   t        t

  • It is recognized by SpoOJ in both directions.
  • One of its 11-mer is the most exceptional 11-mer (w, w) in the origin domain.



                                                                      BOSC, Stockholm, June 27-28, 2009 – p.16
Identification of matS in E. coli

10 most over-represented 11-mer (w, w) of the TER domain (compound Poisson
approximation + family option):
                                                                                  rank            ra
           word   N1   N2     E1      E2    score1   score2   p-skew          R’MES             Ske
 GACACTGTCAC       7     0   0.21    0.43     5.84     0.39   0.0004                    1
 TGACACTGTCA       7     2   0.28    0.53     5.49     1.29   0.0101                    2            4
 GACAGTGTCAC       6     0   0.20    0.43     5.24     0.38   0.0011                    3            1
 GACGTTGTCAC       7     3   0.35    1.30     5.22     1.06   0.0012                    4            1
 GACAACGTCAC       7     3   0.37    1.49     5.15     0.88   0.0008                    5            1
 GACCCGAACGA       5     1   0.12    0.47     5.09     0.31   0.0017                    6            2
  ATAGGGTAGAT      4     1   0.06    0.26     4.94     0.73   0.0041                    7            3
  TAGTTACAACA      5     1   0.16    0.54     4.79     0.21   0.0032                    8            2
  ATAAACGGCCC      6     3   0.31    1.68     4.76     0.71   0.0008                    9            1
 TGACAACGTCA       7     5   0.51   1.786     4.72     1.81   0.0073                  10             3




                                                               BOSC, Stockholm, June 27-28, 2009 – p.17
Identification of matS in E. coli

                             GACACTGTCAC
                            TGACACTGTCA
                             GACAGTGTCAC
                             GACGTTGTCAC
                             GACAACGTCAC
                            TGACAACGTCA

                          GTGACRNYGTCAC



matS is the 13nt GTGACRNYGTCAC: it is recognized by the matP protein which
structures the Ter domain [Mercier at al. (08)].




                                                            BOSC, Stockholm, June 27-28, 2009 – p.18
Acknowledgment

       Françoise Gélis (R’MES 1.0)
        Annie Bouvier (R’MES 2.0)
       Mark Hoebeke (R’MES 3.0)


http://genome.jouy.inra.fr/ssb/rmes/




                                     BOSC, Stockholm, June 27-28, 2009 – p.19

More Related Content

What's hot

次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
Yuichi Yoshida
 
PAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ Warwick
Pierre Jacob
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
zukun
 
Spectral Learning Methods for Finite State Machines with Applications to Na...
  Spectral Learning Methods for Finite State Machines with Applications to Na...  Spectral Learning Methods for Finite State Machines with Applications to Na...
Spectral Learning Methods for Finite State Machines with Applications to Na...
LARCA UPC
 
Wasserstein GAN
Wasserstein GANWasserstein GAN
Wasserstein GAN
Jinho Lee
 
Micro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effectMicro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effect
Guillaume Costeseque
 

What's hot (6)

次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
次数制限モデルにおける全てのCSPに対する最適な定数時間近似アルゴリズムと近似困難性
 
PAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ Warwick
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Spectral Learning Methods for Finite State Machines with Applications to Na...
  Spectral Learning Methods for Finite State Machines with Applications to Na...  Spectral Learning Methods for Finite State Machines with Applications to Na...
Spectral Learning Methods for Finite State Machines with Applications to Na...
 
Wasserstein GAN
Wasserstein GANWasserstein GAN
Wasserstein GAN
 
Micro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effectMicro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effect
 

Similar to Schbath Rmes Bosc2009

Dsp
DspDsp
14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)
Apostolos Chalkis
 
Dmss2011 public
Dmss2011 publicDmss2011 public
Dmss2011 public
Yasuo Tabei
 
Thesis defense improved
Thesis defense improvedThesis defense improved
Thesis defense improved
Zheng Mengdi
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applications
Xavier Anguera
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
Yasuo Tabei
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
Vissarion Fisikopoulos
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
avrilcoghlan
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdf
grssieee
 
Volume computation and applications
Volume computation and applications Volume computation and applications
Volume computation and applications
Vissarion Fisikopoulos
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
avrilcoghlan
 
Thesis defense
Thesis defenseThesis defense
Thesis defense
Zheng Mengdi
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
Iosif Itkin
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
Programming Homework Help
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Masahiro Suzuki
 
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Harshal Chaudhari
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
Deep Learning JP
 
Algorithm Exam Help
Algorithm Exam HelpAlgorithm Exam Help
Algorithm Exam Help
Programming Exam Help
 
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
The Statistical and Applied Mathematical Sciences Institute
 

Similar to Schbath Rmes Bosc2009 (20)

Dsp
DspDsp
Dsp
 
14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)14th Athens Colloquium on Algorithms and Complexity (ACAC19)
14th Athens Colloquium on Algorithms and Complexity (ACAC19)
 
Dmss2011 public
Dmss2011 publicDmss2011 public
Dmss2011 public
 
Thesis defense improved
Thesis defense improvedThesis defense improved
Thesis defense improved
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applicationsMultimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applications
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdf
 
Volume computation and applications
Volume computation and applications Volume computation and applications
Volume computation and applications
 
DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Thesis defense
Thesis defenseThesis defense
Thesis defense
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
Algorithm Exam Help
Algorithm Exam HelpAlgorithm Exam Help
Algorithm Exam Help
 
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
QMC: Transition Workshop - Density Estimation by Randomized Quasi-Monte Carlo...
 

More from bosc

Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009
bosc
 
Bosc Intro 20090627
Bosc Intro 20090627Bosc Intro 20090627
Bosc Intro 20090627
bosc
 
Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009
bosc
 
Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009
bosc
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
bosc
 
Rice Emboss Bosc2009
Rice Emboss Bosc2009Rice Emboss Bosc2009
Rice Emboss Bosc2009
bosc
 
Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009
bosc
 
Senger Soaplab Bosc2009
Senger Soaplab Bosc2009Senger Soaplab Bosc2009
Senger Soaplab Bosc2009
bosc
 
Cock Biopython Bosc2009
Cock Biopython Bosc2009Cock Biopython Bosc2009
Cock Biopython Bosc2009
bosc
 
Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009
bosc
 
Snell Psoda Bosc2009
Snell Psoda Bosc2009Snell Psoda Bosc2009
Snell Psoda Bosc2009
bosc
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
bosc
 
Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009
bosc
 
Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009
bosc
 
Moeller Debian Bosc2009
Moeller Debian Bosc2009Moeller Debian Bosc2009
Moeller Debian Bosc2009
bosc
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009
bosc
 
Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009
bosc
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009
bosc
 
Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009
bosc
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
bosc
 

More from bosc (20)

Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009
 
Bosc Intro 20090627
Bosc Intro 20090627Bosc Intro 20090627
Bosc Intro 20090627
 
Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009Software Patterns Panel Bosc2009
Software Patterns Panel Bosc2009
 
Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009
 
Welch Wordifier Bosc2009
Welch Wordifier Bosc2009Welch Wordifier Bosc2009
Welch Wordifier Bosc2009
 
Rice Emboss Bosc2009
Rice Emboss Bosc2009Rice Emboss Bosc2009
Rice Emboss Bosc2009
 
Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009
 
Senger Soaplab Bosc2009
Senger Soaplab Bosc2009Senger Soaplab Bosc2009
Senger Soaplab Bosc2009
 
Cock Biopython Bosc2009
Cock Biopython Bosc2009Cock Biopython Bosc2009
Cock Biopython Bosc2009
 
Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009Hanmer Software Patterns Bosc2009
Hanmer Software Patterns Bosc2009
 
Snell Psoda Bosc2009
Snell Psoda Bosc2009Snell Psoda Bosc2009
Snell Psoda Bosc2009
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
 
Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009Drablos Composite Motifs Bosc2009
Drablos Composite Motifs Bosc2009
 
Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009Fauteux Seeder Bosc2009
Fauteux Seeder Bosc2009
 
Moeller Debian Bosc2009
Moeller Debian Bosc2009Moeller Debian Bosc2009
Moeller Debian Bosc2009
 
Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009Prins Bio Lib Bosc 2009
Prins Bio Lib Bosc 2009
 
Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009Wilczynski_BNFinder_BOSC2009
Wilczynski_BNFinder_BOSC2009
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009
 
Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009Varre_Biomanycores_BOSC2009
Varre_Biomanycores_BOSC2009
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
 

Recently uploaded

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 

Recently uploaded (20)

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 

Schbath Rmes Bosc2009

  • 1. R’MES Finding Exceptional Motifs in Sequences S. Schbath INRA, Jouy-en-Josas, France http://genome.jouy.inra.fr/ssb/rmes/ BOSC, Stockholm, June 27-28, 2009 – p.1
  • 2. Introduction: motifs and statistics BOSC, Stockholm, June 27-28, 2009 – p.2
  • 3. DNA and motifs • DNA: Long molecule, sequence of nucleotides • Nucleotides: A(denine), C(ytosine), G(uanine), T(hymine). • Motif (= oligonucleotides): short sequence of nucleotides, e.g. CAGTAG • Functional motif: recognized by proteins or enzymes to initiate a biological process TAGACAGATAGACGATCAGTAGCCAGTAGACAGTAGGCATGA. . . BOSC, Stockholm, June 27-28, 2009 – p.3
  • 4. Some functional motifs • Restriction sites: recognized by specific bacterial restriction enzymes ⇒ double-strand DNA break. E.g. GAATTC recognized by EcoRI • Chi motif: recognized by an enzyme which processes along DNA sequence and degrades it ⇒ enzyme degradation activity stopped and DNA repair is stimulated by recombination. E.g. GCTGGTGG recognized by RecBCD (E. coli) • parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome into macro-domains. t TGTTAACACGTGAAACA c c t t • promoter: structured motif recognized by the RNA polymerase to initiate gene transcription. (16;18) E.g. TTGAC − − − TATAAT (E. coli). BOSC, Stockholm, June 27-28, 2009 – p.4
  • 5. Some functional motifs • Restriction sites: recognized by specific bacterial restriction enzymes ⇒ double-strand DNA break. E.g. GAATTC recognized by EcoRI very rare along bacterial genomes • Chi motif: recognized by an enzyme which processes along DNA sequence and degrades it ⇒ enzyme degradation activity stopped and DNA repair is stimulated by recombination. E.g. GCTGGTGG recognized by RecBCD (E. coli) very frequent along E. coli genome • parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome into macro-domains. t TGTTAACACGTGAAACA c c t t very frequent into the ORI domain, rare elsewhere • promoter: structured motif recognized by the RNA polymerase to initiate gene transcription. (16;18) E.g. TTGAC − − − TATAAT (E. coli). particularly located in front of genes BOSC, Stockholm, June 27-28, 2009 – p.4
  • 6. Prediction of functional motifs Most of the functional motifs are unknown in the different species. For instance, • which would be the Chi motif of S. aureus? [Halpern et al. (08)] • Is there an equivalent of parS in E. coli? [Mercier et al. (08)] Statistical approach: to identify candidate motifs based on their statistical properties. The most over-represented The most over-represented families 8-letter words under M1 anbcdef g under M1 E. coli ( = 4.6 106 ) H. influenzae ( = 1.8 106 ) word obs exp score motif obs exp score gctggtgg 762 84.9 73.5 gntggtgg 223 55.3 22.33 ggcgctgg 828 125.9 62.6 anttcatc 469 180.3 21.59 cgctggcg 870 150.8 58.6 anatcgcc 288 87.8 21.38 gctggcgg 723 125.9 53.3 tnatcgcc 279 84.5 21.18 cgctggtg 619 101.7 51.3 gnagaaga 270 83.6 20.10 BOSC, Stockholm, June 27-28, 2009 – p.5
  • 7. Presentation of R’MES BOSC, Stockholm, June 27-28, 2009 – p.6
  • 8. Statistical questions addressed by R’MES Questions related to the significance of the number of occurrences of motifs w in sequences: • Is N obs (w) significantly high? • Is N obs (w) significantly higher than N obs (w )? −→ If w = w: is w significantly skewed (strand bias)? obs obs • Is N1 (w) significantly more unexpected than N2 (w)? Several types of motifs w: • fixed words (e.g. gctggtgg), • degenerated patterns (e.g. gntggtgg), • set of words (e.g. {w, w}). BOSC, Stockholm, June 27-28, 2009 – p.7
  • 9. Is N obs (w) significantly high? • One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the count (r.v.) of w in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) BOSC, Stockholm, June 27-28, 2009 – p.8
  • 10. Is N obs (w) significantly high? • One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the count (r.v.) of w in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • either a Gaussian approximation of N (w) (when E(N (w)) is large) [Prum et al. (95)], [Schbath et al. (95)] • or a compound Poisson distribution of N (w) (when E(N (w)) is small) [Schbath (95)], [Roquain and Schbath (07)] (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 ) BOSC, Stockholm, June 27-28, 2009 – p.8
  • 11. Is N obs (w) significantly high? • One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the count (r.v.) of w in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • either a Gaussian approximation of N (w) (when E(N (w)) is large) [Prum et al. (95)], [Schbath et al. (95)] • or a compound Poisson distribution of N (w) (when E(N (w)) is small) [Schbath (95)], [Roquain and Schbath (07)] (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 ) • R’MES produces scores of exceptionality (probit transformation). High positive (resp. negative) scores correspond to exceptionally frequent (resp. rare) motifs. rmes –gauss –s seqfile –m m –l wordlength –o outputfile BOSC, Stockholm, June 27-28, 2009 – p.8
  • 12. Is N obs (w) significantly higher than N obs (w)? N obs (w) • One needs to calculate the p-value P where N (·) is the “ ” N (w) N (w) ≥ N obs (w) count (r.v.) in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) BOSC, Stockholm, June 27-28, 2009 – p.9
  • 13. Is N obs (w) significantly higher than N obs (w)? N obs (w) • One needs to calculate the p-value P where N (·) is the “ ” N (w) N (w) ≥ N obs (w) count (r.v.) in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when E(N (w)) and E(N (w)) are large) [Prum et al. (95)], [Schbath et al. (95)] BOSC, Stockholm, June 27-28, 2009 – p.9
  • 14. Is N obs (w) significantly higher than N obs (w)? N obs (w) • One needs to calculate the p-value P where N (·) is the “ ” N (w) N (w) ≥ N obs (w) count (r.v.) in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when E(N (w)) and E(N (w)) are large) [Prum et al. (95)], [Schbath et al. (95)] • R’MES produces scores of exceptional skew (probit transformation): High positive (resp. negative) scores correspond to motifs significantlty more frequent (resp. rare) along the sequence than along the complementary one. rmes –skew –seq seqfile –m m –l wordlength –o outputfile BOSC, Stockholm, June 27-28, 2009 – p.9
  • 15. obs obs Is N1 (w) significantly more except. than N2 (w)? • One wants to compare the exceptionality of a motif w in two different obs obs sequences (two observed counts N1 (w) and N2 (w)) BOSC, Stockholm, June 27-28, 2009 – p.10
  • 16. obs obs Is N1 (w) significantly more except. than N2 (w)? • One wants to compare the exceptionality of a motif w in two different obs obs sequences (two observed counts N1 (w) and N2 (w)) • R’MES computes a test statistic and its asociated p-value to test H0 : {w is equally exceptional in both sequences} against H1 : {w is more exceptional in the first sequence} [Robin et al. (08)] BOSC, Stockholm, June 27-28, 2009 – p.10
  • 17. obs obs Is N1 (w) significantly more except. than N2 (w)? • One wants to compare the exceptionality of a motif w in two different obs obs sequences (two observed counts N1 (w) and N2 (w)) • R’MES computes a test statistic and its asociated p-value to test H0 : {w is equally exceptional in both sequences} against H1 : {w is more exceptional in the first sequence} [Robin et al. (08)] • The test is performed by considering occurrence processes like Poisson processes whose intensities take the sequence compositions in oligos of length 1- up to -(m + 1) into account. • Option –seq2 soon available in R’MES. BOSC, Stockholm, June 27-28, 2009 – p.10
  • 18. RMESPlot interface BOSC, Stockholm, June 27-28, 2009 – p.11
  • 19. Prediction and identification of functional DNA motifs BOSC, Stockholm, June 27-28, 2009 – p.12
  • 20. Chi motifs in bacterial genomes • Motif involved in the repair of double-strand DNA breaks. Chi needs to be frequent along bacterial genomes. • Chi motifs have been identified for few bacterial species. They are not conserved through species. • Known Chi motifs are 5 to 8 nucleotides long and can be degenerated. • Moreover, Chi activity is strongly orientation-dependent (direction of DNA replication). It is present preferentially on the leading strands (high skew). BOSC, Stockholm, June 27-28, 2009 – p.13
  • 21. E. coli as a learning case • 8-letter word GCTGGTGG • 762 occurrences on the leading strands ( = 4.6 106 ) • Among the most over-represented 8-letter words (whatever the model Mm) ⇒ its frequency cannot be explained by the genome composition. • Its rank is improved if one analyzes only the backbone genome (genome conserved in several strains of the species). • Its skew equals 3.20 (p-value of 3.310−11 ). The skew of a motif w is defined by N obs (w)/N obs (w) where w is the reverse complementary of w. BOSC, Stockholm, June 27-28, 2009 – p.14
  • 22. Identification of Chi motif in S. aureus Halpern et al. (07) • Analysis of the S. aureus backbone ( = 2.44 106 ). • 8-letter words: none of the most over-represented and skewed motifs were frequent enough. • 7-letter words: A=gaaaatg (1067), B=ggattag (266), C=gaagcgg (272), D=gaattag (614) BOSC, Stockholm, June 27-28, 2009 – p.15
  • 23. Organization of the Ter macrodomain in E. coli The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)]. How is such structure ensured? BOSC, Stockholm, June 27-28, 2009 – p.16
  • 24. Organization of the Ter macrodomain in E. coli The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)]. How is such structure ensured? Bacillus subtilis as a learning case: • In B. subtilis, the parS motif is responsible for the structuration of the chromosomal domain surrounding the origin of replication [Lin and Grossman (98)]. • parS motif is 16 nt long, its sequence is partially degenerated and rather palindromic. t TGTTAACACGTGAAACA c c t t • It is recognized by SpoOJ in both directions. • One of its 11-mer is the most exceptional 11-mer (w, w) in the origin domain. BOSC, Stockholm, June 27-28, 2009 – p.16
  • 25. Identification of matS in E. coli 10 most over-represented 11-mer (w, w) of the TER domain (compound Poisson approximation + family option): rank ra word N1 N2 E1 E2 score1 score2 p-skew R’MES Ske GACACTGTCAC 7 0 0.21 0.43 5.84 0.39 0.0004 1 TGACACTGTCA 7 2 0.28 0.53 5.49 1.29 0.0101 2 4 GACAGTGTCAC 6 0 0.20 0.43 5.24 0.38 0.0011 3 1 GACGTTGTCAC 7 3 0.35 1.30 5.22 1.06 0.0012 4 1 GACAACGTCAC 7 3 0.37 1.49 5.15 0.88 0.0008 5 1 GACCCGAACGA 5 1 0.12 0.47 5.09 0.31 0.0017 6 2 ATAGGGTAGAT 4 1 0.06 0.26 4.94 0.73 0.0041 7 3 TAGTTACAACA 5 1 0.16 0.54 4.79 0.21 0.0032 8 2 ATAAACGGCCC 6 3 0.31 1.68 4.76 0.71 0.0008 9 1 TGACAACGTCA 7 5 0.51 1.786 4.72 1.81 0.0073 10 3 BOSC, Stockholm, June 27-28, 2009 – p.17
  • 26. Identification of matS in E. coli GACACTGTCAC TGACACTGTCA GACAGTGTCAC GACGTTGTCAC GACAACGTCAC TGACAACGTCA GTGACRNYGTCAC matS is the 13nt GTGACRNYGTCAC: it is recognized by the matP protein which structures the Ter domain [Mercier at al. (08)]. BOSC, Stockholm, June 27-28, 2009 – p.18
  • 27. Acknowledgment Françoise Gélis (R’MES 1.0) Annie Bouvier (R’MES 2.0) Mark Hoebeke (R’MES 3.0) http://genome.jouy.inra.fr/ssb/rmes/ BOSC, Stockholm, June 27-28, 2009 – p.19