Schbath Rmes Bosc2009

  • 568 views
Uploaded on

 

More in: Technology , Travel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
568
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. R’MES Finding Exceptional Motifs in Sequences S. Schbath INRA, Jouy-en-Josas, France http://genome.jouy.inra.fr/ssb/rmes/ BOSC, Stockholm, June 27-28, 2009 – p.1
  • 2. Introduction: motifs and statistics BOSC, Stockholm, June 27-28, 2009 – p.2
  • 3. DNA and motifs • DNA: Long molecule, sequence of nucleotides • Nucleotides: A(denine), C(ytosine), G(uanine), T(hymine). • Motif (= oligonucleotides): short sequence of nucleotides, e.g. CAGTAG • Functional motif: recognized by proteins or enzymes to initiate a biological process TAGACAGATAGACGATCAGTAGCCAGTAGACAGTAGGCATGA. . . BOSC, Stockholm, June 27-28, 2009 – p.3
  • 4. Some functional motifs • Restriction sites: recognized by specific bacterial restriction enzymes ⇒ double-strand DNA break. E.g. GAATTC recognized by EcoRI • Chi motif: recognized by an enzyme which processes along DNA sequence and degrades it ⇒ enzyme degradation activity stopped and DNA repair is stimulated by recombination. E.g. GCTGGTGG recognized by RecBCD (E. coli) • parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome into macro-domains. t TGTTAACACGTGAAACA c c t t • promoter: structured motif recognized by the RNA polymerase to initiate gene transcription. (16;18) E.g. TTGAC − − − TATAAT (E. coli). BOSC, Stockholm, June 27-28, 2009 – p.4
  • 5. Some functional motifs • Restriction sites: recognized by specific bacterial restriction enzymes ⇒ double-strand DNA break. E.g. GAATTC recognized by EcoRI very rare along bacterial genomes • Chi motif: recognized by an enzyme which processes along DNA sequence and degrades it ⇒ enzyme degradation activity stopped and DNA repair is stimulated by recombination. E.g. GCTGGTGG recognized by RecBCD (E. coli) very frequent along E. coli genome • parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome into macro-domains. t TGTTAACACGTGAAACA c c t t very frequent into the ORI domain, rare elsewhere • promoter: structured motif recognized by the RNA polymerase to initiate gene transcription. (16;18) E.g. TTGAC − − − TATAAT (E. coli). particularly located in front of genes BOSC, Stockholm, June 27-28, 2009 – p.4
  • 6. Prediction of functional motifs Most of the functional motifs are unknown in the different species. For instance, • which would be the Chi motif of S. aureus? [Halpern et al. (08)] • Is there an equivalent of parS in E. coli? [Mercier et al. (08)] Statistical approach: to identify candidate motifs based on their statistical properties. The most over-represented The most over-represented families 8-letter words under M1 anbcdef g under M1 E. coli ( = 4.6 106 ) H. influenzae ( = 1.8 106 ) word obs exp score motif obs exp score gctggtgg 762 84.9 73.5 gntggtgg 223 55.3 22.33 ggcgctgg 828 125.9 62.6 anttcatc 469 180.3 21.59 cgctggcg 870 150.8 58.6 anatcgcc 288 87.8 21.38 gctggcgg 723 125.9 53.3 tnatcgcc 279 84.5 21.18 cgctggtg 619 101.7 51.3 gnagaaga 270 83.6 20.10 BOSC, Stockholm, June 27-28, 2009 – p.5
  • 7. Presentation of R’MES BOSC, Stockholm, June 27-28, 2009 – p.6
  • 8. Statistical questions addressed by R’MES Questions related to the significance of the number of occurrences of motifs w in sequences: • Is N obs (w) significantly high? • Is N obs (w) significantly higher than N obs (w )? −→ If w = w: is w significantly skewed (strand bias)? obs obs • Is N1 (w) significantly more unexpected than N2 (w)? Several types of motifs w: • fixed words (e.g. gctggtgg), • degenerated patterns (e.g. gntggtgg), • set of words (e.g. {w, w}). BOSC, Stockholm, June 27-28, 2009 – p.7
  • 9. Is N obs (w) significantly high? • One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the count (r.v.) of w in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) BOSC, Stockholm, June 27-28, 2009 – p.8
  • 10. Is N obs (w) significantly high? • One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the count (r.v.) of w in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • either a Gaussian approximation of N (w) (when E(N (w)) is large) [Prum et al. (95)], [Schbath et al. (95)] • or a compound Poisson distribution of N (w) (when E(N (w)) is small) [Schbath (95)], [Roquain and Schbath (07)] (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 ) BOSC, Stockholm, June 27-28, 2009 – p.8
  • 11. Is N obs (w) significantly high? • One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the count (r.v.) of w in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • either a Gaussian approximation of N (w) (when E(N (w)) is large) [Prum et al. (95)], [Schbath et al. (95)] • or a compound Poisson distribution of N (w) (when E(N (w)) is small) [Schbath (95)], [Roquain and Schbath (07)] (see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 ) • R’MES produces scores of exceptionality (probit transformation). High positive (resp. negative) scores correspond to exceptionally frequent (resp. rare) motifs. rmes –gauss –s seqfile –m m –l wordlength –o outputfile BOSC, Stockholm, June 27-28, 2009 – p.8
  • 12. Is N obs (w) significantly higher than N obs (w)? N obs (w) • One needs to calculate the p-value P where N (·) is the “ ” N (w) N (w) ≥ N obs (w) count (r.v.) in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) BOSC, Stockholm, June 27-28, 2009 – p.9
  • 13. Is N obs (w) significantly higher than N obs (w)? N obs (w) • One needs to calculate the p-value P where N (·) is the “ ” N (w) N (w) ≥ N obs (w) count (r.v.) in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when E(N (w)) and E(N (w)) are large) [Prum et al. (95)], [Schbath et al. (95)] BOSC, Stockholm, June 27-28, 2009 – p.9
  • 14. Is N obs (w) significantly higher than N obs (w)? N obs (w) • One needs to calculate the p-value P where N (·) is the “ ” N (w) N (w) ≥ N obs (w) count (r.v.) in random sequences (→ model). • R’MES considers Markov chain models of order m (Mm) which fit the sequence composition in oligos of length 1- up to -(m + 1). Possibility to take the phase in coding sequences into account (Mm_3) • R’MES approximates the p-value by using • the 2-dimensional Gaussian approximation of (N (w), N (w)) (when E(N (w)) and E(N (w)) are large) [Prum et al. (95)], [Schbath et al. (95)] • R’MES produces scores of exceptional skew (probit transformation): High positive (resp. negative) scores correspond to motifs significantlty more frequent (resp. rare) along the sequence than along the complementary one. rmes –skew –seq seqfile –m m –l wordlength –o outputfile BOSC, Stockholm, June 27-28, 2009 – p.9
  • 15. obs obs Is N1 (w) significantly more except. than N2 (w)? • One wants to compare the exceptionality of a motif w in two different obs obs sequences (two observed counts N1 (w) and N2 (w)) BOSC, Stockholm, June 27-28, 2009 – p.10
  • 16. obs obs Is N1 (w) significantly more except. than N2 (w)? • One wants to compare the exceptionality of a motif w in two different obs obs sequences (two observed counts N1 (w) and N2 (w)) • R’MES computes a test statistic and its asociated p-value to test H0 : {w is equally exceptional in both sequences} against H1 : {w is more exceptional in the first sequence} [Robin et al. (08)] BOSC, Stockholm, June 27-28, 2009 – p.10
  • 17. obs obs Is N1 (w) significantly more except. than N2 (w)? • One wants to compare the exceptionality of a motif w in two different obs obs sequences (two observed counts N1 (w) and N2 (w)) • R’MES computes a test statistic and its asociated p-value to test H0 : {w is equally exceptional in both sequences} against H1 : {w is more exceptional in the first sequence} [Robin et al. (08)] • The test is performed by considering occurrence processes like Poisson processes whose intensities take the sequence compositions in oligos of length 1- up to -(m + 1) into account. • Option –seq2 soon available in R’MES. BOSC, Stockholm, June 27-28, 2009 – p.10
  • 18. RMESPlot interface BOSC, Stockholm, June 27-28, 2009 – p.11
  • 19. Prediction and identification of functional DNA motifs BOSC, Stockholm, June 27-28, 2009 – p.12
  • 20. Chi motifs in bacterial genomes • Motif involved in the repair of double-strand DNA breaks. Chi needs to be frequent along bacterial genomes. • Chi motifs have been identified for few bacterial species. They are not conserved through species. • Known Chi motifs are 5 to 8 nucleotides long and can be degenerated. • Moreover, Chi activity is strongly orientation-dependent (direction of DNA replication). It is present preferentially on the leading strands (high skew). BOSC, Stockholm, June 27-28, 2009 – p.13
  • 21. E. coli as a learning case • 8-letter word GCTGGTGG • 762 occurrences on the leading strands ( = 4.6 106 ) • Among the most over-represented 8-letter words (whatever the model Mm) ⇒ its frequency cannot be explained by the genome composition. • Its rank is improved if one analyzes only the backbone genome (genome conserved in several strains of the species). • Its skew equals 3.20 (p-value of 3.310−11 ). The skew of a motif w is defined by N obs (w)/N obs (w) where w is the reverse complementary of w. BOSC, Stockholm, June 27-28, 2009 – p.14
  • 22. Identification of Chi motif in S. aureus Halpern et al. (07) • Analysis of the S. aureus backbone ( = 2.44 106 ). • 8-letter words: none of the most over-represented and skewed motifs were frequent enough. • 7-letter words: A=gaaaatg (1067), B=ggattag (266), C=gaagcgg (272), D=gaattag (614) BOSC, Stockholm, June 27-28, 2009 – p.15
  • 23. Organization of the Ter macrodomain in E. coli The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)]. How is such structure ensured? BOSC, Stockholm, June 27-28, 2009 – p.16
  • 24. Organization of the Ter macrodomain in E. coli The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)]. How is such structure ensured? Bacillus subtilis as a learning case: • In B. subtilis, the parS motif is responsible for the structuration of the chromosomal domain surrounding the origin of replication [Lin and Grossman (98)]. • parS motif is 16 nt long, its sequence is partially degenerated and rather palindromic. t TGTTAACACGTGAAACA c c t t • It is recognized by SpoOJ in both directions. • One of its 11-mer is the most exceptional 11-mer (w, w) in the origin domain. BOSC, Stockholm, June 27-28, 2009 – p.16
  • 25. Identification of matS in E. coli 10 most over-represented 11-mer (w, w) of the TER domain (compound Poisson approximation + family option): rank ra word N1 N2 E1 E2 score1 score2 p-skew R’MES Ske GACACTGTCAC 7 0 0.21 0.43 5.84 0.39 0.0004 1 TGACACTGTCA 7 2 0.28 0.53 5.49 1.29 0.0101 2 4 GACAGTGTCAC 6 0 0.20 0.43 5.24 0.38 0.0011 3 1 GACGTTGTCAC 7 3 0.35 1.30 5.22 1.06 0.0012 4 1 GACAACGTCAC 7 3 0.37 1.49 5.15 0.88 0.0008 5 1 GACCCGAACGA 5 1 0.12 0.47 5.09 0.31 0.0017 6 2 ATAGGGTAGAT 4 1 0.06 0.26 4.94 0.73 0.0041 7 3 TAGTTACAACA 5 1 0.16 0.54 4.79 0.21 0.0032 8 2 ATAAACGGCCC 6 3 0.31 1.68 4.76 0.71 0.0008 9 1 TGACAACGTCA 7 5 0.51 1.786 4.72 1.81 0.0073 10 3 BOSC, Stockholm, June 27-28, 2009 – p.17
  • 26. Identification of matS in E. coli GACACTGTCAC TGACACTGTCA GACAGTGTCAC GACGTTGTCAC GACAACGTCAC TGACAACGTCA GTGACRNYGTCAC matS is the 13nt GTGACRNYGTCAC: it is recognized by the matP protein which structures the Ter domain [Mercier at al. (08)]. BOSC, Stockholm, June 27-28, 2009 – p.18
  • 27. Acknowledgment Françoise Gélis (R’MES 1.0) Annie Bouvier (R’MES 2.0) Mark Hoebeke (R’MES 3.0) http://genome.jouy.inra.fr/ssb/rmes/ BOSC, Stockholm, June 27-28, 2009 – p.19