Schbath Rmes Bosc2009

R’MES
Finding Exceptional Motifs in Sequences

S. Schbath

INRA, Jouy-en-Josas, France

http://genome.jouy.inra.fr/ssb/rmes/

BOSC, Stockholm, June 27-28, 2009 – p.1

Introduction:
motifs and statistics


DNA and motifs

• DNA: Long molecule, sequence of
nucleotides
• Nucleotides: A(denine), C(ytosine),
G(uanine), T(hymine).
• Motif (= oligonucleotides): short
sequence of nucleotides, e.g.
CAGTAG
• Functional motif: recognized by
proteins or enzymes to initiate a
biological process

TAGACAGATAGACGATCAGTAGCCAGTAGACAGTAGGCATGA. . .


Some functional motifs

• Restriction sites: recognized by speciﬁc bacterial restriction enzymes ⇒
double-strand DNA break.
E.g. GAATTC recognized by EcoRI

• Chi motif: recognized by an enzyme which processes along DNA sequence
and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
stimulated by recombination.
E.g. GCTGGTGG recognized by RecBCD (E. coli)

• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
into macro-domains.
t
TGTTAACACGTGAAACA
c c t t

• promoter: structured motif recognized by the RNA polymerase to initiate
gene transcription.
(16;18)
E.g. TTGAC − − − TATAAT (E. coli).


Some functional motifs

• Restriction sites: recognized by speciﬁc bacterial restriction enzymes ⇒
double-strand DNA break.
E.g. GAATTC recognized by EcoRI
very rare along bacterial genomes
• Chi motif: recognized by an enzyme which processes along DNA sequence
and degrades it ⇒ enzyme degradation activity stopped and DNA repair is
stimulated by recombination.
E.g. GCTGGTGG recognized by RecBCD (E. coli)
very frequent along E. coli genome
• parS: recognized by the Spo0J protein ⇒ organization of B. subtilis genome
into macro-domains.
t
TGTTAACACGTGAAACA
c c t t
very frequent into the ORI domain, rare elsewhere
• promoter: structured motif recognized by the RNA polymerase to initiate
gene transcription.
(16;18)
E.g. TTGAC − − − TATAAT (E. coli).
particularly located in front of genes

Prediction of functional motifs

Most of the functional motifs are unknown in the different species.
For instance,
• which would be the Chi motif of S. aureus? [Halpern et al. (08)]
• Is there an equivalent of parS in E. coli? [Mercier et al. (08)]
Statistical approach: to identify candidate motifs based on their statistical
properties.

The most over-represented The most over-represented families
8-letter words under M1 anbcdef g under M1
E. coli ( = 4.6 106 ) H. inﬂuenzae ( = 1.8 106 )
word obs exp score motif obs exp score
gctggtgg 762 84.9 73.5 gntggtgg 223 55.3 22.33
ggcgctgg 828 125.9 62.6 anttcatc 469 180.3 21.59
cgctggcg 870 150.8 58.6 anatcgcc 288 87.8 21.38
gctggcgg 723 125.9 53.3 tnatcgcc 279 84.5 21.18
cgctggtg 619 101.7 51.3 gnagaaga 270 83.6 20.10

Presentation of R’MES


Statistical questions addressed by R’MES

Questions related to the significance of the number of occurrences of motifs w
in sequences:

• Is N obs (w) significantly high?
• Is N obs (w) significantly higher than N obs (w )?
−→ If w = w: is w significantly skewed (strand bias)?
obs obs
• Is N1 (w) significantly more unexpected than N2 (w)?
Several types of motifs w:

• fixed words (e.g. gctggtgg),
• degenerated patterns (e.g. gntggtgg),
• set of words (e.g. {w, w}).


Is N obs (w) signiﬁcantly high?

• One needs to calculate the p-value P(N (w) ≥ N obs (w)) where N (w) is the
count (r.v.) of w in random sequences (→ model).
• R’MES considers Markov chain models of order m (Mm) which ﬁt the
sequence composition in oligos of length 1- up to -(m + 1).
Possibility to take the phase in coding sequences into account (Mm_3)



• R’MES approximates the p-value by using
• either a Gaussian approximation of N (w) (when E(N (w)) is large)
[Prum et al. (95)], [Schbath et al. (95)]
• or a compound Poisson distribution of N (w) (when E(N (w)) is small)
[Schbath (95)], [Roquain and Schbath (07)]
(see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )



• either a Gaussian approximation of N (w) (when E(N (w)) is large)
• or a compound Poisson distribution of N (w) (when E(N (w)) is small)
[Schbath (95)], [Roquain and Schbath (07)]
(see DNA, Words and Models, Robin, Rodolphe, Schbath, CUP 2005 )
• R’MES produces scores of exceptionality (probit transformation).
High positive (resp. negative) scores correspond to exceptionally frequent
(resp. rare) motifs.

rmes –gauss –s seqﬁle –m m –l wordlength –o outputﬁle


Is N obs (w) signiﬁcantly higher than N obs (w)?

N obs (w)
• One needs to calculate the p-value P where N (·) is the
“ ”
N (w)
N (w)
≥ N obs (w)
count (r.v.) in random sequences (→ model).



N obs (w)
“ ”
N (w)
N (w)
≥ N obs (w)
• the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
E(N (w)) and E(N (w)) are large)



N obs (w)
“ ”
N (w)
N (w)
≥ N obs (w)
• the 2-dimensional Gaussian approximation of (N (w), N (w)) (when
E(N (w)) and E(N (w)) are large)
• R’MES produces scores of exceptional skew (probit transformation):
High positive (resp. negative) scores correspond to motifs significantlty more
frequent (resp. rare) along the sequence than along the complementary one.

rmes –skew –seq seqfile –m m –l wordlength –o outputfile


obs obs
Is N1 (w) signiﬁcantly more except. than N2 (w)?

• One wants to compare the exceptionality of a motif w in two different
obs obs
sequences (two observed counts N1 (w) and N2 (w))


obs obs

obs obs
• R’MES computes a test statistic and its asociated p-value to test
H0 : {w is equally exceptional in both sequences}
against
H1 : {w is more exceptional in the ﬁrst sequence}
[Robin et al. (08)]


obs obs

obs obs
• R’MES computes a test statistic and its asociated p-value to test
H0 : {w is equally exceptional in both sequences}
against
H1 : {w is more exceptional in the ﬁrst sequence}
[Robin et al. (08)]
• The test is performed by considering occurrence processes like Poisson
processes whose intensities take the sequence compositions in oligos of
length 1- up to -(m + 1) into account.
• Option –seq2 soon available in R’MES.


RMESPlot interface


Prediction and identiﬁcation
of functional DNA motifs


Chi motifs in bacterial genomes

• Motif involved in the repair of double-strand DNA breaks.
Chi needs to be frequent along bacterial genomes.
• Chi motifs have been identiﬁed for few bacterial species. They are not
conserved through species.
• Known Chi motifs are 5 to 8 nucleotides long and can be degenerated.
• Moreover, Chi activity is strongly orientation-dependent (direction of DNA
replication).
It is present preferentially on the leading strands (high skew).


E. coli as a learning case

• 8-letter word GCTGGTGG
• 762 occurrences on the leading strands ( = 4.6 106 )
• Among the most over-represented 8-letter words (whatever the model Mm)
⇒ its frequency cannot be explained by the genome composition.
• Its rank is improved if one analyzes only the backbone genome (genome
conserved in several strains of the species).
• Its skew equals 3.20 (p-value of 3.310−11 ).

The skew of a motif w is deﬁned by N obs (w)/N obs (w) where w is the reverse
complementary of w.


Identiﬁcation of Chi motif in S. aureus

Halpern et al. (07)
• Analysis of the S. aureus backbone ( = 2.44 106 ).
• 8-letter words: none of the most over-represented and skewed motifs were
frequent enough.
• 7-letter words:

A=gaaaatg (1067), B=ggattag (266), C=gaagcgg (272), D=gaattag (614)

Organization of the Ter macrodomain in E. coli

The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?


Organization of the Ter macrodomain in E. coli

The chromosome of E. coli is organized into 4 macrodomains [Valens et al. (04)].
How is such structure ensured?

Bacillus subtilis as a learning case:

• In B. subtilis, the parS motif is responsible for the structuration of the
chromosomal domain surrounding the origin of replication [Lin and
Grossman (98)].
• parS motif is 16 nt long, its sequence is partially degenerated and rather
palindromic.
t
TGTTAACACGTGAAACA
c c t t

• It is recognized by SpoOJ in both directions.
• One of its 11-mer is the most exceptional 11-mer (w, w) in the origin domain.


Identiﬁcation of matS in E. coli

10 most over-represented 11-mer (w, w) of the TER domain (compound Poisson
approximation + family option):
rank ra
word N1 N2 E1 E2 score1 score2 p-skew R’MES Ske
GACACTGTCAC 7 0 0.21 0.43 5.84 0.39 0.0004 1
TGACACTGTCA 7 2 0.28 0.53 5.49 1.29 0.0101 2 4
GACAGTGTCAC 6 0 0.20 0.43 5.24 0.38 0.0011 3 1
GACGTTGTCAC 7 3 0.35 1.30 5.22 1.06 0.0012 4 1
GACAACGTCAC 7 3 0.37 1.49 5.15 0.88 0.0008 5 1
GACCCGAACGA 5 1 0.12 0.47 5.09 0.31 0.0017 6 2
ATAGGGTAGAT 4 1 0.06 0.26 4.94 0.73 0.0041 7 3
TAGTTACAACA 5 1 0.16 0.54 4.79 0.21 0.0032 8 2
ATAAACGGCCC 6 3 0.31 1.68 4.76 0.71 0.0008 9 1
TGACAACGTCA 7 5 0.51 1.786 4.72 1.81 0.0073 10 3


Identiﬁcation of matS in E. coli

GACACTGTCAC
TGACACTGTCA
GACAGTGTCAC
GACGTTGTCAC
GACAACGTCAC
TGACAACGTCA

GTGACRNYGTCAC

matS is the 13nt GTGACRNYGTCAC: it is recognized by the matP protein which
structures the Ter domain [Mercier at al. (08)].


Acknowledgment

Françoise Gélis (R’MES 1.0)
Annie Bouvier (R’MES 2.0)
Mark Hoebeke (R’MES 3.0)

http://genome.jouy.inra.fr/ssb/rmes/


Schbath Rmes Bosc2009

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Schbath Rmes Bosc2009

Similar to Schbath Rmes Bosc2009 (20)

More from bosc

More from bosc (20)

Recently uploaded

Recently uploaded (20)

Schbath Rmes Bosc2009