Your SlideShare is downloading. ×
Drablos Composite Motifs Bosc2009
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Drablos Composite Motifs Bosc2009

518

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
518
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1 Computational discovery of composite motifs in DNA Geir Kjetil Sandve, Osman Abul and Finn Drabløs Finn Drabløs [tare.medisin.ntnu.no]
  • 2. Introduction 2 Basic gene regulation • Proteins (transcription factors, TFs) recognise binding sites (sequence motifs) in gene regulatory regions • The transcription factors stabilise the Michael Lones transcription complex • Distal promoters (enhancers) interact through DNA looping Finn Drabløs [tare.medisin.ntnu.no]
  • 3. Motivation 3 De novo prediction of binding sites • Make a set of co-regulated genes – E.g. from microarray experiments, normally imperfect sets • Extract assumed regulatory regions – Normally a fixed region upstream from TSS of each gene • Search for overrepresented patterns in these regions – Use a model for what a motif should look like • Consensus sequence with mismatches • Position Weight Matrix (PWM) based on log odds scores for occurrences – Use a strategy to find (local) optima for this model • E.g. Gibbs sampling, expectation maximisation … • Problem: More than 100 different methods – Which methods are reliable? Finn Drabløs [tare.medisin.ntnu.no]
  • 4. Motivation 4 Benchmarking of de novo tools • Tompa et al, Nature Biotech 23, 137-144 (2005) • Tested 14 different tools for motif discovery • Used 52 data sets from fly (6), human (26), mouse (12) and yeast (8) • Used data sets with real (Transfac) binding sites in different sequence contexts – ”real” – The actual promoter sequences – ”generic” – Randomly chosen promoter sequences from same genome – ”markov” – Sequences generated by Markov chain of order 3 • Measured performance at nucleotide level Finn Drabløs [tare.medisin.ntnu.no]
  • 5. Motivation 5 Average benchmark performance Method TP FP FN TN TP FN AlignAce 477 3789 8186 436048 FP TN Pred_P Pred_N ANN-Spec 754 7799 7909 432038 Consensus 178 1394 8485 438443 Real_P 471 8192 GLAM 223 5619 8440 434218 Real_N 5167 434670 Improbizer 594 7942 8069 431895 MEME 581 4836 8082 435001 MEME3 673 6726 7990 433111 nCC = 0.053 MITRA 272 4092 8391 435745 MotifSampler 520 4344 8143 435493 Performance is close to Oligo/dyad 345 1891 8318 437946 QuickScore 151 4856 8512 434981 random! SeSiMCMC 530 13813 8133 426024 Weeder 748 1748 7915 438089 Too many FP, FN YMF 554 3492 8109 436345 Finn Drabløs [tare.medisin.ntnu.no]
  • 6. Motivation 6 Can we improve performance? • Use better motif representations – Hidden Markov Models • Use better algorithms – More exhaustive searching TODAY! – Discriminative motif discovery • Use better background models – Real sequences (not Markov models) TODAY! • Filter out false positives – Identify “motif-like” solutions – Identify regulatory regions – Use co-occurrence of motifs TODAY! • Modules, composite motifs Finn Drabløs [tare.medisin.ntnu.no]
  • 7. Approach 7 Composite motif discovery • TFs act together as modules • Modules are not completely unique Finn Drabløs [tare.medisin.ntnu.no]
  • 8. Algorithm 8 Basic definitions • Frequent modules – Modules (and motifs) can be ranked by support • Fraction of sequences where the module (or motif) is found – Support is monotonous • Adding a motif to a module can never increase module support • Specific modules – Modules can be ranked by hit probability • Probability that a sequence supports the module – Hit probability is monotonous (as for support) – Specific modules have low hit probability in background sequences • Significant modules – Modules can be ranked by significance • Probability that support in sequence ≠ background Finn Drabløs [tare.medisin.ntnu.no]
  • 9. Algorithm 9 Search tree • Discretized single motifs {1, 2, 3, …} organised as an implicit search tree • Support set H and hit probability P is iteratively computed (monotonicity) – Initially H is full sequence set and P is 1) • Search tree is efficiently pruned (indicated with X) based on H and P • Final output can be ranked by module significance Finn Drabløs [tare.medisin.ntnu.no]
  • 10. Implementation 10 Module significance • Position-level probability in background – Probability of single motif at specific location – Estimated from real DNA background sequences • Sequence-level probability in background – Probability of single motif at least once in given background sequence – Estimated as union of position-level probabilities • Hit-probability in background – Probability of composite motif at least once in background sequence – Estimated as product of individual motif components • Significance p-value of observed support – Probability of seeing at least observed support in background set – Estimated as right tail of binomial distribution p • At least k out of n successes given hit-probability Finn Drabløs [tare.medisin.ntnu.no]
  • 11. Implementation 11 Problem specification • Frequent and specific modules – Use thresholds on support and specificity – Complete solutions but multi- objective optimization • Top-ranking modules – Combine objectives into single measure, e.g. p-value • Pareto-optimal modules – Each objective is a separate dimension of optimality http://en.wikipedia.org/wiki/Pareto_efficiency – Return Pareto front of composite motifs Finn Drabløs [tare.medisin.ntnu.no]
  • 12. Implementation 12 Motif prediction flowchart Finn Drabløs [tare.medisin.ntnu.no]
  • 13. Benchmarking 13 Benchmark data set • Known composite motifs from the TransCompel database • Tests performance by adding “noise matrices” to input – Matrices for TFs assumed not to bind in sequence set • Will have random (false positive) hits – Selected at random from Transfac • Max noise level includes all Transfac matrices – Similar to actual usage • Searching for motifs consisting of unknown TFs Finn Drabløs [tare.medisin.ntnu.no]
  • 14. Benchmarking 14 General performance (nCC) • Compo compared to several other tools – TransCompel benchmark set • Compo has clearly best performance, in particular at realistic settings (high noise level) Finn Drabløs [tare.medisin.ntnu.no]
  • 15. Benchmarking 15 Background and support • Compo gains performance from realistic background (real DNA) and support – Random DNA based on multinomial sequence model • Performance without real DNA background or support comparable to other tools Finn Drabløs [tare.medisin.ntnu.no]
  • 16. Future development 16 Pareto front • Pareto front on support, max motif distance and significance (colour) • Compo prediction not optimal – Compo predicted Ets and GATA – Annotated motif is AP1 and NFAT • Explore alternative solutions • Explore parameter X – NFAT interactions O – AP1 Finn Drabløs [tare.medisin.ntnu.no]
  • 17. Acknowledgements 17 The research group BiGR Programmers / Technicians Johansen, Jostein Drabløs, Finn Thomas, Laurent Olsen, Lene C. Postdocs / Researchers Sætrom, Pål Others Kusnierczyk, Wacek Solbakken, Trude Rye, Morten Klein, Jörn Master students Anderssen, Endre Bolstad, Kjersti Wang, Xinhui (ERCIM) Muiser, Iwe Capatana, Ana (ERCIM, starting 2009) Sponberg, Bjørn Brands, Stef PhDs Skaland, Even Bratlie, Marit Skyrud Klepper, Kjetil Former members Saito, Takaya Sandve, Geir Kjetil Lundbæk, Marie Abul, Osman Håndstad, Tony Schwalie, Petra Lones, Michael Finn Drabløs [tare.medisin.ntnu.no]

×