Motif-based analysis of ChIP-seq data - Timothy Bailey


Published on

ChIP-seq experiments are the current method of choice for surveying the targets of DNA-binding transcription factors (TFs). Motif-based sequence analysis of ChIP-seq data can provide extremely valuable insights into the biological mechanisms underlying transcriptional regulation. For example, it can determine the DNA-binding affinity (motif) of the factor, distinguish regions bound directly or indirectly by the factor, and suggest the identities of TFs that regulate cooperatively. Motif analysis can also be employed in differential mode to identify explanatory differences in the motif content of genomic regions bound by the factor under different cellular contexts. I will describe several types of motif-based analysis and give concrete examples of biological insights they can yield.

Published in: Health & Medicine, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Motif-based analysis of ChIP-seq data - Timothy Bailey

  1. 1. Motif-based analysis of ChIP-seq data Timothy L. Bailey AMATA October 16, 2013
  2. 2. Overview of Talk • ChIP-seq data analysis – Why do motif-based analysis? – MEME-ChIP • Two case studies – KLF1 in mouse fetal liver cells – NFI in mouse neural stem cells
  3. 3. Steps in ChIP-seq • Cross-link proteins to DNA • Fragment chromatin • Immunoprecipitate with antibody to protein • Size-select and ligate • Amplify • Sequence Cross-link
  4. 4. What can we learn from Transcription Factor ChIP-seq data? • Where is the TF bound? • What is its DNA-binding affinity? • What genes might it regulate? • What are its partners?
  5. 5. ChIP-seq Data Analysis 1. Mapping: Align the reads with the reference genome. 2. Peak Calling: Identify regions with significant read coverage. 3. Motif-based sequence analysis: Identifying DNA sequence patterns in the peaks. … “Practical guidelines for the comprehensive analysis of ChIP-seq data”, Bailey et al., PLoS Comp Bio (in press).
  6. 6. Why do motif-based analysis? • Quality control • Understanding DNA-binding affinity • Understanding regulatory mechanisms
  7. 7. PWM-based Word-based Known motifs
  8. 8. Motif Discovery: MEME 100-bp ChIP-seq regions align sites motif IC = total height of letters motif logo • Searches for novel PWM motif with most significant information content (IC). • Null model: the distribution of the IC of a set of sites of a given width in random sequences of a given length.
  9. 9. Discriminative Motif Discovery: DREME 100-bp ChIP-seq regions P=5 100-bp shuffled regions N=3 • Searches for novel regular expression motif with most significant enrichment of sites in positive sequences. • Null model: the probability of a site is the same in the two sets of sequences. • Test: Fisher’s Exact Test on P and N (number of sequences with ≥ 1 site) Motif Regular Expression: CCMRCCC
  10. 10. Central Motif Enrichment Analysis: CentriMo 500-bp ChIP-seq regions W=120 Probability L=500 S = number of “successes” = 4 T = number of “trials” = 5 • Searches for known motif whose best sites are most centrally enriched in the ChIP-seq regions. • Null model: best sites are uniformly distributed within the regions. • Test: Binomial(S, T, w/L) “site-probability” curve Position of Best Site
  11. 11. Motif Spacing Analysis: SpaMo 500-bp ChIP-seq regions 300-bp centered on primary • Searches for known motifs whose best sites have a preferred spacing with the primary motif. 1. 2. 3. Align regions on best primary site. Predict best secondary site. Compute enrichment at each possible spacing. • Null model: uniform • Test: Binomial
  12. 12. Case Study 1: KLF1 Did my ChIP-seq work?
  13. 13. Knowing when TF ChIP-seq fails 1) KLF1 ChIP: Tallack et al, Genome Research, 2011. 2) KLF1 ChIP: Other published data. • The top MEME and DREME motifs confirm the in vitro KLF-family motif. • UniPROBE Klf7_primary motif MEME KLF1 motif The best DREME motif only approximates the KLF-family motif. MEME finds no similar motif. UniPROBE Klf7_primary motif DREME KLF1 motif
  14. 14. TF motif databases are an invaluable resource
  15. 15. KLF-family motifs are nearly identical
  16. 16. Strong Evidence of Failure: Central Motif Enrichment 1) Tallack KLF1 data 2) Other KLF1 data p = 10-66 p = 0.7 KLF7 in vitro motif
  17. 17. Are KLF1, GATA and GATA/SCL motifs the most centrally enriched motifs in KLF1 peak regions? 1. Tallack KLF1 data – yes. CentriMo Analysis of Tallack KLF1 data Top four centrally enriched motifs in JASPAR+UniProbe (862 motifs) Klf4 KLF7 KLF4 GATA1 W=103 GATA1/SCL W=111 W=194 -40 P = 10-54 W=177 -66 P = 10-48 Klf7 GATA/SCL GATA
  18. 18. Are KLF1, GATA and GATA/SCL motifs the most centrally enriched motifs in KLF1 peak regions? 1. Tallack KLF1 data – yes. T 2. Other KLF1 data – no. 1. Tallack KLF1 data KLF4 2. Other KLF1 data
  19. 19. KLF1 summary • Enrichment of the known KLF-family motif(s) as well as of known co-factors are strong evidence of a successful TF ChIP-seq experiment. • Perform motif-based analysis on TF ChIP-seq before publishing!
  20. 20. Case Study 2: NFI How does my TF bind?
  21. 21. Nuclear Factor I • Martynoga et al. (2013) ChIP-ed NFI in proliferating and quiescent mouse neural stem cells. • NFIA, NFIB, NFIC and NFIX. • NFI thought to bind as dimers.
  22. 22. Enriched motifs in NFI peaks in proliferating neural stem cells
  23. 23. Does NFI bind as a monomer in neural stem cells?
  24. 24. Enrichment suggests NFIX binds often as a monomer MEME MEME
  25. 25. Half-site spacing enriched at multiples of 10 bp
  26. 26. Dimeric sites are twice as common in embryonic fibroblasts mNSC ChIP MEF ChIP
  27. 27. E-boxes are highly enriched near NFI peaks
  28. 28. Most enriched E-box is enriched also in quiescent neural stem cells proliferating quiescent
  29. 29. One E-box not enriched in quiescent neural stem cells proliferating quiescent
  30. 30. Differentially enriched motif could bind OLIG or NEUROG/D
  31. 31. NFI summary • Motif-based analysis sheds light on how TFs bind. – Unlike other NFIs, NFIX often binds as a monomer. – NFI binding is less associated with binding of OLIG or NEUROG/D factors in quiescent than in proliferating neural stem cells.
  32. 32. Conclusions • Motif-based TF ChIP-seq analysis is highly useful for: – Quality control – Understanding DNA-binding affinity – Understanding regulatory mechanisms
  33. 33. Acknowledgements The MEME Suite • William Noble • James Johnson • Charles Grant • Martin Frith • Philip Machanick • Tom Whitington • Shobhit Gupta • Tom Lesluyes • Benjamin Dartigues KLF Project • Michael Tallack • Tom Whitington • Andrew Perkins • Sean Grimmond • Brooke Gardiner • Ehsan Nourbakhsh • Nicole Cloonan • Elanor Wainwright • Janelle Keys • Wai Shan Yuen
  34. 34.
  35. 35. Transcription Factors • Mammalian transcription is controlled (in part) by about 1400 DNA-binding transcription factor (TF) proteins. • These proteins control transcription in two main ways: – Directly, by promoting (or preventing) the assembly of the pre-initiation complex. – Indirectly, by modifying chromatin.
  36. 36. ChIP-seq • Chromatin ImmunoPrecipitation followed by highthroughput sequencing. • TF binding sites (“punctate peaks”) • Chromatin mods (“broad peaks”)
  37. 37. KLF1 is a key transcription factor in blood cell development • We performed KLF1 ChIP-seq in mouse fetal liver cells and analyzed the resulting 945 peak regions using the MEME Suite [Tallack et al, Genome Research, 2010.] • We confirmed – the in vitro binding motif of KLF1, – several co-factor TFs, and – a co-factor complex. Pooled 4 Livers (~80x106 cells) Positive: ChIP (αKLF1 Rabbit Polyclonal Ab) Control: Input DNA
  38. 38. A second KLF1 ChIP-seq experiment • Pilon et al. (Blood, 2011) also performed KLF1 ChIP-seq in mouse fetal liver cells. • They predicted over 13,000 peak regions. • We reanalyzed their data using the MEME Suite. • This second ChIP-seq data gives very different results.
  39. 39. Do the Pilon KLF1 ChIP-seq regions contain KLF1 co-factor sites?
  40. 40. GATA1 and SCL are important KLF1 regulatory co-factors • GATA1 and SCL bind DNA in a protein complex [Wadman et al, EMBO Journal, 1997]. T 1. Tallack KLF1 data – MEME finds complex motif 2. Pilon KLF1 data—MEME does not find the motif Known GATA-SCL motif (JASPAR database) MEME GATA-SCL motif found in Tallack KLF1 data
  41. 41. Is motif discovery failing? • To check this, we use CentriMo to search for any motifs in the JASPAR+UniPROBE motif database that are centrally enriched in the two KLF1 ChIP-seq datasets.
  42. 42. Caveats in ChIP-seq Motif Analysis • Peak regions may contain other TF motifs due to looping. • The binding of the ChIP-ed factor “X” may be indirect. • ChIP-ed motif might be weak due to assisted binding. Farnham, Nature Reviews Genetics, 2009
  43. 43. MEME motif is E-box with adjacent NFI half-site NFI half-site
  44. 44. Differential central enrichment