Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analysis of ChIP-Seq Data

2,065 views

Published on

An introduction to the tools and methods used for the bioinformatics analysis of ChIP-Seq data.

Written and delivered for the "Epigenetics and its applications in clinical research" course at the Karolinska Institute in Stockholm, Sweden.

Published in: Science
  • Be the first to comment

Analysis of ChIP-Seq Data

  1. 1. Bioinformatics Analysis of ChIP-Seq Phil Ewels, NGI Stockholm phil.ewels@scilifelab.se Epigenetics and its applications in clinical research (2601) 2017-03-21
  2. 2. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Talk Overview • Overview of ChIP-Seq • ChIP-Seq data processing • Peak Calling • Normalisation & quality control • Analysis Pipelines • Downstream analyses 2
  3. 3. Overview of ChIP-Seq
  4. 4. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Question - Can we find where a protein of interest binds across the genome? • Requirements - Good antibody - Reference genome • Assumptions - Protein binds in a stable pattern - Binding is comparable across a population of cells 4
  5. 5. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 5
  6. 6. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 6
  7. 7. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 7
  8. 8. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 8
  9. 9. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 9
  10. 10. ChIP-Seq data processing
  11. 11. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing • Sequence QC - FastQC / FastQ Screen • Trimming - Cutadapt / Trimmomatic / AlienTrimmer / FASTX-Toolkit • Alignment - Bowtie / BWA / STAR • Duplicate removal - Picard / Samtools / SeqMonk 11
  12. 12. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: FastQC 12
  13. 13. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: FastQ Screen 13
  14. 14. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: cutadapt 14 http://opensource.scilifelab.se/projects/cutadapt/
  15. 15. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: cutadapt + FastQC 15
  16. 16. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: Alignment • Bowtie 1 & 2 - Bowtie 1 good for short reads (less than 50bp) - Bowtie 2 better with longer reads • STAR - As good as bowtie but much faster - Has a large memory footprint (~30 gigs for Human) • BWA / Subread / SOAP / MAQ • Alignments should be unique 16
  17. 17. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: Duplicate Removal • Duplicates can come from multiple sources - PCR duplicates - Optical duplicates - Deep sequencing (genuine duplicates) • How do you define duplicates? - Sequence content - errors? - Mapping position • Sonication makes genuine duplicates unlikely 17
  18. 18. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Results Summary: MultiQC 18 • Scans your results directory and parses log files • Builds a single report summarising everything http://multiqc.info
  19. 19. Peak Calling
  20. 20. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Peak Calling: Considerations • What kind of mark are you looking for? • Point-source factors - Few, sharp peaks - Most transcription factors • Many peaks - RNA Polymerase II • Broad peaks - Some histone marks (H3K27me3) 20
  21. 21. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Peak Calling: Tools • Huge number of tools available • Many different statistical approaches • Only important thing to remember: your results should look sensible and you must be consistent • If in doubt, use MACS v2 or SPP - https://github.com/taoliu/MACS/ - http://compbio.med.harvard.edu/Supplements/ChIP-seq/ 21
  22. 22. Normalisation & QC
  23. 23. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Normalisation & quality control • What we expect to see: • Assumptions: - Specific antibody - Perfect purification - Equal representation • Reality: - Non-specific antibody binding - Unbound DNA being sequenced - Open chromatin bias, repetitive regions not aligned 23
  24. 24. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Normalisation: input controls • Typically run an input sample - Cross-linked DNA, but no ChIP step - Can use a non-nuclear antibody such as IgG - Same sample, same prep - Captures systematic biases (eg. chromatin type, GC) • Can use the data in multiple ways - Just determine regions to exclude - Subtraction normalisation - Typically used when calling peaks 24
  25. 25. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Normalisation: Signal and Noise 25 • We will sequence lots of irrelevant stuff - What is signal and what is noise? • Essentially, we’re looking for enrichment - Peak callers do a lot of this for you • Most peak callers need an input sample - Some can use mappability and GC content instead
  26. 26. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: visualisation 26 • Visualising the data is quick and very helpful - UCSC / SeqMonk / IGV • Fast impression of how the experiment has worked • Not enough on its own
  27. 27. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: SeqMonk 27
  28. 28. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Saturation Analysis • If you sequence more reads, you’ll find more peaks • If you’ve sequenced enough, you should be nearing a plateau • Look into complexity of data - Preseq - SPP subsampling 28
  29. 29. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Strand Cross-Correlation 29 • Single-end sequencing should give a bimodal peak around binding sites on the two DNA strands • Some peak callers use this to aid in region calling and for QC • Can define NSC and RSC scores…
  30. 30. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Strand Cross-Correlation 30 • Can define NSC and RSC scores… Landt et al. 2012
  31. 31. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Stats, stats, stats • NSC and RSC - Normalised strand cross-correlation coefficient - Relative strand cross-correlation coefficient • FRiP - Fraction of reads in peaks • FDRs, IDRs, p-values of peaks - False discovery rates - Irreproducible discovery rates 31
  32. 32. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Stats, stats, stats • Useful if you have a lot of samples - Allows benchmarking and identification of failed samples • Don’t be overwhelmed by the acronyms • Believe your eyes - if the data looks trustworthy, it probably is trustworthy 32
  33. 33. Analysis Pipelines
  34. 34. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Bioinformatics Workflows • Running all of these steps for many samples is repetitive - Difficult, dull, prone to errors • Processing can be automated by a Workflow Manager - Also known as Pipeline Tools • Execute processing steps for you, managing files and dependencies. 34
  35. 35. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Cluster Flow • Available on UPPMAX • ChIP-seq pipeline - Runs QC, alignment, deduplication and generates coverage tracks / fingerprint plots - Written with J Westholm 35 #fastqc #bowtie1 #samtools_sort_index #bedtools_bamToBed #bedToNrf #picard_dedup #samtools_sort_index #phantompeaktools_runSpp #deeptools_bamCoverage #deeptools_bamFingerprint #bedtools_intersectNeg #samtools_sort_index module load clusterflow cf --setup cf-uppmax --add_genomes cf --genome GRCh37 chipseq_qc *.fq.gz
  36. 36. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Nextflow • Runs on UPPMAX • Several pipelines built at NGI, including ChIP-seq - Still under development, could be a little buggy • Also runs elsewhere. Docker coming soon. 36 curl -fsSL get.nextflow.io | bash nextflow run SciLifeLab/NGI-ChIPseq --project b2017123 --reads '*_R{1,2}.fastq.gz' --macsconfig ‘macssetup.config' --genome GRCh37
  37. 37. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Bioinformatics Workflows • These are great, but come with some caveats - Some setup is required - They don’t always work… - Results must be checked • They are not a substitute for understanding the analysis steps • You are still responsible for your results! 37
  38. 38. Downstream Analysis
  39. 39. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Annotation • You have reads! Peaks! But where are they? - Co-ordinates are not helpful by themselves • BEDTools - closest: distance to nearest genes - intersect: overlap with feature classes • HOMER annotation • SeqMonk Average quantitation plots 39
  40. 40. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Annotation 40 • HOMER can annotate read intensities
  41. 41. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Annotation 41 • SeqMonk average quantitation plot across genes
  42. 42. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Annotation • GO analysis is increasingly popular - Gene Ontology search • Databases classify every gene with a restricted vocabulary • Use your data to find if any GO terms are enriched • Like peak callers, lots of software available - DAVID and GREAT are popular - Cytoscape good for visualisation 42
  43. 43. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Motif Searching • Search peaks for enriched sequence motifs - Could indicate a TF binding motif - Interesting for new ChIP factors - Can be informative for co-operative binding • HOMER is one of many tools to do this 43
  44. 44. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Differential binding • May want to compare samples across conditions or time series • Overlapping peaks is too simplistic • DiffBind: R Bioconductor package - ChIP-seq equivalent of DESeq and edgeR - Extensive documentation and tutorials - http://bioconductor.org/packages/release/bioc/html/DiffBind.html 44
  45. 45. Conclusions
  46. 46. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Conclusions • There is no “correct way” to analyse ChIP-seq - Depends on biological system and question - Affected by number of samples and experimental setup - Defined by your experience and skills • Two packages that do a lot of steps: - HOMER - SeqMonk - Lots of YouTube walk through videos - https://youtu.be/LcMVb4zQBXI and https://youtu.be/Cy13yV6Rf6s 46
  47. 47. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Further Reading • Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data - Bailey et al. PLOS Comp Bio (2013) • ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia - Landt et al. Genome Research (2012) • ChIP–seq: advantages and challenges of a maturing technology - Park. Nature Reviews Genetics (2009) • http://seqanswers.com and http://biostars.org 47
  48. 48. Questions? phil.ewels@scilifelab.se Slides: http://tiny.cc/chipseq

×