Workshop NGS data analysis - 3
Upcoming SlideShare
Loading in...5
×
 

Workshop NGS data analysis - 3

on

  • 806 views

ChIP-seq peak calling and downstream analysis

ChIP-seq peak calling and downstream analysis

Statistics

Views

Total Views
806
Views on SlideShare
803
Embed Views
3

Actions

Likes
0
Downloads
80
Comments
0

2 Embeds 3

http://www.linkedin.com 2
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Workshop NGS data analysis - 3 Workshop NGS data analysis - 3 Presentation Transcript

  • Sequencing data analysisWorkshop – part 3 / peak calling and annotation Outline Previously in this workshop… Peak calling and annotation – the steps Peak calling and annotation – the workshop Maté Ongenaert
  • Previously in this workshop…Introduction – the real cost of sequencing
  • Previously in this workshop…Introduction – the real cost of sequencing
  • Previously in this workshop… The workflow of NGS data analysis Data analysis Raw machine reads… What’s next? Preprocessing (machine/technology) - adaptors, indexes, conversions,… - machine/technology dependent Reads with associated qualities (universal) - FASTQ - QC check Depending on application (general applicable) - ‘de novo’ assembly of genome (bacterial genomes,…) - Mapping to a reference genome  mapped reads - SAM/BAM/… High-level analysis (specific for application) - SNP calling - Peak calling
  • Previously in this workshop… The workflow of NGS data analysis
  • Previously in this workshop… Main data formats Raw sequence reads:- Represent the sequence ~ FASTA >SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT- Extension: represent the quality, per base ~ FASTQ – Q for qualityScore ~ phred ~ ASCII table ~ phred + 33 = Sanger @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !*((((***+))%%%++)(%%%%).1***-+*))**55CCF>>>>>>CCCCCCC65- Machine and platform independent and compressed: SRA (NCBI)Get the original FASTQ file using SRATools (NCBI)
  • Previously in this workshop… Main data formats- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAMDESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION# QNAME: template name#FLAG#RNAME: reference name# POS: mapping position#MAPQ: mapping quality#CIGAR: CIGAR string#RNEXT: reference name of the mate/next fragment#PNEXT: position of the mate/next fragment#TLEN: observed template length#SEQ: fragment sequence#QUAL: ASCII of Phred-scale base quality+33#Headers@HD VN:1.3 SO:coordinate@SQ SN:ref LN:45#Alignment blockr001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
  • Previously in this workshop… Main data formats- BED files (location / annotation / scores): Browser Extensible DataUsed for mapping / annotation / peak locations / - extension: bigBED (binary)FIELDS USED:# chr# start# end# name# score# strandtrack name=pairedReads description="Clone Paired Reads" useScore=1#chr start end name score strandchr22 1000 5000 cloneA 960 +chr22 2000 6000 cloneB 900 –- BEDGraph files (location, combined with score)Used to represent peak scorestrack type=bedGraph name="BedGraph Format" description="BedGraph format"visibility=full color=200,100,0 altColor=0,100,200 priority=20#chr start end scorechr19 59302000 59302300 -1.0chr19 59302300 59302600 -0.75chr19 59302600 59302900 -0.50
  • Previously in this workshop… Main data formats- WIG files (location / annotation / scores): wiggleUsed for visulization or summarize data, in most cases count data or normalized countdata (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)browser position chr19:59304200-59310700browser hide all#150 base wide bar graph at arbitrarily spaced positions,#threshold line drawn at y=11.76#autoScale off viewing range set to [0:25]#priority = 10 positions this as the first graphtrack type=wiggle_0 name="variableStep" description="variableStep format"visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255yLineMark=11.76 yLineOnOff=on priority=10variableStep chrom=chr19 span=15059304701 10.059304901 12.559305401 15.059305601 17.559305901 20.059306081 17.5
  • Previously in this workshop… Main data formats- GFF format (General Feature Format) or GTFUsed for annotation of genetic / genomic features – such as all coding genes in EnsemblOften used in downstream analysis to assign annotation to regions / peaks / …FIELDS USED:# seqname (the name of the sequence)# source (the program that generated this feature)# feature (the name of this type of feature – for example: exon)# start (the starting position of the feature in the sequence)# end (the ending position of the feature)# score (a score between 0 and 1000)# strand (valid entries include +, -, or .)# frame (if the feature is a coding exon, frame should be a number between0-2 that represents the reading frame of the first base. If the feature isnot a coding exon, the value should be ..)# group (all lines with the same group are linked together into a singleitem)track name=regulatory description="TeleGene(tm) Regulatory Regions"#chr source feature start end scores tr fr groupchr22 TeleGene enhancer 1000000 1001000 500 + . touch1chr22 TeleGene promoter 1010000 1010100 900 + . touch1chr22 TeleGene promoter 1020000 1020000 800 - . touch2
  • Peak calling The workflowPeak calling:Identify genomic regions where the number of sequenced reads (coverage) of the IP-sample is higher than can be estimated from the input (control) samples >> enrichedregions >> possibly captured by the IP & thus sequenced with more coveragePeak annotation:When such enriched regions are identified, where are they located (intron/exon/…) ?What is the closest gene or the closest promoter region?
  • Peak calling The workflowPeak calling:CoverageFrom the BAM file: mapping against the reference genomeBoth the IP-sample and the control (Input) must be mapped, duplicates will be ignored bymost peak callersPeak caller will determine coverage for both samples- Store them for visualisation (WIG files; BIGWIG files or similar)EnrichedFind out which regions are enriched (or within the sample or versus a control (Input)sample  statistics ~ model of tag distributions and normalisation strategy
  • Peak calling The workflow Peak calling: Enriched Find out which regions are enriched (or within the sample or versus a control (Input) sample  statistics ~ model of tag distributions and normalisation strategy Significance relative to control Density profiles Peak assignment Control data adjustment Statistical model / test data Statistical Window- Tag Gaussian Strand- Peak height Bacground Genomic Normalized Conditional Local Chromome Program Reference FDR model on HMM T-test based clustering kernel specific or FE subtract dupl/deletions control binomial poisson poisson controlCisgenome [73] X X X X X X Minimal ChipSeq [74] X X XPeak Finder E-range [75] X X X X X MACS [76] X X X X X QuEST [77] X X X X X Hpeak [78] X X X XSole-Search [79] X X X X X PeakSeq [80] X X X X SISSRS [81] X X Xspp package [82] X X X X X
  • Peak calling The workflowUsage: macs14 <-t tfile> [-n name] [-g genomesize] [options]Example: macs14 -t ChIP.bam -c Control.bam -f BAM -g h -n test -w --call-subpeaksmacs14 -- Model-based Analysis for ChIP-SequencingOptions: --version show programs version number and exit -h, --help show this help message and exit. -t TFILE, --treatment=TFILE ChIP-seq treatment files. REQUIRED. When ELANDMULTIPET is selected, you must provide two files separated by comma, e.g. s_1_1_eland_multi.txt,s_1_2_eland_multi.txt -c CFILE, --control=CFILE Control files. When ELANDMULTIPET is selected, you must provide two files separated by comma, e.g. s_2_1_eland_multi.txt,s_2_2_eland_multi.txt -n NAME, --name=NAME Experiment name, which will be used to generate output file names. DEFAULT: "NA" -f FORMAT, --format=FORMAT Format of tag file, "AUTO", "BED" or "ELAND" or "ELANDMULTI" or "ELANDMULTIPET" or "ELANDEXPORT" or "SAM" or "BAM" or "BOWTIE". The default AUTO option will let MACS decide which format the file is. Please check the definition in 00README file if you choose EL AND/ELANDMULTI/ELANDMULTIPET/ELANDEXPORT/SAM/BAM/BOWTI E. DEFAULT: "AUTO"
  • Peak calling The workflowPeak annotationEnrichedPeak locations > in which features is my peak located; is it close to a gene; provide mesome statistics on how far my peaks are from annotated TSSesR/BioConductorChipPeakAnno packagePeakAnalyzer
  • Sequencing data analysisWorkshop – part 3 / peak calling and annotation Outline Previously in this workshop… Peak calling and annotation – the steps Peak calling and annotation – the workshop Maté Ongenaert
  • Peak calling The workflowFurther downstream processingPeak overlaps Is this observed overlap larger than one can expect if the datasets were random?  Peak caller gives each peak a score  Randomy distribute this score accross the peaks of the same peakset (factor) and, for a percentage of top- peaks, calculate overlapping peaks in real dataset and with random distributed scores
  • Peak calling The workflowFurther downstream processingIdentify sequence motifs (region around ‘peak’, searched for motifs) Further downstream processing Identify differentially bound regions between conditions/factors/…
  • Peak calling The workflowFurther downstream processingPeak overlaps Real 10% 15% 20% 30% 50% 75% 7 18 25 52 102 201 Means 0, 347 11 ,53 2, 699 9, 297 42, 377 1 888 40,Factor diff 20,7291 1 066 1 61 4484 5, 1 9, 262689885 5, 593202108 2, 406966043 14266651 , 52 FDR 10% 15% 20% 30% 50% 75% 0 0 0 0 0 0 10% 10% 15% 20% 30% 50% 75% 282 333 506 907 1000 1000 20% 10% 15% 20% 30% 50% 75% 59 33 125 332 1000 1000 30% 10% 15% 20% 30% 50% 75% 4 2 9 27 981 1000 50% 10% 15% 20% 30% 50% 75% 2 0 0 0 95 1000 75% 10% 15% 20% 30% 50% 75% 0 0 0 0 0 148
  • Sequencing data analysisWorkshop – part 3 / peak calling and annotation Outline Previously in this workshop… Peak calling and annotation – the steps Peak calling and annotation – the workshop Maté Ongenaert
  • Blokde Van… ETER