Here you have your reads: now what?
Making sense of high-throughput sequencing illustrated with ChIP- and RNA-seq data
Javier Quilez Oliete - Bioinformatician @ Beato Lab
1
Downstream
analyses
Core
analysis
ChIP-seq
RNA-seq
…
- Sample-level
- Homogeneous
- Similar steps
across *seq types
- Multi-sample
- Project-specific
- Varied/flexible
- Combine different
*seq types
ChIP-seq
DNA
Protein
ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)
ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)
X X
X
Sonication
(physical fragmentation)
ChIP fragment
ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)
X X
Technical sequences
(e.g. adapters)
ChIP fragment
X
Sonication
(physical fragmentation)
The fragment sequenced
includes sequence beyond
that of the actual binding
ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)
X X
Technical sequences
(e.g. adapters)
Single end
Single end
ChIP fragment
X
Sonication
(physical fragmentation)
Most common
ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)
X X
Technical sequences
(e.g. adapters)
Paired end
Paired end
ChIP fragment
X
Sonication
(physical fragmentation)
Core analysis (ChIP-seq)
Trimming
- sequencing adapters
- low-quality ends
- too-short reads
Trimmomatic
Improves alignment to
genome sequence
Alignment
Protein
binding
site
Genome sequence
Alignment
Protein
binding
site
Genome sequence
BWA
Bowtie
GEM
…
Read-by-read sequence alignment to
genome sequence with the goal of
identifying the genomic location from
which the ChIP fragment originated
Read counts profiles
Protein
binding
site
Genome sequence
bam2wig
BEDtools
SAMtools
Deeptools
…
Read counts profiles
100 million reads
10 million reads
Not comparable!
Read counts profiles
Reads per million
Comparable!
Peak calling
Genome sequence
Signal background
Peak calling
Genome sequence
Signal background
Signal enrichment
Peak calling
Peak region
Genome sequence
Identification of regions showing
significant signal enrichment over the
background levels (MACS2, Zerone…)
Signal background
Signal enrichment
Peak calling
Control
(no ChIP)
ChIP sample
Peak region
Signal enrichment
Signal enrichment
Peak calling
Control
(no ChIP)
ChIP sample
Including a control sample allows
accounting for spurious enrichments
(resulting from structural variation in
the genome, ChIP artefacts) and
improves the accuracy of the peak
calling by reducing the false positives
Peak region
True enrichment
Spurious enrichment
Downstream analyses (ChIP-seq)
Genome Browser
Scale
chr9:
T47D gDNA
T47D T0 Roberto input
T47D PR T0
T47D PR T60
T47D T0 PR
T47D T30 PR 1nM
T47D T30 PR 2nM
T47D T30 PR 5nM
T47D T30 PR 10nM
T47D T30 PR 100nM
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
T47D PR T0 [0]
50 kb hg38
137,300,000 137,350,000 137,400,000
DNA-seq peaks indentified with MACS2 (without control)
T47D gDNA RPM profile
ChIP-seq peaks indentified with MACS2
Input T0 (Roberto) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T0 (gv_009_02_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T60 (gv_066_01_01_chipseq) RPM profile
T47D input (gv_098_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T0 PR (gv_092_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 2nM (gv_094_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 5nM (gv_095_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 10nM (gv_097_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 100nM (gv_096_01_01_chipseq) RPM profile
GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default)
All GENCODE transcripts including comprehensive set V24
Duplications of >1000 Bases of Non-RepeatMasked Sequence
Simple Tandem Repeats by TRF
Repeating Elements by RepeatMasker
Genomic Intervals Masked by WindowMasker + SDust
ChIP-seq peaks indentified with MACS2
T47D gDNA
1 _
0 _
Input T0 (Roberto)
1 _
0 _
T47D PR T0
1 _
0 _
T47D PR T60
1 _
0 _
T47D input
1 _
0 _
T47D T0 PR
1 _
0 _
T47D T30 PR 1nM
1 _
0 _
T47D T30 PR 2nM
1 _
0 _
T47D T30 PR 5nM
1 _
0 _
T47D T30 PR 10nM
1 _
0 _
T47D T30 PR 100nM
1 _
0 _
Control
samples
Genome Browser
Scale
chr9:
T47D gDNA
T47D T0 Roberto input
T47D PR T0
T47D PR T60
T47D T0 PR
T47D T30 PR 1nM
T47D T30 PR 2nM
T47D T30 PR 5nM
T47D T30 PR 10nM
T47D T30 PR 100nM
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
T47D PR T0 [0]
50 kb hg38
137,300,000 137,350,000 137,400,000
DNA-seq peaks indentified with MACS2 (without control)
T47D gDNA RPM profile
ChIP-seq peaks indentified with MACS2
Input T0 (Roberto) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T0 (gv_009_02_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T60 (gv_066_01_01_chipseq) RPM profile
T47D input (gv_098_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T0 PR (gv_092_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 2nM (gv_094_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 5nM (gv_095_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 10nM (gv_097_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 100nM (gv_096_01_01_chipseq) RPM profile
GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default)
All GENCODE transcripts including comprehensive set V24
Duplications of >1000 Bases of Non-RepeatMasked Sequence
Simple Tandem Repeats by TRF
Repeating Elements by RepeatMasker
Genomic Intervals Masked by WindowMasker + SDust
ChIP-seq peaks indentified with MACS2
T47D gDNA
1 _
0 _
Input T0 (Roberto)
1 _
0 _
T47D PR T0
1 _
0 _
T47D PR T60
1 _
0 _
T47D input
1 _
0 _
T47D T0 PR
1 _
0 _
T47D T30 PR 1nM
1 _
0 _
T47D T30 PR 2nM
1 _
0 _
T47D T30 PR 5nM
1 _
0 _
T47D T30 PR 10nM
1 _
0 _
T47D T30 PR 100nM
1 _
0 _
Control
samples
True
peaks
Genome Browser
Scale
chr9:
T47D gDNA
T47D T0 Roberto input
T47D PR T0
T47D PR T60
T47D T0 PR
T47D T30 PR 1nM
T47D T30 PR 2nM
T47D T30 PR 5nM
T47D T30 PR 10nM
T47D T30 PR 100nM
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
T47D PR T0 [0]
50 kb hg38
137,300,000 137,350,000 137,400,000
DNA-seq peaks indentified with MACS2 (without control)
T47D gDNA RPM profile
ChIP-seq peaks indentified with MACS2
Input T0 (Roberto) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T0 (gv_009_02_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T60 (gv_066_01_01_chipseq) RPM profile
T47D input (gv_098_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T0 PR (gv_092_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 2nM (gv_094_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 5nM (gv_095_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 10nM (gv_097_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 100nM (gv_096_01_01_chipseq) RPM profile
GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default)
All GENCODE transcripts including comprehensive set V24
Duplications of >1000 Bases of Non-RepeatMasked Sequence
Simple Tandem Repeats by TRF
Repeating Elements by RepeatMasker
Genomic Intervals Masked by WindowMasker + SDust
ChIP-seq peaks indentified with MACS2
T47D gDNA
1 _
0 _
Input T0 (Roberto)
1 _
0 _
T47D PR T0
1 _
0 _
T47D PR T60
1 _
0 _
T47D input
1 _
0 _
T47D T0 PR
1 _
0 _
T47D T30 PR 1nM
1 _
0 _
T47D T30 PR 2nM
1 _
0 _
T47D T30 PR 5nM
1 _
0 _
T47D T30 PR 10nM
1 _
0 _
T47D T30 PR 100nM
1 _
0 _
Control
samples
True
peaks
False
positive
Overlap of peaks genomic coordinates
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
Overlap of peaks genomic coordinates
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
Replicate 1
Replicate 2
Measure overlap between ChIP-seq
replicate samples (expected to be
high) as a quality metric
Overlap of peaks genomic coordinates
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
Replicate 1
Replicate 2
Measure overlap between ChIP-seq
replicate samples (expected to be
high) as a quality metric
Protein A
Treatment 1
Protein B
Protein A
Treatment 2
Interrogate overlap
between proteins/
conditions
(Venn diagrams >3 groups cannot be
proportional and are harder to interpret)
Signal enrichment over regions
Gene expression
Gene promoter
ChIP-signal
Signal enrichment over regions
Gene expression
Gene promoter
ChIP-signal
…
Is there consistent ChIP-seq signal enrichment
over gene promoters?
Signal enrichment over regions
GenepromotersProteinCpeaksRandomregions
Protein A Protein B
For each promoter (rows) the
normalised protein A ChIP-seq
signal is shown for the promoter
(center of the row) as well as for its
flanking region
The darker the color in the
heatmap, the higher the
intensity of the ChIP-seq
signal (i.e. number of reads)
Average profile: curve
showing the average for all
rows (e.g. gene promoters)
Genomic distribution of peaks
Percentage of peaks falling in each of the annotation categories
Genomic distribution of peaks
Percentage of peaks falling in each of the annotation categories
Percentage of peaks at a given distance from a transcription start site (TSS)
Peak region
Genome sequence
Signal enrichment
Motif discovery analysis
Protein
binding
site
The fragment sequenced
includes sequence beyond
that of the actual binding
Motif discovery analysis
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
• Is there any motif/sequence over-
represented in the sequences of the
peaks (relative to the genome)?
• Does such motif correspond to any
annotated motif (e.g. transcription
factors)?
• Motif discovery allows defining the
binding site of the target protein as
well as the binding of secondary
proteins
• Need to account for the fact that the
peak may reflect a region broader
than the precise binding site
TGTTCT
Motif discovery analysis
1. Find consensus motif
The height of each letter is
proportional to its frequency in that
position within the motif
RC = reverse complement
Motif discovery analysis
1. Find consensus motif
The height of each letter is
proportional to its frequency in that
position within the motif
RC = reverse complement
2. Motif annotation
Search for known transcription
factors compatible with the
consensus motif
Downstream
analyses
Core
analysis
ChIP-seq
RNA-seq
…
- Sample-level
- Homogeneous
- Similar steps
across *seq types
- Multi-sample
- Project-specific
- Varied/flexible
- Combine different
*seq types
RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Transcription et al*
*splicing plus addition of polyA-tail
Poly-A tail
mRNA
RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Transcription et al*
*splicing plus addition of polyA-tail
Poly-A tail
mRNA
cDNA
Poly-A selection
+
cDNA synthesis
RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Transcription et al*
*splicing plus addition of polyA-tail
Poly-A tail
mRNA
cDNA
Poly-A selection
+
cDNA synthesis
RNA-seq experiment targeting
messenger RNA (mRNA) as this is
one of the most common
applications —however, other
applications exist (e.g. total RNA,
ribosomal RNA)
RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Transcription et al*
*splicing plus addition of polyA-tail
Poly-A tail
mRNA
cDNA
Technical sequences
(e.g. adapters)Poly-A selection
+
cDNA synthesis
RNA-seq experiment targeting
messenger RNA (mRNA) as this is
one of the most common
applications —however, other
applications exist (e.g. total RNA,
ribosomal RNA)
RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Transcription et al*
*splicing plus addition of polyA-tail
Poly-A tail
mRNA
cDNA
Technical sequences
(e.g. adapters)Poly-A selection
+
cDNA synthesis
RNA-seq experiment targeting
messenger RNA (mRNA) as this is
one of the most common
applications —however, other
applications exist (e.g. total RNA,
ribosomal RNA)
Paired end
Paired end
Core analysis (RNA-seq)
Trimming
- sequencing adapters
- low-quality ends
- too-short reads
Trimmomatic
Improves alignment to
genome sequence
Alignment
Genome sequence
Gene
Exon1
…
Exon2
ExonN
STAR
TopHat
GEM
…
Read-by-read sequence alignment to
genome sequence with the goal of
identifying the genomic location from
which the RNA originated
Some RNA-seq reads will originate
from different exons and thus map to
non-contiguos genomic positions —
RNA-seq aligners need to be aware
of this and split reads accordingly
(red region) during the alignment
Read counts profiles
Genome sequence
Gene
Exon1
…
Exon2
ExonN
STAR
bam2wig
BEDtools
SAMtools
Deeptools
…
Normalise by the number of
million reads to make different
experiment comparable
Expression quantification
• The number of reads is proportional to the level of expression (i.e. more RNA, more
reads)
• Expression quantification can be measured at either gene- or transcript-level
(Kallisto, HTSeq, featureCounts)
• There are several units to measure expression:
• read counts per gene/transcript —not normalised: does not account for the
sample library size (i.e. the number of reads sequenced) or gene/transcript
length
• Reads Per Kilobase of transcript per Million mapped reads (RPKM) —accounts
for library size and loci length
• Transcripts Per Million (TPM) —accounts for library size and loci length
• TPM is becoming more popular over RPKM and some argue the latter are
inconsistent across samples (http://blog.nextgenetics.net/?e=51)
Downstream analyses (RNA-seq)
Genome Browser
Treated
Scale
chr6:
T47D gDNA
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
50 kb hg38
35,590,000 35,600,000 35,610,000 35,620,000 35,630,000 35,640,000 35,650,000 35,660,000 35,670,000 35,680,000
DNA-seq peaks indentified with MACS2 (without control)
T47D gDNA RPM profile
T47D T0 (fd_004_02_01_rnaseq) RPM profile
T47D T0 (fd_004_01_01_rnaseq) RPM profile
T47D T0 (fd_004_03_01_rnaseq) RPM profile
T47D R6 (fd_006_03_01_rnaseq) RPM profile
T47D R6 (fd_005_02_01_rnaseq) RPM profile
T47D R6 (fd_005_01_01_rnaseq) RPM profile
Basic Gene Annotation Set from GENCODE Version 24 (Ensembl 83)
GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default)
All GENCODE transcripts including comprehensive set V24
Duplications of >1000 Bases of Non-RepeatMasked Sequence
Simple Tandem Repeats by TRF
Repeating Elements by RepeatMasker
Genomic Intervals Masked by WindowMasker + SDust
FKBP5
FKBP5
FKBP5
FKBP5
SNORA40 MIR5690
T47D gDNA
1 _
0 _
T47D T0
0 _
-25 _
T47D T0
0 _
-25 _
T47D T0
1 _
-25 _
T47D R6
10 _
-25 _
0 -
T47D R6
0 _
-25 _
T47D R6
0 _
-25 _
Untreated
Genome Browser
Treated
Scale
chr6:
T47D gDNA
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
50 kb hg38
35,590,000 35,600,000 35,610,000 35,620,000 35,630,000 35,640,000 35,650,000 35,660,000 35,670,000 35,680,000
DNA-seq peaks indentified with MACS2 (without control)
T47D gDNA RPM profile
T47D T0 (fd_004_02_01_rnaseq) RPM profile
T47D T0 (fd_004_01_01_rnaseq) RPM profile
T47D T0 (fd_004_03_01_rnaseq) RPM profile
T47D R6 (fd_006_03_01_rnaseq) RPM profile
T47D R6 (fd_005_02_01_rnaseq) RPM profile
T47D R6 (fd_005_01_01_rnaseq) RPM profile
Basic Gene Annotation Set from GENCODE Version 24 (Ensembl 83)
GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default)
All GENCODE transcripts including comprehensive set V24
Duplications of >1000 Bases of Non-RepeatMasked Sequence
Simple Tandem Repeats by TRF
Repeating Elements by RepeatMasker
Genomic Intervals Masked by WindowMasker + SDust
FKBP5
FKBP5
FKBP5
FKBP5
SNORA40 MIR5690
T47D gDNA
1 _
0 _
T47D T0
0 _
-25 _
T47D T0
0 _
-25 _
T47D T0
1 _
-25 _
T47D R6
10 _
-25 _
0 -
T47D R6
0 _
-25 _
T47D R6
0 _
-25 _
Untreated
Indeed, looking for genes that are show differential expression between two conditions (e.g. treated vs
untreated) is likely the most common application of RNA-seq
Obviously it is not performed by visual inspection in the browser but with dedicated software (sleuth,
DESeq2, edgeR) —these account for biological/technical variation between replicates and assign a
significance value to the differential expression
Differential expression
analysis
Differential expression
analysis
Downstream
analyses
Core
analysis
RNA-seq
Trimming
Trimmomatic
Alignment
BWA
Bowtie
GEM
ChIP-seq Read counts profiles
bam2wig
BEDtools
SAMtools
Deeptools
Peak calling
MACS2
Zerone
- Genome Browser
- Overlaps of peaks genomic
coordinates
- Signal enrichment over
regions
- Genomic distribution of peaks
- Motif discovery analysis
- …
Trimming
Trimmomatic
Alignment
STAR
TopHat
GEM
Read counts profiles
STAR
bam2wig
BEDtools
SAMtools
Deeptools
Expression
quantification
Kallisto
featureCounts
HTSeq
- Genome Browser
- Differential expression
analysis
- …

20161021_master_lesson_no_feedback

  • 1.
    Here you haveyour reads: now what? Making sense of high-throughput sequencing illustrated with ChIP- and RNA-seq data Javier Quilez Oliete - Bioinformatician @ Beato Lab 1
  • 2.
    Downstream analyses Core analysis ChIP-seq RNA-seq … - Sample-level - Homogeneous -Similar steps across *seq types - Multi-sample - Project-specific - Varied/flexible - Combine different *seq types
  • 3.
  • 4.
  • 5.
  • 6.
    ChIP-seq DNA Protein Formaldehyde (chemical binding) X X Technicalsequences (e.g. adapters) ChIP fragment X Sonication (physical fragmentation) The fragment sequenced includes sequence beyond that of the actual binding
  • 7.
    ChIP-seq DNA Protein Formaldehyde (chemical binding) X X Technicalsequences (e.g. adapters) Single end Single end ChIP fragment X Sonication (physical fragmentation) Most common
  • 8.
    ChIP-seq DNA Protein Formaldehyde (chemical binding) X X Technicalsequences (e.g. adapters) Paired end Paired end ChIP fragment X Sonication (physical fragmentation)
  • 9.
  • 10.
    Trimming - sequencing adapters -low-quality ends - too-short reads Trimmomatic Improves alignment to genome sequence
  • 11.
  • 12.
    Alignment Protein binding site Genome sequence BWA Bowtie GEM … Read-by-read sequencealignment to genome sequence with the goal of identifying the genomic location from which the ChIP fragment originated
  • 13.
    Read counts profiles Protein binding site Genomesequence bam2wig BEDtools SAMtools Deeptools …
  • 14.
    Read counts profiles 100million reads 10 million reads Not comparable!
  • 15.
    Read counts profiles Readsper million Comparable!
  • 16.
  • 17.
    Peak calling Genome sequence Signalbackground Signal enrichment
  • 18.
    Peak calling Peak region Genomesequence Identification of regions showing significant signal enrichment over the background levels (MACS2, Zerone…) Signal background Signal enrichment
  • 19.
    Peak calling Control (no ChIP) ChIPsample Peak region Signal enrichment Signal enrichment
  • 20.
    Peak calling Control (no ChIP) ChIPsample Including a control sample allows accounting for spurious enrichments (resulting from structural variation in the genome, ChIP artefacts) and improves the accuracy of the peak calling by reducing the false positives Peak region True enrichment Spurious enrichment
  • 21.
  • 22.
    Genome Browser Scale chr9: T47D gDNA T47DT0 Roberto input T47D PR T0 T47D PR T60 T47D T0 PR T47D T30 PR 1nM T47D T30 PR 2nM T47D T30 PR 5nM T47D T30 PR 10nM T47D T30 PR 100nM GENCODE v24 Pseudogenes Segmental Dups Simple Repeats RepeatMasker WM + SDust T47D PR T0 [0] 50 kb hg38 137,300,000 137,350,000 137,400,000 DNA-seq peaks indentified with MACS2 (without control) T47D gDNA RPM profile ChIP-seq peaks indentified with MACS2 Input T0 (Roberto) RPM profile ChIP-seq peaks indentified with MACS2 T47D PR T0 (gv_009_02_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D PR T60 (gv_066_01_01_chipseq) RPM profile T47D input (gv_098_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T0 PR (gv_092_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 2nM (gv_094_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 5nM (gv_095_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 10nM (gv_097_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 100nM (gv_096_01_01_chipseq) RPM profile GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default) All GENCODE transcripts including comprehensive set V24 Duplications of >1000 Bases of Non-RepeatMasked Sequence Simple Tandem Repeats by TRF Repeating Elements by RepeatMasker Genomic Intervals Masked by WindowMasker + SDust ChIP-seq peaks indentified with MACS2 T47D gDNA 1 _ 0 _ Input T0 (Roberto) 1 _ 0 _ T47D PR T0 1 _ 0 _ T47D PR T60 1 _ 0 _ T47D input 1 _ 0 _ T47D T0 PR 1 _ 0 _ T47D T30 PR 1nM 1 _ 0 _ T47D T30 PR 2nM 1 _ 0 _ T47D T30 PR 5nM 1 _ 0 _ T47D T30 PR 10nM 1 _ 0 _ T47D T30 PR 100nM 1 _ 0 _ Control samples
  • 23.
    Genome Browser Scale chr9: T47D gDNA T47DT0 Roberto input T47D PR T0 T47D PR T60 T47D T0 PR T47D T30 PR 1nM T47D T30 PR 2nM T47D T30 PR 5nM T47D T30 PR 10nM T47D T30 PR 100nM GENCODE v24 Pseudogenes Segmental Dups Simple Repeats RepeatMasker WM + SDust T47D PR T0 [0] 50 kb hg38 137,300,000 137,350,000 137,400,000 DNA-seq peaks indentified with MACS2 (without control) T47D gDNA RPM profile ChIP-seq peaks indentified with MACS2 Input T0 (Roberto) RPM profile ChIP-seq peaks indentified with MACS2 T47D PR T0 (gv_009_02_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D PR T60 (gv_066_01_01_chipseq) RPM profile T47D input (gv_098_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T0 PR (gv_092_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 2nM (gv_094_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 5nM (gv_095_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 10nM (gv_097_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 100nM (gv_096_01_01_chipseq) RPM profile GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default) All GENCODE transcripts including comprehensive set V24 Duplications of >1000 Bases of Non-RepeatMasked Sequence Simple Tandem Repeats by TRF Repeating Elements by RepeatMasker Genomic Intervals Masked by WindowMasker + SDust ChIP-seq peaks indentified with MACS2 T47D gDNA 1 _ 0 _ Input T0 (Roberto) 1 _ 0 _ T47D PR T0 1 _ 0 _ T47D PR T60 1 _ 0 _ T47D input 1 _ 0 _ T47D T0 PR 1 _ 0 _ T47D T30 PR 1nM 1 _ 0 _ T47D T30 PR 2nM 1 _ 0 _ T47D T30 PR 5nM 1 _ 0 _ T47D T30 PR 10nM 1 _ 0 _ T47D T30 PR 100nM 1 _ 0 _ Control samples True peaks
  • 24.
    Genome Browser Scale chr9: T47D gDNA T47DT0 Roberto input T47D PR T0 T47D PR T60 T47D T0 PR T47D T30 PR 1nM T47D T30 PR 2nM T47D T30 PR 5nM T47D T30 PR 10nM T47D T30 PR 100nM GENCODE v24 Pseudogenes Segmental Dups Simple Repeats RepeatMasker WM + SDust T47D PR T0 [0] 50 kb hg38 137,300,000 137,350,000 137,400,000 DNA-seq peaks indentified with MACS2 (without control) T47D gDNA RPM profile ChIP-seq peaks indentified with MACS2 Input T0 (Roberto) RPM profile ChIP-seq peaks indentified with MACS2 T47D PR T0 (gv_009_02_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D PR T60 (gv_066_01_01_chipseq) RPM profile T47D input (gv_098_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T0 PR (gv_092_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 2nM (gv_094_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 5nM (gv_095_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 10nM (gv_097_01_01_chipseq) RPM profile ChIP-seq peaks indentified with MACS2 T47D T30 PR 100nM (gv_096_01_01_chipseq) RPM profile GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default) All GENCODE transcripts including comprehensive set V24 Duplications of >1000 Bases of Non-RepeatMasked Sequence Simple Tandem Repeats by TRF Repeating Elements by RepeatMasker Genomic Intervals Masked by WindowMasker + SDust ChIP-seq peaks indentified with MACS2 T47D gDNA 1 _ 0 _ Input T0 (Roberto) 1 _ 0 _ T47D PR T0 1 _ 0 _ T47D PR T60 1 _ 0 _ T47D input 1 _ 0 _ T47D T0 PR 1 _ 0 _ T47D T30 PR 1nM 1 _ 0 _ T47D T30 PR 2nM 1 _ 0 _ T47D T30 PR 5nM 1 _ 0 _ T47D T30 PR 10nM 1 _ 0 _ T47D T30 PR 100nM 1 _ 0 _ Control samples True peaks False positive
  • 25.
    Overlap of peaksgenomic coordinates http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
  • 26.
    Overlap of peaksgenomic coordinates http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html Replicate 1 Replicate 2 Measure overlap between ChIP-seq replicate samples (expected to be high) as a quality metric
  • 27.
    Overlap of peaksgenomic coordinates http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html Replicate 1 Replicate 2 Measure overlap between ChIP-seq replicate samples (expected to be high) as a quality metric Protein A Treatment 1 Protein B Protein A Treatment 2 Interrogate overlap between proteins/ conditions (Venn diagrams >3 groups cannot be proportional and are harder to interpret)
  • 28.
    Signal enrichment overregions Gene expression Gene promoter ChIP-signal
  • 29.
    Signal enrichment overregions Gene expression Gene promoter ChIP-signal … Is there consistent ChIP-seq signal enrichment over gene promoters?
  • 30.
    Signal enrichment overregions GenepromotersProteinCpeaksRandomregions Protein A Protein B For each promoter (rows) the normalised protein A ChIP-seq signal is shown for the promoter (center of the row) as well as for its flanking region The darker the color in the heatmap, the higher the intensity of the ChIP-seq signal (i.e. number of reads) Average profile: curve showing the average for all rows (e.g. gene promoters)
  • 31.
    Genomic distribution ofpeaks Percentage of peaks falling in each of the annotation categories
  • 32.
    Genomic distribution ofpeaks Percentage of peaks falling in each of the annotation categories Percentage of peaks at a given distance from a transcription start site (TSS)
  • 33.
    Peak region Genome sequence Signalenrichment Motif discovery analysis Protein binding site The fragment sequenced includes sequence beyond that of the actual binding
  • 34.
    Motif discovery analysis TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT TGTTCT •Is there any motif/sequence over- represented in the sequences of the peaks (relative to the genome)? • Does such motif correspond to any annotated motif (e.g. transcription factors)? • Motif discovery allows defining the binding site of the target protein as well as the binding of secondary proteins • Need to account for the fact that the peak may reflect a region broader than the precise binding site TGTTCT
  • 35.
    Motif discovery analysis 1.Find consensus motif The height of each letter is proportional to its frequency in that position within the motif RC = reverse complement
  • 36.
    Motif discovery analysis 1.Find consensus motif The height of each letter is proportional to its frequency in that position within the motif RC = reverse complement 2. Motif annotation Search for known transcription factors compatible with the consensus motif
  • 37.
    Downstream analyses Core analysis ChIP-seq RNA-seq … - Sample-level - Homogeneous -Similar steps across *seq types - Multi-sample - Project-specific - Varied/flexible - Combine different *seq types
  • 38.
  • 39.
  • 40.
    RNA-seq DNA Gene Exon1 … Exon2 ExonN Transcription et al* *splicingplus addition of polyA-tail Poly-A tail mRNA cDNA Poly-A selection + cDNA synthesis
  • 41.
    RNA-seq DNA Gene Exon1 … Exon2 ExonN Transcription et al* *splicingplus addition of polyA-tail Poly-A tail mRNA cDNA Poly-A selection + cDNA synthesis RNA-seq experiment targeting messenger RNA (mRNA) as this is one of the most common applications —however, other applications exist (e.g. total RNA, ribosomal RNA)
  • 42.
    RNA-seq DNA Gene Exon1 … Exon2 ExonN Transcription et al* *splicingplus addition of polyA-tail Poly-A tail mRNA cDNA Technical sequences (e.g. adapters)Poly-A selection + cDNA synthesis RNA-seq experiment targeting messenger RNA (mRNA) as this is one of the most common applications —however, other applications exist (e.g. total RNA, ribosomal RNA)
  • 43.
    RNA-seq DNA Gene Exon1 … Exon2 ExonN Transcription et al* *splicingplus addition of polyA-tail Poly-A tail mRNA cDNA Technical sequences (e.g. adapters)Poly-A selection + cDNA synthesis RNA-seq experiment targeting messenger RNA (mRNA) as this is one of the most common applications —however, other applications exist (e.g. total RNA, ribosomal RNA) Paired end Paired end
  • 44.
  • 45.
    Trimming - sequencing adapters -low-quality ends - too-short reads Trimmomatic Improves alignment to genome sequence
  • 46.
    Alignment Genome sequence Gene Exon1 … Exon2 ExonN STAR TopHat GEM … Read-by-read sequencealignment to genome sequence with the goal of identifying the genomic location from which the RNA originated Some RNA-seq reads will originate from different exons and thus map to non-contiguos genomic positions — RNA-seq aligners need to be aware of this and split reads accordingly (red region) during the alignment
  • 47.
    Read counts profiles Genomesequence Gene Exon1 … Exon2 ExonN STAR bam2wig BEDtools SAMtools Deeptools … Normalise by the number of million reads to make different experiment comparable
  • 48.
    Expression quantification • Thenumber of reads is proportional to the level of expression (i.e. more RNA, more reads) • Expression quantification can be measured at either gene- or transcript-level (Kallisto, HTSeq, featureCounts) • There are several units to measure expression: • read counts per gene/transcript —not normalised: does not account for the sample library size (i.e. the number of reads sequenced) or gene/transcript length • Reads Per Kilobase of transcript per Million mapped reads (RPKM) —accounts for library size and loci length • Transcripts Per Million (TPM) —accounts for library size and loci length • TPM is becoming more popular over RPKM and some argue the latter are inconsistent across samples (http://blog.nextgenetics.net/?e=51)
  • 49.
  • 50.
    Genome Browser Treated Scale chr6: T47D gDNA GENCODEv24 Pseudogenes Segmental Dups Simple Repeats RepeatMasker WM + SDust 50 kb hg38 35,590,000 35,600,000 35,610,000 35,620,000 35,630,000 35,640,000 35,650,000 35,660,000 35,670,000 35,680,000 DNA-seq peaks indentified with MACS2 (without control) T47D gDNA RPM profile T47D T0 (fd_004_02_01_rnaseq) RPM profile T47D T0 (fd_004_01_01_rnaseq) RPM profile T47D T0 (fd_004_03_01_rnaseq) RPM profile T47D R6 (fd_006_03_01_rnaseq) RPM profile T47D R6 (fd_005_02_01_rnaseq) RPM profile T47D R6 (fd_005_01_01_rnaseq) RPM profile Basic Gene Annotation Set from GENCODE Version 24 (Ensembl 83) GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default) All GENCODE transcripts including comprehensive set V24 Duplications of >1000 Bases of Non-RepeatMasked Sequence Simple Tandem Repeats by TRF Repeating Elements by RepeatMasker Genomic Intervals Masked by WindowMasker + SDust FKBP5 FKBP5 FKBP5 FKBP5 SNORA40 MIR5690 T47D gDNA 1 _ 0 _ T47D T0 0 _ -25 _ T47D T0 0 _ -25 _ T47D T0 1 _ -25 _ T47D R6 10 _ -25 _ 0 - T47D R6 0 _ -25 _ T47D R6 0 _ -25 _ Untreated
  • 51.
    Genome Browser Treated Scale chr6: T47D gDNA GENCODEv24 Pseudogenes Segmental Dups Simple Repeats RepeatMasker WM + SDust 50 kb hg38 35,590,000 35,600,000 35,610,000 35,620,000 35,630,000 35,640,000 35,650,000 35,660,000 35,670,000 35,680,000 DNA-seq peaks indentified with MACS2 (without control) T47D gDNA RPM profile T47D T0 (fd_004_02_01_rnaseq) RPM profile T47D T0 (fd_004_01_01_rnaseq) RPM profile T47D T0 (fd_004_03_01_rnaseq) RPM profile T47D R6 (fd_006_03_01_rnaseq) RPM profile T47D R6 (fd_005_02_01_rnaseq) RPM profile T47D R6 (fd_005_01_01_rnaseq) RPM profile Basic Gene Annotation Set from GENCODE Version 24 (Ensembl 83) GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default) All GENCODE transcripts including comprehensive set V24 Duplications of >1000 Bases of Non-RepeatMasked Sequence Simple Tandem Repeats by TRF Repeating Elements by RepeatMasker Genomic Intervals Masked by WindowMasker + SDust FKBP5 FKBP5 FKBP5 FKBP5 SNORA40 MIR5690 T47D gDNA 1 _ 0 _ T47D T0 0 _ -25 _ T47D T0 0 _ -25 _ T47D T0 1 _ -25 _ T47D R6 10 _ -25 _ 0 - T47D R6 0 _ -25 _ T47D R6 0 _ -25 _ Untreated Indeed, looking for genes that are show differential expression between two conditions (e.g. treated vs untreated) is likely the most common application of RNA-seq Obviously it is not performed by visual inspection in the browser but with dedicated software (sleuth, DESeq2, edgeR) —these account for biological/technical variation between replicates and assign a significance value to the differential expression
  • 52.
  • 53.
  • 54.
    Downstream analyses Core analysis RNA-seq Trimming Trimmomatic Alignment BWA Bowtie GEM ChIP-seq Read countsprofiles bam2wig BEDtools SAMtools Deeptools Peak calling MACS2 Zerone - Genome Browser - Overlaps of peaks genomic coordinates - Signal enrichment over regions - Genomic distribution of peaks - Motif discovery analysis - … Trimming Trimmomatic Alignment STAR TopHat GEM Read counts profiles STAR bam2wig BEDtools SAMtools Deeptools Expression quantification Kallisto featureCounts HTSeq - Genome Browser - Differential expression analysis - …