1. Here you have your reads: now what?
Making sense of high-throughput sequencing illustrated with ChIP- and RNA-seq data
Javier Quilez Oliete - Bioinformatician @ Beato Lab
1
18. Peak calling
Peak region
Genome sequence
Identification of regions showing
significant signal enrichment over the
background levels (MACS2, Zerone…)
Signal background
Signal enrichment
20. Peak calling
Control
(no ChIP)
ChIP sample
Including a control sample allows
accounting for spurious enrichments
(resulting from structural variation in
the genome, ChIP artefacts) and
improves the accuracy of the peak
calling by reducing the false positives
Peak region
True enrichment
Spurious enrichment
22. Genome Browser
Scale
chr9:
T47D gDNA
T47D T0 Roberto input
T47D PR T0
T47D PR T60
T47D T0 PR
T47D T30 PR 1nM
T47D T30 PR 2nM
T47D T30 PR 5nM
T47D T30 PR 10nM
T47D T30 PR 100nM
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
T47D PR T0 [0]
50 kb hg38
137,300,000 137,350,000 137,400,000
DNA-seq peaks indentified with MACS2 (without control)
T47D gDNA RPM profile
ChIP-seq peaks indentified with MACS2
Input T0 (Roberto) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T0 (gv_009_02_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T60 (gv_066_01_01_chipseq) RPM profile
T47D input (gv_098_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T0 PR (gv_092_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 2nM (gv_094_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 5nM (gv_095_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 10nM (gv_097_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 100nM (gv_096_01_01_chipseq) RPM profile
GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default)
All GENCODE transcripts including comprehensive set V24
Duplications of >1000 Bases of Non-RepeatMasked Sequence
Simple Tandem Repeats by TRF
Repeating Elements by RepeatMasker
Genomic Intervals Masked by WindowMasker + SDust
ChIP-seq peaks indentified with MACS2
T47D gDNA
1 _
0 _
Input T0 (Roberto)
1 _
0 _
T47D PR T0
1 _
0 _
T47D PR T60
1 _
0 _
T47D input
1 _
0 _
T47D T0 PR
1 _
0 _
T47D T30 PR 1nM
1 _
0 _
T47D T30 PR 2nM
1 _
0 _
T47D T30 PR 5nM
1 _
0 _
T47D T30 PR 10nM
1 _
0 _
T47D T30 PR 100nM
1 _
0 _
Control
samples
23. Genome Browser
Scale
chr9:
T47D gDNA
T47D T0 Roberto input
T47D PR T0
T47D PR T60
T47D T0 PR
T47D T30 PR 1nM
T47D T30 PR 2nM
T47D T30 PR 5nM
T47D T30 PR 10nM
T47D T30 PR 100nM
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
T47D PR T0 [0]
50 kb hg38
137,300,000 137,350,000 137,400,000
DNA-seq peaks indentified with MACS2 (without control)
T47D gDNA RPM profile
ChIP-seq peaks indentified with MACS2
Input T0 (Roberto) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T0 (gv_009_02_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T60 (gv_066_01_01_chipseq) RPM profile
T47D input (gv_098_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T0 PR (gv_092_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 2nM (gv_094_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 5nM (gv_095_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 10nM (gv_097_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 100nM (gv_096_01_01_chipseq) RPM profile
GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default)
All GENCODE transcripts including comprehensive set V24
Duplications of >1000 Bases of Non-RepeatMasked Sequence
Simple Tandem Repeats by TRF
Repeating Elements by RepeatMasker
Genomic Intervals Masked by WindowMasker + SDust
ChIP-seq peaks indentified with MACS2
T47D gDNA
1 _
0 _
Input T0 (Roberto)
1 _
0 _
T47D PR T0
1 _
0 _
T47D PR T60
1 _
0 _
T47D input
1 _
0 _
T47D T0 PR
1 _
0 _
T47D T30 PR 1nM
1 _
0 _
T47D T30 PR 2nM
1 _
0 _
T47D T30 PR 5nM
1 _
0 _
T47D T30 PR 10nM
1 _
0 _
T47D T30 PR 100nM
1 _
0 _
Control
samples
True
peaks
24. Genome Browser
Scale
chr9:
T47D gDNA
T47D T0 Roberto input
T47D PR T0
T47D PR T60
T47D T0 PR
T47D T30 PR 1nM
T47D T30 PR 2nM
T47D T30 PR 5nM
T47D T30 PR 10nM
T47D T30 PR 100nM
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
T47D PR T0 [0]
50 kb hg38
137,300,000 137,350,000 137,400,000
DNA-seq peaks indentified with MACS2 (without control)
T47D gDNA RPM profile
ChIP-seq peaks indentified with MACS2
Input T0 (Roberto) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T0 (gv_009_02_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D PR T60 (gv_066_01_01_chipseq) RPM profile
T47D input (gv_098_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T0 PR (gv_092_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 2nM (gv_094_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 5nM (gv_095_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 10nM (gv_097_01_01_chipseq) RPM profile
ChIP-seq peaks indentified with MACS2
T47D T30 PR 100nM (gv_096_01_01_chipseq) RPM profile
GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default)
All GENCODE transcripts including comprehensive set V24
Duplications of >1000 Bases of Non-RepeatMasked Sequence
Simple Tandem Repeats by TRF
Repeating Elements by RepeatMasker
Genomic Intervals Masked by WindowMasker + SDust
ChIP-seq peaks indentified with MACS2
T47D gDNA
1 _
0 _
Input T0 (Roberto)
1 _
0 _
T47D PR T0
1 _
0 _
T47D PR T60
1 _
0 _
T47D input
1 _
0 _
T47D T0 PR
1 _
0 _
T47D T30 PR 1nM
1 _
0 _
T47D T30 PR 2nM
1 _
0 _
T47D T30 PR 5nM
1 _
0 _
T47D T30 PR 10nM
1 _
0 _
T47D T30 PR 100nM
1 _
0 _
Control
samples
True
peaks
False
positive
25. Overlap of peaks genomic coordinates
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
26. Overlap of peaks genomic coordinates
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
Replicate 1
Replicate 2
Measure overlap between ChIP-seq
replicate samples (expected to be
high) as a quality metric
27. Overlap of peaks genomic coordinates
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
Replicate 1
Replicate 2
Measure overlap between ChIP-seq
replicate samples (expected to be
high) as a quality metric
Protein A
Treatment 1
Protein B
Protein A
Treatment 2
Interrogate overlap
between proteins/
conditions
(Venn diagrams >3 groups cannot be
proportional and are harder to interpret)
29. Signal enrichment over regions
Gene expression
Gene promoter
ChIP-signal
…
Is there consistent ChIP-seq signal enrichment
over gene promoters?
30. Signal enrichment over regions
GenepromotersProteinCpeaksRandomregions
Protein A Protein B
For each promoter (rows) the
normalised protein A ChIP-seq
signal is shown for the promoter
(center of the row) as well as for its
flanking region
The darker the color in the
heatmap, the higher the
intensity of the ChIP-seq
signal (i.e. number of reads)
Average profile: curve
showing the average for all
rows (e.g. gene promoters)
31. Genomic distribution of peaks
Percentage of peaks falling in each of the annotation categories
32. Genomic distribution of peaks
Percentage of peaks falling in each of the annotation categories
Percentage of peaks at a given distance from a transcription start site (TSS)
33. Peak region
Genome sequence
Signal enrichment
Motif discovery analysis
Protein
binding
site
The fragment sequenced
includes sequence beyond
that of the actual binding
34. Motif discovery analysis
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
• Is there any motif/sequence over-
represented in the sequences of the
peaks (relative to the genome)?
• Does such motif correspond to any
annotated motif (e.g. transcription
factors)?
• Motif discovery allows defining the
binding site of the target protein as
well as the binding of secondary
proteins
• Need to account for the fact that the
peak may reflect a region broader
than the precise binding site
TGTTCT
35. Motif discovery analysis
1. Find consensus motif
The height of each letter is
proportional to its frequency in that
position within the motif
RC = reverse complement
36. Motif discovery analysis
1. Find consensus motif
The height of each letter is
proportional to its frequency in that
position within the motif
RC = reverse complement
2. Motif annotation
Search for known transcription
factors compatible with the
consensus motif
41. RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Transcription et al*
*splicing plus addition of polyA-tail
Poly-A tail
mRNA
cDNA
Poly-A selection
+
cDNA synthesis
RNA-seq experiment targeting
messenger RNA (mRNA) as this is
one of the most common
applications —however, other
applications exist (e.g. total RNA,
ribosomal RNA)
42. RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Transcription et al*
*splicing plus addition of polyA-tail
Poly-A tail
mRNA
cDNA
Technical sequences
(e.g. adapters)Poly-A selection
+
cDNA synthesis
RNA-seq experiment targeting
messenger RNA (mRNA) as this is
one of the most common
applications —however, other
applications exist (e.g. total RNA,
ribosomal RNA)
43. RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Transcription et al*
*splicing plus addition of polyA-tail
Poly-A tail
mRNA
cDNA
Technical sequences
(e.g. adapters)Poly-A selection
+
cDNA synthesis
RNA-seq experiment targeting
messenger RNA (mRNA) as this is
one of the most common
applications —however, other
applications exist (e.g. total RNA,
ribosomal RNA)
Paired end
Paired end
46. Alignment
Genome sequence
Gene
Exon1
…
Exon2
ExonN
STAR
TopHat
GEM
…
Read-by-read sequence alignment to
genome sequence with the goal of
identifying the genomic location from
which the RNA originated
Some RNA-seq reads will originate
from different exons and thus map to
non-contiguos genomic positions —
RNA-seq aligners need to be aware
of this and split reads accordingly
(red region) during the alignment
47. Read counts profiles
Genome sequence
Gene
Exon1
…
Exon2
ExonN
STAR
bam2wig
BEDtools
SAMtools
Deeptools
…
Normalise by the number of
million reads to make different
experiment comparable
48. Expression quantification
• The number of reads is proportional to the level of expression (i.e. more RNA, more
reads)
• Expression quantification can be measured at either gene- or transcript-level
(Kallisto, HTSeq, featureCounts)
• There are several units to measure expression:
• read counts per gene/transcript —not normalised: does not account for the
sample library size (i.e. the number of reads sequenced) or gene/transcript
length
• Reads Per Kilobase of transcript per Million mapped reads (RPKM) —accounts
for library size and loci length
• Transcripts Per Million (TPM) —accounts for library size and loci length
• TPM is becoming more popular over RPKM and some argue the latter are
inconsistent across samples (http://blog.nextgenetics.net/?e=51)