20161021_master_lesson_no_feedback

Here you have your reads: now what?
Making sense of high-throughput sequencing illustrated with ChIP- and RNA-seq data
Javier Quilez Oliete - Bioinformatician @ Beato Lab
1

Downstream
analyses
Core
analysis
ChIP-seq
RNA-seq
…
- Sample-level
- Homogeneous
- Similar steps
across *seq types
- Multi-sample
- Project-speciﬁc
- Varied/ﬂexible
- Combine different
*seq types

ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)

ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)
X X
X
Sonication
(physical fragmentation)
ChIP fragment

ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)
X X
Technical sequences
(e.g. adapters)
ChIP fragment
X
Sonication
The fragment sequenced
includes sequence beyond
that of the actual binding

ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)
X X
Technical sequences
(e.g. adapters)
Single end
Single end
ChIP fragment
X
Sonication
Most common

ChIP-seq
DNA
Protein
Formaldehyde
(chemical binding)
X X
Technical sequences
(e.g. adapters)
Paired end
Paired end
ChIP fragment
X
Sonication

Trimming
- sequencing adapters
- low-quality ends
- too-short reads
Trimmomatic
Improves alignment to
genome sequence

Alignment
Protein
binding
site
Genome sequence

Alignment
Protein
binding
site
Genome sequence
BWA
Bowtie
GEM
…
Read-by-read sequence alignment to
genome sequence with the goal of
identifying the genomic location from
which the ChIP fragment originated

Read counts proﬁles
Protein
binding
site
Genome sequence
bam2wig
BEDtools
SAMtools
Deeptools
…

100 million reads
10 million reads
Not comparable!

Reads per million
Comparable!

Peak calling
Genome sequence
Signal background

Peak calling
Genome sequence
Signal background
Signal enrichment

Peak calling
Peak region
Genome sequence
Identiﬁcation of regions showing
signiﬁcant signal enrichment over the
background levels (MACS2, Zerone…)
Signal background
Signal enrichment

Peak calling
Control
(no ChIP)
ChIP sample
Peak region
Signal enrichment
Signal enrichment

Peak calling
Control
(no ChIP)
ChIP sample
Including a control sample allows
accounting for spurious enrichments
(resulting from structural variation in
the genome, ChIP artefacts) and
improves the accuracy of the peak
calling by reducing the false positives
Peak region
True enrichment
Spurious enrichment

Downstream analyses (ChIP-seq)

Genome Browser
Scale
chr9:
T47D gDNA
T47D T0 Roberto input
T47D PR T0
T47D PR T60
T47D T0 PR
T47D T30 PR 1nM
T47D T30 PR 2nM
T47D T30 PR 5nM
T47D T30 PR 10nM
T47D T30 PR 100nM
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
T47D PR T0 [0]
50 kb hg38
137,300,000 137,350,000 137,400,000
DNA-seq peaks indentified with MACS2 (without control)
T47D gDNA RPM profile
ChIP-seq peaks indentified with MACS2
Input T0 (Roberto) RPM profile
T47D PR T0 (gv_009_02_01_chipseq) RPM profile
T47D input (gv_098_01_01_chipseq) RPM profile
T47D T0 PR (gv_092_01_01_chipseq) RPM profile
T47D T30 PR 1nM (gv_093_01_01_chipseq) RPM profile
GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default)
All GENCODE transcripts including comprehensive set V24
Duplications of >1000 Bases of Non-RepeatMasked Sequence
Simple Tandem Repeats by TRF
Repeating Elements by RepeatMasker
Genomic Intervals Masked by WindowMasker + SDust
T47D gDNA
1 _
0 _
Input T0 (Roberto)
1 _
0 _
T47D PR T0
1 _
0 _
T47D PR T60
1 _
0 _
T47D input
1 _
0 _
T47D T0 PR
1 _
0 _
T47D T30 PR 1nM
1 _
0 _
T47D T30 PR 2nM
1 _
0 _
T47D T30 PR 5nM
1 _
0 _
T47D T30 PR 10nM
1 _
0 _
T47D T30 PR 100nM
1 _
0 _
Control
samples

Genome Browser
Scale
chr9:
T47D gDNA
T47D PR T0
T47D PR T60
T47D T0 PR
T47D T30 PR 1nM
T47D T30 PR 2nM
T47D T30 PR 5nM
T47D T30 PR 10nM
T47D T30 PR 100nM
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
T47D PR T0 [0]
50 kb hg38
137,300,000 137,350,000 137,400,000
T47D gDNA
1 _
0 _
Input T0 (Roberto)
1 _
0 _
T47D PR T0
1 _
0 _
T47D PR T60
1 _
0 _
T47D input
1 _
0 _
T47D T0 PR
1 _
0 _
T47D T30 PR 1nM
1 _
0 _
T47D T30 PR 2nM
1 _
0 _
T47D T30 PR 5nM
1 _
0 _
T47D T30 PR 10nM
1 _
0 _
T47D T30 PR 100nM
1 _
0 _
Control
samples
True
peaks

Genome Browser
Scale
chr9:
T47D gDNA
T47D PR T0
T47D PR T60
T47D T0 PR
T47D T30 PR 1nM
T47D T30 PR 2nM
T47D T30 PR 5nM
T47D T30 PR 10nM
T47D T30 PR 100nM
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
T47D PR T0 [0]
50 kb hg38
137,300,000 137,350,000 137,400,000
T47D gDNA
1 _
0 _
Input T0 (Roberto)
1 _
0 _
T47D PR T0
1 _
0 _
T47D PR T60
1 _
0 _
T47D input
1 _
0 _
T47D T0 PR
1 _
0 _
T47D T30 PR 1nM
1 _
0 _
T47D T30 PR 2nM
1 _
0 _
T47D T30 PR 5nM
1 _
0 _
T47D T30 PR 10nM
1 _
0 _
T47D T30 PR 100nM
1 _
0 _
Control
samples
True
peaks
False
positive

Overlap of peaks genomic coordinates
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

Replicate 1
Replicate 2
Measure overlap between ChIP-seq
replicate samples (expected to be
high) as a quality metric

Replicate 1
Replicate 2
Measure overlap between ChIP-seq
replicate samples (expected to be
high) as a quality metric
Protein A
Treatment 1
Protein B
Protein A
Treatment 2
Interrogate overlap
between proteins/
conditions
(Venn diagrams >3 groups cannot be
proportional and are harder to interpret)

Signal enrichment over regions
Gene expression
Gene promoter
ChIP-signal

Gene expression
Gene promoter
ChIP-signal
…
Is there consistent ChIP-seq signal enrichment
over gene promoters?

GenepromotersProteinCpeaksRandomregions
Protein A Protein B
For each promoter (rows) the
normalised protein A ChIP-seq
signal is shown for the promoter
(center of the row) as well as for its
ﬂanking region
The darker the color in the
heatmap, the higher the
intensity of the ChIP-seq
signal (i.e. number of reads)
Average proﬁle: curve
showing the average for all
rows (e.g. gene promoters)

Genomic distribution of peaks
Percentage of peaks falling in each of the annotation categories

Genomic distribution of peaks
Percentage of peaks falling in each of the annotation categories
Percentage of peaks at a given distance from a transcription start site (TSS)

Peak region
Genome sequence
Signal enrichment
Motif discovery analysis
Protein
binding
site
The fragment sequenced
includes sequence beyond
that of the actual binding

TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
TGTTCT
• Is there any motif/sequence over-
represented in the sequences of the
peaks (relative to the genome)?
• Does such motif correspond to any
annotated motif (e.g. transcription
factors)?
• Motif discovery allows deﬁning the
binding site of the target protein as
well as the binding of secondary
proteins
• Need to account for the fact that the
peak may reﬂect a region broader
than the precise binding site
TGTTCT

1. Find consensus motif
The height of each letter is
proportional to its frequency in that
position within the motif
RC = reverse complement

1. Find consensus motif
The height of each letter is
proportional to its frequency in that
position within the motif
RC = reverse complement
2. Motif annotation
Search for known transcription
factors compatible with the
consensus motif

RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN

RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Transcription et al*
*splicing plus addition of polyA-tail
Poly-A tail
mRNA

RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Poly-A tail
mRNA
cDNA
Poly-A selection
+
cDNA synthesis

RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Poly-A tail
mRNA
cDNA
Poly-A selection
+
cDNA synthesis
RNA-seq experiment targeting
messenger RNA (mRNA) as this is
one of the most common
applications —however, other
applications exist (e.g. total RNA,
ribosomal RNA)

RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Poly-A tail
mRNA
cDNA
Technical sequences
(e.g. adapters)Poly-A selection
+
cDNA synthesis
ribosomal RNA)

RNA-seq
DNA
Gene
Exon1
…
Exon2
ExonN
Poly-A tail
mRNA
cDNA
Technical sequences
(e.g. adapters)Poly-A selection
+
cDNA synthesis
ribosomal RNA)
Paired end
Paired end

Alignment
Genome sequence
Gene
Exon1
…
Exon2
ExonN
STAR
TopHat
GEM
…
Read-by-read sequence alignment to
genome sequence with the goal of
identifying the genomic location from
which the RNA originated
Some RNA-seq reads will originate
from different exons and thus map to
non-contiguos genomic positions —
RNA-seq aligners need to be aware
of this and split reads accordingly
(red region) during the alignment

Genome sequence
Gene
Exon1
…
Exon2
ExonN
STAR
bam2wig
BEDtools
SAMtools
Deeptools
…
Normalise by the number of
million reads to make different
experiment comparable

Expression quantiﬁcation
• The number of reads is proportional to the level of expression (i.e. more RNA, more
reads)
• Expression quantiﬁcation can be measured at either gene- or transcript-level
(Kallisto, HTSeq, featureCounts)
• There are several units to measure expression:
• read counts per gene/transcript —not normalised: does not account for the
sample library size (i.e. the number of reads sequenced) or gene/transcript
length
• Reads Per Kilobase of transcript per Million mapped reads (RPKM) —accounts
for library size and loci length
• Transcripts Per Million (TPM) —accounts for library size and loci length
• TPM is becoming more popular over RPKM and some argue the latter are
inconsistent across samples (http://blog.nextgenetics.net/?e=51)

Genome Browser
Treated
Scale
chr6:
T47D gDNA
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
50 kb hg38
35,590,000 35,600,000 35,610,000 35,620,000 35,630,000 35,640,000 35,650,000 35,660,000 35,670,000 35,680,000
T47D T0 (fd_004_02_01_rnaseq) RPM profile
T47D R6 (fd_006_03_01_rnaseq) RPM profile
Basic Gene Annotation Set from GENCODE Version 24 (Ensembl 83)
FKBP5
FKBP5
FKBP5
FKBP5
SNORA40 MIR5690
T47D gDNA
1 _
0 _
T47D T0
0 _
-25 _
T47D T0
0 _
-25 _
T47D T0
1 _
-25 _
T47D R6
10 _
-25 _
0 -
T47D R6
0 _
-25 _
T47D R6
0 _
-25 _
Untreated

Genome Browser
Treated
Scale
chr6:
T47D gDNA
GENCODE v24
Pseudogenes
Segmental Dups
Simple Repeats
RepeatMasker
WM + SDust
50 kb hg38
35,590,000 35,600,000 35,610,000 35,620,000 35,630,000 35,640,000 35,650,000 35,660,000 35,670,000 35,680,000
Basic Gene Annotation Set from GENCODE Version 24 (Ensembl 83)
FKBP5
FKBP5
FKBP5
FKBP5
SNORA40 MIR5690
T47D gDNA
1 _
0 _
T47D T0
0 _
-25 _
T47D T0
0 _
-25 _
T47D T0
1 _
-25 _
T47D R6
10 _
-25 _
0 -
T47D R6
0 _
-25 _
T47D R6
0 _
-25 _
Untreated
Indeed, looking for genes that are show differential expression between two conditions (e.g. treated vs
untreated) is likely the most common application of RNA-seq
Obviously it is not performed by visual inspection in the browser but with dedicated software (sleuth,
DESeq2, edgeR) —these account for biological/technical variation between replicates and assign a
signiﬁcance value to the differential expression

Differential expression
analysis

Downstream
analyses
Core
analysis
RNA-seq
Trimming
Trimmomatic
Alignment
BWA
Bowtie
GEM
ChIP-seq Read counts proﬁles
bam2wig
BEDtools
SAMtools
Deeptools
Peak calling
MACS2
Zerone
- Genome Browser
- Overlaps of peaks genomic
coordinates
- Signal enrichment over
regions
- Genomic distribution of peaks
- Motif discovery analysis
- …
Trimming
Trimmomatic
Alignment
STAR
TopHat
GEM
STAR
bam2wig
BEDtools
SAMtools
Deeptools
Expression
quantiﬁcation
Kallisto
featureCounts
HTSeq
- Genome Browser
- Differential expression
analysis
- …

20161021_master_lesson_no_feedback

More Related Content

What's hot

Viewers also liked

Similar to 20161021_master_lesson_no_feedback

20161021_master_lesson_no_feedback