This presentation is available under the Creative Commons
Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts
hereof.
RNA-seq for DE analysis training
Detecting differentially
expressed genes
Joachim Jacob
22 and 24 April 2014
2 of 44
Bioinformatics analysis will take most of your time
Quality control (QC) of raw reads
Preprocessing: filtering of reads
and read parts, to help our goal
of differential detection.
QC of preprocessing Mapping to a reference genome
(alternative: to a transcriptome)
QC of the mapping
Count table extraction
QC of the count table
DE test
Biological insight
1
2
3
5
4
6
3 of 44
Goal: get me some DE genes!
Based on a raw count table, we want to detect
differentially expressed genes between conditions
of interest.
We will assign to each gene a p-value (0-1), which
shows us 'how surprised we should be' to see this
difference, when we assume there is no difference.
0 1
Very big chance there is a difference
p-value
Very small chance there is a real difference
4 of 44
Goal
Every single decision we have taken in
previous analysis steps was done to
improve this outcome of detecting DE
expressed genes.
5 of 44
Raw counts to DE genes
6 of 44
DE detection tools from count tables
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Detecting_differential_expression_by_count_analysis
7 of 44
Algorithms under active development
http://genomebiology.com/2010/11/10/r106
8 of 44
Intuition: how to detect DE?
gene_id CAF0006876
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
23171 22903 29227 24072 23151 26336 25252 24122
Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16
19527 26898 18880 24237 26640 22315 20952 25629
Variability X
Variability Y
Compare and conclude given a
mean (or 'base') level: similar
or not?
Condition A
Condition B
9 of 44
Intuition
gene_id CAF0006876
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
23171 22903 29227 24072 23151 26336 25252 24122
Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16
19527 26898 18880 24237 26640 22315 20952 25629
Condition A
Condition B
10 of 44
Intuition – model is fitted
gene_id CAF0006876
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
23171 22903 29227 24072 23151 26336 25252 24122
Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16
19527 26898 18880 24237 26640 22315 20952 25629
Condition A
Condition B
NB model is estimated
11 of 44
Intuition – difference is quantified
gene_id CAF0006876
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
23171 22903 29227 24072 23151 26336 25252 24122
Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16
19527 26898 18880 24237 26640 22315 20952 25629
Condition A
Condition B
NB model is estimated:
2 parameters: mean and
dispersion needed.
Difference is put into p-value
12 of 44
But counts are dependent on
The read counts of a gene between different
conditions, is dependent on (see first part):
1. Chance (NB model)
2. Expression level
3. Library size (number of reads in that library)
4. Length of transcript
5. GC content of the genes
13 of 44
Normalize for library size
Assumption: most genes are not DE between
samples. DESeq calculates for every sample the
'effective library size' by a scale factor.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/
Rest of the
genes
Rest of the
genes
100%
100%
14 of 44
Normalize for library size
DESeq computes a scaling factor for a given sample by computing the median of the ratio, for each gene, of its
read count over its geometric mean across all samples. It then uses the assumption that most genes are not DE
and uses this median of ratios to obtain the scaling factor associated with this sample.
Original library size * scale factor = effective library size
DESeq will multiply original counts by the sample
scaling factor.
DESeq: This normalization method [14] is included in the DESeq Bioconductor package (ver-
sion 1.6.0) [14] and is based on the hypothesis that most genes are not DE. A DESeq scaling
factor for a given lane is computed as the median of the ratio, for each gene, of its read count
over its geometric mean across all lanes. The underlying idea is that non-DE genes should have
similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the
median of this ratio for the lane provides an estimate of the correction factor that should be
applied to all read counts of this lane to fulfill the hypothesis
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/
15 of 44
Normalize for library size
Geom Mean
24595
…
…
...
…
…
1,1 1,2 0,9 1,0 1,2 0,8 0,85 1,0
0,9 1,0 1,2 0,8 0,85 1,0 1,1 1,2
1,1 1,2 0,9 1,0 1,2 0,8 0,85 1,0
1,0 1,2 0,8 0,85 1,0 1,2 1,1 0,8
1,1 1,0 1,2 0,8 0,85 1,0 0,85 1,0
0,8 0,85 1,01,1 1,2 0,9 1,0 1,2 1,2
… … … … … … … …
1,2 0,8 1,1 0,9 0,85 1,1 1,1 1,0
Divide each
count by
geomean
Take median
per column
16 of 44
Other normalisations
● EdgeR: TMM, trimmed mean of M-values
In the end: the algorithms conduct internally the
normalization, and just continue.
17 of 44
Dispersion estimation
● For every gene, an NB is fitted based on
the counts. The most important factor in
that model to be estimated is the
dispersion.
● DESeq2 applies three steps
● Estimates dispersion parameter for each gene
● Plots and fits a curve
● Adjusts the dispersion parameter towards the
curve ('shrinking')
18 of 44
Dispersion estimation
1. Black dots: estimated
from normalized data.
2. Red line: curve fitted
3. blue dots: final assigned
dispersion parameter for
that gene
Model is fit!
19 of 44
Test is run between conditions
If 2 conditions are
compared, for each
gene 2 NB models (one
for each condition) are
made, and a test (Wald
test) decides whether
the difference is
significant (red in plot).
Significant (p-value < 0,01)
Not significant
MA-plot: mean of counts
versus the log2 fold change
between 2 conditions.
20 of 44
Test is run between conditions
If 2 conditions are
compared, for each
gene 2 NB models (one
for each condition) are
made, and a test (Wald
test) decides whether
the difference is
significant (red in plot).
This means that we are
going to perform 1000's
of tests.
If we set a cut-off on the p-value
of 0,01 and we have performed
20000 tests (= genes), 200 genes that
do not differ will turn up
significant only by chance.
21 of 44
Check the distribution of p-values
An enrichment
(smaller or
Bigger) should
be seen at low
P-values.
Other p-values should not
show a trend.
The histogram of the p-values must
look like the one below. If not, the
test is not reliable. Perhaps the NB
fitting step did not succeed, or
confounding variables are present.
22 of 44
Confounded distribution of p-values
23 of 44
Improve test results
A fraction is
false positive
You set a cut-off of 0,05.
A fraction is
correctly identified
as DE
24 of 44
Improve test results
We can improve testing by 2 measures:
● avoid testing: apply a filtering before testing, an
independent filtering.
● apply a multiple testing correction
25 of 44
Avoid testing by independent filtering
Some scientists just
remove genes with
mean counts in the
samples <10. But there
is a more formal
method to remove
genes, in order to
reduce the testing.
http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/
From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf
26 of 44
Avoid testing by independent filtering
Left: a scatter plot of
mean counts versus
transformed p-values.
The red line depicts a
cut-off of 0,1. Note that
genes with lower counts
do not reach the p-value
threshold. Some of them
are save to exclude from
testing.
27 of 44
Avoid testing by independent filtering
If we filter out increasingly bigger portions of genes based on their
mean counts, the number of significant genes increase.
28 of 44
Avoid testing by independent filtering
See later (slide 30)
Choose the variable of interest.
You can run it once on all to check the outcome.
http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/
From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf
29 of 44
Avoid testing by independent filtering
30 of 44
Packages for independent filtering
HTSFilter is a package
especially developed for
independent filtering in a
non-arbitrary way.
In our Galaxy, during the
exercises, you will be
using another approach.
http://www.bioconductor.org/packages/release/bioc/html/HTSFilter.html
31 of 44
Multiple testing correction
Automatically performed and reported in results:
Benjamini/Hochberg correction, to control false
discovery rate (FDR).
FDR is the fraction of false positives in the genes that
are classified as DE.
If we set a threshold α of 0,05, 20% of the genes will
be false positives. If we apply FDR correction of 0.05,
5% of the genes in the final list will be false positives.
32 of 44
Including influencing factors
Through a generalized linear model (GLM), the
influencing factors are modeled to predict the counts.
The factors come from the sample descriptions file.
Yeast (=WT)
GDA (=G)
Yeast mutant
(=UPC)
GDA + vit C (=AG)
Additional metadata (batch
factor)
Day 1 Day 1Day 2 Day 2
33 of 44
DESeq2 to detect DE genes
We provide a combination of factors
(the model, GLM) which influence the
counts. Every factor should match the
column name in the sample
descriptions
The levels of the factors corresponding
To the 'base' or 'no perturbation'.
The fraction filtered out, determined
by the independent filter tool.
Adjusted p-value cut-off
34 of 44
The output of DESeq2
The 'detect differential
expression' tool gives you four
results: the first is the report
including graphs.
Only lower than
cut-off and with
indep filtering.
All genes, with indep
filtering applied.
Complete DESeq results,
without indep filtering
applied.
35 of 44
Effect of variance on DE detection
Log2(FC) Log2(FC)
StandardError(SE)ofLogFC
StandardError(SE)ofLogFC
All genes, with their logFCOnly the DE genes
36 of 44
Volcano plot is often asymmetric
Volcano plot: shows
the DE genes with
our given cut-off.
-0.3 0.3
-log10(pvalue)
log10(FC)
37 of 44
Comparing different conditions
Yeast (=WT)
GDA (=G)
Yeast mutant
(=UPC)
GDA + vit C (=AG)
Day 1 Day 1Day 2 Day 2
Which genes are DE between UPC and WT?
Which genes are DE between G and AG?
Which genes are DE in WT between G and AG?
38 of 44
Comparing different conditions
Adjust the sample descriptions file and the model:
Remove these
Remove these
1. Which genes are DE between UPC and WT?
2. Which genes are DE between G and AG?
3. Which genes are DE in WT between G and AG?
1. 2. 3.
39 of 44
Congratulations!
We have reached our goal!
40 of 44
Overview
http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
41 of 44
Reads
42 of 44
Keywords
Effective library size
dispersion
shrinking
Significantly differentially expressed
MA-plot
Alpha cut-off
Independent filtering
FDR
p-value
Write in your own words what the terms mean
43 of 44
Exercises
● →
Detecting differential expression from a
count table
44 of 44
Break

Part 5 of RNA-seq for DE analysis: Detecting differential expression

  • 1.
    This presentation isavailable under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof. RNA-seq for DE analysis training Detecting differentially expressed genes Joachim Jacob 22 and 24 April 2014
  • 2.
    2 of 44 Bioinformaticsanalysis will take most of your time Quality control (QC) of raw reads Preprocessing: filtering of reads and read parts, to help our goal of differential detection. QC of preprocessing Mapping to a reference genome (alternative: to a transcriptome) QC of the mapping Count table extraction QC of the count table DE test Biological insight 1 2 3 5 4 6
  • 3.
    3 of 44 Goal:get me some DE genes! Based on a raw count table, we want to detect differentially expressed genes between conditions of interest. We will assign to each gene a p-value (0-1), which shows us 'how surprised we should be' to see this difference, when we assume there is no difference. 0 1 Very big chance there is a difference p-value Very small chance there is a real difference
  • 4.
    4 of 44 Goal Everysingle decision we have taken in previous analysis steps was done to improve this outcome of detecting DE expressed genes.
  • 5.
    5 of 44 Rawcounts to DE genes
  • 6.
    6 of 44 DEdetection tools from count tables http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Detecting_differential_expression_by_count_analysis
  • 7.
    7 of 44 Algorithmsunder active development http://genomebiology.com/2010/11/10/r106
  • 8.
    8 of 44 Intuition:how to detect DE? gene_id CAF0006876 sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 23171 22903 29227 24072 23151 26336 25252 24122 Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16 19527 26898 18880 24237 26640 22315 20952 25629 Variability X Variability Y Compare and conclude given a mean (or 'base') level: similar or not? Condition A Condition B
  • 9.
    9 of 44 Intuition gene_idCAF0006876 sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 23171 22903 29227 24072 23151 26336 25252 24122 Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16 19527 26898 18880 24237 26640 22315 20952 25629 Condition A Condition B
  • 10.
    10 of 44 Intuition– model is fitted gene_id CAF0006876 sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 23171 22903 29227 24072 23151 26336 25252 24122 Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16 19527 26898 18880 24237 26640 22315 20952 25629 Condition A Condition B NB model is estimated
  • 11.
    11 of 44 Intuition– difference is quantified gene_id CAF0006876 sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 23171 22903 29227 24072 23151 26336 25252 24122 Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16 19527 26898 18880 24237 26640 22315 20952 25629 Condition A Condition B NB model is estimated: 2 parameters: mean and dispersion needed. Difference is put into p-value
  • 12.
    12 of 44 Butcounts are dependent on The read counts of a gene between different conditions, is dependent on (see first part): 1. Chance (NB model) 2. Expression level 3. Library size (number of reads in that library) 4. Length of transcript 5. GC content of the genes
  • 13.
    13 of 44 Normalizefor library size Assumption: most genes are not DE between samples. DESeq calculates for every sample the 'effective library size' by a scale factor. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/ Rest of the genes Rest of the genes 100% 100%
  • 14.
    14 of 44 Normalizefor library size DESeq computes a scaling factor for a given sample by computing the median of the ratio, for each gene, of its read count over its geometric mean across all samples. It then uses the assumption that most genes are not DE and uses this median of ratios to obtain the scaling factor associated with this sample. Original library size * scale factor = effective library size DESeq will multiply original counts by the sample scaling factor. DESeq: This normalization method [14] is included in the DESeq Bioconductor package (ver- sion 1.6.0) [14] and is based on the hypothesis that most genes are not DE. A DESeq scaling factor for a given lane is computed as the median of the ratio, for each gene, of its read count over its geometric mean across all lanes. The underlying idea is that non-DE genes should have similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the median of this ratio for the lane provides an estimate of the correction factor that should be applied to all read counts of this lane to fulfill the hypothesis http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/
  • 15.
    15 of 44 Normalizefor library size Geom Mean 24595 … … ... … … 1,1 1,2 0,9 1,0 1,2 0,8 0,85 1,0 0,9 1,0 1,2 0,8 0,85 1,0 1,1 1,2 1,1 1,2 0,9 1,0 1,2 0,8 0,85 1,0 1,0 1,2 0,8 0,85 1,0 1,2 1,1 0,8 1,1 1,0 1,2 0,8 0,85 1,0 0,85 1,0 0,8 0,85 1,01,1 1,2 0,9 1,0 1,2 1,2 … … … … … … … … 1,2 0,8 1,1 0,9 0,85 1,1 1,1 1,0 Divide each count by geomean Take median per column
  • 16.
    16 of 44 Othernormalisations ● EdgeR: TMM, trimmed mean of M-values In the end: the algorithms conduct internally the normalization, and just continue.
  • 17.
    17 of 44 Dispersionestimation ● For every gene, an NB is fitted based on the counts. The most important factor in that model to be estimated is the dispersion. ● DESeq2 applies three steps ● Estimates dispersion parameter for each gene ● Plots and fits a curve ● Adjusts the dispersion parameter towards the curve ('shrinking')
  • 18.
    18 of 44 Dispersionestimation 1. Black dots: estimated from normalized data. 2. Red line: curve fitted 3. blue dots: final assigned dispersion parameter for that gene Model is fit!
  • 19.
    19 of 44 Testis run between conditions If 2 conditions are compared, for each gene 2 NB models (one for each condition) are made, and a test (Wald test) decides whether the difference is significant (red in plot). Significant (p-value < 0,01) Not significant MA-plot: mean of counts versus the log2 fold change between 2 conditions.
  • 20.
    20 of 44 Testis run between conditions If 2 conditions are compared, for each gene 2 NB models (one for each condition) are made, and a test (Wald test) decides whether the difference is significant (red in plot). This means that we are going to perform 1000's of tests. If we set a cut-off on the p-value of 0,01 and we have performed 20000 tests (= genes), 200 genes that do not differ will turn up significant only by chance.
  • 21.
    21 of 44 Checkthe distribution of p-values An enrichment (smaller or Bigger) should be seen at low P-values. Other p-values should not show a trend. The histogram of the p-values must look like the one below. If not, the test is not reliable. Perhaps the NB fitting step did not succeed, or confounding variables are present.
  • 22.
    22 of 44 Confoundeddistribution of p-values
  • 23.
    23 of 44 Improvetest results A fraction is false positive You set a cut-off of 0,05. A fraction is correctly identified as DE
  • 24.
    24 of 44 Improvetest results We can improve testing by 2 measures: ● avoid testing: apply a filtering before testing, an independent filtering. ● apply a multiple testing correction
  • 25.
    25 of 44 Avoidtesting by independent filtering Some scientists just remove genes with mean counts in the samples <10. But there is a more formal method to remove genes, in order to reduce the testing. http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/ From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf
  • 26.
    26 of 44 Avoidtesting by independent filtering Left: a scatter plot of mean counts versus transformed p-values. The red line depicts a cut-off of 0,1. Note that genes with lower counts do not reach the p-value threshold. Some of them are save to exclude from testing.
  • 27.
    27 of 44 Avoidtesting by independent filtering If we filter out increasingly bigger portions of genes based on their mean counts, the number of significant genes increase.
  • 28.
    28 of 44 Avoidtesting by independent filtering See later (slide 30) Choose the variable of interest. You can run it once on all to check the outcome. http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/ From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf
  • 29.
    29 of 44 Avoidtesting by independent filtering
  • 30.
    30 of 44 Packagesfor independent filtering HTSFilter is a package especially developed for independent filtering in a non-arbitrary way. In our Galaxy, during the exercises, you will be using another approach. http://www.bioconductor.org/packages/release/bioc/html/HTSFilter.html
  • 31.
    31 of 44 Multipletesting correction Automatically performed and reported in results: Benjamini/Hochberg correction, to control false discovery rate (FDR). FDR is the fraction of false positives in the genes that are classified as DE. If we set a threshold α of 0,05, 20% of the genes will be false positives. If we apply FDR correction of 0.05, 5% of the genes in the final list will be false positives.
  • 32.
    32 of 44 Includinginfluencing factors Through a generalized linear model (GLM), the influencing factors are modeled to predict the counts. The factors come from the sample descriptions file. Yeast (=WT) GDA (=G) Yeast mutant (=UPC) GDA + vit C (=AG) Additional metadata (batch factor) Day 1 Day 1Day 2 Day 2
  • 33.
    33 of 44 DESeq2to detect DE genes We provide a combination of factors (the model, GLM) which influence the counts. Every factor should match the column name in the sample descriptions The levels of the factors corresponding To the 'base' or 'no perturbation'. The fraction filtered out, determined by the independent filter tool. Adjusted p-value cut-off
  • 34.
    34 of 44 Theoutput of DESeq2 The 'detect differential expression' tool gives you four results: the first is the report including graphs. Only lower than cut-off and with indep filtering. All genes, with indep filtering applied. Complete DESeq results, without indep filtering applied.
  • 35.
    35 of 44 Effectof variance on DE detection Log2(FC) Log2(FC) StandardError(SE)ofLogFC StandardError(SE)ofLogFC All genes, with their logFCOnly the DE genes
  • 36.
    36 of 44 Volcanoplot is often asymmetric Volcano plot: shows the DE genes with our given cut-off. -0.3 0.3 -log10(pvalue) log10(FC)
  • 37.
    37 of 44 Comparingdifferent conditions Yeast (=WT) GDA (=G) Yeast mutant (=UPC) GDA + vit C (=AG) Day 1 Day 1Day 2 Day 2 Which genes are DE between UPC and WT? Which genes are DE between G and AG? Which genes are DE in WT between G and AG?
  • 38.
    38 of 44 Comparingdifferent conditions Adjust the sample descriptions file and the model: Remove these Remove these 1. Which genes are DE between UPC and WT? 2. Which genes are DE between G and AG? 3. Which genes are DE in WT between G and AG? 1. 2. 3.
  • 39.
    39 of 44 Congratulations! Wehave reached our goal!
  • 40.
  • 41.
  • 42.
    42 of 44 Keywords Effectivelibrary size dispersion shrinking Significantly differentially expressed MA-plot Alpha cut-off Independent filtering FDR p-value Write in your own words what the terms mean
  • 43.
    43 of 44 Exercises ●→ Detecting differential expression from a count table
  • 44.