RNA-seq for DE analysis: detecting differential expression - part 5

Detecting differentially
expressed genes
RNA-seq for DE analysis training
Joachim Jacob
20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts hereof.

Goal
Based on a count table, we want to detect
differentially expressed genes between conditions
of interest.
We will assign to each gene a p-value (0-1), which
shows us 'how surprised we should be' to see this
difference, when we assume there is no difference.

0

p-value

1

Very big chance there is a difference
Very small chance there is a real difference

Goal
Every single decision we have taken in
previous analysis steps was done to
improve this outcome of detecting DE
expressed genes.

Algorithms under active development

http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Detecting_differential_expression_by_count_analysis

Algorithms under active development

http://genomebiology.com/2010/11/10/r106

Intuition
gene_id

CAF0006876

Condition A

sample1
23171

sample2
22903

sample3
29227

sample4
24072

sample5
23151

sample6
26336

sample7
25252

sample8
24122

Condition B

Sample9
19527

sample10
26898

sample11
18880

sample12
24237

sample13
26640

sample14
22315

sample15
20952

sample16
25629

Variability X
Variability Y

Compare and conclude given a
Mean level: similar or not?

Intuition
gene_id

CAF0006876

Condition A

sample1
23171

sample2
22903

sample3
29227

sample4
24072

sample5
23151

sample6
26336

sample7
25252

sample8
24122

Condition B

Sample9
19527

sample10
26898

sample11
18880

sample12
24237

sample13
26640

sample14
22315

sample15
20952

sample16
25629

Intuition – model is fitted
gene_id

CAF0006876

Condition A

sample1
23171

sample2
22903

sample3
29227

sample4
24072

sample5
23151

sample6
26336

sample7
25252

sample8
24122

Condition B

Sample9
19527

sample10
26898

sample11
18880

sample12
24237

sample13
26640

sample14
22315

sample15
20952

sample16
25629

NB model is estimated:
2 parameters: mean and
dispersion needed.

Intuition – difference is quantified
gene_id

CAF0006876

Condition A

sample1
23171

sample2
22903

sample3
29227

sample4
24072

sample5
23151

sample6
26336

sample7
25252

sample8
24122

Condition B

Sample9
19527

sample10
26898

sample11
18880

sample12
24237

sample13
26640

sample14
22315

sample15
20952

sample16
25629

NB model is estimated:
2 parameters: mean and
dispersion needed.
Difference is put into p-value

BUT counts are dependent on
The read counts of a gene between different
conditions, is dependent on (see first part):
1. Chance (NB model)
2. Expression level
3. Library size (number of reads in that library)
4. Length of transcript
5. GC content of the genes

Normalize for library size
Assumption: most genes are not DE between
samples. DESeq calculates for every sample the
'effective library size' by a scale factor.
100%
100%

Rest of the
genes

Rest of the
genes

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/

Original library size * scale factor = effective library size
DESeq will multiply original counts by the sample
scaling factor.
DESeq: This normalization method [14] is included in the DESeq Bioconductor package (version 1.6.0) [14] and is based on the hypothesis that most genes are not DE. A DESeq scaling
factor for a given lane is computed as the median of the ratio, for each gene, of its read count
over its geometric mean across all lanes. The underlying idea is that non-DE genes should have
similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the
median of this ratio for the lane provides an estimate of the correction factor that should be
applied to all read counts of this lane to fulfill the hypothesis
DESeq computes a scaling factor for a given sample by computing the median of the ratio, for each gene, of its
read count over its geometric mean across all samples. It then uses the assumption that most genes are not DE
and uses this median of ratios to obtain the scaling factor associated with this sample.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/

Geom Mean
24595
…
…
...
…
…

1,1
0,9
1,1
1,0
1,1
0,8
…
1,2

1,2
1,0
1,2
1,2
1,0
0,85
…
0,8

0,9
1,0
1,2
0,8
0,9
1,0
0,8
0,85
1,2
0,8
1,01,1 1,2
…
…
1,1
0,9

1,2
0,85
1,2
1,0
0,85
0,9
…
0,85

0,8
1,0
0,8
1,2
1,0
1,0
…
1,1

0,85
1,1
0,85
1,1
0,85
1,2
…
1,1

1,0
1,2
1,0
0,8
1,0
1,2
…
1,0

Divide each
count by
geomean

Take median
per column

Other normalisations
●

EdgeR: TMM, trimmed mean of M-values

Dispersion estimation
●

●

For every gene, an NB is fitted based on
the counts. The most important factor in
that model to be estimated is the
dispersion.
DESeq2 applies three steps
●

Estimates dispersion parameter for each gene

●

Plots and fits a curve

●

Adjusts the dispersion parameter towards the
curve ('shrinking')

Dispersion estimation
1. Black dots: estimated
from normalized data.

2. Red line: curve fitted
3. blue dots: final assigned
dispersion parameter for
that gene

Model is fit!

Test is run between conditions
If 2 conditions are
compared, for each
gene 2 NB models (one
for each condition) are
made, and a test (Wald
test) decides whether
the difference is
significant (red in plot).
MA-plot: mean of counts
versus the log2 fold change
between 2 conditions.

Significant (p-value < 0,01)
Not significant

Test is run between conditions
If 2 conditions are
compared, for each
gene 2 NB models (one
for each condition) are
made, and a test (Wald
test) decides whether
the difference is
significant (red in plot).

This means that we are
going to perform 1000's
of tests.

If we set a cut-off on the p-value
of 0,01 and we have performed
20000 tests (= genes), 1000 genes
will appear significant by chance.

Check the distribution of p-values
An enrichment
(smaller or
Bigger) should
be seen at low
P-values.

If the histogram of the p-values does
not match a profile as shown below,
the test is not reliable. Perhaps the
NB fitting step did not succeed, or
confounding variables are present.

Other p-values should not
show a trend.

Confounded distribution of p-values

Improve test results

A fraction is
correctly identified
as DE
A fraction is
false positive

You set a cut-off of 0,05.

Improve test results
We can improve testing by 2 measures:
avoid testing: apply a filtering before testing, an
independent filtering.
●

●

apply a multiple testing correction

Independent filtering
Some scientists just
remove genes with
mean counts in the
samples <10. But there
is a more formal
method to remove
genes, in order to
reduce the testing.

http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/
From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf

Left a scatter plot of
mean counts versus
transformed p-values.
The red line depicts a
cut-off of 0,1. Note that
genes with lower counts
do not reach the p-value
threshold. Some of them
are save to exclude from
testing.


If we filter out increasingly bigger portions of genes based on their
mean counts, the number of significant genes increase.


See later (slide 30)
Choose the variable of interest.
You can run it once on all to check the outcome.


Multiple testing correction
Automatically in results: Benjamini/Hochberg
correction, to control false discovery rate (FDR).
FDR is the fraction of false positives in the genes that
are classified as DE.
If we set a threshold α of 0,05, 20% of the DE genes
will be false positives.

Including different factors
Always remember your experimental setup and the
goal. Summarize the setup in the sample
descriptions file.
GDA (=G)

GDA + vit C (=AG)

Yeast (=WT)

Yeast mutant
(=UPC)

Day 1 Day 2

Day 1 Day 2

Additional metadata (batch
factor)


We provide a combination of factors
(the model, GLM) which influence the
counts. Every factor should match the
column name in the sample
descriptions

The levels of the factors corresponding
To the 'base' or 'no perturbation'.
The fraction filtered out, determined
by the independent filter tool.
Adjusted p-value cut-off

The 'detect differential
expression' tool gives you four
results: the first is the report
including graphs.

Only lower than
cut-off and with
indep filtering.

All genes, with indep
filtering applied.

Complete DESeq results,
without indep filtering
applied.

Standard Error (SE) of LogFC

Log2(FC)
Standard Error (SE) of LogFC


Log2(FC)

Volcano plot: shows
the DE genes with
our given cut-off.

Comparing different conditions
GDA (=G)

GDA + vit C (=AG)

Yeast (=WT)

Yeast mutant
(=UPC)

Day 1 Day 2

Day 1 Day 2

Which genes are DE between UPC and WT?
Which genes are DE between G and AG?
Which genes are DE in WT between G and AG?

Comparing different conditions
Adjust the sample descriptions file and the model:
1. Which genes are DE between UPC and WT?
2. Which genes are DE between G and AG?
3. Which genes are DE in WT between G and AG?
1.

2.

3.

Remove these

Remove these

Congratulations!
We have reached our goal!

Overview

http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html

Keywords
Effective library size
dispersion
shrinking
Significantly differentially expressed
MA-plot
Alpha cut-off
FDR
p-value
Write in your own words what the terms mean

RNA-seq for DE analysis: detecting differential expression - part 5

More Related Content

What's hot

Viewers also liked

Similar to RNA-seq for DE analysis: detecting differential expression - part 5

More from BITS

Recently uploaded

RNA-seq for DE analysis: detecting differential expression - part 5