presentation

RNASeq Diﬀerential Expression DE Analysis in
the context of Breast Cancer
Debit Ahmed
18 June 2016

Outline
RNASeq DE analysis
RNA DE Analysis, What is it ?
Description of the data
Proposed Workﬂow-Pipeline
Analysis steps: From sequencing raw data to read counts
Quality Control FastQC
Mapping to the Human reference genome
Outputs and post-processing Samtools
Summarization
Analysis steps: Normalization, Exploration and Visualization
Why we need to normalize the count data ?
Statistics for RNASeq DE analysis
Summary of the normalization-based methods
Results
DESeq-based and DESeq2-based pipelines
edgeR-based and voomLimma-based pipelines
Overlapping signiﬁcant DE genes
What next ?

RNASeq DE Analysis, Aims and Objectives

RNA DE Analysis, What is it ? [A. Anjum et al., April
2016]
Analysis of differences in expression of gene populations under
different environments, conditions, treatments, and stages.
Statistical distributions are used to approximate the pattern of
differential gene expression.
A gene is declared differentially expressed if a difference or
change observed in read counts or expression levels/index
between two experimental conditions is statistically significant.

Description of the data
Illumina RNA Sequencing reads (.fastq ﬁles containing the
read sequences and the Phred scores)
Paired-end: Forward/Reverse
22 tumor samples and the matched 22 normal tissues
2 conditions (Status)
Biological replicates

Proposed Workﬂow-Pipeline part I

Proposed Workﬂow-Pipeline part II

FastQC: Quality control checks on raw sequence data

FastQC: Quality control checks on raw sequence data
rRNA contamination of the raw data !!!

Mapping to the Human reference genome
Next-Gen speciﬁc read alignment programs
Handle the vast amount of data generated by next-generation
sequencers
Runtime/Memory consuming
BWA, Bowtie, Tophat, STAR
Human reference genome from Ensembl:
Homosapiens.GRCh38.fa

Tophat Mapping Strategy
source: [Bioinformatics. 2009 May 1; 25(9): 1105–1111].

Outputs and post-processing Samtools
.bam file
samtools sort =⇒ sorted .bam file
samtools index =⇒ indexed .bam.bai file (IGV)
samtools sort -n =⇒ sorted by name .bam file
samtools view =⇒ huge .sam file is generated and ready to
use with HTseq

Summarization
Inputs: List of genomic features (filtered .gtf file), Aligned
sequencing reads
FeatureCounts, Cufflinks-Cuffdiff, HTSeq (Python Library)
In the case of RNA-Seq, the features are typically genes. The
simplest and most common approach counts the number of reads
overlapping the exons in a gene [A. Oshlack et al., 2010].
The counts can then be used for gene-level differential expression
analyses using methods such as DESeq2 or EdgeR [S. Anders et
al., 2014]

Why we need to normalize the RNASeq count data ?
The basic source of variations between samples is the
difference in library size (RNA samples may be sequenced to
different depths) [J. Zyprych-Walczak et al., 2015]
Sequencing depth, gene length, and count distribution are the
main biases that must be accounted for in the normalization
and differential expression calculations.
Many methods have been proposed differing both in the type
of bias adjustment and in the adopted statistical strategy

Normalization of RNASeq count data: Methods and
Strategies
RPKM ((Mortazavi et al., 2008): Counts divided by transcript
length (kb) times the total number of millions of mapped
reads.
FPKM (Trapnell et al., 2010) - Fragments per Kilobase of
exon per Million mapped reads, analogous to RPKM.
DESeq normalization implemented in the DESeq
Bioconductor package (S. Anders et al., 2010)
Upper-quartile (Bullard et al., 2010) - Counts are divided by
upper-quartile of counts for transcripts with at least one read.
TMM (Robinson and Oshlack, 2010) - Trimmed mean of M
values implemented in the edgeR Bioconductor package.

Normalization of RNASeq count data: Methods and
Strategies
Issues
⇒ How to choose a normalization adapted to our experiment and
which criteria can help make this choice ?
⇒ What is the impact of the normalization step on lists of
diﬀerential expression genes ?
In our case
We conduct the DE analysis from the raw read counts genereted in
the previous step following the DESeq, DESeq2, EdgeR methods,
and the one implemented in voom-Limma package. The diﬀerent
dedicated pipelines corresponding to each method are described in
details

Statistics for RNASeq count data: Which appropriate
model can we use ?
source: CartoonStock.com

Statistics for RNASeq count data: Which appropriate
model can we use ?
We have count data for some list of genes with biological
replicates corresponding to two conditions we want to compare
X: a random variable representing the number of reads falling
in a given gene
Description: a ﬁxed number of events n, each with a
constant probability of success
Event: An RNASeq read lands in a given gene (success) or
not (failure)

Statistics for RNASeq count data: Poisson or Negative
Binomial NB distribution ?
The most used distributions to model the RNASeq count data are
Poisson and NB.
- The Poisson assumption [A. Anjum et al., 2016]
P(X = x) = e−λ
x! ; x = 1, 2, ...
Simple: One parameter λ (mean = variance)
However, does not account for the variability of the biological
replicates ⇒ False positive rates
RNASeq count data are overdispersed (the variance grows faster
than the mean). This problem was solved in count data by using
negative binomial distribution.

Statistics for RNASeq count data: Poisson or Negative
Binomial ?
We can solve the overdispersion problem !!!
- The Negative Binomial distribution [A. Anjum et al., 2016]
P(X = x) = x+r−1
r−1 pr qx ; x = 1, 2, ...
p: probability of a single success
r: total number of success (here: read counts)
Two parameters: mean and dispersion (modeling of more
general mean–variance relationships)
Extension of the poisson model including overdispersion

The RNASeq count data are overdispersed
RNASeq count data (for biological replicates) follow the NB
distribution instead of the Poisson distribution
source: [I. Gonzalez, Statistical analysis of RNA-Seq data, 2014]

The RNASeq count data are overdispersed: checking this
in my data

Multiple testing in the case of RNASeq count data
Due to the large number of tests performed in the analysis of
RNA-seq, the multiple testing problem needs to be addressed
(control the FDR, adjustment of p value)
The Benjamini-Hochberg method is used for the correction for the
multiple testing.

The Benjamini-Hochberg Algorithm revisited [Y. Benjamini
and Y. Hochberg, 1995]
Algorithm 1 benjamini hochberg()
m: number of hypotheses to test (number of genes to test for diﬀerentially expression between conditions)
pi : unadjusted p value from the test of the m hypotheses (i = 1, .., m)
begin
1. order pi : p1 ≤ p2 ≤ .. ≤ pm
for i = 1 to m do:
2. calculate the adjustment factor ai : ai = m/i (i: rank of the gene)
3. multiply pi by ai : p = pi ai
if the multiplication violates the original ordering then
4. repair this by decreasing the highest p value in all the violating pairs:
˜p = min
j=i,...m
pi
endif
5. ˜p = min(pi , 1)
done
end

Summary of the analysis methods: DESeq-based pipeline
[S. Anders et al., 2010]
Statistical test: binomial test (the raw data without
normalization)
limma test: with the variance stabilizing transformation
getVarianceStabilizedData() to get the normalized count for
testing
Adjustment for multiple testing: the Benjamini-Hochberg
multiple testing adjustment procedure.
Normalization: DESeq

Summary of the analysis methods: DESeq2-based [M.
Love et al., 2014]
Statistical test: Wald test (need absolutly raw counts data,
not the normalized ones)
Independent filtering: DESeq2 uses the average expression
strength of each gene across all samples, as its filter criterion,
and it omits all genes with mean normalized counts below a
filtering threshold from multiple testing adjustment
Adjustment of the Wald-test p-value: the P-values from the
subset of genes that pass the independent filtering step are
adjusted for multiple testing using the procedure of Benjamini
and Hochberg
Normalization: DESeq-based

Summary of the analysis methods: EdgeR-based pipeline
[Y. Chen et al., 2016]
Statistical test: GLM likelihood ratio test (need absolutly raw
counts data, not the normalized ones). Since the classical
edgeR is designed only for a single factor experiment, we
should use the GLM edgeR for our two-condition based
experiment.
Once the negative binomial GLMs are fitted with the
Cox-Reid dispersion estimates, the GLM likelihood ratio test
(GLRT) is performed for each tag (gene). Tags can then be
ranked in order of evidence for differential expression based on
the p-value computed for each tag.
The ranking is then used for multiple testing correction with
the Benjamini-Hocheberg method to produce the adjusted
p-value.
Normalization: library size normalization using a trimmed
mean of M values (TMM) between each pair of samples, and
gene-specific correction.

Summary of the analysis methods: Voom-Limma pipeline
for DE analysis [M. E. Ritchie et al., 2015]
Unlike the above methods, Limma is not based on negative
binomial model.
Voom transformation is applied to the read counts
(conversion to the log-counts per million LogCPM with
associated precision weights.) ⇒ estimating the
mean-variance relationship empirically.
Voom function to convert the mean-variance trend into
precision weights
Using the design matrix to ﬁt the linear model, and test for
DE between conditions
Statistical test: moderated t-statistics
TMM normalization
Adjustment for multiple testing: Benjamini-Hochberg’s
approach (Default method)

Diﬀerential expression study, Analysis and Visualization:
DESeq-based and DESeq2-based pipelines

DESeq-based and DESeq2-based pipelines, Some QC:
Sample distance heatmap

DESeq-based and DESeq2-based pipelines, Some QC: PCA

DESeq-based and DESeq2-based pipelines, Checking the
normalization

DESeq-based and DESeq2-based pipelines, Biological
dispersion

DESeq-based analysis results: significant DE genes
Only 5 genes are significant for differentially expression test after
adjustment (FDR < 0.1), and 0 genes at FDR < 0.05

DESeq2-based analysis results: signiﬁcant DE genes
Signiﬁcant DE genes at FDR < 0.05 with DESeq2-based method

Diﬀerential expression study, Analysis and Visualization:
edgeR-based and voomLimma-based pipelines

edgeR-based pipeline, Distance similarity between samples

edgeR-based pipeline: Biological coeﬃcient of variation

edgeR-based analysis results: signiﬁcant DE genes

Overlapping signiﬁcant DE genes: at FDR < 0.05

Overlapping signiﬁcant DE genes: at FDR < 1e − 5

What next ?
Biological insights from the list of DE genes
Coexpression network analysis WGCNA package [P.
Langfelder and S. Horvath, 2008]
Prior knowledges (guidance) ⇒ epistasis analysis
Integromics

presentation

More Related Content

Viewers also liked

Similar to presentation

presentation