RNASeq Differential Expression DE Analysis in
the context of Breast Cancer
Debit Ahmed
18 June 2016
Outline
RNASeq DE analysis
RNA DE Analysis, What is it ?
Description of the data
Proposed Workflow-Pipeline
Analysis steps: From sequencing raw data to read counts
Quality Control FastQC
Mapping to the Human reference genome
Outputs and post-processing Samtools
Summarization
Analysis steps: Normalization, Exploration and Visualization
Why we need to normalize the count data ?
Statistics for RNASeq DE analysis
Summary of the normalization-based methods
Results
DESeq-based and DESeq2-based pipelines
edgeR-based and voomLimma-based pipelines
Overlapping significant DE genes
What next ?
RNASeq DE Analysis, Aims and Objectives
RNA DE Analysis, What is it ? [A. Anjum et al., April
2016]
Analysis of differences in expression of gene populations under
different environments, conditions, treatments, and stages.
Statistical distributions are used to approximate the pattern of
differential gene expression.
A gene is declared differentially expressed if a difference or
change observed in read counts or expression levels/index
between two experimental conditions is statistically significant.
Description of the data
Illumina RNA Sequencing reads (.fastq files containing the
read sequences and the Phred scores)
Paired-end: Forward/Reverse
22 tumor samples and the matched 22 normal tissues
2 conditions (Status)
Biological replicates
Proposed Workflow-Pipeline part I
Proposed Workflow-Pipeline part II
DE Pipeline Analysis Steps
FastQC: Quality control checks on raw sequence data
FastQC: Quality control checks on raw sequence data
rRNA contamination of the raw data !!!
Mapping to the Human reference genome
Next-Gen specific read alignment programs
Handle the vast amount of data generated by next-generation
sequencers
Runtime/Memory consuming
BWA, Bowtie, Tophat, STAR
Human reference genome from Ensembl:
Homosapiens.GRCh38.fa
Tophat Mapping Strategy
source: [Bioinformatics. 2009 May 1; 25(9): 1105–1111].
Outputs and post-processing Samtools
.bam file
samtools sort =⇒ sorted .bam file
samtools index =⇒ indexed .bam.bai file (IGV)
samtools sort -n =⇒ sorted by name .bam file
samtools view =⇒ huge .sam file is generated and ready to
use with HTseq
Summarization
Inputs: List of genomic features (filtered .gtf file), Aligned
sequencing reads
FeatureCounts, Cufflinks-Cuffdiff, HTSeq (Python Library)
In the case of RNA-Seq, the features are typically genes. The
simplest and most common approach counts the number of reads
overlapping the exons in a gene [A. Oshlack et al., 2010].
The counts can then be used for gene-level differential expression
analyses using methods such as DESeq2 or EdgeR [S. Anders et
al., 2014]
Summarization HTSeq
Why we need to normalize the RNASeq count data ?
The basic source of variations between samples is the
difference in library size (RNA samples may be sequenced to
different depths) [J. Zyprych-Walczak et al., 2015]
Sequencing depth, gene length, and count distribution are the
main biases that must be accounted for in the normalization
and differential expression calculations.
Many methods have been proposed differing both in the type
of bias adjustment and in the adopted statistical strategy
Normalization of RNASeq count data: Methods and
Strategies
RPKM ((Mortazavi et al., 2008): Counts divided by transcript
length (kb) times the total number of millions of mapped
reads.
FPKM (Trapnell et al., 2010) - Fragments per Kilobase of
exon per Million mapped reads, analogous to RPKM.
DESeq normalization implemented in the DESeq
Bioconductor package (S. Anders et al., 2010)
Upper-quartile (Bullard et al., 2010) - Counts are divided by
upper-quartile of counts for transcripts with at least one read.
TMM (Robinson and Oshlack, 2010) - Trimmed mean of M
values implemented in the edgeR Bioconductor package.
Normalization of RNASeq count data: Methods and
Strategies
Issues
⇒ How to choose a normalization adapted to our experiment and
which criteria can help make this choice ?
⇒ What is the impact of the normalization step on lists of
differential expression genes ?
In our case
We conduct the DE analysis from the raw read counts genereted in
the previous step following the DESeq, DESeq2, EdgeR methods,
and the one implemented in voom-Limma package. The different
dedicated pipelines corresponding to each method are described in
details
Statistics for RNASeq count data: Which appropriate
model can we use ?
source: CartoonStock.com
Statistics for RNASeq count data: Which appropriate
model can we use ?
We have count data for some list of genes with biological
replicates corresponding to two conditions we want to compare
X: a random variable representing the number of reads falling
in a given gene
Description: a fixed number of events n, each with a
constant probability of success
Event: An RNASeq read lands in a given gene (success) or
not (failure)
Statistics for RNASeq count data: Poisson or Negative
Binomial NB distribution ?
The most used distributions to model the RNASeq count data are
Poisson and NB.
- The Poisson assumption [A. Anjum et al., 2016]
P(X = x) = e−λ
x! ; x = 1, 2, ...
Simple: One parameter λ (mean = variance)
However, does not account for the variability of the biological
replicates ⇒ False positive rates
RNASeq count data are overdispersed (the variance grows faster
than the mean). This problem was solved in count data by using
negative binomial distribution.
Statistics for RNASeq count data: Poisson or Negative
Binomial ?
We can solve the overdispersion problem !!!
- The Negative Binomial distribution [A. Anjum et al., 2016]
P(X = x) = x+r−1
r−1 pr qx ; x = 1, 2, ...
p: probability of a single success
r: total number of success (here: read counts)
Two parameters: mean and dispersion (modeling of more
general mean–variance relationships)
Extension of the poisson model including overdispersion
The RNASeq count data are overdispersed
RNASeq count data (for biological replicates) follow the NB
distribution instead of the Poisson distribution
source: [I. Gonzalez, Statistical analysis of RNA-Seq data, 2014]
The RNASeq count data are overdispersed: checking this
in my data
Multiple testing in the case of RNASeq count data
Due to the large number of tests performed in the analysis of
RNA-seq, the multiple testing problem needs to be addressed
(control the FDR, adjustment of p value)
The Benjamini-Hochberg method is used for the correction for the
multiple testing.
The Benjamini-Hochberg Algorithm revisited [Y. Benjamini
and Y. Hochberg, 1995]
Algorithm 1 benjamini hochberg()
m: number of hypotheses to test (number of genes to test for differentially expression between conditions)
pi : unadjusted p value from the test of the m hypotheses (i = 1, .., m)
begin
1. order pi : p1 ≤ p2 ≤ .. ≤ pm
for i = 1 to m do:
2. calculate the adjustment factor ai : ai = m/i (i: rank of the gene)
3. multiply pi by ai : p = pi ai
if the multiplication violates the original ordering then
4. repair this by decreasing the highest p value in all the violating pairs:
˜p = min
j=i,...m
pi
endif
5. ˜p = min(pi , 1)
done
end
Summary of the analysis methods: DESeq-based pipeline
[S. Anders et al., 2010]
Statistical test: binomial test (the raw data without
normalization)
limma test: with the variance stabilizing transformation
getVarianceStabilizedData() to get the normalized count for
testing
Adjustment for multiple testing: the Benjamini-Hochberg
multiple testing adjustment procedure.
Normalization: DESeq
Summary of the analysis methods: DESeq2-based [M.
Love et al., 2014]
Statistical test: Wald test (need absolutly raw counts data,
not the normalized ones)
Independent filtering: DESeq2 uses the average expression
strength of each gene across all samples, as its filter criterion,
and it omits all genes with mean normalized counts below a
filtering threshold from multiple testing adjustment
Adjustment of the Wald-test p-value: the P-values from the
subset of genes that pass the independent filtering step are
adjusted for multiple testing using the procedure of Benjamini
and Hochberg
Normalization: DESeq-based
Summary of the analysis methods: EdgeR-based pipeline
[Y. Chen et al., 2016]
Statistical test: GLM likelihood ratio test (need absolutly raw
counts data, not the normalized ones). Since the classical
edgeR is designed only for a single factor experiment, we
should use the GLM edgeR for our two-condition based
experiment.
Once the negative binomial GLMs are fitted with the
Cox-Reid dispersion estimates, the GLM likelihood ratio test
(GLRT) is performed for each tag (gene). Tags can then be
ranked in order of evidence for differential expression based on
the p-value computed for each tag.
The ranking is then used for multiple testing correction with
the Benjamini-Hocheberg method to produce the adjusted
p-value.
Normalization: library size normalization using a trimmed
mean of M values (TMM) between each pair of samples, and
gene-specific correction.
Summary of the analysis methods: Voom-Limma pipeline
for DE analysis [M. E. Ritchie et al., 2015]
Unlike the above methods, Limma is not based on negative
binomial model.
Voom transformation is applied to the read counts
(conversion to the log-counts per million LogCPM with
associated precision weights.) ⇒ estimating the
mean-variance relationship empirically.
Voom function to convert the mean-variance trend into
precision weights
Using the design matrix to fit the linear model, and test for
DE between conditions
Statistical test: moderated t-statistics
TMM normalization
Adjustment for multiple testing: Benjamini-Hochberg’s
approach (Default method)
Differential expression study, Analysis and Visualization:
DESeq-based and DESeq2-based pipelines
DESeq-based and DESeq2-based pipelines, Some QC:
Sample distance heatmap
DESeq-based and DESeq2-based pipelines, Some QC: PCA
DESeq-based and DESeq2-based pipelines, Checking the
normalization
DESeq-based and DESeq2-based pipelines, Checking the
normalization
DESeq-based and DESeq2-based pipelines, Biological
dispersion
DESeq-based analysis results: significant DE genes
Only 5 genes are significant for differentially expression test after
adjustment (FDR < 0.1), and 0 genes at FDR < 0.05
DESeq2-based analysis results: significant DE genes
Significant DE genes at FDR < 0.05 with DESeq2-based method
Differential expression study, Analysis and Visualization:
edgeR-based and voomLimma-based pipelines
edgeR-based pipeline, PCA
edgeR-based pipeline, Distance similarity between samples
edgeR-based pipeline: Biological coefficient of variation
edgeR-based analysis results: significant DE genes
Overlapping significant DE genes: at FDR < 0.05
Overlapping significant DE genes: at FDR < 1e − 5
What next ?
What next ?
Biological insights from the list of DE genes
Coexpression network analysis WGCNA package [P.
Langfelder and S. Horvath, 2008]
Prior knowledges (guidance) ⇒ epistasis analysis
Integromics

presentation

  • 1.
    RNASeq Differential ExpressionDE Analysis in the context of Breast Cancer Debit Ahmed 18 June 2016
  • 2.
    Outline RNASeq DE analysis RNADE Analysis, What is it ? Description of the data Proposed Workflow-Pipeline Analysis steps: From sequencing raw data to read counts Quality Control FastQC Mapping to the Human reference genome Outputs and post-processing Samtools Summarization Analysis steps: Normalization, Exploration and Visualization Why we need to normalize the count data ? Statistics for RNASeq DE analysis Summary of the normalization-based methods Results DESeq-based and DESeq2-based pipelines edgeR-based and voomLimma-based pipelines Overlapping significant DE genes What next ?
  • 3.
    RNASeq DE Analysis,Aims and Objectives
  • 4.
    RNA DE Analysis,What is it ? [A. Anjum et al., April 2016] Analysis of differences in expression of gene populations under different environments, conditions, treatments, and stages. Statistical distributions are used to approximate the pattern of differential gene expression. A gene is declared differentially expressed if a difference or change observed in read counts or expression levels/index between two experimental conditions is statistically significant.
  • 5.
    Description of thedata Illumina RNA Sequencing reads (.fastq files containing the read sequences and the Phred scores) Paired-end: Forward/Reverse 22 tumor samples and the matched 22 normal tissues 2 conditions (Status) Biological replicates
  • 6.
  • 7.
  • 8.
  • 9.
    FastQC: Quality controlchecks on raw sequence data
  • 10.
    FastQC: Quality controlchecks on raw sequence data rRNA contamination of the raw data !!!
  • 11.
    Mapping to theHuman reference genome Next-Gen specific read alignment programs Handle the vast amount of data generated by next-generation sequencers Runtime/Memory consuming BWA, Bowtie, Tophat, STAR Human reference genome from Ensembl: Homosapiens.GRCh38.fa
  • 12.
    Tophat Mapping Strategy source:[Bioinformatics. 2009 May 1; 25(9): 1105–1111].
  • 13.
    Outputs and post-processingSamtools .bam file samtools sort =⇒ sorted .bam file samtools index =⇒ indexed .bam.bai file (IGV) samtools sort -n =⇒ sorted by name .bam file samtools view =⇒ huge .sam file is generated and ready to use with HTseq
  • 14.
    Summarization Inputs: List ofgenomic features (filtered .gtf file), Aligned sequencing reads FeatureCounts, Cufflinks-Cuffdiff, HTSeq (Python Library) In the case of RNA-Seq, the features are typically genes. The simplest and most common approach counts the number of reads overlapping the exons in a gene [A. Oshlack et al., 2010]. The counts can then be used for gene-level differential expression analyses using methods such as DESeq2 or EdgeR [S. Anders et al., 2014]
  • 15.
  • 16.
    Why we needto normalize the RNASeq count data ? The basic source of variations between samples is the difference in library size (RNA samples may be sequenced to different depths) [J. Zyprych-Walczak et al., 2015] Sequencing depth, gene length, and count distribution are the main biases that must be accounted for in the normalization and differential expression calculations. Many methods have been proposed differing both in the type of bias adjustment and in the adopted statistical strategy
  • 17.
    Normalization of RNASeqcount data: Methods and Strategies RPKM ((Mortazavi et al., 2008): Counts divided by transcript length (kb) times the total number of millions of mapped reads. FPKM (Trapnell et al., 2010) - Fragments per Kilobase of exon per Million mapped reads, analogous to RPKM. DESeq normalization implemented in the DESeq Bioconductor package (S. Anders et al., 2010) Upper-quartile (Bullard et al., 2010) - Counts are divided by upper-quartile of counts for transcripts with at least one read. TMM (Robinson and Oshlack, 2010) - Trimmed mean of M values implemented in the edgeR Bioconductor package.
  • 18.
    Normalization of RNASeqcount data: Methods and Strategies Issues ⇒ How to choose a normalization adapted to our experiment and which criteria can help make this choice ? ⇒ What is the impact of the normalization step on lists of differential expression genes ? In our case We conduct the DE analysis from the raw read counts genereted in the previous step following the DESeq, DESeq2, EdgeR methods, and the one implemented in voom-Limma package. The different dedicated pipelines corresponding to each method are described in details
  • 19.
    Statistics for RNASeqcount data: Which appropriate model can we use ? source: CartoonStock.com
  • 20.
    Statistics for RNASeqcount data: Which appropriate model can we use ? We have count data for some list of genes with biological replicates corresponding to two conditions we want to compare X: a random variable representing the number of reads falling in a given gene Description: a fixed number of events n, each with a constant probability of success Event: An RNASeq read lands in a given gene (success) or not (failure)
  • 21.
    Statistics for RNASeqcount data: Poisson or Negative Binomial NB distribution ? The most used distributions to model the RNASeq count data are Poisson and NB. - The Poisson assumption [A. Anjum et al., 2016] P(X = x) = e−λ x! ; x = 1, 2, ... Simple: One parameter λ (mean = variance) However, does not account for the variability of the biological replicates ⇒ False positive rates RNASeq count data are overdispersed (the variance grows faster than the mean). This problem was solved in count data by using negative binomial distribution.
  • 22.
    Statistics for RNASeqcount data: Poisson or Negative Binomial ? We can solve the overdispersion problem !!! - The Negative Binomial distribution [A. Anjum et al., 2016] P(X = x) = x+r−1 r−1 pr qx ; x = 1, 2, ... p: probability of a single success r: total number of success (here: read counts) Two parameters: mean and dispersion (modeling of more general mean–variance relationships) Extension of the poisson model including overdispersion
  • 23.
    The RNASeq countdata are overdispersed RNASeq count data (for biological replicates) follow the NB distribution instead of the Poisson distribution source: [I. Gonzalez, Statistical analysis of RNA-Seq data, 2014]
  • 24.
    The RNASeq countdata are overdispersed: checking this in my data
  • 25.
    Multiple testing inthe case of RNASeq count data Due to the large number of tests performed in the analysis of RNA-seq, the multiple testing problem needs to be addressed (control the FDR, adjustment of p value) The Benjamini-Hochberg method is used for the correction for the multiple testing.
  • 26.
    The Benjamini-Hochberg Algorithmrevisited [Y. Benjamini and Y. Hochberg, 1995] Algorithm 1 benjamini hochberg() m: number of hypotheses to test (number of genes to test for differentially expression between conditions) pi : unadjusted p value from the test of the m hypotheses (i = 1, .., m) begin 1. order pi : p1 ≤ p2 ≤ .. ≤ pm for i = 1 to m do: 2. calculate the adjustment factor ai : ai = m/i (i: rank of the gene) 3. multiply pi by ai : p = pi ai if the multiplication violates the original ordering then 4. repair this by decreasing the highest p value in all the violating pairs: ˜p = min j=i,...m pi endif 5. ˜p = min(pi , 1) done end
  • 27.
    Summary of theanalysis methods: DESeq-based pipeline [S. Anders et al., 2010] Statistical test: binomial test (the raw data without normalization) limma test: with the variance stabilizing transformation getVarianceStabilizedData() to get the normalized count for testing Adjustment for multiple testing: the Benjamini-Hochberg multiple testing adjustment procedure. Normalization: DESeq
  • 28.
    Summary of theanalysis methods: DESeq2-based [M. Love et al., 2014] Statistical test: Wald test (need absolutly raw counts data, not the normalized ones) Independent filtering: DESeq2 uses the average expression strength of each gene across all samples, as its filter criterion, and it omits all genes with mean normalized counts below a filtering threshold from multiple testing adjustment Adjustment of the Wald-test p-value: the P-values from the subset of genes that pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg Normalization: DESeq-based
  • 29.
    Summary of theanalysis methods: EdgeR-based pipeline [Y. Chen et al., 2016] Statistical test: GLM likelihood ratio test (need absolutly raw counts data, not the normalized ones). Since the classical edgeR is designed only for a single factor experiment, we should use the GLM edgeR for our two-condition based experiment. Once the negative binomial GLMs are fitted with the Cox-Reid dispersion estimates, the GLM likelihood ratio test (GLRT) is performed for each tag (gene). Tags can then be ranked in order of evidence for differential expression based on the p-value computed for each tag. The ranking is then used for multiple testing correction with the Benjamini-Hocheberg method to produce the adjusted p-value. Normalization: library size normalization using a trimmed mean of M values (TMM) between each pair of samples, and gene-specific correction.
  • 30.
    Summary of theanalysis methods: Voom-Limma pipeline for DE analysis [M. E. Ritchie et al., 2015] Unlike the above methods, Limma is not based on negative binomial model. Voom transformation is applied to the read counts (conversion to the log-counts per million LogCPM with associated precision weights.) ⇒ estimating the mean-variance relationship empirically. Voom function to convert the mean-variance trend into precision weights Using the design matrix to fit the linear model, and test for DE between conditions Statistical test: moderated t-statistics TMM normalization Adjustment for multiple testing: Benjamini-Hochberg’s approach (Default method)
  • 31.
    Differential expression study,Analysis and Visualization: DESeq-based and DESeq2-based pipelines
  • 32.
    DESeq-based and DESeq2-basedpipelines, Some QC: Sample distance heatmap
  • 33.
    DESeq-based and DESeq2-basedpipelines, Some QC: PCA
  • 34.
    DESeq-based and DESeq2-basedpipelines, Checking the normalization
  • 35.
    DESeq-based and DESeq2-basedpipelines, Checking the normalization
  • 36.
    DESeq-based and DESeq2-basedpipelines, Biological dispersion
  • 37.
    DESeq-based analysis results:significant DE genes Only 5 genes are significant for differentially expression test after adjustment (FDR < 0.1), and 0 genes at FDR < 0.05
  • 38.
    DESeq2-based analysis results:significant DE genes Significant DE genes at FDR < 0.05 with DESeq2-based method
  • 39.
    Differential expression study,Analysis and Visualization: edgeR-based and voomLimma-based pipelines
  • 40.
  • 41.
    edgeR-based pipeline, Distancesimilarity between samples
  • 42.
    edgeR-based pipeline: Biologicalcoefficient of variation
  • 43.
    edgeR-based analysis results:significant DE genes
  • 44.
    Overlapping significant DEgenes: at FDR < 0.05
  • 45.
    Overlapping significant DEgenes: at FDR < 1e − 5
  • 46.
  • 47.
    What next ? Biologicalinsights from the list of DE genes Coexpression network analysis WGCNA package [P. Langfelder and S. Horvath, 2008] Prior knowledges (guidance) ⇒ epistasis analysis Integromics