Applied Bioinformatics Journal Club
Wednesday, March 5
Background
• Comparison of
commonly used DE
software packages
–
–
–
–
–
–

Cuffdiff
edgeR
DESeq
PoisssonSeq
baySeq
limma

• Two benchmark
datasets
– Sequencing Quality
Control (SEQC) dataset
• Includes qRT-PCR for
1,000 genes

– Biological replicates from
3 cell lines as part of
ENCODE project
Focus of paper:
Comparison of elevant measures for
DE detection

• Normalization of count data

• Sensitivity and specificity of DE detection
• Genes expressed in one condition but no
expression in the other condition
• Sequencing depth and number of replicates
Theoretical background
• Count matrix—number
of reads assigned to
gene i in sequencing
experiment j
• Length bias when
measuring gene
expression by RNA-seq
– Reduces the ability to
detect differential
expression among
shorter genes

• Differential gene
expression consists of 3
components:
– Normalization of counts
– Parameter estimation of
the statistical model
– Tests for differential
expression
Normalization
• Commonly used
– RPKM
– FPKM
– Biases—proportional
representation of each
gene is dependent on
expression levels of other
genes

• DESeq-scaling factor
based normalization
– median of ratio for each
gene of its read count over
its geometric mean across
all samples

• Cuffdiff—extension of
DESeq normalization
– Intra-condition library
scaling
– Second scaling between
conditions
– Also accounts for changes
in isoform levels
Normalization
• edgeR
– Trimmed means of M
values (TMM)
– Weighted average of
subset of genes
(excluding genes of high
average read counts and
genes with large
differences in
expression)

• baySeq
– Sum gene counts to
upper 25% quantile to
normalize library size

• PoissonSeq
– Goodness of fit estimate
to define a gene set that
is least differentiated
between 2 conditions,
and then used to
compute library
normalization factors
Normalization
• limma (2 normalization procedures)
– Quantile normalization
Sorts counts from each sample and sets the
values to be equal to quantile mean from all
samples
– Voom: LOWESS regression to estimate mean
variance relation and transforms read counts to
log form for linear modeling
Statistical modeling of gene expression
• edgeR and DESeq
– Negative binomial distribution (estimation of
dispersion factor)

• edgeR
– Estimation of dispersion factor as weighted
combination of 2 components
• Gene specific dispersion effect and common dispersion
effect calculated for all genes
Statistical modeling of gene expression
• DESeq
– Variance estimate into a combination of Poisson
estimate and a second term that models biological
variability

• Cuffdiff
– Separate variance models for single isoform and
multiple isoform genes
• Single isoform—similar to DESeq
• Multiple isoform– mixed model of negative binomial
and beta distributions
Statistical modeling of gene expression
• baySeq
– Full Bayesian model of negative binomial
distributions
– Prior probability parameters are estimated by
numerical sampling of the data

• PoissonSeq
– Models gene counts as a Poisson variable
– Mean of distribution represented by log-linear
relationship of library size, expression of gene, and
correlation of gene with condition
Test for differential expression
• edgeR and DESeq
– Variation of Fisher exact test modified for negative
binomial distribution
– Returns exact P value from derived probabilities

• Cuffdiff
– Ratio of normalized counts between 2 conditions
(follows normal distribution)
– t-test to calculate P value
Test for differential expression
• limma
– Moderated t-statistic of modified standard error
and degrees of freedom

• baySeq
– Estimates 2 models for every gene
• No differential expression
• Differential expression

– Posterior likelihood of DE given the data is used to
identify differentially expressed genes (no P value)
Test for differential expression
• PoissonSeq
– Test for significance of correlation term
– Evaluated by score statistics which follow a Chisquared distribution (used to derive P values)

• Multiple hypothesis corrections
– Benjamini-Hochberg
– PoissonSeq—permutation based FDR
Results
• Normalization and log expression correlation
• Differential expression analysis

• Evaluation of type I errors
• Evaluation of genes expressed in one condition
• Impact of sequencing depth and replication on
DE detection
5
5
RNASeq DE methods review Applied Bioinformatics Journal Club

RNASeq DE methods review Applied Bioinformatics Journal Club

  • 1.
    Applied Bioinformatics JournalClub Wednesday, March 5
  • 2.
    Background • Comparison of commonlyused DE software packages – – – – – – Cuffdiff edgeR DESeq PoisssonSeq baySeq limma • Two benchmark datasets – Sequencing Quality Control (SEQC) dataset • Includes qRT-PCR for 1,000 genes – Biological replicates from 3 cell lines as part of ENCODE project
  • 3.
    Focus of paper: Comparisonof elevant measures for DE detection • Normalization of count data • Sensitivity and specificity of DE detection • Genes expressed in one condition but no expression in the other condition • Sequencing depth and number of replicates
  • 4.
    Theoretical background • Countmatrix—number of reads assigned to gene i in sequencing experiment j • Length bias when measuring gene expression by RNA-seq – Reduces the ability to detect differential expression among shorter genes • Differential gene expression consists of 3 components: – Normalization of counts – Parameter estimation of the statistical model – Tests for differential expression
  • 5.
    Normalization • Commonly used –RPKM – FPKM – Biases—proportional representation of each gene is dependent on expression levels of other genes • DESeq-scaling factor based normalization – median of ratio for each gene of its read count over its geometric mean across all samples • Cuffdiff—extension of DESeq normalization – Intra-condition library scaling – Second scaling between conditions – Also accounts for changes in isoform levels
  • 6.
    Normalization • edgeR – Trimmedmeans of M values (TMM) – Weighted average of subset of genes (excluding genes of high average read counts and genes with large differences in expression) • baySeq – Sum gene counts to upper 25% quantile to normalize library size • PoissonSeq – Goodness of fit estimate to define a gene set that is least differentiated between 2 conditions, and then used to compute library normalization factors
  • 7.
    Normalization • limma (2normalization procedures) – Quantile normalization Sorts counts from each sample and sets the values to be equal to quantile mean from all samples – Voom: LOWESS regression to estimate mean variance relation and transforms read counts to log form for linear modeling
  • 8.
    Statistical modeling ofgene expression • edgeR and DESeq – Negative binomial distribution (estimation of dispersion factor) • edgeR – Estimation of dispersion factor as weighted combination of 2 components • Gene specific dispersion effect and common dispersion effect calculated for all genes
  • 9.
    Statistical modeling ofgene expression • DESeq – Variance estimate into a combination of Poisson estimate and a second term that models biological variability • Cuffdiff – Separate variance models for single isoform and multiple isoform genes • Single isoform—similar to DESeq • Multiple isoform– mixed model of negative binomial and beta distributions
  • 10.
    Statistical modeling ofgene expression • baySeq – Full Bayesian model of negative binomial distributions – Prior probability parameters are estimated by numerical sampling of the data • PoissonSeq – Models gene counts as a Poisson variable – Mean of distribution represented by log-linear relationship of library size, expression of gene, and correlation of gene with condition
  • 11.
    Test for differentialexpression • edgeR and DESeq – Variation of Fisher exact test modified for negative binomial distribution – Returns exact P value from derived probabilities • Cuffdiff – Ratio of normalized counts between 2 conditions (follows normal distribution) – t-test to calculate P value
  • 12.
    Test for differentialexpression • limma – Moderated t-statistic of modified standard error and degrees of freedom • baySeq – Estimates 2 models for every gene • No differential expression • Differential expression – Posterior likelihood of DE given the data is used to identify differentially expressed genes (no P value)
  • 13.
    Test for differentialexpression • PoissonSeq – Test for significance of correlation term – Evaluated by score statistics which follow a Chisquared distribution (used to derive P values) • Multiple hypothesis corrections – Benjamini-Hochberg – PoissonSeq—permutation based FDR
  • 14.
    Results • Normalization andlog expression correlation • Differential expression analysis • Evaluation of type I errors • Evaluation of genes expressed in one condition • Impact of sequencing depth and replication on DE detection
  • 25.
  • 26.