Impact_of_gene_length_on_DEG

1
STUDYING THE IMPACT OF GENE LENGTH ON
DETECTION OF DIFFERENTIALLY EXPRESSED GENES
WITH RNA-SEQ TECHNOLOGY
Long Pei
Zhejiang University
College of Life Sciences
Université Libre de Bruxelles
Faculty of Applied Science
Supervisor:
Professor Jacques van Helden
Professor Esteban Zimanyi

2
Abstract
RNA-seq is an emerging high-throughput sequencing method, it is being widely used for
transcriptome studies of the structure and function. The goal of this work is to analyze the
impact of gene length on differential gene expression tests. In the beginning part of our
analysis, we first test the reproducibility of the RNA-seq technology and the effect of the
process of normalization. Then in the second step, we focus on the statistical tests of
differential gene expression, comparing the DEG identification by different tests. We use
Negative Binomial exact test and Fisher test to detect DEG. In the final stage of the work, we
will test the impact of gene length on the gene differential expression. The genes with larger
length tend to have more likelihood to be identified as a differentially expressed gene
because it is easier to be mapped with short reads.
Key Words: RNA-seq, exact test, differential gene expression

3
Abbreviations and Comments
DEG Differentially expressed genes
RNA Ribonucleic acid (RNA) is one of the three major macromolecules
(along with DNA and proteins) that are essential for all known forms
of life.
RNA-seq Method relying on high-throughput sequencing technology to
measure the transcriptome (concentration o each RNA).
Microarray
NB
FET
Sequencing depth
Alternative splicing
cDNA library
RPKM
A multiplex lab-on-a-chip on a solid substrate (usually a glass slide or
silicon thin-film cell) that assays large amounts of biological material
Negative binomial distribution. A discrete probability distribution of
the number of successes in a sequence of Bernoulli trials before a
specified (non-random) number r of failures occurs.
Fisher exact test. The test is useful for categorical data that result
from classifying objects in two different ways; it is used to examine
the significance of the association (contingency) between the two
kinds of classification.
The total number of all the sequences reads or base pairs
represented in a single sequencing experiment or series of
experiments.
A process by which the exons of the RNA produced by transcription
of a gene (a primary gene transcript or pre-mRNA) are reconnected
in multiple ways during RNA splicing.
A combination of cloned cDNA (complementary DNA) fragments
inserted into a collection of host cells, which together constitute
some portion of the transcriptome of the organism.
Reads mapping to the genome per kilo-base of transcript per million
reads sequenced

4
Background
RNA-seq technology and its current application
Normally speaking, the transcriptome is defined as a complete set of transcripts in a cell.
And the quantity of transcript of each gene varies with different stages or physiological
conditions. According to the paper published by Wang in 2009 (Zhong Wang, Mark Gerstein,
Michael Snyder, 2009), "the key aims of transcriptomics are to catalogue all species of
transcripts, including mRNAs, non-coding RNAs and small RNAs; to determine the
transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing
patterns and other post-transcriptional modifications; and to quantify the changing
expression levels of each transcript during development and under different conditions".
Since the mid-1990s, DNA microarrays have been the major biological technology for
carrying out large-scale studies of gene expression levels, which is a widely discussed aspect
of transcriptomics. The ability of these arrays to simultaneously interrogate thousands of
transcripts has led to important advances in a wide range of biological problems, including
the identification of gene expression differences among diseased and healthy tissues, and
new insights into developmental processes, pharmacogenomic responses, and the evolution
of gene regulation (John Marioni, Christopher Mason, Shrikant Mane, Matthew Stephens,
Yoav Gilad, 2008). However, this sort of methods have been proved to have several
limitations, including reliance upon existing knowledge about genome sequence;
complicated normalization methods required for comparing expression levels across
different experiments; high background levels owing to cross-hybridization and a limited
dynamic range of detection owing to both background and saturation of signals.
Sequencing-based approaches to measuring gene expression levels have the potential to
overcome these limitations. As an important next generation sequencing methods, RNA-Seq
is a recently developed approach to transcriptome profiling that uses deep-sequencing
technologies. Studies using this method have already altered our view of the extent and
complexity of eukaryotic transcriptomes. In addition, RNA-Seq also provides a far more
precise measurement of levels of transcripts and their isoforms than other methods. RNA-
Seq uses recently developed deep-sequencing technologies. In general, a population of RNA
is converted to a library of cDNA fragments with adaptors attached to one or both ends.
Each molecule, with or without amplification, is then sequenced in a high-throughput
manner to obtain short sequences from one end (single-end sequencing) or both ends (pair-
end sequencing).The reads results are typically 30–400 bp, depending on the DNA-
sequencing technology used.

5
The advantages of RNA-seq
In a general view, any high-throughput sequencing technology can be used for RNA-Seq, and
the Illumina IG, Roche 454 Life Science systems and Applied Biosystems SOLiD have already
been applied for this purpose. The Helicos Biosciences tSMS system has not yet been used
for published RNA-Seq studies, but is also appropriate and has the added advantage of
avoiding amplification of target cDNA (John Marioni, Christopher Mason, Shrikant Mane,
Matthew Stephens, Yoav Gilad, 2008). Following sequencing, the resulting reads are either
aligned to a reference genome or reference transcripts (this process is always referred to as
read mapping) , or assembled de novo without the genomic sequence to produce a genome-
scale transcription map that consists of both the transcriptional structure (analysis of
problems like exon and intron boundaries identification or multiple transcription products
due to alternative splicing) or level of expression for each gene.
Although RNA-Seq is still a technology under active development, it offers several key
advantages over existing technologies. First, unlike traditional hybridization-based
approaches, RNA-Seq is not limited to detecting transcripts that correspond to existing
reference genomic sequence. For example, 454-based RNA-Seq has been used to sequence
the transcriptome of the Glanville fritillary butterfly (Zhong Wang, Mark Gerstein, Michael
Snyder, 2009). This makes RNA-Seq particularly attractive for non-model organisms with
genomic sequences that are yet to be determined. A second advantage of RNA-Seq relative
to DNA microarrays is that RNA-Seq has very low, if any, background signal because DNA
sequences can been unambiguously mapped to unique regions of the genome. RNA-Seq
does not have an upper limit for quantification analysis, which could then correlate with the
number of sequences obtained. Consequently, it has a large dynamic range of expression
levels over which transcripts can be detected.
Motivation of the work
In a recent conference (Ana Conesa, http://www.lirmm.fr/~rivals/HTS-2011/, 2011), Ana
Conesa raised the issue of the impact of gene length and sequencing depth on the sensitivity
of statistical tests for detecting differentially expressed genes (DEG). The goal of this work is
to analyze the impact of gene length on DEG tests. In the first stage, we will test the
reproducibility of the RNA-seq technology and the effect of the process of normalization.
Then in the second stage, we will focus on the statistical tests of differential gene expression,
comparing the DEG identification by different tests. In the third stage of the work, we will
test the impact of gene length on the gene differential expression. We will combine two
statistical tests approaches:
1) Use of Negative Binomial exact test and Fisher exact test to identify DEG, compare
the test results with up and down regulated genes detection rates.
2) Use of experimental data to analyze impact of gene length on the frequency of genes
declared differentially expressed by various statistical tests.

6
Material and methods
Statistical analysis
All statistical analyses were led using the R statistical package (http://www.r-project.org/). We
downloaded the following libraries, available from the R CRAN repository and Bioconductor.
edgeR (http://www.bioconductor.org/packages/2.3/bioc/html/edgeR.html)
statmod (http://cran.r-project.org/web/packages/statmod/index.html)
Liver and Kidney Dataset
The dataset we use in our analysis was originally published by Marioni in 2008, the Liver
versus Kidney dataset was then cited and revised by Mark Robinson and Alicia Oschlack in
2010 (Mark Robinson, Alicia Oshlack, 2010). The original dataset published by Marioni (John
Marioni, Christopher Mason, Shrikant Mane, Matthew Stephens, Yoav Gilad, 2008) contains
in total 32,000 genes, each of them is compared by the count reads of Liver and Kidney
samples to detect whether or not the gene has a differential expression in different tissues.
Marioni was focusing on the test of reproducibility of RNA-seq technology by using technical
replicates, and there were two runs of Illumina sequencing experiments with seven lanes in
each run. The original 14 lanes were designed not only to contain comparison of cDNA
concentration differences but also gene expression level information between Kidney and
Liver tissues. Then in the dataset revised by Robinson and Oschlack (WEHI Bioinformatics -
Resources, 2008), 10 of the 14 lanes were cited, all these 10 lanes are in the same
concentration experimental conditions. In our analysis we use the revised dataset from
Robinson and Oschlack, which contains 22490 genes in total and 5 lanes of technical
replicates for both Liver and Kidney tissues.
Normalization of the RNA-seq data
Estimated normalization factors should follow the core principle that a gene with the same
expression level in two different samples is not detected as a different expressed gene(DEG).
To further highlight the need for more sophisticated normalization procedures in RNA-seq
data, we can simply consider a small imaginary example. Imagine we have a sequencing
experiment for comparing two RNA populations, A and B. In this hypothetical scenario,
suppose every gene that is expressed in B and A with the same number of transcripts.
However, assume that sample A also contains a set of genes equal in number and expression
that are not expressed in B. Thus, sample A has twice as many total expressed genes as
sample B, that is, its RNA production is twice the size of sample B. Suppose that each sample
is then sequenced to the same depth. Without any additional adjustment, a gene expressed
in both samples will have, on average, half the number of reads from sample A, since the

7
read results are spread over twice as many genes. Therefore, the correct normalization
would adjust sample A by a factor of 2 (Mark Robinson, Alicia Oshlack, 2010).
The hypothetical example above highlights the notion that the proportion of reads
attributed to a given gene in a library depends on the expression properties of the whole
sample rather than just the expression level derived from read counts of that gene. In real
experimental conditions, there are indeed biological and even technical situations where
such a normalization is required. For example, if an RNA sample is contaminated, the reads
that represent the contamination will take away reads from the true sample, thus dropping
the number of reads of interest and offsetting the proportion for every gene. However, as
we demonstrate in Results and Discussion, true biological differences in RNA composition
between different samples will be the main reason for normalization.
As for the dataset we used in the analysis, the total amount of all read results are
respectively 1,691,734 and 1,804,977 mapped short reads for Liver and Kidney tissues. Since
only two tissues are compared in this case, a simple normalization method could be used to
adjust the bias of possible situations of false different expression genes identified due to
different library sizes. The scaling factor is used to adjust both library sizes. is defined as
the ratio of the library sizes of two tissue samples and thus could reflect the relative
production of two samples. , where s represents the total amount of RNA-seq
read production. The adjusted library sizes are defined as the square root of (two samples
separately divided and multiplied by ).
Identifying different expressed genes
In the statistical test procedure, we first sage.test function from the CRAN statmod package
(CRAN-Package statmod (R package), 2007) to calculate a Fisher exact P-value for each gene.
For two libraries, the effective library sizes are calculated by multiplying/dividing the square
root of the estimated normalization factor with the original library size.
Let us assume we sequenced mRNA from two biological samples with total counts of
10,000,000 reads for each sample. A given gene g is represented by 4 reads in the first
sample (sample A), and by 2 reads in the second sample (sample B). Before we proceed with
the Fisher test, we first introduce some notation. We represent the cells by the letters a, b, c
and d, call the totals across rows and columns marginal totals, and represent the grand total
by n. So the table now looks like this:
Sample A Sample B total
Gene g detected a B a + b
Not detected C D c + d
Totals a + c b + d n
Thus in the assumed example, a=4,b=2,c= ,d= ,n=2e+7.

8
Fisher showed that the probability of obtaining any such set of values was given by the hyper
geometric distribution:
where is the binomial coefficient and the symbol “!” indicates the factorial operator.
This formula gives the exact probability of observing this particular arrangement of the data,
assuming the given marginal totals, on the null hypothesis that the expression level of gene g
in sample A and in sample B are equally likely to be a certain value. To put it another way, if
we assume that the probability that gene g is detected in sample A shows p, the probability
that gene g is detected in sample B shows p, and we assume that gene g could be detected
in both sample A and sample B enter our test analysis independently of whether or not gene
g is detected, then this hyper geometric formula gives the conditional probability of
observing the values a, b, c, d in the four cells, conditionally on the observed samples.
In the DEG analysis, we should know the fact that the Fisher test will give different
significance if the same ratio is observed with different absolute counts. Let us assume we
sequenced mRNA from two biological samples with total counts of 10,000,000 reads for
each sample. A group of given genes are represented by 4, 40, 400 and 4000 reads in the
first sample (sample A), and by 2, 20, 200 and 2000 reads in the second sample (sample B).
The ratio is thus always the same. For applying the Fisher exact test, one method is to
implement package statmod and then use the function sage.test().
Sample.A Sample.B Total Number P-value
Gene1 4 2 1e+7 4.531250e-01
Gene2 40 20 1e+7 9.853448e-03
Gene3 400 200 1e+7 1.788647e-16
Gene4 4000 2000 1e+7 8.760637e-150
It could be seen from the table above that same ratio read counts will get different P-value
when we are carrying out Fisher exact test. The P-value here gives the information of the
probability of mistaking differential gene expression for non significant, thus the smaller P-
value would indicate the larger probability of the fact that this gene is a differentially
expressed gene in the Liver and Kidney tissues.

9
Apart from this, for the two-libraries comparison in our analysis using Liver versus Kidney
dataset, we also use the exactTest function from the edger package (Mark Robinson, Davis
McCarthy, Gordon Smyth, 2010) with tag common dispersion to carry out an Negative
Binomial exact test. For two libraries, the effective library sizes are calculated by
multiplying/dividing the square root of the estimated normalization factor with the original
library size.
The negative binomial distribution is a discrete probability distribution of the number of
successes in a sequence of Bernoulli trials before a specified (non-random) number r of
failures occurs. Parameters used in NB distribution are r>0 represents the failure numbers
before stopped times of the experiment and p ∈ (0,1) represents success probability in each
experiment. Thus, the probability of random variable K to take the exact value k should be
The common dispersion is estimated from a large number of tags considering and therefore
is treated as fixed when applying these tests. The negative binomial assumes that both
samples have the same variance, however it is likely that if genes are repressed in one
condition and active in another one, they will have a wider variance in the active condition
than in the repressed condition. Actually, this is typical of the Poisson-like distributions: the
variance is proportional to the mean. So the common dispersion is an important parameter.
With the edgeR package, we use an exact test for the negative binomial distribution, which
has strong parallels with Fisher's exact test, by replacing the hyper geometric probabilities
with Negative binomial model to compute exact p-values that can be used to assess
differential expression (Mark Robinson, Gordon Smyth, 2008). The function exactTest in
edgeR could be adapted to conduct the NB exact test for pairwise comparisons of groups.
DE ratio versus transcripts length
As a consequence of the above discussion, the RNA-seq data should be more reliable for
highly expressed genes than for poorly expressed genes, which is an indication of expression
level. Furthermore, the RNA-seq reads counts should be more reliable for long genes than
for short genes since on average one expect more reads to be mapped on long than on short
genes, which will illustrate the impact of gene length. Also the read counts tend to be more
reliable for data sets with a higher sequencing depth (total number of counts) than for data
sets with a poor sequencing depth. Thus the sensitivity of RNA-Seq could be assumed as a
function of both molar concentration and transcript length. In the differential expressed
gene analysis, the ratio of DEG detection should be affected as well.
Therefore a normalization method considering the effects of gene length and total reads
counts were introduced by quantifying transcript levels in reads per kilo-base of exon model
per million mapped reads (RPKM).

10
The RPKM (Ali Mortazavi, Brian Williams, Kenneth McCue, Lorian Schaeffer, Barbara Wold,
2008)measure of read density reflects the molar concentration of a transcript in the starting
sample by normalizing for RNA length and for the total read number in the measurement.
This facilitates transparent comparison of transcript levels both within and between samples.
In our study, we calculate RPKM for each sample to get an intuitive idea about gene
differential expression. However, for the statistical testing procedure, the RPKM is not a such
reliable value as the real read counts itself.
Another key point should be considered is that, the original dataset only contains the
information of gene start and end sites. We should explore the impact of the gene length
with the length of transcription products. Also, some of the Ensembl Gene IDs contained in
Marioni’s dataset have been revised. When we are checking in the Ensembl database, some
of the genes could not be found with the ID numbers provided in the original dataset. In our
analysis, we use perl and shell program to extract mRNA sequence information in fasta
format and get the length of all the transcripts. Many of the genes in the Liver versus Kidney
sample have multiple transcription products due to alternative splicing occurring different
transcripts from a same gene. Then we carry out the analysis of differential expression rate
versus gene length by respectively using the minimum, mean and maximum values of length
of each group of transcripts from a common gene.
Since alternative splicing is a very common phenomena in the process of transcription, all
the genes are then divided into two subsets (Alicia Oshlack, Matthew Wakefield, 2009). One
subset contains only the genes which have single transcription product, the other subset
contains the genes which have multiple transcription products due to alternative splicing.
For each subset, we will carry out a new re-permutation based on the increasing order of
minimum mean or maximum mRNA length. Then the total data and the data of genes in the
two subsets are respectively divided into certain numbers of groups. In each group, there
are 100 genes and the DEG genes ratio thus would be easy to compute, the analysis of gene
length impact would be based on the plot of DEG ratio and gene transcription products
lengths. In each group, the average mRNA length could be calculated to represent a grass
estimation of transcription products lengths. Then we will get a set of points Pi(Xi,Yi), in
which i stands for the index of the group and Xi is the average mRNA length of this group, Yi
is the DEG ratio of this group.

11
Results and discussion
Technical reproducibility test
The Liver versus Kidney RNA-seq dataset contains technical replicates. RNA-seq technology
has a strong reliability of technical reproducibility. Only a small fraction of the genes will
show an obvious variant among different replicates due to sequencing error or systematic
differences. In our study for RNA-seq data statistical analysis, we first compare the read
counts among different technical samples in order to ensure the reproducibility of RNA-seq
technology (Figure 1). From a general view of read counts plot among replicates in both
Liver and Kidney tissues, the data are highly correlated as we expect. The correlation tables
of both tissues show very high reproducibility among replicates (Table 1).
Figure 1. Reproducibility of the RNA-seq technology. Each plot compares the counts for each gene (represented as a dot)
between two technical replicates (sequencing obtained from the same biological sample). Left panel: five technical
replicates of kidney transcriptome. Right: liver transcriptome.
R1L1Kidney R1L3Kidney R1L7Kidney R2L2Kidney R2L6Kidney
R1L1Kidney 1 0.9994263 0.9989557 0.9991514 0.9991353
R1L3Kidney 0.9994263 1 0.9994721 0.9985344 0.9987507
R1L7Kidney 0.9989557 0.9994721 1 0.997683 0.9979635
R2L2Kidney 0.9991514 0.9985344 0.997683 1 0.9995904
R2L6Kidney 0.9991353 0.9987507 0.9979635 0.9995904 1

12
R1L2Liver R1L2Liver R1L2Liver R1L2Liver R1L2Liver
R1L2Liver 1 0.9998428 0.9996362 0.999604 0.9996646
R1L2Liver 0.9998428 1 0.9997997 0.9997832 0.9995034
R1L2Liver 0.9996362 0.9997997 1 0.9998645 0.999162
R1L2Liver 0.999604 0.9997832 0.9998645 1 0.9990891
R1L2Liver 0.9996646 0.9995034 0.999162 0.9990891 1
Table 1. Correlation coefficients of the RNA-seq samples replicates. The blue table shows the correlation coefficients
among Kidney samples. The red table shows the correlation coefficients among Liver samples.
RNA-seq data normalization
A common method to compare read counts and detect differential expression is the MA plot,
which plot log-fold-change of genes from different samples against absolute expression
value. Here we define the log-fold-change (M) and absolute expression value (A) as following:
for
In the above equations stands for the observed count of gene g in Liver and Kidney
samples respectively. And the library size of Liver and Kidney samples are defined as and
.
As indicated in material and method, normalization process is required for unequal library
sizes between Liver and Kidney samples. We first use the MA plot to show the normalization
and then compare the ratio change of M value and A value before and after normalization.
Take R1L1Kidney and R1L2Liver for our two-comparison analysis and the two lanes each give
a view of Kidney and Liver samples , the total amount of mapped short read are 1804977
and 1691734. Thus the relative production of two samples is the ratio of the total counts
=1.0667. Then the normalized factor for Liver and Kidney are 1/ and following
the core principle that a gene with the same expression level in two different samples is not
detected as a different expressed gene.

13
Due to the fact that some genes only express uniquely in Liver or in Kidney, there will be
some read counts contain zero value, then for these genes, we could not define a valid A
value to estimate the absolute expression. In the MA plot (Figure 2Figure 1 A,D) these genes
will be shown in a smear area with different M value, points with orange color, but an invalid
A value. The M value bias among all the genes before normalization would be
log2(Liver.libsizes / Kidney.libsizes) = -0.0935, which is derived from the assumption that we
could observe equal read counts, but still detect difference due to library size inequality.
The application of normalization would have a shift for the M value more intensely then the
A value, as indicated in the figure (Figure 2Figure 1 B,C,E,F). It could be deduced from an
elementary estimation that the mean shift is of M value (log-fold-change) is not that obvious
due to the offset effect from the bidirectional change caused by up-regulated (genes that
have a higher expression level in Liver against Kidney)and down-regulated genes (genes that
have a higher expression level in Kidney against Liver). The mean value of the log-fold-
change is -0.5010 before normalization and -0.5945 after normalization. We test the ratio
change for all the genes between log-fold-change calculated before and after normalization.
(Figure 2Figure 1 H). Here, the ratio change is defined as following expression:
Ratio change= (Normalized M value – Original M value) / Original M value
It should be clarified that not all the ratio change results are valid. For those unique
expressed genes, zero read in a sample tissue (Liver or Kidney) could then lead to an invalid
log-fold-change.
And then we apply a similar plot for the ratio change of absolute expression values before
and after normalization (Figure 2Figure 1 I). Then according to the distribution of absolute
expression value, we get a test for the log-fold-change value ratio change against absolute
expression. Since the majority of the counts are in the range of -20 to -5 of the A value, for
each area, the mean M ratio change before and after normalization within the discrete
groups are collected for the evaluation. For example, the mean M value ratio change of
genes with A value in the area (-17,-16) is calculated to represent an average M value for
those genes in this interval of A value.
Then we plot the valid discrete mean M ratio change (the valid ratio change should be in the
range or (-1,1), here we exclude those invalid M ratio change values due to log error with
zero counts). The valid genes number for the comparison of M value before and after
normalization is 66.7% of the original data. And the average M ratio change is 0.87% for the
normalization process. The relation between absolute expression and M ratio change could
be seen as well through the blue curve (Figure 2Figure 1 G). There is no considerable trend of
incensement of M value ratio change against higher level expression genes from the general
view, however, the relation between M value ratio change and A value seems to be
significant for the very low expression level genes.

14
Figure 2. M-A plot of comparison before and after normalization. A: M-A plot for the original data. B:Histogram of the M
value of differential expressed genes of the original data. C:Histogram of the A value (absolute expression) of the original
data. D: M-A plot for the normalized data. D:Histogram of the M value of differential expressed genes of the normalized
data. F:Histogram of the A value (absolute expression) of the normalized data. G: M ratio change versus the absolute
expression, the blue line shows the relationship between the average value of M value for every intervals and the absolute
expression value. H: Histogram of the ratio change of M value before and after normalization. I: Histogram of the ratio
change of A value before and after normalization.
Identifying differential expressed genes
In the analysis of statistical test on genes between Liver and Kidney samples, we carry out
Negative Binomial Model exact test using the function exactTest from package edgeR and
Fisher exact test using function sage from package statmod. In both tests normalized library
sizes are used instead of original total read counts. (Figure 3 A,B)
P-values are calculated for both the Negative Binomial exact test and Fisher exact test, then
we identify the differential expressed genes by selection of p<0.01 for both tests. Meanwhile,

15
the reverse rank of P-values are executed thus it will provide a reference for us to find the
most differential expressed genes if we need a certain number. In the statistical testing
study, we first apply an MA plot with NB exact test and Fisher exact test. Then, the
difference of the two test methods are shown in the MA plot by a procedure of DEG
selection according to NB exact test as the first step and then the Fisher exact test as the
second step. In a similar way, we reverse the above two steps and get a comparison of
differential expressed gene selection via the MA plot results produced from these two
procedures. Follow the comparison, we then rank the P-value and select 1000 top DEG
among all the genes in the dataset. (Figure 3 E,F)
From the Negative Binomial exact test, 7204 genes are reported as DEG; while from the
Fisher exact test, 7941 genes are identified as DEG. In the contingency table, we can see that
6918 genes are included in the intersection of both tests. 286 genes are identified as DEG
with NB exact test but skip the Fisher exact test; 1023 genes are the opposite, which are
identified as DEG with Fisher exact test but skip the NB exact test. (Table 2). We could
deduce directly from the MA plot that NB exact test seems to have better performance in
identifying up-regulated genes while Fisher exact shows higher detection of down-regulated
genes. (Figure 3 C,D). Within the DEG identified by each test, we can see this as well. Among
7204 differential expressed genes identified by NB exact test, 1836 genes (25.5%)are up-
regulated (expression level in Liver is higher than in Kidney), and 5368 genes (74.5%) are
down-regulated (expression level in Liver is lower than in Kidney). Among 7941 differential
expressed genes identified by Fisher exact test, 1553 genes (19.6%) are up-regulated and
6388 genes (80.4%) are down-regulated. This comparison could illustrate the DEG selection
preferences of NB exact test for up-regulated genes and Fisher exact test for down-regulated
genes.
The top 1000 DEG detected by NB exact test contains 572 down-regulated genes and 428
up-regulated genes, while the top 1000 DEG detected by Fisher exact test contains 639
down-regulated genes and 361 up-regulated genes. Then we could draw from the trend of
ratio change of up-regulated gene proportion. In NB exact test, the up-regulated genes
detection fall from 42.8% in the top 1000 subset to 25.5% in the whole DEG set. In Fisher
exact test, the up-regulated genes detection all from 36.1% in the top 1000 subset to 19.6%
in the whole DEG set.
Apart from the comparison of DEG detected and up-regulated / down-regulated ratio with
NB exact test and Fisher exact test indicated above, we also carry out a significance plot by –
log10(E-value) as the significance and significance values of NB exact test in X axis while
significance values of Fisher exact test in Y axis. (Figure 4). The majority dots are not far away
from the line y=x, which ensure again that 6918 genes are identified as DEG in the intersect
of NB exact test DEG and Fisher exact test DEG subsets.

17
Figure 3. M-A plot of statistical tests to identify differential expressed genes. A: M-A plot of the common tags and DEG
using Negative Binomial exact test. B: M-A plot of the common tags and DEG using Fisher exact test. C: M-A plot of the
common tags and DEG by first using Negative Binomial exact test and then Fisher exact test to show the DEG identified only
by FET not NB. D: M-A plot of the common tags and DEG by first using Fisher exact test and then Negative Binomial exact
test to show the DEG identified only by NB not FET. E: M-A plot of the common tags and top 1000 DEG using Negative
Binomial exact test. F: M-A plot of the common tags and top 1000 DEG using Fisher exact test.
Fisher exact
test/NB exact
test
Non-significant Significant
(DEG)
Non-significant 14263 1023
Significant
(DEG)
286 6918
Table 2. Contingency table of DEG identified by using Negative Binomial exact test and Fisher exact test. Non-significant
means that this gene is not identified as a differential expressed gene. Significant means that this gene is identified as a
differential expressed gene.
Figure 4. Significant value plot of Fisher exact test and Negative Binomial exact test. The significant value is –log10(E-value)
to give an estimation of whether the gene is detected as a differential expressed gene.
Impact of transcription products(mRNA) length
As it is commonly believed, the RNA-seq reads counts should be more reliable for long genes
than for short genes, because the longer the gene is, the higher possibility there will be for
the short reads to be mapped in a certain location within the area. From this point of view,

18
we could thus deduce that gene length will have an impact on the gene expression level and
the differential expressed genes identification detecting rate.
In order to have a valid biological meaning, the gene length should be identified as the
mRNA length rather than the simple length calculated from the gene end site and start site.
With the Ensembl Gene IDs contained in the original dataset, we retrieve the sequence
information from RSAT platform (http://rsat.ulb.ac.be/) and save the results in fasta format
file. The original dataset was published in 2008 and the Ensembl Gene IDs was revised in the
recent years due to fast growing technology of detecting genes and making annotations of
known genes.
When we are checking in the Ensembl database, some of the genes could not be found with
the ID numbers provided in the original dataset. What is more, we should take the
alternative splicing into consideration when focusing on the analysis of transcription
products length. It is proved that in many genes of the original dataset, alternative splicing
phenomenon exist.
We extract the information of mRNA length and unique genes valid for an Ensembl Gene ID
query, among the 22490 genes contained in the Liver versus Kidney original dataset, 19182
genes are reported to have at least one transcription product. These genes have a total
amount of 116104 transcripts resulting from alternative splicing. Since the differential
expressed genes would only be identified once by each gene considering all the possible
mapped results from all the isoforms of mRNA products, we need to take a valid transcript
length for each gene to represent all the isoforms. Thus we use the minimum and maximum
value of the isoforms for each gene as the valid gene length and analysis the relationship of
gene length and differential expression ratio for the genes between Liver and Kidney
samples.
We first check the transcripts lengths distributions of the minimum, mean and maximum
lengths. (Figure 5). As it is shown, both of the distributions show a peak but there are
different peak for using the minimum and maximum transcripts lengths respectively. It
might be resulted from the gene ontology which might reflect the presence of a large genes
belonging to some particular family expressed in the Liver and Kidney tissues.
Since we have many genes with transcripts lengths and the state of test indicating whether
or not the gene should be regarded as DEG, we would like to plot the ratio of differential
expressed genes within an range of transcripts lengths against the mean value of the
corresponding transcripts lengths. We execute a rank for the 19182 genes and then divide
the total valid genes into 192 groups according to the ranking result. With every 100 genes in
a group (82 genes in the last group), the 192 blocks division would thus give a convincing
estimation of the impact of transcripts length on the DEG detection ratio. (Figure 6).

19
Figure 5. Distribution of the gene transcription products lengths. Each distribution plot shows the minimum, mean and
maximum mRNA lengths for all the genes, Negative Binomial exact test detected differential expressed genes and Fisher
exact test detected differential expressed genes.

20
Figure 6. Differential expressed genes ratio (within each group) versus mRNA length. Each plot compares the differential
expressed genes ratio versus mRNA length, the first column shows the Negative Binomial exact test results for minimum
mean and maximum mRNA lengths while the second column shows the Fisher exact test results for same transcription
products lengths parameters.
Among all the 19182 genes, 4563 genes (23.8%) have single transcription product and 14619
genes (76.2%) have multiple transcription products due to alternative splicing. Following the
method used for dividing all genes into 192 groups, we then build two subsets of single
product genes and alternative splicing genes, construct similar group frames of 46 groups for
single mRNA subset and 146 groups for alternative splicing subset, respectively.
The subsets are executed revised permutation with increasing order by transcripts mean
length (single mRNA subset, actually min length, mean length and max length are the same
for this subset) and transcripts max length (alternative splicing subset). Without the
alternative splicing as a factor, single mRNA subset would be easier for us to explore the
relationship of gene length and DEG ratio. The work we do below would be a separation of
genes having single mRNA transcription product and genes affected by alternative splicing to
have transcription isoforms. Distribution of single mRNA length is show in the Figure 7.

21
Figure 7. Single transcription product mRNA length distribution. Histogram of gene transcription products lengths for the
genes that have single mRNA transcription products.
Figure 8. Length curve of mRNA length. Each plot shows the length curve of minimum mean and maximum mRNA length of
the average length values within each group. The index of group is from 1 to 192, resulting from the re-permutation of the
gene mRNA lengths with an increasing order.

22
Figure 9. Length curves of single mRNA and alternative splicing subsets. The upper plot shows the length curve with
increasing order resulting from re-permutation of single mRNA subset which contains 46 groups. The other plot shows the
length curve with increasing order resulting from re-permutation of alternative splicing subset which contains 146 groups.
After the division of blocks with increasing orders, we plot the length mean value within
each group (every group contains 100 genes) for all the genes, the single mRNA subset and
the alternative splicing subset. (Figure 8,Figure 9). Then we test the DE ratio by Negative
Binomial exact test and Fisher exact test for the two subsets respectively. In the result, we
could deduce that single mRNA group would give a more clear information about the impact
of transcripts length on the ratio of differential expressed genes while the alternative
splicing subset seems to show a steeper slop between transcripts length and DE ratio.
(Figure 10). As it has been stated above, there might be a certain gene family from the
perspective of gene ontology, so it is not surprising that the alternative splicing subset will
give less obvious correlation between transcripts length and DE ratio compared with the
single mRNA subset.

23
Figure 10. Differential expressed genes ratio versus mRNA lengths. Each plot shows the relationship between DEG ratio
(within each group) mRNA length from the results acquired by using separately the Negative Binomial exact test and Fisher
exact test method, for genes producing single mRNA and genes having multiple transcripts due to alternative splicing.

24
Summary and Conclusions
The goal of this work is to analyze the impact of gene length on differential gene expression
tests. In the beginning part of our analysis, we first test the reproducibility of the RNA-seq
technology and the effect of the process of normalization. The fact that we could obtain very
high correlation coefficients among different sample replicates shows the strong
reproducibility of the RNA-seq technology. In order to test the effect of normalization, we
use the log-fold-change and absolute expression to compare the parameters (M and A) value
ratio change between original library sizes and adjusted library sizes.
Then in the second step, we focus on the statistical tests of differential gene expression,
comparing the DEG identification by different tests. We use Negative Binomial exact test and
Fisher test to detect DEG. From the MA plot results, it could be deduced that Negative
Binomial exact test seems to perform better to detect up-regulated genes while Fisher exact
test seems to be more sensitive at identifying down-regulated genes. We also extract top
1000 differentially expressed genes according to the minimum P-value permutation
respectively for NB exact test and Fisher exact test. A further work might be done to explore
the trend of the proportion of up and down regulated genes in the different groups of DEG
like top 1000 DEG, top 1000 to 2000 ranking differential expression genes ,till all
differentially expressed genes.
In the final stage of the work, we will test the impact of gene length on the gene differential
expression. We extract the information of mRNA length and unique genes valid for an
Ensembl Gene ID query, among the 22490 genes contained in the Liver versus Kidney
original dataset, 19182 genes are reported to have at least one transcription product. As for
the use of RPKM to analyze the differential expressed gene, it is not considered as a good
idea to apply statistical test on RPKM values according to Simon Anders. “If you want to test
for differential expression, it is a good idea to stay on the level of raw, integer counts, and
not use RPKM or related data that is normalized by transcript length. This is because
significance depends on the number of actual reads that you count. If you have low count
you need to see a high fold-change to call significance.”
We plot the ratio of differential expressed genes within an range of transcripts lengths
(every 100 genes as a group) against the mean value of the corresponding transcripts lengths.
Then we separate the total 19182 genes into two subsets, genes with single transcription
product and genes which have alternative splicing in their transcription process, and then
test the differential expression ratio by Negative Binomial exact test and Fisher exact test for
the two subsets respectively. All the results we acquire from these tests show that there is
indeed an impact of gene length on the differential expressed genes identification detecting
rate. The genes with larger length tend to have more probability to be identified as a
differentially expressed gene.

25
References
Ali Mortazavi, Brian Williams, Kenneth McCue, Lorian Schaeffer, Barbara Wold. (2008). Mapping and
quantifying mammalian transcriptomes by RNA-Seq. Nature Methods , 5:621-628.
Alicia Oshlack, Matthew Wakefield. (2009). Transcript length bias in RNA-seq data confounds systems
biology. Biology Direct , 4:14.
Ana Conesa, http://www.lirmm.fr/~rivals/HTS-2011/. (2011). Differential expression with RNASeq:
length and depth does matter. Paris: http://www.lirmm.fr/~rivals/HTS-2011/.
Bioconductor. (2003). Retrieved from http://www.bioconductor.org/
CRAN-Package statmod (R package). (2007). From http://cran.r-
project.org/web/packages/statmod/index.html
James Bradford, Yvonne Hey, Tim Yates, Yaoyong Li, Stuart Pepper, Crispin Miller . (2010). A
comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global
transcription profiling. BMC Genomics , 11:282.
John Marioni, Christopher Mason, Shrikant Mane, Matthew Stephens, Yoav Gilad. (2008). RNA-seq:
an assessment of technical reproducibility and comparison with gene expression arrays. Genome
Research , vol. 18 (9) pp. 1509-17.
Mark Robinson, Alicia Oshlack. (2010). A scaling normalization method for differential expression
analysis of RNA-seq data. Genome Biology , 11:R25.
Mark Robinson, Davis McCarthy, Gordon Smyth. (2010). edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data. Bioinformatics , 26:139-140.
Mark Robinson, Gordon Smyth. (2008). Small sample estimation of negative binomial dispersion,
with applications to SAGE data. Biostatistics , 9:321-332.
WEHI Bioinformatics - Resources. (2008). Retrieved from http://bioinf.wehi.edu.au/resources/
Zhong Wang, Mark Gerstein, Michael Snyder. (2009). RNA-Seq: a revolutionary tool for
transcriptomics. Nature Review Genetics , 10:57-63.

Impact_of_gene_length_on_DEG

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Impact_of_gene_length_on_DEG

Similar to Impact_of_gene_length_on_DEG (20)

Impact_of_gene_length_on_DEG