1. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Revision
Normalization and cufflinks
2. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Normalization of read count
R/FPKM (Mortazavi et al.,2008) - Reads/Fragment per kilobase of exon
per million mappable reads
• Corrects for: differences in sequencing depth and transcript length
• Aiming to: compare a gene across samples and different genes within
samples
TMM (Robinson and Oshlack., 2010) - Trimmed mean of M values
• Corrects for: differences in transcript pool composition; extreme outliers
• Aiming to: provide better across-sample comparability
3. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Normalization of read count
Limma voom (LogCPM) (Law et al.,2013) - Counts per million
• Aiming to: Stabilize variance, removes dependence of variance on the
mean
TPM (Li etal 2010, Wagner et al 2012) - Transcripts per million
• Corrects for: transcript length distribution in RNA pool
• Aiming to: provide better across-sample comparability
4. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• FPKM for paired end reads and RPKM for single end reads
• Fragment means fragment of DNA, so the two reads that
comprise a paired-end read count as one.
• Per kilobase of exon means the counts of fragments are then
normalized by dividing by the total length of all exons in the gene.
• This bit of magic makes it possible to compare Gene A to Gene B
even if they are of different lengths.
• Per million reads means this value is then normalized against the
library size.
• This bit of magic makes it possible to compare Gene A in Sample
1 to Sample 2
R/FPKM (Mortazavi et al.,2008)
5. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
A quantification measurement for gene expression
• R: expression level of the gene
• L: length of the gene
• N: depth of the sequencing
• C: number total reads fall into the gene region
R/FPKM (Mortazavi et al.,2008)
Total exon size of a gene is 3,000-nt. Calculate the expression levels for
this gene in RPKM in an RNA-seq experiment that contained 50 million
mappable reads, with 600 reads falling into exon regions of this gene.
R = 600/(50 × 3.000) = 4.00
R = C ÷ L × N( ) L in kbs and N in Millions
6. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Calculation of FPKM/RPKM
Genes Sample1 Sample 2 Sample 3
1 (2kb) 20 24 60
2 (4kb) 40 50 120
3 (1kb) 10 16 30
4 (10kb) 0 0 2
Total 70 90 212
Total reads for sample 1, 2 and 3 - 7M ,9M and 21.2M
(millions of reads equated to a scale of tens of reads)
Step 1. Divide the reads of each gene with the total reads of the sample
Genes Sample1(RPM) Sample 2(RPM) Sample 3(RPM)
1 (2kb) 2.86 2.67 2.83
2 (4kb) 5.71 5.56 5.66
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.09
7. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Fragments/Reads per kilobase per million of reads
Reads are scaled for both depth and length
Step 2. Divide the values obtained after step 1 with the gene lengths
Genes Sample1 (RPKM) Sample 2
(RPKM)
Sample 3
(RPKM)
1 (2kb) 1.43 1.33 1.42
2 (4kb) 1.43 1.39 1.42
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.009
Total normalized
reads
4.29 4.5 4.5
8. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Calculation of TPM
Step 1. Divide the reads of each gene with the length of each gene
Genes Sample 1 Sample 2 Sample 3
1 (2kb) 20 24 60
2 (4kb) 40 50 120
3 (1kb) 10 16 30
4 (10kb) 0 0 2
Total reads per kb of gene for sample 1, 2 and 3- 3M,4.05M and 9.02M
Genes Sample 1(RPK) Sample 2(RPK) Sample 3(RPK)
1 (2kb) 10 12 30
2 (4kb) 10 12.5 30
3 (1kb) 10 16 30
4 (10kb) 0 0 0.2
Total 30 40.5 90.2
(millions of reads equated to a scale of tens of reads)
9. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 2. Divide the values obtained after step 1 with the gene lengths
Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM)
1 (2kb) 3.33 2.96 3.326
2 (4kb) 3.33 3.09 3.326
3 (1kb) 3.33 3.95 3.326
4 (10kb) 0 0 0.02
Total 10 10 10
Genes Sample1 (RPK) Sample 2(RPK) Sample 3(RPK)
1 (2kb) 10 12 30
2 (4kb) 10 12.5 30
3 (1kb) 10 16 30
4 (10kb) 0 0 0.2
Calculation of TPM
10. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RPKM vs TPM
Genes Sample1 (RPKM) Sample 2
(RPKM)
Sample 3
(RPKM)
1 (2kb) 1.43 1.33 1.42
2 (4kb) 1.43 1.39 1.42
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.009
Total normalized
reads
4.29 4.5 4.5
Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM)
1 (2kb) 3.33 2.96 3.326
2 (4kb) 3.33 3.09 3.326
3 (1kb) 3.33 3.95 3.326
4 (10kb) 0 0 0.02
Total normalized
reads
10 10 10
Sums of total normalized reads
11. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Eg : if certain genes are very highly expressed in one tissue but not another,
there will be less ‘’sequencing real estate’’ left for the less expressed genes in
that tissue and RPKM normalization (or similar) will give biased expression
values for them compared to the other sample
Equal sequencing depth -> Yellow and green will get lower RPKM in RNA population
1 although the expression levels are actually the same in populations 1 and 2
Robinson and Oshlack Genome Biology 2010, 11: R25, http://genomebiology.com /
2010/11/3/R25
RNA population 1 RNA population 2
TMM – Trimmed Mean of M Value
Attempts to correct for differences in RNA
composition between samples
12. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Identification of differentially expressed genes
Quality filtered/trimmed RNA-Seq Short reads
FPKM based
strategy
Calculate transcript
abundances
(Cufflinks)
Reference Genome
(Y/N)
Mapping to the reference
(GMAP-GSNAP, Tophat,Bowtie,etc.)
Y
N De novo Transcriptome
assembly (Trinity)
Mapping and detection of
DEGs (RSEM)
Count based
strategy
Generate count data
(RSEM)
Detection of DEGs
(cuffdiff2)
Detection of DEGs
(DESeq, edgeR, EBSeq)
13. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Genome Mapping and Alignment using GMAP - GSNAP
Genomic Mapping and Alignment Program
• GMAP is a standalone program for mapping and aligning cDNA sequences to a
genome.
• The program maps and aligns a single sequence with minimal startup time and
memory requirements, and provides fast batch processing of large sequence sets.
• The program generates accurate gene structures, even in the presence of
substantial polymorphisms and sequence errors, without using probabilistic splice
site models.
Step 1. Command for indexing the the genome : gmap_build -d btau8
bosTau8.fa
Initially used a hashing
scheme but later used a
much more efficient
double lookup scheme
14. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The index files created are as below in the folder btau8
gsnap –d btau8 –t 4 control_R1.fastq> control_R1.sam
Step 2. Mapping the reads to the genome
15. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• The end product of the GMAP aligner is a SAM file which needs to be
converted into a BAM file for further analysis in cufflinks.
• Repeat the same for the other replicate by changing the input file name.
• A total of four SAM files are generated separately.
The BAM files generated can be analysed in two ways -
1. The BAM files can be used to generate a merged assembly of transcripts
via cufflinks and cuffmerge. This merged assembly (i.e merged.gtf) is
used in cuffdiff to generate differential expressed genes.
2. Cuffdiff can be used directly to generate differentially expressed genes
using the BAM files generated.
The index files created are as below in the folder btau8
16. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
./samtools view –bsh aln.sam >aln.bam
-b: Output in the BAM format. -s: Input in the SAM format. –h: Include
header in the output
For the Control sample:
./samtools view –bsh control_R1.sam >control_R1.bam
For the Infected sample:
./samtools view –bsh infected_R1.sam >infected_R1.bam
Step 3. Converting SAM to BAM using samtools
17. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for sorting:./samtools sort aln.bam aln.sorted
Example:
For the Control sample:
./samtools sort control_R1.bam control_R1_sorted
For the Infected sample:
./samtools sort infected_R1.bam infected_R1_sorted
Step 4. Sorting BAM using samtools
18. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for running cufflinks on a BAM file
For the Control sample:
cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN
control_R1_sorted.bam
Step 5. (Option 1) Differential expression using cufflinks,
cuffmerge and cuffdiff.
19. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
For the infected sample:
cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN
infected_R1_sorted.bam
These commands generate transcript.gtf files for each replicate, which are
further used in cuffmerge to generate a merged assembly. This merged
assembly is then used in cuffdiff to generate differentially expressed genes.
20. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for running cuffmerge
cuffmerge -g btau8refflat.gtf -s bosTau8.fa -p 8 assemblies.txt
assemblies.txt is the file with the list of all the GTFs.
This generates a merged.gtf in the merged_asm folder. This file is
used in the next cuffdiff command.
Command for running cuffdiff
cuffdiff merged.gtf control_R1_sorted.bam control_R2_sorted.bam
infected_R1_sorted.bam infected_R2_sorted.bam
This command generates many files, out of which gene_exp.diff is the file
of our concern.
21. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
CuffDiff computes differentially expressed genes in the set. For computing
differential expression at least two samples -infected and control are required.
CuffDiff should always be run on replicates - i.e., N infected vs N control.
Command:
Cuffdiff –p –N transcripts.gtf
-p: num-threads <int>. -N
Running cuffdiff for our BAM files
cuffdiff –p 3 –N bostau8reflat.gtf control_R1_sorted.bam,control_R2_sorted.bam
infected_R1_sorted.bam,infected_R2_sorted.bam –o cuffdiff_out
Step 5. (Option 2) Differential expression using CuffDiff directly
22. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
A unique identifier
describing the object
(gene, transcript, CDS,
primary transcript)
Gene ID
Gene Name
Infected
OK (test successful), NOTEST (not enough alignments
for testing), LOWDATA (too complex or shallowly
sequenced), HIDATA (too many fragments in locus), or
FAIL, when an ill-conditioned covariance matrix or
other numerical exception prevents testing
FPKM in
Sample 1
FPKM in
Sample 2
The (base 2) log
of the fold
change y/x
Genomic coordinates for easy
browsing to the genes or
transcripts being tested.
Control
The value of the test statistic
used to compute significance
of the observed change in
FPKM
The uncorrected
p-value of the test
statistic
gene_exp.diff
Log2fold change = Log2(FPKM infected/FPKM of control)
= Log2(0.576748/3.92513) = -2.76673
23. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Identification of Differentially expressed genes - I
(using RSEM - EBSeq)
24. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Quality filtered/trimmed RNA-Seq Short reads
Calculate transcript
abundances
(RSEM)
Reference Genome
Mapping to the reference
(Bowtie)
Detection of DEGs
(DESeq, edgeR,EBSeq)
Downloading the reference
genome and GTF from Ensembl
genome browser
Count based
strategy
25. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RSEM is a cutting-edge RNASeq analysis package that is an end-to-end
solution for differential expression, and simplifies the whole process. It also
introduces a new more robust unit of RNASeq measurement called TPM.
RSEM (RNA-Seq by Expectation-Maximization)
(Li1 and Dewey., 2011)
Step 1. Downloading RSEM and installing
wget http://deweylab.biostat.wisc.edu/rsem/src/rsem-1.2.19.tar.gz
tar –xvzf rsem-1.2.19.tar.gz
cd rsem-1.2.19/make
Step 2. Prerequisites required for running RSEM : Perl, R and Bowtie are
required to be installed. Perl and R are normally present in most of the
computers.
26. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 3. Downloading Bowtie and installing
Download Bowtie from http://sourceforge.net/projects/bowtie-bio/files/
bowtie/1.1.1/
Step 4. Copy bowtie in your path or add bowtie path in bash
profile
Copying bowtie in your path
sudo cp -R /Users/appleserver/Desktop/bowtie2 /usr/local/bin
add bowtie path in bash profile (preferred)
export PATH="/Users/ravikumar/Desktop/bowtie2:$PATH"
run source ~/.bash_profile
RSEM (RNA-Seq by Expectation-Maximization)
Indicates that the path has been added
echo $PATH - to check whether the path has been added
27. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 5. Downloading the reference,gunzipping and concatenating
Download Bos taurus genome from Ensembl genome browser. An easier
alternative is to use wget command for a direct download on HPC:
wget -m ftp://ftp.ensembl.org/pub/release-81/fasta/bos_taurus/dna/ &or f
in $(find . -name "*.gz")
28. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The folder that is created is as below
29. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Direct download of each individual chromosome and gtf from the
ftp site can be done
30. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The files downloaded are gunzipped using -
gunzip Bos_taurus.UMD3.1.dna.chromosome.*.fa.gz
Concatenating/combining all the fasta file into a combined fasta file
(reference):
cat Bos_taurus.UMD3.1.dna.chromosome.*.fa > combined.fa
Step 6. Download annotation file in gtf format.
Command for downloading : wget –m ftp://ftp.ensembl.org/pub/
release-81/gtf/bos_taurus
The gtf file downloaded needs to be modifies to extract only the exon
annotations.
awk command to extract the exon annotations from gtf:
awk ‘$3 == “exon”’ Bos_taurus.UMD3.1.8.1.gtf> filtered.gtf
31. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
filtered.gtf
original gtf
32. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 7. Prepare reference using RSEM
To prepare the reference sequence, run the ‘rsem-prepare-reference’ program.
The command for preparing the reference running:
./rsem-prepare-reference --gtf filtered.gtf --bowtie2 combined.fa BT
This creates 12 file as index files with the name of BT and extension bt2
33. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 8. Calculating expression values in counts, TPM and FPKM:
To calculate expression values, ‘rsem-calculate-expression’ program.
Command for running rsem-calculate-expression :
For running the control sample:
. /rsem-calculate-expression --bowtie2 control_R1.fastq BT ControlR1
There will be six files generated as shown above. genes.results is the most
important file among the six
For running the Infected sample:
. /rsem-calculate-expression --bowtie2 infected_R1.fastq BT infectedR1
34. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 9. Combining RSEM genes.results of all the files:
RSEM produces “expected counts” or “gene counts” values. After rounding
these expected counts values to the nearest integer - EBSeq, DESeq, or
edgeR to identify differentially expressed genes.
./rsem-generate-data-matrix *.genes.results > genes.results
35. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
EBSeq is an R package for identifying genes and isoforms differentially
expressed (DE) across two or more biological conditions in an RNA-seq
experiment. EBSEq uses RSEM counts as input to identify differentially
expressed genes
Step 1. Installing EBSeq:
To install, type the following commands in R:
source("https://bioconductor.org/biocLite.R")
biocLite("EBSeq")
Step 2. Command for Loading the package EBSeq
>library(EBSeq)
Step 3.Command for getting the working directory
>getwd()
Differentially expression using EBSeq (Leng et al., 2013):
Empirical Bayesian approach for RNA-Seq data analysis
36. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 4. Command for setting the working library
> setwd()
Step 5. Input requirement for Gene level DE analysis:
The input file formats supported by EBSeq are .csv, .xls, or .xlsx, .txt (tab
delimited). In the input file, rows should be the genes and the columns
should be the samples.
Example of the data set in .txt format (genesresult.txt) that is used
here
37. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 6. Commands to Run EBSeq:
> x=data.matrix(read.table("genesresults.txt"))
> dim(x)
[1] 24596 4
> str(x)
num [1:24596, 1:4] 615 3 0 473 1 286 832 362 103 17 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:24596] "ENSBTAG00000000005" "ENSBTAG00000000008"
"ENSBTAG00000000009" "ENSBTAG00000000010" ...
..$ : chr [1:4] "infectedR1.genes.results" "infectedR2.genes.results"
"ControlR1.genes.results" "ControlR2.genes.results"
> Sizes=MedianNorm(x)
> EBOut=EBTest(Data=x,
+ Conditions=as.factor(rep(c("C1","C2"),each=2)),sizeFactors=Sizes,
+ maxround=5)
38. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Removing transcripts with 75 th quantile < = 10
12071 transcripts will be tested
iteration 1 done
time 0.12
iteration 2 done
time 0.13
iteration 3 done
time 0.08
iteration 4 done
> PP=GetPPMat(EBOut)
> str(PP)
num [1:12071, 1:2] 1 1 0 0 1 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:12071] "ENSBTAG00000000005"
"ENSBTAG00000000010" "ENSBTAG00000000012"
"ENSBTAG00000000013" ...
..$ : chr [1:2] "PPEE" "PPDE"
39. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
> DEfound=rownames(PP)[which(PP[,"PPDE"]>=.95)]
> str(DEfound)
chr [1:6528] "ENSBTAG00000000012" "ENSBTAG00000000013"
"ENSBTAG00000000015" "ENSBTAG00000000019"
"ENSBTAG00000000021" "ENSBTAG00000000025"
"ENSBTAG00000000026" "ENSBTAG00000000032" ...
> write.table(DEfound,"DE.txt",sep = "t",quote = F,col.names=F)
> GeneFC=PostFC(EBOut)
> write.table(GeneFC,"FC.txt",sep = "t",quote = F,col.names=F)
Output
GeneID PostFC Real FC comparison
40. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The posterior fold change estimations will give less extreme values for
low expressers. e.g. if gene1 has Y = 5000 and X = 1000, its FC and
PostFC will both be 5. If gene2 has Y = 5 and X = 1, its FC will be 5 but
its PostFC will be < 5 and closer to 1. Therefore when we sort the
PostFC, gene2 will be less significant than gene1.