SlideShare a Scribd company logo
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Revision
Normalization and cufflinks
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Normalization of read count
R/FPKM (Mortazavi et al.,2008) - Reads/Fragment per kilobase of exon
per million mappable reads
• Corrects for: differences in sequencing depth and transcript length
• Aiming to: compare a gene across samples and different genes within
samples
TMM (Robinson and Oshlack., 2010) - Trimmed mean of M values
• Corrects for: differences in transcript pool composition; extreme outliers
• Aiming to: provide better across-sample comparability
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Normalization of read count
Limma voom (LogCPM) (Law et al.,2013) - Counts per million
• Aiming to: Stabilize variance, removes dependence of variance on the
mean
TPM (Li etal 2010, Wagner et al 2012) - Transcripts per million
• Corrects for: transcript length distribution in RNA pool
• Aiming to: provide better across-sample comparability
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• FPKM for paired end reads and RPKM for single end reads
• Fragment means fragment of DNA, so the two reads that
comprise a paired-end read count as one.
• Per kilobase of exon means the counts of fragments are then
normalized by dividing by the total length of all exons in the gene.
• This bit of magic makes it possible to compare Gene A to Gene B
even if they are of different lengths.
• Per million reads means this value is then normalized against the
library size.
• This bit of magic makes it possible to compare Gene A in Sample
1 to Sample 2
R/FPKM (Mortazavi et al.,2008)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
A quantification measurement for gene expression
• R: expression level of the gene
• L: length of the gene
• N: depth of the sequencing
• C: number total reads fall into the gene region
R/FPKM (Mortazavi et al.,2008)
Total exon size of a gene is 3,000-nt. Calculate the expression levels for
this gene in RPKM in an RNA-seq experiment that contained 50 million
mappable reads, with 600 reads falling into exon regions of this gene.
R = 600/(50 × 3.000) = 4.00
R = C ÷ L × N( ) L in kbs and N in Millions
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Calculation of FPKM/RPKM
Genes Sample1 Sample 2 Sample 3
1 (2kb) 20 24 60
2 (4kb) 40 50 120
3 (1kb) 10 16 30
4 (10kb) 0 0 2
Total 70 90 212
Total reads for sample 1, 2 and 3 - 7M ,9M and 21.2M
(millions of reads equated to a scale of tens of reads)
Step 1. Divide the reads of each gene with the total reads of the sample
Genes Sample1(RPM) Sample 2(RPM) Sample 3(RPM)
1 (2kb) 2.86 2.67 2.83
2 (4kb) 5.71 5.56 5.66
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.09
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Fragments/Reads per kilobase per million of reads
Reads are scaled for both depth and length
Step 2. Divide the values obtained after step 1 with the gene lengths
Genes Sample1 (RPKM) Sample 2
(RPKM)
Sample 3
(RPKM)
1 (2kb) 1.43 1.33 1.42
2 (4kb) 1.43 1.39 1.42
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.009
Total normalized
reads
4.29 4.5 4.5
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Calculation of TPM
Step 1. Divide the reads of each gene with the length of each gene
Genes Sample 1 Sample 2 Sample 3
1 (2kb) 20 24 60
2 (4kb) 40 50 120
3 (1kb) 10 16 30
4 (10kb) 0 0 2
Total reads per kb of gene for sample 1, 2 and 3- 3M,4.05M and 9.02M
Genes Sample 1(RPK) Sample 2(RPK) Sample 3(RPK)
1 (2kb) 10 12 30
2 (4kb) 10 12.5 30
3 (1kb) 10 16 30
4 (10kb) 0 0 0.2
Total 30 40.5 90.2
(millions of reads equated to a scale of tens of reads)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 2. Divide the values obtained after step 1 with the gene lengths
Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM)
1 (2kb) 3.33 2.96 3.326
2 (4kb) 3.33 3.09 3.326
3 (1kb) 3.33 3.95 3.326
4 (10kb) 0 0 0.02
Total 10 10 10
Genes Sample1 (RPK) Sample 2(RPK) Sample 3(RPK)
1 (2kb) 10 12 30
2 (4kb) 10 12.5 30
3 (1kb) 10 16 30
4 (10kb) 0 0 0.2
Calculation of TPM
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RPKM vs TPM
Genes Sample1 (RPKM) Sample 2
(RPKM)
Sample 3
(RPKM)
1 (2kb) 1.43 1.33 1.42
2 (4kb) 1.43 1.39 1.42
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.009
Total normalized
reads
4.29 4.5 4.5
Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM)
1 (2kb) 3.33 2.96 3.326
2 (4kb) 3.33 3.09 3.326
3 (1kb) 3.33 3.95 3.326
4 (10kb) 0 0 0.02
Total normalized
reads
10 10 10
Sums of total normalized reads
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Eg : if certain genes are very highly expressed in one tissue but not another,
there will be less ‘’sequencing real estate’’ left for the less expressed genes in
that tissue and RPKM normalization (or similar) will give biased expression
values for them compared to the other sample
Equal sequencing depth -> Yellow and green will get lower RPKM in RNA population
1 although the expression levels are actually the same in populations 1 and 2
Robinson and Oshlack Genome Biology 2010, 11: R25, http://genomebiology.com /
2010/11/3/R25
RNA population 1 RNA population 2
TMM – Trimmed Mean of M Value
Attempts to correct for differences in RNA
composition between samples
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Identification of differentially expressed genes
Quality filtered/trimmed RNA-Seq Short reads
FPKM based
strategy
Calculate transcript
abundances
(Cufflinks)
Reference Genome
(Y/N)
Mapping to the reference
(GMAP-GSNAP, Tophat,Bowtie,etc.)
Y
N De novo Transcriptome
assembly (Trinity)
Mapping and detection of
DEGs (RSEM)
Count based
strategy
Generate count data
(RSEM)
Detection of DEGs
(cuffdiff2)
Detection of DEGs
(DESeq, edgeR, EBSeq)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Genome Mapping and Alignment using GMAP - GSNAP
Genomic Mapping and Alignment Program
• GMAP is a standalone program for mapping and aligning cDNA sequences to a
genome.
• The program maps and aligns a single sequence with minimal startup time and
memory requirements, and provides fast batch processing of large sequence sets.
• The program generates accurate gene structures, even in the presence of
substantial polymorphisms and sequence errors, without using probabilistic splice
site models.
Step 1. Command for indexing the the genome : gmap_build -d btau8
bosTau8.fa
Initially used a hashing
scheme but later used a
much more efficient
double lookup scheme
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The index files created are as below in the folder btau8

gsnap –d btau8 –t 4 control_R1.fastq> control_R1.sam
Step 2. Mapping the reads to the genome
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• The end product of the GMAP aligner is a SAM file which needs to be
converted into a BAM file for further analysis in cufflinks.
• Repeat the same for the other replicate by changing the input file name.
• A total of four SAM files are generated separately.
The BAM files generated can be analysed in two ways -
1. The BAM files can be used to generate a merged assembly of transcripts
via cufflinks and cuffmerge. This merged assembly (i.e merged.gtf) is
used in cuffdiff to generate differential expressed genes.
2. Cuffdiff can be used directly to generate differentially expressed genes
using the BAM files generated.
The index files created are as below in the folder btau8

Computational Biology and Genomics Facility, Indian Veterinary Research Institute
./samtools view –bsh aln.sam >aln.bam
-b: Output in the BAM format. -s: Input in the SAM format. –h: Include
header in the output
For the Control sample:
./samtools view –bsh control_R1.sam >control_R1.bam
For the Infected sample:
./samtools view –bsh infected_R1.sam >infected_R1.bam
Step 3. Converting SAM to BAM using samtools
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for sorting:./samtools sort aln.bam aln.sorted
Example:
For the Control sample:
./samtools sort control_R1.bam control_R1_sorted
For the Infected sample:
./samtools sort infected_R1.bam infected_R1_sorted
Step 4. Sorting BAM using samtools
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for running cufflinks on a BAM file
For the Control sample:
cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN
control_R1_sorted.bam
Step 5. (Option 1) Differential expression using cufflinks,
cuffmerge and cuffdiff.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
For the infected sample:
cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN
infected_R1_sorted.bam
These commands generate transcript.gtf files for each replicate, which are
further used in cuffmerge to generate a merged assembly. This merged
assembly is then used in cuffdiff to generate differentially expressed genes.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for running cuffmerge
cuffmerge -g btau8refflat.gtf -s bosTau8.fa -p 8 assemblies.txt
assemblies.txt is the file with the list of all the GTFs.
This generates a merged.gtf in the merged_asm folder. This file is
used in the next cuffdiff command.
Command for running cuffdiff
cuffdiff merged.gtf control_R1_sorted.bam control_R2_sorted.bam
infected_R1_sorted.bam infected_R2_sorted.bam
This command generates many files, out of which gene_exp.diff is the file
of our concern.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
CuffDiff computes differentially expressed genes in the set. For computing
differential expression at least two samples -infected and control are required.
CuffDiff should always be run on replicates - i.e., N infected vs N control.
Command:
Cuffdiff –p –N transcripts.gtf
-p: num-threads <int>. -N
Running cuffdiff for our BAM files
cuffdiff –p 3 –N bostau8reflat.gtf control_R1_sorted.bam,control_R2_sorted.bam
infected_R1_sorted.bam,infected_R2_sorted.bam –o cuffdiff_out
Step 5. (Option 2) Differential expression using CuffDiff directly
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
A unique identifier
describing the object
(gene, transcript, CDS,
primary transcript)
Gene ID
Gene Name
Infected
OK (test successful), NOTEST (not enough alignments
for testing), LOWDATA (too complex or shallowly
sequenced), HIDATA (too many fragments in locus), or
FAIL, when an ill-conditioned covariance matrix or
other numerical exception prevents testing
FPKM in
Sample 1
FPKM in
Sample 2
The (base 2) log
of the fold
change y/x
Genomic coordinates for easy
browsing to the genes or
transcripts being tested.
Control
The value of the test statistic
used to compute significance
of the observed change in
FPKM
The uncorrected
p-value of the test
statistic
gene_exp.diff
Log2fold change = Log2(FPKM infected/FPKM of control)
= Log2(0.576748/3.92513) = -2.76673
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Identification of Differentially expressed genes - I
(using RSEM - EBSeq)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Quality filtered/trimmed RNA-Seq Short reads
Calculate transcript
abundances
(RSEM)
Reference Genome
Mapping to the reference
(Bowtie)
Detection of DEGs
(DESeq, edgeR,EBSeq)
Downloading the reference
genome and GTF from Ensembl
genome browser
Count based
strategy
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RSEM is a cutting-edge RNASeq analysis package that is an end-to-end
solution for differential expression, and simplifies the whole process. It also
introduces a new more robust unit of RNASeq measurement called TPM.
RSEM (RNA-Seq by Expectation-Maximization)
(Li1 and Dewey., 2011)
Step 1. Downloading RSEM and installing
wget http://deweylab.biostat.wisc.edu/rsem/src/rsem-1.2.19.tar.gz
tar –xvzf rsem-1.2.19.tar.gz
cd rsem-1.2.19/make
Step 2. Prerequisites required for running RSEM : Perl, R and Bowtie are
required to be installed. Perl and R are normally present in most of the
computers.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 3. Downloading Bowtie and installing
Download Bowtie from http://sourceforge.net/projects/bowtie-bio/files/
bowtie/1.1.1/
Step 4. Copy bowtie in your path or add bowtie path in bash
profile
Copying bowtie in your path
sudo cp -R /Users/appleserver/Desktop/bowtie2 /usr/local/bin
add bowtie path in bash profile (preferred)
export PATH="/Users/ravikumar/Desktop/bowtie2:$PATH"
run source ~/.bash_profile
RSEM (RNA-Seq by Expectation-Maximization)
	
Indicates that the path has been added
echo $PATH - to check whether the path has been added
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 5. Downloading the reference,gunzipping and concatenating
Download Bos taurus genome from Ensembl genome browser. An easier
alternative is to use wget command for a direct download on HPC:
wget -m ftp://ftp.ensembl.org/pub/release-81/fasta/bos_taurus/dna/ &or f
in $(find . -name "*.gz")
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The folder that is created is as below
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Direct download of each individual chromosome and gtf from the
ftp site can be done
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The files downloaded are gunzipped using -
gunzip Bos_taurus.UMD3.1.dna.chromosome.*.fa.gz
Concatenating/combining all the fasta file into a combined fasta file
(reference):
cat Bos_taurus.UMD3.1.dna.chromosome.*.fa > combined.fa
Step 6. Download annotation file in gtf format.
Command for downloading : wget –m ftp://ftp.ensembl.org/pub/
release-81/gtf/bos_taurus
The gtf file downloaded needs to be modifies to extract only the exon
annotations.
awk command to extract the exon annotations from gtf:
awk ‘$3 == “exon”’ Bos_taurus.UMD3.1.8.1.gtf> filtered.gtf
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
filtered.gtf
original gtf
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 7. Prepare reference using RSEM
To prepare the reference sequence, run the ‘rsem-prepare-reference’ program.
The command for preparing the reference running:
./rsem-prepare-reference --gtf filtered.gtf --bowtie2 combined.fa BT
This creates 12 file as index files with the name of BT and extension bt2
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 8. Calculating expression values in counts, TPM and FPKM:
To calculate expression values, ‘rsem-calculate-expression’ program.
Command for running rsem-calculate-expression :
For running the control sample:
. /rsem-calculate-expression --bowtie2 control_R1.fastq BT ControlR1
There will be six files generated as shown above. genes.results is the most
important file among the six
For running the Infected sample:
. /rsem-calculate-expression --bowtie2 infected_R1.fastq BT infectedR1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 9. Combining RSEM genes.results of all the files:
RSEM produces “expected counts” or “gene counts” values. After rounding
these expected counts values to the nearest integer - EBSeq, DESeq, or
edgeR to identify differentially expressed genes.
./rsem-generate-data-matrix *.genes.results > genes.results
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
EBSeq is an R package for identifying genes and isoforms differentially
expressed (DE) across two or more biological conditions in an RNA-seq
experiment. EBSEq uses RSEM counts as input to identify differentially
expressed genes
Step 1. Installing EBSeq:
To install, type the following commands in R:
source("https://bioconductor.org/biocLite.R")
biocLite("EBSeq")
Step 2. Command for Loading the package EBSeq
>library(EBSeq)
Step 3.Command for getting the working directory
>getwd()
Differentially expression using EBSeq (Leng et al., 2013):
Empirical Bayesian approach for RNA-Seq data analysis
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 4. Command for setting the working library
> setwd()
	
Step 5. Input requirement for Gene level DE analysis:
The input file formats supported by EBSeq are .csv, .xls, or .xlsx, .txt (tab
delimited). In the input file, rows should be the genes and the columns
should be the samples.
Example of the data set in .txt format (genesresult.txt) that is used
here
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 6. Commands to Run EBSeq:
> x=data.matrix(read.table("genesresults.txt"))
> dim(x)
[1] 24596 4
> str(x)
num [1:24596, 1:4] 615 3 0 473 1 286 832 362 103 17 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:24596] "ENSBTAG00000000005" "ENSBTAG00000000008"
"ENSBTAG00000000009" "ENSBTAG00000000010" ...
..$ : chr [1:4] "infectedR1.genes.results" "infectedR2.genes.results"
"ControlR1.genes.results" "ControlR2.genes.results"
> Sizes=MedianNorm(x)
> EBOut=EBTest(Data=x,
+ Conditions=as.factor(rep(c("C1","C2"),each=2)),sizeFactors=Sizes,
+ maxround=5)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Removing transcripts with 75 th quantile < = 10
12071 transcripts will be tested
iteration 1 done
time 0.12
iteration 2 done
time 0.13
iteration 3 done
time 0.08
iteration 4 done
> PP=GetPPMat(EBOut)
> str(PP)
num [1:12071, 1:2] 1 1 0 0 1 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:12071] "ENSBTAG00000000005"
"ENSBTAG00000000010" "ENSBTAG00000000012"
"ENSBTAG00000000013" ...
..$ : chr [1:2] "PPEE" "PPDE"
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
> DEfound=rownames(PP)[which(PP[,"PPDE"]>=.95)]
> str(DEfound)
chr [1:6528] "ENSBTAG00000000012" "ENSBTAG00000000013"
"ENSBTAG00000000015" "ENSBTAG00000000019"
"ENSBTAG00000000021" "ENSBTAG00000000025"
"ENSBTAG00000000026" "ENSBTAG00000000032" ...
> write.table(DEfound,"DE.txt",sep = "t",quote = F,col.names=F)
> GeneFC=PostFC(EBOut)
> write.table(GeneFC,"FC.txt",sep = "t",quote = F,col.names=F)
Output
GeneID PostFC Real FC comparison
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The posterior fold change estimations will give less extreme values for
low expressers. e.g. if gene1 has Y = 5000 and X = 1000, its FC and
PostFC will both be 5. If gene2 has Y = 5 and X = 1, its FC will be 5 but
its PostFC will be < 5 and closer to 1. Therefore when we sort the
PostFC, gene2 will be less significant than gene1.

More Related Content

What's hot

Whole Genome Selection
Whole Genome SelectionWhole Genome Selection
Whole Genome Selection
Raghav N.R
 
Genomic selection, prediction models, GEBV values, genomic selection in plant...
Genomic selection, prediction models, GEBV values, genomic selection in plant...Genomic selection, prediction models, GEBV values, genomic selection in plant...
Genomic selection, prediction models, GEBV values, genomic selection in plant...
Mahesh Biradar
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
Karan Veer Singh
 
Strategies for mapping of genes for agronomic traits in plants
Strategies for mapping of genes for agronomic traits in plantsStrategies for mapping of genes for agronomic traits in plants
Strategies for mapping of genes for agronomic traits in plants
tusharamodugu
 
SNPs analysis methods
SNPs analysis methodsSNPs analysis methods
SNPs analysis methods
had89
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
Karan Veer Singh
 
Genomic Selection in Plants
Genomic Selection in PlantsGenomic Selection in Plants
Genomic Selection in Plants
Prakash Narayan
 
Diversity Array technology
Diversity Array technologyDiversity Array technology
Diversity Array technology
Manjesh Saakre
 
Molecular markers: Outlook
Molecular markers: OutlookMolecular markers: Outlook
Molecular markers: Outlook
Adhiyamaan Raj
 
molecular markers
 molecular markers molecular markers
molecular markers
Nawfal Aldujaily
 
Microsatellites Markers
Microsatellites  MarkersMicrosatellites  Markers
Microsatellites Markers
Karan Veer Singh
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencing
Bhavya Sree
 
Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)
Hamza Khan
 
Association mapping
Association mappingAssociation mapping
Association mapping
Senthil Natesan
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
rjorton
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Manikhandan Mudaliar
 
Genomic Selection & Precision Phenotyping
Genomic Selection & Precision PhenotypingGenomic Selection & Precision Phenotyping
Genomic Selection & Precision Phenotyping
CIMMYT
 
Phylogenetics1
Phylogenetics1Phylogenetics1
Phylogenetics1
Sébastien De Landtsheer
 
Forensic Sciences (DNA Fingerprinting) STR Typing - Case Report
Forensic Sciences (DNA Fingerprinting) STR Typing - Case ReportForensic Sciences (DNA Fingerprinting) STR Typing - Case Report
Forensic Sciences (DNA Fingerprinting) STR Typing - Case Report
narmeenarshad
 

What's hot (20)

Whole Genome Selection
Whole Genome SelectionWhole Genome Selection
Whole Genome Selection
 
Genomic selection, prediction models, GEBV values, genomic selection in plant...
Genomic selection, prediction models, GEBV values, genomic selection in plant...Genomic selection, prediction models, GEBV values, genomic selection in plant...
Genomic selection, prediction models, GEBV values, genomic selection in plant...
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
Strategies for mapping of genes for agronomic traits in plants
Strategies for mapping of genes for agronomic traits in plantsStrategies for mapping of genes for agronomic traits in plants
Strategies for mapping of genes for agronomic traits in plants
 
SNPs analysis methods
SNPs analysis methodsSNPs analysis methods
SNPs analysis methods
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Genomic Selection in Plants
Genomic Selection in PlantsGenomic Selection in Plants
Genomic Selection in Plants
 
Diversity Array technology
Diversity Array technologyDiversity Array technology
Diversity Array technology
 
Molecular markers: Outlook
Molecular markers: OutlookMolecular markers: Outlook
Molecular markers: Outlook
 
molecular markers
 molecular markers molecular markers
molecular markers
 
Microsatellites Markers
Microsatellites  MarkersMicrosatellites  Markers
Microsatellites Markers
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencing
 
Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)
 
Association mapping
Association mappingAssociation mapping
Association mapping
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
PCR DNA PRIMER
PCR DNA PRIMERPCR DNA PRIMER
PCR DNA PRIMER
 
Genomic Selection & Precision Phenotyping
Genomic Selection & Precision PhenotypingGenomic Selection & Precision Phenotyping
Genomic Selection & Precision Phenotyping
 
Phylogenetics1
Phylogenetics1Phylogenetics1
Phylogenetics1
 
Forensic Sciences (DNA Fingerprinting) STR Typing - Case Report
Forensic Sciences (DNA Fingerprinting) STR Typing - Case ReportForensic Sciences (DNA Fingerprinting) STR Typing - Case Report
Forensic Sciences (DNA Fingerprinting) STR Typing - Case Report
 

Similar to RSEM and DE packages

Cufflinks
CufflinksCufflinks
Cufflinks
Ravi Gandham
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
IJCI JOURNAL
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issuesDongyan Zhao
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
Dan Gaston
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
mikaelhuss
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_PresentationToyin23
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Aug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigenticsAug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigentics
GenomeInABottle
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
hansjansen9999
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
Ravi Gandham
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfwhole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdf
CRISTIANALONSORODRIG1
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, Workshop
Fahadahammed2
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
mikaelhuss
 
iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0 iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0
Shigehiro Kuraku (工樂 樹洋)
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821
GenomeInABottle
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
GenomeInABottle
 
Multisite UTE 31P Rosette MRSI(PETALUTE)
Multisite UTE 31P Rosette MRSI(PETALUTE)Multisite UTE 31P Rosette MRSI(PETALUTE)
Multisite UTE 31P Rosette MRSI(PETALUTE)
Uzay Emir
 

Similar to RSEM and DE packages (20)

Cufflinks
CufflinksCufflinks
Cufflinks
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Aug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigenticsAug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigentics
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfwhole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdf
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, Workshop
 
ASHI2013HLA(1)
ASHI2013HLA(1)ASHI2013HLA(1)
ASHI2013HLA(1)
 
Ffpe white paper
Ffpe white paperFfpe white paper
Ffpe white paper
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0 iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Multisite UTE 31P Rosette MRSI(PETALUTE)
Multisite UTE 31P Rosette MRSI(PETALUTE)Multisite UTE 31P Rosette MRSI(PETALUTE)
Multisite UTE 31P Rosette MRSI(PETALUTE)
 

More from Ravi Gandham

Functional annotation
Functional annotationFunctional annotation
Functional annotation
Ravi Gandham
 
Data formats
Data formatsData formats
Data formats
Ravi Gandham
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
Ravi Gandham
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview
Ravi Gandham
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
Ravi Gandham
 
Primer designing
Primer designingPrimer designing
Primer designing
Ravi Gandham
 

More from Ravi Gandham (6)

Functional annotation
Functional annotationFunctional annotation
Functional annotation
 
Data formats
Data formatsData formats
Data formats
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Primer designing
Primer designingPrimer designing
Primer designing
 

Recently uploaded

TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
gb193092
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
ArianaBusciglio
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 

Recently uploaded (20)

TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 

RSEM and DE packages

  • 1. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Revision Normalization and cufflinks
  • 2. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Normalization of read count R/FPKM (Mortazavi et al.,2008) - Reads/Fragment per kilobase of exon per million mappable reads • Corrects for: differences in sequencing depth and transcript length • Aiming to: compare a gene across samples and different genes within samples TMM (Robinson and Oshlack., 2010) - Trimmed mean of M values • Corrects for: differences in transcript pool composition; extreme outliers • Aiming to: provide better across-sample comparability
  • 3. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Normalization of read count Limma voom (LogCPM) (Law et al.,2013) - Counts per million • Aiming to: Stabilize variance, removes dependence of variance on the mean TPM (Li etal 2010, Wagner et al 2012) - Transcripts per million • Corrects for: transcript length distribution in RNA pool • Aiming to: provide better across-sample comparability
  • 4. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • FPKM for paired end reads and RPKM for single end reads • Fragment means fragment of DNA, so the two reads that comprise a paired-end read count as one. • Per kilobase of exon means the counts of fragments are then normalized by dividing by the total length of all exons in the gene. • This bit of magic makes it possible to compare Gene A to Gene B even if they are of different lengths. • Per million reads means this value is then normalized against the library size. • This bit of magic makes it possible to compare Gene A in Sample 1 to Sample 2 R/FPKM (Mortazavi et al.,2008)
  • 5. Computational Biology and Genomics Facility, Indian Veterinary Research Institute A quantification measurement for gene expression • R: expression level of the gene • L: length of the gene • N: depth of the sequencing • C: number total reads fall into the gene region R/FPKM (Mortazavi et al.,2008) Total exon size of a gene is 3,000-nt. Calculate the expression levels for this gene in RPKM in an RNA-seq experiment that contained 50 million mappable reads, with 600 reads falling into exon regions of this gene. R = 600/(50 × 3.000) = 4.00 R = C ÷ L × N( ) L in kbs and N in Millions
  • 6. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Calculation of FPKM/RPKM Genes Sample1 Sample 2 Sample 3 1 (2kb) 20 24 60 2 (4kb) 40 50 120 3 (1kb) 10 16 30 4 (10kb) 0 0 2 Total 70 90 212 Total reads for sample 1, 2 and 3 - 7M ,9M and 21.2M (millions of reads equated to a scale of tens of reads) Step 1. Divide the reads of each gene with the total reads of the sample Genes Sample1(RPM) Sample 2(RPM) Sample 3(RPM) 1 (2kb) 2.86 2.67 2.83 2 (4kb) 5.71 5.56 5.66 3 (1kb) 1.43 1.78 1.42 4 (10kb) 0 0 0.09
  • 7. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Fragments/Reads per kilobase per million of reads Reads are scaled for both depth and length Step 2. Divide the values obtained after step 1 with the gene lengths Genes Sample1 (RPKM) Sample 2 (RPKM) Sample 3 (RPKM) 1 (2kb) 1.43 1.33 1.42 2 (4kb) 1.43 1.39 1.42 3 (1kb) 1.43 1.78 1.42 4 (10kb) 0 0 0.009 Total normalized reads 4.29 4.5 4.5
  • 8. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Calculation of TPM Step 1. Divide the reads of each gene with the length of each gene Genes Sample 1 Sample 2 Sample 3 1 (2kb) 20 24 60 2 (4kb) 40 50 120 3 (1kb) 10 16 30 4 (10kb) 0 0 2 Total reads per kb of gene for sample 1, 2 and 3- 3M,4.05M and 9.02M Genes Sample 1(RPK) Sample 2(RPK) Sample 3(RPK) 1 (2kb) 10 12 30 2 (4kb) 10 12.5 30 3 (1kb) 10 16 30 4 (10kb) 0 0 0.2 Total 30 40.5 90.2 (millions of reads equated to a scale of tens of reads)
  • 9. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 2. Divide the values obtained after step 1 with the gene lengths Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM) 1 (2kb) 3.33 2.96 3.326 2 (4kb) 3.33 3.09 3.326 3 (1kb) 3.33 3.95 3.326 4 (10kb) 0 0 0.02 Total 10 10 10 Genes Sample1 (RPK) Sample 2(RPK) Sample 3(RPK) 1 (2kb) 10 12 30 2 (4kb) 10 12.5 30 3 (1kb) 10 16 30 4 (10kb) 0 0 0.2 Calculation of TPM
  • 10. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RPKM vs TPM Genes Sample1 (RPKM) Sample 2 (RPKM) Sample 3 (RPKM) 1 (2kb) 1.43 1.33 1.42 2 (4kb) 1.43 1.39 1.42 3 (1kb) 1.43 1.78 1.42 4 (10kb) 0 0 0.009 Total normalized reads 4.29 4.5 4.5 Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM) 1 (2kb) 3.33 2.96 3.326 2 (4kb) 3.33 3.09 3.326 3 (1kb) 3.33 3.95 3.326 4 (10kb) 0 0 0.02 Total normalized reads 10 10 10 Sums of total normalized reads
  • 11. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Eg : if certain genes are very highly expressed in one tissue but not another, there will be less ‘’sequencing real estate’’ left for the less expressed genes in that tissue and RPKM normalization (or similar) will give biased expression values for them compared to the other sample Equal sequencing depth -> Yellow and green will get lower RPKM in RNA population 1 although the expression levels are actually the same in populations 1 and 2 Robinson and Oshlack Genome Biology 2010, 11: R25, http://genomebiology.com / 2010/11/3/R25 RNA population 1 RNA population 2 TMM – Trimmed Mean of M Value Attempts to correct for differences in RNA composition between samples
  • 12. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Identification of differentially expressed genes Quality filtered/trimmed RNA-Seq Short reads FPKM based strategy Calculate transcript abundances (Cufflinks) Reference Genome (Y/N) Mapping to the reference (GMAP-GSNAP, Tophat,Bowtie,etc.) Y N De novo Transcriptome assembly (Trinity) Mapping and detection of DEGs (RSEM) Count based strategy Generate count data (RSEM) Detection of DEGs (cuffdiff2) Detection of DEGs (DESeq, edgeR, EBSeq)
  • 13. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Genome Mapping and Alignment using GMAP - GSNAP Genomic Mapping and Alignment Program • GMAP is a standalone program for mapping and aligning cDNA sequences to a genome. • The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. • The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Step 1. Command for indexing the the genome : gmap_build -d btau8 bosTau8.fa Initially used a hashing scheme but later used a much more efficient double lookup scheme
  • 14. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The index files created are as below in the folder btau8
 gsnap –d btau8 –t 4 control_R1.fastq> control_R1.sam Step 2. Mapping the reads to the genome
  • 15. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • The end product of the GMAP aligner is a SAM file which needs to be converted into a BAM file for further analysis in cufflinks. • Repeat the same for the other replicate by changing the input file name. • A total of four SAM files are generated separately. The BAM files generated can be analysed in two ways - 1. The BAM files can be used to generate a merged assembly of transcripts via cufflinks and cuffmerge. This merged assembly (i.e merged.gtf) is used in cuffdiff to generate differential expressed genes. 2. Cuffdiff can be used directly to generate differentially expressed genes using the BAM files generated. The index files created are as below in the folder btau8

  • 16. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ./samtools view –bsh aln.sam >aln.bam -b: Output in the BAM format. -s: Input in the SAM format. –h: Include header in the output For the Control sample: ./samtools view –bsh control_R1.sam >control_R1.bam For the Infected sample: ./samtools view –bsh infected_R1.sam >infected_R1.bam Step 3. Converting SAM to BAM using samtools
  • 17. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Command for sorting:./samtools sort aln.bam aln.sorted Example: For the Control sample: ./samtools sort control_R1.bam control_R1_sorted For the Infected sample: ./samtools sort infected_R1.bam infected_R1_sorted Step 4. Sorting BAM using samtools
  • 18. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Command for running cufflinks on a BAM file For the Control sample: cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN control_R1_sorted.bam Step 5. (Option 1) Differential expression using cufflinks, cuffmerge and cuffdiff.
  • 19. Computational Biology and Genomics Facility, Indian Veterinary Research Institute For the infected sample: cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN infected_R1_sorted.bam These commands generate transcript.gtf files for each replicate, which are further used in cuffmerge to generate a merged assembly. This merged assembly is then used in cuffdiff to generate differentially expressed genes.
  • 20. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Command for running cuffmerge cuffmerge -g btau8refflat.gtf -s bosTau8.fa -p 8 assemblies.txt assemblies.txt is the file with the list of all the GTFs. This generates a merged.gtf in the merged_asm folder. This file is used in the next cuffdiff command. Command for running cuffdiff cuffdiff merged.gtf control_R1_sorted.bam control_R2_sorted.bam infected_R1_sorted.bam infected_R2_sorted.bam This command generates many files, out of which gene_exp.diff is the file of our concern.
  • 21. Computational Biology and Genomics Facility, Indian Veterinary Research Institute CuffDiff computes differentially expressed genes in the set. For computing differential expression at least two samples -infected and control are required. CuffDiff should always be run on replicates - i.e., N infected vs N control. Command: Cuffdiff –p –N transcripts.gtf -p: num-threads <int>. -N Running cuffdiff for our BAM files cuffdiff –p 3 –N bostau8reflat.gtf control_R1_sorted.bam,control_R2_sorted.bam infected_R1_sorted.bam,infected_R2_sorted.bam –o cuffdiff_out Step 5. (Option 2) Differential expression using CuffDiff directly
  • 22. Computational Biology and Genomics Facility, Indian Veterinary Research Institute A unique identifier describing the object (gene, transcript, CDS, primary transcript) Gene ID Gene Name Infected OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing FPKM in Sample 1 FPKM in Sample 2 The (base 2) log of the fold change y/x Genomic coordinates for easy browsing to the genes or transcripts being tested. Control The value of the test statistic used to compute significance of the observed change in FPKM The uncorrected p-value of the test statistic gene_exp.diff Log2fold change = Log2(FPKM infected/FPKM of control) = Log2(0.576748/3.92513) = -2.76673
  • 23. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Identification of Differentially expressed genes - I (using RSEM - EBSeq)
  • 24. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Quality filtered/trimmed RNA-Seq Short reads Calculate transcript abundances (RSEM) Reference Genome Mapping to the reference (Bowtie) Detection of DEGs (DESeq, edgeR,EBSeq) Downloading the reference genome and GTF from Ensembl genome browser Count based strategy
  • 25. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RSEM is a cutting-edge RNASeq analysis package that is an end-to-end solution for differential expression, and simplifies the whole process. It also introduces a new more robust unit of RNASeq measurement called TPM. RSEM (RNA-Seq by Expectation-Maximization) (Li1 and Dewey., 2011) Step 1. Downloading RSEM and installing wget http://deweylab.biostat.wisc.edu/rsem/src/rsem-1.2.19.tar.gz tar –xvzf rsem-1.2.19.tar.gz cd rsem-1.2.19/make Step 2. Prerequisites required for running RSEM : Perl, R and Bowtie are required to be installed. Perl and R are normally present in most of the computers.
  • 26. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 3. Downloading Bowtie and installing Download Bowtie from http://sourceforge.net/projects/bowtie-bio/files/ bowtie/1.1.1/ Step 4. Copy bowtie in your path or add bowtie path in bash profile Copying bowtie in your path sudo cp -R /Users/appleserver/Desktop/bowtie2 /usr/local/bin add bowtie path in bash profile (preferred) export PATH="/Users/ravikumar/Desktop/bowtie2:$PATH" run source ~/.bash_profile RSEM (RNA-Seq by Expectation-Maximization) Indicates that the path has been added echo $PATH - to check whether the path has been added
  • 27. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 5. Downloading the reference,gunzipping and concatenating Download Bos taurus genome from Ensembl genome browser. An easier alternative is to use wget command for a direct download on HPC: wget -m ftp://ftp.ensembl.org/pub/release-81/fasta/bos_taurus/dna/ &or f in $(find . -name "*.gz")
  • 28. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The folder that is created is as below
  • 29. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Direct download of each individual chromosome and gtf from the ftp site can be done
  • 30. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The files downloaded are gunzipped using - gunzip Bos_taurus.UMD3.1.dna.chromosome.*.fa.gz Concatenating/combining all the fasta file into a combined fasta file (reference): cat Bos_taurus.UMD3.1.dna.chromosome.*.fa > combined.fa Step 6. Download annotation file in gtf format. Command for downloading : wget –m ftp://ftp.ensembl.org/pub/ release-81/gtf/bos_taurus The gtf file downloaded needs to be modifies to extract only the exon annotations. awk command to extract the exon annotations from gtf: awk ‘$3 == “exon”’ Bos_taurus.UMD3.1.8.1.gtf> filtered.gtf
  • 31. Computational Biology and Genomics Facility, Indian Veterinary Research Institute filtered.gtf original gtf
  • 32. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 7. Prepare reference using RSEM To prepare the reference sequence, run the ‘rsem-prepare-reference’ program. The command for preparing the reference running: ./rsem-prepare-reference --gtf filtered.gtf --bowtie2 combined.fa BT This creates 12 file as index files with the name of BT and extension bt2
  • 33. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 8. Calculating expression values in counts, TPM and FPKM: To calculate expression values, ‘rsem-calculate-expression’ program. Command for running rsem-calculate-expression : For running the control sample: . /rsem-calculate-expression --bowtie2 control_R1.fastq BT ControlR1 There will be six files generated as shown above. genes.results is the most important file among the six For running the Infected sample: . /rsem-calculate-expression --bowtie2 infected_R1.fastq BT infectedR1
  • 34. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 9. Combining RSEM genes.results of all the files: RSEM produces “expected counts” or “gene counts” values. After rounding these expected counts values to the nearest integer - EBSeq, DESeq, or edgeR to identify differentially expressed genes. ./rsem-generate-data-matrix *.genes.results > genes.results
  • 35. Computational Biology and Genomics Facility, Indian Veterinary Research Institute EBSeq is an R package for identifying genes and isoforms differentially expressed (DE) across two or more biological conditions in an RNA-seq experiment. EBSEq uses RSEM counts as input to identify differentially expressed genes Step 1. Installing EBSeq: To install, type the following commands in R: source("https://bioconductor.org/biocLite.R") biocLite("EBSeq") Step 2. Command for Loading the package EBSeq >library(EBSeq) Step 3.Command for getting the working directory >getwd() Differentially expression using EBSeq (Leng et al., 2013): Empirical Bayesian approach for RNA-Seq data analysis
  • 36. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 4. Command for setting the working library > setwd() Step 5. Input requirement for Gene level DE analysis: The input file formats supported by EBSeq are .csv, .xls, or .xlsx, .txt (tab delimited). In the input file, rows should be the genes and the columns should be the samples. Example of the data set in .txt format (genesresult.txt) that is used here
  • 37. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 6. Commands to Run EBSeq: > x=data.matrix(read.table("genesresults.txt")) > dim(x) [1] 24596 4 > str(x) num [1:24596, 1:4] 615 3 0 473 1 286 832 362 103 17 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:24596] "ENSBTAG00000000005" "ENSBTAG00000000008" "ENSBTAG00000000009" "ENSBTAG00000000010" ... ..$ : chr [1:4] "infectedR1.genes.results" "infectedR2.genes.results" "ControlR1.genes.results" "ControlR2.genes.results" > Sizes=MedianNorm(x) > EBOut=EBTest(Data=x, + Conditions=as.factor(rep(c("C1","C2"),each=2)),sizeFactors=Sizes, + maxround=5)
  • 38. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Removing transcripts with 75 th quantile < = 10 12071 transcripts will be tested iteration 1 done time 0.12 iteration 2 done time 0.13 iteration 3 done time 0.08 iteration 4 done > PP=GetPPMat(EBOut) > str(PP) num [1:12071, 1:2] 1 1 0 0 1 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:12071] "ENSBTAG00000000005" "ENSBTAG00000000010" "ENSBTAG00000000012" "ENSBTAG00000000013" ... ..$ : chr [1:2] "PPEE" "PPDE"
  • 39. Computational Biology and Genomics Facility, Indian Veterinary Research Institute > DEfound=rownames(PP)[which(PP[,"PPDE"]>=.95)] > str(DEfound) chr [1:6528] "ENSBTAG00000000012" "ENSBTAG00000000013" "ENSBTAG00000000015" "ENSBTAG00000000019" "ENSBTAG00000000021" "ENSBTAG00000000025" "ENSBTAG00000000026" "ENSBTAG00000000032" ... > write.table(DEfound,"DE.txt",sep = "t",quote = F,col.names=F) > GeneFC=PostFC(EBOut) > write.table(GeneFC,"FC.txt",sep = "t",quote = F,col.names=F) Output GeneID PostFC Real FC comparison
  • 40. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The posterior fold change estimations will give less extreme values for low expressers. e.g. if gene1 has Y = 5000 and X = 1000, its FC and PostFC will both be 5. If gene2 has Y = 5 and X = 1, its FC will be 5 but its PostFC will be < 5 and closer to 1. Therefore when we sort the PostFC, gene2 will be less significant than gene1.