SlideShare a Scribd company logo
1 of 40
Download to read offline
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Revision
Normalization and cufflinks
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Normalization of read count
R/FPKM (Mortazavi et al.,2008) - Reads/Fragment per kilobase of exon
per million mappable reads
• Corrects for: differences in sequencing depth and transcript length
• Aiming to: compare a gene across samples and different genes within
samples
TMM (Robinson and Oshlack., 2010) - Trimmed mean of M values
• Corrects for: differences in transcript pool composition; extreme outliers
• Aiming to: provide better across-sample comparability
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Normalization of read count
Limma voom (LogCPM) (Law et al.,2013) - Counts per million
• Aiming to: Stabilize variance, removes dependence of variance on the
mean
TPM (Li etal 2010, Wagner et al 2012) - Transcripts per million
• Corrects for: transcript length distribution in RNA pool
• Aiming to: provide better across-sample comparability
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• FPKM for paired end reads and RPKM for single end reads
• Fragment means fragment of DNA, so the two reads that
comprise a paired-end read count as one.
• Per kilobase of exon means the counts of fragments are then
normalized by dividing by the total length of all exons in the gene.
• This bit of magic makes it possible to compare Gene A to Gene B
even if they are of different lengths.
• Per million reads means this value is then normalized against the
library size.
• This bit of magic makes it possible to compare Gene A in Sample
1 to Sample 2
R/FPKM (Mortazavi et al.,2008)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
A quantification measurement for gene expression
• R: expression level of the gene
• L: length of the gene
• N: depth of the sequencing
• C: number total reads fall into the gene region
R/FPKM (Mortazavi et al.,2008)
Total exon size of a gene is 3,000-nt. Calculate the expression levels for
this gene in RPKM in an RNA-seq experiment that contained 50 million
mappable reads, with 600 reads falling into exon regions of this gene.
R = 600/(50 × 3.000) = 4.00
R = C ÷ L × N( ) L in kbs and N in Millions
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Calculation of FPKM/RPKM
Genes Sample1 Sample 2 Sample 3
1 (2kb) 20 24 60
2 (4kb) 40 50 120
3 (1kb) 10 16 30
4 (10kb) 0 0 2
Total 70 90 212
Total reads for sample 1, 2 and 3 - 7M ,9M and 21.2M
(millions of reads equated to a scale of tens of reads)
Step 1. Divide the reads of each gene with the total reads of the sample
Genes Sample1(RPM) Sample 2(RPM) Sample 3(RPM)
1 (2kb) 2.86 2.67 2.83
2 (4kb) 5.71 5.56 5.66
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.09
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Fragments/Reads per kilobase per million of reads
Reads are scaled for both depth and length
Step 2. Divide the values obtained after step 1 with the gene lengths
Genes Sample1 (RPKM) Sample 2
(RPKM)
Sample 3
(RPKM)
1 (2kb) 1.43 1.33 1.42
2 (4kb) 1.43 1.39 1.42
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.009
Total normalized
reads
4.29 4.5 4.5
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Calculation of TPM
Step 1. Divide the reads of each gene with the length of each gene
Genes Sample 1 Sample 2 Sample 3
1 (2kb) 20 24 60
2 (4kb) 40 50 120
3 (1kb) 10 16 30
4 (10kb) 0 0 2
Total reads per kb of gene for sample 1, 2 and 3- 3M,4.05M and 9.02M
Genes Sample 1(RPK) Sample 2(RPK) Sample 3(RPK)
1 (2kb) 10 12 30
2 (4kb) 10 12.5 30
3 (1kb) 10 16 30
4 (10kb) 0 0 0.2
Total 30 40.5 90.2
(millions of reads equated to a scale of tens of reads)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 2. Divide the values obtained after step 1 with the gene lengths
Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM)
1 (2kb) 3.33 2.96 3.326
2 (4kb) 3.33 3.09 3.326
3 (1kb) 3.33 3.95 3.326
4 (10kb) 0 0 0.02
Total 10 10 10
Genes Sample1 (RPK) Sample 2(RPK) Sample 3(RPK)
1 (2kb) 10 12 30
2 (4kb) 10 12.5 30
3 (1kb) 10 16 30
4 (10kb) 0 0 0.2
Calculation of TPM
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RPKM vs TPM
Genes Sample1 (RPKM) Sample 2
(RPKM)
Sample 3
(RPKM)
1 (2kb) 1.43 1.33 1.42
2 (4kb) 1.43 1.39 1.42
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.009
Total normalized
reads
4.29 4.5 4.5
Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM)
1 (2kb) 3.33 2.96 3.326
2 (4kb) 3.33 3.09 3.326
3 (1kb) 3.33 3.95 3.326
4 (10kb) 0 0 0.02
Total normalized
reads
10 10 10
Sums of total normalized reads
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Eg : if certain genes are very highly expressed in one tissue but not another,
there will be less ‘’sequencing real estate’’ left for the less expressed genes in
that tissue and RPKM normalization (or similar) will give biased expression
values for them compared to the other sample
Equal sequencing depth -> Yellow and green will get lower RPKM in RNA population
1 although the expression levels are actually the same in populations 1 and 2
Robinson and Oshlack Genome Biology 2010, 11: R25, http://genomebiology.com /
2010/11/3/R25
RNA population 1 RNA population 2
TMM – Trimmed Mean of M Value
Attempts to correct for differences in RNA
composition between samples
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Identification of differentially expressed genes
Quality filtered/trimmed RNA-Seq Short reads
FPKM based
strategy
Calculate transcript
abundances
(Cufflinks)
Reference Genome
(Y/N)
Mapping to the reference
(GMAP-GSNAP, Tophat,Bowtie,etc.)
Y
N De novo Transcriptome
assembly (Trinity)
Mapping and detection of
DEGs (RSEM)
Count based
strategy
Generate count data
(RSEM)
Detection of DEGs
(cuffdiff2)
Detection of DEGs
(DESeq, edgeR, EBSeq)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Genome Mapping and Alignment using GMAP - GSNAP
Genomic Mapping and Alignment Program
• GMAP is a standalone program for mapping and aligning cDNA sequences to a
genome.
• The program maps and aligns a single sequence with minimal startup time and
memory requirements, and provides fast batch processing of large sequence sets.
• The program generates accurate gene structures, even in the presence of
substantial polymorphisms and sequence errors, without using probabilistic splice
site models.
Step 1. Command for indexing the the genome : gmap_build -d btau8
bosTau8.fa
Initially used a hashing
scheme but later used a
much more efficient
double lookup scheme
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The index files created are as below in the folder btau8

gsnap –d btau8 –t 4 control_R1.fastq> control_R1.sam
Step 2. Mapping the reads to the genome
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• The end product of the GMAP aligner is a SAM file which needs to be
converted into a BAM file for further analysis in cufflinks.
• Repeat the same for the other replicate by changing the input file name.
• A total of four SAM files are generated separately.
The BAM files generated can be analysed in two ways -
1. The BAM files can be used to generate a merged assembly of transcripts
via cufflinks and cuffmerge. This merged assembly (i.e merged.gtf) is
used in cuffdiff to generate differential expressed genes.
2. Cuffdiff can be used directly to generate differentially expressed genes
using the BAM files generated.
The index files created are as below in the folder btau8

Computational Biology and Genomics Facility, Indian Veterinary Research Institute
./samtools view –bsh aln.sam >aln.bam
-b: Output in the BAM format. -s: Input in the SAM format. –h: Include
header in the output
For the Control sample:
./samtools view –bsh control_R1.sam >control_R1.bam
For the Infected sample:
./samtools view –bsh infected_R1.sam >infected_R1.bam
Step 3. Converting SAM to BAM using samtools
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for sorting:./samtools sort aln.bam aln.sorted
Example:
For the Control sample:
./samtools sort control_R1.bam control_R1_sorted
For the Infected sample:
./samtools sort infected_R1.bam infected_R1_sorted
Step 4. Sorting BAM using samtools
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for running cufflinks on a BAM file
For the Control sample:
cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN
control_R1_sorted.bam
Step 5. (Option 1) Differential expression using cufflinks,
cuffmerge and cuffdiff.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
For the infected sample:
cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN
infected_R1_sorted.bam
These commands generate transcript.gtf files for each replicate, which are
further used in cuffmerge to generate a merged assembly. This merged
assembly is then used in cuffdiff to generate differentially expressed genes.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for running cuffmerge
cuffmerge -g btau8refflat.gtf -s bosTau8.fa -p 8 assemblies.txt
assemblies.txt is the file with the list of all the GTFs.
This generates a merged.gtf in the merged_asm folder. This file is
used in the next cuffdiff command.
Command for running cuffdiff
cuffdiff merged.gtf control_R1_sorted.bam control_R2_sorted.bam
infected_R1_sorted.bam infected_R2_sorted.bam
This command generates many files, out of which gene_exp.diff is the file
of our concern.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
CuffDiff computes differentially expressed genes in the set. For computing
differential expression at least two samples -infected and control are required.
CuffDiff should always be run on replicates - i.e., N infected vs N control.
Command:
Cuffdiff –p –N transcripts.gtf
-p: num-threads <int>. -N
Running cuffdiff for our BAM files
cuffdiff –p 3 –N bostau8reflat.gtf control_R1_sorted.bam,control_R2_sorted.bam
infected_R1_sorted.bam,infected_R2_sorted.bam –o cuffdiff_out
Step 5. (Option 2) Differential expression using CuffDiff directly
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
A unique identifier
describing the object
(gene, transcript, CDS,
primary transcript)
Gene ID
Gene Name
Infected
OK (test successful), NOTEST (not enough alignments
for testing), LOWDATA (too complex or shallowly
sequenced), HIDATA (too many fragments in locus), or
FAIL, when an ill-conditioned covariance matrix or
other numerical exception prevents testing
FPKM in
Sample 1
FPKM in
Sample 2
The (base 2) log
of the fold
change y/x
Genomic coordinates for easy
browsing to the genes or
transcripts being tested.
Control
The value of the test statistic
used to compute significance
of the observed change in
FPKM
The uncorrected
p-value of the test
statistic
gene_exp.diff
Log2fold change = Log2(FPKM infected/FPKM of control)
= Log2(0.576748/3.92513) = -2.76673
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Identification of Differentially expressed genes - I
(using RSEM - EBSeq)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Quality filtered/trimmed RNA-Seq Short reads
Calculate transcript
abundances
(RSEM)
Reference Genome
Mapping to the reference
(Bowtie)
Detection of DEGs
(DESeq, edgeR,EBSeq)
Downloading the reference
genome and GTF from Ensembl
genome browser
Count based
strategy
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RSEM is a cutting-edge RNASeq analysis package that is an end-to-end
solution for differential expression, and simplifies the whole process. It also
introduces a new more robust unit of RNASeq measurement called TPM.
RSEM (RNA-Seq by Expectation-Maximization)
(Li1 and Dewey., 2011)
Step 1. Downloading RSEM and installing
wget http://deweylab.biostat.wisc.edu/rsem/src/rsem-1.2.19.tar.gz
tar –xvzf rsem-1.2.19.tar.gz
cd rsem-1.2.19/make
Step 2. Prerequisites required for running RSEM : Perl, R and Bowtie are
required to be installed. Perl and R are normally present in most of the
computers.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 3. Downloading Bowtie and installing
Download Bowtie from http://sourceforge.net/projects/bowtie-bio/files/
bowtie/1.1.1/
Step 4. Copy bowtie in your path or add bowtie path in bash
profile
Copying bowtie in your path
sudo cp -R /Users/appleserver/Desktop/bowtie2 /usr/local/bin
add bowtie path in bash profile (preferred)
export PATH="/Users/ravikumar/Desktop/bowtie2:$PATH"
run source ~/.bash_profile
RSEM (RNA-Seq by Expectation-Maximization)
	
Indicates that the path has been added
echo $PATH - to check whether the path has been added
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 5. Downloading the reference,gunzipping and concatenating
Download Bos taurus genome from Ensembl genome browser. An easier
alternative is to use wget command for a direct download on HPC:
wget -m ftp://ftp.ensembl.org/pub/release-81/fasta/bos_taurus/dna/ &or f
in $(find . -name "*.gz")
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The folder that is created is as below
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Direct download of each individual chromosome and gtf from the
ftp site can be done
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The files downloaded are gunzipped using -
gunzip Bos_taurus.UMD3.1.dna.chromosome.*.fa.gz
Concatenating/combining all the fasta file into a combined fasta file
(reference):
cat Bos_taurus.UMD3.1.dna.chromosome.*.fa > combined.fa
Step 6. Download annotation file in gtf format.
Command for downloading : wget –m ftp://ftp.ensembl.org/pub/
release-81/gtf/bos_taurus
The gtf file downloaded needs to be modifies to extract only the exon
annotations.
awk command to extract the exon annotations from gtf:
awk ‘$3 == “exon”’ Bos_taurus.UMD3.1.8.1.gtf> filtered.gtf
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
filtered.gtf
original gtf
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 7. Prepare reference using RSEM
To prepare the reference sequence, run the ‘rsem-prepare-reference’ program.
The command for preparing the reference running:
./rsem-prepare-reference --gtf filtered.gtf --bowtie2 combined.fa BT
This creates 12 file as index files with the name of BT and extension bt2
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 8. Calculating expression values in counts, TPM and FPKM:
To calculate expression values, ‘rsem-calculate-expression’ program.
Command for running rsem-calculate-expression :
For running the control sample:
. /rsem-calculate-expression --bowtie2 control_R1.fastq BT ControlR1
There will be six files generated as shown above. genes.results is the most
important file among the six
For running the Infected sample:
. /rsem-calculate-expression --bowtie2 infected_R1.fastq BT infectedR1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 9. Combining RSEM genes.results of all the files:
RSEM produces “expected counts” or “gene counts” values. After rounding
these expected counts values to the nearest integer - EBSeq, DESeq, or
edgeR to identify differentially expressed genes.
./rsem-generate-data-matrix *.genes.results > genes.results
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
EBSeq is an R package for identifying genes and isoforms differentially
expressed (DE) across two or more biological conditions in an RNA-seq
experiment. EBSEq uses RSEM counts as input to identify differentially
expressed genes
Step 1. Installing EBSeq:
To install, type the following commands in R:
source("https://bioconductor.org/biocLite.R")
biocLite("EBSeq")
Step 2. Command for Loading the package EBSeq
>library(EBSeq)
Step 3.Command for getting the working directory
>getwd()
Differentially expression using EBSeq (Leng et al., 2013):
Empirical Bayesian approach for RNA-Seq data analysis
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 4. Command for setting the working library
> setwd()
	
Step 5. Input requirement for Gene level DE analysis:
The input file formats supported by EBSeq are .csv, .xls, or .xlsx, .txt (tab
delimited). In the input file, rows should be the genes and the columns
should be the samples.
Example of the data set in .txt format (genesresult.txt) that is used
here
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 6. Commands to Run EBSeq:
> x=data.matrix(read.table("genesresults.txt"))
> dim(x)
[1] 24596 4
> str(x)
num [1:24596, 1:4] 615 3 0 473 1 286 832 362 103 17 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:24596] "ENSBTAG00000000005" "ENSBTAG00000000008"
"ENSBTAG00000000009" "ENSBTAG00000000010" ...
..$ : chr [1:4] "infectedR1.genes.results" "infectedR2.genes.results"
"ControlR1.genes.results" "ControlR2.genes.results"
> Sizes=MedianNorm(x)
> EBOut=EBTest(Data=x,
+ Conditions=as.factor(rep(c("C1","C2"),each=2)),sizeFactors=Sizes,
+ maxround=5)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Removing transcripts with 75 th quantile < = 10
12071 transcripts will be tested
iteration 1 done
time 0.12
iteration 2 done
time 0.13
iteration 3 done
time 0.08
iteration 4 done
> PP=GetPPMat(EBOut)
> str(PP)
num [1:12071, 1:2] 1 1 0 0 1 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:12071] "ENSBTAG00000000005"
"ENSBTAG00000000010" "ENSBTAG00000000012"
"ENSBTAG00000000013" ...
..$ : chr [1:2] "PPEE" "PPDE"
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
> DEfound=rownames(PP)[which(PP[,"PPDE"]>=.95)]
> str(DEfound)
chr [1:6528] "ENSBTAG00000000012" "ENSBTAG00000000013"
"ENSBTAG00000000015" "ENSBTAG00000000019"
"ENSBTAG00000000021" "ENSBTAG00000000025"
"ENSBTAG00000000026" "ENSBTAG00000000032" ...
> write.table(DEfound,"DE.txt",sep = "t",quote = F,col.names=F)
> GeneFC=PostFC(EBOut)
> write.table(GeneFC,"FC.txt",sep = "t",quote = F,col.names=F)
Output
GeneID PostFC Real FC comparison
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The posterior fold change estimations will give less extreme values for
low expressers. e.g. if gene1 has Y = 5000 and X = 1000, its FC and
PostFC will both be 5. If gene2 has Y = 5 and X = 1, its FC will be 5 but
its PostFC will be < 5 and closer to 1. Therefore when we sort the
PostFC, gene2 will be less significant than gene1.

More Related Content

What's hot

Microsatellites- Molecular fingerprints
Microsatellites- Molecular fingerprints Microsatellites- Molecular fingerprints
Microsatellites- Molecular fingerprints Sumana Choudhury
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics Senthil Natesan
 
High throughput qPCR: tips for analysis across multiple plates
High throughput qPCR: tips for analysis across multiple platesHigh throughput qPCR: tips for analysis across multiple plates
High throughput qPCR: tips for analysis across multiple platesIntegrated DNA Technologies
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqTimothy Tickle
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqManjappa Ganiger
 
Whole Genome Selection
Whole Genome SelectionWhole Genome Selection
Whole Genome SelectionRaghav N.R
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsAjit Shinde
 
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesBack to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesGolden Helix Inc
 
S4.1 Genomics-assisted breeding for maize improvement
S4.1  Genomics-assisted breeding for maize improvementS4.1  Genomics-assisted breeding for maize improvement
S4.1 Genomics-assisted breeding for maize improvementCIMMYT
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
What is comparative genomics
What is comparative genomicsWhat is comparative genomics
What is comparative genomicsUsman Arshad
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencingBhavya Sree
 

What's hot (20)

SNp mining in crops
SNp mining in cropsSNp mining in crops
SNp mining in crops
 
Microsatellites- Molecular fingerprints
Microsatellites- Molecular fingerprints Microsatellites- Molecular fingerprints
Microsatellites- Molecular fingerprints
 
Analysis of gene expression
Analysis of gene expressionAnalysis of gene expression
Analysis of gene expression
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics
 
High throughput qPCR: tips for analysis across multiple plates
High throughput qPCR: tips for analysis across multiple platesHigh throughput qPCR: tips for analysis across multiple plates
High throughput qPCR: tips for analysis across multiple plates
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
 
Whole Genome Selection
Whole Genome SelectionWhole Genome Selection
Whole Genome Selection
 
Reverse genetics
Reverse geneticsReverse genetics
Reverse genetics
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex DiseasesBack to Basics: Using GWAS to Drive Discovery for Complex Diseases
Back to Basics: Using GWAS to Drive Discovery for Complex Diseases
 
S4.1 Genomics-assisted breeding for maize improvement
S4.1  Genomics-assisted breeding for maize improvementS4.1  Genomics-assisted breeding for maize improvement
S4.1 Genomics-assisted breeding for maize improvement
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
What is comparative genomics
What is comparative genomicsWhat is comparative genomics
What is comparative genomics
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencing
 

Similar to RSEM and DE packages

Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)IJCI JOURNAL
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issuesDongyan Zhao
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_PresentationToyin23
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Aug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigenticsAug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigenticsGenomeInABottle
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data AnalysisRavi Gandham
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfwhole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfCRISTIANALONSORODRIG1
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, WorkshopFahadahammed2
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821GenomeInABottle
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
Multisite UTE 31P Rosette MRSI(PETALUTE)
Multisite UTE 31P Rosette MRSI(PETALUTE)Multisite UTE 31P Rosette MRSI(PETALUTE)
Multisite UTE 31P Rosette MRSI(PETALUTE)Uzay Emir
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema
 

Similar to RSEM and DE packages (20)

Cufflinks
CufflinksCufflinks
Cufflinks
 
Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)Genetic algorithm guided key generation in wireless communication (gakg)
Genetic algorithm guided key generation in wireless communication (gakg)
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Aug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigenticsAug2015 analysis team 10 mason epigentics
Aug2015 analysis team 10 mason epigentics
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfwhole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdf
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, Workshop
 
ASHI2013HLA(1)
ASHI2013HLA(1)ASHI2013HLA(1)
ASHI2013HLA(1)
 
Ffpe white paper
Ffpe white paperFfpe white paper
Ffpe white paper
 
iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0 iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Multisite UTE 31P Rosette MRSI(PETALUTE)
Multisite UTE 31P Rosette MRSI(PETALUTE)Multisite UTE 31P Rosette MRSI(PETALUTE)
Multisite UTE 31P Rosette MRSI(PETALUTE)
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 

More from Ravi Gandham

Functional annotation
Functional annotationFunctional annotation
Functional annotationRavi Gandham
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and PrinseqliteRavi Gandham
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview Ravi Gandham
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 

More from Ravi Gandham (6)

Functional annotation
Functional annotationFunctional annotation
Functional annotation
 
Data formats
Data formatsData formats
Data formats
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Primer designing
Primer designingPrimer designing
Primer designing
 

Recently uploaded

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 

Recently uploaded (20)

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 

RSEM and DE packages

  • 1. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Revision Normalization and cufflinks
  • 2. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Normalization of read count R/FPKM (Mortazavi et al.,2008) - Reads/Fragment per kilobase of exon per million mappable reads • Corrects for: differences in sequencing depth and transcript length • Aiming to: compare a gene across samples and different genes within samples TMM (Robinson and Oshlack., 2010) - Trimmed mean of M values • Corrects for: differences in transcript pool composition; extreme outliers • Aiming to: provide better across-sample comparability
  • 3. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Normalization of read count Limma voom (LogCPM) (Law et al.,2013) - Counts per million • Aiming to: Stabilize variance, removes dependence of variance on the mean TPM (Li etal 2010, Wagner et al 2012) - Transcripts per million • Corrects for: transcript length distribution in RNA pool • Aiming to: provide better across-sample comparability
  • 4. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • FPKM for paired end reads and RPKM for single end reads • Fragment means fragment of DNA, so the two reads that comprise a paired-end read count as one. • Per kilobase of exon means the counts of fragments are then normalized by dividing by the total length of all exons in the gene. • This bit of magic makes it possible to compare Gene A to Gene B even if they are of different lengths. • Per million reads means this value is then normalized against the library size. • This bit of magic makes it possible to compare Gene A in Sample 1 to Sample 2 R/FPKM (Mortazavi et al.,2008)
  • 5. Computational Biology and Genomics Facility, Indian Veterinary Research Institute A quantification measurement for gene expression • R: expression level of the gene • L: length of the gene • N: depth of the sequencing • C: number total reads fall into the gene region R/FPKM (Mortazavi et al.,2008) Total exon size of a gene is 3,000-nt. Calculate the expression levels for this gene in RPKM in an RNA-seq experiment that contained 50 million mappable reads, with 600 reads falling into exon regions of this gene. R = 600/(50 × 3.000) = 4.00 R = C ÷ L × N( ) L in kbs and N in Millions
  • 6. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Calculation of FPKM/RPKM Genes Sample1 Sample 2 Sample 3 1 (2kb) 20 24 60 2 (4kb) 40 50 120 3 (1kb) 10 16 30 4 (10kb) 0 0 2 Total 70 90 212 Total reads for sample 1, 2 and 3 - 7M ,9M and 21.2M (millions of reads equated to a scale of tens of reads) Step 1. Divide the reads of each gene with the total reads of the sample Genes Sample1(RPM) Sample 2(RPM) Sample 3(RPM) 1 (2kb) 2.86 2.67 2.83 2 (4kb) 5.71 5.56 5.66 3 (1kb) 1.43 1.78 1.42 4 (10kb) 0 0 0.09
  • 7. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Fragments/Reads per kilobase per million of reads Reads are scaled for both depth and length Step 2. Divide the values obtained after step 1 with the gene lengths Genes Sample1 (RPKM) Sample 2 (RPKM) Sample 3 (RPKM) 1 (2kb) 1.43 1.33 1.42 2 (4kb) 1.43 1.39 1.42 3 (1kb) 1.43 1.78 1.42 4 (10kb) 0 0 0.009 Total normalized reads 4.29 4.5 4.5
  • 8. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Calculation of TPM Step 1. Divide the reads of each gene with the length of each gene Genes Sample 1 Sample 2 Sample 3 1 (2kb) 20 24 60 2 (4kb) 40 50 120 3 (1kb) 10 16 30 4 (10kb) 0 0 2 Total reads per kb of gene for sample 1, 2 and 3- 3M,4.05M and 9.02M Genes Sample 1(RPK) Sample 2(RPK) Sample 3(RPK) 1 (2kb) 10 12 30 2 (4kb) 10 12.5 30 3 (1kb) 10 16 30 4 (10kb) 0 0 0.2 Total 30 40.5 90.2 (millions of reads equated to a scale of tens of reads)
  • 9. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 2. Divide the values obtained after step 1 with the gene lengths Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM) 1 (2kb) 3.33 2.96 3.326 2 (4kb) 3.33 3.09 3.326 3 (1kb) 3.33 3.95 3.326 4 (10kb) 0 0 0.02 Total 10 10 10 Genes Sample1 (RPK) Sample 2(RPK) Sample 3(RPK) 1 (2kb) 10 12 30 2 (4kb) 10 12.5 30 3 (1kb) 10 16 30 4 (10kb) 0 0 0.2 Calculation of TPM
  • 10. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RPKM vs TPM Genes Sample1 (RPKM) Sample 2 (RPKM) Sample 3 (RPKM) 1 (2kb) 1.43 1.33 1.42 2 (4kb) 1.43 1.39 1.42 3 (1kb) 1.43 1.78 1.42 4 (10kb) 0 0 0.009 Total normalized reads 4.29 4.5 4.5 Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM) 1 (2kb) 3.33 2.96 3.326 2 (4kb) 3.33 3.09 3.326 3 (1kb) 3.33 3.95 3.326 4 (10kb) 0 0 0.02 Total normalized reads 10 10 10 Sums of total normalized reads
  • 11. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Eg : if certain genes are very highly expressed in one tissue but not another, there will be less ‘’sequencing real estate’’ left for the less expressed genes in that tissue and RPKM normalization (or similar) will give biased expression values for them compared to the other sample Equal sequencing depth -> Yellow and green will get lower RPKM in RNA population 1 although the expression levels are actually the same in populations 1 and 2 Robinson and Oshlack Genome Biology 2010, 11: R25, http://genomebiology.com / 2010/11/3/R25 RNA population 1 RNA population 2 TMM – Trimmed Mean of M Value Attempts to correct for differences in RNA composition between samples
  • 12. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Identification of differentially expressed genes Quality filtered/trimmed RNA-Seq Short reads FPKM based strategy Calculate transcript abundances (Cufflinks) Reference Genome (Y/N) Mapping to the reference (GMAP-GSNAP, Tophat,Bowtie,etc.) Y N De novo Transcriptome assembly (Trinity) Mapping and detection of DEGs (RSEM) Count based strategy Generate count data (RSEM) Detection of DEGs (cuffdiff2) Detection of DEGs (DESeq, edgeR, EBSeq)
  • 13. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Genome Mapping and Alignment using GMAP - GSNAP Genomic Mapping and Alignment Program • GMAP is a standalone program for mapping and aligning cDNA sequences to a genome. • The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. • The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Step 1. Command for indexing the the genome : gmap_build -d btau8 bosTau8.fa Initially used a hashing scheme but later used a much more efficient double lookup scheme
  • 14. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The index files created are as below in the folder btau8
 gsnap –d btau8 –t 4 control_R1.fastq> control_R1.sam Step 2. Mapping the reads to the genome
  • 15. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • The end product of the GMAP aligner is a SAM file which needs to be converted into a BAM file for further analysis in cufflinks. • Repeat the same for the other replicate by changing the input file name. • A total of four SAM files are generated separately. The BAM files generated can be analysed in two ways - 1. The BAM files can be used to generate a merged assembly of transcripts via cufflinks and cuffmerge. This merged assembly (i.e merged.gtf) is used in cuffdiff to generate differential expressed genes. 2. Cuffdiff can be used directly to generate differentially expressed genes using the BAM files generated. The index files created are as below in the folder btau8

  • 16. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ./samtools view –bsh aln.sam >aln.bam -b: Output in the BAM format. -s: Input in the SAM format. –h: Include header in the output For the Control sample: ./samtools view –bsh control_R1.sam >control_R1.bam For the Infected sample: ./samtools view –bsh infected_R1.sam >infected_R1.bam Step 3. Converting SAM to BAM using samtools
  • 17. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Command for sorting:./samtools sort aln.bam aln.sorted Example: For the Control sample: ./samtools sort control_R1.bam control_R1_sorted For the Infected sample: ./samtools sort infected_R1.bam infected_R1_sorted Step 4. Sorting BAM using samtools
  • 18. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Command for running cufflinks on a BAM file For the Control sample: cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN control_R1_sorted.bam Step 5. (Option 1) Differential expression using cufflinks, cuffmerge and cuffdiff.
  • 19. Computational Biology and Genomics Facility, Indian Veterinary Research Institute For the infected sample: cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN infected_R1_sorted.bam These commands generate transcript.gtf files for each replicate, which are further used in cuffmerge to generate a merged assembly. This merged assembly is then used in cuffdiff to generate differentially expressed genes.
  • 20. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Command for running cuffmerge cuffmerge -g btau8refflat.gtf -s bosTau8.fa -p 8 assemblies.txt assemblies.txt is the file with the list of all the GTFs. This generates a merged.gtf in the merged_asm folder. This file is used in the next cuffdiff command. Command for running cuffdiff cuffdiff merged.gtf control_R1_sorted.bam control_R2_sorted.bam infected_R1_sorted.bam infected_R2_sorted.bam This command generates many files, out of which gene_exp.diff is the file of our concern.
  • 21. Computational Biology and Genomics Facility, Indian Veterinary Research Institute CuffDiff computes differentially expressed genes in the set. For computing differential expression at least two samples -infected and control are required. CuffDiff should always be run on replicates - i.e., N infected vs N control. Command: Cuffdiff –p –N transcripts.gtf -p: num-threads <int>. -N Running cuffdiff for our BAM files cuffdiff –p 3 –N bostau8reflat.gtf control_R1_sorted.bam,control_R2_sorted.bam infected_R1_sorted.bam,infected_R2_sorted.bam –o cuffdiff_out Step 5. (Option 2) Differential expression using CuffDiff directly
  • 22. Computational Biology and Genomics Facility, Indian Veterinary Research Institute A unique identifier describing the object (gene, transcript, CDS, primary transcript) Gene ID Gene Name Infected OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing FPKM in Sample 1 FPKM in Sample 2 The (base 2) log of the fold change y/x Genomic coordinates for easy browsing to the genes or transcripts being tested. Control The value of the test statistic used to compute significance of the observed change in FPKM The uncorrected p-value of the test statistic gene_exp.diff Log2fold change = Log2(FPKM infected/FPKM of control) = Log2(0.576748/3.92513) = -2.76673
  • 23. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Identification of Differentially expressed genes - I (using RSEM - EBSeq)
  • 24. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Quality filtered/trimmed RNA-Seq Short reads Calculate transcript abundances (RSEM) Reference Genome Mapping to the reference (Bowtie) Detection of DEGs (DESeq, edgeR,EBSeq) Downloading the reference genome and GTF from Ensembl genome browser Count based strategy
  • 25. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RSEM is a cutting-edge RNASeq analysis package that is an end-to-end solution for differential expression, and simplifies the whole process. It also introduces a new more robust unit of RNASeq measurement called TPM. RSEM (RNA-Seq by Expectation-Maximization) (Li1 and Dewey., 2011) Step 1. Downloading RSEM and installing wget http://deweylab.biostat.wisc.edu/rsem/src/rsem-1.2.19.tar.gz tar –xvzf rsem-1.2.19.tar.gz cd rsem-1.2.19/make Step 2. Prerequisites required for running RSEM : Perl, R and Bowtie are required to be installed. Perl and R are normally present in most of the computers.
  • 26. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 3. Downloading Bowtie and installing Download Bowtie from http://sourceforge.net/projects/bowtie-bio/files/ bowtie/1.1.1/ Step 4. Copy bowtie in your path or add bowtie path in bash profile Copying bowtie in your path sudo cp -R /Users/appleserver/Desktop/bowtie2 /usr/local/bin add bowtie path in bash profile (preferred) export PATH="/Users/ravikumar/Desktop/bowtie2:$PATH" run source ~/.bash_profile RSEM (RNA-Seq by Expectation-Maximization) Indicates that the path has been added echo $PATH - to check whether the path has been added
  • 27. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 5. Downloading the reference,gunzipping and concatenating Download Bos taurus genome from Ensembl genome browser. An easier alternative is to use wget command for a direct download on HPC: wget -m ftp://ftp.ensembl.org/pub/release-81/fasta/bos_taurus/dna/ &or f in $(find . -name "*.gz")
  • 28. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The folder that is created is as below
  • 29. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Direct download of each individual chromosome and gtf from the ftp site can be done
  • 30. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The files downloaded are gunzipped using - gunzip Bos_taurus.UMD3.1.dna.chromosome.*.fa.gz Concatenating/combining all the fasta file into a combined fasta file (reference): cat Bos_taurus.UMD3.1.dna.chromosome.*.fa > combined.fa Step 6. Download annotation file in gtf format. Command for downloading : wget –m ftp://ftp.ensembl.org/pub/ release-81/gtf/bos_taurus The gtf file downloaded needs to be modifies to extract only the exon annotations. awk command to extract the exon annotations from gtf: awk ‘$3 == “exon”’ Bos_taurus.UMD3.1.8.1.gtf> filtered.gtf
  • 31. Computational Biology and Genomics Facility, Indian Veterinary Research Institute filtered.gtf original gtf
  • 32. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 7. Prepare reference using RSEM To prepare the reference sequence, run the ‘rsem-prepare-reference’ program. The command for preparing the reference running: ./rsem-prepare-reference --gtf filtered.gtf --bowtie2 combined.fa BT This creates 12 file as index files with the name of BT and extension bt2
  • 33. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 8. Calculating expression values in counts, TPM and FPKM: To calculate expression values, ‘rsem-calculate-expression’ program. Command for running rsem-calculate-expression : For running the control sample: . /rsem-calculate-expression --bowtie2 control_R1.fastq BT ControlR1 There will be six files generated as shown above. genes.results is the most important file among the six For running the Infected sample: . /rsem-calculate-expression --bowtie2 infected_R1.fastq BT infectedR1
  • 34. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 9. Combining RSEM genes.results of all the files: RSEM produces “expected counts” or “gene counts” values. After rounding these expected counts values to the nearest integer - EBSeq, DESeq, or edgeR to identify differentially expressed genes. ./rsem-generate-data-matrix *.genes.results > genes.results
  • 35. Computational Biology and Genomics Facility, Indian Veterinary Research Institute EBSeq is an R package for identifying genes and isoforms differentially expressed (DE) across two or more biological conditions in an RNA-seq experiment. EBSEq uses RSEM counts as input to identify differentially expressed genes Step 1. Installing EBSeq: To install, type the following commands in R: source("https://bioconductor.org/biocLite.R") biocLite("EBSeq") Step 2. Command for Loading the package EBSeq >library(EBSeq) Step 3.Command for getting the working directory >getwd() Differentially expression using EBSeq (Leng et al., 2013): Empirical Bayesian approach for RNA-Seq data analysis
  • 36. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 4. Command for setting the working library > setwd() Step 5. Input requirement for Gene level DE analysis: The input file formats supported by EBSeq are .csv, .xls, or .xlsx, .txt (tab delimited). In the input file, rows should be the genes and the columns should be the samples. Example of the data set in .txt format (genesresult.txt) that is used here
  • 37. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 6. Commands to Run EBSeq: > x=data.matrix(read.table("genesresults.txt")) > dim(x) [1] 24596 4 > str(x) num [1:24596, 1:4] 615 3 0 473 1 286 832 362 103 17 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:24596] "ENSBTAG00000000005" "ENSBTAG00000000008" "ENSBTAG00000000009" "ENSBTAG00000000010" ... ..$ : chr [1:4] "infectedR1.genes.results" "infectedR2.genes.results" "ControlR1.genes.results" "ControlR2.genes.results" > Sizes=MedianNorm(x) > EBOut=EBTest(Data=x, + Conditions=as.factor(rep(c("C1","C2"),each=2)),sizeFactors=Sizes, + maxround=5)
  • 38. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Removing transcripts with 75 th quantile < = 10 12071 transcripts will be tested iteration 1 done time 0.12 iteration 2 done time 0.13 iteration 3 done time 0.08 iteration 4 done > PP=GetPPMat(EBOut) > str(PP) num [1:12071, 1:2] 1 1 0 0 1 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:12071] "ENSBTAG00000000005" "ENSBTAG00000000010" "ENSBTAG00000000012" "ENSBTAG00000000013" ... ..$ : chr [1:2] "PPEE" "PPDE"
  • 39. Computational Biology and Genomics Facility, Indian Veterinary Research Institute > DEfound=rownames(PP)[which(PP[,"PPDE"]>=.95)] > str(DEfound) chr [1:6528] "ENSBTAG00000000012" "ENSBTAG00000000013" "ENSBTAG00000000015" "ENSBTAG00000000019" "ENSBTAG00000000021" "ENSBTAG00000000025" "ENSBTAG00000000026" "ENSBTAG00000000032" ... > write.table(DEfound,"DE.txt",sep = "t",quote = F,col.names=F) > GeneFC=PostFC(EBOut) > write.table(GeneFC,"FC.txt",sep = "t",quote = F,col.names=F) Output GeneID PostFC Real FC comparison
  • 40. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The posterior fold change estimations will give less extreme values for low expressers. e.g. if gene1 has Y = 5000 and X = 1000, its FC and PostFC will both be 5. If gene2 has Y = 5 and X = 1, its FC will be 5 but its PostFC will be < 5 and closer to 1. Therefore when we sort the PostFC, gene2 will be less significant than gene1.