Introductin to RNAseq & Differential Gene ExpressionPresentation Transcript
Introduction to RNA-seq
Differential Gene Expression
By : Amit Kumar Singh
Next Generation Sequencing data
• Traditional Sanger Vs Next
– Reduced cost per base
– Reduced sequencing time
– Covering wide range of
Massive growth in amount of data generation since 2006
Sequencing Cheaper and Fast..
Analysis of data complex and time consuming..
Comparative costs: sequencing a human genome
Next Generation Sequencing: Possibilities
What is a Transcriptome ?
Complete set of all RNA molecues in cell. It includes mRNA, rRNA, tRNA
and other non coding RNA.
Array of mRNA transcripts produced in a particular cell or tissue type.
The study of transcriptomics, also referred to
as expression profiling, examines the expression
level of mRNAs in a given cell population,
GENOME vs. TRANSCRIPTOME
Content is fixed
Transcriptome : Content is time and cell specific
& is much more complex than the genome
Next-generation sequencing (NGS) of cDNA (RNA-Seq) becomes more widely
adopted for transcriptome profiling.
* Dropping prices and maturing technology are causing NGS as technology of choice
RNA-Seq does not depend on genome annotation
Transcript reconstruction – non model organisms.
Trascript verification – model organisms
RNA-Seq is the method of choice in projects using nonmodel organisms and for
novel transcript discovery and genome annotation.
Accurate expression level determination
Current wet-lab RNA-Seq strategies require lengthy library preparation procedures
Different types of RNA
Transcripts and alternate splicing
RNA transcript is the code that is copied from the strand of DNA(known
as the template strand).
mRNA (pre)is the actually strand that carries the code out of the nucleus
and into the cytoplasm.
This mRNA undergoes with alternate splicing where introns are spliced
Transcripts sharing same TSS or CDS
What is RNAseq ?
Sequencing based method to
Use of Next-Generation
Sequencing (NGS) technology to
measure RNA levels
Generating and sequencing
‘reads’ from cDNA
Mapping reads to reference
Quantification of assembled
Experiment design : Replicates
Technical Replicates: measure quantity from one source.
Eg : 5 samples from single patient suffering from lung cancer
Biological Replicates : measure a quantity from different sources under the same
Eg: 5 Samples, each from 5 different patients suffering from lung
Use of replicates
– Minimize experimental variation or artifacts
– Improving results by averaging out
– More the data, more robust the statistical test
and Results are more statistically significant
Transcript assembly or genome
Transcript and gene quantification
pipeline for detecting
An overview of RNAseq for Differential gene
Tuxedo Pipeline for RNAseq analysis
Objective: To find the unique location where a short
read is identical to the reference
Reality: Reference is never a perfect representation
of the actual biological source of RNA being
Sample-specific attributes like SNPs and indels;
short reads align perfectly to multiple locations and
can contain sequencing errors
Real task is to find the location where each short
read best matches the reference allowing for errors
and structural variation
Problem in mapping of reads spanning splice
These reads are
Splice junction aligners break junction Reads
and index the information
Multimaps: Reads that map equally well to several
Paired-end reads reduce the problem of multimapping
Splice junction mapper
Initial mapping onto genome (exons)
by bowtie, an ultrafast short read
Builds database of possible splice
Maps unmapped reads against the
Also ; splits the unmapped reads into
smaller fragment to map on exons.
Input to know : GTF file
• GTF : Gene transfer format
• Reference GTF file is collection of every transcript (genes and
its isoforms + non-coding RNA transcripts)
• Available with genome databases ENSEMBL, UCSC, RefSeq
Sample Ref.GTF file format
Attributes of transcripts
Mapping with Tophat
How to use !
Tophat which is a splice junction aligner. At the backend it uses bowtie for mapping
of short reads on genome.
Bowtie which uses an extremely economical data structure
called the FM index to store the reference genome sequence and
allows it to be searched rapidly.
Indexing of Reference Genome:
Eg : The referece genome is chr19.fa. Indexing of Reference Genome is done by
bowtie2 utility – bowtie2-build.
bowtie2-build <Ref genome fasta> <prefix>
[user]$ bowtie2-build chr19.fa chr19
(i)Mapping without using reference annotation
[user]$ tophat chr19 reads1.fastq reads2.fastq
(ii) Mapping with using reference annotation
It uses referece annotation (GTF) for known splice junction location
for better mapping.
[user]$ tophat -G chr19.gtf chr19 reads1.fastq reads2.fastq
(iii) Mapping only to the reference annotation
[user]$ tophat -G chr19.gtf –no-novel-juncs chr19 reads1.fastq
Note :The Gene transfer format (GTF) is a file format used to hold information
about gene structure.
New feature :Mapping on transcriptome:
You can even map your reads directly on transcriptome with this new feature of
When providing TopHat with a known transcript file (-G/--GTF option above), a
transcriptome sequence file is built
Bowtie then creates the index for this new transcriptome sequences
Reads are then aligned these known transcripts (First time)
[user]$ tophat -o output_sample1 -G chr19.gtf --transcriptomeindex=transcriptome/known chr19 sample1_1.fastq sample1_2.fastq
Once the transcriptome index is formed, there is not need to specify -G option next
time if you want to run tophat for other samples (Next time mapping on
[user]$ tophat -o output_sample2 --transcriptome-index=tran/known chr19
Output of Tophat
1. accepted_hits.bam. A list of read alignments in BAM format.
2. junctions.bed. A UCSC BED track of junctions reported by TopHat.
The score is the number of alignments spanning the junction.
Alignments are reported in BAM files
BAM is the compressed, binary version of SAM, a flexible
and general purpose read alignment format.
Many downstream analysis tools accept SAM and BAM as
There are also numerous utilities for viewing and manipulating
SAM and BAM files. Perhaps The most popular among these
is the SAMtools.
Mapping quality mate
CIGAR string (describes the position of insertions/deletions/matches in the
alignment, encodes splice junctions, for example)
For more information : http://samtools.sourceforge.net/samtools.shtml
Start & End
Junctions View on
Analysis with samtools
(i) View the BAM file
[user]$ samtools view accepted_hits.bam
(ii) Convert the BAM file into non binary SAM file
[user]$ samtools view accepted_hits.bam > accepted_hits.sam
(iii) Count the number of lines of sam file
[user]$ wc -l accepted_hits.sam
(iv) sorting of SAM file
[user]$ samtools sort accepted_hits.bam outprefix
(v) Indexing of BAM file
[user]$ samtools index accepted_hits.bam
(VI) Knowing the statistics of BAM file
[user]$ samtools flagstat accepted_hits.bam
Cufflinks to generate a transcriptome assembly for each sample. Cufflinks assembles
individual transcripts from RNA-seq reads that have been aligned to the genome.
More reads mapped to a transcript if it is
-At higher depth of coverage
• Normalize such that
Features of different lengths of different conditions can be
• Need for Normalization:
To reduce bias within the sample or between different sample
• FPKM is one such normalization strategy adopted by cufflinks.
• Cufflink estimates the abundance values in
FPKM (fragments per kilobase of transcript per
million mapped fragments )
• Cufflinks ensure that expression levels for
different genes and transcripts can be compared
across runs by FPKM values.
• FPKM is a measure of how many reads have
been recorded for each transcript normalized by
transcript length and the total number of reads.
FPKM= 10 ×
C= the number of reads mapped onto the gene's exons
N= total number of reads in the experiment
L= the sum of the exons in base pairs.
Visualizing data on IGV
Gene in BAM
The assemblies generated by
cufflinks are then merged
together using the Cuffmerge
utility (An function of cufflinks
This merged assembly provides
a uniform basis for calculating
gene and transcript expression
in each condition
[user]$cuffmerge -s genome/chr19.fa -g chromosome10.gtf assembly_GTF.list
Where assemby_GTF.list , a text file contains path of all cufflinks assemblies you
want to merge.
Cuffdiff : Protocol to estimate differential gene
Calculates expression levels and tests the statistical significance of observed
Estimates log2 fold change
log2( FPKMB /FPKMA )
Cuffdiff reports numerous output files containing the results of its differential
analysis of the samples.
These files contain statistical values such as fold change, P values, gene and
transcript features such as commonname and location in the genome and the FPKM
values for each feature.
[user]$ cuffdiff merged_asm/merged.gtf sample1/tophat_out/accepted_hits.bam
Tools used in RNAseq
DESeq (R package)
Geneset enrichment analysis
Identification of GO Terms that are significantly overrepresented
set of genelist.
in the given
Hypergeometric statistical test is performed to identify such terms.
Simple Example :
Let Your statistically significant gene list = 694 (Each gene associated with GO
Total genes in organism = 10,738
Total genes with cell division GO term biological process in organism = 634
Hypergeometric test will predict (with its statistical values for confidence): Out of
694 genes 107 genes have cell division GO term (Biological process) which is
You can conclude that there is cell division which is altered between normal and