SlideShare a Scribd company logo
1 of 24
RNA-Seq analysis pipeline For
Unraveling Traits in crops:
TUXEDO protocol
By Abdulsalam Toyin
OUTLINE
RNA-Seq
Biology of Gene Expression
Concept of Genes and Transcripts
Alignment
RNA-Seq specific aligners
Tophat, STAR
SAM/BAM format
 Quantification
cufflinks(gene level, isoform level, splicing…..)
Alternative methods
 Post-Alignment Quality Control
 Other analysis with RNA-Seq
RNA-Seq
Diagram Describes a
summary of the RNA-Seq
Technique: (1) RNA is
isolated from a panel of
tissues or treatments. (2) A
pool of core tissues (e.g.,
leaf, root, flower, fruit) are
used to create the
reference cDNA library
which is then sequenced
accordingly.
RNA-Seq
• RNA-seq basically seeks to answer the following question:
 What genes are expressed in a sample?
What transcripts are expressed for each gene?
What are the genes, and transcripts?: what are the expression
levels?
How do expression levels and splicing pattern differ between
two conditions?
Biology of Gene Expression
The process of gene
expression is very
complex. “Genes takes a
body of their own in
process called gene
expression”. The process
starts with transcription
during which RNA-
polymerase creates a copy
of the gene, nucleotide by
nucleotide as a single
stranded molecule.
While in the
nucleus(highly unstable):5’
capping, 3’ cleavage and
addition of long polyA tail
to stabilize.
Genes are encoded in
the genome occupying
a specific location.
Genes consist of:
Informative block
“Exon” and non
informative block
“intron”. During
transcription the
introns are spliced out
and the exons are glued
together for form
mRNA.
RNA-Seq Pipeline
Mapping
 How to match millions of reads(~100 character fragments) with a reference
sequence of billions of characters.
Mapping for RNA-Seq
 The transcript may result from splicing and the mapping strategy must account for
this.
 Either we map to the transcriptome or we map to the genome with a “splice-aware”
aligner.
 Tophat is one of the first and by far the most popular aligner for RNA-Seq data
 it was built “on top” of bowtie
Mapping for RNA-Seq
 It breaks the reads into pieces, maps first to the genome and then extends to
“possible splice junctions”.
 We can also Pass an annotation file(GTF format) as arguments.
 Extract transcript, build an index
 Map to the “transcriptome” and then to the genome.
Tophat
 Using annotation:
 Improves accuracy, mostly around splicing junctions.
 “bias” in favor of known transcripts
 less power to detect novel transcripts or novel isoforms
 Beware if you have an incomplete annotation
 At the mapping step, it’s better to keep multi-mappers(within a reasonable limit,
10, 20 hits) Tophat provides an options control multiple mappings.
SAM/BAM FORMAT
 Composed of two parts
 Header to describe the source of the data, the reference sequence, the method of alignment and so on.
 Alignments to describe the reads, the location and the nature of the alignments.
SAM/BAM FORMAT
Visualization in IGV
Quantification
 In general, we want more than just alignment
 in theory, RNA-Seq is a quantifying assay, and we want to measure gene expression.
Cufflinks
 Based on alignment there are two goals:
 Transcript assembly
 Transcript quantification
Cufflinks
 Assembly: Try to find the
minimal number of paths in a
graph to fully represent the
alignment.
 Quantification: Estimate the
most likely abundance of the
difference isoforms.
Unit of Abundance
 We can only measure relative measures based on X number of reads in library, Y map to
geneA and Z map to geneB. If we change X then both Y and Z will change.
 Now if gene A and gene B have the same no of reads mapped
 Do they have similar expression levels?
 if gene B and gene A have the same size, Yes
 otherwise, No because a longer gene will receive more reads than a shorter gene
FPKM
Fragment per Kilobase per millions of reads.
 Not annotated gene size, but effective size
 Effective length: number of possible start site on transcript(depends on the
estimated fragment size)
 Millions of reads; millions of mapped reads(not millions of sequenced reads)
 FPKM let you compare the expression of gene between samples(because it
account for differences in library depth)
 It also lets you compare the expression of two genes within the same
sample (because it account difference in gene length).
Cufflinks and Cuffdiff
 In addition to estimating expression, cufflinks output gene expression(more or less
the sum of the different isoforms)
 cufflinks contains a method called cuffdiff for differential expression.
 cuffdiff estimate the isoform expression in two groups(which can be composed of
multiple replicates) and performs statistical test for;
 Differential gen expression
 Differential transcript levels
 Alternative splicing
 Differential usage of transcription start sites
Quality Control
 A number of matrices are important to look at:
Ribosomal contamination
Map the entire library against a set of ribosomal RNA
sequence and count the number reads mapping
 Numbers of reads mapping, and number of reads
mapping uniquely.
 Distribution of expression. Few high expressed,
some mid-expressed many low expression genes.
Quality Control
 Other matrices:
 Duplication rate(based on the location of the alignment) might bias the
estimation of the gene expression
% of the reads mapping of CDS, UTR, intron and intergenic regions,
obviously, the more on CDS and UTR the better.
Unsupervised clustering to verify that the samples cluster according to
biological differences and not according to experimental batches.
Hierarchical Clustering
PCA, MDS plot
Differential Expression
THANK YOU

More Related Content

What's hot

What's hot (20)

Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Genome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTKGenome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTK
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Finding ORF
Finding ORFFinding ORF
Finding ORF
 
Genome Curation using Apollo
Genome Curation using ApolloGenome Curation using Apollo
Genome Curation using Apollo
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Rna seq
Rna seqRna seq
Rna seq
 

Viewers also liked

CBS WATCH_SO15_ExquisiteEscapes
CBS WATCH_SO15_ExquisiteEscapesCBS WATCH_SO15_ExquisiteEscapes
CBS WATCH_SO15_ExquisiteEscapes
Michele Shapiro
 
Trabajo de investigacion
Trabajo de investigacionTrabajo de investigacion
Trabajo de investigacion
andercardona
 
CBS WATCH - Travel - Istanbul
CBS WATCH - Travel - IstanbulCBS WATCH - Travel - Istanbul
CBS WATCH - Travel - Istanbul
Michele Shapiro
 
Comunidad de practica
Comunidad de practicaComunidad de practica
Comunidad de practica
misderechos
 

Viewers also liked (17)

Alicia Granados: Adaptive Pathways and Lifecycle Approach (ADAPTSMART
Alicia Granados: Adaptive Pathways and Lifecycle Approach (ADAPTSMARTAlicia Granados: Adaptive Pathways and Lifecycle Approach (ADAPTSMART
Alicia Granados: Adaptive Pathways and Lifecycle Approach (ADAPTSMART
 
MR3
MR3MR3
MR3
 
SSBTR - M.Asaulov
SSBTR - M.AsaulovSSBTR - M.Asaulov
SSBTR - M.Asaulov
 
CBS WATCH_SO15_ExquisiteEscapes
CBS WATCH_SO15_ExquisiteEscapesCBS WATCH_SO15_ExquisiteEscapes
CBS WATCH_SO15_ExquisiteEscapes
 
Trabajo de investigacion
Trabajo de investigacionTrabajo de investigacion
Trabajo de investigacion
 
Abhi pre
Abhi preAbhi pre
Abhi pre
 
CBS WATCH - Travel - Istanbul
CBS WATCH - Travel - IstanbulCBS WATCH - Travel - Istanbul
CBS WATCH - Travel - Istanbul
 
Tax book
Tax book Tax book
Tax book
 
01 ajay kumar verma cv
01 ajay kumar verma cv01 ajay kumar verma cv
01 ajay kumar verma cv
 
Deseo de navidad
Deseo de navidadDeseo de navidad
Deseo de navidad
 
Certificate_1
Certificate_1Certificate_1
Certificate_1
 
Galvanizado
GalvanizadoGalvanizado
Galvanizado
 
Dureza rockwell y brinell
Dureza rockwell y brinellDureza rockwell y brinell
Dureza rockwell y brinell
 
Domotica (electiva iii)
Domotica (electiva iii)Domotica (electiva iii)
Domotica (electiva iii)
 
Applications of piezoelectricity
Applications of piezoelectricityApplications of piezoelectricity
Applications of piezoelectricity
 
The challenge of small data
The challenge of small dataThe challenge of small data
The challenge of small data
 
Comunidad de practica
Comunidad de practicaComunidad de practica
Comunidad de practica
 

Similar to RNA-Seq_Presentation

RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
Tanmay Ghai
 

Similar to RNA-Seq_Presentation (20)

Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
 
31931 31941
31931 3194131931 31941
31931 31941
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing Data
 
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
Gene expression profiling
Gene expression profilingGene expression profiling
Gene expression profiling
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
Marker devt. workshop 27022012
Marker devt. workshop 27022012Marker devt. workshop 27022012
Marker devt. workshop 27022012
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
Variants calling and SNP prioritization in mouse RNA.pptx
Variants calling and SNP prioritization in mouse RNA.pptxVariants calling and SNP prioritization in mouse RNA.pptx
Variants calling and SNP prioritization in mouse RNA.pptx
 

RNA-Seq_Presentation

  • 1. RNA-Seq analysis pipeline For Unraveling Traits in crops: TUXEDO protocol By Abdulsalam Toyin
  • 2. OUTLINE RNA-Seq Biology of Gene Expression Concept of Genes and Transcripts Alignment RNA-Seq specific aligners Tophat, STAR SAM/BAM format  Quantification cufflinks(gene level, isoform level, splicing…..) Alternative methods  Post-Alignment Quality Control  Other analysis with RNA-Seq
  • 3. RNA-Seq Diagram Describes a summary of the RNA-Seq Technique: (1) RNA is isolated from a panel of tissues or treatments. (2) A pool of core tissues (e.g., leaf, root, flower, fruit) are used to create the reference cDNA library which is then sequenced accordingly.
  • 4. RNA-Seq • RNA-seq basically seeks to answer the following question:  What genes are expressed in a sample? What transcripts are expressed for each gene? What are the genes, and transcripts?: what are the expression levels? How do expression levels and splicing pattern differ between two conditions?
  • 5. Biology of Gene Expression The process of gene expression is very complex. “Genes takes a body of their own in process called gene expression”. The process starts with transcription during which RNA- polymerase creates a copy of the gene, nucleotide by nucleotide as a single stranded molecule. While in the nucleus(highly unstable):5’ capping, 3’ cleavage and addition of long polyA tail to stabilize.
  • 6. Genes are encoded in the genome occupying a specific location. Genes consist of: Informative block “Exon” and non informative block “intron”. During transcription the introns are spliced out and the exons are glued together for form mRNA.
  • 8. Mapping  How to match millions of reads(~100 character fragments) with a reference sequence of billions of characters.
  • 9. Mapping for RNA-Seq  The transcript may result from splicing and the mapping strategy must account for this.  Either we map to the transcriptome or we map to the genome with a “splice-aware” aligner.  Tophat is one of the first and by far the most popular aligner for RNA-Seq data  it was built “on top” of bowtie
  • 10. Mapping for RNA-Seq  It breaks the reads into pieces, maps first to the genome and then extends to “possible splice junctions”.  We can also Pass an annotation file(GTF format) as arguments.  Extract transcript, build an index  Map to the “transcriptome” and then to the genome.
  • 11. Tophat  Using annotation:  Improves accuracy, mostly around splicing junctions.  “bias” in favor of known transcripts  less power to detect novel transcripts or novel isoforms  Beware if you have an incomplete annotation  At the mapping step, it’s better to keep multi-mappers(within a reasonable limit, 10, 20 hits) Tophat provides an options control multiple mappings.
  • 12. SAM/BAM FORMAT  Composed of two parts  Header to describe the source of the data, the reference sequence, the method of alignment and so on.  Alignments to describe the reads, the location and the nature of the alignments.
  • 15. Quantification  In general, we want more than just alignment  in theory, RNA-Seq is a quantifying assay, and we want to measure gene expression.
  • 16. Cufflinks  Based on alignment there are two goals:  Transcript assembly  Transcript quantification
  • 17. Cufflinks  Assembly: Try to find the minimal number of paths in a graph to fully represent the alignment.  Quantification: Estimate the most likely abundance of the difference isoforms.
  • 18. Unit of Abundance  We can only measure relative measures based on X number of reads in library, Y map to geneA and Z map to geneB. If we change X then both Y and Z will change.  Now if gene A and gene B have the same no of reads mapped  Do they have similar expression levels?  if gene B and gene A have the same size, Yes  otherwise, No because a longer gene will receive more reads than a shorter gene
  • 19. FPKM Fragment per Kilobase per millions of reads.  Not annotated gene size, but effective size  Effective length: number of possible start site on transcript(depends on the estimated fragment size)  Millions of reads; millions of mapped reads(not millions of sequenced reads)  FPKM let you compare the expression of gene between samples(because it account for differences in library depth)  It also lets you compare the expression of two genes within the same sample (because it account difference in gene length).
  • 20. Cufflinks and Cuffdiff  In addition to estimating expression, cufflinks output gene expression(more or less the sum of the different isoforms)  cufflinks contains a method called cuffdiff for differential expression.  cuffdiff estimate the isoform expression in two groups(which can be composed of multiple replicates) and performs statistical test for;  Differential gen expression  Differential transcript levels  Alternative splicing  Differential usage of transcription start sites
  • 21. Quality Control  A number of matrices are important to look at: Ribosomal contamination Map the entire library against a set of ribosomal RNA sequence and count the number reads mapping  Numbers of reads mapping, and number of reads mapping uniquely.  Distribution of expression. Few high expressed, some mid-expressed many low expression genes.
  • 22. Quality Control  Other matrices:  Duplication rate(based on the location of the alignment) might bias the estimation of the gene expression % of the reads mapping of CDS, UTR, intron and intergenic regions, obviously, the more on CDS and UTR the better. Unsupervised clustering to verify that the samples cluster according to biological differences and not according to experimental batches. Hierarchical Clustering PCA, MDS plot