RNA-Seq Data Analysis

National Bureau of Animal Genetic Resources
Karnal
Transcriptome Sequencing
Sequencing steady state RNA in a sample is known as
RNA-Seq. It is free of limitations such as pr...
Making sense of RNA-Seq data…….
Depends upon the scientific question of interest.
For example allele specific expression r...
Applications of RNA-Seq
Abundance estimation
2. Alternative splicing
3. RNA editing
4. Finding novel transcripts
5. Findin...
From RNA-seq reads
to differential
expression results:
Oshlack et al. Genome
Biology 2010, 11:220
Mapping Reads to Reference: CLC bio Workbench
 The

RNA-Seq analysis is done in several steps: First, all genes
are extra...
Mapping Examples
The mapping parameters









Maximum number of mismatches : short reads (shorter than 56
nucleotides, except for c...
Summarization
Summarization
Summarization
Summarization
Summarization : Mapping Statistics
Summarization : Detailed Mapping Statistics
Summarization : Parameters









Transcripts: The number of transcripts based on the mRNA
annotations on the refer...
Visualizing Mapping
Read Quality Assessment
Basic Statistics Summary



The Basic Statistics module generates some simple



composition statistics for the file ana...


This view shows an overview of the range of
quality values across all bases at each position
in the FastQ file. For eac...
The per sequence quality score report allows you
to see if a subset of your sequences have
Universally low quality values....
Normalization
Differential expression
Clustering
Comparison of Expression Profile
Expression Profile of Specific Pathways
Systems Biology : Gostat Analysis

Best GOs
Genes
(Max: 100)
GO:0003735 Mitochondria mrpl42 mrpl41 ndufa13
ndufb5 timm13 e...
Upcoming SlideShare
Loading in...5
×

Rna seq pipeline

482

Published on

RNA Sequence data analysis,Transcriptome sequencing, Sequencing steady state RNA in a sample is known as RNA-Seq. It is free of limitations such as prior knowledge about the organism is not required.
RNA-Seq is useful to unravel inaccessible complexities of transcriptomics such as finding novel transcripts and isoforms.
Data set produced is large and complex; interpretation is not straight forward.

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
482
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
36
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Rna seq pipeline"

  1. 1. RNA-Seq Data Analysis National Bureau of Animal Genetic Resources Karnal
  2. 2. Transcriptome Sequencing Sequencing steady state RNA in a sample is known as RNA-Seq. It is free of limitations such as prior knowledge about the organism is not required. RNA-Seq is useful to unravel inaccessible complexities of transcriptomics such as finding novel transcripts and isoforms. Data set produced is large and complex; interpretation is not straight forward.
  3. 3. Making sense of RNA-Seq data……. Depends upon the scientific question of interest. For example allele specific expression requires accurate determination of the transcribed SNPs. Finding novel transcripts will help in finding fusion gene events and aberrations in cancer samples.
  4. 4. Applications of RNA-Seq Abundance estimation 2. Alternative splicing 3. RNA editing 4. Finding novel transcripts 5. Finding isoforms And many more….. 1.
  5. 5. From RNA-seq reads to differential expression results: Oshlack et al. Genome Biology 2010, 11:220
  6. 6. Mapping Reads to Reference: CLC bio Workbench  The RNA-Seq analysis is done in several steps: First, all genes are extracted from the reference genome (using annotations of type gene). Other annotations on the gene sequences are preserved (e.g. CDS information about coding sequences etc).  Next, all annotated transcripts (using annotations of type mRNA) are extracted. If there are several annotated splice variants, they are all extracted. Note that the mRNA annotation type is used for extracting the exon-exon boundaries.
  7. 7. Mapping Examples
  8. 8. The mapping parameters      Maximum number of mismatches : short reads (shorter than 56 nucleotides, except for color space data which are always treated as long reads). This is the maximum number of mismatches to be allowed. Maximum value is 3, except for color space where it is 2. Minimum length fraction : the default is 0.9 which means that at least 90 % of the bases need to align to the reference. Minimum similarity fraction : the default setting at 0.8 and the default setting for the length fraction, it means that 90 % of the read should align with 80 % similarity in order to include the read. Maximum number of hits for a read : a read that matches to more distinct places in the references than the ’Maximum number of hits for a read’ specified will not be mapped Strand-specific alignment : Mapping reads to specific strand
  9. 9. Summarization
  10. 10. Summarization
  11. 11. Summarization
  12. 12. Summarization
  13. 13. Summarization : Mapping Statistics
  14. 14. Summarization : Detailed Mapping Statistics
  15. 15. Summarization : Parameters      Transcripts: The number of transcripts based on the mRNA annotations on the reference. Note that this is not based on the sequencing data - only on the annotations already on the reference sequence(s). Exon length: The total length of all exons (not all transcripts). Unique gene reads : This is the number of reads that match uniquely to the gene. Total gene reads: This is all the reads that are mapped to this gene --both reads that map uniquely to the gene and reads that matched to more positions in the reference (but fewer than the ’Maximum number of hits for a read’ parameter) which were assigned to this gene. RPKM: Reads Per Kilobase of exon model per Million mapped reads is the expression value measured in RPKM [Mortazavi et al., 2008]: RPKM = total exon reads/ mapped reads(millions)exon length (KB) .
  16. 16. Visualizing Mapping
  17. 17. Read Quality Assessment
  18. 18. Basic Statistics Summary  The Basic Statistics module generates some simple  composition statistics for the file analysed.  Filename: The original filename of the file which was analysed.  File type: Says whether the file appeared to contain actual base calls or colorspace data which had to be converted to base calls.  Total Sequences: A count of the total number of sequences processed. There are two values reported, actual and estimated.  Sequence Length: Provides the length of the shortest and longest sequence in the set. If all sequences are the same length only one value is reported.  %GC: The overall %GC of all bases in all sequences  Warning  Basic Statistics never raises a warning.
  19. 19.  This view shows an overview of the range of quality values across all bases at each position in the FastQ file. For each position a BoxWhisker type plot is drawn. The elements of the plot are as follows:  The central red line is the median value  The yellow box represents quartilerange (25-75%)  The upper and lower whiskers represent the10% and 90% points the inter- The blue line represents the mean quality. The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read. It should be mentioned that there are number of different ways to encode a quality score in a FastQ file. FastQC attempts to automatically determine which encoding method was used, the title of the graph will describe the encoding FastQC thinks your file used.
  20. 20. The per sequence quality score report allows you to see if a subset of your sequences have Universally low quality values. It is often the case that a subset of sequences will have universally poor quality, often because they are poorly imaged (on the edge of the field of view etc), however these should represent only a small percentage of the total sequences. If a significant proportion of the sequences in a run have overall low quality then this could indicate some kind of systematic problem - possibly with just part of the run (for example one end of a flowcell).
  21. 21. Normalization
  22. 22. Differential expression
  23. 23. Clustering
  24. 24. Comparison of Expression Profile
  25. 25. Expression Profile of Specific Pathways
  26. 26. Systems Biology : Gostat Analysis Best GOs Genes (Max: 100) GO:0003735 Mitochondria mrpl42 mrpl41 ndufa13 ndufb5 timm13 etfb ndufa3 atp5d atp5j2 ndufb7 mrpl14 ndufa5 ndufa11 mrpl34 GO:0005840 Ribosome rps2 mrpl42 rps18 rps17 mrpl41 rps23 mrps18c rplp2 mrpl14 rpl9 rps29 mrpl34 Count 150 12 Total 18253 156 12 163 P-Value 4.78E-06 4.78E-06
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×