The document discusses tools for analyzing transcriptome data. It describes FastQC, a tool used for quality control checks on raw sequencing data by generating statistics on base quality, GC content, overrepresented sequences, etc. Scripture is described as a tool for de novo assembly of RNA-seq data that relies on aligned reads and a reference genome to reconstruct transcripts. The document outlines the typical workflow of indexing aligned reads, running quality checks with FastQC, and using Scripture or other tools for reconstruction. Common file formats like FASTQ, SAM, BAM and output formats like BED are also summarized.
2. Transcript- “omics”
Just like the other –omics based techniques, transcriptomics is the detailed study of transcriptome.
The transcriptome, is the complete set of all RNA molecules in a cell, a population of cells or in an
organism.
Transcriptome Analysis is the study of the transcriptome, of the complete set of RNA transcripts that are
produced by the genome, under specific circumstances or in a specific cell, using high-throughput methods.
Such analysis is done by techniques like microarray and RNA-seq.
3. Numerous erroneous sequence variants can be introduced during the library preparation, sequencing, and imaging
steps , which should be identified and filtered out in the data analysis step. Thus, QC of raw data should be
performed as the initial step of routine RNA-seq workflow.
Tools such as FastQC and HTQC can be applied
Depending on the RNA-seq library construction strategy, some form of read trimming may be advisable prior to
aligning the RNA-seq data.
This is optional and can be done after QC check since the FASTQC tool indicates the need for trimming.
modern high throughput sequencers can generate hundreds of millions of sequences in a single run.
To ensure that the raw data looks good and there is no biasness.
Need for QC check
4. Transcriptome data
Typical outputs include quantitative tables of the transcript levels.
The results of transcriptomic analyses are graphically often presented as heat maps.
Clustered data
Venn diagrams, which count the transcripts which are equivalently regulated in multiple
samples
Figure: (Left) Heat map
representation of p53 data from
Brainspan database.
(Right)Venn diagram
representation for transcriptome
data.
5. Most widely used format in sequence analysis is the FastQ
Can also be represented in .csv or .xlsx file formats
A CEL (Affymetrix DNA microarray image analysis software). It contains the data extracted from "probes" on an
Affymetrix GeneChip and can store thousands of data points.
SAM format may also be used
Sources to findtranscriptome data
Ensembl
GEO
Brainspan etc.
Data Formats
7. Fastq
Most widely used format in sequence analysis
Generally delivered from a sequencer.
FASTQ format stores sequences and Phred qualities in a single file.
Contains much more information than FastA.
Hence preferred by softwares eg.Aligners,Qc tools etc.
Each sequence requires at least 4 lines:
The first line is the sequence header which starts with an ‘@’ (not a ‘>’!).
The second line is the sequence.
The third line starts with ‘+’ ,has same sequence identifier.
The fourth line are the quality scores
The sequence identifier is further split up into flow cell id,run id etc.
10. Tools
FastQC
is a very popular tool used to provide an overview of basic quality control metrics for raw next
generation sequencing data. There are a number different analyses (called modules) that may be
performed on a sequence data set. Written by Simon Andrews of Babraham Bioinformatics.
Scripture
Is a tool for transcriptome reconstruction. Scripture is a tool for de novo assembly of RNA-
seq full-length gene transcriptome data.
Relies solely on RNA-Seq reads and an assembled genome to build a transcriptome ab initio.
12. Fastqc
Tool used to provide an overview of basic quality control metrics for raw next generation sequencing
data
It runs a set of analyses on one or more raw sequence files in fastq or sam format and produces a
report which summarizes the results.
An interactive graphical application by running the run_fastqc.bat file.
Non-interactive mode on the command line
FastQC will generate an HTML report for each file without launching a user interface.
13. Fastqc
How does it perform such quality checks?
What algorithm is it based on?
14. FastQC supports files in the following formats
FastQ (all quality encoding variants)
Casava FastQ files*
Colorspace FastQ
GZip compressed FastQ
SAM
BAM
SAM/BAM Mapped only (normally used for colorspace data)
File formats
15. 1. Basic Statistics
2. Per base sequence quality
3. Per tile sequence quality
4. Per sequence quality scores
5. Per base sequence content
6. Per sequence GC content
7. Per base N content
8. Sequence Length Distribution
9. Sequence distribution levels
10.Overrepresented sequences
11.Adapter content
Summary includes:
16. The Basic Statistics module generates statistics file.
Filename: The original filename of the file which was analysed
File type
Encoding: Says which ASCII encoding of quality values was
found in this file.
Total Sequences: A count of the total number of sequences
processed.
Sequence Length: Provides the length of the shortest and longest
sequence in the set. If all sequences are the same length only one
value is reported.
%GC: The overall %GC of all bases in all sequences
Basic Statistics
17. Overview of the range of quality values across all bases at
each position in the FastQ file.
It produces a box plot for the same
A warning will be issued if the median < 25.
Failure if median<5.
If the quality of the library falls to a low level then
perform quality trimming (reads are truncated based on
their average quality).
Per Base Sequence Quality
Good
Bad
18. Subset of your sequences have universally low quality
values.
Poor quality, because of poor imaging.
One may check If a significant proportion of the
sequences in a run have overall low quality
An error is raised if the most frequently observed mean
quality is below 20 - this equates to a 1% error rate.
Per Sequence Quality Scores
19. Plots out the proportion of each base position for which
each of the four normal DNA bases has been called.
Issues a warning if the difference between A and T, or G
and C is greater than 10% in any position.
Overrepresented sequences: If there is any evidence of
overrepresented sequences such as adapter dimers or
rRNA in a sample then these sequences may bias the
overall composition and their sequence will emerge from
this plot.
Per Base Sequence Content
20. Measures GC content across the whole length of each
sequence in a file and compares it to a modelled normal
distribution of GC content.
In a normal random library we see a roughly normal
distribution of GC content where the central peak
corresponds to the overall GC content of the underlying
genome.
An unusually shaped distribution could indicate a
contaminated library or some other kinds of biased subset
GC content
21. The left hand side of the main interactive display or the top of the HTML report show a summary of the
modules which were run, and a quick evaluation of whether the results of the module seem entirely normal
(green tick), slightly abnormal (orange triangle) or very unusual (red cross).
In addition to providing an interactive report FastQC also has the option to create an HTML version of this
report for a more permanent record. This HTML report can also be generated directly by running FastQC
in non-interactive mode.
To create a report simply select File > Save Report from the main menu.
The HTML file which is saved is a self-contained document with all of the graphs embedded into it.
Output & Result Analysis
23. Scripture
Scripture is a method for transcriptome reconstruction that relies solely on RNA-Seq reads and an assembled
genome to build a transcriptome ab initio.
Scripture is a tool for de novo assembly of RNA-seq full-length gene transcriptome data. The Scripture algorithm
needs both reads and a genome sequence.
Scripture provides three main operations or tasks:
Segmentation: To call transcripts based on previously aligned data
Score: To evaluate expression of transcript sets
Add paired end data to a previously segmented graph.
24. Identification of all protein isoforms that may be expressed by a gene.
RNA-seq has been used to reconstruct transcriptomes by assembling sequencing reads with5 or without6 reference genomes.
However, transcriptome diversity owing to alternative transcription start sites, alternative splicing of exons, and/or the use of
different poly(A) sites is often difficult to capture and characterize using NGS data, due to their relatively short read length
(typically ≤ 400 nt)10 in comparison to the length of mature transcripts (median > 2500 nt).
Necessityof reconstruction?
The knowledge of all protein isoforms that may be expressed by a gene is fundamental.
Tools such as Scripture,Cufflinks,SLIDE,MultiSplice etc. use RNA-seq data for exon identification, and expression levels data for
transcript assembly
While exon identification performs quite well, transcript assembly remains difficult for complex transcriptomes.
Transcriptome reconstruction
25. Genome-guided methods rely on a reference genome to first map all the reads to the genome and then
assemble overlapping reads into transcripts. By contrast, genome-independent methods assemble the reads
directly into transcripts without using a reference genome.
Both genome-guided and genome-independent algorithms have been reported to accurately reconstruct
thousands of transcripts and many alternative splice forms 28,29,53,55. So what to prefer?
This is governed by the particular biological question to be answered. Genome-independent methods are the
obvious choice for organisms without a reference sequence, whereas the increased sensitivity of genome-
guided approaches makes them the obvious choice for annotating organisms with a reference genome.
Reconstruction methods
26. Reads originating from two different isoforms of the
same genes are colored black and blue. In genome-
guided assembly, reads are first mapped to a
reference genome, and spliced reads are used to
build a transcript graph, which is then parsed into
gene annotations.
In the genome-independent approach, reads are
broken into k-mer seeds and arranged into a de Bruijn
graph structure. The graph is parsed to identify
transcript sequences, which are aligned to the
genome to produce gene annotations.
Spliced reads give rise to four possible
transcripts, but only two transcripts are needed
to explain all reads; the two possible sets of
minimal isoforms are depicted
Method
27. SAM stands for Sequence Alignment/Map format.
It is a generic format for storing large nucleotide sequence alignments.
Can easily generated by alignment programs or converted from existing alignment formats
Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
It is a TAB-delimited text format
Consists of a header section, which is optional, and an alignment section.
If present, the header must be prior to the alignments. Header lines start with ‘@’, while alignment lines do not.
Each alignment line has 11 mandatory fields for essential alignment information
SAMformat
30. Scripture
Java Runtime Environment
Downloaded as a .jar file(scripture.jar)
Command line interface
java –jar scripture.jar
Algorithm
Scripture's main algorithm to "segment" the genome from the sequence data into regions enriched in read
coverage takes as input a read alignment file, genome information and filtering parameters to produce a
transcript graph.
Command:
java - jar scripture.jar <Mandatory parameters> <optional parameter>
31. Mandatory Parameters
-alignment: Path to the a spliced read alignment file
-out: Path to a file for Scripture to write its output.
-sizeFile: A 2-column tab separated file containing the chromosome name and size for the organism.
-chr: Chromosome to segment
-chrSequence: Full path to the chromosome sequence in fasta format for the chromosome to segment.
Optional Parameters
-start: Start of region to segment if not segmenting the full chromosome.
-end: End of region to segment when not segmenting the full chromosome.
-pairedEnd: Paired end data. This file can be in either SAM, BAM format
32. Aligned reads
data
Indexing
Reconstruction
Workflow
Sortingand indexing of aligned files igvtools (for SAM) and samtools (for
BAM) are used.
Either use pre-aligned readsor performread alignment with
BowTieor TopHat priorto sorting
Performtranscriptome reconstructionby Scripture,Trinityetc.
33. 1. Use pre-aligned reads from GEO dataset. For eg: GSE20851 (aligned to the mouse genome).
2. Unzip the file and proceed to next step.
gunzip GSE20851_GSM521650_ES.aligned.sam.gz
3. Perform indexing by using igvtools.
igvtools index GSE20851_GSM521650_ES.aligned.sam
4. Run Scripture from command-line by
java –jar scripture.jar
5. Get the file for mouse sizes and the fasta file for the chromosome(let’s say chr19)
6. Run Scripture on this chromosome(19)
java –jar scripture.jar –alignment GSE20851_GSM521650_ES.aligned.sam –out chr19.scriptureESTest.segments –sizeFile mm9.sizes –chr chr19
–chrSequence chr19.fa
Steps
35. Output
The output of Scripture is a BED file format containing:
all identified transcripts
The BED format is a concise and flexible way to represent genomic features and annotations. The
BED format description supports up to 12 columns
And a graph file of .dot format containing
all segments found in the data (significant or not)
can be visualized using programs such as GraphViz
BED format
Chrom | start | end | name | score | strand | thickstart | thickend | itemRGB | blockcount | blocksize