SlideShare a Scribd company logo
1 of 37
Tools for Transcriptome data analysis
Sanjana Pandey
Msc.Bioinformatics
Transcript- “omics”
 Just like the other –omics based techniques, transcriptomics is the detailed study of transcriptome.
 The transcriptome, is the complete set of all RNA molecules in a cell, a population of cells or in an
organism.
 Transcriptome Analysis is the study of the transcriptome, of the complete set of RNA transcripts that are
produced by the genome, under specific circumstances or in a specific cell, using high-throughput methods.
 Such analysis is done by techniques like microarray and RNA-seq.
 Numerous erroneous sequence variants can be introduced during the library preparation, sequencing, and imaging
steps , which should be identified and filtered out in the data analysis step. Thus, QC of raw data should be
performed as the initial step of routine RNA-seq workflow.
 Tools such as FastQC and HTQC can be applied
 Depending on the RNA-seq library construction strategy, some form of read trimming may be advisable prior to
aligning the RNA-seq data.
 This is optional and can be done after QC check since the FASTQC tool indicates the need for trimming.
 modern high throughput sequencers can generate hundreds of millions of sequences in a single run.
 To ensure that the raw data looks good and there is no biasness.
Need for QC check
Transcriptome data
 Typical outputs include quantitative tables of the transcript levels.
 The results of transcriptomic analyses are graphically often presented as heat maps.
 Clustered data
 Venn diagrams, which count the transcripts which are equivalently regulated in multiple
samples
Figure: (Left) Heat map
representation of p53 data from
Brainspan database.
(Right)Venn diagram
representation for transcriptome
data.
 Most widely used format in sequence analysis is the FastQ
 Can also be represented in .csv or .xlsx file formats
 A CEL (Affymetrix DNA microarray image analysis software). It contains the data extracted from "probes" on an
Affymetrix GeneChip and can store thousands of data points.
 SAM format may also be used
Sources to findtranscriptome data
 Ensembl
 GEO
 Brainspan etc.
Data Formats
Excel/csv format
Figure: Excel file data for the p53 and its
interacting genes’ expression in cerebrum
Fastq
 Most widely used format in sequence analysis
 Generally delivered from a sequencer.
 FASTQ format stores sequences and Phred qualities in a single file.
 Contains much more information than FastA.
 Hence preferred by softwares eg.Aligners,Qc tools etc.
Each sequence requires at least 4 lines:
 The first line is the sequence header which starts with an ‘@’ (not a ‘>’!).
 The second line is the sequence.
 The third line starts with ‘+’ ,has same sequence identifier.
 The fourth line are the quality scores
 The sequence identifier is further split up into flow cell id,run id etc.
Workflow
scripture
Figure: General Worflowfor transcriptomedata processing andsubsequent analysis.
Tools
 FastQC
 is a very popular tool used to provide an overview of basic quality control metrics for raw next
generation sequencing data. There are a number different analyses (called modules) that may be
performed on a sequence data set. Written by Simon Andrews of Babraham Bioinformatics.
 Scripture
 Is a tool for transcriptome reconstruction. Scripture is a tool for de novo assembly of RNA-
seq full-length gene transcriptome data.
 Relies solely on RNA-Seq reads and an assembled genome to build a transcriptome ab initio.
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
FastQC
Fastqc
 Tool used to provide an overview of basic quality control metrics for raw next generation sequencing
data
 It runs a set of analyses on one or more raw sequence files in fastq or sam format and produces a
report which summarizes the results.
 An interactive graphical application by running the run_fastqc.bat file.
 Non-interactive mode on the command line
 FastQC will generate an HTML report for each file without launching a user interface.
Fastqc
How does it perform such quality checks?
What algorithm is it based on?
FastQC supports files in the following formats
 FastQ (all quality encoding variants)
 Casava FastQ files*
 Colorspace FastQ
 GZip compressed FastQ
 SAM
 BAM
 SAM/BAM Mapped only (normally used for colorspace data)
File formats
1. Basic Statistics
2. Per base sequence quality
3. Per tile sequence quality
4. Per sequence quality scores
5. Per base sequence content
6. Per sequence GC content
7. Per base N content
8. Sequence Length Distribution
9. Sequence distribution levels
10.Overrepresented sequences
11.Adapter content
Summary includes:
 The Basic Statistics module generates statistics file.
 Filename: The original filename of the file which was analysed
 File type
 Encoding: Says which ASCII encoding of quality values was
found in this file.
 Total Sequences: A count of the total number of sequences
processed.
 Sequence Length: Provides the length of the shortest and longest
sequence in the set. If all sequences are the same length only one
value is reported.
 %GC: The overall %GC of all bases in all sequences
Basic Statistics
 Overview of the range of quality values across all bases at
each position in the FastQ file.
 It produces a box plot for the same
 A warning will be issued if the median < 25.
 Failure if median<5.
 If the quality of the library falls to a low level then
perform quality trimming (reads are truncated based on
their average quality).
Per Base Sequence Quality
Good
Bad
 Subset of your sequences have universally low quality
values.
 Poor quality, because of poor imaging.
 One may check If a significant proportion of the
sequences in a run have overall low quality
 An error is raised if the most frequently observed mean
quality is below 20 - this equates to a 1% error rate.
Per Sequence Quality Scores
 Plots out the proportion of each base position for which
each of the four normal DNA bases has been called.
 Issues a warning if the difference between A and T, or G
and C is greater than 10% in any position.
 Overrepresented sequences: If there is any evidence of
overrepresented sequences such as adapter dimers or
rRNA in a sample then these sequences may bias the
overall composition and their sequence will emerge from
this plot.
Per Base Sequence Content
 Measures GC content across the whole length of each
sequence in a file and compares it to a modelled normal
distribution of GC content.
 In a normal random library we see a roughly normal
distribution of GC content where the central peak
corresponds to the overall GC content of the underlying
genome.
 An unusually shaped distribution could indicate a
contaminated library or some other kinds of biased subset
GC content
 The left hand side of the main interactive display or the top of the HTML report show a summary of the
modules which were run, and a quick evaluation of whether the results of the module seem entirely normal
(green tick), slightly abnormal (orange triangle) or very unusual (red cross).
 In addition to providing an interactive report FastQC also has the option to create an HTML version of this
report for a more permanent record. This HTML report can also be generated directly by running FastQC
in non-interactive mode.
 To create a report simply select File > Save Report from the main menu.
 The HTML file which is saved is a self-contained document with all of the graphs embedded into it.
Output & Result Analysis
http://software.broadinstitute.org/software/scripture/home
Scripture
Scripture
 Scripture is a method for transcriptome reconstruction that relies solely on RNA-Seq reads and an assembled
genome to build a transcriptome ab initio.
 Scripture is a tool for de novo assembly of RNA-seq full-length gene transcriptome data. The Scripture algorithm
needs both reads and a genome sequence.
Scripture provides three main operations or tasks:
 Segmentation: To call transcripts based on previously aligned data
 Score: To evaluate expression of transcript sets
 Add paired end data to a previously segmented graph.
 Identification of all protein isoforms that may be expressed by a gene.
 RNA-seq has been used to reconstruct transcriptomes by assembling sequencing reads with5 or without6 reference genomes.
 However, transcriptome diversity owing to alternative transcription start sites, alternative splicing of exons, and/or the use of
different poly(A) sites is often difficult to capture and characterize using NGS data, due to their relatively short read length
(typically ≤ 400 nt)10 in comparison to the length of mature transcripts (median > 2500 nt).
Necessityof reconstruction?
 The knowledge of all protein isoforms that may be expressed by a gene is fundamental.
 Tools such as Scripture,Cufflinks,SLIDE,MultiSplice etc. use RNA-seq data for exon identification, and expression levels data for
transcript assembly
 While exon identification performs quite well, transcript assembly remains difficult for complex transcriptomes.
Transcriptome reconstruction
 Genome-guided methods rely on a reference genome to first map all the reads to the genome and then
assemble overlapping reads into transcripts. By contrast, genome-independent methods assemble the reads
directly into transcripts without using a reference genome.
 Both genome-guided and genome-independent algorithms have been reported to accurately reconstruct
thousands of transcripts and many alternative splice forms 28,29,53,55. So what to prefer?
 This is governed by the particular biological question to be answered. Genome-independent methods are the
obvious choice for organisms without a reference sequence, whereas the increased sensitivity of genome-
guided approaches makes them the obvious choice for annotating organisms with a reference genome.
Reconstruction methods
 Reads originating from two different isoforms of the
same genes are colored black and blue. In genome-
guided assembly, reads are first mapped to a
reference genome, and spliced reads are used to
build a transcript graph, which is then parsed into
gene annotations.
 In the genome-independent approach, reads are
broken into k-mer seeds and arranged into a de Bruijn
graph structure. The graph is parsed to identify
transcript sequences, which are aligned to the
genome to produce gene annotations.
 Spliced reads give rise to four possible
transcripts, but only two transcripts are needed
to explain all reads; the two possible sets of
minimal isoforms are depicted
Method
 SAM stands for Sequence Alignment/Map format.
 It is a generic format for storing large nucleotide sequence alignments.
 Can easily generated by alignment programs or converted from existing alignment formats
 Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
 It is a TAB-delimited text format
 Consists of a header section, which is optional, and an alignment section.
 If present, the header must be prior to the alignments. Header lines start with ‘@’, while alignment lines do not.
Each alignment line has 11 mandatory fields for essential alignment information
SAMformat
Reference sequence dictionary.
References sequence names SN
References sequence length LN
File level metadata
SAM format
SAMformat
Scripture
 Java Runtime Environment
 Downloaded as a .jar file(scripture.jar)
 Command line interface
java –jar scripture.jar
Algorithm
 Scripture's main algorithm to "segment" the genome from the sequence data into regions enriched in read
coverage takes as input a read alignment file, genome information and filtering parameters to produce a
transcript graph.
 Command:
java - jar scripture.jar <Mandatory parameters> <optional parameter>
Mandatory Parameters
 -alignment: Path to the a spliced read alignment file
 -out: Path to a file for Scripture to write its output.
 -sizeFile: A 2-column tab separated file containing the chromosome name and size for the organism.
 -chr: Chromosome to segment
 -chrSequence: Full path to the chromosome sequence in fasta format for the chromosome to segment.
Optional Parameters
 -start: Start of region to segment if not segmenting the full chromosome.
 -end: End of region to segment when not segmenting the full chromosome.
 -pairedEnd: Paired end data. This file can be in either SAM, BAM format
Aligned reads
data
Indexing
Reconstruction
Workflow
Sortingand indexing of aligned files igvtools (for SAM) and samtools (for
BAM) are used.
Either use pre-aligned readsor performread alignment with
BowTieor TopHat priorto sorting
Performtranscriptome reconstructionby Scripture,Trinityetc.
1. Use pre-aligned reads from GEO dataset. For eg: GSE20851 (aligned to the mouse genome).
2. Unzip the file and proceed to next step.
gunzip GSE20851_GSM521650_ES.aligned.sam.gz
3. Perform indexing by using igvtools.
igvtools index GSE20851_GSM521650_ES.aligned.sam
4. Run Scripture from command-line by
java –jar scripture.jar
5. Get the file for mouse sizes and the fasta file for the chromosome(let’s say chr19)
6. Run Scripture on this chromosome(19)
java –jar scripture.jar –alignment GSE20851_GSM521650_ES.aligned.sam –out chr19.scriptureESTest.segments –sizeFile mm9.sizes –chr chr19
–chrSequence chr19.fa
Steps
Figure: Files needed for carrying out run
on Scripture
Output
 The output of Scripture is a BED file format containing:
all identified transcripts
 The BED format is a concise and flexible way to represent genomic features and annotations. The
BED format description supports up to 12 columns
 And a graph file of .dot format containing
all segments found in the data (significant or not)
can be visualized using programs such as GraphViz
BED format
Chrom | start | end | name | score | strand | thickstart | thickend | itemRGB | blockcount | blocksize
References
1. https://sci-hub.scihubtw.tw/https://doi.org/10.1038/nmeth.1613
2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4742321/
3. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20851
4. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-S9-S3
5. http://software.broadinstitute.org/software/scripture/home
6. https://www.affymetrix.com/support/developer/powertools/changelog/gcos-agcc/cel.html
7. http://www.brainspan.org/rnaseq/search/index.html
8. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
9. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
Thankyou.
Questions and Feedbacks are most welcomed…….

More Related Content

What's hot

Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomicsSukhjinder Singh
 
Serial analysis of gene expression
Serial analysis of gene expressionSerial analysis of gene expression
Serial analysis of gene expressionAshwini R
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsajay301
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)talhakhat
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS
 
UniProt
UniProtUniProt
UniProtAmnaA7
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSHEETHUMOLKS
 
Transcriptome analysis
Transcriptome analysisTranscriptome analysis
Transcriptome analysisRamaJumwal2
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matricesAshwini
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and toolsKAUSHAL SAHU
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentAfra Fathima
 
Viruses as vector, binary, shuttle vector
Viruses as vector, binary, shuttle vectorViruses as vector, binary, shuttle vector
Viruses as vector, binary, shuttle vectorPromila Sheoran
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)LOGESWARAN KA
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformaticsnadeem akhter
 

What's hot (20)

Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomics
 
Serial analysis of gene expression
Serial analysis of gene expressionSerial analysis of gene expression
Serial analysis of gene expression
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1
 
Fasta
FastaFasta
Fasta
 
UniProt
UniProtUniProt
UniProt
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 
Transcriptome analysis
Transcriptome analysisTranscriptome analysis
Transcriptome analysis
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and tools
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Viruses as vector, binary, shuttle vector
Viruses as vector, binary, shuttle vectorViruses as vector, binary, shuttle vector
Viruses as vector, binary, shuttle vector
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Pyrosequencing
PyrosequencingPyrosequencing
Pyrosequencing
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 

Similar to Tools for Transcriptome Data Analysis

Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_PresentationToyin23
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsGolden Helix Inc
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataAlireza Doustmohammadi
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-SeqNUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-SeqHimanshu Sethi
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataJoachim Jacob
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seqJyoti Singh
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomicsFrancisco Garc
 
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...Jonathan Eisen
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 

Similar to Tools for Transcriptome Data Analysis (20)

Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Transcriptome project
Transcriptome projectTranscriptome project
Transcriptome project
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing Data
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-SeqNUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomics
 
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
 

More from SANJANA PANDEY

biological membranes pdf
 biological membranes pdf biological membranes pdf
biological membranes pdfSANJANA PANDEY
 
Blood functions and composition pdf
Blood functions and composition pdfBlood functions and composition pdf
Blood functions and composition pdfSANJANA PANDEY
 
tissue engineering by sanjana pandey
tissue engineering by sanjana pandeytissue engineering by sanjana pandey
tissue engineering by sanjana pandeySANJANA PANDEY
 
CRISPR/CAS9 ppt by sanjana pandey
CRISPR/CAS9 ppt by sanjana pandeyCRISPR/CAS9 ppt by sanjana pandey
CRISPR/CAS9 ppt by sanjana pandeySANJANA PANDEY
 

More from SANJANA PANDEY (6)

Protein Threading
Protein ThreadingProtein Threading
Protein Threading
 
Forms of DNA
Forms of DNAForms of DNA
Forms of DNA
 
biological membranes pdf
 biological membranes pdf biological membranes pdf
biological membranes pdf
 
Blood functions and composition pdf
Blood functions and composition pdfBlood functions and composition pdf
Blood functions and composition pdf
 
tissue engineering by sanjana pandey
tissue engineering by sanjana pandeytissue engineering by sanjana pandey
tissue engineering by sanjana pandey
 
CRISPR/CAS9 ppt by sanjana pandey
CRISPR/CAS9 ppt by sanjana pandeyCRISPR/CAS9 ppt by sanjana pandey
CRISPR/CAS9 ppt by sanjana pandey
 

Recently uploaded

Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 

Recently uploaded (20)

Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 

Tools for Transcriptome Data Analysis

  • 1. Tools for Transcriptome data analysis Sanjana Pandey Msc.Bioinformatics
  • 2. Transcript- “omics”  Just like the other –omics based techniques, transcriptomics is the detailed study of transcriptome.  The transcriptome, is the complete set of all RNA molecules in a cell, a population of cells or in an organism.  Transcriptome Analysis is the study of the transcriptome, of the complete set of RNA transcripts that are produced by the genome, under specific circumstances or in a specific cell, using high-throughput methods.  Such analysis is done by techniques like microarray and RNA-seq.
  • 3.  Numerous erroneous sequence variants can be introduced during the library preparation, sequencing, and imaging steps , which should be identified and filtered out in the data analysis step. Thus, QC of raw data should be performed as the initial step of routine RNA-seq workflow.  Tools such as FastQC and HTQC can be applied  Depending on the RNA-seq library construction strategy, some form of read trimming may be advisable prior to aligning the RNA-seq data.  This is optional and can be done after QC check since the FASTQC tool indicates the need for trimming.  modern high throughput sequencers can generate hundreds of millions of sequences in a single run.  To ensure that the raw data looks good and there is no biasness. Need for QC check
  • 4. Transcriptome data  Typical outputs include quantitative tables of the transcript levels.  The results of transcriptomic analyses are graphically often presented as heat maps.  Clustered data  Venn diagrams, which count the transcripts which are equivalently regulated in multiple samples Figure: (Left) Heat map representation of p53 data from Brainspan database. (Right)Venn diagram representation for transcriptome data.
  • 5.  Most widely used format in sequence analysis is the FastQ  Can also be represented in .csv or .xlsx file formats  A CEL (Affymetrix DNA microarray image analysis software). It contains the data extracted from "probes" on an Affymetrix GeneChip and can store thousands of data points.  SAM format may also be used Sources to findtranscriptome data  Ensembl  GEO  Brainspan etc. Data Formats
  • 6. Excel/csv format Figure: Excel file data for the p53 and its interacting genes’ expression in cerebrum
  • 7. Fastq  Most widely used format in sequence analysis  Generally delivered from a sequencer.  FASTQ format stores sequences and Phred qualities in a single file.  Contains much more information than FastA.  Hence preferred by softwares eg.Aligners,Qc tools etc. Each sequence requires at least 4 lines:  The first line is the sequence header which starts with an ‘@’ (not a ‘>’!).  The second line is the sequence.  The third line starts with ‘+’ ,has same sequence identifier.  The fourth line are the quality scores  The sequence identifier is further split up into flow cell id,run id etc.
  • 8.
  • 9. Workflow scripture Figure: General Worflowfor transcriptomedata processing andsubsequent analysis.
  • 10. Tools  FastQC  is a very popular tool used to provide an overview of basic quality control metrics for raw next generation sequencing data. There are a number different analyses (called modules) that may be performed on a sequence data set. Written by Simon Andrews of Babraham Bioinformatics.  Scripture  Is a tool for transcriptome reconstruction. Scripture is a tool for de novo assembly of RNA- seq full-length gene transcriptome data.  Relies solely on RNA-Seq reads and an assembled genome to build a transcriptome ab initio.
  • 12. Fastqc  Tool used to provide an overview of basic quality control metrics for raw next generation sequencing data  It runs a set of analyses on one or more raw sequence files in fastq or sam format and produces a report which summarizes the results.  An interactive graphical application by running the run_fastqc.bat file.  Non-interactive mode on the command line  FastQC will generate an HTML report for each file without launching a user interface.
  • 13. Fastqc How does it perform such quality checks? What algorithm is it based on?
  • 14. FastQC supports files in the following formats  FastQ (all quality encoding variants)  Casava FastQ files*  Colorspace FastQ  GZip compressed FastQ  SAM  BAM  SAM/BAM Mapped only (normally used for colorspace data) File formats
  • 15. 1. Basic Statistics 2. Per base sequence quality 3. Per tile sequence quality 4. Per sequence quality scores 5. Per base sequence content 6. Per sequence GC content 7. Per base N content 8. Sequence Length Distribution 9. Sequence distribution levels 10.Overrepresented sequences 11.Adapter content Summary includes:
  • 16.  The Basic Statistics module generates statistics file.  Filename: The original filename of the file which was analysed  File type  Encoding: Says which ASCII encoding of quality values was found in this file.  Total Sequences: A count of the total number of sequences processed.  Sequence Length: Provides the length of the shortest and longest sequence in the set. If all sequences are the same length only one value is reported.  %GC: The overall %GC of all bases in all sequences Basic Statistics
  • 17.  Overview of the range of quality values across all bases at each position in the FastQ file.  It produces a box plot for the same  A warning will be issued if the median < 25.  Failure if median<5.  If the quality of the library falls to a low level then perform quality trimming (reads are truncated based on their average quality). Per Base Sequence Quality Good Bad
  • 18.  Subset of your sequences have universally low quality values.  Poor quality, because of poor imaging.  One may check If a significant proportion of the sequences in a run have overall low quality  An error is raised if the most frequently observed mean quality is below 20 - this equates to a 1% error rate. Per Sequence Quality Scores
  • 19.  Plots out the proportion of each base position for which each of the four normal DNA bases has been called.  Issues a warning if the difference between A and T, or G and C is greater than 10% in any position.  Overrepresented sequences: If there is any evidence of overrepresented sequences such as adapter dimers or rRNA in a sample then these sequences may bias the overall composition and their sequence will emerge from this plot. Per Base Sequence Content
  • 20.  Measures GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content.  In a normal random library we see a roughly normal distribution of GC content where the central peak corresponds to the overall GC content of the underlying genome.  An unusually shaped distribution could indicate a contaminated library or some other kinds of biased subset GC content
  • 21.  The left hand side of the main interactive display or the top of the HTML report show a summary of the modules which were run, and a quick evaluation of whether the results of the module seem entirely normal (green tick), slightly abnormal (orange triangle) or very unusual (red cross).  In addition to providing an interactive report FastQC also has the option to create an HTML version of this report for a more permanent record. This HTML report can also be generated directly by running FastQC in non-interactive mode.  To create a report simply select File > Save Report from the main menu.  The HTML file which is saved is a self-contained document with all of the graphs embedded into it. Output & Result Analysis
  • 23. Scripture  Scripture is a method for transcriptome reconstruction that relies solely on RNA-Seq reads and an assembled genome to build a transcriptome ab initio.  Scripture is a tool for de novo assembly of RNA-seq full-length gene transcriptome data. The Scripture algorithm needs both reads and a genome sequence. Scripture provides three main operations or tasks:  Segmentation: To call transcripts based on previously aligned data  Score: To evaluate expression of transcript sets  Add paired end data to a previously segmented graph.
  • 24.  Identification of all protein isoforms that may be expressed by a gene.  RNA-seq has been used to reconstruct transcriptomes by assembling sequencing reads with5 or without6 reference genomes.  However, transcriptome diversity owing to alternative transcription start sites, alternative splicing of exons, and/or the use of different poly(A) sites is often difficult to capture and characterize using NGS data, due to their relatively short read length (typically ≤ 400 nt)10 in comparison to the length of mature transcripts (median > 2500 nt). Necessityof reconstruction?  The knowledge of all protein isoforms that may be expressed by a gene is fundamental.  Tools such as Scripture,Cufflinks,SLIDE,MultiSplice etc. use RNA-seq data for exon identification, and expression levels data for transcript assembly  While exon identification performs quite well, transcript assembly remains difficult for complex transcriptomes. Transcriptome reconstruction
  • 25.  Genome-guided methods rely on a reference genome to first map all the reads to the genome and then assemble overlapping reads into transcripts. By contrast, genome-independent methods assemble the reads directly into transcripts without using a reference genome.  Both genome-guided and genome-independent algorithms have been reported to accurately reconstruct thousands of transcripts and many alternative splice forms 28,29,53,55. So what to prefer?  This is governed by the particular biological question to be answered. Genome-independent methods are the obvious choice for organisms without a reference sequence, whereas the increased sensitivity of genome- guided approaches makes them the obvious choice for annotating organisms with a reference genome. Reconstruction methods
  • 26.  Reads originating from two different isoforms of the same genes are colored black and blue. In genome- guided assembly, reads are first mapped to a reference genome, and spliced reads are used to build a transcript graph, which is then parsed into gene annotations.  In the genome-independent approach, reads are broken into k-mer seeds and arranged into a de Bruijn graph structure. The graph is parsed to identify transcript sequences, which are aligned to the genome to produce gene annotations.  Spliced reads give rise to four possible transcripts, but only two transcripts are needed to explain all reads; the two possible sets of minimal isoforms are depicted Method
  • 27.  SAM stands for Sequence Alignment/Map format.  It is a generic format for storing large nucleotide sequence alignments.  Can easily generated by alignment programs or converted from existing alignment formats  Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.  It is a TAB-delimited text format  Consists of a header section, which is optional, and an alignment section.  If present, the header must be prior to the alignments. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information SAMformat
  • 28. Reference sequence dictionary. References sequence names SN References sequence length LN File level metadata SAM format
  • 30. Scripture  Java Runtime Environment  Downloaded as a .jar file(scripture.jar)  Command line interface java –jar scripture.jar Algorithm  Scripture's main algorithm to "segment" the genome from the sequence data into regions enriched in read coverage takes as input a read alignment file, genome information and filtering parameters to produce a transcript graph.  Command: java - jar scripture.jar <Mandatory parameters> <optional parameter>
  • 31. Mandatory Parameters  -alignment: Path to the a spliced read alignment file  -out: Path to a file for Scripture to write its output.  -sizeFile: A 2-column tab separated file containing the chromosome name and size for the organism.  -chr: Chromosome to segment  -chrSequence: Full path to the chromosome sequence in fasta format for the chromosome to segment. Optional Parameters  -start: Start of region to segment if not segmenting the full chromosome.  -end: End of region to segment when not segmenting the full chromosome.  -pairedEnd: Paired end data. This file can be in either SAM, BAM format
  • 32. Aligned reads data Indexing Reconstruction Workflow Sortingand indexing of aligned files igvtools (for SAM) and samtools (for BAM) are used. Either use pre-aligned readsor performread alignment with BowTieor TopHat priorto sorting Performtranscriptome reconstructionby Scripture,Trinityetc.
  • 33. 1. Use pre-aligned reads from GEO dataset. For eg: GSE20851 (aligned to the mouse genome). 2. Unzip the file and proceed to next step. gunzip GSE20851_GSM521650_ES.aligned.sam.gz 3. Perform indexing by using igvtools. igvtools index GSE20851_GSM521650_ES.aligned.sam 4. Run Scripture from command-line by java –jar scripture.jar 5. Get the file for mouse sizes and the fasta file for the chromosome(let’s say chr19) 6. Run Scripture on this chromosome(19) java –jar scripture.jar –alignment GSE20851_GSM521650_ES.aligned.sam –out chr19.scriptureESTest.segments –sizeFile mm9.sizes –chr chr19 –chrSequence chr19.fa Steps
  • 34. Figure: Files needed for carrying out run on Scripture
  • 35. Output  The output of Scripture is a BED file format containing: all identified transcripts  The BED format is a concise and flexible way to represent genomic features and annotations. The BED format description supports up to 12 columns  And a graph file of .dot format containing all segments found in the data (significant or not) can be visualized using programs such as GraphViz BED format Chrom | start | end | name | score | strand | thickstart | thickend | itemRGB | blockcount | blocksize
  • 36. References 1. https://sci-hub.scihubtw.tw/https://doi.org/10.1038/nmeth.1613 2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4742321/ 3. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20851 4. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-S9-S3 5. http://software.broadinstitute.org/software/scripture/home 6. https://www.affymetrix.com/support/developer/powertools/changelog/gcos-agcc/cel.html 7. http://www.brainspan.org/rnaseq/search/index.html 8. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html 9. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
  • 37. Thankyou. Questions and Feedbacks are most welcomed…….