Galaxy RNA-Seq Analysis: Tuxedo Protocol

Galaxy RNA-Seq
Analysis: Tuxedo Protocol
ChangBum Hong, KT Bioinformatics, GenomeCloud SCIC

genome-cloud.com
This work is licensed under the Creative Commons Attribution-NonCommercialShareAlike 3.0 New Zealand License. To view a copy of this license, visit http://
creativecommons.org/licenses/by-nc-sa/3.0/nz/ or send a letter to Creative
Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Introduction
•

RNA-Seq

•
•

•

Transcriptome assembly

•

Qualitative identiﬁcation of expressed sequence

Differential expression analysis

•

Quantitative measurement of transcript expression

RNA-Seq Applications

•

Annotation: Identify novel genes, transcripts, exons, splicing events,
ncRNAs

•
•

Detecting RNA editing and SNPs

Measurements: RNA quantiﬁcation and differential gene expression

Experimental design
• What are my goals?

• Transcriptome assembly?

• Differential expression analysis?

• Identify rare transcripts?

• What are the characteristics of my system?

• Large, complex genome?

• Introns and high degree of alternative splicing?

• No reference genome or transcriptome?

Experimental Outputs

Assembly

Expression

Differentially expressed

Splicing

Sequencing
• Platforms

• Library preparation

• Multiplexing

• Sequence reads

• File names

• Fastq format(Formats vary)

• 4 lines per read

Illumina Read ID

Data Quality Control
• Data Quality Assessment

• Identify poor/bad sample

• Identify contaminates

• Trimming: remove bad bases from read

• Filtering: remove bad reads from library

Read Mapping
• Alignment algorithm must be

• fast

• able to handle SNPs, indels, and sequencing errors

• allow for introns for reference genome alignment

• Input

• fastq read library

• reference genome index

• insert size mean and stddev(for paired-end libraries)

• Output

• SAM (text) / BAM (binary) alignment ﬁles

Differential Expression
• Cuffdiff (Cufﬂinks package)

• Pairwise comparisons

• Differnetial gene, transcript, and primary transcript
expression

• Easy to use, well documented

• Input: transcriptome, SAM/BAM read alignments

Transcriptome Assembly
• RNA-Seq

• Reference genome

• Reference transcriptome

• RNA-Seq

• Reference genome

• No reference transcriptome

• RNA-Seq

• No reference genome

• No reference transcriptome

Reference
Genome

FASTA

GFF/GTF

Experimental Design

Referecne

Transcriptome

RNA

Sequencing
FASTQ

Reads
FASTQ

Data Quality Control

Tuxedo protocol

Combining tools in a pipeline
• Linux Command-line Tools

• Shell script, Makefile

• GUI Based pipeline

• DNANexus

• SevenBridegs Genomics

• Galaxy

• Open Source

• Wrapper for command line utilites

• Workflows

• Save all steps you did in your analysis

• Return the entire analysis on a new dataset

• Share your workflow with other people

How to use Galaxy?
GALAXY MAIN: User disk quotas 250GB for registered users, maximum concurrent jobs: 8
NO

WAIT

TIMES

NO

NO

JOB

STORAGE

SUBMISSION
QUOTAS
LIMITS

NO

DATA

TRANSFER

BOTTLENECKS

NO

IT

EXPERIENCE

REQUIRED

NO

REQUIRED

INFRASTRUCTURE

COST

GALAXY
MAIN

Free

LOCAL
GALAXY

Free ?

CLOUD
GALAXY

(AMAZON)

동일사양 대비
약 2배 (KT의)

SLIPSTREAM
GALAXY

$19,995

(2천2백만원)

KT
GenomeCloud

GALAXY

시간당 740원
부터

Outline of tutorial
• Starting Galaxy

• Mapping with Tophat

• Workﬂows

• Visualizing alignment with IGV

• Computing differential expression with cuffdiff

• Cuffdiff visuaalization with CummeRbund

Starting Galaxy
• Tutorial Dataset

• Accessing Galaxy

• Import ﬁles for one sample into current history

• Set ﬁle attributes

• Run FastQC

Tutorial Dataset
• FASTQ ﬁles (fastq): Sequence Reads

• Reference (fasta): Genome Sequence (galaxy default)

• Geneset (GTF / GFF3): Reference Geneset

• Bowtie2 index: Reference index ﬁles for Bowtie2
(galaxy default)

Tutorial Dataset
Reference & Gene sets

• Ensembl

• http://www.ensembl.org/info/data/ftp/index.html

Tutorial Dataset
Reference & Gene sets
•illumina iGenomes

• The iGenomes are a collection of reference sequences and annotation files for commonly analyzed

organisms. The files have been downloaded from Ensembl, NCBI, or UCSC, and chromosome names have
been changed to be simple and consistent with their download source. Each iGenome is available as a
compressed file that contains sequences and annotation files for a single genomic build of an organism.

• http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn

Tutorial Dataset
Sequencing data

•Sequencing data (Drosophila melanogaster)

• Gene Expression Omnibus at accession GSE32038

• http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32038

Biological replicates vs.
technical replicates
Technical Replicates

Biological Replicates

Accessing Galaxy
•
•

Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com

Log in with username and password

select galaxy service

GenomeCloud (genome-cloud.com)

when your galaxy is ready

you will recive the e-mail

access the galaxy via public ip address

you can register via user menu > register

Center pane
Tools pane

History pane

Import ﬁles
•
•

Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com

Log in with username and password
example fastq and gtf ﬁles are located in shared data > RNA-Seq with Drosophila melanogaster

import data into your history panel (read to analysis)

Set ﬁle attributes
•
•

In the history pane click on the pencil icon

Enter “fastqsanger” (It will takes time)

Sanger Phread+33 fastqsanger (cassava 1.8 ▲ )

Ilumina 1.3 Phread+64 fastqillunina (cassava 1.8 ▼)

Solexa Solexa+64 fastqsolexa
Tophat options

--solexa-quals: Use the Solexa scale for quality values in FASTQ ﬁles

--solexa1.3-quals: Phred64/Illumina 1.3~1.5

!

BWA options

-l : The input is in the Illumina 1.3+ read format (quality equals ASCII-64)

!

GenomeCloud (g-Analysis)

Error probability

Quality Score Encoding

CASAVA 1.8.2 Quality Score (or Q-score)

Run FastQC
•
•

Load the FastQC tool from the tool pane

Set the input ﬁle (repeat this step on the C1, C2 all piar ﬁles)

wait

running

done

error

Galaxy status
When fastqc has ﬁnished running,

click on the eye on the FastQC output ﬁle

to display

illumina

(in-house data)

IonTorrent

(in-house data)

illumina

(good dataset in FastQC homepage)

illumina

(bad dataset in FastQC homepage)

Per base sequence quality
illumina

(in-house data)

IonTorrent

(in-house data)

illumina


illumina


Per sequence quality score

illumina

(in-house data)

IonTorrent

(in-house data)

illumina


Per base sequence content

illumina


illumina

(in-house data)

IonTorrent

(in-house data)

illumina


illumina


Per base GC content

illumina

(in-house data)

IonTorrent

(in-house data)

illumina


illumina


Per sequence GC content

illumina

(in-house data)

IonTorrent

(in-house data)

illumina


Per base N content

illumina


illumina

(in-house data)

IonTorrent

(in-house data)

illumina


illumina


Sequence Length Distribution

illumina

(in-house data)

IonTorrent

(in-house data)

illumina


Sequence Duplication Levels

illumina


Mapping with Tophat
• Initial Tophat run

• Determine insert size

• Rerun Tophat with correct insert size

• Review mapping statistics

Initial Tophat run
•
•
•

Use Full Tophat paramters

Paired-end FASTQ ﬁles, Select reference genome, Use Own Juctions(Yes), Use Gene Annotation Model(Yes)

Gene Model Anntations (use GFF ﬁle)

Determine insert size
•

Load the insert size tool “NGS: Picard (beta) -> Insertion size meterics”

Determine insert size
•
•

Click “eye” icon

Identify the MIN_INSERT_SIZE (198)

Rerun Tophat
•
•
•

Click any one of the Tophat2 output ﬁles in the history panne

Click on the circular blue arrow icon

Change the “Mean Inner Distance between Mate Pairs” (198)

Tophat Output
•
•

unmapped.bam (BAM)

•

junctions.bed (BED): list BED track of junctions reported by Tophat
where each junction consists of two connected BED blocks where
each block is as long as the max overhang of nay read spanning
juction

•

deletions.bed (BED): mentions the last genomic base before the
deletion

•

insertions.bed (BED): mentions the ﬁrst genomic base of deletion

accepted_hits.bam (BAM): a list of read alignments in BAM/SAM
format

Load files into IGV
•
•
•
•

Click on the “accepted hits” file in the history pane

Click on the “display with IGV web current”

A file named “igv.jnlp” will be downloaded by your browser

Open with text editor copy BAM file location

IGV with Housekeeping gene

http://www.sabiosciences.com/rt_pcr_product/HTML/PADM-000Z.html

Load ﬁles into IGV
•
•

Enter “Act42A” in the search box to view the reads aligning

Right-click on the coverage track and select “Set Data Range” (max value to 4372)

Housekeeping gene: Act42A
Set max value

IGV with Differential
Expression

Keyword: regucalcin (calcium-binding protein)

this gene has four isoforms

Load files into Trackster
•
•
•

Click on the “accepted hits” file in the history pane

Click on the graph icon and select “Trackster”

Select bam files

drag into new group

move to regucalcin gene

create new group ‘Add group’

Run cuffdiff
•
•
•

Load the Cuffdiff tool: “NGS:RNA Analysis->Cuffdiff ”

Perform replicate analysis(Yes)

Add new Group / Add new Replicate

Cuffdiff output
• Genes: gene differential FPKM

• Isoforms: Transcript differential FPKM

• CDS: Coding sequence differential FPKM

•
•

View and ﬁlter cuffdiff
output
Differential Gene Expression (DGE)

Filter out genes with signiﬁcant change in expression with a log fold-change of at least 1 “C14
== ‘yes’ and abs(c10)>1” in the “With following condition” text box

•
•

Cuffdiff visualization with
CummeRbund
Load the CummeRbund tool: NGS:RNA Analysis->cummerbund

Plot type: Density, check the “Replicates” box

Samples have similar density
distribution(density plot)

Samples cluster by expression condition

(MDS / PCA plot)

Samples cluster by experimental condition

(Dendogram)

Volcano

Differential analysis results for regucalcin

Expression plot shows clear differences in the
expression of regucalcin across conditions C1
and C2 (four alternative isoforms)

Scatter plots highlight general similarities
and speciﬁc outliers between conditions
C1 and C2

Extract workﬂow from
current history
•

Click on the small gear icon and select “Extract Workﬂow”

Edit workflow
•
•

Click on “Workflow” at the top of the Galaxy window

Move the elements of the workflow

Run workflow
•
•
•

Load a workflow by clicking on “Workflow” ath the top of the screen

Click on “Run”

Select the input datas

Useful galaxy sites
•

Public main galaxy site (user disk quotas 250GB for registered users, maximum concurrent jobs: 8)

•
•

Test galaxy site (beta site for galaxy main instance)

•
•

http://hongiiv.tistory.com/701

Galaxy를 이용한 SNP 분석 (Korean)

•
•

https://wiki.galaxyproject.org/Learn

Galaxy를 이용한 NGS 분석 (Korean)

•
•

https://test.galaxyproject.org/

Galaxy screen cast and tutorials

•
•

https://usegalaxy.org/


Galaxy를 이용한 부시맨 genome 분석 (Korean)

•


Acknowledgements:
YoungGi Kim

HanKyu Choi

WanPyo Hong

KangJung Kim

Thank you

Galaxy RNA-Seq Analysis: Tuxedo Protocol

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Galaxy RNA-Seq Analysis: Tuxedo Protocol

Similar to Galaxy RNA-Seq Analysis: Tuxedo Protocol (20)

More from Hong ChangBum

More from Hong ChangBum (20)

Recently uploaded

Recently uploaded (20)

Galaxy RNA-Seq Analysis: Tuxedo Protocol