Your SlideShare is downloading. ×
0
Galaxy RNA-Seq
Analysis: Tuxedo Protocol
ChangBum Hong, KT Bioinformatics, GenomeCloud SCIC	

genome-cloud.com
This work i...
Introduction
•

RNA-Seq	


•
•

•

Transcriptome assembly	


•

Qualitative identification of expressed sequence	


Differe...
Experimental design
• What are my goals?	

• Transcriptome assembly?	

• Differential expression analysis?	

• Identify ra...
Experimental Outputs

Assembly

Expression	

Differentially expressed

Splicing
Sequencing
• Platforms	

• Library preparation	

• Multiplexing	

• Sequence reads	

• File names	

• Fastq format(Formats...
Data Quality Control
• Data Quality Assessment	

• Identify poor/bad sample	

• Identify contaminates	

• Trimming: remove...
Read Mapping
• Alignment algorithm must be	

• fast	

• able to handle SNPs, indels, and sequencing errors	

• allow for i...
Differential Expression
• Cuffdiff (Cufflinks package)	

• Pairwise comparisons	

• Differnetial gene, transcript, and prim...
Transcriptome Assembly
• RNA-Seq	

• Reference genome	

• Reference transcriptome	

• RNA-Seq	

• Reference genome	

• No ...
Reference
Genome

FASTA

GFF/GTF

Experimental Design

Referecne	

Transcriptome

RNA

Sequencing
FASTQ

Reads
FASTQ

Data...
Combining tools in a pipeline
• Linux Command-line Tools	

• Shell script, Makefile	

• GUI Based pipeline	

• DNANexus 	

...
How to use Galaxy?
GALAXY MAIN: User disk quotas 250GB for registered users, maximum concurrent jobs: 8
NO 	

WAIT 	

TIME...
Outline of tutorial
• Starting Galaxy	

• Mapping with Tophat	

• Workflows	

• Visualizing alignment with IGV	

• Computin...
Starting Galaxy
• Tutorial Dataset	

• Accessing Galaxy	

• Import files for one sample into current history	

• Set file at...
Tutorial Dataset
• FASTQ files (fastq): Sequence Reads	

• Reference (fasta): Genome Sequence (galaxy default)	

• Geneset ...
Tutorial Dataset
Reference & Gene sets

• Ensembl 	

• http://www.ensembl.org/info/data/ftp/index.html
Tutorial Dataset
Reference & Gene sets
•illumina iGenomes	


• The iGenomes are a collection of reference sequences and an...
Tutorial Dataset
Sequencing data

•Sequencing data (Drosophila melanogaster)	

• Gene Expression Omnibus at accession GSE3...
Biological replicates vs.
technical replicates
Technical Replicates

Biological Replicates
Accessing Galaxy
•
•

Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com	

Log in wit...
when your galaxy is ready 	

you will recive the e-mail

access the galaxy via public ip address

you can register via use...
Import files
•
•

Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com	

Log in with use...
Set file attributes
•
•

In the history pane click on the pencil icon	

Enter “fastqsanger” (It will takes time)

Sanger Ph...
Error probability

Quality Score Encoding

CASAVA 1.8.2 Quality Score (or Q-score)
Run FastQC
•
•

Load the FastQC tool from the tool pane	

Set the input file (repeat this step on the C1, C2 all piar files)
wait

running

done

error

Galaxy status
When fastqc has finished running,	

click on the eye on the FastQC output file 	

...
illumina	

(in-house data)

IonTorrent	

(in-house data)

illumina	

(good dataset in FastQC homepage)

illumina	

(bad da...
illumina	

(in-house data)

IonTorrent	

(in-house data)

illumina	

(good dataset in FastQC homepage)

illumina	

(bad da...
illumina	

(in-house data)

IonTorrent	

(in-house data)

illumina	

(good dataset in FastQC homepage)

illumina	

(bad da...
Mapping with Tophat
• Initial Tophat run	

• Determine insert size	

• Rerun Tophat with correct insert size	

• Review ma...
Initial Tophat run
•
•
•

Use Full Tophat paramters	

Paired-end FASTQ files, Select reference genome, Use Own Juctions(Yes...
Determine insert size
•

Load the insert size tool “NGS: Picard (beta) -> Insertion size meterics”
Determine insert size
•
•

Click “eye” icon	

Identify the MIN_INSERT_SIZE (198)
Rerun Tophat
•
•
•

Click any one of the Tophat2 output files in the history panne	

Click on the circular blue arrow icon	...
Tophat Output
•
•

unmapped.bam (BAM)	


•

junctions.bed (BED): list BED track of junctions reported by Tophat
where each...
Load files into IGV
•
•
•
•

Click on the “accepted hits” file in the history pane	

Click on the “display with IGV web curr...
IGV with Housekeeping gene

http://www.sabiosciences.com/rt_pcr_product/HTML/PADM-000Z.html
Load files into IGV
•
•

Enter “Act42A” in the search box to view the reads aligning	

Right-click on the coverage track an...
IGV with Differential
Expression
Keyword: regucalcin (calcium-binding protein)

this gene has four isoforms
Load files into Trackster
•
•
•

Click on the “accepted hits” file in the history pane	

Click on the graph icon and select ...
drag into new group

move to regucalcin gene

create new group ‘Add group’
set max value
Run cuffdiff
•
•
•

Load the Cuffdiff tool: “NGS:RNA Analysis->Cuffdiff ”	

Perform replicate analysis(Yes)	

Add new Grou...
Cuffdiff output
• Genes: gene differential FPKM	

• Isoforms: Transcript differential FPKM	

• CDS: Coding sequence differ...
•
•

View and filter cuffdiff
output
Differential Gene Expression (DGE)	

Filter out genes with significant change in expres...
•
•

Cuffdiff visualization with
CummeRbund
Load the CummeRbund tool: NGS:RNA Analysis->cummerbund	

Plot type: Density, c...
Samples have similar density
distribution(density plot)

Samples cluster by expression condition	

(MDS / PCA plot)

Sampl...
Volcano

Differential analysis results for regucalcin	

Expression plot shows clear differences in the
expression of reguc...
Extract workflow from
current history
•

Click on the small gear icon and select “Extract Workflow”
Edit workflow
•
•

Click on “Workflow” at the top of the Galaxy window	

Move the elements of the workflow
Run workflow
•
•
•

Load a workflow by clicking on “Workflow” ath the top of the screen	

Click on “Run”	

Select the input d...
Useful galaxy sites
•

Public main galaxy site (user disk quotas 250GB for registered users, maximum concurrent jobs: 8)	
...
Acknowledgements:
YoungGi Kim	

HanKyu Choi	

WanPyo Hong	

KangJung Kim

Thank you
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Upcoming SlideShare
Loading in...5
×

Galaxy RNA-Seq Analysis: Tuxedo Protocol

12,148

Published on

Galaxy RNA-Seq Analysis: Tuxedo Protocol

Published in: Health & Medicine, Technology
1 Comment
10 Likes
Statistics
Notes
  • I am sure this slide is very complete use case of Galaxy on the GenomeCloud through explaining RNA-Seq analysis using Tuxedo Protocol. Great job and thank for sharing.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
12,148
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
235
Comments
1
Likes
10
Embeds 0
No embeds

No notes for slide

Transcript of "Galaxy RNA-Seq Analysis: Tuxedo Protocol"

  1. 1. Galaxy RNA-Seq Analysis: Tuxedo Protocol ChangBum Hong, KT Bioinformatics, GenomeCloud SCIC genome-cloud.com This work is licensed under the Creative Commons Attribution-NonCommercialShareAlike 3.0 New Zealand License. To view a copy of this license, visit http:// creativecommons.org/licenses/by-nc-sa/3.0/nz/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
  2. 2. Introduction • RNA-Seq • • • Transcriptome assembly • Qualitative identification of expressed sequence Differential expression analysis • Quantitative measurement of transcript expression RNA-Seq Applications • Annotation: Identify novel genes, transcripts, exons, splicing events, ncRNAs • • Detecting RNA editing and SNPs Measurements: RNA quantification and differential gene expression
  3. 3. Experimental design • What are my goals? • Transcriptome assembly? • Differential expression analysis? • Identify rare transcripts? • What are the characteristics of my system? • Large, complex genome? • Introns and high degree of alternative splicing? • No reference genome or transcriptome?
  4. 4. Experimental Outputs Assembly Expression Differentially expressed Splicing
  5. 5. Sequencing • Platforms • Library preparation • Multiplexing • Sequence reads • File names • Fastq format(Formats vary) • 4 lines per read Illumina Read ID
  6. 6. Data Quality Control • Data Quality Assessment • Identify poor/bad sample • Identify contaminates • Trimming: remove bad bases from read • Filtering: remove bad reads from library
  7. 7. Read Mapping • Alignment algorithm must be • fast • able to handle SNPs, indels, and sequencing errors • allow for introns for reference genome alignment • Input • fastq read library • reference genome index • insert size mean and stddev(for paired-end libraries) • Output • SAM (text) / BAM (binary) alignment files
  8. 8. Differential Expression • Cuffdiff (Cufflinks package) • Pairwise comparisons • Differnetial gene, transcript, and primary transcript expression • Easy to use, well documented • Input: transcriptome, SAM/BAM read alignments
  9. 9. Transcriptome Assembly • RNA-Seq • Reference genome • Reference transcriptome • RNA-Seq • Reference genome • No reference transcriptome • RNA-Seq • No reference genome • No reference transcriptome
  10. 10. Reference Genome FASTA GFF/GTF Experimental Design Referecne Transcriptome RNA Sequencing FASTQ Reads FASTQ Data Quality Control Tuxedo protocol
  11. 11. Combining tools in a pipeline • Linux Command-line Tools • Shell script, Makefile • GUI Based pipeline • DNANexus • SevenBridegs Genomics • Galaxy • Open Source • Wrapper for command line utilites • Workflows • Save all steps you did in your analysis • Return the entire analysis on a new dataset • Share your workflow with other people
  12. 12. How to use Galaxy? GALAXY MAIN: User disk quotas 250GB for registered users, maximum concurrent jobs: 8 NO WAIT TIMES NO NO JOB STORAGE SUBMISSION QUOTAS LIMITS NO DATA TRANSFER BOTTLENECKS NO IT EXPERIENCE REQUIRED NO REQUIRED INFRASTRUCTURE COST GALAXY MAIN Free LOCAL GALAXY Free ? CLOUD GALAXY (AMAZON) 동일사양 대비 약 2배 (KT의) SLIPSTREAM GALAXY $19,995 (2천2백만원) KT GenomeCloud GALAXY 시간당 740원 부터
  13. 13. Outline of tutorial • Starting Galaxy • Mapping with Tophat • Workflows • Visualizing alignment with IGV • Computing differential expression with cuffdiff • Cuffdiff visuaalization with CummeRbund
  14. 14. Starting Galaxy • Tutorial Dataset • Accessing Galaxy • Import files for one sample into current history • Set file attributes • Run FastQC
  15. 15. Tutorial Dataset • FASTQ files (fastq): Sequence Reads • Reference (fasta): Genome Sequence (galaxy default) • Geneset (GTF / GFF3): Reference Geneset • Bowtie2 index: Reference index files for Bowtie2 (galaxy default)
  16. 16. Tutorial Dataset Reference & Gene sets • Ensembl • http://www.ensembl.org/info/data/ftp/index.html
  17. 17. Tutorial Dataset Reference & Gene sets •illumina iGenomes • The iGenomes are a collection of reference sequences and annotation files for commonly analyzed organisms. The files have been downloaded from Ensembl, NCBI, or UCSC, and chromosome names have been changed to be simple and consistent with their download source. Each iGenome is available as a compressed file that contains sequences and annotation files for a single genomic build of an organism. • http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn
  18. 18. Tutorial Dataset Sequencing data •Sequencing data (Drosophila melanogaster) • Gene Expression Omnibus at accession GSE32038 • http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32038
  19. 19. Biological replicates vs. technical replicates Technical Replicates Biological Replicates
  20. 20. Accessing Galaxy • • Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com Log in with username and password select galaxy service GenomeCloud (genome-cloud.com)
  21. 21. when your galaxy is ready you will recive the e-mail access the galaxy via public ip address you can register via user menu > register Center pane Tools pane History pane
  22. 22. Import files • • Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com Log in with username and password example fastq and gtf files are located in shared data > RNA-Seq with Drosophila melanogaster import data into your history panel (read to analysis)
  23. 23. Set file attributes • • In the history pane click on the pencil icon Enter “fastqsanger” (It will takes time) Sanger Phread+33 fastqsanger (cassava 1.8 ▲ ) Ilumina 1.3 Phread+64 fastqillunina (cassava 1.8 ▼) Solexa Solexa+64 fastqsolexa Tophat options --solexa-quals: Use the Solexa scale for quality values in FASTQ files --solexa1.3-quals: Phred64/Illumina 1.3~1.5 ! BWA options -l : The input is in the Illumina 1.3+ read format (quality equals ASCII-64) ! GenomeCloud (g-Analysis)
  24. 24. Error probability Quality Score Encoding CASAVA 1.8.2 Quality Score (or Q-score)
  25. 25. Run FastQC • • Load the FastQC tool from the tool pane Set the input file (repeat this step on the C1, C2 all piar files)
  26. 26. wait running done error Galaxy status When fastqc has finished running, click on the eye on the FastQC output file to display
  27. 27. illumina (in-house data) IonTorrent (in-house data) illumina (good dataset in FastQC homepage) illumina (bad dataset in FastQC homepage) Per base sequence quality illumina (in-house data) IonTorrent (in-house data) illumina (good dataset in FastQC homepage) illumina (bad dataset in FastQC homepage) Per sequence quality score illumina (in-house data) IonTorrent (in-house data) illumina (good dataset in FastQC homepage) Per base sequence content illumina (bad dataset in FastQC homepage)
  28. 28. illumina (in-house data) IonTorrent (in-house data) illumina (good dataset in FastQC homepage) illumina (bad dataset in FastQC homepage) Per base GC content illumina (in-house data) IonTorrent (in-house data) illumina (good dataset in FastQC homepage) illumina (bad dataset in FastQC homepage) Per sequence GC content illumina (in-house data) IonTorrent (in-house data) illumina (good dataset in FastQC homepage) Per base N content illumina (bad dataset in FastQC homepage)
  29. 29. illumina (in-house data) IonTorrent (in-house data) illumina (good dataset in FastQC homepage) illumina (bad dataset in FastQC homepage) Sequence Length Distribution illumina (in-house data) IonTorrent (in-house data) illumina (good dataset in FastQC homepage) Sequence Duplication Levels illumina (bad dataset in FastQC homepage)
  30. 30. Mapping with Tophat • Initial Tophat run • Determine insert size • Rerun Tophat with correct insert size • Review mapping statistics
  31. 31. Initial Tophat run • • • Use Full Tophat paramters Paired-end FASTQ files, Select reference genome, Use Own Juctions(Yes), Use Gene Annotation Model(Yes) Gene Model Anntations (use GFF file)
  32. 32. Determine insert size • Load the insert size tool “NGS: Picard (beta) -> Insertion size meterics”
  33. 33. Determine insert size • • Click “eye” icon Identify the MIN_INSERT_SIZE (198)
  34. 34. Rerun Tophat • • • Click any one of the Tophat2 output files in the history panne Click on the circular blue arrow icon Change the “Mean Inner Distance between Mate Pairs” (198)
  35. 35. Tophat Output • • unmapped.bam (BAM) • junctions.bed (BED): list BED track of junctions reported by Tophat where each junction consists of two connected BED blocks where each block is as long as the max overhang of nay read spanning juction • deletions.bed (BED): mentions the last genomic base before the deletion • insertions.bed (BED): mentions the first genomic base of deletion accepted_hits.bam (BAM): a list of read alignments in BAM/SAM format
  36. 36. Load files into IGV • • • • Click on the “accepted hits” file in the history pane Click on the “display with IGV web current” A file named “igv.jnlp” will be downloaded by your browser Open with text editor copy BAM file location
  37. 37. IGV with Housekeeping gene http://www.sabiosciences.com/rt_pcr_product/HTML/PADM-000Z.html
  38. 38. Load files into IGV • • Enter “Act42A” in the search box to view the reads aligning Right-click on the coverage track and select “Set Data Range” (max value to 4372) Housekeeping gene: Act42A Set max value
  39. 39. IGV with Differential Expression
  40. 40. Keyword: regucalcin (calcium-binding protein) this gene has four isoforms
  41. 41. Load files into Trackster • • • Click on the “accepted hits” file in the history pane Click on the graph icon and select “Trackster” Select bam files
  42. 42. drag into new group move to regucalcin gene create new group ‘Add group’
  43. 43. set max value
  44. 44. Run cuffdiff • • • Load the Cuffdiff tool: “NGS:RNA Analysis->Cuffdiff ” Perform replicate analysis(Yes) Add new Group / Add new Replicate
  45. 45. Cuffdiff output • Genes: gene differential FPKM • Isoforms: Transcript differential FPKM • CDS: Coding sequence differential FPKM
  46. 46. • • View and filter cuffdiff output Differential Gene Expression (DGE) Filter out genes with significant change in expression with a log fold-change of at least 1 “C14 == ‘yes’ and abs(c10)>1” in the “With following condition” text box
  47. 47. • • Cuffdiff visualization with CummeRbund Load the CummeRbund tool: NGS:RNA Analysis->cummerbund Plot type: Density, check the “Replicates” box
  48. 48. Samples have similar density distribution(density plot) Samples cluster by expression condition (MDS / PCA plot) Samples cluster by experimental condition (Dendogram)
  49. 49. Volcano Differential analysis results for regucalcin Expression plot shows clear differences in the expression of regucalcin across conditions C1 and C2 (four alternative isoforms) Scatter plots highlight general similarities and specific outliers between conditions C1 and C2
  50. 50. Extract workflow from current history • Click on the small gear icon and select “Extract Workflow”
  51. 51. Edit workflow • • Click on “Workflow” at the top of the Galaxy window Move the elements of the workflow
  52. 52. Run workflow • • • Load a workflow by clicking on “Workflow” ath the top of the screen Click on “Run” Select the input datas
  53. 53. Useful galaxy sites • Public main galaxy site (user disk quotas 250GB for registered users, maximum concurrent jobs: 8) • • Test galaxy site (beta site for galaxy main instance) • • http://hongiiv.tistory.com/701 Galaxy를 이용한 SNP 분석 (Korean) • • https://wiki.galaxyproject.org/Learn Galaxy를 이용한 NGS 분석 (Korean) • • https://test.galaxyproject.org/ Galaxy screen cast and tutorials • • https://usegalaxy.org/ http://hongiiv.tistory.com/652 Galaxy를 이용한 부시맨 genome 분석 (Korean) • http://hongiiv.tistory.com/655
  54. 54. Acknowledgements: YoungGi Kim HanKyu Choi WanPyo Hong KangJung Kim Thank you
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×