Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Galaxy RNA-Seq Analysis: Tuxedo Protocol
1. Galaxy RNA-Seq
Analysis: Tuxedo Protocol
ChangBum Hong, KT Bioinformatics, GenomeCloud SCIC
genome-cloud.com
This work is licensed under the Creative Commons Attribution-NonCommercialShareAlike 3.0 New Zealand License. To view a copy of this license, visit http://
creativecommons.org/licenses/by-nc-sa/3.0/nz/ or send a letter to Creative
Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
3. Experimental design
• What are my goals?
• Transcriptome assembly?
• Differential expression analysis?
• Identify rare transcripts?
• What are the characteristics of my system?
• Large, complex genome?
• Introns and high degree of alternative splicing?
• No reference genome or transcriptome?
6. Data Quality Control
• Data Quality Assessment
• Identify poor/bad sample
• Identify contaminates
• Trimming: remove bad bases from read
• Filtering: remove bad reads from library
7. Read Mapping
• Alignment algorithm must be
• fast
• able to handle SNPs, indels, and sequencing errors
• allow for introns for reference genome alignment
• Input
• fastq read library
• reference genome index
• insert size mean and stddev(for paired-end libraries)
• Output
• SAM (text) / BAM (binary) alignment files
8. Differential Expression
• Cuffdiff (Cufflinks package)
• Pairwise comparisons
• Differnetial gene, transcript, and primary transcript
expression
• Easy to use, well documented
• Input: transcriptome, SAM/BAM read alignments
9. Transcriptome Assembly
• RNA-Seq
• Reference genome
• Reference transcriptome
• RNA-Seq
• Reference genome
• No reference transcriptome
• RNA-Seq
• No reference genome
• No reference transcriptome
11. Combining tools in a pipeline
• Linux Command-line Tools
• Shell script, Makefile
• GUI Based pipeline
• DNANexus
• SevenBridegs Genomics
• Galaxy
• Open Source
• Wrapper for command line utilites
• Workflows
• Save all steps you did in your analysis
• Return the entire analysis on a new dataset
• Share your workflow with other people
12. How to use Galaxy?
GALAXY MAIN: User disk quotas 250GB for registered users, maximum concurrent jobs: 8
NO
WAIT
TIMES
NO
NO
JOB
STORAGE
SUBMISSION
QUOTAS
LIMITS
NO
DATA
TRANSFER
BOTTLENECKS
NO
IT
EXPERIENCE
REQUIRED
NO
REQUIRED
INFRASTRUCTURE
COST
GALAXY
MAIN
Free
LOCAL
GALAXY
Free ?
CLOUD
GALAXY
(AMAZON)
동일사양 대비
약 2배 (KT의)
SLIPSTREAM
GALAXY
$19,995
(2천2백만원)
KT
GenomeCloud
GALAXY
시간당 740원
부터
13. Outline of tutorial
• Starting Galaxy
• Mapping with Tophat
• Workflows
• Visualizing alignment with IGV
• Computing differential expression with cuffdiff
• Cuffdiff visuaalization with CummeRbund
14.
15. Starting Galaxy
• Tutorial Dataset
• Accessing Galaxy
• Import files for one sample into current history
• Set file attributes
• Run FastQC
18. Tutorial Dataset
Reference & Gene sets
•illumina iGenomes
• The iGenomes are a collection of reference sequences and annotation files for commonly analyzed
organisms. The files have been downloaded from Ensembl, NCBI, or UCSC, and chromosome names have
been changed to be simple and consistent with their download source. Each iGenome is available as a
compressed file that contains sequences and annotation files for a single genomic build of an organism.
• http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn
19. Tutorial Dataset
Sequencing data
•Sequencing data (Drosophila melanogaster)
• Gene Expression Omnibus at accession GSE32038
• http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32038
21. Accessing Galaxy
•
•
Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com
Log in with username and password
select galaxy service
GenomeCloud (genome-cloud.com)
22. when your galaxy is ready
you will recive the e-mail
access the galaxy via public ip address
you can register via user menu > register
Center pane
Tools pane
History pane
23. Import files
•
•
Open a web browser and navigate to Galaxy website usegalaxy.org or www.genome-cloud.com
Log in with username and password
example fastq and gtf files are located in shared data > RNA-Seq with Drosophila melanogaster
import data into your history panel (read to analysis)
24. Set file attributes
•
•
In the history pane click on the pencil icon
Enter “fastqsanger” (It will takes time)
Sanger Phread+33 fastqsanger (cassava 1.8 ▲ )
Ilumina 1.3 Phread+64 fastqillunina (cassava 1.8 ▼)
Solexa Solexa+64 fastqsolexa
Tophat options
--solexa-quals: Use the Solexa scale for quality values in FASTQ files
--solexa1.3-quals: Phred64/Illumina 1.3~1.5
!
BWA options
-l : The input is in the Illumina 1.3+ read format (quality equals ASCII-64)
!
GenomeCloud (g-Analysis)
28. illumina
(in-house data)
IonTorrent
(in-house data)
illumina
(good dataset in FastQC homepage)
illumina
(bad dataset in FastQC homepage)
Per base sequence quality
illumina
(in-house data)
IonTorrent
(in-house data)
illumina
(good dataset in FastQC homepage)
illumina
(bad dataset in FastQC homepage)
Per sequence quality score
illumina
(in-house data)
IonTorrent
(in-house data)
illumina
(good dataset in FastQC homepage)
Per base sequence content
illumina
(bad dataset in FastQC homepage)
29. illumina
(in-house data)
IonTorrent
(in-house data)
illumina
(good dataset in FastQC homepage)
illumina
(bad dataset in FastQC homepage)
Per base GC content
illumina
(in-house data)
IonTorrent
(in-house data)
illumina
(good dataset in FastQC homepage)
illumina
(bad dataset in FastQC homepage)
Per sequence GC content
illumina
(in-house data)
IonTorrent
(in-house data)
illumina
(good dataset in FastQC homepage)
Per base N content
illumina
(bad dataset in FastQC homepage)
30. illumina
(in-house data)
IonTorrent
(in-house data)
illumina
(good dataset in FastQC homepage)
illumina
(bad dataset in FastQC homepage)
Sequence Length Distribution
illumina
(in-house data)
IonTorrent
(in-house data)
illumina
(good dataset in FastQC homepage)
Sequence Duplication Levels
illumina
(bad dataset in FastQC homepage)
31.
32. Mapping with Tophat
• Initial Tophat run
• Determine insert size
• Rerun Tophat with correct insert size
• Review mapping statistics
33. Initial Tophat run
•
•
•
Use Full Tophat paramters
Paired-end FASTQ files, Select reference genome, Use Own Juctions(Yes), Use Gene Annotation Model(Yes)
Gene Model Anntations (use GFF file)
36. Rerun Tophat
•
•
•
Click any one of the Tophat2 output files in the history panne
Click on the circular blue arrow icon
Change the “Mean Inner Distance between Mate Pairs” (198)
37. Tophat Output
•
•
unmapped.bam (BAM)
•
junctions.bed (BED): list BED track of junctions reported by Tophat
where each junction consists of two connected BED blocks where
each block is as long as the max overhang of nay read spanning
juction
•
deletions.bed (BED): mentions the last genomic base before the
deletion
•
insertions.bed (BED): mentions the first genomic base of deletion
accepted_hits.bam (BAM): a list of read alignments in BAM/SAM
format
38. Load files into IGV
•
•
•
•
Click on the “accepted hits” file in the history pane
Click on the “display with IGV web current”
A file named “igv.jnlp” will be downloaded by your browser
Open with text editor copy BAM file location
39. IGV with Housekeeping gene
http://www.sabiosciences.com/rt_pcr_product/HTML/PADM-000Z.html
40. Load files into IGV
•
•
Enter “Act42A” in the search box to view the reads aligning
Right-click on the coverage track and select “Set Data Range” (max value to 4372)
Housekeeping gene: Act42A
Set max value
48. •
•
View and filter cuffdiff
output
Differential Gene Expression (DGE)
Filter out genes with significant change in expression with a log fold-change of at least 1 “C14
== ‘yes’ and abs(c10)>1” in the “With following condition” text box
51. Samples have similar density
distribution(density plot)
Samples cluster by expression condition
(MDS / PCA plot)
Samples cluster by experimental condition
(Dendogram)
52. Volcano
Differential analysis results for regucalcin
Expression plot shows clear differences in the
expression of regucalcin across conditions C1
and C2 (four alternative isoforms)
Scatter plots highlight general similarities
and specific outliers between conditions
C1 and C2
54. Edit workflow
•
•
Click on “Workflow” at the top of the Galaxy window
Move the elements of the workflow
55. Run workflow
•
•
•
Load a workflow by clicking on “Workflow” ath the top of the screen
Click on “Run”
Select the input datas
56.
57. Useful galaxy sites
•
Public main galaxy site (user disk quotas 250GB for registered users, maximum concurrent jobs: 8)
•
•
Test galaxy site (beta site for galaxy main instance)
•
•
http://hongiiv.tistory.com/701
Galaxy를 이용한 SNP 분석 (Korean)
•
•
https://wiki.galaxyproject.org/Learn
Galaxy를 이용한 NGS 분석 (Korean)
•
•
https://test.galaxyproject.org/
Galaxy screen cast and tutorials
•
•
https://usegalaxy.org/
http://hongiiv.tistory.com/652
Galaxy를 이용한 부시맨 genome 분석 (Korean)
•
http://hongiiv.tistory.com/655