Your SlideShare is downloading. ×
  • Like
Dgaston dec-06-2012
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Dgaston dec-06-2012


Intro primer on Bioinformatics and Gene Expression analysis in RNA-Seq using the Tuxedo pipeline

Intro primer on Bioinformatics and Gene Expression analysis in RNA-Seq using the Tuxedo pipeline

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Bioinformatics: Intro to RNA-Seq Analysis Integrated Learning Session Daniel Gaston, PhD Dr. Karen Bedard Lab, Department of Pathology December 6th, 2012
  • 2. Overview Introduction  Considerations for RNA-Seq  Computational Resources/Options Analysis of RNA-Seq Data  Principle of analyzing RNA-Seq  General RNA-Seq analysis pipeline  “Tuxedo” pipeline  Alternative tools Resources 
  • 3. Before You Start: Considerations for RNA-Seq Analysis Next-Generation Sequencing experiments generate a lot of raw data  25-40 GB/sample/replicate for most transcriptomes/tissue types/cell lines/conditions Require more computational resources than many labs routinely have available for analyse data  At minimum several processing “cores” (8 minimum)  Large amount of RAM (16GB+)  Large amount of disk storage space for intermediate and final results files in addition to raw FastQ files  Can be a significant amount of time per sample (days to week)
  • 4. Computational Options Local (Large workstation or cluster) Remote Computer/Cluster (ComputeCanada/ACENet) Cloud Services  Amazon Web Services Cloud/Local Bioinformatics „Portals”  Galaxy  Chipster  GenomeSpace  CloudBioLinux  CloudMan  BioCloudCentral (Interface to CloudMan, CloudBioLinux, etc)
  • 5. RNA-Seq Analysis Workflow
  • 6. So I Ran an RNA-Seq Experiment. NowWhat? Need to go from raw “read” data to gene expression data We now have:  De-multiplexed fastq files for each individual sample and replicate We want lists of:  Differentially expressed genes/transcripts  Potentially novel genes/transcripts  Potentially novel splice junctions  Potential fusion events Organize your data, programs, and additional resources (discussed later)
  • 7. What is the Raw Data A single lane of Illumina HiSeq 2000 sequencing produces ~ 250 – 300 million “reads” of sequencing Can be paired or single-end sequencing (paired-end preferred) Various sequencing lengths (number of sequencing cycles)  2x50bp, 2x75bp, 2x100bp, 2x150bp most common  Cost versus amount of usable data True raw data is actually image data with colour intensities that are then converted into text (A, C, G, T and quality scores) called FastQ
  • 8. FastQ@M00814:1:000000000-A2472:1:1101:14526:1866 1:N:0:1TGGAACATGCGTGCGNAGCCGAAAGTGTGTCCCCACTTTCATATGAAGAAAGAC+?????BBBBB9?+<+#,,6C>CAEEHHHFFHHHHFEHHHHHHHHHGHHHHHHHHHH FASTA format file with a header line, sequence line, and quality scores for every base in the sequenced read In Paired-End Sequencing one file for each “end” of sequencing (Primer 1 and Primer 2) Qualities scores are encoded with a single character representing a number. Most common encoding scheme is called Phred33. Old Illumina software used Phred64 but current generation does not. (Illumina 1.3 – 1.7 is Phred64)  Often needs to be set explicitly in alignment programs
  • 9. General Analysis Pipeline Short-Read Alignment Transcript Reconstruction Abundance/Expression Visualization / Statisticss
  • 10. What is Short-Read Alignment?Paired-End ReadsSection of ReferenceChromosome
  • 11. What’s Special About RNA-Seq Normally distance between paired-reads and size of insertions both constrained With RNA-Seq the source is mRNA, not genomic DNA Mapping to a reference genome, not transcriptome Need to account for introns, pairs can be much further apart than expected
  • 12. Transcript Reconstruction: Intron/ExonJunctionsExon1 Exon 2 Exon 3
  • 13. Transcript Reconstruction: AlternativeSplicingExon1 Exon 2 Exon 3
  • 14. Transcript Reconstruction: NovelExon/Transcript IdentificationExon1 Exon 2 Exon X Exon 3
  • 15. Transcript Reconstruction: FusionTranscriptsExon1 Exon 2 Exon 3 Gene 2 Exon 4
  • 16. Transcript Reconstruction: DifferentialExpression Sample 1 Sample 2
  • 17. What else can we look for? Combine with ChiP-Seq to differentiate various levels of regulation Integrative analyses to identify common elements (micro-RNA, transcription factors, molecular pathways, protein-DNA interactions) Combine with whole-exome or whole-genome sequencing  Allele-specific expression  Allelic imbalance  LOH  Large genomic rearrangements/abnormalities
  • 18. Caution Need to differentiate between real data and artifacts Differentiate between biologically meaningful data and “noise” Sample selection, experimental design, biological replication (not technical replication), and robust statistical methods are important Looking at your data “by eye” is useful, but needs to be backed up by stats Avoid experimenter bias Try and be holistic in your analyses
  • 19. Visualizing with IGV
  • 20. “Tuxedo” Analysis Pipeline Bowtie Tophat Cufflinks Cufflinks Cuffcompare Cuffmerge Cuffdiff CummeRbund
  • 21. What you need before you begin The individual programs Reference genome (hg19/GRCh37)  FASTA file of whole genome, each chromosome is a sequence entry Bowtie2 Index files for reference genome  Index files are compressed representations of the genome that allow assembly to the reference efficiently and in parallel Gene/Transcript annotation reference (UCSC, Ensembl, ENCODE, etc)  Gives information about the location of genes and important features such as location of introns, exons, splice junctions, etc
  • 22. Step 0: Bowtie Bowtie forms the core of TopHat for short-read alignment Initial mapping of subset of reads (~5 million) to a reference transcriptome to estimate inner-distance mean/median and standard deviation for tophat This info can be retrieved from the library prep stage but is actually better to estimate from your final data Sample command-line: bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1 read1.fastq -2 read2.fastq –S output.sam
  • 23. Step 1: Tophat Tophat is a short-read mapper capable of aligning reads to a reference genome and finding exon-exon junctions Can be provided a list of known junctions, do de novo junction discovery, or both Also has an option to find potential fusion-gene transcripts Sample command-line: tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq read2.fastq
  • 24. About TopHat Options -o: The path/name of a directory in which to place all of the TopHat output files -G path to and name of an annotation file so TopHat can be aware of known junctions Reference Genome: Given as path and “base name.” If reference genome saved as: /genomes/hg19/genome.fa then the relevant path and basename would be /genomes/hg19/genome Inner Distance = Fragment size – (2 x read length)
  • 25. TopHat: Additional options --no-mixed --b2-very-sensitive --fusion-search Running above options on 6 processing cores on one sample took ~26 hours
  • 26. Step 2: Cufflinks Cufflinks performs gene and transcript discovery Many possible options  No novel discovery, use only a reference group of transcripts  de novo mode (shown below, beginner‟s default)  Mixed Reference-Guided Assembly and de novo discovery.  Options for more robust normalization methods and error correction Sample command-line: cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam
  • 27. Step 3: Cuffmerge Merges sample assemblies, estimate abundances, clean up transcriptome Sample command-line: cuffmerge –g gene_annotations.gtf –s /path/to/genome.fa –p 8 text_list_of_assemblies.txt
  • 28. Step 4: Cuffdiff Calculates expression levels of transcripts in samples Estimates differential expression between samples Calculates significance value for difference in expression levels between samples Also groups together transcripts that all start from same start site. Identify genes under transcriptional/post-transcriptional regulation Sample command-line: cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u merged.gtf cond1.bam cond2.bam
  • 29. Cuffdiff Output FPKM values for genes, isoforms, CDS, and groups of genes from same Transcription Start Site for each condition  FPKM is the normalized “expression value” used in RNA-Seq Count files of above As above but on a per replicate basis Differential expression test results for genes, CDS, primary transcripts, spliced transcripts on a per sample (condition) comparison basis (Each possible X vs Y comparison unless otherwise specified)  Includes identifiers, expression levels, expression difference values, p-values, q-values, and yes/no significance field Differential splicing tests, differential coding output, differential promoter use
  • 30. Step 5: CummeRbund (R) Trapnell et al., 2012
  • 31. Visualization Trapnell et al., 2012
  • 32. Help! Command X failed  Keep calm  Don‟t blame the computer  Check input files and formats  Google/SeqAnswers/Biostars Results looks “weird”  Check the raw data  Re-check the commands you used RNA-Seq analysis is an experiment:  Maintain good records of what you did, like any other experiment
  • 33. Alternative tools Alternative short-read alignment  BWA -> Can not align RNA-Seq data  GSNAP  STAR -> Requires minimum of 30GB of RAM Alternative transcript reconstruction  STAR  Scripture Alternative Expression/Abundance Estimation  DESeq  DEXSeq  edgeR
  • 34. Resources
  • 35. Software Websites TopHat Cufflinks STAR Scripture Bioconductor  DEXSeq  DESeq  edgeR Blah
  • 36. Additional Resources Differential gene and transcript expression analysis of RNA-Seq Experiments with TopHat and Cufflinks (2012) Nature Protocols. 7(3) (Q&A site) SeqAnswers Forum GENCODE Gene Annotations   TopHat / Illumina iGenomes References and Annotation Files: 
  • 37. Acknowledgements Dalhousie University  Dr. Graham Dellaire  Dr. Karen Bedard  Montgomery Lab  Dr. Chris McMaster Stanford  Dr. Andrew Orr  Dr. Stephen Montgomery  Dr. Conrad Fernandez  BHCRI CRTP Skills  Dr. Marissa Leblanc Acquisition Program  Mat Nightingale  Bedard Lab  IGNITE
  • 38. Experimental Data for Genes of Interest
  • 39. UCSC Genome Browser
  • 40. UCSC Genome Browser
  • 41. MetabolicMine
  • 42. MetabolicMine
  • 43. NCI Pathway Interaction Database
  • 44. The Cancer Genome Atlas Identify cancer subtypes, actionable driver mutations, personalized/genomic/precision medicine More than $275 million in funding from NIH Multiple research groups around the world 20 cancer types being studied 205 publications from the research network since late 2008
  • 45. The Cancer Genome Atlas
  • 46. The Cancer Genome Atlas
  • 47. The Cancer Genome Atlas
  • 48. UNIX/Linux command-line basics
  • 49. What is UNIX? UNIX and UNIX-Like are a family of computer operating systems originally developed at AT&T‟s Bell Labs  Apple OS X and iOS (UNIX)  Linux (UNIX-Like)
  • 50. Intro The terminal (command-line) isn‟t THAT scary. Maintaining a Linux environment can be challenging, but most of these analyses can also be done in an OS X environment Installing software can sometimes be cumbersome and confusing, however many standard bioinformatics programs and software libraries are fairly easy to set-up Working with the programs from the command-line will often give you a better appreciation for what the program does and what it requires
  • 51. Terms to Know Path: The location of a directory, file, or command on the computer.  Example: /Users/dan (OS X home directory)
  • 52. The Commands You Need to Know ls: Lists the files in the current directory. Directories (folders) are just a special type of file themselves cd: Change directory pwd: View the full path of the directory you are currently in cat: Displays the contents of a file on the terminal screen head / tail : Displays the top or bottom contents of a file to the screen respectively