Dgaston dec-06-2012
Upcoming SlideShare
Loading in...5

Dgaston dec-06-2012



Intro primer on Bioinformatics and Gene Expression analysis in RNA-Seq using the Tuxedo pipeline

Intro primer on Bioinformatics and Gene Expression analysis in RNA-Seq using the Tuxedo pipeline



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Dgaston dec-06-2012 Dgaston dec-06-2012 Presentation Transcript

  • Bioinformatics: Intro to RNA-Seq Analysis Integrated Learning Session Daniel Gaston, PhD Dr. Karen Bedard Lab, Department of Pathology December 6th, 2012
  • Overview Introduction  Considerations for RNA-Seq  Computational Resources/Options Analysis of RNA-Seq Data  Principle of analyzing RNA-Seq  General RNA-Seq analysis pipeline  “Tuxedo” pipeline  Alternative tools Resources  http://www.slideshare.net/DanGaston
  • Before You Start: Considerations for RNA-Seq Analysis Next-Generation Sequencing experiments generate a lot of raw data  25-40 GB/sample/replicate for most transcriptomes/tissue types/cell lines/conditions Require more computational resources than many labs routinely have available for analyse data  At minimum several processing “cores” (8 minimum)  Large amount of RAM (16GB+)  Large amount of disk storage space for intermediate and final results files in addition to raw FastQ files  Can be a significant amount of time per sample (days to week)
  • Computational Options Local (Large workstation or cluster) Remote Computer/Cluster (ComputeCanada/ACENet) Cloud Services  Amazon Web Services Cloud/Local Bioinformatics „Portals”  Galaxy  Chipster  GenomeSpace  CloudBioLinux  CloudMan  BioCloudCentral (Interface to CloudMan, CloudBioLinux, etc)
  • RNA-Seq Analysis Workflow
  • So I Ran an RNA-Seq Experiment. NowWhat? Need to go from raw “read” data to gene expression data We now have:  De-multiplexed fastq files for each individual sample and replicate We want lists of:  Differentially expressed genes/transcripts  Potentially novel genes/transcripts  Potentially novel splice junctions  Potential fusion events Organize your data, programs, and additional resources (discussed later)
  • What is the Raw Data A single lane of Illumina HiSeq 2000 sequencing produces ~ 250 – 300 million “reads” of sequencing Can be paired or single-end sequencing (paired-end preferred) Various sequencing lengths (number of sequencing cycles)  2x50bp, 2x75bp, 2x100bp, 2x150bp most common  Cost versus amount of usable data True raw data is actually image data with colour intensities that are then converted into text (A, C, G, T and quality scores) called FastQ
  • FastQ@M00814:1:000000000-A2472:1:1101:14526:1866 1:N:0:1TGGAACATGCGTGCGNAGCCGAAAGTGTGTCCCCACTTTCATATGAAGAAAGAC+?????BBBBB9?+<+#,,6C>CAEEHHHFFHHHHFEHHHHHHHHHGHHHHHHHHHH FASTA format file with a header line, sequence line, and quality scores for every base in the sequenced read In Paired-End Sequencing one file for each “end” of sequencing (Primer 1 and Primer 2) Qualities scores are encoded with a single character representing a number. Most common encoding scheme is called Phred33. Old Illumina software used Phred64 but current generation does not. (Illumina 1.3 – 1.7 is Phred64)  Often needs to be set explicitly in alignment programs
  • General Analysis Pipeline Short-Read Alignment Transcript Reconstruction Abundance/Expression Visualization / Statisticss
  • What is Short-Read Alignment?Paired-End ReadsSection of ReferenceChromosome
  • What’s Special About RNA-Seq Normally distance between paired-reads and size of insertions both constrained With RNA-Seq the source is mRNA, not genomic DNA Mapping to a reference genome, not transcriptome Need to account for introns, pairs can be much further apart than expected
  • Transcript Reconstruction: Intron/ExonJunctionsExon1 Exon 2 Exon 3
  • Transcript Reconstruction: AlternativeSplicingExon1 Exon 2 Exon 3
  • Transcript Reconstruction: NovelExon/Transcript IdentificationExon1 Exon 2 Exon X Exon 3
  • Transcript Reconstruction: FusionTranscriptsExon1 Exon 2 Exon 3 Gene 2 Exon 4
  • Transcript Reconstruction: DifferentialExpression Sample 1 Sample 2
  • What else can we look for? Combine with ChiP-Seq to differentiate various levels of regulation Integrative analyses to identify common elements (micro-RNA, transcription factors, molecular pathways, protein-DNA interactions) Combine with whole-exome or whole-genome sequencing  Allele-specific expression  Allelic imbalance  LOH  Large genomic rearrangements/abnormalities
  • Caution Need to differentiate between real data and artifacts Differentiate between biologically meaningful data and “noise” Sample selection, experimental design, biological replication (not technical replication), and robust statistical methods are important Looking at your data “by eye” is useful, but needs to be backed up by stats Avoid experimenter bias Try and be holistic in your analyses
  • Visualizing with IGV
  • “Tuxedo” Analysis Pipeline Bowtie Tophat Cufflinks Cufflinks Cuffcompare Cuffmerge Cuffdiff CummeRbund
  • What you need before you begin The individual programs Reference genome (hg19/GRCh37)  FASTA file of whole genome, each chromosome is a sequence entry Bowtie2 Index files for reference genome  Index files are compressed representations of the genome that allow assembly to the reference efficiently and in parallel Gene/Transcript annotation reference (UCSC, Ensembl, ENCODE, etc)  Gives information about the location of genes and important features such as location of introns, exons, splice junctions, etc
  • Step 0: Bowtie Bowtie forms the core of TopHat for short-read alignment Initial mapping of subset of reads (~5 million) to a reference transcriptome to estimate inner-distance mean/median and standard deviation for tophat This info can be retrieved from the library prep stage but is actually better to estimate from your final data Sample command-line: bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1 read1.fastq -2 read2.fastq –S output.sam
  • Step 1: Tophat Tophat is a short-read mapper capable of aligning reads to a reference genome and finding exon-exon junctions Can be provided a list of known junctions, do de novo junction discovery, or both Also has an option to find potential fusion-gene transcripts Sample command-line: tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq read2.fastq
  • About TopHat Options -o: The path/name of a directory in which to place all of the TopHat output files -G path to and name of an annotation file so TopHat can be aware of known junctions Reference Genome: Given as path and “base name.” If reference genome saved as: /genomes/hg19/genome.fa then the relevant path and basename would be /genomes/hg19/genome Inner Distance = Fragment size – (2 x read length)
  • TopHat: Additional options --no-mixed --b2-very-sensitive --fusion-search Running above options on 6 processing cores on one sample took ~26 hours
  • Step 2: Cufflinks Cufflinks performs gene and transcript discovery Many possible options  No novel discovery, use only a reference group of transcripts  de novo mode (shown below, beginner‟s default)  Mixed Reference-Guided Assembly and de novo discovery.  Options for more robust normalization methods and error correction Sample command-line: cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam
  • Step 3: Cuffmerge Merges sample assemblies, estimate abundances, clean up transcriptome Sample command-line: cuffmerge –g gene_annotations.gtf –s /path/to/genome.fa –p 8 text_list_of_assemblies.txt
  • Step 4: Cuffdiff Calculates expression levels of transcripts in samples Estimates differential expression between samples Calculates significance value for difference in expression levels between samples Also groups together transcripts that all start from same start site. Identify genes under transcriptional/post-transcriptional regulation Sample command-line: cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u merged.gtf cond1.bam cond2.bam
  • Cuffdiff Output FPKM values for genes, isoforms, CDS, and groups of genes from same Transcription Start Site for each condition  FPKM is the normalized “expression value” used in RNA-Seq Count files of above As above but on a per replicate basis Differential expression test results for genes, CDS, primary transcripts, spliced transcripts on a per sample (condition) comparison basis (Each possible X vs Y comparison unless otherwise specified)  Includes identifiers, expression levels, expression difference values, p-values, q-values, and yes/no significance field Differential splicing tests, differential coding output, differential promoter use
  • Step 5: CummeRbund (R) Trapnell et al., 2012
  • Visualization Trapnell et al., 2012
  • Help! Command X failed  Keep calm  Don‟t blame the computer  Check input files and formats  Google/SeqAnswers/Biostars Results looks “weird”  Check the raw data  Re-check the commands you used RNA-Seq analysis is an experiment:  Maintain good records of what you did, like any other experiment
  • Alternative tools Alternative short-read alignment  BWA -> Can not align RNA-Seq data  GSNAP  STAR -> Requires minimum of 30GB of RAM Alternative transcript reconstruction  STAR  Scripture Alternative Expression/Abundance Estimation  DESeq  DEXSeq  edgeR
  • Resources
  • Software Websites TopHat http://tophat.cbcb.umd.edu Cufflinks http://cufflinks.cbcb.umd.edu STAR http://gingeraslab.cshl.edu/STAR/ Scripturehttp://www.broadinstitute.org/software/scripture/ Bioconductor http://www.bioconductor.org/  DEXSeq  DESeq  edgeR Blah
  • Additional Resources Differential gene and transcript expression analysis of RNA-Seq Experiments with TopHat and Cufflinks (2012) Nature Protocols. 7(3) www.biostars.org (Q&A site) SeqAnswers Forum GENCODE Gene Annotations  http://www.gencodegenes.org/  ftp://ftp.sanger.ac.uk/pub/gencode TopHat / Illumina iGenomes References and Annotation Files:  http://tophat.cbcb.umd.edu/igenomes.html
  • Acknowledgements Dalhousie University  Dr. Graham Dellaire  Dr. Karen Bedard  Montgomery Lab  Dr. Chris McMaster Stanford  Dr. Andrew Orr  Dr. Stephen Montgomery  Dr. Conrad Fernandez  BHCRI CRTP Skills  Dr. Marissa Leblanc Acquisition Program  Mat Nightingale  Bedard Lab  IGNITE
  • Experimental Data for Genes of Interest
  • UCSC Genome Browser
  • UCSC Genome Browser
  • MetabolicMine
  • MetabolicMine
  • NCI Pathway Interaction Database
  • The Cancer Genome Atlas Identify cancer subtypes, actionable driver mutations, personalized/genomic/precision medicine More than $275 million in funding from NIH Multiple research groups around the world 20 cancer types being studied 205 publications from the research network since late 2008
  • The Cancer Genome Atlas
  • The Cancer Genome Atlas
  • The Cancer Genome Atlas
  • UNIX/Linux command-line basics
  • What is UNIX? UNIX and UNIX-Like are a family of computer operating systems originally developed at AT&T‟s Bell Labs  Apple OS X and iOS (UNIX)  Linux (UNIX-Like)
  • Intro The terminal (command-line) isn‟t THAT scary. Maintaining a Linux environment can be challenging, but most of these analyses can also be done in an OS X environment Installing software can sometimes be cumbersome and confusing, however many standard bioinformatics programs and software libraries are fairly easy to set-up Working with the programs from the command-line will often give you a better appreciation for what the program does and what it requires
  • Terms to Know Path: The location of a directory, file, or command on the computer.  Example: /Users/dan (OS X home directory)
  • The Commands You Need to Know ls: Lists the files in the current directory. Directories (folders) are just a special type of file themselves cd: Change directory pwd: View the full path of the directory you are currently in cat: Displays the contents of a file on the terminal screen head / tail : Displays the top or bottom contents of a file to the screen respectively