Dgaston dec-06-2012


Published on

Intro primer on Bioinformatics and Gene Expression analysis in RNA-Seq using the Tuxedo pipeline

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dgaston dec-06-2012

  1. 1. Bioinformatics: Intro to RNA-Seq Analysis Integrated Learning Session Daniel Gaston, PhD Dr. Karen Bedard Lab, Department of Pathology December 6th, 2012
  2. 2. Overview Introduction  Considerations for RNA-Seq  Computational Resources/Options Analysis of RNA-Seq Data  Principle of analyzing RNA-Seq  General RNA-Seq analysis pipeline  “Tuxedo” pipeline  Alternative tools Resources  http://www.slideshare.net/DanGaston
  3. 3. Before You Start: Considerations for RNA-Seq Analysis Next-Generation Sequencing experiments generate a lot of raw data  25-40 GB/sample/replicate for most transcriptomes/tissue types/cell lines/conditions Require more computational resources than many labs routinely have available for analyse data  At minimum several processing “cores” (8 minimum)  Large amount of RAM (16GB+)  Large amount of disk storage space for intermediate and final results files in addition to raw FastQ files  Can be a significant amount of time per sample (days to week)
  4. 4. Computational Options Local (Large workstation or cluster) Remote Computer/Cluster (ComputeCanada/ACENet) Cloud Services  Amazon Web Services Cloud/Local Bioinformatics „Portals”  Galaxy  Chipster  GenomeSpace  CloudBioLinux  CloudMan  BioCloudCentral (Interface to CloudMan, CloudBioLinux, etc)
  5. 5. RNA-Seq Analysis Workflow
  6. 6. So I Ran an RNA-Seq Experiment. NowWhat? Need to go from raw “read” data to gene expression data We now have:  De-multiplexed fastq files for each individual sample and replicate We want lists of:  Differentially expressed genes/transcripts  Potentially novel genes/transcripts  Potentially novel splice junctions  Potential fusion events Organize your data, programs, and additional resources (discussed later)
  7. 7. What is the Raw Data A single lane of Illumina HiSeq 2000 sequencing produces ~ 250 – 300 million “reads” of sequencing Can be paired or single-end sequencing (paired-end preferred) Various sequencing lengths (number of sequencing cycles)  2x50bp, 2x75bp, 2x100bp, 2x150bp most common  Cost versus amount of usable data True raw data is actually image data with colour intensities that are then converted into text (A, C, G, T and quality scores) called FastQ
  8. 8. FastQ@M00814:1:000000000-A2472:1:1101:14526:1866 1:N:0:1TGGAACATGCGTGCGNAGCCGAAAGTGTGTCCCCACTTTCATATGAAGAAAGAC+?????BBBBB9?+<+#,,6C>CAEEHHHFFHHHHFEHHHHHHHHHGHHHHHHHHHH FASTA format file with a header line, sequence line, and quality scores for every base in the sequenced read In Paired-End Sequencing one file for each “end” of sequencing (Primer 1 and Primer 2) Qualities scores are encoded with a single character representing a number. Most common encoding scheme is called Phred33. Old Illumina software used Phred64 but current generation does not. (Illumina 1.3 – 1.7 is Phred64)  Often needs to be set explicitly in alignment programs
  9. 9. General Analysis Pipeline Short-Read Alignment Transcript Reconstruction Abundance/Expression Visualization / Statisticss
  10. 10. What is Short-Read Alignment?Paired-End ReadsSection of ReferenceChromosome
  11. 11. What’s Special About RNA-Seq Normally distance between paired-reads and size of insertions both constrained With RNA-Seq the source is mRNA, not genomic DNA Mapping to a reference genome, not transcriptome Need to account for introns, pairs can be much further apart than expected
  12. 12. Transcript Reconstruction: Intron/ExonJunctionsExon1 Exon 2 Exon 3
  13. 13. Transcript Reconstruction: AlternativeSplicingExon1 Exon 2 Exon 3
  14. 14. Transcript Reconstruction: NovelExon/Transcript IdentificationExon1 Exon 2 Exon X Exon 3
  15. 15. Transcript Reconstruction: FusionTranscriptsExon1 Exon 2 Exon 3 Gene 2 Exon 4
  16. 16. Transcript Reconstruction: DifferentialExpression Sample 1 Sample 2
  17. 17. What else can we look for? Combine with ChiP-Seq to differentiate various levels of regulation Integrative analyses to identify common elements (micro-RNA, transcription factors, molecular pathways, protein-DNA interactions) Combine with whole-exome or whole-genome sequencing  Allele-specific expression  Allelic imbalance  LOH  Large genomic rearrangements/abnormalities
  18. 18. Caution Need to differentiate between real data and artifacts Differentiate between biologically meaningful data and “noise” Sample selection, experimental design, biological replication (not technical replication), and robust statistical methods are important Looking at your data “by eye” is useful, but needs to be backed up by stats Avoid experimenter bias Try and be holistic in your analyses
  19. 19. Visualizing with IGV
  20. 20. “Tuxedo” Analysis Pipeline Bowtie Tophat Cufflinks Cufflinks Cuffcompare Cuffmerge Cuffdiff CummeRbund
  21. 21. What you need before you begin The individual programs Reference genome (hg19/GRCh37)  FASTA file of whole genome, each chromosome is a sequence entry Bowtie2 Index files for reference genome  Index files are compressed representations of the genome that allow assembly to the reference efficiently and in parallel Gene/Transcript annotation reference (UCSC, Ensembl, ENCODE, etc)  Gives information about the location of genes and important features such as location of introns, exons, splice junctions, etc
  22. 22. Step 0: Bowtie Bowtie forms the core of TopHat for short-read alignment Initial mapping of subset of reads (~5 million) to a reference transcriptome to estimate inner-distance mean/median and standard deviation for tophat This info can be retrieved from the library prep stage but is actually better to estimate from your final data Sample command-line: bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1 read1.fastq -2 read2.fastq –S output.sam
  23. 23. Step 1: Tophat Tophat is a short-read mapper capable of aligning reads to a reference genome and finding exon-exon junctions Can be provided a list of known junctions, do de novo junction discovery, or both Also has an option to find potential fusion-gene transcripts Sample command-line: tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq read2.fastq
  24. 24. About TopHat Options -o: The path/name of a directory in which to place all of the TopHat output files -G path to and name of an annotation file so TopHat can be aware of known junctions Reference Genome: Given as path and “base name.” If reference genome saved as: /genomes/hg19/genome.fa then the relevant path and basename would be /genomes/hg19/genome Inner Distance = Fragment size – (2 x read length)
  25. 25. TopHat: Additional options --no-mixed --b2-very-sensitive --fusion-search Running above options on 6 processing cores on one sample took ~26 hours
  26. 26. Step 2: Cufflinks Cufflinks performs gene and transcript discovery Many possible options  No novel discovery, use only a reference group of transcripts  de novo mode (shown below, beginner‟s default)  Mixed Reference-Guided Assembly and de novo discovery.  Options for more robust normalization methods and error correction Sample command-line: cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam
  27. 27. Step 3: Cuffmerge Merges sample assemblies, estimate abundances, clean up transcriptome Sample command-line: cuffmerge –g gene_annotations.gtf –s /path/to/genome.fa –p 8 text_list_of_assemblies.txt
  28. 28. Step 4: Cuffdiff Calculates expression levels of transcripts in samples Estimates differential expression between samples Calculates significance value for difference in expression levels between samples Also groups together transcripts that all start from same start site. Identify genes under transcriptional/post-transcriptional regulation Sample command-line: cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u merged.gtf cond1.bam cond2.bam
  29. 29. Cuffdiff Output FPKM values for genes, isoforms, CDS, and groups of genes from same Transcription Start Site for each condition  FPKM is the normalized “expression value” used in RNA-Seq Count files of above As above but on a per replicate basis Differential expression test results for genes, CDS, primary transcripts, spliced transcripts on a per sample (condition) comparison basis (Each possible X vs Y comparison unless otherwise specified)  Includes identifiers, expression levels, expression difference values, p-values, q-values, and yes/no significance field Differential splicing tests, differential coding output, differential promoter use
  30. 30. Step 5: CummeRbund (R) Trapnell et al., 2012
  31. 31. Visualization Trapnell et al., 2012
  32. 32. Help! Command X failed  Keep calm  Don‟t blame the computer  Check input files and formats  Google/SeqAnswers/Biostars Results looks “weird”  Check the raw data  Re-check the commands you used RNA-Seq analysis is an experiment:  Maintain good records of what you did, like any other experiment
  33. 33. Alternative tools Alternative short-read alignment  BWA -> Can not align RNA-Seq data  GSNAP  STAR -> Requires minimum of 30GB of RAM Alternative transcript reconstruction  STAR  Scripture Alternative Expression/Abundance Estimation  DESeq  DEXSeq  edgeR
  34. 34. Resources
  35. 35. Software Websites TopHat http://tophat.cbcb.umd.edu Cufflinks http://cufflinks.cbcb.umd.edu STAR http://gingeraslab.cshl.edu/STAR/ Scripturehttp://www.broadinstitute.org/software/scripture/ Bioconductor http://www.bioconductor.org/  DEXSeq  DESeq  edgeR Blah
  36. 36. Additional Resources Differential gene and transcript expression analysis of RNA-Seq Experiments with TopHat and Cufflinks (2012) Nature Protocols. 7(3) www.biostars.org (Q&A site) SeqAnswers Forum GENCODE Gene Annotations  http://www.gencodegenes.org/  ftp://ftp.sanger.ac.uk/pub/gencode TopHat / Illumina iGenomes References and Annotation Files:  http://tophat.cbcb.umd.edu/igenomes.html
  37. 37. Acknowledgements Dalhousie University  Dr. Graham Dellaire  Dr. Karen Bedard  Montgomery Lab  Dr. Chris McMaster Stanford  Dr. Andrew Orr  Dr. Stephen Montgomery  Dr. Conrad Fernandez  BHCRI CRTP Skills  Dr. Marissa Leblanc Acquisition Program  Mat Nightingale  Bedard Lab  IGNITE
  38. 38. Experimental Data for Genes of Interest
  39. 39. UCSC Genome Browser
  40. 40. UCSC Genome Browser
  41. 41. MetabolicMine
  42. 42. MetabolicMine
  43. 43. NCI Pathway Interaction Database
  44. 44. The Cancer Genome Atlas Identify cancer subtypes, actionable driver mutations, personalized/genomic/precision medicine More than $275 million in funding from NIH Multiple research groups around the world 20 cancer types being studied 205 publications from the research network since late 2008
  45. 45. The Cancer Genome Atlas
  46. 46. The Cancer Genome Atlas
  47. 47. The Cancer Genome Atlas
  48. 48. UNIX/Linux command-line basics
  49. 49. What is UNIX? UNIX and UNIX-Like are a family of computer operating systems originally developed at AT&T‟s Bell Labs  Apple OS X and iOS (UNIX)  Linux (UNIX-Like)
  50. 50. Intro The terminal (command-line) isn‟t THAT scary. Maintaining a Linux environment can be challenging, but most of these analyses can also be done in an OS X environment Installing software can sometimes be cumbersome and confusing, however many standard bioinformatics programs and software libraries are fairly easy to set-up Working with the programs from the command-line will often give you a better appreciation for what the program does and what it requires
  51. 51. Terms to Know Path: The location of a directory, file, or command on the computer.  Example: /Users/dan (OS X home directory)
  52. 52. The Commands You Need to Know ls: Lists the files in the current directory. Directories (folders) are just a special type of file themselves cd: Change directory pwd: View the full path of the directory you are currently in cat: Displays the contents of a file on the terminal screen head / tail : Displays the top or bottom contents of a file to the screen respectively