Your SlideShare is downloading. ×
0
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Dgaston dec-06-2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dgaston dec-06-2012

454

Published on

Intro primer on Bioinformatics and Gene Expression analysis in RNA-Seq using the Tuxedo pipeline

Intro primer on Bioinformatics and Gene Expression analysis in RNA-Seq using the Tuxedo pipeline

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
454
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Bioinformatics: Intro to RNA-Seq Analysis Integrated Learning Session Daniel Gaston, PhD Dr. Karen Bedard Lab, Department of Pathology December 6th, 2012
  • 2. Overview Introduction  Considerations for RNA-Seq  Computational Resources/Options Analysis of RNA-Seq Data  Principle of analyzing RNA-Seq  General RNA-Seq analysis pipeline  “Tuxedo” pipeline  Alternative tools Resources  http://www.slideshare.net/DanGaston
  • 3. Before You Start: Considerations for RNA-Seq Analysis Next-Generation Sequencing experiments generate a lot of raw data  25-40 GB/sample/replicate for most transcriptomes/tissue types/cell lines/conditions Require more computational resources than many labs routinely have available for analyse data  At minimum several processing “cores” (8 minimum)  Large amount of RAM (16GB+)  Large amount of disk storage space for intermediate and final results files in addition to raw FastQ files  Can be a significant amount of time per sample (days to week)
  • 4. Computational Options Local (Large workstation or cluster) Remote Computer/Cluster (ComputeCanada/ACENet) Cloud Services  Amazon Web Services Cloud/Local Bioinformatics „Portals”  Galaxy  Chipster  GenomeSpace  CloudBioLinux  CloudMan  BioCloudCentral (Interface to CloudMan, CloudBioLinux, etc)
  • 5. RNA-Seq Analysis Workflow
  • 6. So I Ran an RNA-Seq Experiment. NowWhat? Need to go from raw “read” data to gene expression data We now have:  De-multiplexed fastq files for each individual sample and replicate We want lists of:  Differentially expressed genes/transcripts  Potentially novel genes/transcripts  Potentially novel splice junctions  Potential fusion events Organize your data, programs, and additional resources (discussed later)
  • 7. What is the Raw Data A single lane of Illumina HiSeq 2000 sequencing produces ~ 250 – 300 million “reads” of sequencing Can be paired or single-end sequencing (paired-end preferred) Various sequencing lengths (number of sequencing cycles)  2x50bp, 2x75bp, 2x100bp, 2x150bp most common  Cost versus amount of usable data True raw data is actually image data with colour intensities that are then converted into text (A, C, G, T and quality scores) called FastQ
  • 8. FastQ@M00814:1:000000000-A2472:1:1101:14526:1866 1:N:0:1TGGAACATGCGTGCGNAGCCGAAAGTGTGTCCCCACTTTCATATGAAGAAAGAC+?????BBBBB9?+<+#,,6C>CAEEHHHFFHHHHFEHHHHHHHHHGHHHHHHHHHH FASTA format file with a header line, sequence line, and quality scores for every base in the sequenced read In Paired-End Sequencing one file for each “end” of sequencing (Primer 1 and Primer 2) Qualities scores are encoded with a single character representing a number. Most common encoding scheme is called Phred33. Old Illumina software used Phred64 but current generation does not. (Illumina 1.3 – 1.7 is Phred64)  Often needs to be set explicitly in alignment programs
  • 9. General Analysis Pipeline Short-Read Alignment Transcript Reconstruction Abundance/Expression Visualization / Statisticss
  • 10. What is Short-Read Alignment?Paired-End ReadsSection of ReferenceChromosome
  • 11. What’s Special About RNA-Seq Normally distance between paired-reads and size of insertions both constrained With RNA-Seq the source is mRNA, not genomic DNA Mapping to a reference genome, not transcriptome Need to account for introns, pairs can be much further apart than expected
  • 12. Transcript Reconstruction: Intron/ExonJunctionsExon1 Exon 2 Exon 3
  • 13. Transcript Reconstruction: AlternativeSplicingExon1 Exon 2 Exon 3
  • 14. Transcript Reconstruction: NovelExon/Transcript IdentificationExon1 Exon 2 Exon X Exon 3
  • 15. Transcript Reconstruction: FusionTranscriptsExon1 Exon 2 Exon 3 Gene 2 Exon 4
  • 16. Transcript Reconstruction: DifferentialExpression Sample 1 Sample 2
  • 17. What else can we look for? Combine with ChiP-Seq to differentiate various levels of regulation Integrative analyses to identify common elements (micro-RNA, transcription factors, molecular pathways, protein-DNA interactions) Combine with whole-exome or whole-genome sequencing  Allele-specific expression  Allelic imbalance  LOH  Large genomic rearrangements/abnormalities
  • 18. Caution Need to differentiate between real data and artifacts Differentiate between biologically meaningful data and “noise” Sample selection, experimental design, biological replication (not technical replication), and robust statistical methods are important Looking at your data “by eye” is useful, but needs to be backed up by stats Avoid experimenter bias Try and be holistic in your analyses
  • 19. Visualizing with IGV
  • 20. “Tuxedo” Analysis Pipeline Bowtie Tophat Cufflinks Cufflinks Cuffcompare Cuffmerge Cuffdiff CummeRbund
  • 21. What you need before you begin The individual programs Reference genome (hg19/GRCh37)  FASTA file of whole genome, each chromosome is a sequence entry Bowtie2 Index files for reference genome  Index files are compressed representations of the genome that allow assembly to the reference efficiently and in parallel Gene/Transcript annotation reference (UCSC, Ensembl, ENCODE, etc)  Gives information about the location of genes and important features such as location of introns, exons, splice junctions, etc
  • 22. Step 0: Bowtie Bowtie forms the core of TopHat for short-read alignment Initial mapping of subset of reads (~5 million) to a reference transcriptome to estimate inner-distance mean/median and standard deviation for tophat This info can be retrieved from the library prep stage but is actually better to estimate from your final data Sample command-line: bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1 read1.fastq -2 read2.fastq –S output.sam
  • 23. Step 1: Tophat Tophat is a short-read mapper capable of aligning reads to a reference genome and finding exon-exon junctions Can be provided a list of known junctions, do de novo junction discovery, or both Also has an option to find potential fusion-gene transcripts Sample command-line: tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq read2.fastq
  • 24. About TopHat Options -o: The path/name of a directory in which to place all of the TopHat output files -G path to and name of an annotation file so TopHat can be aware of known junctions Reference Genome: Given as path and “base name.” If reference genome saved as: /genomes/hg19/genome.fa then the relevant path and basename would be /genomes/hg19/genome Inner Distance = Fragment size – (2 x read length)
  • 25. TopHat: Additional options --no-mixed --b2-very-sensitive --fusion-search Running above options on 6 processing cores on one sample took ~26 hours
  • 26. Step 2: Cufflinks Cufflinks performs gene and transcript discovery Many possible options  No novel discovery, use only a reference group of transcripts  de novo mode (shown below, beginner‟s default)  Mixed Reference-Guided Assembly and de novo discovery.  Options for more robust normalization methods and error correction Sample command-line: cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam
  • 27. Step 3: Cuffmerge Merges sample assemblies, estimate abundances, clean up transcriptome Sample command-line: cuffmerge –g gene_annotations.gtf –s /path/to/genome.fa –p 8 text_list_of_assemblies.txt
  • 28. Step 4: Cuffdiff Calculates expression levels of transcripts in samples Estimates differential expression between samples Calculates significance value for difference in expression levels between samples Also groups together transcripts that all start from same start site. Identify genes under transcriptional/post-transcriptional regulation Sample command-line: cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u merged.gtf cond1.bam cond2.bam
  • 29. Cuffdiff Output FPKM values for genes, isoforms, CDS, and groups of genes from same Transcription Start Site for each condition  FPKM is the normalized “expression value” used in RNA-Seq Count files of above As above but on a per replicate basis Differential expression test results for genes, CDS, primary transcripts, spliced transcripts on a per sample (condition) comparison basis (Each possible X vs Y comparison unless otherwise specified)  Includes identifiers, expression levels, expression difference values, p-values, q-values, and yes/no significance field Differential splicing tests, differential coding output, differential promoter use
  • 30. Step 5: CummeRbund (R) Trapnell et al., 2012
  • 31. Visualization Trapnell et al., 2012
  • 32. Help! Command X failed  Keep calm  Don‟t blame the computer  Check input files and formats  Google/SeqAnswers/Biostars Results looks “weird”  Check the raw data  Re-check the commands you used RNA-Seq analysis is an experiment:  Maintain good records of what you did, like any other experiment
  • 33. Alternative tools Alternative short-read alignment  BWA -> Can not align RNA-Seq data  GSNAP  STAR -> Requires minimum of 30GB of RAM Alternative transcript reconstruction  STAR  Scripture Alternative Expression/Abundance Estimation  DESeq  DEXSeq  edgeR
  • 34. Resources
  • 35. Software Websites TopHat http://tophat.cbcb.umd.edu Cufflinks http://cufflinks.cbcb.umd.edu STAR http://gingeraslab.cshl.edu/STAR/ Scripturehttp://www.broadinstitute.org/software/scripture/ Bioconductor http://www.bioconductor.org/  DEXSeq  DESeq  edgeR Blah
  • 36. Additional Resources Differential gene and transcript expression analysis of RNA-Seq Experiments with TopHat and Cufflinks (2012) Nature Protocols. 7(3) www.biostars.org (Q&A site) SeqAnswers Forum GENCODE Gene Annotations  http://www.gencodegenes.org/  ftp://ftp.sanger.ac.uk/pub/gencode TopHat / Illumina iGenomes References and Annotation Files:  http://tophat.cbcb.umd.edu/igenomes.html
  • 37. Acknowledgements Dalhousie University  Dr. Graham Dellaire  Dr. Karen Bedard  Montgomery Lab  Dr. Chris McMaster Stanford  Dr. Andrew Orr  Dr. Stephen Montgomery  Dr. Conrad Fernandez  BHCRI CRTP Skills  Dr. Marissa Leblanc Acquisition Program  Mat Nightingale  Bedard Lab  IGNITE
  • 38. Experimental Data for Genes of Interest
  • 39. UCSC Genome Browser
  • 40. UCSC Genome Browser
  • 41. MetabolicMine
  • 42. MetabolicMine
  • 43. NCI Pathway Interaction Database
  • 44. The Cancer Genome Atlas Identify cancer subtypes, actionable driver mutations, personalized/genomic/precision medicine More than $275 million in funding from NIH Multiple research groups around the world 20 cancer types being studied 205 publications from the research network since late 2008
  • 45. The Cancer Genome Atlas
  • 46. The Cancer Genome Atlas
  • 47. The Cancer Genome Atlas
  • 48. UNIX/Linux command-line basics
  • 49. What is UNIX? UNIX and UNIX-Like are a family of computer operating systems originally developed at AT&T‟s Bell Labs  Apple OS X and iOS (UNIX)  Linux (UNIX-Like)
  • 50. Intro The terminal (command-line) isn‟t THAT scary. Maintaining a Linux environment can be challenging, but most of these analyses can also be done in an OS X environment Installing software can sometimes be cumbersome and confusing, however many standard bioinformatics programs and software libraries are fairly easy to set-up Working with the programs from the command-line will often give you a better appreciation for what the program does and what it requires
  • 51. Terms to Know Path: The location of a directory, file, or command on the computer.  Example: /Users/dan (OS X home directory)
  • 52. The Commands You Need to Know ls: Lists the files in the current directory. Directories (folders) are just a special type of file themselves cd: Change directory pwd: View the full path of the directory you are currently in cat: Displays the contents of a file on the terminal screen head / tail : Displays the top or bottom contents of a file to the screen respectively

×