Dgaston dec-06-2012

Bioinformatics: Intro to RNA-Seq
Analysis

Integrated Learning Session
Daniel Gaston, PhD
Dr. Karen Bedard Lab, Department of Pathology

December 6th, 2012

Overview
 Introduction
 Considerations for RNA-Seq
 Computational Resources/Options
 Analysis of RNA-Seq Data
 Principle of analyzing RNA-Seq
 General RNA-Seq analysis pipeline
 “Tuxedo” pipeline
 Alternative tools
 Resources
 http://www.slideshare.net/DanGaston

Before You Start: Considerations for RNA-
Seq Analysis
 Next-Generation Sequencing experiments generate
a lot of raw data
 25-40 GB/sample/replicate for most transcriptomes/tissue
types/cell lines/conditions

 Require more computational resources than many
labs routinely have available for analyse data
 At minimum several processing “cores” (8 minimum)
 Large amount of RAM (16GB+)
 Large amount of disk storage space for intermediate and
final results files in addition to raw FastQ files
 Can be a significant amount of time per sample (days to
week)

Computational Options
 Local (Large workstation or cluster)
 Remote Computer/Cluster
(ComputeCanada/ACENet)
 Cloud Services
 Amazon Web Services
 Cloud/Local Bioinformatics „Portals”
 Galaxy
 Chipster
 GenomeSpace
 CloudBioLinux
 CloudMan
 BioCloudCentral (Interface to CloudMan, CloudBioLinux,
etc)

So I Ran an RNA-Seq Experiment. Now
What?
 Need to go from raw “read” data to gene expression
data
 We now have:
 De-multiplexed fastq files for each individual sample and
replicate
 We want lists of:
 Differentially expressed genes/transcripts
 Potentially novel genes/transcripts
 Potentially novel splice junctions
 Potential fusion events
 Organize your data, programs, and additional
resources (discussed later)

What is the Raw Data
 A single lane of Illumina HiSeq 2000 sequencing
produces ~ 250 – 300 million “reads” of sequencing
 Can be paired or single-end sequencing (paired-end
preferred)
 Various sequencing lengths (number of sequencing
cycles)
 2x50bp, 2x75bp, 2x100bp, 2x150bp most common
 Cost versus amount of usable data
 True raw data is actually image data with colour
intensities that are then converted into text (A, C, G,
T and quality scores) called FastQ

FastQ
@M00814:1:000000000-A2472:1:1101:14526:1866 1:N:0:1
TGGAACATGCGTGCGNAGCCGAAAGTGTGTCCCCACTTTCATATGAAGAAAGAC
+
?????BBBBB9?+<+#,,6C>CAEEHHHFFHHHHFEHHHHHHHHHGHHHHHHHHHH

 FASTA format file with a header line, sequence line, and
quality scores for every base in the sequenced read
 In Paired-End Sequencing one file for each “end” of
sequencing (Primer 1 and Primer 2)
 Qualities scores are encoded with a single character
representing a number. Most common encoding scheme
is called Phred33. Old Illumina software used Phred64
but current generation does not. (Illumina 1.3 – 1.7 is
Phred64)
 Often needs to be set explicitly in alignment programs

General Analysis Pipeline

Short-Read Alignment

Transcript Reconstruction

Abundance/Expression

Visualization / Statisticss

What is Short-Read Alignment?

Paired-End Reads

Section of Reference
Chromosome

What’s Special About RNA-Seq

 Normally distance between paired-reads and size of
insertions both constrained
 With RNA-Seq the source is mRNA, not genomic
DNA
 Mapping to a reference genome, not transcriptome
 Need to account for introns, pairs can be much
further apart than expected

Transcript Reconstruction: Intron/Exon
Junctions

Exon1 Exon 2 Exon 3

Transcript Reconstruction: Alternative
Splicing

Exon1 Exon 2 Exon 3

Transcript Reconstruction: Novel
Exon/Transcript Identification

Exon1 Exon 2 Exon X Exon 3

Transcript Reconstruction: Fusion
Transcripts

Exon1 Exon 2 Exon 3

Gene 2 Exon 4

Transcript Reconstruction: Differential
Expression

Sample 1

Sample 2

What else can we look for?
 Combine with ChiP-Seq to differentiate various
levels of regulation
 Integrative analyses to identify common elements
(micro-RNA, transcription factors, molecular
pathways, protein-DNA interactions)
 Combine with whole-exome or whole-genome
sequencing
 Allele-specific expression
 Allelic imbalance
 LOH
 Large genomic rearrangements/abnormalities

Caution
 Need to differentiate between real data and artifacts
 Differentiate between biologically meaningful data
and “noise”
 Sample selection, experimental design, biological
replication (not technical replication), and robust
statistical methods are important
 Looking at your data “by eye” is useful, but needs to
be backed up by stats
 Avoid experimenter bias
 Try and be holistic in your analyses

“Tuxedo” Analysis Pipeline

Bowtie

Tophat

Cufflinks
Cufflinks Cuffcompare Cuffmerge Cuffdiff

CummeRbund

What you need before you begin
 The individual programs
 Reference genome (hg19/GRCh37)
 FASTA file of whole genome, each chromosome is a
sequence entry
 Bowtie2 Index files for reference genome
 Index files are compressed representations of the
genome that allow assembly to the reference efficiently
and in parallel
 Gene/Transcript annotation reference (UCSC,
Ensembl, ENCODE, etc)
 Gives information about the location of genes and
important features such as location of introns, exons,
splice junctions, etc

Step 0: Bowtie
 Bowtie forms the core of TopHat for short-read
alignment
 Initial mapping of subset of reads (~5 million) to a
reference transcriptome to estimate inner-distance
mean/median and standard deviation for tophat
 This info can be retrieved from the library prep stage
but is actually better to estimate from your final data
 Sample command-line:

bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1
read1.fastq -2 read2.fastq –S output.sam

Step 1: Tophat
 Tophat is a short-read mapper capable of aligning
reads to a reference genome and finding exon-exon
junctions
 Can be provided a list of known junctions, do de
novo junction discovery, or both
 Also has an option to find potential fusion-gene
transcripts

tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev
std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq
read2.fastq

About TopHat Options
 -o: The path/name of a directory in which to place all
of the TopHat output files
 -G path to and name of an annotation file so TopHat
can be aware of known junctions
 Reference Genome: Given as path and “base
name.” If reference genome saved as:
/genomes/hg19/genome.fa then the relevant path
and basename would be /genomes/hg19/genome
 Inner Distance = Fragment size – (2 x read length)

TopHat: Additional options
 --no-mixed
 --b2-very-sensitive
 --fusion-search
 Running above options on 6 processing cores on
one sample took ~26 hours

Step 2: Cufflinks
 Cufflinks performs gene and transcript discovery
 Many possible options
 No novel discovery, use only a reference group of
transcripts
 de novo mode (shown below, beginner‟s default)
 Mixed Reference-Guided Assembly and de novo
discovery.
 Options for more robust normalization methods and error
correction

cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam

Step 3: Cuffmerge
 Merges sample assemblies, estimate abundances,
clean up transcriptome

cuffmerge –g gene_annotations.gtf –s /path/to/genome.fa –p 8
text_list_of_assemblies.txt

Step 4: Cuffdiff
 Calculates expression levels of transcripts in
samples
 Estimates differential expression between samples
 Calculates significance value for difference in
expression levels between samples
 Also groups together transcripts that all start from
same start site. Identify genes under
transcriptional/post-transcriptional regulation

cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u
merged.gtf cond1.bam cond2.bam

Cuffdiff Output
 FPKM values for genes, isoforms, CDS, and groups of
genes from same Transcription Start Site for each
condition
 FPKM is the normalized “expression value” used in RNA-Seq
 Count files of above
 As above but on a per replicate basis
 Differential expression test results for genes, CDS,
primary transcripts, spliced transcripts on a per sample
(condition) comparison basis (Each possible X vs Y
comparison unless otherwise specified)
 Includes identifiers, expression levels, expression difference
values, p-values, q-values, and yes/no significance field
 Differential splicing tests, differential coding output,
differential promoter use

Step 5: CummeRbund (R)

Trapnell et al., 2012

Visualization

Trapnell et al., 2012

Help!
 Command X failed
 Keep calm
 Don‟t blame the computer
 Check input files and formats
 Google/SeqAnswers/Biostars
 Results looks “weird”
 Check the raw data
 Re-check the commands you used
 RNA-Seq analysis is an experiment:
 Maintain good records of what you did, like any other
experiment

Alternative tools
 Alternative short-read alignment
 BWA -> Can not align RNA-Seq data
 GSNAP
 STAR -> Requires minimum of 30GB of RAM
 Alternative transcript reconstruction
 STAR
 Scripture
 Alternative Expression/Abundance Estimation
 DESeq
 DEXSeq
 edgeR

Software Websites
 TopHat http://tophat.cbcb.umd.edu
 Cufflinks http://cufflinks.cbcb.umd.edu
 STAR http://gingeraslab.cshl.edu/STAR/
 Scripture

http://www.broadinstitute.org/software/scripture/

 Bioconductor http://www.bioconductor.org/
 DEXSeq
 DESeq
 edgeR
 Blah

Additional Resources
 Differential gene and transcript expression analysis
of RNA-Seq Experiments with TopHat and Cufflinks
(2012) Nature Protocols. 7(3)
 www.biostars.org (Q&A site)
 SeqAnswers Forum
 GENCODE Gene Annotations
 http://www.gencodegenes.org/
 ftp://ftp.sanger.ac.uk/pub/gencode
 TopHat / Illumina iGenomes References and
Annotation Files:
 http://tophat.cbcb.umd.edu/igenomes.html

Acknowledgements
 Dalhousie University  Dr. Graham Dellaire
 Dr. Karen Bedard  Montgomery Lab
 Dr. Chris McMaster Stanford
 Dr. Andrew Orr  Dr. Stephen Montgomery
 Dr. Conrad Fernandez
 BHCRI CRTP Skills
 Dr. Marissa Leblanc
Acquisition Program
 Mat Nightingale
 Bedard Lab
 IGNITE

Experimental Data for Genes of
Interest

NCI Pathway Interaction Database

The Cancer Genome Atlas
 Identify cancer subtypes, actionable driver
mutations, personalized/genomic/precision medicine
 More than $275 million in funding from NIH
 Multiple research groups around the world
 20 cancer types being studied
 205 publications from the research network since
late 2008

UNIX/Linux command-line basics

What is UNIX?
 UNIX and UNIX-Like are a family of computer
operating systems originally developed at AT&T‟s
Bell Labs
 Apple OS X and iOS (UNIX)
 Linux (UNIX-Like)

Intro
 The terminal (command-line) isn‟t THAT scary.
Maintaining a Linux environment can be challenging,
but most of these analyses can also be done in an
OS X environment
 Installing software can sometimes be cumbersome
and confusing, however many standard
bioinformatics programs and software libraries are
fairly easy to set-up
 Working with the programs from the command-line
will often give you a better appreciation for what the
program does and what it requires

Terms to Know
 Path: The location of a directory, file, or command on
the computer.
 Example: /Users/dan (OS X home directory)

The Commands You Need to Know
 ls: Lists the files in the current directory. Directories
(folders) are just a special type of file themselves
 cd: Change directory
 pwd: View the full path of the directory you are
currently in
 cat: Displays the contents of a file on the terminal
screen
 head / tail : Displays the top or bottom contents of a
file to the screen respectively

Dgaston dec-06-2012

More Related Content

What's hot

Similar to Dgaston dec-06-2012

More from Dan Gaston

Recently uploaded

Dgaston dec-06-2012