RNA-seq Analysis

RNA-‐seq
analysis

Mikael
Huss

Bioinforma7cs
scien7st
at
WABI
(Wallenberg

Advanced
Infrastructure
for
Bioinforma7cs),
Science

for
Life
Laboratory
/
DBB,
Stockholm
university

February
13,
2013

Omics,
biology
and
diseases

+ + + +
Protein “parts Protein
Genomics RNA profiles Interactomics
list” profiles

Systems
biology

Pathways,
molecular
targets,
diagnos5cs

Approximate contents of talk

- Gene expression analysis in general; differences between RNA-seq and microarrays

- Typical workflow(s) for RNA-seq analysis

- Normalization issues

- Visualization

- Differential expression analysis

I have tried to include many references so you can go back to these slides for
reference afterwards

How
DNA
get
transcribed
to
RNA
(and
then

translated
to
proteins)
varies
between
e.
g.

-‐Tissues

-‐ Cell
types

-‐ Cell
states

-‐Individuals

What
can
gene
expression
tell
us?

Basic
research

-‐ How
do
gene
expression
paUerns
determine
cellular
iden7ty?
(7ssues,
cell
types
…)

-‐ How
does
gene
expression
control
early
development
in
an
embryo?

-‐ What
kinds
of
genes
are
expressed
in
response
to
speciﬁc
s7muli
(infec7ons,
smoking,

environmental
pollu7on,
gym
exercise
…)?

-‐ What
kinds
of
genes
do
bacteria
or
other
microorganisms
express
in
the
human
gut
/
in

soil
/
in
oceans
under
diﬀerent
condi7ons?

…
and
much,
much
more
…

What
can
gene
expression
tell
us?

Diseases

-‐ Which
genes
are
over-‐
(or
under-‐)expressed
in
pa7ents
vs.
healthy
controls?

-‐ Which
genes
are
correlated
to
disease
progression?

-‐ Can
markers
of
hidden
disease
be
found
by
sequencing
blood
plasma?

Gene
expression
signatures
for
disease?

Hypothesis:

Cell
types
are
stable

states
in
a
“space”
of

gene
expression
paUerns.

Diseases
(e
g
cancers)

distort
the
gene

expression
so
that
the
cell

ends
up
in
the
wrong

stable
state.

Furusawa
and
Kaneko,
Biology
Direct
2009
4:17

Can
the
research
community
ﬁnd
such
paUerns?

On-‐line
predic7on
compe77ons,
objec7vely
scored
by
the
organizers

Diagnosing
MS
(mul/ple
sclerosis),
lung
cancer,
psoriasis,
COPD
(KOL)

Prognos/ca/ng
breast
cancer
outcome

Human
7ssue
RNA-‐seq
data
sets

Genotype-Tissue Expression project
http://commonfund.nih.gov/GTEx/

Illumina Human Body Map
accessed via ReCount database, bowtie-bio.sourceforge.net/recount/

Wang 2008 data set of ~15 human tissues
accessed via ReCount

RNA-seq Atlas
http://medicalgenomics.org/rna_seq_atlas

Human Protein Atlas
http://www.proteinatlas.org (tissue RNA-seq data not yet publicly released)

Tools
for
genome-‐scale
gene
expression
measurements

Microarrays
(c:a
1995)

Some7mes
called
“gene
chips”

Based
on
hybridiza7on

RNA
sequencing
(c:a
2008
in
current
form)

Based
on
sampling

Typical
(m)RNA-‐seq
experiment

“library”
-‐>

<-‐
reads

hUp://cmb.molgen.mpg.de

Alterna7ve:
rRNA
deple7on

There are various kits for depleting rRNA instead

Pluses:
- Can use for microorganisms that don’t have poly-A tails
- Thus, can use for simultaneous host/pathogen expression profiling
- Can find non-coding RNA

Minuses:
-Usually leaves in quite a lot of rRNA
-In practice, often variable efficiency between samples -> hard to compare results

Sequencing
plagorms

ABI
3730xl
454
Life
Sciences
SOLiD
+
Paciﬁc
Biosciences,

Sanger
Sequencing
pyrosequencing
Illumina
Oxford
Nanopore
etc

Single-‐molecule

sequencing

Length/read
800
bp

400
bp

100
bp

20
000+
bp

Reads/run

96

1
million

2
billion

5
million

Bases/run

60
kbp

400
Mbp

500
Gbp

100
Gbp

Speed

10
years/HG

1
month/HG

1
day/HG

10
min/HG

“old
school”
“2nd
gen”
“3rd
gen”

Microarray:
Hybridiza7on

Source:
Wikipedia

The
design
of
the
microarray
determines
what
you
can
detect
in
a
sample

RNA
sequencing:
Sampling

It
is
possible
to
detect
transcripts
that
are
not
known
a
priori
(in
advance)

RNA-‐seq
advantages

The
non-‐dependence
on
reference
makes

possible:

-‐  meta-‐transcriptomics

-‐  detec7ng
novel
splice
variants

-‐  detec7ng
novel
transcripts

-‐  Fusion
transcripts

-‐  Non-‐coding
transcripts

Some
examples

RNA-seq Atlas Wang 2008

Some
examples

RNA-seq Atlas

<- Skeletal Wang 2008
muscle ->

<-Adipose tissue-> HPA

What
does
one
do
with
RNA-‐seq
reads?

•  Mapping
(also
called
alignment)

•  (de
novo)
Assembly

Mapping
(alignment)
vs.
assembly

Imagine
a
book
being
ripped
to
pieces
with
word
or
sentence

fragments
ending
up
on
each
piece
of
paper.

If
you
have
a
copy
of
the
book
that
you
can
compare
the
pieces
to,

you
have
a
mapping
(alignment)
problem.

If
you
have
no
copy
of
the
book,
you
have
a
de
novo
assembly

problem.

Mapping
to
a
reference
genome

Reads
from
the
sequencer

Sequencing
error

Gene7c
varia7on

CAATCAGA G TCCCACTGTGG

AGACG TCCCACTGTGGGGTG

GTGAAGTGTCCGTAGATGTGTG

GCAAATGCAATCAGACG TCCC

Gene(or
transcript)
sequence

Mapping
to
a
reference
genome




Mapping
to
the
genome
vs.
the

transcriptome

Vs. the genome:
-Can (in principle) detect new transcripts, splice variants
- Less sensitive, need a lot of coverage to discover new things
- Need a “splice-aware” aligned such as TopHat, MapSplice, RUM etc.

Vs. the transcriptome:
-Not unbiased anymore, tied to existing annotation
-Faster, more sensitive, need less coverage

The best of both worlds?
- Tools like TopHat (v1.4 and up) now do both

If
it
had
been
de
novo
assembly





Assembly




“singleton”


Consensus
sequence(s)

Assembly
of
RNA-‐seq
reads

Will not be discussed much further here.

Most popular de novo assemblers build de Bruijn graphs where overlapping k-mers
are connected to each other. The programs then try to find paths through the graph

Typically needs a LOT of RAM. Can try to pre-process using “digital normalization”

Tools:
- Trinity
- Velvet/Oases
- CLC Bio (commercial)

Assembly
of
RNA-‐seq
reads

Typical workflow could be:

- Clean the reads properly (remove adapters, low-quality reads)
- Useful tools: FastQC, PRINSEQ, FASTX toolkit etc.

- Run assembly tool of choice, resulting in a set of contigs

- BLAST the contigs against nt database, check for % overlap by transcript in
related organisms

- Map your original reads back to the contigs and count the reads overlapping
each

<- comparison of
assembly &
mapping

Quan7fying
expression
with
RNA-‐seq

Microarrays give a continuous (floating-point) expression value for each gene

RNA-‐seq
gives
an
integer
value
for
each
gene
(“digital
expression”):
read
counts

Example
(SciLifeLab)
mapping
workﬂow

FASTQ file(s)

TopHat 2.0

BAM file

Picard tools (SortSam, MarkDuplicates)

Sorted BAM file with duplicate reads removed

HTSeq 0.5 Cufflinks 2.0

Gene-level count files Gene- and isoform-level expression
(for DE analysis) estimates (FPKM, for reporting)

RNA-‐seq
mapping:
diﬀerent
isoforms

Isoform
1

Exon
1
Exon
2
Exon
3

Isoform
2

Exon
1
Exon
2

(what
it
would
look
like
mapped
to
the
genome)

Exon
1
Exon
2
Exon
3

Need
a
special
mapping
algorithm
which
allows
large
gaps,
a
“split-‐read
aligner”

(what
we
would
actually
observe
–
of
course
we
don’t
know
which
reads
come
from

which
isoform)

Sta7s7cal
algorithms
needed
to
es7mate
what
propor7on
of
reads
comes
from
which

isoform.
(For
example,
maximum
likelihood
/
expecta7on
maximiza7on)

Name
Free/Commercial/ Type
of
approach

Descrip5on
only

Xing
et
al.
2006
D
Maximum
likelihood

Partek
C
“

Li
et
al.
2010
D
“

Avadis
C
“

IsoEM
F
“

MISO
F
“
(MCMC)

Cuﬄinks
F
“

rQuant
F
Least
squares
(quadra7c

programming)

Rpkmforgenes.py
F
Least
squares

Howard
and
Heber
2010
D
Least
squares

FluxCapacitor
F
Linear
programming

CLC
Bio
C
?

NSMAP
F
Nonnega7ve
Sparse

Maximum
A
Posteriori

ALEXA-‐SEQ
F
Use
only
reads
that
are
compa7ble

with
a
single
isoform

NEUMA
D
Normaliza7on
by
Expected

Uniquely
Mappable
Area

Some remarks on isoform quantification

- It is necessary for correct gene-level quantification as well because straight read
counting methods can never be fully correct (from 2012 CuffDiff2 paper)

- Xing et al. (2006) gave the basic idea for EM-
based isoform quantification which other
programs (Cufflinks, MISO, IsoEM, …) have
added various “bells and whistles” to

- It is actually pretty hard to do isoform
quantification well because there can be a lot
of possible isoforms  not enough sequence
coverage to estimate

Basic idea of the EM approach

We have a set of reads mapping to some locus
- Some fit one specific isoform
- Some fit several isoforms

If we knew the isoforms’ expression levels, we could distribute the reads proportionally
to those. But we don’t!

On the other hand, if we knew the probability of each read to match each isoform, we
could estimate the isoforms’ expression pretty well. But we don’t know that either.

So … start with a guess and iterate!

- Assign reads to isoforms according to some initial guess
- Re-estimate isoform expression levels
- Repeat until convergence!

Gene
fusion
detec7on
with
RNA-‐seq

Beyond
isoforms:
Detect
pieces
of
diﬀerent
genes
that
have
been
fused

Look
for
reads

that
map
in

“wrong”
ways

Wang
et
al.
Brieﬁngs
in

Bioinforma7cs
doi:10.1093/
bib/bbs044

Some
further
comments
on
microarrays

and
RNA-‐seq

-‐  Microarrays
are
s7ll
cheaper
and
faster.

-‐  You
may
be
able
to
run
more
replicates,
which
is
important
for
sta7s7cal
power.

-‐  RNA-‐seq
has
a
wider
measurement
range.

-‐  Low
expressed
transcripts:

-‐  Microarrays
have
high
background
signal
-‐>
poor
measurement

-‐  RNA-‐seq
can
measure
well
if
you
sequence
very
deeply

-‐  Medium
expressed
transcripts:

-‐  Microarrays
measure
well

-‐  RNA-‐seq
measures
well
if
sequenced
rela7vely
deeply

-‐  High
expressed
transcripts:

-‐  Microarrays
measure
poorly
because
of
satura7on

-‐  RNA-‐seq
measures
well

-‐  Less
is
understood
about
how
to
pre-‐process
and
normalize
RNA-‐seq
data.

-‐  One
interes7ng
aspect
of
RNA-‐seq:
You
can
con7nue
to
sequence
a
sample
more

to
obtain
beUer
gene
expression
es7mates.

Analysis

-‐  Pre-‐processing
and
normaliza7on

-‐  Visualiza7on

-‐  Differen7al
gene
expression
analysis

-‐  ( Gene
set
analysis,
pathway
analysis,
gene

expression
signatures
…
-‐>
try
to
find
the

biological
significance)

Pre-‐processing

Why
do
we
do
pre-‐processing
and
normaliza7on
of

RNA-‐seq
(or
microarray)
data?

Pre-‐processing

Why
do
we
do
pre-‐processing
and
normaliza7on
of

RNA-‐seq
(or
microarray)
data?

-‐  To
correct
for
batch
effects

-‐  Different
labs

-‐  Different
prepara7on
7mes

-‐  Etc.

Pre-‐processing

Why
do
we
do
pre-‐processing
and
normaliza7on
of

RNA-‐seq
(or
microarray)
data?

-‐  To
correct
for
batch
effects

-‐  Different
labs

-‐  Different
prepara7on
7mes

-‐  Etc.

-‐  To
correct
for
intrinsic
technical
biases
in
the

technologies

Pre-‐processing

Why
do
we
do
pre-‐processing
and
normaliza7on
of
RNA-‐
seq
(or
microarray)
data?

-‐  To
correct
for
batch
effects

-‐  Different
labs

-‐  Different
prepara7on
7mes

-‐  Etc.

-‐  To
correct
for
intrinsic
technical
biases
in
the

technologies

-‐  To
make
the
expression
value
distribu7ons
conform
to

some
assump7ons
in
order
to
perform
sta7s7cal
tests

RNA-‐seq
pre-‐processing

For
RNA-‐seq
data,
it
is
s7ll
less
understood
than
for

microarrays
how
one
should
pre-‐process
and

normalize
the
data.
Let’s
look
at
some
aspects

(that
some7mes
apply
to
both
RNA-‐seq
and

microarray
data)

R
and
Bioconductor

Very helpful for (e.g.) microarray and RNA-seq
differential expression analysis

Microarray: RNA-seq:

affy, lumi (read raw microarray signal files DESeq, edgeR, baySeq,
& preprocess) (differential expression analysis
limma (differential expression analysis based on count data)
with complex designs) SAMSeq (nonparametric
differential expression analysis)

Variance
stabiliza5on

Raw data
(could be microarray signal or RNA-seq counts)

Higher value -> higher variability (noise)

Log transform

Lower value -> higher variability. Too aggressive

Variance stabilizing transform
e.g. voom() in limma package

http://bridgecrest.blogspot.se/2011_09_01_archive.html

Quan5fying
expression
with
RNA-‐seq

If
you
want
to
compare
RNA-‐seq
counts
between
different
genes
and/or
samples,
consider:

-‐ Longer
genes/transcripts
are
expected
to
generate
more
reads

-‐ The
more
you
sequence,
the
more
reads
you
get
from
each
gene

Therefore,
the
standard
measure
has
been
RPKM
(
),
which
corrects
for
transcript
length
and
sequencing
depth:

⎛ X t ⎞
⎜ l ⎟
10 9 ⋅ X t (Xt:
no
of
reads
mapped
to
transcript/gene/…
t

⎜ eff ,t ⎟
Nlib:
no
of
mapped
reads
in
library

RPKM
=

⎜ 10 3 ⎟
⎜ ⎟
=

N lib ⋅ leff ,t Leff,
t:
effec/ve
length
of
transcript/gene/…
t)

⎝ ⎠
⎛ N lib ⎞
⎜ 6 ⎟
⎝ 10 ⎠

€ €
FPKM is a paired-end version of this

Alterna5ves

TPM – “transcripts per million”

A slightly modified RPKM measure that
accounts for differences in gene length
distribution in the transcript population

Alterna5ves

TMM – “trimmed mean of M values”

Attempts to correct for differences in RNA composition between samples

E g if certain genes are very highly expressed in one tissue but not another, there will be less
“sequencing real estate” left for the less expressed genes in that tissue and RPKM normalization (or
similar) will give biased expression values for them compared to the other sample

RNA population 1 RNA population 2

Equal sequencing depth -> orange and red will get lower RPKM in RNA population 1 although the
expression levels are actually the same in populations 1 and 2

Robinson and Oshlack Genome Biology 2010, 11:R25, http://genomebiology.com/2010/11/3/R25

Across-‐sample
comparability

Dillies et al., Briefings in Bioinformatics, doi:10.1093/bib/bbs046

Across-‐sample
comparability

Prac5cal
issues
with
normaliza5on
methods

Limma / voom can give negative values

TMM cannot be done on a single sample

RNA-‐seq
pre-‐processing

In
RNA-‐seq,
normaliza7on
of
counts
is
oven

interwoven
with
diﬀeren7al
expression
analysis

and
done
implicitly
in
DE
packages
such
as
DESeq,

edgeR
etc.

Normalized
values
like
RPKM
are
usually
only
used

for
repor7ng
expression
values,
not
tes7ng
for

diﬀeren7al
expression.

Why?

Count
nature
of
RNA-‐seq
data

These
methods
want
to
use
the
added
sta7s7cal
power
provided
by

the
count
nature
of
RNA-‐seq
data.

Simpliﬁed
toy
example:

Scenario 1: A 30000-bp transcript has 1000 counts in sample A and 700 counts
in sample B.

Scenario 2: A 300-bp transcript has 10 counts in sample A and 7 counts in
sample B.

Assume that the sequencing depths are the same in both samples and both
scenarios. Then the RPKM is the same in sample A in both scenarios, and in
sample B and both scenarios.

In scenario A, we can be more confident that there is a true difference in the
expression level than in scenario B (although we would want more replicates of
course!) by analogy to a coin flip – 700 heads out of 1000 trials gives much more
confidence that a coin is biased than 7 heads out of 10 trials

Visualiza5on

Can
be
useful
for
“sanity
checking”,
outlier
detec7on
and
exploratory
analysis
in
general

Examples
of
useful
visualiza7ons

-‐ Heat
maps

-‐ PCA/MDS/NMF

-‐ Box
plots,
violin
plots
etc.

Box
plots

Useful for comparing groups

Adding the actual data points is optional but can be interesting

Sample
correla5on
heat
maps

Heat maps are ubiquitous in transcriptomics
Correlations between samples, hierarchical clustering

Used for “sanity checks”, outlier detection

Two tissues Batch effects

Gene
/
sample
heat
maps

With a smaller
collection of genes,
one sometimes looks
at gene/sample heat
maps

PCA
plots

Another way to see how samples cluster

PCA
plots

Nice thing with PCA: you can also see how much each gene contributes to each
principal component -> a kind of feature selection

Alterna5ves
to
PCA

NMF: non-negative matrix factorization. Also a matrix decomposition technique (like
PCA)
“A bioinformatic assay for pluripotency in human cells”, Nature Methods: doi.10.1038/nmeth.1580

PCA
plot
of
human
5ssue
RNA-‐seq

Red – GTex
Green – Body Map
Black – Human Protein Atlas

#
of
genes
taking
up
X%
of
sequences

GTex RPKM
HBA1
HBB
HBA2

#
of
genes
taking
up
X%
of
sequences

GTex

#
of
genes
taking
up
X%
of
sequences

Wang/Sandberg

Diﬀeren5al
expression
analysis

Many tools available!

Easily the most common type of analysis, even though it is understood that
gene expression levels are not independent of each other, and should in
principle be considered together.

However, since the number of samples is typically << the number of
measured genes, a full model is usually not feasible to construct in practice.
Some sort of feature selection is needed.

Diﬀeren5al
expression
analysis

One would simply like to do a t-test or something like that for each gene, but
…

Diﬀeren5al
expression
analysis

…

- Assumes normal distribution & no mean-variance dependence

Diﬀeren5al
expression
analysis

…

- Hard to estimate variance from few samples

Diﬀeren5al
expression
analysis

…

- Hard to estimate variance from few samples
- Multiple testing issue

Parametric
vs.
non-‐parametric
methods

It would be nice to not have to assume anything about the expression value
distributions but only use rank-order statistics. -> methods like SAM
(Significance Analysis of Microarrays) or SAM-seq (equivalent for RNA-seq data)

However, it is (typically) harder to show statistical significance with non-
parametric methods with few replicates.

My rule of thumb:

- Many replicates (~ >10) in each group -> use SAM(Seq)
- Otherwise use DESeq or other parametric method

Note that according to Simon Anders (creator of DESeq) says that non-
parametric methods are definitely better with 12 replicates and maybe already at
five

http://seqanswers.com/forums/showpost.php?p=74264&postcount=3

Standard
DE
methods

Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)

Standard
DE
methods


Distributional issue: Solved by variance stabilizing transform in limma

edgeR and DESeq model the count data using a negative binomial distribution and
use their own modified statistical tests based on that.

Standard
DE
methods




Multiple testing issue: All of these packages report false discovery rate (corrected
p values).

Standard
DE
methods




Multiple testing issue: All of these packages report false discovery rate (corrected
p values).

Variance estimation issue: These packages (in slightly different ways) “borrow”
information across genes to get a better variance estimate. One says that the
estimates “shrink” from gene-specific estimates towards a common mean value.

CuﬀDiﬀ2

Integrates isoform quantification +
differential expression analysis

Complex
designs

The simplest case is when you just want to compare two groups against each other.

But what if you have several factors that you want to control for?

E.g. you have taken tumor samples at two different time points from six patients,
cultured the samples and treated them with two different anticancer drugs and a mock
control treatment. -> 2x6x3 = 36 samples.

Now you want to assess the differential expression in response to one of the
anticancer drugs, drug X. You could just compare all “drug X” samples to all control
samples but the inter-subject variability might be larger than the specific drug effect.

 Enter limma / DESeq / edgeR which can work with factorial designs

(SAMSeq cannot, which is another reason one might not want to use it)

Limma
and
factorial
designs

limma stands for “linear models for microarray analysis”

Essentially, the expression of each gene is modeled with a linear relation

http://www.math.ku.dk/~richard/courses/bioconductor2009/handout/19_08_Wednesday/KU-August2009-LIMMA/PPT-PDF/Robinson-limma-linear-models-ku-2009.6up.pdf

The design matrix describes all the conditions, e g treatment, patient, time etc
y = a + b*treatment + c*time + d*patient + e

Baseline/average Error term/noise

Recent
DE
so[ware
comparison

Take-‐away
messages
from
DE
tool

comparison

- CuffDiff2, which should theoretically be better, seems to work worse, probably
due to the increased “statistical burden” from isoform expression estimation

- The HTSeq quantification which is theoretically “wrong” seems to give good
results with downstream software

- It is practically always better to sequence more biological replicates than to
sequence the same samples deeper

Omitted from this comparison
- gains from ability to do complex designs
- non-parametric methods

The
end


Contact me at mikael.huss@scilifelab.se if you have any questions

RNA-seq Analysis

More Related Content

What's hot

Similar to RNA-seq Analysis

More from COST action BM1006

Recently uploaded

RNA-seq Analysis