RNA-Seq Analysis: An Introduction to Mapping, Quantification, and Differential Expression

NGS
APPLICATIONS
2:

INTRODUCTION
TO
RNASEQ
ANALYSIS

Overview

•  Earlier:
libraries
to
raw
reads.

Now

•  What
to
do
with
RNA-‐seq
reads?

•  How
to
design
a
RNA-‐Seq

experiment?

Blencowe B J et al. Genes Dev. 2009;23:1379-1386
Illumina
HiSeq

Reads
are
ready.

Now
What?

bcl2fastq

Big
Fastq
ﬁles
(2-‐30Gb)

•  Reads
represent
real
biology.

•  More
reads
corresponding
to
a
transcript
indicate
higher
abundance
of
that

transcript.

•  Reads
may
represent
novel
transcripts
or
novel
arrangements
of
exons
that
are

not
present
in
any
known
reference
genome.

•  New
exon-‐exon
juncIons,
RNA-‐ediIng,
and
nucleoIde
variaIons
(SNPs)
may
all

be
present
in
the
read
data.

How
do
we
translate
these
raw
reads
into
biological
knowledge:

start
with

sequence
alignment.

Reads
are
ready.

Now
What?

Fastq

Do
we
have
a

genome
reference?

Yes

Do
we
a
transcript/gene

annotaIon
reference?

Yes
No

No

Perform
full
de
novo

transcriptome
construcIon

Perform
alignment-‐guided
de
novo

transcriptome
assembly

Align
to
the
genome.

QuanIﬁcaIon
Only:
accept

only
alignments
that

correspond
to
known

transcripts

Align
to
known

exons
but
accept

alternaIve

arrangements.

Align
to
known

exons
plus
other

regions.

Like
microarray

What
to
map
to?

Map
to
a
genome
with
no
gene
annotaSon.

•  Assembling
transcripts
from
exon
regions
is
diﬃcult
and
requires

complex
staIsIcal
algorithms.

•  IdenIfying
alternaIve
transcript
isoforms
is
unreliable.

•  Usually
this
is
best
for
a
novel
or
unannotated
genomes.

Exons
?

Genome
ref

What
to
map
to?

Map
to
the
genome,
with
knowledge
of
transcript
annotaSons

• Well
annotated
genome
reference
is
required.

• To
eﬀecively
map
to
exon
juncIons,
you
need
a
mapping

algorithm
that
can
divide
the
sequencing
reads
and
map
porIons

independently.

• IdenIfying
alternaIve
transcript
isoforms
involves
complex

algorithms.

Which
sequence
mappers
to
use?

•  RNASeq
Alignment
algorithm
must
be

–  Fast

–  Able
to
handle
SNPs,
indels,
and
sequencing
errors

–  Maintain
accurate
quanIﬁcaIon

–  Allow
for
introns
for
reference
genome
alignment(spliced
alignment

detecIon)

•  Burrows
Wheeler
Transform(BWT)
mappers

–  Fast

–  Limited
mismatches
allowed
(<3)

–  Limited
indel
detecIon
ability

–  Examples:
BowIe2,
BWA,
Tophat

–  Use
cases:
large
and
conserved
genome
and
transcriptomes

•  Hash
Table
mappers

–  Require
large
amount
of
RAM
for
indexing

–  More
mismatches
allowed

–  Indel
detecIon

–  Examples:
GSNAP,
SHRiMP,
STAR

–  Use
case:
highly
variable
or
smaller
genomes,
transcriptomes

RNA-‐Seq
reads

Alignment

Assemble

Transcripts

fastq
file

SAM/BAM
file

Transcript
isoforms
Gene
or
transcript

quanSficaSon

Count
reads

HTseq
-‐

h_p://www-‐huber.embl.de/users/
anders/HTSeq/doc/overview.html

Cufflinks
-‐

h_p://cufflinks.cbcb.umd.edu/

Bioconductor
-‐

h_p://www.bioconductor.org/

Trinity
-‐

h_p://trinityrnaseq.sourceforge.net/

Cufflinks
-‐

h_p://cufflinks.cbcb.umd.edu/

Generalized
Analysis
Workflow

BowIe2,
BWA,
Tophat,

GSNAP,
SHRiMP,
STAR

RNA-‐Seq
reads

Align
to
the
genome
using

BowIe/Tophat.

Tophat

Cufflinks

Spliced
Fragments
align
to

known
exon-‐exon
juncIons.

Genomic
mapped
reads
may

idenIfy
novel
isoforms.

fastq
file

SAM/BAM
file

Genome
reference
.fasta

Gene
annotaSons
.g^

Genome
reference
.fasta

Gene
annotaSons
.g^

Transcript
isoforms
Gene/transcript

quanSficaSon

Cufflinks
idenIfies
mutually

exclusive
exons.

Graph-‐based

analysis
uses
a
shortest-‐path

algorithm
to
determine

Tophat/Cufflinks

Workflow

Sequence
Alignment
Files

BAM/SAM
alignment
files

• SAM
file
is
the
standard
alignment
file
format
generated
from

all
mappers

• All
alignments
files
are
stored
in
a
BAM
file,
an
industry

standard.

• BAM
is
a
compressed
(binary)
version
of
the
SAM
file.

BAM
is

not
readable.

It
can
be
indexed
so
that
huge
alignment
files

can
be
read
and
searched
rapidly
by
other
tools
and
genome

browsers.

• A
suite
of
tools
(called
“samtools”)
is
used
to
convert
between

SAM
and
BAM.

• Samtools
can
also
be
used
to
index
bam
file
for
faster

visualizaIon,
on
IGV
or
UCSC
Genome
Browser

SAM
format

h_p://samtools.sourceforge.net/SAM1.pdf

Format
version

Ref
seq
name

Ref
seq
length

Sort
order

Cigar
String

h_p://samtools.sourceforge.net/SAM1.pdf

CIGAR
Strings

Compact
IdiosyncraIc
Gapped
Alignment
Report

DifferenSal
Gene
Expression
Analysis

•  Given
samples
from
different

experimental
condiIons,
find

changes
in
transcriptome

profiles

•  Allows
for
hypothesis

genera0on
on
molecular

abnormaliIes
and
mechanisms

that
may
contribute
to
the

tumor
phenotype

•  Provides
insights
to
potenIal

biological
mechanisms

associated
with
experimental/
diseased
condiIons

Sample
annotations STAR aligner
featureCounts

DESeq,
GSEA,
QC

HTML
report

Standard
Transcriptome
Sequencing
Pipeline

This
is
really
a
simple
sequence
counSng

problem

Data:

NGS
randomly
sample
and
sequence
all
gene

transcripts
from
samples
(so
the
number
of
reads

correlate
with
the
number
of
transcripts)

ObjecSve:

Does
gene
X
has
more
copies
in
condiIon
Z

than
in
B
(Z>B)?

X
Y
Z
X
Y
Z

CondiSon
Z
CondiSon
B

CounSng
Rules
for
RNASeq

•  Count
mapped
reads,
not
base-‐pairs

•  Count
each
read
at
most
once

•  Discard
a
read
if

–  It
cannot
be
uniquely
mapped

–  Its
alignment
overlaps
with
several
genes

–  The
alignment
quality
score
is
bad

–  (for
paired-‐end
reads)
the
mates
do
not
map
to
the

same
genes
(poten0al
fusion
genes)

•  Do
not
discard
if
there
is
read
duplicates
(same

reads
appear
mulIple
Imes)

•  Keep
track
of
alignment
method
and
parameters

What
kind
of
quesSons
can
be
answered

from
sequence
count
data?

Gene

Healthy1
Health
2
Health
3
PaSent
1
PaSent
2
PaSent
3

CCT2
50
60
45
75
5
69

TP53
30
72
30
127
40
80

CXCR5
3
10
60
20
5
40

Gene
Sequence
Count
Data

Is
gene
TP53
upregulated
in
paSent
samples?

-‐  Hint:
If
healthy
samples
were
sequenced
at
20
million
reads
and

paIent
samples
were
sequenced
at
80
million
reads,
does
it

change
the
answer?

Is
there
more
TP53
transcript
copies
compare
to

CCT2?

-‐  Hint:
TP53
transcript
is
a
lot
longer
than
CCT2

Direct
comparison
of
read
counts
per

gene
is
problemaSc

More
sequence
reads
mapped
to
a
transcript
if
it
is

a)
Long

b)
At
higher
depth
of
Coverage

Read
Counts
=
12,
Depth
=
3X,
Read
Counts
=
5,
Depth
=
3X

Read
Counts
=
11,
Depth
=
5X
Read
Counts
=
5,
Depth
=
3X

Cannot
claim
blue
transcript
is
transcribed
at
a
higher
level

than
green
transcript
based
on
read
counts

NormalizaSon
RNASeq
Count
Data

•  Data
NormalizaIon
is
ALWAYS
required
to

compare
one
sequencing
result
to
another

•  Bring
count
data
from
diﬀerent
experiments
to

the
same
scale
for
comparison

•  RNASeq
count
data
normalizaIon
wants
to
adjust

data
such
that:

–  gene
with
diﬀerent
lengths
can
be
compared

–  Total
sequence
counts
are
considered

RPKM:
Reads
per
Kilobase
per
Million

Mapped
Reads

C
=
#
of
mappable
reads
in
a
feature
(exon
or
transcript)

N
=
#
of
mappable
reads
in
the
experiment

L
=
length
of
the
feature
in
base
pairs

The
easiest
way
to
normalize
is
take
the
number
of
the
mapped

reads
on
a
transcript
and
divide
by
the
length
of
the
transcript

and
the
number
of
total
read

Nature
Methods
-‐
5,
621
-‐
628
(2008)

•  Generally
correct
for
biases

•  Vulnerable
to
bias
by
a
few
highly
expressed
genes
driving
N
to

be
large

•  Used
to
be
the
standard,
but
not
anymore

Other
NormalizaSon
Methods

Upper
QuarSle
Method

Aim:
Correct
for
the
bias
that
total
read
count
is
strongly
dependent

on
a
few
highly
expressed
transcripts

Method:
Use
the
top
25%
(upper
quarIle)most
expressed

transcripts
as
scaling
factor
and
report
back
Normalized
Count

Geometric
Mean
Method
(the
DESeq
method)

Aim:
to
minimize
the
eﬀect
of
majority
of
sequences
and

concentrate
on
variaIon
between
condiIons

AssumpSon:

A
majority
of
transcripts
is
not
diﬀerenIally
expressed

Method:

Take
geometric
means
of
read
counts
as
reference
value
sj

to
normalize
transcript
count

Bullard
et
al.
BMC
Bioinforma0cs
2010,
11:94

kij=number
of
reads
in
sample
j
assigned
to
gene
i

v
=
sample
1
to
m

Inferring
DifferenSal
Expression
(DE)

Method
NormalizaS
on

Needs

replicas

Input
StaSsScs
for

DE

Availability

edgeR
Library
size

Yes
Raw

counts

Empirical

Bayesian

esImaIon
based

on
NegaIve

binomial

distribuIon

R/Bioconductor

DESeq
Library
size
No
Raw

counts

NegaIve

binomial

distribuIon

R/Bioconductor

baySeq
Library
size
Yes
Raw

counts

Empirical

Bayesian

esImaIon
based

on
NegaIve

binomial

distribuIon

R/Bioconductor

LIMMA
Library
size
Yes
Raw

counts

Empirical

Bayesian

esImaIon

R/Bioconductor

CuffDiff
RPKM
No
RPKM
Log
raIo
Standalone

Typical
DE
Result
Table

Gene
or

transcript

name

Mean
expression

levels

Fold
Change:
measurement
of

changing
magnitude,
calculated
as

FC=baseMeanB/baseMeanA

Typically
Log2(FC)
is
reported

Signiﬁcance:
use
adjusted
P

value
(padj)
instead
of
raw
P

value
(pval)
unless
you
know

what
you
are
doing

Why
use
adjusted
P-‐value
instead
of
raw

P-‐value?

MulSple
Comparison
Problem
–
When
large
number
of
staIsIcal
tests
were

performed
simultaneously
(as
in
genomic
analysis),
some
tests
will

have
P
values
less
than
0.05
purely
by
chance,
even
if
all
your
null
hypotheses

are
really
true.

Benne@-‐Salmon-‐2009

The
Dead
Thinking
Salmon
Experiment

-‐  Buy
a
whole
salmon

-‐  Take
fMRI
image
of
the
salmon,
which

similar
to
genomic
analysis
asks
the

quesIon
if
a
small
region
(voxels)
of
the

brain
is
acIve

-‐  Some
region
WILL
BE
signiﬁcantly
acIve

if
enough
of
picture
and

enough
of

voxel
are
taken

-‐  SuggesIng
the
dead
salmon
is

thinking…

-‐  Nothing
is
signiﬁcant
if
p-‐val
is
adjusted

Methods
for
Adjustment:

Bonferroni
correcIon,
FDR
controlling
procedures

Heatmap
and
Hierarchical
Clustering

•  Most
common
representaIon

for
diﬀerenIal
expression

analysis

•  Hierarchical
clustering
on
both

samples
are
genes
are
oven

performed
to
idenIfy
similar

samples/genes

•  Can
be
generated
using
many

tools,
such
as
R/Bioconductor

heatmap
and
gplots
package

FuncSonal
Enrichment
Analysis

•  Use
gene
expression
to
idenIfy
pathways
or
gene

funcIons
that
are
over-‐represented

•  Address
the
quesIon:
“What
biological
funcIons

are
diﬀerent
between
sample
groups?”

•  Many
open-‐source
and
proprietary
tools

–  GSEA
(h_p://www.broadinsItute.org/gsea/index.jsp)

–  DAVID
(h_ps://david.ncifcrf.gov)

–  TopGO/GOSEQ
(R/Bioconductor)

–  Ingenuity
Pathway
Analysis
(QIAGEN,
proprietary)

•  Detailed
discussion
is
out
of
scope
for
this
course

DESIGN
RNASEQ
EXPERIMENT

Design
RNASeq
Experiment

•  Biological
Comparison(s)

•  Replicates

•  Read
length

•  Paired
End/Single
Read

•  Read
depth

•  Pooling

Biological
System
in
QuesIons

Simple
QuesSon

Complex
QuesSon

Examples:

•  Cell
line
groups
treated
with

diﬀerent
condiIons

•  PaIent
groups
with
the
same

disease
treated
with
diﬀerent

treatment

Examples:

•  Matched
paIent
samples
from
both

normal
and
diseased
Issues

•  Normal
and
cancer
samples

obtained
from
genotypically
diverse

populaIon

Experimental
QuesSons

•  What
are
my
goals?

–  DiﬀerenIal
expression
analysis
of
genes?

–  DiﬀerenIal
expression
analysis
of
transcripts?

–  IdenIfy
rare
transcript
isoforms?

–  IdenIfy
transcript
polymorphism?

–  IdenIfy
non-‐coding
RNA
populaIons
such
as
miRNA,

lincRNA?

•  What
are
the
characterisScs
of
systems?

–  Large,
complex
genome
?
(ie.
Human)

–  Highly
heterogeneous
sample
populaIon
?
(i.e.
breast

tumor)

–  No
reference
genome
or
transcriptome
?

–  High
degree
of
alternaIve
splicing?

Experimental
QuesSons

What
are
the
sequencing
opIons?

How
much
money
to
spend?

What
are
Single
Read
(SR)
and
Paired
End

(PE)
sequencing

cDNA

Single
Read
(SR)
:

only
one
end
from
each
cDNA
fragment

is
sequenced
to
generate
one
read
per
fragment

Paired
End
(PE)
:
the
cDNA
fragment
is
sequenced
from

both
ends
to
generate
two
reads
per
fragment
from
two

direcIons

What
are
Single
Read
(SR)
and
Paired
End

(PE)
sequencing

Single
Read
(SR)

-‐  Sample
the
same
number
of
cDNA
fragment
as
PE

-‐  Generate
half
of
the
reads
(half
of
the
depth)
than
PE

-‐  Suitable
for
gene
expression
level
detecIon

-‐  SubstanIally
cheaper
than
PE

Paired
End
(PE)

-‐  Sample
the
same
number
of
cDNA
fragment
as
SR

-‐  Allow
for
more
accurate
detecIon
of
structural
variant,
novel

isoform
idenIﬁcaIon
and
quanIﬁcaIon

Reference
Sequence

Impacts
of
Read
Length
on
RNASeq

Longer
read
length
provides
(ie.
75bp
vs
50bp):

-‐  be_er
ability
to
assemble
unknown
transcripts

-‐  Higher
accuracy
to
map
reads
to
complex
regions
(i.e.

repeats,
high
polymorphic
regions)

-‐  Splice
juncIon
detecIon
is
most
aﬀected
by
read
length

Is
long
read
length
(ie.
100bp
vs
50
bp)
always
give
bejer?

-‐  Not
necessarily

-‐  Long
reads
convey
minimal
to
no
advantage
for
diﬀerenIal

gene
expression
analysis

50
bp

75bp

50
bp

75bp

Impacts
of
Sequencing
Depth

•  Quick
means
to
detect
more
genes
and
transcript

variants
with
low
expression
(the
more
reads
you

sequence,
the
more
genes
you
ﬁnd)

•  Require
logarithmic
increase
in
depth
for
linear
increase

in
gene
detected

X
Y
Z
X
Y
Z

RNASeq
1,
30
million
reads
RNASeq
2,
10
million
reads

Number
of
reads
needed
for
an

experiment

•  Different
RNA
sequencing
require
different
number
of
reads

•  More
genes
are
detected
with
higher
sequencing
depth

•  However,
the
increase
of
detected
genes
reduces
substanIally

•  Understand
your
sequencing
system
before
deciding
on
depth

•  Can
always
increase
depth
by
addiIonal
sequencing
on
the
same

library

–  Unlike
microarray
there
is
very
limited
batch
effect
for
RNASeq

Differen0al
expression
in
RNA-‐seq:
A
ma@er

of
depth.
Genome
Res.
2011.

Experimental
Design

•  Technical
replicates

–  Not
needed:

RNASeq
have
low
technical
variaIon

•  Minimize
batch
effects

•  Biological
replicates

–  Not
needed
for
novel
transcript
idenIficaIon
and

transcriptome
assembly

–  EssenIal
for
differenIal
expression
analysis

–  Difficult
to
esImate
the
minimum
number

•  3+
for
cell
lines

•  5+
for
inbred
lines
(i.e.
mouse,
model
organsims)

•  20+
for
human
samples

(usually
unachievable)

–  Must
have
3+
to
perform
staIsIcal
analysis

Experimental
Design

•  Pooling
samples

– Limited
RNA
obtainable

•  Tumor
samples
from
hard
to
reach
Issue
type
(i.e.

brain)

– Novel
transcriptome
assembly

– Don’t
do
it
unless
you
know
what
you
are
doing

QuesSons
to
ask
when
gekng
raw

RNASeq
data
back

•  How
was
the
RNA
extracted?

•  How
was
RNASeq
library
constructed?

•  Which
playorm
was
the
library
sequenced
on?

•  How
long
was
the
read
length?

•  Was
sequencing
done
with
single
read
or

paired
end?

•  How
many
reads
were
sequenced
per
sample?

•  Where
is
the
QC
report?

Check
list
for
gekng
RNASeq
DE
analysis

results
back

q 
Fastq
files

q 
FastQC
Report

q 
BAM
files

q 
RNASeq
QC
Report
(Not
discussed)

q Table
of
DifferenIally
Expressed
Genes/

Transcripts

q 
Heatmaps

q 
FuncIonal
Enrichment
Analysis
Table

Recognize
Yourself
as
a
Genomic
Data

Consumer

BioinformaScists/Data
ScienSsts

-‐  Let
data
drive
scienIﬁc

hypothesis
generaIon

-‐  Start
with
raw
data
(i.e.
fastq)

-‐  Process
raw
data
by
privately

tuned
pipelines

KNOW
YOUR
DATA
SOURCE

TranslaSonal
ScienSsts

-‐  Start
with
a
speciﬁc
hypothesis

derived
from
observaIon

-‐  Find
processed
to
perform

secondary
analysis

-‐  Use
readily
available
tools

-‐  Interpret
results
in
the
context

of
iniIal
hypothesis

KNOW
YOUR
TOOLS

RNA-Seq Analysis: An Introduction to Mapping, Quantification, and Differential Expression

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to RNA-Seq Analysis: An Introduction to Mapping, Quantification, and Differential Expression

Similar to RNA-Seq Analysis: An Introduction to Mapping, Quantification, and Differential Expression (20)

Recently uploaded

Recently uploaded (20)

RNA-Seq Analysis: An Introduction to Mapping, Quantification, and Differential Expression