5. Gene
Expression
Measurements
5
Evaluation of gene expression measurements from
commercial microarray platforms
Paul K. Tan, Thomas J. Downey1
, Edward L. Spitznagel Jr2
, Pin Xu, Dadin Fu,
Dimiter S. Dimitrov3
, Richard A. Lempicki4
, Bruce M. Raaka5
and Margaret C. Cam*
Microarray Core Laboratory, National Institute of Diabetes and Digestive and Kidney Disorders (NIDDK),
National Institutes of Health, 1
Partek Incorporated, 2
Department of Mathematics, Washington University,
3
Laboratory of Experimental and Computational Biology (LECB), National Cancer Institute, NIH, 4
National Institute
of Allergy and Infectious Diseases (NIAID), NIH, SAIC-Frederick, Inc., 5
Clinical Endocrinology Branch, NIDDK,
NIH, USA
Received May 23, 2003; Revised July 11, 2003; Accepted August 11, 2003
ABSTRACT
Multiple commercial microarrays for measuring
genome-wide gene expression levels are currently
available, including oligonucleotide and cDNA,
single- and two-channel formats. This study reports
on the results of gene expression measurements
generated from identical RNA preparations that
were obtained using three commercially available
microarray platforms. RNA was collected from
PANC-1 cells grown in serum-rich medium and at
24 h following the removal of serum. Three bio-
logical replicates were prepared for each condition,
and three experimental replicates were produced for
the ®rst biological replicate. RNA was labeled and
hybridized to microarrays from three major sup-
pliers according to manufacturers' protocols, and
gene expression measurements were obtained
using each platform's standard software. For each
platform, gene targets from a subset of 2009 com-
mon genes were compared. Correlations in gene
expression levels and comparisons for signi®cant
gene expression changes in this subset were calcu-
lated, and showed considerable divergence across
the different platforms, suggesting the need for
establishing industrial manufacturing standards,
and further independent and thorough validation of
the technology.
INTRODUCTION
A powerful application of microarray technology is in
(1). Once target genes are identi®ed, additional laboratory
resources may be invested to validate this list and to further
characterize the relationship of their biological functions to
the process under study (2). The ef®ciency of knowledge
discovery using this high-throughput experimental process
depends upon the reliability of the microarray technology used
in the initial screening experiments. Researchers planning to
utilize microarray experiments for discovery-based research
must evaluate available commercial technologies when allo-
cating laboratory resources for prospective experiments.
Several formats of microarrays for measuring genome-wide
gene expression levels are currently available (3). Important
factors for selecting an appropriate microarray platform would
include sensitivity, speci®city and both inter- and intra-assay
reproducibility. Also important is knowledge of the degree of
cross-platform agreement, as interchangeability amongst
various microarray formats would allow for the utility of
gene expression data without regard to platform. Having such
a property would allow researchers from independent labora-
tories to make direct comparisons on data produced from
different types of available platforms, and would reduce the
need to replicate experiments (4). Such cross-platform com-
parisons ideally require that corresponding RNA expression
measurements be concordant. Previous comparisons of
microarray formats suggested that expression data on the
NCI60 cell lines from spotted cDNA mircroarrays could not
be directly combined with data from synthesized oligonucleo-
tide arrays (5). This ®nding was determined using identical
originating cell lines; however, cell culturing, mRNA prep-
aration and hybridization of targets were all performed
separately. In this study we analyzed identical RNA prepar-
ations using three commercially available high-density
microarray platforms. This experimental design allowed us
to compare the microarray formats while controlling for
5676±5684 Nucleic Acids Research, 2003, Vol. 31, No. 19
DOI: 10.1093/nar/gkg763
• Transcript
raCos
across
samples
• Lack
of
confidence
in
gene
expression
experiments
– Same
pair
of
samples,
different
plaUorms,
different
raCo
results!
• CriCcal
applicaCons
– Cancer
Biology
– Drug
Discovery
– Tissue
engineering
– Stem
Cell
Biology
6. External
RNA
Control
ConsorCum
(ERCC)
• Industry-‐iniCated,
NIST-‐hosted,
stakeholder
coupled
– grew
out
of
NIST
workshop
in
2003
• iniCated
by
Janet
Warrington,
VP
Clinical
Genomics
at
Affymetrix
– all
major
microarray
technology
developers
– other
gene
expression
assay
developers
• Open
to
all
interested
parCes
• Voluntary
• More
than
90
parCcipants
– Private,
Public,
Academic
6
Spike-‐
ins
7. ERCC
CollaboraCve
Study
• Developed
sequence
library
from
submission
by
ERCC
members,
as
well
as
synthesis
– evaluated
performance
of
RNA
controls
on
variety
of
plaUorms
– selected
96
well-‐performing
sequences
in
collaboraCve
study
• Array
manufacturers
modified
products
to
include
ERCC
control
sequences
7
176
144
106
96
8. SRM
2374
–
DNA
Sequence
Library
for
External
RNA
Controls
• NIST
Standard
Reference
Material
(SRM
2374)
• Contains
96
unique
control
sequences
inserted
in
common
plasmid
DNA
– engineered
to
be
readily
in
vitro
transcribed
to
make
RNA
controls
– RNA
controls
intended
to
mimic
mammalian
mRNA
• hdp://www.nist.gov/srm/
10. CreaCng
Spike-‐in
Mixtures
from
SRM
2374
10
SRM
2374
Plasmid
DNA
Library
in
vitro
transcripCon
RNA
transcripts
Pooling
Mixtures
with
known
abundance
raCos
…
11. Feature
A_1
A_2
A_3
B_1
B_2
B_3
T1
1
5
4
0
2
3
T2
200
204
199
101
97
103
T3
142
153
147
149
130
155
ERCC-‐0001
5
8
10
20
23
19
…
Method
ValidaCon
with
erccdashboard
R
package
erccdashboard Package Vignette
Sarah A. Munro
May 4, 2014
This vignette describes the use of the erccdashboard R package to analyze External RNA Control Con-
sortium (ERCC) spike-in control ratio mixtures in gene expression experiments. If you use this package for
method validation of your gene expression experiments please cite our publication:
Please cite our paper when you use the erccdashboard
package for analysis. This is a placeholder citation,
because our manuscript is still under review.
Munro SA, Lund S, Pine PS, Binder H, Clevert D,
Conesa A, Dopazo J, Fasold M, Hochreiter S, Hong H,
Jafari N, Kreil DP, ˚A ,
Aabaj PP, Liao Y, Lin S, Meehan
J, Mason CE, Santoyo J, Setterquist RA, Shi L, Shi
W, Smyth GK, Stralis-Pavese N, Su Z, Tong W, Wang
C, Wang J, Xu J, Ye Z, Yang Y, Yu Y, Salit M (Under
Review, 2014). Assessing Technical Performance in
Gene Expression Experiments with External Spike-in
RNA Control Ratio Mixtures.
A BibTeX entry for LaTeX users is
@Article{,
title = {Assessing Technical Performance in Gene Expression Experiments with External Spike-in RNA Co
author = {Munro SA and Lund S and Pine PS and Binder H and Clevert D and Conesa A and Dopazo J and Fa
journal = {Under Review},
volume = {0},
pages = {0},
year = {2014},
}
Munro SA, Lund S, Pine PS, Binder H, Clevert D, Conesa A, Dopazo J, Fasold M, Hochreiter S, Hong H,
Jafari N, Kreil DP, ˘0141abaj PP, Li S, Liao Y, Lin S, Meehan J, Mason CE, Santoyo J, Setterquist RA, Shi
• Open-‐source
R
package
– erccdashboard
• Assess
technical
performance
of
a
gene
expression
experiment
• Compare
results
– Within
a
single
laboratory
– Between
laboratories
12. Method
ValidaCon
with
erccdashboard
R
package
• Open-‐source
R
package
– erccdashboard
• Assess
technical
performance
of
a
gene
expression
experiment
• Compare
results
– Within
a
single
laboratory
– Between
laboratories
18. Others…
Synthetic Spike-in Standards Improve Run-Specific
Systematic Error Analysis for DNA and RNA Sequencing
Justin M. Zook1
*, Daniel Samarov2
, Jennifer McDaniel1
, Shurjo K. Sen3
, Marc Salit1
1 Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America, 2 Statistical Engineering Division,
National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America, 3 Genetic Disease Research Branch, National Human Genome Research
Institute, National Institutes of Health, Bethesda, Maryland, United States of America
Abstract
While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic
sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants.
These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in
false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells,
bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set
used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either
from a part of the data set being ‘‘recalibrated’’ (Genome Analysis ToolKit, or GATK) or from a separate data set with special
characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards
to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to
conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base
quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the
spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific
recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the
spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG,
and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these
DNA and RNA spike-in standards with GATK improves base quality score recalibration.
Citation: Zook JM, Samarov D, McDaniel J, Sen SK, Salit M (2012) Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA
Sequencing. PLoS ONE 7(7): e41356. doi:10.1371/journal.pone.0041356
Editor: Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Germany
Received February 28, 2012; Accepted June 20, 2012; Published July 31, 2012
This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for
any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Funding: This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of
Health. No additional external funding was received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or
preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: zook@nist.gov
Introduction
As sequencing costs drop, it is becoming cost-effective to
sequence even whole genomes to a sufficient depth that random
errors become insignificant. However, systematic sequencing
errors (SSEs) and biases remain problematic even at high
sequencing depths, so recent research has started to focus on
understanding these SSEs and biases [1,2]. In this work, we focus
on SSEs rather than coverage biases, where SSEs are systematic
errors in sample preparation and sequencing processes that cause
base call errors to accumulate preferentially at certain base
positions in the genome, and coverage biases are biases in the
number of reads covering certain genomic regions such as GC-
bias [3–5]. Examples of SSEs, as well as random errors, are
portrayed in Figure 1(a). Compensating for these SSEs is critical
for applications in which a variant might be expected to be in only
a small fraction of the reads, such as samples containing RNA-
editing [6,7], cancer tissues and circulating tumor cells [8–11],
fetal DNA in mother’s blood [12], mixtures of bacterial strains
[13], mitochondrial heteroplasmy [14], mosaic disorders [15], and
pooled samples [16,17]. Since the causes of many SSEs are not
well understood and may vary due to batch effects in a run-specific
manner, compensating for them requires training data sets. The
two previously proposed approaches either use a separate data set
with special characteristics (e.g., SysCall uses overlapping paired-
end reads [1]) or use the data set itself excluding regions known to
have variants (e.g., Genome Analysis Toolkit, or GATK, base
quality score recalibration [2]). Here, we combine the advantages
of these approaches by using DNA or RNA spike-in standards
without homology to almost all biological organisms.
The first approach, SysCall, used a methyl-Seq dataset that had
overlapping paired-end reads to detect SSEs depending on
sequencing direction for the Illumina sequencer [1]. The region
in which the reads overlap can be used to find systematic errors
that preferentially occur on one DNA strand compared to the
other strand. To improve variant calls, the SysCall method uses
a separate dataset with overlapping reads to train a logistic
regression model that accounts for SSEs correlated with several
covariates: (1) the 2 preceding bases + the base in question (each
base independently), (2) directionality bias of the errors, the
proportion of non-reference reads, and (3) a comparison of the
quality scores of the error base to the next base. Most sequencing
runs do not contain overlapping paired reads, so SysCall assumes
the SSEs in a training data set are the same as the SSEs in other
19. Experience
from
Expression
Analysis
Thanks
to
Wendell
Jones
and
Erik
Aronesty
I
like…
• 1000s
of
RNA-‐Seq
samples
– Ambion
Mix
1
– “did
the
library
reacCons
work
appropriately
and
consistently?
– “did
our
lab
degrade
samples
or
were
the
samples
already
degraded?”
– “effects
(or
lack
of)
between
lane
or
flowcell”
I
wish…
• “Construct
Ext
RNA
Controls
that
emulate
a
variety
of
splice
variaCon
(some
that
may
be
challenging)
and
have
them
at
different
magnitudes”
– ”examine
not
only
the
chemistry
but
also
the
bioinformaCc
pipeline
to
ensure
it
has
basic
fitness.”
• “Suggest
a
protocol
for
adding
Ext
RNA
Controls
for
FFPE.”
– “While
we
spike
in
ERCC
controls
at
a
fixed
amount
for
FFPE
samples,
we
get
out
a
wild
range
of
sequence
coming
out
that
aligns
to
the
ERCC
controls.”
– Hypothesis:
“much
of
the
target
RNA
is
so
damaged
that
it
doesn't
ligate
to
adapters
correctly,
but
ERCC
controls
do
(ligate);
as
a
result
they
are
(much)
preferenCally
amplified.”
21. OpportuniCes
for
ERCC
2.0
• New
technologies
– RNA-‐Seq
– Long
reads
• PacBio,
Moleculo
– Digital
counCng
• Cellular
Research,
digital
PCR
• Method
improvements
– Library
preparaCon
– BioinformaCcs
• New
discoveries
Counting individual DNA molecules by the
stochastic attachment of diverse labels
Glenn K. Fu, Jing Hu, Pei-Hua Wang, and Stephen P. A. Fodor1
Affymetrix, Inc., 3420 Central Expressway, Santa Clara, CA 95051
Edited* by Ronald W. Davis, Stanford Genome Technology Center, Palo Alto, CA, and approved March 22, 2011 (received for review November 27, 2010)
We implement a unique strategy for single molecule counting
termed stochastic labeling, where random attachment of a diverse
set of labels converts a population of identical DNA molecules
into a population of distinct DNA molecules suitable for threshold
detection. The conceptual framework for stochastic labeling is
developed and experimentally demonstrated by determining the
absolute and relative number of selected genes after stochastically
labeling approximately 360,000 different fragments of the human
genome. The approach does not require the physical separation of
molecules and takes advantage of highly parallel methods such as
microarray and sequencing technologies to simultaneously count
absolute numbers of multiple targets. Stochastic labeling should
be particularly useful for determining the absolute numbers of
RNA or DNA molecules in single cells.
absolute counting ∣ digital PCR ∣ next-generation sequencing ∣
single molecule detection
Determining small numbers of biological molecules and their
changes is essential when unraveling mechanisms of cellular
response, differentiation or signal transduction, and in perform-
Identical DNA
target molecules {t1
, t2
…. tn
}
t1
t2
t3
t4
Pool of labels
{l1
, l2
…. lm
}
Random
labeling
t1
l20
t2
l107
t3
l477
t4
l9
Amplification and detection
of k distinctly labeled molecules
Fig. 1. A schematic representation of the labeling process. An example
showing four identical target molecules in solution. Each DNA molecule ran-
domly captures and joins with a label by choosing from a large, nondepleting
Fu
et
al.
PNAS
2011
23. Charge
to
the
Workshop
• Develop
consensus
on…
– Concept
• Shared
interests
– PorUolio
• Controls
• Analysis
• Documentary
Standards
• Develop
consorCum
structure…
– Working
groups
– Steering
commidee
24. Principles
of
OperaCon
• Pre-‐compeCCve
• Consensus
decision-‐
making
• Data-‐driven
• Technology
independent
• Leadership
– Working
Groups
– Steering
Commidee
• NIST-‐hosted
• “You
get
out
of
it
what
you
put
into
it.”
25. “A rising tide floats all boats”
ERCC operates by consensus
“A rising tide floats all boats…”
27. Scope
of
ERCC
2.0
• ERCC
2.0
is
convened
to
develop
standard
controls
for
RNA
measurements
• Three
working
groups
are
proposed
1. Design
– Types
of
controls
&
sequence
design
2. Development
– Building
controls,
developing
&
tesCng
control
mixtures
3. Analysis
– Standard
performance
metrics
– Tools
as
needed
to
support
design
&
development
28. The
Arc
of
ERCC
2.0
• Products
– Sequences
represenCng
different
types
of
RNA
• Transcript
isoforms
• miRNA
• New
mRNA
mimics
• …
– Documentary
standards
for
using
controls
– Performance
metrics
• LogisCcs
– Workshops
• Number,
frequency
– Telecons,
Mailing
list,
Wiki
• Development
Schedules
– Sequence
selecCon
– Control
synthesis
– Control
tesCng
and
analysis
– Reference
material
development,
characterizaCon,
release
– AnalyCcal
methods
&
tools
– Documentary
standards
– …
– Finished.
• DisseminaCon
– Steering
commidee
to
address
business
models
30. What
will
we
do
together?
• NIST
is
commided
to...
– HosCng
the
consorCum
– SupporCng
product
development
• PorUolio
possibiliCes
– Reference
materials
– Reference
data
– Analysis
methods
– Analysis
tools
– Documentary
standards
– …
• Define
consorCum
mission
– Purpose
of
ERCC
2.0
products
• Providing
infrastructure
to
discern
signal
from
arCfact
• Confidence
in
RNA
measurement
results
• …
31. How
can
we
work
together?
• How
do
we
make
decisions?
• How
do
we
operate?
– Working
groups
– Semi-‐annual
meeCngs
– Conference
calls
– Email
list,
wiki?
• Why
a
consorCum?
– We
can
make
beder
standards
together
• Things
the
consorCum
can
do
as
an
enCty:
– Integrate
controls
from
the
membership
– Conduct
validaCon
studies
– Make
recommendaCons,
guidelines,
develop
standards
• Documentary
standards
to
support
regulated
applicaCons
35. Development
Working
Group
• Control
synthesis
– DNA
templates,
RNA
molecules
– Special
modificaCons
• QC
of
DNA,
RNA
controls
– Purity
– Homogeneity
– Stability
• Control
Mixture
Design
– Dynamic
range
– RaCos
– …
• Plan
and
conduct
interlaboratory
studies
to
evaluate
controls
– ValidaCon
of
controls
– ValidaCon
of
concepts
and
analysis
– Use
mulCple
measurement
technologies
36. Analysis
Working
Group
• Develop
standard
performance
metrics
– Develop
reference
implementaCon
as
example
• ApplicaCons
– Process
control
– QuanCtaCve
benchmarking
– NormalizaCon
– OpCmizaCon
• Tools
as
needed
to
support
design
&
development
– ValidaCon
study
analysis
– Mixture
design
tools
39. Closing
Comments
Day
1
• 9:00
am
start
tomorrow
(there
will
be
coffee)
• More
presentaCons
tomorrow
morning
• Open
Pitch
session
is
also
available
tomorrow
– Let
us
know
if
you
want
to
speak
tomorrow,
but
you
can
also
can
just
get
up
and
pitch
• Working
groups
will
reconvene
tomorrow
to
develop
summaries
• Please
join
us
now
for
dinner