SlideShare a Scribd company logo
804 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012
Cite this: Mol. BioSyst., 2012, 8, 804–817
A statistical framework for multiparameter analysis at the single-cell levelw
Wandaliz Torres-Garcı´a,ab
Shashanka Ashili,b
Laimonas Kelbauskas,*b
Roger H. Johnson,b
Weiwen Zhang,*b
George C. Runger*a
and Deirdre R. Meldrumb
Received 17th October 2011, Accepted 2nd December 2011
DOI: 10.1039/c2mb05429a
Phenotypic characterization of individual cells provides crucial insights into intercellular
heterogeneity and enables access to information that is unavailable from ensemble averaged, bulk
cell analyses. Single-cell studies have attracted significant interest in recent years and spurred the
development of a variety of commercially available and research-grade technologies. To quantify
cell-to-cell variability of cell populations, we have developed an experimental platform for
real-time measurements of oxygen consumption (OC) kinetics at the single-cell level. Unique
challenges inherent to these single-cell measurements arise, and no existing data analysis
methodology is available to address them. Here we present a data processing and analysis method
that addresses challenges encountered with this unique type of data in order to extract
biologically relevant information. We applied the method to analyze OC profiles obtained with
single cells of two different cell lines derived from metaplastic and dysplastic human Barrett’s
esophageal epithelium. In terms of method development, three main challenges were considered
for this heterogeneous dynamic system: (i) high levels of noise, (ii) the lack of a priori knowledge
of single-cell dynamics, and (iii) the role of intercellular variability within and across cell types.
Several strategies and solutions to address each of these three challenges are presented. The
features such as slopes, intercepts, breakpoint or change-point were extracted for every OC profile
and compared across individual cells and cell types. The results demonstrated that the extracted
features facilitated exposition of subtle differences between individual cells and their responses to
cell–cell interactions. With minor modifications, this method can be used to process and analyze
data from other acquisition and experimental modalities at the single-cell level, providing a
valuable statistical framework for single-cell analysis.
Introduction
Cell-to-cell variability has been found to play a central role in
a variety of physiological processes such as differentiation,
proliferation, stress response and pathogenesis. Due to the
stochastic nature of many intracellular processes, individual
cells can exhibit significant phenotypic differences and respond
differently to stimuli and changes in the microenvironment.1–4
The origin of many diseases is thought to be in several, or
perhaps even one aberrant cell that acquires the capability to
evade the cues regulating normal cell function and death.
Early identification and detailed characterization of such
abnormal cells bear the potential not only to provide deep
insights into fundamental cell processes, but also to open new
avenues for treatment and management of diseases with high
morbidity and mortality, including cancer. Because of that,
single-cell studies have been gaining momentum in the last
decade facilitated by technological advances enabling reliable
measurement of various biologically relevant parameters with
high sensitivity and precision. To study cell signaling and
metabolic pathways, one needs to be able to characterize
simultaneously as many parameters of living single cells
as possible. Multiparameter analysis could reveal the details
of intracellular mechanisms, providing novel insights into
systems biology of cells.
Technological challenges such as extremely low amounts of
biological material, small differential changes in metabolite
concentrations and the fragility of cells have been hampering
significant progress in single-cell analysis. One of the major
limitations in single-cell experiments is the low signal-to-noise
ratio. Reliable separation of meaningful data from noise
represents a formidable challenge, one that is exacerbated by
the absence of a priori knowledge of the dynamics of physio-
logical processes that take place in individual cells. This is
particularly true in experiments where single living cells need
a
School of Computing, Informatics, and Decision Systems
Engineering, Arizona State University, Tempe, AZ 85287-5906,
USA. E-mail: George.Runger@asu.edu
b
Center for Biosignatures Discovery Automation, The Biodesign
Institute, Arizona State University, Tempe, AZ 85287-6501, USA.
E-mail: Laimonas.Kelbauskas@asu.edu, Weiwen.Zhang@asu.edu
w Electronic supplementary information (ESI) available. See DOI:
10.1039/c2mb05429a
Molecular
BioSystems
Dynamic Article Links
www.rsc.org/molecularbiosystems PAPER
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online / Journal Homepage / Table of Contents for this issue
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 805
to be characterized with minimal perturbation of their normal
function, further limiting the experimentalists’ choice among
available methodology.
There are numerous methods available to remove white noise
in dynamic data. Many of them have been suitably adapted for
application in a variety of research fields such as chemistry,
environmental studies and medicine.5–7
Nevertheless, reducing
random perturbation in a signal is not a trivial task, since
it is usually unclear how much noise can be removed without
losing the ‘‘true’’ signal. Quality assurance problems in newly
developed technologies are common. For example, in the
1990’s, when DNA microarrays started to gain interest in the
scientific community, the importance and value of the unique
information for understanding biological systems,8–13
as well as
the need for quality assessment and noise reduction14,15
were
clearly acknowledged. In light of these challenges, many noise
reduction methods were proposed, and later developed into a
mature and unified methodology.16,17
In a way analogous to
noise reduction, the characterization of data and signals in the
bioinformatics arena has been widely studied especially for data
quality assessment purposes in microarray-based studies. Many
feature selection techniques are commonly used for microarray
data characterization, including selection of genes with signifi-
cant expression levels in response to changes in conditions or
experimental settings.18
Modeling of real-time data obtained from dynamical
systems has been explored in the literature utilizing traditional
statistical methods.19–23
The traditional methods tend to
establish parametric assumptions which are often hard to
justify in complex biological systems. Hence, there exists a
critical need to model real-time measurements in biological
systems, including live cells, without a priori knowledge of the
nature of underlying dynamical processes. However, so far
none of these established methods have been applied to
analyze data obtained from individual living cells.
Here we present a study focused on the analysis of novel
respiration kinetics data from individual cells. The cell
metabolic analysis method entails manipulation24
and isolation
of single cells25
and determination of their oxygen consump-
tion (OC) kinetics in real-time.26–28
The data obtained from
these measurements exhibit much higher levels of noise com-
pared to bulk-cell experiments. The lack of a priori knowledge
of single-cell dynamics makes it difficult to define charac-
teristic features in these datasets, posing challenges in the
extraction of biologically information and its proper biological
interpretation. The real-time nature of the measurements
contributes additional complexity to the analysis. In this
work we describe our initial efforts to develop statistical
methodologies to address the challenges of noise reduction,
data characterization through feature extraction, and biological
comparison for respiration phenotype measurements in
individual cells. We analyzed OC kinetics data of single cells
obtained from two esophageal epithelial cell lines: metaplastic
(CP-A) and early dysplastic (CP-C) Barrett’s esophageal cells.
These cell lines were derived from biopsies taken from the
corresponding regions in human esophagus and represent
different stages of pre-neoplastic progression. Because of the
clear delineation of the two cell types in terms of histopathology
and their relevance to cancer, our findings may also be of interest
to cancer biologists. Because they serve to define and extract
elements of a disease biosignature, the statistical methodologies
presented here could be used as a foundational framework for
analyzing single-cell data.
Results
Data preprocessing
The data consist of OC kinetics in single human metaplastic
(CP-A) and early dysplastic (CP-C) cells. Fig. 1 summarizes
the challenges addressed in this unique data structure. The
early exploration of OC measurements at the single-cell level
indicated the need to reduce noise and unwanted perturbations
in the signals. Reducing noise helps enhance the discovery of
Fig. 1 Statistical framework diagram. Sequential steps to process and analyze single-cell oxygen consumption data: smoothing, feature extraction
and classification. Major challenges and proposed solution strategies to address each one of them are shown.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
806 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012
relevant features related to the ‘‘true’’ signal’s behavior. In our
approach two main stages of noise reduction were performed:
(1) low-pass filtering and (2) outlier smoothing. Common
filtering techniques were applied to the OC data in two
different ways. First, a filter was applied to each of the OC
curves to estimate a curve-specific metric of variation that was
used to detect outliers in the unsmoothed data. This outlier
smoothing process invoked traditional control charts in which
any data point of the OC curves lying outside curve-specific
control limits was considered an outlier and its value was
smoothed by neighborhood averaging (see the Methods section).
This step reduced the adverse influence of artifact caused by
stochastic response of the microsensor or other measurement
system components. After outlier smoothing, the resulting signal
was processed through a low-pass filter (Fig. 2).
Feature extraction
After preprocessing, the data analysis procedure aimed to
characterize the OC kinetics. The feature extraction step
addresses challenge number two described in Fig. 1, which
can be divided into two separate problems: (1) removal of
redundant information characterized through the understanding
of experimental limitations, and (2) extraction of distinctive
features without a priori knowledge of the system.
Detection of the time needed to reach zero oxygen concen-
tration in the microchambers. The removal of redundant
information from further analysis is based on experimental
considerations. During measurement, individual cells are
hermetically isolated in sub-nanolitre volume chambers which
results in a limited amount of oxygen being available for
consumption by cells.27,28
Data collected after the oxygen
concentration in the chambers reaches zero are not useful
for OC kinetics analysis and can be discarded as extraneous
or redundant. During an experiment, OC kinetics of nine
individual cells was recorded simultaneously. The time needed
for each cell to deplete the oxygen in the microchamber varied
significantly from cell to cell due to the metabolic rate hetero-
geneity. The experiment was continued until oxygen concen-
tration in all nine chambers reached zero, resulting in different
amounts of redundant data collected for each cell.28
To
address this issue, we proceeded to automatically detect the
time point where each curve reached a zero value (0% oxygen)
and to discard data collected after that time point (referred to
as zero-value tails), excluding it from further analysis (Fig. 2).
Hence, we define redundant information in the context of
experimental conditions, namely by the limited amount of
oxygen available for each cell to consume and the variable rate
at which they consume it. Further experimental details related
to the data used in this study can be found in the Methods
section. Removing the zero-value tails from each of the OC
curves facilitated robust modeling of these curves in regions of
interests and allowed for reliable feature extraction from the
kinetics. Removal of the zero-value tails would be a trivial
problem if one had to analyze a small number of curves or if
the time to reach zero was the same for all cells. For the
analysis of hundreds of cells, however, we needed to develop
an algorithm to automatically remove the tails to ensure
consistency and rapid data processing.
A cumulative sum (CUSUM) control chart is a commonly
used statistical tool to detect small changes.29
Its application
allowed us to automatically detect the time point at which each
OC curve reaches a zero oxygen value and does not change
significantly afterwards (Fig. 2). This statistical procedure
was performed to detect a change time point, feature called
TimetoZero, for each sample. A summary of these detected
time points across samples using the CUSUM procedure is
shown in Fig. 3. Stratification frequency plots of the times
needed to reach 0% oxygen concentration for the entire set of
data by cell lines are shown in Fig. 3 providing a general view
of this TimetoZero feature distribution across different cell
lines. We used this feature as a reference point to remove the
data lying beyond this point as redundant. However, by
determining the time-to-zero reference points, we captured a
unique characteristic of the OC kinetics to better understand
cell heterogeneity.
OC characterization and other features. Modeling OC curves
is challenging since there is no a priori knowledge of single-cell
respiration kinetics. Other than the notion that cells are
Fig. 2 Step-by-step statistical framework example. Main steps used to characterize the OC kinetics data are shown: (a) data filtering, (b) detection
of feature, TimetoZero, using CUSUM, (c) removal of zero-valued tails, (d) identification of characteristic features using a spline model.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 807
heterogeneous, little is known about the characteristics of specific
factors such as oxygen consumption and their relationship to
different cell types and metabolic states.1,30
We addressed this
challenge by approximating OC kinetics with a constrained
piecewise linear regression model. This spline model fits two
continuous linear regressions with slopes constrained to be
negative. Based upon preliminary data analysis which revealed
this pattern, it seemed appropriate to study OC curves by means
of fitting two linear models (Fig. 2). The two continuous regres-
sions share a mutual breakpoint optimally detected through a
likelihood method across the entire time span. This model allows
us to capture features in different segments of the data.
The spline model was compared with the simple linear
regression model using a goodness-of-fit criterion. We
performed the comparison of these two models by using an
F test for each OC curve. These multiple comparisons raise a
commonly known problem in multiple hypothesis testing:
increased false-positives. To address this problem, we have
corrected all computed p-values using the Bonferroni correction
method.31
Through the evaluation of these tests, we found
that 99.3% and 97.7% of the OC kinetics data obtained from
CP-A and CP-C cells, respectively, could be fit better with the
constrained piecewise linear (spline) regression than with the
simple linear regression model at a = 0.001. Fig. 4 shows
the percentage of curves that were fit more accurately with the
spline model as a function of the level (a) of Type I error for the
F test. In general, more than 90% of OC curves obtained with
both cell types showed a statistically significant improvement of
the fit at different values of a when using the constrained
piecewise model as compared to simple linear regression.
A slightly higher percentage of curves measured from CP-A
compared to CP-C cells could be fit more reliably with the
constrained piecewise model.
The model enabled the extraction of relevant features that
were used to characterize the OC kinetics. Besides the regular
features from fitting linear regressions (intercepts and slopes),
we were able to detect several other features (Table 1), such as
time and oxygen concentration at which the first slope of the
piecewise model is replaced by the second slope. All features
were determined for each kinetics curve of both cell types
(CP-A and CP-C), and the feature distributions within and
across cell types were further analyzed. Whether or not piece-
wise linear regression represents a biologically relevant model
Fig. 3 Features histogram and significance tests between CP-A and CP-C for the TimetoZero feature. (a) Distribution histograms for single CP-A
and CP-C cells; (b) 95% confidence interval of the means of the feature for both cell types.
Fig. 4 Comparative multiple hypothesis testing between the spline
model and linear regression fit. Percentage of OC curves per cell type
that revealed a better fit with the spline model than with the linear
regression shown as a function of different values of a (Type-I error).
The Bonferroni correction was applied to the individual test p-values
to alleviate the problem of false-positives when multiple comparisons
are performed. Inset: zoom in on a range of [0, 0.05].
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
808 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012
of OC kinetics, it provided a good empirical fit to the
experimental data with a simple structure, permitting feature
extraction for comparative studies between different cell types
and conditions.
Validation
Prior to using this statistical methodology for biological data
interpretation, it was validated to assess its accuracy and
robustness. For validation we used a model system based on
enzymatic scavenging of oxygen by Oxyrase.32,33
Oxyrase is a
preparation of membrane fragments from Enterococcus coli
and contains membrane monooxygenases and dioxygenases.
When it comes in contact with lactic acid, Oxyrase removes
oxygen rapidly from aqueous environments, including cell
medium. Because of its enzymatic basis, oxygen removal
kinetics by Oxyrase can be modeled using the Michaelis–
Menten equation that describes enzymatic reaction rates as
a function of substrate concentration. To reproduce data
collection conditions as close as possible to actual experiments,
we measured oxygen consumption kinetics of Oxyrase (no cells)
using experimental settings identical to those used for single
cells. This ensures that the signal-to-noise ratios are similar to
single-cell data. We used four different Oxyrase concentrations,
50 mL, 150 mL, 200 mL, and 250 mL (ranging from 0.06–0.2% by
volume) for more robust validation of the statistical framework.
The features extracted from the OC kinetics data obtained
with Oxyrase utilizing the statistical framework showed signifi-
cant differences among signals measured with different
Oxyrase concentrations. The application of a Random Forest
classifier model34
to the extracted features revealed clear
discrimination among the four different concentrations with
out-of-bag error rates of 2% when all features were included
in the model, and 11.1%, when TimetoZero (see Feature
extraction) was removed from the data analysis. Ensemble
learners are predictive models that combine a collection of
simpler classifiers yielding better predictive performance as an
ensemble than any of the individual classifiers.35
The distinct
discrimination among the different Oxyrase concentrations
was visualized with the use of multidimensional scaling36
in
panels (a) and (b) of Fig. 5. Each panel portrays the visualization
Table 1 Extracted features and their descriptions
Features Description
Change-point.Time Time value at which the change in slopes in the piecewise linear fit takes place
Change-point.Oxygen Oxygen consumption value at which the change in slopes in the piecewise linear fit takes place
Intercept coefficient (B0) Intercept of left linear regression
Left slope coefficient (B1) Slope of the linear regression before the Change-Pointa
Right slope coefficient (B1) Slope of the linear regression after the Change-Pointa
Kurtosis Measure of ‘‘peakedness’’. Higher kurtosis means more of the variance is the result of infrequent extreme
deviations, as opposed to frequent modestly sized deviations.
Skewness Measure of the asymmetry.
Minimum MSE The Mean squared error value for the best piecewise linear regression fit.
TimetoZero Time at which the oxygen concentration in the chamber reaches a value of zero
Brief description of features extracted from curves after application of smoothing and filtering techniques.a
Slope magnitudes extracted from the
spline model are divided by two for curves obtained with two cells per well.
Fig. 5 Multidimensional Plots for Oxyrase enzymatic reaction for validation. This plot visualizes the scaling coordinates of the proximity matrix
obtained with a Random Forest performed to classify four distinct Oxyrase concentration values. These oxyrase measurements were gathered
through the same semi-automated technology as the OC curves in study. These were used in validation since its behavior is well-understood and
differences are expected across features from oxyrase curves from different concentrations.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 809
patterns for the two Random Forest models discussed earlier:
(a) a classifier with all features and (b) a classifier with all
features except TimeToZero. The resulting proximity matrix
from the random forest classifier is used as input in multi-
dimensional scaling to find a suitable 2D visual configuration
that showcases the sample patterns. Each axis, named scaling
dimensions, represents the 2D coordinates in which these
patterns are plotted. The ability to clearly differentiate varying
reaction rates (slopes) obtained with different Oxyrase concen-
trations shows that our approach enables adequately robust and
accurate characterization of dynamic processes. By capturing
these differences among the signals known to have different
kinetics using the statistical framework employed in this work,
we validated our approach for application to single-cell OC data.
Biological inferences and interpretation
Comparison between different cell lines. Extracted quantita-
tive features such as slopes, intercepts, breakpoint or change-
point were compared across individual cells and cell types.
To detect differences between CP-A and CP-C features we
computed two sets of significance tests. A test of the statistical
Fig. 6 Comparison of features between CP-A and CP-C cells by means of a spline model. Three main features were extracted using the
constrained piecewise linear model: (a, b) oxygen concentration where the change of slopes in the fit occurs (change-point), (c, d) left (before slope
change) and (e, f) right (after slope change) slopes. Figures on the left show feature frequency values and those on the right show 95% confidence
interval of the features means.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
810 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012
significance of differences between the means or the medians of
the features of the two cell lines revealed significant differences
for the TimetoZero and Change point.Oxygen features (Table 1).
The distribution of the time point when each OC kinetics
curve reaches a oxygen concentration value near zero
(TimetoZero feature) exhibits a broad range of values in both
cell types, as mentioned previously (Fig. 3). Statistical analysis
revealed significant differences between both the means
and the medians of the two cell types with p-values equal to
0.003 and 0.008, respectively (Fig. 3).
Another feature of interest is the value of oxygen at the
point where the two linear regressions of the spline model meet
(Change-point.Oxygen). At the breakpoint of the spline model
two features can be captured: oxygen concentration and time.
Oxygen concentration when the change in slopes takes place is
biologically relevant as it indicates a change in the oxygen
consumption kinetics most likely caused by alterations in the
energy production of the cell. The distributions of the Change-
point.Oxygen feature within each cell type showed character-
istics typical of a bimodal density function. Qualitatively the
distribution histograms of the two cell types show significant
similarity (Fig. 6) with a more clearly defined main peak at
6–6.5 ppm for CP-C cells. The distributions clearly indicate
marked heterogeneity in OC kinetics within the same cell type.
More subtle differences can be seen when comparing the two
cell types (Fig. 6b). One of the most notable differences is the
existence of a second, broader peak between 2–4 ppm in CP-C
cells, which is less pronounced in CP-A cells. However, the
statistical test of the mean and median showed p-values of
0.053 and 0.061, respectively, indicating that both of these
parameters are not statistically different at a =0.05.
Two other features that we analyzed were the slopes (rates)
of the OC kinetics measured in the study. Understanding how
fast individual cells consume oxygen is of great interest as it is
directly related to the energy production levels in the cell. The
distributions of the slopes showed a long tail containing only a
small number of cells, while the majority of the cells’ OC rates
were concentrated in a relatively narrow range (Fig. 6)
[À0.02,0]. For both, left and right slopes no statistically
significant differences between their means were found when
comparing the two cell types (Fig. 6). However, the median
values of the right slope were found to be statistically different
between the two cell types with a p-value equal to 0.002
(Fig. 6).
We further explored these comparisons as a classification
problem with two classes (e.g. one cell type versus another)
finding subtle differences between the two cell types using an
ensemble-based classifier: Random Forest. The classification
problem indicated an out-of-bag error rate of 30% when
classifying single-cell CP-A and CP-C cells based on the
extracted features (Table 1). A multidimensional plot from
the tested Random Forest (more details in the Methods
section: Comparisons and classification techniques) is shown
in Fig. 7. This plot shows differences among cell lines.
The role of intercellular interactions: comparison between OC
kinetics in isolated single and interacting cells. To explore
metabolic heterogeneity in the presence of intercellular inter-
actions, OC kinetics curves were obtained with two cells of
the same type placed into one microchamber. We compared
features extracted from the OC data of single cells (i.e., CP-A_1
and CP-C_1) with those obtained with two cells per single
chamber (i.e., CP-A_2 and CP-C_2). The same statistical
methodology was applied to CP-A_2 and CP-C_2 OC
curves as for the data acquired with single, non-interacting cells
with only minor modifications to certain features. To account
for the number of cells (one or two) per microchamber the
values of the slopes measured in microchambers with double
occupancy were divided by two assuming equal OC for the two
cells in a microwell, allowing comparisons with single-cell
slopes.
We first investigated the goodness-of-fit of the spline model
applied to the OC kinetics data of interacting cells. We
compared data fits obtained with the spline model and with
simple linear regression using a multiple hypothesis testing with
Bonferroni correction as described in the Methods section.
Similar to the results obtained with individual, non-interacting
cells of both cell lines, the spline model fit was found to be
statistically better than the simple linear regression model for
all measurements with double-occupancy, interacting cells
(Fig. S1, ESIw).
A set of features from CP-A_1, CP-A_2, CP-C_1, and
CP-C_2 curves were extracted using the constrained piecewise
linear regression model. Distribution patterns similar to those
obtained with single, non-interacting cells were found for the
OC kinetics curves with interacting cells for features such as
TimetoZero, Change-point.Oxygen, Left.Slope, and Right.Slope
(description in Table 1). Statistically significant differences in
both the mean and median were found for at least one of the
four distinct groups of OC curves for the feature TimetoZero as
Fig. 7 Multidimensional scaling plot: a Random Forest classifier for
single CP-A vs. CP-C cells. This plot visualizes the scaling coordinates
of the proximity matrix obtained from a Random Forest to classify
CP-A versus CP-C cells at the single-cell level. This graphical repre-
sentation shows how the Random Forest classifier was able to find
high-dimensional interactions between data features that cluster OC
curves together.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 811
Fig. 8 The TimetoZero feature extracted from single- and double-cells for CP-A and CP-C oxygen consumption curves. Time to zero is a
time feature extracted after removal of zero-valued tails using the CUSUM method. (a) Distribution histogram of the feature among single,
non-interacting (CP-A_1 and CP-C_1) cells and for interacting (two cells per well; CP-A_2 and CP-C_2) cells. (b) 95% confidence interval plot of
the means of TimetoZero for each experimental condition. Testing for statistically significant differences between the means or between the
location shifts (e.g., medians) showed p-values equal to 0 in both cases.
Fig. 9 Other features of interest extracted from oxygen consumption kinetics of single, non-interacting- and double, interacting-CP-A and CP-C
cells. The left panels show distribution histograms of the corresponding features; the right panels show 95% confidence interval of the means of the
corresponding features. (a) and (b) Oxygen concentration values where the change of slopes in the spline model occurs. (c) and (d) Slope values of
the first linear regression of the spline model (Left.Slope). (e) and (f) Slope values of the second linear regression (Right.Slope). See Table 1 for
more detailed description of the slopes.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
812 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012
shown in Fig. 8. With p-values close to zero, this feature may
be an important discriminator among these non-interacting
and interacting cells (less marked differences can be observed
for CP-C_2 probably due to its small sample size). Other
extracted features such as the ones presented in Fig. 9 (oxygen
concentration at breakpoint, slopes before and after the break-
point) portrayed less distinct differences among these groups
but revealed empirical distribution patterns only available
through the study of individual OC curves. For example, oxygen
concentration at the breakpoint revealed significant differences
for at least one group among all groups with p-values of 0.001
and 0.01 when testing for means and medians, respectively,
suggesting CP-C_1 as more different for this feature (Fig. 9). In
contrast, slope values (adjusted for interacting cells by dividing
by two) did not differ as much across different cell groups
besides the median of Right.slope which showcased a p-value
of 0.003 for at least one group being different among others
(Fig. 9). These comparisons are possible through the application
of the methodology presented in this work.
The features extracted using the statistical framework
allowed for multiple comparisons of different phenotypes.
As seen before, the distributions of each of the features
permitted comparisons and showcased subtle differences. To
further analyze the OC curves through the extracted features,
an ensemble classifier34,35
was applied with the objective
of classifying the four groups of interest (CP-A_1, CP-A_2,
CP-C_1, and CP-C_2). A Random Forest classifier34
(see Methods) was applied to the extracted features to unravel
nonlinear relationships among the relevant features. Initially,
we built Random Forest models for pairs of classes
(i.e., CP-A_1 vs. CP-A_2, CP-C_1 vs. CP-C_2, etc.) obtaining
error rates of B20–30% for all pairs. These models included
all extracted features. When all four data classes were included
in a single Random Forest model, the classification error rates
were found to be around 40% when all features were used in
the model and 50% for a Random Forest model that included all
features except TimetoZero (Fig. 10). The TimetoZero feature
was removed from the classification model to capture discrimi-
nant relationships among other features where differences might
not be as clear or direct as in the case of TimetoZero.
Table 2 shows the confusion matrices providing details on
how many curves were misclassified using the models with or
without the TimetoZero feature. Also shown in Table 2 is that
the number of curves among the four different classes is
unbalanced. To address this problem, down-sampling was
performed on all Random Forest models applied here to lessen
the sample size effect in the learning model. Down-sampling is
a sampling technique that reduces the size of the majority class
or the class with the greatest number of samples. It is widely
used to balance the classes to minimize the overall error rate.37
In addition, Table 3 presents the feature importance scores for
both Random Forest models. It can be seen that TimetoZero
has the highest score for distinguishing between the different
experimental classes. However, when the TimetoZero feature
was removed, all features ranked similarly. Although their
predictability measures are not high, the results obtained with
the Random Forest models show semi-defined clusters within
the same experimental condition or the cell type. Fig. 10 shows
how the data points of the same type of experiment tend to
agglomerate in regions partially overlapping with other experi-
mental conditions. This Random Forest model extracts non-
linear patterns among the features to discriminate among
different classes. The two cell lines used in the study represent
different stages of pre-neoplastic progression in esophageal
cancer and, thus, are closely related in their phenotypic and
genotypic profiles. Therefore, it is likely that they will show
similarities in terms of oxygen consumption as well, thus
making the differentiation more difficult. More features either
from the OC curves or any other biologically relevant data
might be necessary to distinguish them clearly.
Fig. 10 Multidimensional scaling plots: a Random Forest model for non-interacting and interacting CP-A and CP-C cells. This plot visualizes
the scaling coordinates of the proximity matrix obtained with a Random Forest performed to classify CP-A versus CP-C at the single- and
double-cell level. (a) Results using all features as described in Table 1. (b) Results using all features with the TimetoZero feature excluded from the
analysis.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 813
Conclusion
The analysis and interpretation of intercellular heterogeneity
data are of fundamental importance in cell biology. A great
deal of interest is found in the scientific community to under-
stand the role of heterogeneity in cellular homeostasis and
pathogenesis.28,38
In recent years, innovative technologies
have been developed to perform biological studies at the
single-cell level,24–28
including single-cell oxygen consumption
measurements. Despite the availability of these technologies,
their real potential can only be exploited utilizing effective
analytical methods capable of performing robust de-noising
and feature extraction steps on the novel type of information.
Through preliminary studies, we have identified three major
challenges when dealing with real-time phenotypic measure-
ments at the single-cell level: random noise, presence of
multiple functional states, and reliable differentiation of cell
behavior within and across different cell types (Fig. 1). In this
study, using single-cell OC data as example, we made the
initial effort to establish a statistical framework for multi-
parameter analysis of the experimental data at the single-cell
level. In our approach to analyze single-cell data we applied
several sets of statistical tools used in signal processing
and statistics for data modeling and feature extraction. The
validation of the method showed that experimental data can be
modeled and their features extracted reliably. The quantitative
features extracted from the single-cell experimental data using
our analysis method revealed subtle differences between
non-interacting, single cells as well as between interacting cells of
both types. This demonstrates the feasibility of the developed
methodology to reliably process the measurement data and
characterize oxygen consumption kinetics. Because of its general
applicability, our statistical framework can be utilized to address
similar challenges that arise in other single-cell data acquisition
and experimental modalities.
Methods
Dataset
Description of oxygen consumption measurements. As a first
step in acquiring and analyzing multiparameter data, our
center has developed an experimental platform for metabolic
phenotype characterization, including oxygen consumption, at
the single-cell level.27,28
Single-cell oxygen consumption rates
are on a scale of fmoles minÀ1
cellÀ1
. Because oxygen sensing
is based on the dynamic quenching of sensor luminescence
by oxygen, the signal-to-noise ratio of the measurement varies
as a function of oxygen concentration in the microchamber.
This factor needs to be taken into account especially when
applying various signal processing algorithms for de-noising
purposes. In addition, other sources of noise include detector
readout noise, intensity variations of the excitation source, and
stochastic sensor noise. For the two cell types studied in this
work, the average time required for an isolated cell to consume
all oxygen within the finite volume (B140 pL) of cell media
ranges between 30–90 min. Noise levels resulting from the
various sources can be significant, requiring the data to be
analyzed utilizing a rigorous statistical framework capable of
reducing noise extracting quantitative features.
We analyzed several sets of oxygen consumption kinetics
data from two Barrett’s esophageal epithelial cell lines (meta-
plastic CP-A and dysplastic CP-C) obtained with the single-
cell technology. The number of OC curves studied for CP-A
and CP-C were 154 and 256, respectively. The cells were
loaded into microwells and incubated for 15–30 hours before
measurements were performed. The incubation time was
selected based on previous studies of cell viability and
morphology. After incubation, microwells with cells were
hermetically sealed with a lid containing an extracellular
optical oxygen sensor. The sensor emission intensity was
collected as a function of time until oxygen concentration in
the microchamber reached zero.27
Table 2 Confusion matrices obtained with Random Forest classifica-
tion models
(A) All features included:
True class
(Num. curves)
Predicted class
Class
error (%)CP-A_1 CP-A_2 CP-C_1 CP-C_2
CP-A_1 (154) 75 24 51 4 51.3
CP-A_2 (118) 4 81 1 32 31.4
CP-C_1 (256) 61 22 165 8 35.5
CP-C_2 (44) 5 20 2 17 61.4
(B) Without TimetoZero feature:
True class
(Num. curves)
Predicted class
Class
error (%)CP-A_1 CP-A_2 CP-C_1 CP-C_2
CP-A_1 (154) 74 29 45 6 51.9
CP-A_2 (118) 20 61 13 24 48.3
CP-C_1 (256) 60 28 142 26 44.5
CP-C_2 (44) 7 17 8 12 72.7
Individual error rates per cell type and different number of cells within
a microwell are shown for Random Forest models constructed using
all features and with the TimetoZero feature excluded from the
analysis. The numbers represent the number of curves classified as
the specific predicted class by the nonlinear model. Classification error
is calculated by the percentage of curves that were misclassified.
Misclassified signals are shown in the gray boxes.
Table 3 Variable importance scores from Random Forest classifica-
tion models
Features
Mean decrease gini (%)
All features Without TimetoZero
Change-point.Time 9.03 12.93
Change-point.Oxygen 10.99 13.23
Left.B0.Coef 10.17 12.89
Left.B1.Coef 10.37 12.53
Right.B1.Coef 13.81 12.49
TimetoZero 17.12 —
Kurtosis 9.14 11.88
Skewness 8.88 11.35
MSE.min 10.49 12.70
These variable importance scores are calculated based on the average
over all trees of a scoring measure. This scoring measure is computed
as the difference of correctly classified cases when the feature matrix
values are evaluated onto the grown tree minus correctly misclassified
items when the variable to be scored is permuted prior tree model
evaluation.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
814 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012
Noise reduction techniques
The noise levels in OC data were reduced using two main
signal processing components: (1) Low-pass filtering and (2)
Outlier smoothing.
Low-pass filtering. Two common low-pass filtering techni-
ques were evaluated. A low-pass filter reduces the amplitude of
high frequencies while leaving low frequencies unchanged.
These two methods along with their parameters are briefly
described here. In addition, we discuss a goodness-of-fit
assessment to decide which of the filtering techniques performs
better for the measured OC kinetics curves.
The Savitzky–Golay (SG) filter is also called least-squares
polynomial smoothing filter and is a finite impulse response
(FIR) filter.39
The technique fits a polynomial of fixed degree
n to a small window of the data of size (2m + 1) to estimate
a midpoint as shown in eqn (1) and (2). This process is
repeated by sliding the data window along the total span.39,40
This type of convoluted filter minimizes the least-squares error
of fitting a polynomial to window frames of the noisy data and
is quite popular in areas such as spectroscopy and analytical
chemistry because of its simplicity and speed.41,42
If the
data are evenly spaced and continuous then the smoothed
value ðyÃ
t Þ is the weighted summation of the points in the
window frame as described in eqn (3). Savitzky–Golay’s early
methodology implementation results in the truncation of
m points at the start and end of the data signal which are
not able to be smoothed out. Therefore, extensions to the
Savitzky–Golay filter addressing initial and endpoint estimation
found in the literature were also implemented in this study.40,43
yÃ
t ¼
Xn
k¼0
bktk
¼ b0 þ b1t þ b2t2
þ Á Á Á þ bntn
;
t ¼ ½Àm; Àðm À 1Þ; . . . ; 0; . . . mŠ
ð1Þ
@
@bk
Xm
t¼Àm
ðyÃ
t À ytÞ2
" #
¼ 0 ð2Þ
yÃ
j ¼
Pm
t¼Àm
ctyjþt
N
ð3Þ
In our study, a second-order polynomial fit was tested; as it is
commonly used in practice.41
Another important parameter
needed in the SG filtering is the window length (m). Common
values for this parameter are m = 11 and m = 21. We evaluated
root-mean-squared-error (RMSE) for a range of values under
both conditions (e.g., CP-A and CP-C) as shown in Fig. S2
(ESIw). Data filtering in this study was performed using a window
size of 11, since the smoothing performance was found to be
better than with m = 21 in terms of preservation of local signal
patterns.
The second filter we applied was the Exponentially
Weighted Moving Average (EWMA). It is an infinite impulse
response (IIR) filter and represents a special case of the
moving average filter where the weights of the data points to
be averaged decay exponentially with the distance from the
most recent data point (eqn (4)). The smoothed value of yt is
obtained through
yÃ
t ¼ lyt þ ð1 À lÞyÃ
tÀ1 ð4Þ
where l represents the decay rate ranging from 0 r l r 1.
A small value of l gives more weight to older data and less to
new data and vice versa.29,44
To detect small signal changes
l = 0.2 was used during the smoothing of the data curves in
this study. An RMSE evaluation across a range of l values
was performed as shown in Fig. S2 (ESIw). In practice,
l values between 0.2–0.3 are used.45
To assess the performance of EWMA and SG filtering
techniques, we evaluated average root-mean-squared-error
(RMSE) between smoothed and raw data as a goodness-
of-fit criterion. The goodness-of-fit statistics describe how well
smoothed values fit experimental data (i.e., coefficient of
determination (R2
), mean squared error (MSE), and root-
mean-squared-error (RMSE)). Small values of the average
RMSE indicate a good fit. Both techniques showed similar
performances for the commonly chosen parameters as displayed
in Fig. S3 (ESIw).
Outlier detection and smoothing. The OC kinetics data
contained random sharp peaks in certain areas due to signal
loss or stochastic sensor intensity fluctuations. We detected
these outliers using traditional control charts theory using the
following equation
L = %x Æ w^s, (5)
where L represents the upper (+) and lower (À) control limits,
%x is the mean value of the response, w is the parameter that
determines the width of the limits, and ^s is an estimated value
of variation. Data points outside the limits calculated using
eqn (5) were considered outliers. ^s was estimated through an
initial filtering step. Each signal undergoes a filtering step as
the ones described in the earlier subsections on low-pass
filtering to estimate its individual variation metric. Smoothed
values resulting from this step are obtained, and the variation
of the raw data points is computed using the Root-Mean-
Squared-Error (RMSE) metric. We assumed ^s to be a
constant, which is not necessarily true. However, because ^s
is utilized for the detection of distant outliers only, this
assumption is adequate. To determine the w parameter
(control width constant) we studied several options. The value
for w was chosen to be equal to 2, as with this value of w on
average 10% of all data points within an OC kinetics curve are
detected as outliers (Fig. S4, ESIw). As expected, higher or
lower values of w resulted in smaller or larger fractions,
respectively, of the data to be outside the imposed boundaries
and detected as outliers. Choosing w = 2 resulted in about
10% of the points within the curve to be classified as outliers.
Naturally, higher values of w, i.e. 3, 4, and 5, showed smaller
percentages ranging from 0% to B5% and smaller values
(w = 1) resulted in a higher percentage (B25%) of data points
detected as outliers (Fig. S4, ESIw). Hence, w = 2 seemed a
reasonable estimation to reduce random noise due to outliers
without excluding too much of the actual signal data from the
analysis. After detection, the outliers were smoothed out by
using a simple 2-neighbor averaging procedure where the
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 815
outlier values are replaced with values computed as the
average of its two adjacent neighbor’s values. The low-pass
filter was re-applied to the entire dataset afterwards.
Feature extraction models
Cumulative sum control (CUSUM) charts: change detection.
With the use of the cumulative sum (CUSUM) control charts,
small changes in the mean value are more efficiently detected
than Shewhart control charts.29
To apply the CUSUM
procedure, the OC curves were order-reversed to identify the
deviation from zero (tail). The OC response signals portray the
behavior of oxygen consumption over time. When it reaches
its minimum value (zero) the signal shows a constant behavior
or a tail of zeros from that time point on. Hence, the time
point at which the signal reaches zero can be obtained by
capturing a deviation within the constant region of zero values
which occur at the end of the signal. Reversing the order of the
signal facilitates the application of CUSUM charts to detect
deviations from zero.
Two input parameters are needed to calculate the CUSUM
statistic (Ck): the subgroup size (k) and the in control mean
(in this study m0 = 0). The parameter Ck is defined in eqn (6) by
k, m0, and the computed mean of the sub-sample of size k ( %xk).
Ck is calculated along the entire sample range.
Ck ¼
Xk
j¼1
ðxk À m0Þ ð6Þ
Other parameters needed to be determined when the process is
out of control (in this study m0 a 0) are decision interval and
amount of shift to detect (slack). Recommended values for
these parameters are decision interval of size 5 and a slack
value of 3.46–48
Piecewise linear regression model. The methodology imple-
mented in this paper for feature extraction consists of fitting a
piecewise linear regression model to each OC kinetics curve. In
general, the piecewise linear regression is used to describe a
nonlinear behavior by fitting the data to a number of linear
segments. In the methodology implemented here two linear
regression models were constrained to connect at the same
breakpoint. We considered a special case of two linear regres-
sions intersecting at a single point at time tc (‘‘change-point’’)
as shown in eqn (7) with the indicator variable It Z tc
= 1, when
t Z tc.49
Both linear regressions were described in one
function y with the use of an indicator variable It Z tc
to define
both regression functions each with constrained slopes b1 and
b1 + b2 as shown in eqn (7). The slope parameters were
constrained to non-positive values due to decreasing oxygen
concentration in the microchambers.
y = b0 + b1t + b2(t À tc)It Z tc
(7)
b1 r 0 and b1+b2 r 0 8 curves
To find the change-point, a likelihood method was used to
minimize the sum squared error (SSE) of the fit of the kinetics
data to two linear regressions. During the fit, an exhaustive
search was performed along the time axis to determine the
change-point and the coefficient estimates that minimize SSE.
Once the change-point was found, the features (Table 1) were
extracted from the piecewise linear model for different experi-
mental conditions (i.e., CP-A, CP-C). The fit to the
constrained piecewise linear regression with one-breakpoint
was statistically compared to the fit to a simple linear regression
model using an F test. To perform the F test, an F statistic is
computed as shown in eqn (8) where SSEModel1 and SSEModel2
refer to the sum squared error of the simple linear regression
and the constrained piecewise linear regression models respec-
tively. Other inputs in eqn (8) are p and n; p is the number of
parameters estimated for each model (i.e., Model1 or Model2)
and n is the total number of data points in the signal.
F ¼
SSEModel1 À SSEModel2
pModel2 À pModel1
SSEModel2
n À pModel2
ð8Þ
if - F  Fa,pModel2ÀpModel1
,nÀpModel2
- Model2 performs better.
The model comparison by an F test was performed for
every single curve resulting in a multiple hypothesis testing
problem. A commonly known problem in multiple hypotheses
testing is the increase of false positives. Several approaches such
as the Bonferroni correction exist to alleviate this
problem. This widely used technique is applied when multiple
statistical tests are computed simultaneously in order to reduce
false positives by reducing the value of a, the significance
level of the test. Another way in which the value of a can be
reduced is by adjusting all the p-values from the individual tests
as shown in eqn (9), where n is the number of
comparisons.31,50,51
pvalue.adjusted[c] = min(pvalue[c] Â n, 1) c A [1,n] (9)
Comparisons and classification techniques
Statistical significance tests. The extracted features were
studied and compared between the two cell lines using tradi-
tional statistical tools such as histograms, confidence intervals
and statistical tests of the mean and median. The statistical
significance of the difference between the means was deter-
mined using the analysis of variance (ANOVA) test which
generalizes the t-test for more than two groups but relies on
several assumptions that may or may not be met for this
particular data structure. ANOVA was performed with caution
to get a general sense of the groups’ mean from the ANOVA
hypothesis shown in eqn (10). In addition to ANOVA, we
performed significance tests for the differences between the
median values using nonparametric tests which waive the strict
assumptions inherent to ANOVA. The median or rank test was
performed using the Mann–Whitney–Wilcoxon test52,53
for a
two-level group test and the Kruskal–Wallis test54
for more
than two groups. Both tests are nonparametric approaches for
evaluating differences in the location shift of the distribution of x
for each group. Eqn (11) represents the analytical expression of
the Kruskal–Wallis test, where ni is the number of observations
in group i, rij is the rank of observation j from group i, and N is
the total number of observations for all groups. The p-value
corresponding to a particular K is approximated through the
w2
distribution.54
H0: m1 = m2 =Á Á Á= mn (10)
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
816 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012
K ¼ ðN À 1Þ
Pg
i¼1
niðri À rÞ2
Pg
i¼1
Pni
j¼1
ðrij À rÞ2
ð11Þ
Ensemble classifier: Random Forest. To further explore
potential relationships among several groups of OC curves,
we applied an ensemble classifier based on decision trees. The
two cell lines (CP-A and CP-C) at the single-cell or two-cell
levels (i.e., CP-A_1, CP-A_2, CP-C_1, and CP-C_2) were
defined as the four classes for the classifier model with features
from the OC curves used as predictors. The decision trees
can be applied in almost all scenarios. Therefore, they provide
a good starting point for modeling heterogeneous and
large data sets. The decision trees apply to either a numerical
or categorical response and are nonlinear, simple, and fast.
The decision trees are scale-invariant and robust to missing
values. However, a single tree is produced by a greedy algo-
rithm that generates an unstable model.34
Consequently,
ensemble methods have been used to counteract the instability
of a single tree.
Supervised ensemble methods build a set of simple models
called base learners and use a weighted outcome for each base
learner in a voting scheme to predict future data. In other
words, ensemble methods merge outputs from multiple base
learners to create a voting committee to improve performance.
Many empirical studies have shown that ensemble methods
often outperform any single base learner.35
The Random Forest classifier is an improved bagging
method which basically exploits the benefits of bootstrapping
sampling through modeling. It grows a forest of random
decision trees on bagged samples yielding accurate results,
comparable with the best known classifiers.34
An advantageous
property of Random Forest classifiers is that they limit over
fitting through embedded out-of-bag (OOB) error estimation.
The out-of-bag error estimation for the ith tree in the Random
Forest model is computed using a percentage of cases not used
in the learning for this ith tree. Other advantages of Random
Forest models are: simple to train and tune in many appli-
cations, computationally efficient, can handle a large number
of variables, provide variable importance scores, embedded
method to estimate missing data, generation of a proximity
matrix among cases, handle variable interactions, can be
adapted to balance error due to datasets with unbalanced
numbers of samples, and capable of extending to unlabeled data
for unsupervised clustering, data views and outlier detection.34
Algorithm: a simple pseudocode for Random Forest classifier
construction is shown below.34,35
 Select a number of cases independently, with replacement
from the original dataset to build the training data.
 Use training data to grow a tree:
3 Select v variables at random from the total number of
input variables (V) where v { V.
3 Best variable among the v predictors is chosen to maximize
the information gain of the split.
3 Split the chosen node into two daughter nodes based on
the best variable.
 Repeat Step 2 until all trees are built.
 Output the ensemble of trees.
Important features of Random Forest classifiers are OOB
sampling, variable importance, and proximity plots. OOB
sampling is identical to cross-validation and, since Random
Forest is performed in parallel trees, a cross-validation can be
done along the way. Variable importance is a key feature of
Random Forests. The variables are ranked based on their
improvement in the empirical loss function among all trees,
meaning that variables that are chosen often in the trees
provide better predictive power or they minimize the loss
function. These proximity distances are measured by putting
all the data, training and out-of-bag, through the grown trees.
If instances i and j are in the same terminal node their
proximity increases by one and so on through all the trees.34
Then proximities are normalized by the number of trees in
the model.
State-of-the-art visualization methods such as multidimensional
scaling36
are used to illustrate how well features discriminate
among different conditions. Multidimensional scaling represents
high-dimensional data in a lower-dimensional space (often two or
three dimensions) in order to better visualize any structure in the
data. The algorithm generates points in the lower-dimensional
space that approximately preserve the pair-wise distances between
the points in the high-dimensional space.55
Conflict of Interest: none declared.
Acknowledgements
The authors would like to thank the personnel and support of
the Center for Biosignatures Discovery Automation in the
Biodesign Institute at Arizona State University. Funding: this
research is supported by the National Institutes of Health
(NIH), National Human Genome Research Institute
(NHGRI), Center of Excellence in Genomic Science (CEGS),
grant number 5 P50 HG002360 to Deirdre R. Meldrum.
References
1 M. Lidstrom and D. R. Meldrum, Life-on-a-chip, Nat. Rev.
Microbiol., 2003, 158, 164.
2 D. J. Wang and S. Bodovitz, Single cell analysis: the new frontier in
‘omics’, Trends Biotechnol., 2010, 28(6), 281–290.
3 T. Kalisky and S. R. Quake, Single-cell genomics, Nat. Methods,
2011, 8(4), 311–314.
4 N. Navin, J. Kendall, J. Troge, P. Andrews, L. Rodgers,
J. McIndoo, K. Cook, A. Stepansky, D. Levy, D. Esposito,
L. Muthuswamy, A. Krasnitz, W. R. McCombie, J. Hicks and
M. Wigler, Tumour evolution inferred by single-cell sequencing,
Nature, 2011, 472(7341), U90–U119.
5 E. J. Kostelich and T. Schreiber, Noise reduction in chaotic time-
series data: A survey of common methods, Phys. Rev. E: Stat. Phys.,
Plasmas, Fluids, Relat. Interdiscip. Top., 1993, 48, 1752–1763.
6 S. J. Orfanidis, Introduction to Signal Processing, Prentice-Hall,
Englewood Cliffs, NJ, 1996.
7 J. Brocker, U. Parlitz and M. Ogorzalek, Nonlinear Noise
Reduction, Proc. IEEE, 2002, 90(5), 898–918.
8 M. Schena, D. Shalon, R. W. Davis and P. O. Brown, Quantitative
monitoring of gene expression patterns with a complementary
DNA microarray, Science, 1995, 270(5235), 467–470.
9 D. A. Lashkari, J. L. DeRisi, J. H. McCusker, A. F. Namath,
C. Gentile, S. Y. Hwang, P. O. Brown and R. W. Davis, Yeast
microarrays for genome wide parallel genetic and gene expression
analysis, Proc. Natl. Acad. Sci. U. S. A., 1997, 94(24), 13057–13062.
10 V. G. Cheung, M. Morley, F. Aguilar, A. Massimi,
R. Kucherlapati and G. Childs, Making and reading microarrays,
Nat. Genet., 1999, 21, 15–19.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online
This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 817
11 S. K. Moore, Making chips to probe genes, IEEE Spectrum, 2001,
38(3), 54–60.
12 W. Torres-Garcia, W. W. Zhang, R. Johnson, G. Runger and
D. R. Meldrum, Integrative analysis of transcriptomic, proteomic
data of Desulfovibrio vulgaris: a nonlinear model to predict abundance
of undetected proteins, Bioinformatics, 2009, 25, 1905–1914.
13 W. Torres-Garcia, S. D. Brown, R. H. Johnson, W. W. Zhang,
G. Runger and D. R. Meldrum, Integrative analysis of transcrip-
tomic and proteomic data of Shewanella oneidensis: missing value
imputation using temporal datasets, Mol. BioSyst., 2011, 7(4),
1093–1104.
14 M. L. T. Lee, F. C. Kuo, G. A. Whitmore and J. Sklar, Importance
of replication in microarray gene expression studies: Statistical
methods and evidence from repetitive cDNA hybridizations, Proc.
Natl. Acad. Sci., 2000, 97(18), 9834–9839.
15 D. E. Carter, J. F. Robinson, E. M. Allister, M. W. Huff and
R. A. Hegele, Quality assessment of microarray experiments, Clin.
Biochem., 2005, 38(7), 639–642.
16 J. Seo, M. Bakay, Y. W. Chen, S. Hilmer, B. Shneiderman and
E. P Hoffman, Interactively optimizing signal-to-noise ratios in
expression profiling: project-specific algorithm selection and detection
p-value weighting in Affymetrix microarrays, Bioinformatics, 2004,
20(16), 2534–2544.
17 T. Howlader and Y. P. Chaubey, Noise Reduction of cDNA
Microarray Images Using Complex Wavelets, IEEE Trans. Image
Process., 2010, 19(8), 1953–1967.
18 Y. Saeys, I. Inza and P. Larran˜ aga, A review of feature selection
techniques in bioinformatics, Bioinformatics, 2007, 23(19),
2507–2517.
19 J. P. Stevens, Intermediate Statistics. A Modern Approach,
Lawrence Erlbaum Associates Publishers, Mahwah, NJ, Second edn,
1999.
20 J. X. Pan and K. T. Fang, Growth Curve Models and Statistical
Diagnostics, Springer Series in Statistics, 2002.
21 S. E. Maxwell and H. D. Delaney, Designing Experiments and
Analyzing Data: A Model Comparison Perspective, Lawrence
Erlbaum, Second edn, 2003.
22 S. Weerahandi, Generalized inference in repeated measures: Exact
methods in MANOVA and mixed models, Wiley-Interscience, 2004.
23 Applied regression analysis and other multivariable methods, ed.
D. G. Kleinbaum, L. L. Kupper and K. E. Muller, PWS Publishing
Co., Boston, MA, USA, 4th edn, 2008.
24 Y. Anis, M. Holl and D. Meldrum, Automated selection and
placement of single cells using vision-based feedback control, IEEE
Trans. Autom. Sci. Eng., 2010, 7(3), 598–606.
25 H. Zhu, M. Holl, T. Ray, S. Bhushan and D. R. Meldrum,
Characterization of deep wet etching of fused silica glass for single
cell and optical sensor deposition, J. Micromech. Microeng., 2009,
19, 6.
26 Y. Tian, B. R. Shumway, C. Youngbull, Y. Li, A. K. Y. Jen,
R. H. Johnson and D. R. Meldrum, Dually fluorescent sensing
of ph and dissolved oxygen using a membrane made from poly-
merizable sensing monomers, Sens. Actuators, B, 2010, 47(2),
714–722.
27 S. Ashili, L. Kelbauskas, J. Houkal, D. Smith, Y. Tian,
C. Youngbull, H. Zhu, Y. Anis, M. Hupp, K. Lee, A. Kumar,
J. Vela, A. Shabilla, R. Johnson, M. Holl and D. Meldrum,
Automated platform for multiparameter stimulus response studies
of metabolic activity at the single-cell level, Proceedings Vol. 7929,
Microfluidics, BIOMEMS, and Medical Microsystems IX, 2011.
28 L. Kelbauskas, S. Ashili, J. Houkal, D. Smith, A. Mohammadreza,
K. Lee, A. Kumar, Y. Anis, T. Paulson, C. Youngbull, Y. Tian,
R. Johnson, M. Holl and D. Meldrum, A novel method for multi-
parameter physiological phenotype characterization at the since-cell
level, Proceedings Vol. 7902, Imaging, Manipulation and Analysis of
Biomolecules, Cells, and Tissues IX, 2011.
29 D. Montgomery, Introduction to Statistical Quality Control,
Wiley Higher Education, 2005.
30 T. Molter, S. C. McQuaide, M. Zhang, M. R. Holl, L. W. Burgess,
M. E. Lidstrom and D. R. Meldrum, Algorithm advancements for
the measurement of single cell oxygen consumption rates, IEEE
International Conference CASE 2007, Automation Science and
Engineering, 2007, 386–391.
31 J. P. Shaffer, Multiple Hypothesis Testing, Annu. Rev. Psychol.,
1995, 46, 561–584.
32 J. K. Joseph, D. Bunnachak, T. J. Burke and R. W. Schrier,
A novel method of inducing and assuring total anoxia during in vitro
studies of O2 deprivation injury, J. Am. Soc. Nephrol., 1990, 1, 837–840.
33 K. C. Ho, J. K. Leach, K. Eley, R. B. Mikkelsen and P. S. Lin,
A simple method of producing low oxygen conditions with Oxyrase
for cultured cells exposed to radiation and Tirapazamine, Am. J. Clin.
Oncol., 2003, 26(4), e86–e91.
34 L. Breiman, Random forests, Mach. Learn., 2001, 45, 5–32.
35 T. Hastie, R. Tibshirani and J. H. Friedman, The Elements of
Statistical Learning—Data Mining, Inference, Prediction, Springer
Verlag, 2nd edn, 2009.
36 T. F. Cox and M. A. Cox, Multidimensional scaling, Chapman and
Hall, London, 1994.
37 L. Breiman, J. Friedman, C. J. Olshen and R. A. Stone, Classification
and Regression Trees, Wadsworth International, Belmont, CA, 1984.
38 S. J. Altschuler and L. F. Wu, Cellular Heterogeneity: Do Differences
Make a Difference?, Cell, 2010, 141(4), 559–563.
39 A. Savitzky and M. J. E. Golay, Smoothing and differentiation of
data by simplified least squares procedures, Anal. Chem., 1964,
36(8), 1627–1639.
40 R. A. Leach, C. A. Carter and J. M. Harrister, Least-squares
polynomial filters for initial point and slope estimation, Anal.
Chem., 1984, 56(13), 2304–2307.
41 P. Persson and G. Strang, Mathematical systems theory in biology,
communications, computation, and finance, Springer, 2002.
42 Z. B. Alfassi, Z. Boger and Y. Ronen, Statistical Treatment of
Analytical Data, CRC Press, Blackwell Science, Boca Raton, FL,
2005.
43 P. A. Gorry, General least-squares smoothing and differentiation
by the convolution (Savitzky–Golay) method, Anal. Chem., 1990,
62(6), 570–573.
44 B. Walczak, Wavelets in chemistry, Elsevier Science, 2000, vol. 22.
45 J. Hunter, The exponentially weighted moving average, J. Qual.
Technol., 1996, 18(4), 203–210.
46 J. Pignatiello and G. C. Runger, Comparison of multivariate
CUSUM charts, J. Qual. Technol., 1990, 22, 173–186.
47 S. S. Prabhu, G. C. Runger and D. C. Montgomery, Selection of
the subgroup size and sampling interval for a CUSUM control
chart, IEEE Trans., 1997, 29, 451–457.
48 V. Golosnoy, S. Ragulin, W. Schmid, Multivariate CUSUM chart:
properties and enhancements, AStA Advances in Statistical Analysis,
Springer, 2009, vol. 93(3), 263–279.
49 R. A. Berk, Statistical Learning from a Regression Perspective,
Springer Science + Business Media, LLC, New York, 2008.
50 Y. Benjamini and Y. Hochberg, Controlling the false discovery rate: a
practical and powerful approach to multiple testing, J. R. Stat. Soc.
Ser. B, 1995, 57, 289–300.
51 Y. Benjamini and D. Yekutieli, The control of the false discovery
rate in multiple testing under dependency, Ann. Stat., 2001, 29,
1165–1188.
52 F. Wilcoxon, Individual comparisons by ranking methods,
Biometrics Bull., 1945, 6, 80–83.
53 H. B. Mann and D. R. Whitney, On a Test of Whether one of Two
Random Variables is Stochastically Larger than the Other, Ann.
Math. Stat., 1947, 18(1), 50–60.
54 W. H. Kruskal and W. A. Wallis, Use of ranks in one-criterion
variance analysis, J. Am. Stat. Assoc., 1952, 47(260), 583–621.
55 C. H. Chen, W. Hardle, A. Unwin, M. Cox and T. F. Cox,
Handbook of data visualization. In Springer Handbooks Comp.
Statistics, chapter Multidimensional Scaling, Springer, Berlin
Heidelberg, 2008, pp. 315–347.
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online

More Related Content

What's hot

Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
lemberger
 
Exosome
ExosomeExosome
Exosome
RezaSahebi4
 
EXTRACTION AND CLASSIFICATION OF BLEBS IN HUMAN EMBRYONIC STEM CELL
EXTRACTION AND CLASSIFICATION OF BLEBS IN HUMAN EMBRYONIC STEM CELLEXTRACTION AND CLASSIFICATION OF BLEBS IN HUMAN EMBRYONIC STEM CELL
EXTRACTION AND CLASSIFICATION OF BLEBS IN HUMAN EMBRYONIC STEM CELL
dbpublications
 
EGRupdatedCVwithCL052015
EGRupdatedCVwithCL052015EGRupdatedCVwithCL052015
EGRupdatedCVwithCL052015Erik Rogers
 
Exosomes lecture
Exosomes lectureExosomes lecture
Exosomes lecture
Dr.Mahmoud Abbas
 
Gene & Tissue Culture: Presentation (Group 4)
Gene & Tissue Culture: Presentation (Group 4)Gene & Tissue Culture: Presentation (Group 4)
Gene & Tissue Culture: Presentation (Group 4)
Su Shen Lim
 
Chemical Nose Biosensors Cancer Cells and Biomarkers
Chemical Nose Biosensors Cancer Cells and BiomarkersChemical Nose Biosensors Cancer Cells and Biomarkers
Chemical Nose Biosensors Cancer Cells and BiomarkersOscar1Miranda2
 
Advanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug DiscoveryAdvanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug Discovery
Mikel Txopitea Elorriaga
 
Systems Biology Approaches to Cancer
Systems Biology Approaches to CancerSystems Biology Approaches to Cancer
Systems Biology Approaches to CancerRaunak Shrestha
 
3D In Vitro Models for Drug Efficiency Testing
3D In Vitro Models for Drug Efficiency Testing3D In Vitro Models for Drug Efficiency Testing
3D In Vitro Models for Drug Efficiency Testing
Tiffany Ho
 
EXTRACELLULAR VESICLES IN CANCER
EXTRACELLULAR VESICLES IN CANCEREXTRACELLULAR VESICLES IN CANCER
EXTRACELLULAR VESICLES IN CANCER
Aniket Vaidya
 
Flowcytometry
FlowcytometryFlowcytometry
Flowcytometry
Prachee Rajput
 
Exosomes - Diagnostics and Therapeutics
Exosomes - Diagnostics and TherapeuticsExosomes - Diagnostics and Therapeutics
Exosomes - Diagnostics and Therapeutics
SumedhaBobade
 
Exosome therapy
Exosome therapyExosome therapy
Exosome therapy
EchoHan4
 
Lessons from-geron 2014
Lessons from-geron 2014Lessons from-geron 2014
Lessons from-geron 2014
Leah Krevitt
 
SCT60103 Group 2 Presentation
SCT60103 Group 2 PresentationSCT60103 Group 2 Presentation
SCT60103 Group 2 Presentation
Jayden On
 
Systems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systemsSystems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systemsLars Juhl Jensen
 
Epigeneticsand methylation
Epigeneticsand methylationEpigeneticsand methylation
Epigeneticsand methylation
Shubhda Roy
 

What's hot (20)

Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
 
Exosome
ExosomeExosome
Exosome
 
PONE2013_VecslerM
PONE2013_VecslerMPONE2013_VecslerM
PONE2013_VecslerM
 
EXTRACTION AND CLASSIFICATION OF BLEBS IN HUMAN EMBRYONIC STEM CELL
EXTRACTION AND CLASSIFICATION OF BLEBS IN HUMAN EMBRYONIC STEM CELLEXTRACTION AND CLASSIFICATION OF BLEBS IN HUMAN EMBRYONIC STEM CELL
EXTRACTION AND CLASSIFICATION OF BLEBS IN HUMAN EMBRYONIC STEM CELL
 
EGRupdatedCVwithCL052015
EGRupdatedCVwithCL052015EGRupdatedCVwithCL052015
EGRupdatedCVwithCL052015
 
Exosomes lecture
Exosomes lectureExosomes lecture
Exosomes lecture
 
Gene & Tissue Culture: Presentation (Group 4)
Gene & Tissue Culture: Presentation (Group 4)Gene & Tissue Culture: Presentation (Group 4)
Gene & Tissue Culture: Presentation (Group 4)
 
Chemical Nose Biosensors Cancer Cells and Biomarkers
Chemical Nose Biosensors Cancer Cells and BiomarkersChemical Nose Biosensors Cancer Cells and Biomarkers
Chemical Nose Biosensors Cancer Cells and Biomarkers
 
Advanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug DiscoveryAdvanced Systems Biology Methods in Drug Discovery
Advanced Systems Biology Methods in Drug Discovery
 
Systems Biology Approaches to Cancer
Systems Biology Approaches to CancerSystems Biology Approaches to Cancer
Systems Biology Approaches to Cancer
 
3D In Vitro Models for Drug Efficiency Testing
3D In Vitro Models for Drug Efficiency Testing3D In Vitro Models for Drug Efficiency Testing
3D In Vitro Models for Drug Efficiency Testing
 
EXTRACELLULAR VESICLES IN CANCER
EXTRACELLULAR VESICLES IN CANCEREXTRACELLULAR VESICLES IN CANCER
EXTRACELLULAR VESICLES IN CANCER
 
Flowcytometry
FlowcytometryFlowcytometry
Flowcytometry
 
Exosomes - Diagnostics and Therapeutics
Exosomes - Diagnostics and TherapeuticsExosomes - Diagnostics and Therapeutics
Exosomes - Diagnostics and Therapeutics
 
Exosome therapy
Exosome therapyExosome therapy
Exosome therapy
 
Lessons from-geron 2014
Lessons from-geron 2014Lessons from-geron 2014
Lessons from-geron 2014
 
SCT60103 Group 2 Presentation
SCT60103 Group 2 PresentationSCT60103 Group 2 Presentation
SCT60103 Group 2 Presentation
 
Systems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systemsSystems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systems
 
Averycv
AverycvAverycv
Averycv
 
Epigeneticsand methylation
Epigeneticsand methylationEpigeneticsand methylation
Epigeneticsand methylation
 

Similar to A statistical framework for multiparameter analysis at the single cell level

Lecaut et al 2012
Lecaut et al 2012Lecaut et al 2012
Lecaut et al 2012
Fran Flores
 
Assay Development in Cell Culture.pdf
Assay Development in Cell Culture.pdfAssay Development in Cell Culture.pdf
Assay Development in Cell Culture.pdf
Kosheeka : Primary Cells for Research
 
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Varij Nayan
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformaticaMartín Arrieta
 
Cimetta et al., 2013
Cimetta et al., 2013Cimetta et al., 2013
Cimetta et al., 2013
Fran Flores
 
IQT Quarterly Winter 2016 - Lim
IQT Quarterly Winter 2016 - LimIQT Quarterly Winter 2016 - Lim
IQT Quarterly Winter 2016 - LimMark David Lim
 
IQT Quarterly Winter 2016 - Lim
IQT Quarterly Winter 2016 - LimIQT Quarterly Winter 2016 - Lim
IQT Quarterly Winter 2016 - LimMark David Lim
 
Genomics and Proteomics - Impact on Drug Discovery
Genomics and Proteomics - Impact on Drug DiscoveryGenomics and Proteomics - Impact on Drug Discovery
Genomics and Proteomics - Impact on Drug Discovery
Philip Bourne
 
Flow cytometry: Principles and Applications
Flow cytometry: Principles and ApplicationsFlow cytometry: Principles and Applications
Flow cytometry: Principles and Applications
Juhi Arora
 
Toxicity testing
Toxicity testingToxicity testing
RESEARCH- Laboratory techniques and methods
RESEARCH- Laboratory techniques and methodsRESEARCH- Laboratory techniques and methods
RESEARCH- Laboratory techniques and methods
bonifacioandres287
 
Ellison MolBioSys b905602e published (2)
Ellison MolBioSys b905602e published (2)Ellison MolBioSys b905602e published (2)
Ellison MolBioSys b905602e published (2)Dr David Ellison
 
Single cell pcr
Single cell pcrSingle cell pcr
Single cell pcr
SuganyaPaulraj
 
Gellibolian 2010 Audio Visual2
Gellibolian 2010 Audio Visual2Gellibolian 2010 Audio Visual2
Gellibolian 2010 Audio Visual2
Robert Gellibolian, Ph.D
 
Flow Cytometry: Guide to Understanding Applications and Benefits | The Lifesc...
Flow Cytometry: Guide to Understanding Applications and Benefits | The Lifesc...Flow Cytometry: Guide to Understanding Applications and Benefits | The Lifesc...
Flow Cytometry: Guide to Understanding Applications and Benefits | The Lifesc...
The Lifesciences Magazine
 
A Cell-Cycle Knowledge Integration Framework
A Cell-Cycle Knowledge Integration FrameworkA Cell-Cycle Knowledge Integration Framework
A Cell-Cycle Knowledge Integration Framework
Lisa Muthukumar
 
Analytical chemistry 2013 qian liu
Analytical chemistry 2013 qian liuAnalytical chemistry 2013 qian liu
Analytical chemistry 2013 qian liu
Qian Liu, phD
 
dkNET Webinar: Tabula Sapiens 03/22/2024
dkNET Webinar: Tabula Sapiens 03/22/2024dkNET Webinar: Tabula Sapiens 03/22/2024
dkNET Webinar: Tabula Sapiens 03/22/2024
dkNET
 

Similar to A statistical framework for multiparameter analysis at the single cell level (20)

Lecaut et al 2012
Lecaut et al 2012Lecaut et al 2012
Lecaut et al 2012
 
Assay Development in Cell Culture.pdf
Assay Development in Cell Culture.pdfAssay Development in Cell Culture.pdf
Assay Development in Cell Culture.pdf
 
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
Interactomics, Integromics to Systems Biology: Next Animal Biotechnology Fron...
 
Bms 2010
Bms 2010Bms 2010
Bms 2010
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
 
Cimetta et al., 2013
Cimetta et al., 2013Cimetta et al., 2013
Cimetta et al., 2013
 
IQT Quarterly Winter 2016 - Lim
IQT Quarterly Winter 2016 - LimIQT Quarterly Winter 2016 - Lim
IQT Quarterly Winter 2016 - Lim
 
IQT Quarterly Winter 2016 - Lim
IQT Quarterly Winter 2016 - LimIQT Quarterly Winter 2016 - Lim
IQT Quarterly Winter 2016 - Lim
 
Genomics and Proteomics - Impact on Drug Discovery
Genomics and Proteomics - Impact on Drug DiscoveryGenomics and Proteomics - Impact on Drug Discovery
Genomics and Proteomics - Impact on Drug Discovery
 
Flow cytometry: Principles and Applications
Flow cytometry: Principles and ApplicationsFlow cytometry: Principles and Applications
Flow cytometry: Principles and Applications
 
Toxicity testing
Toxicity testingToxicity testing
Toxicity testing
 
RESEARCH- Laboratory techniques and methods
RESEARCH- Laboratory techniques and methodsRESEARCH- Laboratory techniques and methods
RESEARCH- Laboratory techniques and methods
 
Ellison MolBioSys b905602e published (2)
Ellison MolBioSys b905602e published (2)Ellison MolBioSys b905602e published (2)
Ellison MolBioSys b905602e published (2)
 
Single cell pcr
Single cell pcrSingle cell pcr
Single cell pcr
 
JALANov2000
JALANov2000JALANov2000
JALANov2000
 
Gellibolian 2010 Audio Visual2
Gellibolian 2010 Audio Visual2Gellibolian 2010 Audio Visual2
Gellibolian 2010 Audio Visual2
 
Flow Cytometry: Guide to Understanding Applications and Benefits | The Lifesc...
Flow Cytometry: Guide to Understanding Applications and Benefits | The Lifesc...Flow Cytometry: Guide to Understanding Applications and Benefits | The Lifesc...
Flow Cytometry: Guide to Understanding Applications and Benefits | The Lifesc...
 
A Cell-Cycle Knowledge Integration Framework
A Cell-Cycle Knowledge Integration FrameworkA Cell-Cycle Knowledge Integration Framework
A Cell-Cycle Knowledge Integration Framework
 
Analytical chemistry 2013 qian liu
Analytical chemistry 2013 qian liuAnalytical chemistry 2013 qian liu
Analytical chemistry 2013 qian liu
 
dkNET Webinar: Tabula Sapiens 03/22/2024
dkNET Webinar: Tabula Sapiens 03/22/2024dkNET Webinar: Tabula Sapiens 03/22/2024
dkNET Webinar: Tabula Sapiens 03/22/2024
 

More from Shashaanka Ashili

Optical Properties of Mesoscopic Systems of Coupled Microspheres
Optical Properties of Mesoscopic Systems of Coupled MicrospheresOptical Properties of Mesoscopic Systems of Coupled Microspheres
Optical Properties of Mesoscopic Systems of Coupled Microspheres
Shashaanka Ashili
 
A physical sciences network characterization of non-tumorigenic and metastati...
A physical sciences network characterization of non-tumorigenic and metastati...A physical sciences network characterization of non-tumorigenic and metastati...
A physical sciences network characterization of non-tumorigenic and metastati...
Shashaanka Ashili
 
Percolation of light through whispering gallery modes in 3D lattices of coupl...
Percolation of light through whispering gallery modes in 3D lattices of coupl...Percolation of light through whispering gallery modes in 3D lattices of coupl...
Percolation of light through whispering gallery modes in 3D lattices of coupl...
Shashaanka Ashili
 
Optical coupling and transport phenomena in chains of spherical dielectric mi...
Optical coupling and transport phenomena in chains of spherical dielectric mi...Optical coupling and transport phenomena in chains of spherical dielectric mi...
Optical coupling and transport phenomena in chains of spherical dielectric mi...
Shashaanka Ashili
 
The effects of inter-cavity separation on optical coupling in dielectric bisp...
The effects of inter-cavity separation on optical coupling in dielectric bisp...The effects of inter-cavity separation on optical coupling in dielectric bisp...
The effects of inter-cavity separation on optical coupling in dielectric bisp...
Shashaanka Ashili
 
Automated platform for multiparameter stimulus response studies of metabolic ...
Automated platform for multiparameter stimulus response studies of metabolic ...Automated platform for multiparameter stimulus response studies of metabolic ...
Automated platform for multiparameter stimulus response studies of metabolic ...
Shashaanka Ashili
 

More from Shashaanka Ashili (6)

Optical Properties of Mesoscopic Systems of Coupled Microspheres
Optical Properties of Mesoscopic Systems of Coupled MicrospheresOptical Properties of Mesoscopic Systems of Coupled Microspheres
Optical Properties of Mesoscopic Systems of Coupled Microspheres
 
A physical sciences network characterization of non-tumorigenic and metastati...
A physical sciences network characterization of non-tumorigenic and metastati...A physical sciences network characterization of non-tumorigenic and metastati...
A physical sciences network characterization of non-tumorigenic and metastati...
 
Percolation of light through whispering gallery modes in 3D lattices of coupl...
Percolation of light through whispering gallery modes in 3D lattices of coupl...Percolation of light through whispering gallery modes in 3D lattices of coupl...
Percolation of light through whispering gallery modes in 3D lattices of coupl...
 
Optical coupling and transport phenomena in chains of spherical dielectric mi...
Optical coupling and transport phenomena in chains of spherical dielectric mi...Optical coupling and transport phenomena in chains of spherical dielectric mi...
Optical coupling and transport phenomena in chains of spherical dielectric mi...
 
The effects of inter-cavity separation on optical coupling in dielectric bisp...
The effects of inter-cavity separation on optical coupling in dielectric bisp...The effects of inter-cavity separation on optical coupling in dielectric bisp...
The effects of inter-cavity separation on optical coupling in dielectric bisp...
 
Automated platform for multiparameter stimulus response studies of metabolic ...
Automated platform for multiparameter stimulus response studies of metabolic ...Automated platform for multiparameter stimulus response studies of metabolic ...
Automated platform for multiparameter stimulus response studies of metabolic ...
 

Recently uploaded

Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Penicillin...........................pptx
Penicillin...........................pptxPenicillin...........................pptx
Penicillin...........................pptx
Cherry
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
justice-and-fairness-ethics with example
justice-and-fairness-ethics with examplejustice-and-fairness-ethics with example
justice-and-fairness-ethics with example
azzyixes
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
rakeshsharma20142015
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Predicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdfPredicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdf
binhminhvu04
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
muralinath2
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 

Recently uploaded (20)

Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Penicillin...........................pptx
Penicillin...........................pptxPenicillin...........................pptx
Penicillin...........................pptx
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
justice-and-fairness-ethics with example
justice-and-fairness-ethics with examplejustice-and-fairness-ethics with example
justice-and-fairness-ethics with example
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Predicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdfPredicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdf
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 

A statistical framework for multiparameter analysis at the single cell level

  • 1. 804 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012 Cite this: Mol. BioSyst., 2012, 8, 804–817 A statistical framework for multiparameter analysis at the single-cell levelw Wandaliz Torres-Garcı´a,ab Shashanka Ashili,b Laimonas Kelbauskas,*b Roger H. Johnson,b Weiwen Zhang,*b George C. Runger*a and Deirdre R. Meldrumb Received 17th October 2011, Accepted 2nd December 2011 DOI: 10.1039/c2mb05429a Phenotypic characterization of individual cells provides crucial insights into intercellular heterogeneity and enables access to information that is unavailable from ensemble averaged, bulk cell analyses. Single-cell studies have attracted significant interest in recent years and spurred the development of a variety of commercially available and research-grade technologies. To quantify cell-to-cell variability of cell populations, we have developed an experimental platform for real-time measurements of oxygen consumption (OC) kinetics at the single-cell level. Unique challenges inherent to these single-cell measurements arise, and no existing data analysis methodology is available to address them. Here we present a data processing and analysis method that addresses challenges encountered with this unique type of data in order to extract biologically relevant information. We applied the method to analyze OC profiles obtained with single cells of two different cell lines derived from metaplastic and dysplastic human Barrett’s esophageal epithelium. In terms of method development, three main challenges were considered for this heterogeneous dynamic system: (i) high levels of noise, (ii) the lack of a priori knowledge of single-cell dynamics, and (iii) the role of intercellular variability within and across cell types. Several strategies and solutions to address each of these three challenges are presented. The features such as slopes, intercepts, breakpoint or change-point were extracted for every OC profile and compared across individual cells and cell types. The results demonstrated that the extracted features facilitated exposition of subtle differences between individual cells and their responses to cell–cell interactions. With minor modifications, this method can be used to process and analyze data from other acquisition and experimental modalities at the single-cell level, providing a valuable statistical framework for single-cell analysis. Introduction Cell-to-cell variability has been found to play a central role in a variety of physiological processes such as differentiation, proliferation, stress response and pathogenesis. Due to the stochastic nature of many intracellular processes, individual cells can exhibit significant phenotypic differences and respond differently to stimuli and changes in the microenvironment.1–4 The origin of many diseases is thought to be in several, or perhaps even one aberrant cell that acquires the capability to evade the cues regulating normal cell function and death. Early identification and detailed characterization of such abnormal cells bear the potential not only to provide deep insights into fundamental cell processes, but also to open new avenues for treatment and management of diseases with high morbidity and mortality, including cancer. Because of that, single-cell studies have been gaining momentum in the last decade facilitated by technological advances enabling reliable measurement of various biologically relevant parameters with high sensitivity and precision. To study cell signaling and metabolic pathways, one needs to be able to characterize simultaneously as many parameters of living single cells as possible. Multiparameter analysis could reveal the details of intracellular mechanisms, providing novel insights into systems biology of cells. Technological challenges such as extremely low amounts of biological material, small differential changes in metabolite concentrations and the fragility of cells have been hampering significant progress in single-cell analysis. One of the major limitations in single-cell experiments is the low signal-to-noise ratio. Reliable separation of meaningful data from noise represents a formidable challenge, one that is exacerbated by the absence of a priori knowledge of the dynamics of physio- logical processes that take place in individual cells. This is particularly true in experiments where single living cells need a School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ 85287-5906, USA. E-mail: George.Runger@asu.edu b Center for Biosignatures Discovery Automation, The Biodesign Institute, Arizona State University, Tempe, AZ 85287-6501, USA. E-mail: Laimonas.Kelbauskas@asu.edu, Weiwen.Zhang@asu.edu w Electronic supplementary information (ESI) available. See DOI: 10.1039/c2mb05429a Molecular BioSystems Dynamic Article Links www.rsc.org/molecularbiosystems PAPER DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online / Journal Homepage / Table of Contents for this issue
  • 2. This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 805 to be characterized with minimal perturbation of their normal function, further limiting the experimentalists’ choice among available methodology. There are numerous methods available to remove white noise in dynamic data. Many of them have been suitably adapted for application in a variety of research fields such as chemistry, environmental studies and medicine.5–7 Nevertheless, reducing random perturbation in a signal is not a trivial task, since it is usually unclear how much noise can be removed without losing the ‘‘true’’ signal. Quality assurance problems in newly developed technologies are common. For example, in the 1990’s, when DNA microarrays started to gain interest in the scientific community, the importance and value of the unique information for understanding biological systems,8–13 as well as the need for quality assessment and noise reduction14,15 were clearly acknowledged. In light of these challenges, many noise reduction methods were proposed, and later developed into a mature and unified methodology.16,17 In a way analogous to noise reduction, the characterization of data and signals in the bioinformatics arena has been widely studied especially for data quality assessment purposes in microarray-based studies. Many feature selection techniques are commonly used for microarray data characterization, including selection of genes with signifi- cant expression levels in response to changes in conditions or experimental settings.18 Modeling of real-time data obtained from dynamical systems has been explored in the literature utilizing traditional statistical methods.19–23 The traditional methods tend to establish parametric assumptions which are often hard to justify in complex biological systems. Hence, there exists a critical need to model real-time measurements in biological systems, including live cells, without a priori knowledge of the nature of underlying dynamical processes. However, so far none of these established methods have been applied to analyze data obtained from individual living cells. Here we present a study focused on the analysis of novel respiration kinetics data from individual cells. The cell metabolic analysis method entails manipulation24 and isolation of single cells25 and determination of their oxygen consump- tion (OC) kinetics in real-time.26–28 The data obtained from these measurements exhibit much higher levels of noise com- pared to bulk-cell experiments. The lack of a priori knowledge of single-cell dynamics makes it difficult to define charac- teristic features in these datasets, posing challenges in the extraction of biologically information and its proper biological interpretation. The real-time nature of the measurements contributes additional complexity to the analysis. In this work we describe our initial efforts to develop statistical methodologies to address the challenges of noise reduction, data characterization through feature extraction, and biological comparison for respiration phenotype measurements in individual cells. We analyzed OC kinetics data of single cells obtained from two esophageal epithelial cell lines: metaplastic (CP-A) and early dysplastic (CP-C) Barrett’s esophageal cells. These cell lines were derived from biopsies taken from the corresponding regions in human esophagus and represent different stages of pre-neoplastic progression. Because of the clear delineation of the two cell types in terms of histopathology and their relevance to cancer, our findings may also be of interest to cancer biologists. Because they serve to define and extract elements of a disease biosignature, the statistical methodologies presented here could be used as a foundational framework for analyzing single-cell data. Results Data preprocessing The data consist of OC kinetics in single human metaplastic (CP-A) and early dysplastic (CP-C) cells. Fig. 1 summarizes the challenges addressed in this unique data structure. The early exploration of OC measurements at the single-cell level indicated the need to reduce noise and unwanted perturbations in the signals. Reducing noise helps enhance the discovery of Fig. 1 Statistical framework diagram. Sequential steps to process and analyze single-cell oxygen consumption data: smoothing, feature extraction and classification. Major challenges and proposed solution strategies to address each one of them are shown. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 3. 806 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012 relevant features related to the ‘‘true’’ signal’s behavior. In our approach two main stages of noise reduction were performed: (1) low-pass filtering and (2) outlier smoothing. Common filtering techniques were applied to the OC data in two different ways. First, a filter was applied to each of the OC curves to estimate a curve-specific metric of variation that was used to detect outliers in the unsmoothed data. This outlier smoothing process invoked traditional control charts in which any data point of the OC curves lying outside curve-specific control limits was considered an outlier and its value was smoothed by neighborhood averaging (see the Methods section). This step reduced the adverse influence of artifact caused by stochastic response of the microsensor or other measurement system components. After outlier smoothing, the resulting signal was processed through a low-pass filter (Fig. 2). Feature extraction After preprocessing, the data analysis procedure aimed to characterize the OC kinetics. The feature extraction step addresses challenge number two described in Fig. 1, which can be divided into two separate problems: (1) removal of redundant information characterized through the understanding of experimental limitations, and (2) extraction of distinctive features without a priori knowledge of the system. Detection of the time needed to reach zero oxygen concen- tration in the microchambers. The removal of redundant information from further analysis is based on experimental considerations. During measurement, individual cells are hermetically isolated in sub-nanolitre volume chambers which results in a limited amount of oxygen being available for consumption by cells.27,28 Data collected after the oxygen concentration in the chambers reaches zero are not useful for OC kinetics analysis and can be discarded as extraneous or redundant. During an experiment, OC kinetics of nine individual cells was recorded simultaneously. The time needed for each cell to deplete the oxygen in the microchamber varied significantly from cell to cell due to the metabolic rate hetero- geneity. The experiment was continued until oxygen concen- tration in all nine chambers reached zero, resulting in different amounts of redundant data collected for each cell.28 To address this issue, we proceeded to automatically detect the time point where each curve reached a zero value (0% oxygen) and to discard data collected after that time point (referred to as zero-value tails), excluding it from further analysis (Fig. 2). Hence, we define redundant information in the context of experimental conditions, namely by the limited amount of oxygen available for each cell to consume and the variable rate at which they consume it. Further experimental details related to the data used in this study can be found in the Methods section. Removing the zero-value tails from each of the OC curves facilitated robust modeling of these curves in regions of interests and allowed for reliable feature extraction from the kinetics. Removal of the zero-value tails would be a trivial problem if one had to analyze a small number of curves or if the time to reach zero was the same for all cells. For the analysis of hundreds of cells, however, we needed to develop an algorithm to automatically remove the tails to ensure consistency and rapid data processing. A cumulative sum (CUSUM) control chart is a commonly used statistical tool to detect small changes.29 Its application allowed us to automatically detect the time point at which each OC curve reaches a zero oxygen value and does not change significantly afterwards (Fig. 2). This statistical procedure was performed to detect a change time point, feature called TimetoZero, for each sample. A summary of these detected time points across samples using the CUSUM procedure is shown in Fig. 3. Stratification frequency plots of the times needed to reach 0% oxygen concentration for the entire set of data by cell lines are shown in Fig. 3 providing a general view of this TimetoZero feature distribution across different cell lines. We used this feature as a reference point to remove the data lying beyond this point as redundant. However, by determining the time-to-zero reference points, we captured a unique characteristic of the OC kinetics to better understand cell heterogeneity. OC characterization and other features. Modeling OC curves is challenging since there is no a priori knowledge of single-cell respiration kinetics. Other than the notion that cells are Fig. 2 Step-by-step statistical framework example. Main steps used to characterize the OC kinetics data are shown: (a) data filtering, (b) detection of feature, TimetoZero, using CUSUM, (c) removal of zero-valued tails, (d) identification of characteristic features using a spline model. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 4. This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 807 heterogeneous, little is known about the characteristics of specific factors such as oxygen consumption and their relationship to different cell types and metabolic states.1,30 We addressed this challenge by approximating OC kinetics with a constrained piecewise linear regression model. This spline model fits two continuous linear regressions with slopes constrained to be negative. Based upon preliminary data analysis which revealed this pattern, it seemed appropriate to study OC curves by means of fitting two linear models (Fig. 2). The two continuous regres- sions share a mutual breakpoint optimally detected through a likelihood method across the entire time span. This model allows us to capture features in different segments of the data. The spline model was compared with the simple linear regression model using a goodness-of-fit criterion. We performed the comparison of these two models by using an F test for each OC curve. These multiple comparisons raise a commonly known problem in multiple hypothesis testing: increased false-positives. To address this problem, we have corrected all computed p-values using the Bonferroni correction method.31 Through the evaluation of these tests, we found that 99.3% and 97.7% of the OC kinetics data obtained from CP-A and CP-C cells, respectively, could be fit better with the constrained piecewise linear (spline) regression than with the simple linear regression model at a = 0.001. Fig. 4 shows the percentage of curves that were fit more accurately with the spline model as a function of the level (a) of Type I error for the F test. In general, more than 90% of OC curves obtained with both cell types showed a statistically significant improvement of the fit at different values of a when using the constrained piecewise model as compared to simple linear regression. A slightly higher percentage of curves measured from CP-A compared to CP-C cells could be fit more reliably with the constrained piecewise model. The model enabled the extraction of relevant features that were used to characterize the OC kinetics. Besides the regular features from fitting linear regressions (intercepts and slopes), we were able to detect several other features (Table 1), such as time and oxygen concentration at which the first slope of the piecewise model is replaced by the second slope. All features were determined for each kinetics curve of both cell types (CP-A and CP-C), and the feature distributions within and across cell types were further analyzed. Whether or not piece- wise linear regression represents a biologically relevant model Fig. 3 Features histogram and significance tests between CP-A and CP-C for the TimetoZero feature. (a) Distribution histograms for single CP-A and CP-C cells; (b) 95% confidence interval of the means of the feature for both cell types. Fig. 4 Comparative multiple hypothesis testing between the spline model and linear regression fit. Percentage of OC curves per cell type that revealed a better fit with the spline model than with the linear regression shown as a function of different values of a (Type-I error). The Bonferroni correction was applied to the individual test p-values to alleviate the problem of false-positives when multiple comparisons are performed. Inset: zoom in on a range of [0, 0.05]. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 5. 808 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012 of OC kinetics, it provided a good empirical fit to the experimental data with a simple structure, permitting feature extraction for comparative studies between different cell types and conditions. Validation Prior to using this statistical methodology for biological data interpretation, it was validated to assess its accuracy and robustness. For validation we used a model system based on enzymatic scavenging of oxygen by Oxyrase.32,33 Oxyrase is a preparation of membrane fragments from Enterococcus coli and contains membrane monooxygenases and dioxygenases. When it comes in contact with lactic acid, Oxyrase removes oxygen rapidly from aqueous environments, including cell medium. Because of its enzymatic basis, oxygen removal kinetics by Oxyrase can be modeled using the Michaelis– Menten equation that describes enzymatic reaction rates as a function of substrate concentration. To reproduce data collection conditions as close as possible to actual experiments, we measured oxygen consumption kinetics of Oxyrase (no cells) using experimental settings identical to those used for single cells. This ensures that the signal-to-noise ratios are similar to single-cell data. We used four different Oxyrase concentrations, 50 mL, 150 mL, 200 mL, and 250 mL (ranging from 0.06–0.2% by volume) for more robust validation of the statistical framework. The features extracted from the OC kinetics data obtained with Oxyrase utilizing the statistical framework showed signifi- cant differences among signals measured with different Oxyrase concentrations. The application of a Random Forest classifier model34 to the extracted features revealed clear discrimination among the four different concentrations with out-of-bag error rates of 2% when all features were included in the model, and 11.1%, when TimetoZero (see Feature extraction) was removed from the data analysis. Ensemble learners are predictive models that combine a collection of simpler classifiers yielding better predictive performance as an ensemble than any of the individual classifiers.35 The distinct discrimination among the different Oxyrase concentrations was visualized with the use of multidimensional scaling36 in panels (a) and (b) of Fig. 5. Each panel portrays the visualization Table 1 Extracted features and their descriptions Features Description Change-point.Time Time value at which the change in slopes in the piecewise linear fit takes place Change-point.Oxygen Oxygen consumption value at which the change in slopes in the piecewise linear fit takes place Intercept coefficient (B0) Intercept of left linear regression Left slope coefficient (B1) Slope of the linear regression before the Change-Pointa Right slope coefficient (B1) Slope of the linear regression after the Change-Pointa Kurtosis Measure of ‘‘peakedness’’. Higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations. Skewness Measure of the asymmetry. Minimum MSE The Mean squared error value for the best piecewise linear regression fit. TimetoZero Time at which the oxygen concentration in the chamber reaches a value of zero Brief description of features extracted from curves after application of smoothing and filtering techniques.a Slope magnitudes extracted from the spline model are divided by two for curves obtained with two cells per well. Fig. 5 Multidimensional Plots for Oxyrase enzymatic reaction for validation. This plot visualizes the scaling coordinates of the proximity matrix obtained with a Random Forest performed to classify four distinct Oxyrase concentration values. These oxyrase measurements were gathered through the same semi-automated technology as the OC curves in study. These were used in validation since its behavior is well-understood and differences are expected across features from oxyrase curves from different concentrations. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 6. This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 809 patterns for the two Random Forest models discussed earlier: (a) a classifier with all features and (b) a classifier with all features except TimeToZero. The resulting proximity matrix from the random forest classifier is used as input in multi- dimensional scaling to find a suitable 2D visual configuration that showcases the sample patterns. Each axis, named scaling dimensions, represents the 2D coordinates in which these patterns are plotted. The ability to clearly differentiate varying reaction rates (slopes) obtained with different Oxyrase concen- trations shows that our approach enables adequately robust and accurate characterization of dynamic processes. By capturing these differences among the signals known to have different kinetics using the statistical framework employed in this work, we validated our approach for application to single-cell OC data. Biological inferences and interpretation Comparison between different cell lines. Extracted quantita- tive features such as slopes, intercepts, breakpoint or change- point were compared across individual cells and cell types. To detect differences between CP-A and CP-C features we computed two sets of significance tests. A test of the statistical Fig. 6 Comparison of features between CP-A and CP-C cells by means of a spline model. Three main features were extracted using the constrained piecewise linear model: (a, b) oxygen concentration where the change of slopes in the fit occurs (change-point), (c, d) left (before slope change) and (e, f) right (after slope change) slopes. Figures on the left show feature frequency values and those on the right show 95% confidence interval of the features means. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 7. 810 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012 significance of differences between the means or the medians of the features of the two cell lines revealed significant differences for the TimetoZero and Change point.Oxygen features (Table 1). The distribution of the time point when each OC kinetics curve reaches a oxygen concentration value near zero (TimetoZero feature) exhibits a broad range of values in both cell types, as mentioned previously (Fig. 3). Statistical analysis revealed significant differences between both the means and the medians of the two cell types with p-values equal to 0.003 and 0.008, respectively (Fig. 3). Another feature of interest is the value of oxygen at the point where the two linear regressions of the spline model meet (Change-point.Oxygen). At the breakpoint of the spline model two features can be captured: oxygen concentration and time. Oxygen concentration when the change in slopes takes place is biologically relevant as it indicates a change in the oxygen consumption kinetics most likely caused by alterations in the energy production of the cell. The distributions of the Change- point.Oxygen feature within each cell type showed character- istics typical of a bimodal density function. Qualitatively the distribution histograms of the two cell types show significant similarity (Fig. 6) with a more clearly defined main peak at 6–6.5 ppm for CP-C cells. The distributions clearly indicate marked heterogeneity in OC kinetics within the same cell type. More subtle differences can be seen when comparing the two cell types (Fig. 6b). One of the most notable differences is the existence of a second, broader peak between 2–4 ppm in CP-C cells, which is less pronounced in CP-A cells. However, the statistical test of the mean and median showed p-values of 0.053 and 0.061, respectively, indicating that both of these parameters are not statistically different at a =0.05. Two other features that we analyzed were the slopes (rates) of the OC kinetics measured in the study. Understanding how fast individual cells consume oxygen is of great interest as it is directly related to the energy production levels in the cell. The distributions of the slopes showed a long tail containing only a small number of cells, while the majority of the cells’ OC rates were concentrated in a relatively narrow range (Fig. 6) [À0.02,0]. For both, left and right slopes no statistically significant differences between their means were found when comparing the two cell types (Fig. 6). However, the median values of the right slope were found to be statistically different between the two cell types with a p-value equal to 0.002 (Fig. 6). We further explored these comparisons as a classification problem with two classes (e.g. one cell type versus another) finding subtle differences between the two cell types using an ensemble-based classifier: Random Forest. The classification problem indicated an out-of-bag error rate of 30% when classifying single-cell CP-A and CP-C cells based on the extracted features (Table 1). A multidimensional plot from the tested Random Forest (more details in the Methods section: Comparisons and classification techniques) is shown in Fig. 7. This plot shows differences among cell lines. The role of intercellular interactions: comparison between OC kinetics in isolated single and interacting cells. To explore metabolic heterogeneity in the presence of intercellular inter- actions, OC kinetics curves were obtained with two cells of the same type placed into one microchamber. We compared features extracted from the OC data of single cells (i.e., CP-A_1 and CP-C_1) with those obtained with two cells per single chamber (i.e., CP-A_2 and CP-C_2). The same statistical methodology was applied to CP-A_2 and CP-C_2 OC curves as for the data acquired with single, non-interacting cells with only minor modifications to certain features. To account for the number of cells (one or two) per microchamber the values of the slopes measured in microchambers with double occupancy were divided by two assuming equal OC for the two cells in a microwell, allowing comparisons with single-cell slopes. We first investigated the goodness-of-fit of the spline model applied to the OC kinetics data of interacting cells. We compared data fits obtained with the spline model and with simple linear regression using a multiple hypothesis testing with Bonferroni correction as described in the Methods section. Similar to the results obtained with individual, non-interacting cells of both cell lines, the spline model fit was found to be statistically better than the simple linear regression model for all measurements with double-occupancy, interacting cells (Fig. S1, ESIw). A set of features from CP-A_1, CP-A_2, CP-C_1, and CP-C_2 curves were extracted using the constrained piecewise linear regression model. Distribution patterns similar to those obtained with single, non-interacting cells were found for the OC kinetics curves with interacting cells for features such as TimetoZero, Change-point.Oxygen, Left.Slope, and Right.Slope (description in Table 1). Statistically significant differences in both the mean and median were found for at least one of the four distinct groups of OC curves for the feature TimetoZero as Fig. 7 Multidimensional scaling plot: a Random Forest classifier for single CP-A vs. CP-C cells. This plot visualizes the scaling coordinates of the proximity matrix obtained from a Random Forest to classify CP-A versus CP-C cells at the single-cell level. This graphical repre- sentation shows how the Random Forest classifier was able to find high-dimensional interactions between data features that cluster OC curves together. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 8. This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 811 Fig. 8 The TimetoZero feature extracted from single- and double-cells for CP-A and CP-C oxygen consumption curves. Time to zero is a time feature extracted after removal of zero-valued tails using the CUSUM method. (a) Distribution histogram of the feature among single, non-interacting (CP-A_1 and CP-C_1) cells and for interacting (two cells per well; CP-A_2 and CP-C_2) cells. (b) 95% confidence interval plot of the means of TimetoZero for each experimental condition. Testing for statistically significant differences between the means or between the location shifts (e.g., medians) showed p-values equal to 0 in both cases. Fig. 9 Other features of interest extracted from oxygen consumption kinetics of single, non-interacting- and double, interacting-CP-A and CP-C cells. The left panels show distribution histograms of the corresponding features; the right panels show 95% confidence interval of the means of the corresponding features. (a) and (b) Oxygen concentration values where the change of slopes in the spline model occurs. (c) and (d) Slope values of the first linear regression of the spline model (Left.Slope). (e) and (f) Slope values of the second linear regression (Right.Slope). See Table 1 for more detailed description of the slopes. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 9. 812 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012 shown in Fig. 8. With p-values close to zero, this feature may be an important discriminator among these non-interacting and interacting cells (less marked differences can be observed for CP-C_2 probably due to its small sample size). Other extracted features such as the ones presented in Fig. 9 (oxygen concentration at breakpoint, slopes before and after the break- point) portrayed less distinct differences among these groups but revealed empirical distribution patterns only available through the study of individual OC curves. For example, oxygen concentration at the breakpoint revealed significant differences for at least one group among all groups with p-values of 0.001 and 0.01 when testing for means and medians, respectively, suggesting CP-C_1 as more different for this feature (Fig. 9). In contrast, slope values (adjusted for interacting cells by dividing by two) did not differ as much across different cell groups besides the median of Right.slope which showcased a p-value of 0.003 for at least one group being different among others (Fig. 9). These comparisons are possible through the application of the methodology presented in this work. The features extracted using the statistical framework allowed for multiple comparisons of different phenotypes. As seen before, the distributions of each of the features permitted comparisons and showcased subtle differences. To further analyze the OC curves through the extracted features, an ensemble classifier34,35 was applied with the objective of classifying the four groups of interest (CP-A_1, CP-A_2, CP-C_1, and CP-C_2). A Random Forest classifier34 (see Methods) was applied to the extracted features to unravel nonlinear relationships among the relevant features. Initially, we built Random Forest models for pairs of classes (i.e., CP-A_1 vs. CP-A_2, CP-C_1 vs. CP-C_2, etc.) obtaining error rates of B20–30% for all pairs. These models included all extracted features. When all four data classes were included in a single Random Forest model, the classification error rates were found to be around 40% when all features were used in the model and 50% for a Random Forest model that included all features except TimetoZero (Fig. 10). The TimetoZero feature was removed from the classification model to capture discrimi- nant relationships among other features where differences might not be as clear or direct as in the case of TimetoZero. Table 2 shows the confusion matrices providing details on how many curves were misclassified using the models with or without the TimetoZero feature. Also shown in Table 2 is that the number of curves among the four different classes is unbalanced. To address this problem, down-sampling was performed on all Random Forest models applied here to lessen the sample size effect in the learning model. Down-sampling is a sampling technique that reduces the size of the majority class or the class with the greatest number of samples. It is widely used to balance the classes to minimize the overall error rate.37 In addition, Table 3 presents the feature importance scores for both Random Forest models. It can be seen that TimetoZero has the highest score for distinguishing between the different experimental classes. However, when the TimetoZero feature was removed, all features ranked similarly. Although their predictability measures are not high, the results obtained with the Random Forest models show semi-defined clusters within the same experimental condition or the cell type. Fig. 10 shows how the data points of the same type of experiment tend to agglomerate in regions partially overlapping with other experi- mental conditions. This Random Forest model extracts non- linear patterns among the features to discriminate among different classes. The two cell lines used in the study represent different stages of pre-neoplastic progression in esophageal cancer and, thus, are closely related in their phenotypic and genotypic profiles. Therefore, it is likely that they will show similarities in terms of oxygen consumption as well, thus making the differentiation more difficult. More features either from the OC curves or any other biologically relevant data might be necessary to distinguish them clearly. Fig. 10 Multidimensional scaling plots: a Random Forest model for non-interacting and interacting CP-A and CP-C cells. This plot visualizes the scaling coordinates of the proximity matrix obtained with a Random Forest performed to classify CP-A versus CP-C at the single- and double-cell level. (a) Results using all features as described in Table 1. (b) Results using all features with the TimetoZero feature excluded from the analysis. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 10. This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 813 Conclusion The analysis and interpretation of intercellular heterogeneity data are of fundamental importance in cell biology. A great deal of interest is found in the scientific community to under- stand the role of heterogeneity in cellular homeostasis and pathogenesis.28,38 In recent years, innovative technologies have been developed to perform biological studies at the single-cell level,24–28 including single-cell oxygen consumption measurements. Despite the availability of these technologies, their real potential can only be exploited utilizing effective analytical methods capable of performing robust de-noising and feature extraction steps on the novel type of information. Through preliminary studies, we have identified three major challenges when dealing with real-time phenotypic measure- ments at the single-cell level: random noise, presence of multiple functional states, and reliable differentiation of cell behavior within and across different cell types (Fig. 1). In this study, using single-cell OC data as example, we made the initial effort to establish a statistical framework for multi- parameter analysis of the experimental data at the single-cell level. In our approach to analyze single-cell data we applied several sets of statistical tools used in signal processing and statistics for data modeling and feature extraction. The validation of the method showed that experimental data can be modeled and their features extracted reliably. The quantitative features extracted from the single-cell experimental data using our analysis method revealed subtle differences between non-interacting, single cells as well as between interacting cells of both types. This demonstrates the feasibility of the developed methodology to reliably process the measurement data and characterize oxygen consumption kinetics. Because of its general applicability, our statistical framework can be utilized to address similar challenges that arise in other single-cell data acquisition and experimental modalities. Methods Dataset Description of oxygen consumption measurements. As a first step in acquiring and analyzing multiparameter data, our center has developed an experimental platform for metabolic phenotype characterization, including oxygen consumption, at the single-cell level.27,28 Single-cell oxygen consumption rates are on a scale of fmoles minÀ1 cellÀ1 . Because oxygen sensing is based on the dynamic quenching of sensor luminescence by oxygen, the signal-to-noise ratio of the measurement varies as a function of oxygen concentration in the microchamber. This factor needs to be taken into account especially when applying various signal processing algorithms for de-noising purposes. In addition, other sources of noise include detector readout noise, intensity variations of the excitation source, and stochastic sensor noise. For the two cell types studied in this work, the average time required for an isolated cell to consume all oxygen within the finite volume (B140 pL) of cell media ranges between 30–90 min. Noise levels resulting from the various sources can be significant, requiring the data to be analyzed utilizing a rigorous statistical framework capable of reducing noise extracting quantitative features. We analyzed several sets of oxygen consumption kinetics data from two Barrett’s esophageal epithelial cell lines (meta- plastic CP-A and dysplastic CP-C) obtained with the single- cell technology. The number of OC curves studied for CP-A and CP-C were 154 and 256, respectively. The cells were loaded into microwells and incubated for 15–30 hours before measurements were performed. The incubation time was selected based on previous studies of cell viability and morphology. After incubation, microwells with cells were hermetically sealed with a lid containing an extracellular optical oxygen sensor. The sensor emission intensity was collected as a function of time until oxygen concentration in the microchamber reached zero.27 Table 2 Confusion matrices obtained with Random Forest classifica- tion models (A) All features included: True class (Num. curves) Predicted class Class error (%)CP-A_1 CP-A_2 CP-C_1 CP-C_2 CP-A_1 (154) 75 24 51 4 51.3 CP-A_2 (118) 4 81 1 32 31.4 CP-C_1 (256) 61 22 165 8 35.5 CP-C_2 (44) 5 20 2 17 61.4 (B) Without TimetoZero feature: True class (Num. curves) Predicted class Class error (%)CP-A_1 CP-A_2 CP-C_1 CP-C_2 CP-A_1 (154) 74 29 45 6 51.9 CP-A_2 (118) 20 61 13 24 48.3 CP-C_1 (256) 60 28 142 26 44.5 CP-C_2 (44) 7 17 8 12 72.7 Individual error rates per cell type and different number of cells within a microwell are shown for Random Forest models constructed using all features and with the TimetoZero feature excluded from the analysis. The numbers represent the number of curves classified as the specific predicted class by the nonlinear model. Classification error is calculated by the percentage of curves that were misclassified. Misclassified signals are shown in the gray boxes. Table 3 Variable importance scores from Random Forest classifica- tion models Features Mean decrease gini (%) All features Without TimetoZero Change-point.Time 9.03 12.93 Change-point.Oxygen 10.99 13.23 Left.B0.Coef 10.17 12.89 Left.B1.Coef 10.37 12.53 Right.B1.Coef 13.81 12.49 TimetoZero 17.12 — Kurtosis 9.14 11.88 Skewness 8.88 11.35 MSE.min 10.49 12.70 These variable importance scores are calculated based on the average over all trees of a scoring measure. This scoring measure is computed as the difference of correctly classified cases when the feature matrix values are evaluated onto the grown tree minus correctly misclassified items when the variable to be scored is permuted prior tree model evaluation. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 11. 814 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012 Noise reduction techniques The noise levels in OC data were reduced using two main signal processing components: (1) Low-pass filtering and (2) Outlier smoothing. Low-pass filtering. Two common low-pass filtering techni- ques were evaluated. A low-pass filter reduces the amplitude of high frequencies while leaving low frequencies unchanged. These two methods along with their parameters are briefly described here. In addition, we discuss a goodness-of-fit assessment to decide which of the filtering techniques performs better for the measured OC kinetics curves. The Savitzky–Golay (SG) filter is also called least-squares polynomial smoothing filter and is a finite impulse response (FIR) filter.39 The technique fits a polynomial of fixed degree n to a small window of the data of size (2m + 1) to estimate a midpoint as shown in eqn (1) and (2). This process is repeated by sliding the data window along the total span.39,40 This type of convoluted filter minimizes the least-squares error of fitting a polynomial to window frames of the noisy data and is quite popular in areas such as spectroscopy and analytical chemistry because of its simplicity and speed.41,42 If the data are evenly spaced and continuous then the smoothed value ðyà t Þ is the weighted summation of the points in the window frame as described in eqn (3). Savitzky–Golay’s early methodology implementation results in the truncation of m points at the start and end of the data signal which are not able to be smoothed out. Therefore, extensions to the Savitzky–Golay filter addressing initial and endpoint estimation found in the literature were also implemented in this study.40,43 yà t ¼ Xn k¼0 bktk ¼ b0 þ b1t þ b2t2 þ Á Á Á þ bntn ; t ¼ ½Àm; Àðm À 1Þ; . . . ; 0; . . . mŠ ð1Þ @ @bk Xm t¼Àm ðyà t À ytÞ2 " # ¼ 0 ð2Þ yà j ¼ Pm t¼Àm ctyjþt N ð3Þ In our study, a second-order polynomial fit was tested; as it is commonly used in practice.41 Another important parameter needed in the SG filtering is the window length (m). Common values for this parameter are m = 11 and m = 21. We evaluated root-mean-squared-error (RMSE) for a range of values under both conditions (e.g., CP-A and CP-C) as shown in Fig. S2 (ESIw). Data filtering in this study was performed using a window size of 11, since the smoothing performance was found to be better than with m = 21 in terms of preservation of local signal patterns. The second filter we applied was the Exponentially Weighted Moving Average (EWMA). It is an infinite impulse response (IIR) filter and represents a special case of the moving average filter where the weights of the data points to be averaged decay exponentially with the distance from the most recent data point (eqn (4)). The smoothed value of yt is obtained through yà t ¼ lyt þ ð1 À lÞyà tÀ1 ð4Þ where l represents the decay rate ranging from 0 r l r 1. A small value of l gives more weight to older data and less to new data and vice versa.29,44 To detect small signal changes l = 0.2 was used during the smoothing of the data curves in this study. An RMSE evaluation across a range of l values was performed as shown in Fig. S2 (ESIw). In practice, l values between 0.2–0.3 are used.45 To assess the performance of EWMA and SG filtering techniques, we evaluated average root-mean-squared-error (RMSE) between smoothed and raw data as a goodness- of-fit criterion. The goodness-of-fit statistics describe how well smoothed values fit experimental data (i.e., coefficient of determination (R2 ), mean squared error (MSE), and root- mean-squared-error (RMSE)). Small values of the average RMSE indicate a good fit. Both techniques showed similar performances for the commonly chosen parameters as displayed in Fig. S3 (ESIw). Outlier detection and smoothing. The OC kinetics data contained random sharp peaks in certain areas due to signal loss or stochastic sensor intensity fluctuations. We detected these outliers using traditional control charts theory using the following equation L = %x Æ w^s, (5) where L represents the upper (+) and lower (À) control limits, %x is the mean value of the response, w is the parameter that determines the width of the limits, and ^s is an estimated value of variation. Data points outside the limits calculated using eqn (5) were considered outliers. ^s was estimated through an initial filtering step. Each signal undergoes a filtering step as the ones described in the earlier subsections on low-pass filtering to estimate its individual variation metric. Smoothed values resulting from this step are obtained, and the variation of the raw data points is computed using the Root-Mean- Squared-Error (RMSE) metric. We assumed ^s to be a constant, which is not necessarily true. However, because ^s is utilized for the detection of distant outliers only, this assumption is adequate. To determine the w parameter (control width constant) we studied several options. The value for w was chosen to be equal to 2, as with this value of w on average 10% of all data points within an OC kinetics curve are detected as outliers (Fig. S4, ESIw). As expected, higher or lower values of w resulted in smaller or larger fractions, respectively, of the data to be outside the imposed boundaries and detected as outliers. Choosing w = 2 resulted in about 10% of the points within the curve to be classified as outliers. Naturally, higher values of w, i.e. 3, 4, and 5, showed smaller percentages ranging from 0% to B5% and smaller values (w = 1) resulted in a higher percentage (B25%) of data points detected as outliers (Fig. S4, ESIw). Hence, w = 2 seemed a reasonable estimation to reduce random noise due to outliers without excluding too much of the actual signal data from the analysis. After detection, the outliers were smoothed out by using a simple 2-neighbor averaging procedure where the DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 12. This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 815 outlier values are replaced with values computed as the average of its two adjacent neighbor’s values. The low-pass filter was re-applied to the entire dataset afterwards. Feature extraction models Cumulative sum control (CUSUM) charts: change detection. With the use of the cumulative sum (CUSUM) control charts, small changes in the mean value are more efficiently detected than Shewhart control charts.29 To apply the CUSUM procedure, the OC curves were order-reversed to identify the deviation from zero (tail). The OC response signals portray the behavior of oxygen consumption over time. When it reaches its minimum value (zero) the signal shows a constant behavior or a tail of zeros from that time point on. Hence, the time point at which the signal reaches zero can be obtained by capturing a deviation within the constant region of zero values which occur at the end of the signal. Reversing the order of the signal facilitates the application of CUSUM charts to detect deviations from zero. Two input parameters are needed to calculate the CUSUM statistic (Ck): the subgroup size (k) and the in control mean (in this study m0 = 0). The parameter Ck is defined in eqn (6) by k, m0, and the computed mean of the sub-sample of size k ( %xk). Ck is calculated along the entire sample range. Ck ¼ Xk j¼1 ðxk À m0Þ ð6Þ Other parameters needed to be determined when the process is out of control (in this study m0 a 0) are decision interval and amount of shift to detect (slack). Recommended values for these parameters are decision interval of size 5 and a slack value of 3.46–48 Piecewise linear regression model. The methodology imple- mented in this paper for feature extraction consists of fitting a piecewise linear regression model to each OC kinetics curve. In general, the piecewise linear regression is used to describe a nonlinear behavior by fitting the data to a number of linear segments. In the methodology implemented here two linear regression models were constrained to connect at the same breakpoint. We considered a special case of two linear regres- sions intersecting at a single point at time tc (‘‘change-point’’) as shown in eqn (7) with the indicator variable It Z tc = 1, when t Z tc.49 Both linear regressions were described in one function y with the use of an indicator variable It Z tc to define both regression functions each with constrained slopes b1 and b1 + b2 as shown in eqn (7). The slope parameters were constrained to non-positive values due to decreasing oxygen concentration in the microchambers. y = b0 + b1t + b2(t À tc)It Z tc (7) b1 r 0 and b1+b2 r 0 8 curves To find the change-point, a likelihood method was used to minimize the sum squared error (SSE) of the fit of the kinetics data to two linear regressions. During the fit, an exhaustive search was performed along the time axis to determine the change-point and the coefficient estimates that minimize SSE. Once the change-point was found, the features (Table 1) were extracted from the piecewise linear model for different experi- mental conditions (i.e., CP-A, CP-C). The fit to the constrained piecewise linear regression with one-breakpoint was statistically compared to the fit to a simple linear regression model using an F test. To perform the F test, an F statistic is computed as shown in eqn (8) where SSEModel1 and SSEModel2 refer to the sum squared error of the simple linear regression and the constrained piecewise linear regression models respec- tively. Other inputs in eqn (8) are p and n; p is the number of parameters estimated for each model (i.e., Model1 or Model2) and n is the total number of data points in the signal. F ¼ SSEModel1 À SSEModel2 pModel2 À pModel1 SSEModel2 n À pModel2 ð8Þ if - F Fa,pModel2ÀpModel1 ,nÀpModel2 - Model2 performs better. The model comparison by an F test was performed for every single curve resulting in a multiple hypothesis testing problem. A commonly known problem in multiple hypotheses testing is the increase of false positives. Several approaches such as the Bonferroni correction exist to alleviate this problem. This widely used technique is applied when multiple statistical tests are computed simultaneously in order to reduce false positives by reducing the value of a, the significance level of the test. Another way in which the value of a can be reduced is by adjusting all the p-values from the individual tests as shown in eqn (9), where n is the number of comparisons.31,50,51 pvalue.adjusted[c] = min(pvalue[c] Â n, 1) c A [1,n] (9) Comparisons and classification techniques Statistical significance tests. The extracted features were studied and compared between the two cell lines using tradi- tional statistical tools such as histograms, confidence intervals and statistical tests of the mean and median. The statistical significance of the difference between the means was deter- mined using the analysis of variance (ANOVA) test which generalizes the t-test for more than two groups but relies on several assumptions that may or may not be met for this particular data structure. ANOVA was performed with caution to get a general sense of the groups’ mean from the ANOVA hypothesis shown in eqn (10). In addition to ANOVA, we performed significance tests for the differences between the median values using nonparametric tests which waive the strict assumptions inherent to ANOVA. The median or rank test was performed using the Mann–Whitney–Wilcoxon test52,53 for a two-level group test and the Kruskal–Wallis test54 for more than two groups. Both tests are nonparametric approaches for evaluating differences in the location shift of the distribution of x for each group. Eqn (11) represents the analytical expression of the Kruskal–Wallis test, where ni is the number of observations in group i, rij is the rank of observation j from group i, and N is the total number of observations for all groups. The p-value corresponding to a particular K is approximated through the w2 distribution.54 H0: m1 = m2 =Á Á Á= mn (10) DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 13. 816 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012 K ¼ ðN À 1Þ Pg i¼1 niðri À rÞ2 Pg i¼1 Pni j¼1 ðrij À rÞ2 ð11Þ Ensemble classifier: Random Forest. To further explore potential relationships among several groups of OC curves, we applied an ensemble classifier based on decision trees. The two cell lines (CP-A and CP-C) at the single-cell or two-cell levels (i.e., CP-A_1, CP-A_2, CP-C_1, and CP-C_2) were defined as the four classes for the classifier model with features from the OC curves used as predictors. The decision trees can be applied in almost all scenarios. Therefore, they provide a good starting point for modeling heterogeneous and large data sets. The decision trees apply to either a numerical or categorical response and are nonlinear, simple, and fast. The decision trees are scale-invariant and robust to missing values. However, a single tree is produced by a greedy algo- rithm that generates an unstable model.34 Consequently, ensemble methods have been used to counteract the instability of a single tree. Supervised ensemble methods build a set of simple models called base learners and use a weighted outcome for each base learner in a voting scheme to predict future data. In other words, ensemble methods merge outputs from multiple base learners to create a voting committee to improve performance. Many empirical studies have shown that ensemble methods often outperform any single base learner.35 The Random Forest classifier is an improved bagging method which basically exploits the benefits of bootstrapping sampling through modeling. It grows a forest of random decision trees on bagged samples yielding accurate results, comparable with the best known classifiers.34 An advantageous property of Random Forest classifiers is that they limit over fitting through embedded out-of-bag (OOB) error estimation. The out-of-bag error estimation for the ith tree in the Random Forest model is computed using a percentage of cases not used in the learning for this ith tree. Other advantages of Random Forest models are: simple to train and tune in many appli- cations, computationally efficient, can handle a large number of variables, provide variable importance scores, embedded method to estimate missing data, generation of a proximity matrix among cases, handle variable interactions, can be adapted to balance error due to datasets with unbalanced numbers of samples, and capable of extending to unlabeled data for unsupervised clustering, data views and outlier detection.34 Algorithm: a simple pseudocode for Random Forest classifier construction is shown below.34,35 Select a number of cases independently, with replacement from the original dataset to build the training data. Use training data to grow a tree: 3 Select v variables at random from the total number of input variables (V) where v { V. 3 Best variable among the v predictors is chosen to maximize the information gain of the split. 3 Split the chosen node into two daughter nodes based on the best variable. Repeat Step 2 until all trees are built. Output the ensemble of trees. Important features of Random Forest classifiers are OOB sampling, variable importance, and proximity plots. OOB sampling is identical to cross-validation and, since Random Forest is performed in parallel trees, a cross-validation can be done along the way. Variable importance is a key feature of Random Forests. The variables are ranked based on their improvement in the empirical loss function among all trees, meaning that variables that are chosen often in the trees provide better predictive power or they minimize the loss function. These proximity distances are measured by putting all the data, training and out-of-bag, through the grown trees. If instances i and j are in the same terminal node their proximity increases by one and so on through all the trees.34 Then proximities are normalized by the number of trees in the model. State-of-the-art visualization methods such as multidimensional scaling36 are used to illustrate how well features discriminate among different conditions. Multidimensional scaling represents high-dimensional data in a lower-dimensional space (often two or three dimensions) in order to better visualize any structure in the data. The algorithm generates points in the lower-dimensional space that approximately preserve the pair-wise distances between the points in the high-dimensional space.55 Conflict of Interest: none declared. Acknowledgements The authors would like to thank the personnel and support of the Center for Biosignatures Discovery Automation in the Biodesign Institute at Arizona State University. Funding: this research is supported by the National Institutes of Health (NIH), National Human Genome Research Institute (NHGRI), Center of Excellence in Genomic Science (CEGS), grant number 5 P50 HG002360 to Deirdre R. Meldrum. References 1 M. Lidstrom and D. R. Meldrum, Life-on-a-chip, Nat. Rev. Microbiol., 2003, 158, 164. 2 D. J. Wang and S. Bodovitz, Single cell analysis: the new frontier in ‘omics’, Trends Biotechnol., 2010, 28(6), 281–290. 3 T. Kalisky and S. R. Quake, Single-cell genomics, Nat. Methods, 2011, 8(4), 311–314. 4 N. Navin, J. Kendall, J. Troge, P. Andrews, L. Rodgers, J. McIndoo, K. Cook, A. Stepansky, D. Levy, D. Esposito, L. Muthuswamy, A. Krasnitz, W. R. McCombie, J. Hicks and M. Wigler, Tumour evolution inferred by single-cell sequencing, Nature, 2011, 472(7341), U90–U119. 5 E. J. Kostelich and T. Schreiber, Noise reduction in chaotic time- series data: A survey of common methods, Phys. Rev. E: Stat. Phys., Plasmas, Fluids, Relat. Interdiscip. Top., 1993, 48, 1752–1763. 6 S. J. Orfanidis, Introduction to Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1996. 7 J. Brocker, U. Parlitz and M. Ogorzalek, Nonlinear Noise Reduction, Proc. IEEE, 2002, 90(5), 898–918. 8 M. Schena, D. Shalon, R. W. Davis and P. O. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science, 1995, 270(5235), 467–470. 9 D. A. Lashkari, J. L. DeRisi, J. H. McCusker, A. F. Namath, C. Gentile, S. Y. Hwang, P. O. Brown and R. W. Davis, Yeast microarrays for genome wide parallel genetic and gene expression analysis, Proc. Natl. Acad. Sci. U. S. A., 1997, 94(24), 13057–13062. 10 V. G. Cheung, M. Morley, F. Aguilar, A. Massimi, R. Kucherlapati and G. Childs, Making and reading microarrays, Nat. Genet., 1999, 21, 15–19. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online
  • 14. This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 817 11 S. K. Moore, Making chips to probe genes, IEEE Spectrum, 2001, 38(3), 54–60. 12 W. Torres-Garcia, W. W. Zhang, R. Johnson, G. Runger and D. R. Meldrum, Integrative analysis of transcriptomic, proteomic data of Desulfovibrio vulgaris: a nonlinear model to predict abundance of undetected proteins, Bioinformatics, 2009, 25, 1905–1914. 13 W. Torres-Garcia, S. D. Brown, R. H. Johnson, W. W. Zhang, G. Runger and D. R. Meldrum, Integrative analysis of transcrip- tomic and proteomic data of Shewanella oneidensis: missing value imputation using temporal datasets, Mol. BioSyst., 2011, 7(4), 1093–1104. 14 M. L. T. Lee, F. C. Kuo, G. A. Whitmore and J. Sklar, Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations, Proc. Natl. Acad. Sci., 2000, 97(18), 9834–9839. 15 D. E. Carter, J. F. Robinson, E. M. Allister, M. W. Huff and R. A. Hegele, Quality assessment of microarray experiments, Clin. Biochem., 2005, 38(7), 639–642. 16 J. Seo, M. Bakay, Y. W. Chen, S. Hilmer, B. Shneiderman and E. P Hoffman, Interactively optimizing signal-to-noise ratios in expression profiling: project-specific algorithm selection and detection p-value weighting in Affymetrix microarrays, Bioinformatics, 2004, 20(16), 2534–2544. 17 T. Howlader and Y. P. Chaubey, Noise Reduction of cDNA Microarray Images Using Complex Wavelets, IEEE Trans. Image Process., 2010, 19(8), 1953–1967. 18 Y. Saeys, I. Inza and P. Larran˜ aga, A review of feature selection techniques in bioinformatics, Bioinformatics, 2007, 23(19), 2507–2517. 19 J. P. Stevens, Intermediate Statistics. A Modern Approach, Lawrence Erlbaum Associates Publishers, Mahwah, NJ, Second edn, 1999. 20 J. X. Pan and K. T. Fang, Growth Curve Models and Statistical Diagnostics, Springer Series in Statistics, 2002. 21 S. E. Maxwell and H. D. Delaney, Designing Experiments and Analyzing Data: A Model Comparison Perspective, Lawrence Erlbaum, Second edn, 2003. 22 S. Weerahandi, Generalized inference in repeated measures: Exact methods in MANOVA and mixed models, Wiley-Interscience, 2004. 23 Applied regression analysis and other multivariable methods, ed. D. G. Kleinbaum, L. L. Kupper and K. E. Muller, PWS Publishing Co., Boston, MA, USA, 4th edn, 2008. 24 Y. Anis, M. Holl and D. Meldrum, Automated selection and placement of single cells using vision-based feedback control, IEEE Trans. Autom. Sci. Eng., 2010, 7(3), 598–606. 25 H. Zhu, M. Holl, T. Ray, S. Bhushan and D. R. Meldrum, Characterization of deep wet etching of fused silica glass for single cell and optical sensor deposition, J. Micromech. Microeng., 2009, 19, 6. 26 Y. Tian, B. R. Shumway, C. Youngbull, Y. Li, A. K. Y. Jen, R. H. Johnson and D. R. Meldrum, Dually fluorescent sensing of ph and dissolved oxygen using a membrane made from poly- merizable sensing monomers, Sens. Actuators, B, 2010, 47(2), 714–722. 27 S. Ashili, L. Kelbauskas, J. Houkal, D. Smith, Y. Tian, C. Youngbull, H. Zhu, Y. Anis, M. Hupp, K. Lee, A. Kumar, J. Vela, A. Shabilla, R. Johnson, M. Holl and D. Meldrum, Automated platform for multiparameter stimulus response studies of metabolic activity at the single-cell level, Proceedings Vol. 7929, Microfluidics, BIOMEMS, and Medical Microsystems IX, 2011. 28 L. Kelbauskas, S. Ashili, J. Houkal, D. Smith, A. Mohammadreza, K. Lee, A. Kumar, Y. Anis, T. Paulson, C. Youngbull, Y. Tian, R. Johnson, M. Holl and D. Meldrum, A novel method for multi- parameter physiological phenotype characterization at the since-cell level, Proceedings Vol. 7902, Imaging, Manipulation and Analysis of Biomolecules, Cells, and Tissues IX, 2011. 29 D. Montgomery, Introduction to Statistical Quality Control, Wiley Higher Education, 2005. 30 T. Molter, S. C. McQuaide, M. Zhang, M. R. Holl, L. W. Burgess, M. E. Lidstrom and D. R. Meldrum, Algorithm advancements for the measurement of single cell oxygen consumption rates, IEEE International Conference CASE 2007, Automation Science and Engineering, 2007, 386–391. 31 J. P. Shaffer, Multiple Hypothesis Testing, Annu. Rev. Psychol., 1995, 46, 561–584. 32 J. K. Joseph, D. Bunnachak, T. J. Burke and R. W. Schrier, A novel method of inducing and assuring total anoxia during in vitro studies of O2 deprivation injury, J. Am. Soc. Nephrol., 1990, 1, 837–840. 33 K. C. Ho, J. K. Leach, K. Eley, R. B. Mikkelsen and P. S. Lin, A simple method of producing low oxygen conditions with Oxyrase for cultured cells exposed to radiation and Tirapazamine, Am. J. Clin. Oncol., 2003, 26(4), e86–e91. 34 L. Breiman, Random forests, Mach. Learn., 2001, 45, 5–32. 35 T. Hastie, R. Tibshirani and J. H. Friedman, The Elements of Statistical Learning—Data Mining, Inference, Prediction, Springer Verlag, 2nd edn, 2009. 36 T. F. Cox and M. A. Cox, Multidimensional scaling, Chapman and Hall, London, 1994. 37 L. Breiman, J. Friedman, C. J. Olshen and R. A. Stone, Classification and Regression Trees, Wadsworth International, Belmont, CA, 1984. 38 S. J. Altschuler and L. F. Wu, Cellular Heterogeneity: Do Differences Make a Difference?, Cell, 2010, 141(4), 559–563. 39 A. Savitzky and M. J. E. Golay, Smoothing and differentiation of data by simplified least squares procedures, Anal. Chem., 1964, 36(8), 1627–1639. 40 R. A. Leach, C. A. Carter and J. M. Harrister, Least-squares polynomial filters for initial point and slope estimation, Anal. Chem., 1984, 56(13), 2304–2307. 41 P. Persson and G. Strang, Mathematical systems theory in biology, communications, computation, and finance, Springer, 2002. 42 Z. B. Alfassi, Z. Boger and Y. Ronen, Statistical Treatment of Analytical Data, CRC Press, Blackwell Science, Boca Raton, FL, 2005. 43 P. A. Gorry, General least-squares smoothing and differentiation by the convolution (Savitzky–Golay) method, Anal. Chem., 1990, 62(6), 570–573. 44 B. Walczak, Wavelets in chemistry, Elsevier Science, 2000, vol. 22. 45 J. Hunter, The exponentially weighted moving average, J. Qual. Technol., 1996, 18(4), 203–210. 46 J. Pignatiello and G. C. Runger, Comparison of multivariate CUSUM charts, J. Qual. Technol., 1990, 22, 173–186. 47 S. S. Prabhu, G. C. Runger and D. C. Montgomery, Selection of the subgroup size and sampling interval for a CUSUM control chart, IEEE Trans., 1997, 29, 451–457. 48 V. Golosnoy, S. Ragulin, W. Schmid, Multivariate CUSUM chart: properties and enhancements, AStA Advances in Statistical Analysis, Springer, 2009, vol. 93(3), 263–279. 49 R. A. Berk, Statistical Learning from a Regression Perspective, Springer Science + Business Media, LLC, New York, 2008. 50 Y. Benjamini and Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B, 1995, 57, 289–300. 51 Y. Benjamini and D. Yekutieli, The control of the false discovery rate in multiple testing under dependency, Ann. Stat., 2001, 29, 1165–1188. 52 F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bull., 1945, 6, 80–83. 53 H. B. Mann and D. R. Whitney, On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other, Ann. Math. Stat., 1947, 18(1), 50–60. 54 W. H. Kruskal and W. A. Wallis, Use of ranks in one-criterion variance analysis, J. Am. Stat. Assoc., 1952, 47(260), 583–621. 55 C. H. Chen, W. Hardle, A. Unwin, M. Cox and T. F. Cox, Handbook of data visualization. In Springer Handbooks Comp. Statistics, chapter Multidimensional Scaling, Springer, Berlin Heidelberg, 2008, pp. 315–347. DownloadedbyArizonaStateUniversityon14March2012 Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A View Online