A statistical framework for multiparameter analysis at the single cell level

804 Mol. BioSyst., 2012, 8, 804–817 This journal is c The Royal Society of Chemistry 2012
Cite this: Mol. BioSyst., 2012, 8, 804–817
A statistical framework for multiparameter analysis at the single-cell levelw
Wandaliz Torres-Garcıá,ab
Shashanka Ashili,b
Laimonas Kelbauskas,*b
Roger H. Johnson,b
Weiwen Zhang,*b
George C. Runger*a
and Deirdre R. Meldrumb
Received 17th October 2011, Accepted 2nd December 2011
DOI: 10.1039/c2mb05429a
Phenotypic characterization of individual cells provides crucial insights into intercellular
heterogeneity and enables access to information that is unavailable from ensemble averaged, bulk
cell analyses. Single-cell studies have attracted significant interest in recent years and spurred the
development of a variety of commercially available and research-grade technologies. To quantify
cell-to-cell variability of cell populations, we have developed an experimental platform for
real-time measurements of oxygen consumption (OC) kinetics at the single-cell level. Unique
challenges inherent to these single-cell measurements arise, and no existing data analysis
methodology is available to address them. Here we present a data processing and analysis method
that addresses challenges encountered with this unique type of data in order to extract
biologically relevant information. We applied the method to analyze OC profiles obtained with
single cells of two different cell lines derived from metaplastic and dysplastic human Barrett’s
esophageal epithelium. In terms of method development, three main challenges were considered
for this heterogeneous dynamic system: (i) high levels of noise, (ii) the lack of a priori knowledge
of single-cell dynamics, and (iii) the role of intercellular variability within and across cell types.
Several strategies and solutions to address each of these three challenges are presented. The
features such as slopes, intercepts, breakpoint or change-point were extracted for every OC profile
and compared across individual cells and cell types. The results demonstrated that the extracted
features facilitated exposition of subtle differences between individual cells and their responses to
cell–cell interactions. With minor modifications, this method can be used to process and analyze
data from other acquisition and experimental modalities at the single-cell level, providing a
valuable statistical framework for single-cell analysis.
Introduction
Cell-to-cell variability has been found to play a central role in
a variety of physiological processes such as differentiation,
proliferation, stress response and pathogenesis. Due to the
stochastic nature of many intracellular processes, individual
cells can exhibit significant phenotypic differences and respond
differently to stimuli and changes in the microenvironment.1–4
The origin of many diseases is thought to be in several, or
perhaps even one aberrant cell that acquires the capability to
evade the cues regulating normal cell function and death.
Early identification and detailed characterization of such
abnormal cells bear the potential not only to provide deep
insights into fundamental cell processes, but also to open new
avenues for treatment and management of diseases with high
morbidity and mortality, including cancer. Because of that,
single-cell studies have been gaining momentum in the last
decade facilitated by technological advances enabling reliable
measurement of various biologically relevant parameters with
high sensitivity and precision. To study cell signaling and
metabolic pathways, one needs to be able to characterize
simultaneously as many parameters of living single cells
as possible. Multiparameter analysis could reveal the details
of intracellular mechanisms, providing novel insights into
systems biology of cells.
Technological challenges such as extremely low amounts of
biological material, small differential changes in metabolite
concentrations and the fragility of cells have been hampering
significant progress in single-cell analysis. One of the major
limitations in single-cell experiments is the low signal-to-noise
ratio. Reliable separation of meaningful data from noise
represents a formidable challenge, one that is exacerbated by
the absence of a priori knowledge of the dynamics of physio-
logical processes that take place in individual cells. This is
particularly true in experiments where single living cells need
a
School of Computing, Informatics, and Decision Systems
Engineering, Arizona State University, Tempe, AZ 85287-5906,
USA. E-mail: George.Runger@asu.edu
b
Center for Biosignatures Discovery Automation, The Biodesign
Institute, Arizona State University, Tempe, AZ 85287-6501, USA.
E-mail: Laimonas.Kelbauskas@asu.edu, Weiwen.Zhang@asu.edu
w Electronic supplementary information (ESI) available. See DOI:
10.1039/c2mb05429a
Molecular
BioSystems
Dynamic Article Links
www.rsc.org/molecularbiosystems PAPER
DownloadedbyArizonaStateUniversityon14March2012
Publishedon05January2012onhttp://pubs.rsc.org|doi:10.1039/C2MB05429A
View Online / Journal Homepage / Table of Contents for this issue

This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 804–817 805
to be characterized with minimal perturbation of their normal
function, further limiting the experimentalists’ choice among
available methodology.
There are numerous methods available to remove white noise
in dynamic data. Many of them have been suitably adapted for
application in a variety of research fields such as chemistry,
environmental studies and medicine.5–7
Nevertheless, reducing
random perturbation in a signal is not a trivial task, since
it is usually unclear how much noise can be removed without
losing the ‘‘true’’ signal. Quality assurance problems in newly
developed technologies are common. For example, in the
1990’s, when DNA microarrays started to gain interest in the
scientific community, the importance and value of the unique
information for understanding biological systems,8–13
as well as
the need for quality assessment and noise reduction14,15
were
clearly acknowledged. In light of these challenges, many noise
reduction methods were proposed, and later developed into a
mature and unified methodology.16,17
In a way analogous to
noise reduction, the characterization of data and signals in the
bioinformatics arena has been widely studied especially for data
quality assessment purposes in microarray-based studies. Many
feature selection techniques are commonly used for microarray
data characterization, including selection of genes with signifi-
cant expression levels in response to changes in conditions or
experimental settings.18
Modeling of real-time data obtained from dynamical
systems has been explored in the literature utilizing traditional
statistical methods.19–23
The traditional methods tend to
establish parametric assumptions which are often hard to
justify in complex biological systems. Hence, there exists a
critical need to model real-time measurements in biological
systems, including live cells, without a priori knowledge of the
nature of underlying dynamical processes. However, so far
none of these established methods have been applied to
analyze data obtained from individual living cells.
Here we present a study focused on the analysis of novel
respiration kinetics data from individual cells. The cell
metabolic analysis method entails manipulation24
and isolation
of single cells25
and determination of their oxygen consump-
tion (OC) kinetics in real-time.26–28
The data obtained from
these measurements exhibit much higher levels of noise com-
pared to bulk-cell experiments. The lack of a priori knowledge
of single-cell dynamics makes it difficult to define charac-
teristic features in these datasets, posing challenges in the
extraction of biologically information and its proper biological
interpretation. The real-time nature of the measurements
contributes additional complexity to the analysis. In this
work we describe our initial efforts to develop statistical
methodologies to address the challenges of noise reduction,
data characterization through feature extraction, and biological
comparison for respiration phenotype measurements in
individual cells. We analyzed OC kinetics data of single cells
obtained from two esophageal epithelial cell lines: metaplastic
(CP-A) and early dysplastic (CP-C) Barrett’s esophageal cells.
These cell lines were derived from biopsies taken from the
corresponding regions in human esophagus and represent
different stages of pre-neoplastic progression. Because of the
clear delineation of the two cell types in terms of histopathology
and their relevance to cancer, our findings may also be of interest
to cancer biologists. Because they serve to define and extract
elements of a disease biosignature, the statistical methodologies
presented here could be used as a foundational framework for
analyzing single-cell data.
Results
Data preprocessing
The data consist of OC kinetics in single human metaplastic
(CP-A) and early dysplastic (CP-C) cells. Fig. 1 summarizes
the challenges addressed in this unique data structure. The
early exploration of OC measurements at the single-cell level
indicated the need to reduce noise and unwanted perturbations
in the signals. Reducing noise helps enhance the discovery of
Fig. 1 Statistical framework diagram. Sequential steps to process and analyze single-cell oxygen consumption data: smoothing, feature extraction
and classification. Major challenges and proposed solution strategies to address each one of them are shown.
View Online

relevant features related to the ‘‘true’’ signal’s behavior. In our
approach two main stages of noise reduction were performed:
(1) low-pass filtering and (2) outlier smoothing. Common
filtering techniques were applied to the OC data in two
different ways. First, a filter was applied to each of the OC
curves to estimate a curve-specific metric of variation that was
used to detect outliers in the unsmoothed data. This outlier
smoothing process invoked traditional control charts in which
any data point of the OC curves lying outside curve-specific
control limits was considered an outlier and its value was
smoothed by neighborhood averaging (see the Methods section).
This step reduced the adverse influence of artifact caused by
stochastic response of the microsensor or other measurement
system components. After outlier smoothing, the resulting signal
was processed through a low-pass filter (Fig. 2).
Feature extraction
After preprocessing, the data analysis procedure aimed to
characterize the OC kinetics. The feature extraction step
addresses challenge number two described in Fig. 1, which
can be divided into two separate problems: (1) removal of
redundant information characterized through the understanding
of experimental limitations, and (2) extraction of distinctive
features without a priori knowledge of the system.
Detection of the time needed to reach zero oxygen concen-
tration in the microchambers. The removal of redundant
information from further analysis is based on experimental
considerations. During measurement, individual cells are
hermetically isolated in sub-nanolitre volume chambers which
results in a limited amount of oxygen being available for
consumption by cells.27,28
Data collected after the oxygen
concentration in the chambers reaches zero are not useful
for OC kinetics analysis and can be discarded as extraneous
or redundant. During an experiment, OC kinetics of nine
individual cells was recorded simultaneously. The time needed
for each cell to deplete the oxygen in the microchamber varied
significantly from cell to cell due to the metabolic rate hetero-
geneity. The experiment was continued until oxygen concen-
tration in all nine chambers reached zero, resulting in different
amounts of redundant data collected for each cell.28
To
address this issue, we proceeded to automatically detect the
time point where each curve reached a zero value (0% oxygen)
and to discard data collected after that time point (referred to
as zero-value tails), excluding it from further analysis (Fig. 2).
Hence, we define redundant information in the context of
experimental conditions, namely by the limited amount of
oxygen available for each cell to consume and the variable rate
at which they consume it. Further experimental details related
to the data used in this study can be found in the Methods
section. Removing the zero-value tails from each of the OC
curves facilitated robust modeling of these curves in regions of
interests and allowed for reliable feature extraction from the
kinetics. Removal of the zero-value tails would be a trivial
problem if one had to analyze a small number of curves or if
the time to reach zero was the same for all cells. For the
analysis of hundreds of cells, however, we needed to develop
an algorithm to automatically remove the tails to ensure
consistency and rapid data processing.
A cumulative sum (CUSUM) control chart is a commonly
used statistical tool to detect small changes.29
Its application
allowed us to automatically detect the time point at which each
OC curve reaches a zero oxygen value and does not change
significantly afterwards (Fig. 2). This statistical procedure
was performed to detect a change time point, feature called
TimetoZero, for each sample. A summary of these detected
time points across samples using the CUSUM procedure is
shown in Fig. 3. Stratification frequency plots of the times
needed to reach 0% oxygen concentration for the entire set of
data by cell lines are shown in Fig. 3 providing a general view
of this TimetoZero feature distribution across different cell
lines. We used this feature as a reference point to remove the
data lying beyond this point as redundant. However, by
determining the time-to-zero reference points, we captured a
unique characteristic of the OC kinetics to better understand
cell heterogeneity.
OC characterization and other features. Modeling OC curves
is challenging since there is no a priori knowledge of single-cell
respiration kinetics. Other than the notion that cells are
Fig. 2 Step-by-step statistical framework example. Main steps used to characterize the OC kinetics data are shown: (a) data filtering, (b) detection
of feature, TimetoZero, using CUSUM, (c) removal of zero-valued tails, (d) identification of characteristic features using a spline model.
View Online

heterogeneous, little is known about the characteristics of specific
factors such as oxygen consumption and their relationship to
different cell types and metabolic states.1,30
We addressed this
challenge by approximating OC kinetics with a constrained
piecewise linear regression model. This spline model fits two
continuous linear regressions with slopes constrained to be
negative. Based upon preliminary data analysis which revealed
this pattern, it seemed appropriate to study OC curves by means
of fitting two linear models (Fig. 2). The two continuous regres-
sions share a mutual breakpoint optimally detected through a
likelihood method across the entire time span. This model allows
us to capture features in different segments of the data.
The spline model was compared with the simple linear
regression model using a goodness-of-fit criterion. We
performed the comparison of these two models by using an
F test for each OC curve. These multiple comparisons raise a
commonly known problem in multiple hypothesis testing:
increased false-positives. To address this problem, we have
corrected all computed p-values using the Bonferroni correction
method.31
Through the evaluation of these tests, we found
that 99.3% and 97.7% of the OC kinetics data obtained from
CP-A and CP-C cells, respectively, could be fit better with the
constrained piecewise linear (spline) regression than with the
simple linear regression model at a = 0.001. Fig. 4 shows
the percentage of curves that were fit more accurately with the
spline model as a function of the level (a) of Type I error for the
F test. In general, more than 90% of OC curves obtained with
both cell types showed a statistically significant improvement of
the fit at different values of a when using the constrained
piecewise model as compared to simple linear regression.
A slightly higher percentage of curves measured from CP-A
compared to CP-C cells could be fit more reliably with the
constrained piecewise model.
The model enabled the extraction of relevant features that
were used to characterize the OC kinetics. Besides the regular
features from fitting linear regressions (intercepts and slopes),
we were able to detect several other features (Table 1), such as
time and oxygen concentration at which the first slope of the
piecewise model is replaced by the second slope. All features
were determined for each kinetics curve of both cell types
(CP-A and CP-C), and the feature distributions within and
across cell types were further analyzed. Whether or not piece-
wise linear regression represents a biologically relevant model
Fig. 3 Features histogram and significance tests between CP-A and CP-C for the TimetoZero feature. (a) Distribution histograms for single CP-A
and CP-C cells; (b) 95% confidence interval of the means of the feature for both cell types.
Fig. 4 Comparative multiple hypothesis testing between the spline
model and linear regression fit. Percentage of OC curves per cell type
that revealed a better fit with the spline model than with the linear
regression shown as a function of different values of a (Type-I error).
The Bonferroni correction was applied to the individual test p-values
to alleviate the problem of false-positives when multiple comparisons
are performed. Inset: zoom in on a range of [0, 0.05].
View Online

of OC kinetics, it provided a good empirical fit to the
experimental data with a simple structure, permitting feature
extraction for comparative studies between different cell types
and conditions.
Validation
Prior to using this statistical methodology for biological data
interpretation, it was validated to assess its accuracy and
robustness. For validation we used a model system based on
enzymatic scavenging of oxygen by Oxyrase.32,33
Oxyrase is a
preparation of membrane fragments from Enterococcus coli
and contains membrane monooxygenases and dioxygenases.
When it comes in contact with lactic acid, Oxyrase removes
oxygen rapidly from aqueous environments, including cell
medium. Because of its enzymatic basis, oxygen removal
kinetics by Oxyrase can be modeled using the Michaelis–
Menten equation that describes enzymatic reaction rates as
a function of substrate concentration. To reproduce data
collection conditions as close as possible to actual experiments,
we measured oxygen consumption kinetics of Oxyrase (no cells)
using experimental settings identical to those used for single
cells. This ensures that the signal-to-noise ratios are similar to
single-cell data. We used four different Oxyrase concentrations,
50 mL, 150 mL, 200 mL, and 250 mL (ranging from 0.06–0.2% by
volume) for more robust validation of the statistical framework.
The features extracted from the OC kinetics data obtained
with Oxyrase utilizing the statistical framework showed signifi-
cant differences among signals measured with different
Oxyrase concentrations. The application of a Random Forest
classifier model34
to the extracted features revealed clear
discrimination among the four different concentrations with
out-of-bag error rates of 2% when all features were included
in the model, and 11.1%, when TimetoZero (see Feature
extraction) was removed from the data analysis. Ensemble
learners are predictive models that combine a collection of
simpler classifiers yielding better predictive performance as an
ensemble than any of the individual classifiers.35
The distinct
discrimination among the different Oxyrase concentrations
was visualized with the use of multidimensional scaling36
in
panels (a) and (b) of Fig. 5. Each panel portrays the visualization
Table 1 Extracted features and their descriptions
Features Description
Change-point.Time Time value at which the change in slopes in the piecewise linear fit takes place
Change-point.Oxygen Oxygen consumption value at which the change in slopes in the piecewise linear fit takes place
Intercept coefficient (B0) Intercept of left linear regression
Left slope coefficient (B1) Slope of the linear regression before the Change-Pointa
Right slope coefficient (B1) Slope of the linear regression after the Change-Pointa
Kurtosis Measure of ‘‘peakedness’’. Higher kurtosis means more of the variance is the result of infrequent extreme
deviations, as opposed to frequent modestly sized deviations.
Skewness Measure of the asymmetry.
Minimum MSE The Mean squared error value for the best piecewise linear regression fit.
TimetoZero Time at which the oxygen concentration in the chamber reaches a value of zero
Brief description of features extracted from curves after application of smoothing and filtering techniques.a
Slope magnitudes extracted from the
spline model are divided by two for curves obtained with two cells per well.
Fig. 5 Multidimensional Plots for Oxyrase enzymatic reaction for validation. This plot visualizes the scaling coordinates of the proximity matrix
obtained with a Random Forest performed to classify four distinct Oxyrase concentration values. These oxyrase measurements were gathered
through the same semi-automated technology as the OC curves in study. These were used in validation since its behavior is well-understood and
differences are expected across features from oxyrase curves from different concentrations.
View Online

patterns for the two Random Forest models discussed earlier:
(a) a classifier with all features and (b) a classifier with all
features except TimeToZero. The resulting proximity matrix
from the random forest classifier is used as input in multi-
dimensional scaling to find a suitable 2D visual configuration
that showcases the sample patterns. Each axis, named scaling
dimensions, represents the 2D coordinates in which these
patterns are plotted. The ability to clearly differentiate varying
reaction rates (slopes) obtained with different Oxyrase concen-
trations shows that our approach enables adequately robust and
accurate characterization of dynamic processes. By capturing
these differences among the signals known to have different
kinetics using the statistical framework employed in this work,
we validated our approach for application to single-cell OC data.
Biological inferences and interpretation
Comparison between different cell lines. Extracted quantita-
tive features such as slopes, intercepts, breakpoint or change-
point were compared across individual cells and cell types.
To detect differences between CP-A and CP-C features we
computed two sets of significance tests. A test of the statistical
Fig. 6 Comparison of features between CP-A and CP-C cells by means of a spline model. Three main features were extracted using the
constrained piecewise linear model: (a, b) oxygen concentration where the change of slopes in the fit occurs (change-point), (c, d) left (before slope
change) and (e, f) right (after slope change) slopes. Figures on the left show feature frequency values and those on the right show 95% confidence
interval of the features means.
View Online

significance of differences between the means or the medians of
the features of the two cell lines revealed significant differences
for the TimetoZero and Change point.Oxygen features (Table 1).
The distribution of the time point when each OC kinetics
curve reaches a oxygen concentration value near zero
(TimetoZero feature) exhibits a broad range of values in both
cell types, as mentioned previously (Fig. 3). Statistical analysis
revealed significant differences between both the means
and the medians of the two cell types with p-values equal to
0.003 and 0.008, respectively (Fig. 3).
Another feature of interest is the value of oxygen at the
point where the two linear regressions of the spline model meet
(Change-point.Oxygen). At the breakpoint of the spline model
two features can be captured: oxygen concentration and time.
Oxygen concentration when the change in slopes takes place is
biologically relevant as it indicates a change in the oxygen
consumption kinetics most likely caused by alterations in the
energy production of the cell. The distributions of the Change-
point.Oxygen feature within each cell type showed character-
istics typical of a bimodal density function. Qualitatively the
distribution histograms of the two cell types show significant
similarity (Fig. 6) with a more clearly defined main peak at
6–6.5 ppm for CP-C cells. The distributions clearly indicate
marked heterogeneity in OC kinetics within the same cell type.
More subtle differences can be seen when comparing the two
cell types (Fig. 6b). One of the most notable differences is the
existence of a second, broader peak between 2–4 ppm in CP-C
cells, which is less pronounced in CP-A cells. However, the
statistical test of the mean and median showed p-values of
0.053 and 0.061, respectively, indicating that both of these
parameters are not statistically different at a =0.05.
Two other features that we analyzed were the slopes (rates)
of the OC kinetics measured in the study. Understanding how
fast individual cells consume oxygen is of great interest as it is
directly related to the energy production levels in the cell. The
distributions of the slopes showed a long tail containing only a
small number of cells, while the majority of the cells’ OC rates
were concentrated in a relatively narrow range (Fig. 6)
[À0.02,0]. For both, left and right slopes no statistically
significant differences between their means were found when
comparing the two cell types (Fig. 6). However, the median
values of the right slope were found to be statistically different
between the two cell types with a p-value equal to 0.002
(Fig. 6).
We further explored these comparisons as a classification
problem with two classes (e.g. one cell type versus another)
finding subtle differences between the two cell types using an
ensemble-based classifier: Random Forest. The classification
problem indicated an out-of-bag error rate of 30% when
classifying single-cell CP-A and CP-C cells based on the
extracted features (Table 1). A multidimensional plot from
the tested Random Forest (more details in the Methods
section: Comparisons and classification techniques) is shown
in Fig. 7. This plot shows differences among cell lines.
The role of intercellular interactions: comparison between OC
kinetics in isolated single and interacting cells. To explore
metabolic heterogeneity in the presence of intercellular inter-
actions, OC kinetics curves were obtained with two cells of
the same type placed into one microchamber. We compared
features extracted from the OC data of single cells (i.e., CP-A_1
and CP-C_1) with those obtained with two cells per single
chamber (i.e., CP-A_2 and CP-C_2). The same statistical
methodology was applied to CP-A_2 and CP-C_2 OC
curves as for the data acquired with single, non-interacting cells
with only minor modifications to certain features. To account
for the number of cells (one or two) per microchamber the
values of the slopes measured in microchambers with double
occupancy were divided by two assuming equal OC for the two
cells in a microwell, allowing comparisons with single-cell
slopes.
We first investigated the goodness-of-fit of the spline model
applied to the OC kinetics data of interacting cells. We
compared data fits obtained with the spline model and with
simple linear regression using a multiple hypothesis testing with
Bonferroni correction as described in the Methods section.
Similar to the results obtained with individual, non-interacting
cells of both cell lines, the spline model fit was found to be
statistically better than the simple linear regression model for
all measurements with double-occupancy, interacting cells
(Fig. S1, ESIw).
A set of features from CP-A_1, CP-A_2, CP-C_1, and
CP-C_2 curves were extracted using the constrained piecewise
linear regression model. Distribution patterns similar to those
obtained with single, non-interacting cells were found for the
OC kinetics curves with interacting cells for features such as
TimetoZero, Change-point.Oxygen, Left.Slope, and Right.Slope
(description in Table 1). Statistically significant differences in
both the mean and median were found for at least one of the
four distinct groups of OC curves for the feature TimetoZero as
Fig. 7 Multidimensional scaling plot: a Random Forest classifier for
single CP-A vs. CP-C cells. This plot visualizes the scaling coordinates
of the proximity matrix obtained from a Random Forest to classify
CP-A versus CP-C cells at the single-cell level. This graphical repre-
sentation shows how the Random Forest classifier was able to find
high-dimensional interactions between data features that cluster OC
curves together.
View Online

Fig. 8 The TimetoZero feature extracted from single- and double-cells for CP-A and CP-C oxygen consumption curves. Time to zero is a
time feature extracted after removal of zero-valued tails using the CUSUM method. (a) Distribution histogram of the feature among single,
non-interacting (CP-A_1 and CP-C_1) cells and for interacting (two cells per well; CP-A_2 and CP-C_2) cells. (b) 95% confidence interval plot of
the means of TimetoZero for each experimental condition. Testing for statistically significant differences between the means or between the
location shifts (e.g., medians) showed p-values equal to 0 in both cases.
Fig. 9 Other features of interest extracted from oxygen consumption kinetics of single, non-interacting- and double, interacting-CP-A and CP-C
cells. The left panels show distribution histograms of the corresponding features; the right panels show 95% confidence interval of the means of the
corresponding features. (a) and (b) Oxygen concentration values where the change of slopes in the spline model occurs. (c) and (d) Slope values of
the first linear regression of the spline model (Left.Slope). (e) and (f) Slope values of the second linear regression (Right.Slope). See Table 1 for
more detailed description of the slopes.
View Online

shown in Fig. 8. With p-values close to zero, this feature may
be an important discriminator among these non-interacting
and interacting cells (less marked differences can be observed
for CP-C_2 probably due to its small sample size). Other
extracted features such as the ones presented in Fig. 9 (oxygen
concentration at breakpoint, slopes before and after the break-
point) portrayed less distinct differences among these groups
but revealed empirical distribution patterns only available
through the study of individual OC curves. For example, oxygen
concentration at the breakpoint revealed significant differences
for at least one group among all groups with p-values of 0.001
and 0.01 when testing for means and medians, respectively,
suggesting CP-C_1 as more different for this feature (Fig. 9). In
contrast, slope values (adjusted for interacting cells by dividing
by two) did not differ as much across different cell groups
besides the median of Right.slope which showcased a p-value
of 0.003 for at least one group being different among others
(Fig. 9). These comparisons are possible through the application
of the methodology presented in this work.
The features extracted using the statistical framework
allowed for multiple comparisons of different phenotypes.
As seen before, the distributions of each of the features
permitted comparisons and showcased subtle differences. To
further analyze the OC curves through the extracted features,
an ensemble classifier34,35
was applied with the objective
of classifying the four groups of interest (CP-A_1, CP-A_2,
CP-C_1, and CP-C_2). A Random Forest classifier34
(see Methods) was applied to the extracted features to unravel
nonlinear relationships among the relevant features. Initially,
we built Random Forest models for pairs of classes
(i.e., CP-A_1 vs. CP-A_2, CP-C_1 vs. CP-C_2, etc.) obtaining
error rates of B20–30% for all pairs. These models included
all extracted features. When all four data classes were included
in a single Random Forest model, the classification error rates
were found to be around 40% when all features were used in
the model and 50% for a Random Forest model that included all
features except TimetoZero (Fig. 10). The TimetoZero feature
was removed from the classification model to capture discrimi-
nant relationships among other features where differences might
not be as clear or direct as in the case of TimetoZero.
Table 2 shows the confusion matrices providing details on
how many curves were misclassified using the models with or
without the TimetoZero feature. Also shown in Table 2 is that
the number of curves among the four different classes is
unbalanced. To address this problem, down-sampling was
performed on all Random Forest models applied here to lessen
the sample size effect in the learning model. Down-sampling is
a sampling technique that reduces the size of the majority class
or the class with the greatest number of samples. It is widely
used to balance the classes to minimize the overall error rate.37
In addition, Table 3 presents the feature importance scores for
both Random Forest models. It can be seen that TimetoZero
has the highest score for distinguishing between the different
experimental classes. However, when the TimetoZero feature
was removed, all features ranked similarly. Although their
predictability measures are not high, the results obtained with
the Random Forest models show semi-defined clusters within
the same experimental condition or the cell type. Fig. 10 shows
how the data points of the same type of experiment tend to
agglomerate in regions partially overlapping with other experi-
mental conditions. This Random Forest model extracts non-
linear patterns among the features to discriminate among
different classes. The two cell lines used in the study represent
different stages of pre-neoplastic progression in esophageal
cancer and, thus, are closely related in their phenotypic and
genotypic profiles. Therefore, it is likely that they will show
similarities in terms of oxygen consumption as well, thus
making the differentiation more difficult. More features either
from the OC curves or any other biologically relevant data
might be necessary to distinguish them clearly.
Fig. 10 Multidimensional scaling plots: a Random Forest model for non-interacting and interacting CP-A and CP-C cells. This plot visualizes
the scaling coordinates of the proximity matrix obtained with a Random Forest performed to classify CP-A versus CP-C at the single- and
double-cell level. (a) Results using all features as described in Table 1. (b) Results using all features with the TimetoZero feature excluded from the
analysis.
View Online

Conclusion
The analysis and interpretation of intercellular heterogeneity
data are of fundamental importance in cell biology. A great
deal of interest is found in the scientific community to under-
stand the role of heterogeneity in cellular homeostasis and
pathogenesis.28,38
In recent years, innovative technologies
have been developed to perform biological studies at the
single-cell level,24–28
including single-cell oxygen consumption
measurements. Despite the availability of these technologies,
their real potential can only be exploited utilizing effective
analytical methods capable of performing robust de-noising
and feature extraction steps on the novel type of information.
Through preliminary studies, we have identified three major
challenges when dealing with real-time phenotypic measure-
ments at the single-cell level: random noise, presence of
multiple functional states, and reliable differentiation of cell
behavior within and across different cell types (Fig. 1). In this
study, using single-cell OC data as example, we made the
initial effort to establish a statistical framework for multi-
parameter analysis of the experimental data at the single-cell
level. In our approach to analyze single-cell data we applied
several sets of statistical tools used in signal processing
and statistics for data modeling and feature extraction. The
validation of the method showed that experimental data can be
modeled and their features extracted reliably. The quantitative
features extracted from the single-cell experimental data using
our analysis method revealed subtle differences between
non-interacting, single cells as well as between interacting cells of
both types. This demonstrates the feasibility of the developed
methodology to reliably process the measurement data and
characterize oxygen consumption kinetics. Because of its general
applicability, our statistical framework can be utilized to address
similar challenges that arise in other single-cell data acquisition
and experimental modalities.
Methods
Dataset
Description of oxygen consumption measurements. As a first
step in acquiring and analyzing multiparameter data, our
center has developed an experimental platform for metabolic
phenotype characterization, including oxygen consumption, at
the single-cell level.27,28
Single-cell oxygen consumption rates
are on a scale of fmoles minÀ1
cellÀ1
. Because oxygen sensing
is based on the dynamic quenching of sensor luminescence
by oxygen, the signal-to-noise ratio of the measurement varies
as a function of oxygen concentration in the microchamber.
This factor needs to be taken into account especially when
applying various signal processing algorithms for de-noising
purposes. In addition, other sources of noise include detector
readout noise, intensity variations of the excitation source, and
stochastic sensor noise. For the two cell types studied in this
work, the average time required for an isolated cell to consume
all oxygen within the finite volume (B140 pL) of cell media
ranges between 30–90 min. Noise levels resulting from the
various sources can be significant, requiring the data to be
analyzed utilizing a rigorous statistical framework capable of
reducing noise extracting quantitative features.
We analyzed several sets of oxygen consumption kinetics
data from two Barrett’s esophageal epithelial cell lines (meta-
plastic CP-A and dysplastic CP-C) obtained with the single-
cell technology. The number of OC curves studied for CP-A
and CP-C were 154 and 256, respectively. The cells were
loaded into microwells and incubated for 15–30 hours before
measurements were performed. The incubation time was
selected based on previous studies of cell viability and
morphology. After incubation, microwells with cells were
hermetically sealed with a lid containing an extracellular
optical oxygen sensor. The sensor emission intensity was
collected as a function of time until oxygen concentration in
the microchamber reached zero.27
Table 2 Confusion matrices obtained with Random Forest classifica-
tion models
(A) All features included:
True class
(Num. curves)
Predicted class
Class
error (%)CP-A_1 CP-A_2 CP-C_1 CP-C_2
CP-A_1 (154) 75 24 51 4 51.3
CP-A_2 (118) 4 81 1 32 31.4
CP-C_1 (256) 61 22 165 8 35.5
CP-C_2 (44) 5 20 2 17 61.4
(B) Without TimetoZero feature:
True class
(Num. curves)
Predicted class
Class
error (%)CP-A_1 CP-A_2 CP-C_1 CP-C_2
CP-A_1 (154) 74 29 45 6 51.9
CP-A_2 (118) 20 61 13 24 48.3
CP-C_1 (256) 60 28 142 26 44.5
CP-C_2 (44) 7 17 8 12 72.7
Individual error rates per cell type and different number of cells within
a microwell are shown for Random Forest models constructed using
all features and with the TimetoZero feature excluded from the
analysis. The numbers represent the number of curves classified as
the specific predicted class by the nonlinear model. Classification error
is calculated by the percentage of curves that were misclassified.
Misclassified signals are shown in the gray boxes.
Table 3 Variable importance scores from Random Forest classifica-
tion models
Features
Mean decrease gini (%)
All features Without TimetoZero
Change-point.Time 9.03 12.93
Change-point.Oxygen 10.99 13.23
Left.B0.Coef 10.17 12.89
Left.B1.Coef 10.37 12.53
Right.B1.Coef 13.81 12.49
TimetoZero 17.12 —
Kurtosis 9.14 11.88
Skewness 8.88 11.35
MSE.min 10.49 12.70
These variable importance scores are calculated based on the average
over all trees of a scoring measure. This scoring measure is computed
as the difference of correctly classified cases when the feature matrix
values are evaluated onto the grown tree minus correctly misclassified
items when the variable to be scored is permuted prior tree model
evaluation.
View Online

Noise reduction techniques
The noise levels in OC data were reduced using two main
signal processing components: (1) Low-pass filtering and (2)
Outlier smoothing.
Low-pass filtering. Two common low-pass filtering techni-
ques were evaluated. A low-pass filter reduces the amplitude of
high frequencies while leaving low frequencies unchanged.
These two methods along with their parameters are briefly
described here. In addition, we discuss a goodness-of-fit
assessment to decide which of the filtering techniques performs
better for the measured OC kinetics curves.
The Savitzky–Golay (SG) filter is also called least-squares
polynomial smoothing filter and is a finite impulse response
(FIR) filter.39
The technique fits a polynomial of fixed degree
n to a small window of the data of size (2m + 1) to estimate
a midpoint as shown in eqn (1) and (2). This process is
repeated by sliding the data window along the total span.39,40
This type of convoluted filter minimizes the least-squares error
of fitting a polynomial to window frames of the noisy data and
is quite popular in areas such as spectroscopy and analytical
chemistry because of its simplicity and speed.41,42
If the
data are evenly spaced and continuous then the smoothed
value ðyÃ
t Þ is the weighted summation of the points in the
window frame as described in eqn (3). Savitzky–Golay’s early
methodology implementation results in the truncation of
m points at the start and end of the data signal which are
not able to be smoothed out. Therefore, extensions to the
Savitzky–Golay filter addressing initial and endpoint estimation
found in the literature were also implemented in this study.40,43
yÃ
t ¼
Xn
k¼0
bktk
¼ b0 þ b1t þ b2t2
þ Á Á Á þ bntn
;
t ¼ ½Àm; Àðm À 1Þ; . . . ; 0; . . . mŠ
ð1Þ
@
@bk
Xm
t¼Àm
ðyÃ
t À ytÞ2
" #
¼ 0 ð2Þ
yÃ
j ¼
Pm
t¼Àm
ctyjþt
N
ð3Þ
In our study, a second-order polynomial fit was tested; as it is
commonly used in practice.41
Another important parameter
needed in the SG filtering is the window length (m). Common
values for this parameter are m = 11 and m = 21. We evaluated
root-mean-squared-error (RMSE) for a range of values under
both conditions (e.g., CP-A and CP-C) as shown in Fig. S2
(ESIw). Data filtering in this study was performed using a window
size of 11, since the smoothing performance was found to be
better than with m = 21 in terms of preservation of local signal
patterns.
The second filter we applied was the Exponentially
Weighted Moving Average (EWMA). It is an infinite impulse
response (IIR) filter and represents a special case of the
moving average filter where the weights of the data points to
be averaged decay exponentially with the distance from the
most recent data point (eqn (4)). The smoothed value of yt is
obtained through
yÃ
t ¼ lyt þ ð1 À lÞyÃ
tÀ1 ð4Þ
where l represents the decay rate ranging from 0 r l r 1.
A small value of l gives more weight to older data and less to
new data and vice versa.29,44
To detect small signal changes
l = 0.2 was used during the smoothing of the data curves in
this study. An RMSE evaluation across a range of l values
was performed as shown in Fig. S2 (ESIw). In practice,
l values between 0.2–0.3 are used.45
To assess the performance of EWMA and SG filtering
techniques, we evaluated average root-mean-squared-error
(RMSE) between smoothed and raw data as a goodness-
of-fit criterion. The goodness-of-fit statistics describe how well
smoothed values fit experimental data (i.e., coefficient of
determination (R2
), mean squared error (MSE), and root-
mean-squared-error (RMSE)). Small values of the average
RMSE indicate a good fit. Both techniques showed similar
performances for the commonly chosen parameters as displayed
in Fig. S3 (ESIw).
Outlier detection and smoothing. The OC kinetics data
contained random sharp peaks in certain areas due to signal
loss or stochastic sensor intensity fluctuations. We detected
these outliers using traditional control charts theory using the
following equation
L = %x Æ w^s, (5)
where L represents the upper (+) and lower (À) control limits,
%x is the mean value of the response, w is the parameter that
determines the width of the limits, and ^s is an estimated value
of variation. Data points outside the limits calculated using
eqn (5) were considered outliers. ^s was estimated through an
initial filtering step. Each signal undergoes a filtering step as
the ones described in the earlier subsections on low-pass
filtering to estimate its individual variation metric. Smoothed
values resulting from this step are obtained, and the variation
of the raw data points is computed using the Root-Mean-
Squared-Error (RMSE) metric. We assumed ^s to be a
constant, which is not necessarily true. However, because ^s
is utilized for the detection of distant outliers only, this
assumption is adequate. To determine the w parameter
(control width constant) we studied several options. The value
for w was chosen to be equal to 2, as with this value of w on
average 10% of all data points within an OC kinetics curve are
detected as outliers (Fig. S4, ESIw). As expected, higher or
lower values of w resulted in smaller or larger fractions,
respectively, of the data to be outside the imposed boundaries
and detected as outliers. Choosing w = 2 resulted in about
10% of the points within the curve to be classified as outliers.
Naturally, higher values of w, i.e. 3, 4, and 5, showed smaller
percentages ranging from 0% to B5% and smaller values
(w = 1) resulted in a higher percentage (B25%) of data points
detected as outliers (Fig. S4, ESIw). Hence, w = 2 seemed a
reasonable estimation to reduce random noise due to outliers
without excluding too much of the actual signal data from the
analysis. After detection, the outliers were smoothed out by
using a simple 2-neighbor averaging procedure where the
View Online

outlier values are replaced with values computed as the
average of its two adjacent neighbor’s values. The low-pass
filter was re-applied to the entire dataset afterwards.
Feature extraction models
Cumulative sum control (CUSUM) charts: change detection.
With the use of the cumulative sum (CUSUM) control charts,
small changes in the mean value are more efficiently detected
than Shewhart control charts.29
To apply the CUSUM
procedure, the OC curves were order-reversed to identify the
deviation from zero (tail). The OC response signals portray the
behavior of oxygen consumption over time. When it reaches
its minimum value (zero) the signal shows a constant behavior
or a tail of zeros from that time point on. Hence, the time
point at which the signal reaches zero can be obtained by
capturing a deviation within the constant region of zero values
which occur at the end of the signal. Reversing the order of the
signal facilitates the application of CUSUM charts to detect
deviations from zero.
Two input parameters are needed to calculate the CUSUM
statistic (Ck): the subgroup size (k) and the in control mean
(in this study m0 = 0). The parameter Ck is defined in eqn (6) by
k, m0, and the computed mean of the sub-sample of size k ( %xk).
Ck is calculated along the entire sample range.
Ck ¼
Xk
j¼1
ðxk À m0Þ ð6Þ
Other parameters needed to be determined when the process is
out of control (in this study m0 a 0) are decision interval and
amount of shift to detect (slack). Recommended values for
these parameters are decision interval of size 5 and a slack
value of 3.46–48
Piecewise linear regression model. The methodology imple-
mented in this paper for feature extraction consists of fitting a
piecewise linear regression model to each OC kinetics curve. In
general, the piecewise linear regression is used to describe a
nonlinear behavior by fitting the data to a number of linear
segments. In the methodology implemented here two linear
regression models were constrained to connect at the same
breakpoint. We considered a special case of two linear regres-
sions intersecting at a single point at time tc (‘‘change-point’’)
as shown in eqn (7) with the indicator variable It Z tc
= 1, when
t Z tc.49
Both linear regressions were described in one
function y with the use of an indicator variable It Z tc
to define
both regression functions each with constrained slopes b1 and
b1 + b2 as shown in eqn (7). The slope parameters were
constrained to non-positive values due to decreasing oxygen
concentration in the microchambers.
y = b0 + b1t + b2(t À tc)It Z tc
(7)
b1 r 0 and b1+b2 r 0 8 curves
To find the change-point, a likelihood method was used to
minimize the sum squared error (SSE) of the fit of the kinetics
data to two linear regressions. During the fit, an exhaustive
search was performed along the time axis to determine the
change-point and the coefficient estimates that minimize SSE.
Once the change-point was found, the features (Table 1) were
extracted from the piecewise linear model for different experi-
mental conditions (i.e., CP-A, CP-C). The fit to the
constrained piecewise linear regression with one-breakpoint
was statistically compared to the fit to a simple linear regression
model using an F test. To perform the F test, an F statistic is
computed as shown in eqn (8) where SSEModel1 and SSEModel2
refer to the sum squared error of the simple linear regression
and the constrained piecewise linear regression models respec-
tively. Other inputs in eqn (8) are p and n; p is the number of
parameters estimated for each model (i.e., Model1 or Model2)
and n is the total number of data points in the signal.
F ¼
SSEModel1 À SSEModel2
pModel2 À pModel1
SSEModel2
n À pModel2
ð8Þ
if - F Fa,pModel2ÀpModel1
,nÀpModel2
- Model2 performs better.
The model comparison by an F test was performed for
every single curve resulting in a multiple hypothesis testing
problem. A commonly known problem in multiple hypotheses
testing is the increase of false positives. Several approaches such
as the Bonferroni correction exist to alleviate this
problem. This widely used technique is applied when multiple
statistical tests are computed simultaneously in order to reduce
false positives by reducing the value of a, the significance
level of the test. Another way in which the value of a can be
reduced is by adjusting all the p-values from the individual tests
as shown in eqn (9), where n is the number of
comparisons.31,50,51
pvalue.adjusted[c] = min(pvalue[c] Â n, 1) c A [1,n] (9)
Comparisons and classification techniques
Statistical significance tests. The extracted features were
studied and compared between the two cell lines using tradi-
tional statistical tools such as histograms, confidence intervals
and statistical tests of the mean and median. The statistical
significance of the difference between the means was deter-
mined using the analysis of variance (ANOVA) test which
generalizes the t-test for more than two groups but relies on
several assumptions that may or may not be met for this
particular data structure. ANOVA was performed with caution
to get a general sense of the groups’ mean from the ANOVA
hypothesis shown in eqn (10). In addition to ANOVA, we
performed significance tests for the differences between the
median values using nonparametric tests which waive the strict
assumptions inherent to ANOVA. The median or rank test was
performed using the Mann–Whitney–Wilcoxon test52,53
for a
two-level group test and the Kruskal–Wallis test54
for more
than two groups. Both tests are nonparametric approaches for
evaluating differences in the location shift of the distribution of x
for each group. Eqn (11) represents the analytical expression of
the Kruskal–Wallis test, where ni is the number of observations
in group i, rij is the rank of observation j from group i, and N is
the total number of observations for all groups. The p-value
corresponding to a particular K is approximated through the
w2
distribution.54
H0: m1 = m2 =Á Á Á= mn (10)
View Online

K ¼ ðN À 1Þ
Pg
i¼1
niðri À rÞ2
Pg
i¼1
Pni
j¼1
ðrij À rÞ2
ð11Þ
Ensemble classifier: Random Forest. To further explore
potential relationships among several groups of OC curves,
we applied an ensemble classifier based on decision trees. The
two cell lines (CP-A and CP-C) at the single-cell or two-cell
levels (i.e., CP-A_1, CP-A_2, CP-C_1, and CP-C_2) were
defined as the four classes for the classifier model with features
from the OC curves used as predictors. The decision trees
can be applied in almost all scenarios. Therefore, they provide
a good starting point for modeling heterogeneous and
large data sets. The decision trees apply to either a numerical
or categorical response and are nonlinear, simple, and fast.
The decision trees are scale-invariant and robust to missing
values. However, a single tree is produced by a greedy algo-
rithm that generates an unstable model.34
Consequently,
ensemble methods have been used to counteract the instability
of a single tree.
Supervised ensemble methods build a set of simple models
called base learners and use a weighted outcome for each base
learner in a voting scheme to predict future data. In other
words, ensemble methods merge outputs from multiple base
learners to create a voting committee to improve performance.
Many empirical studies have shown that ensemble methods
often outperform any single base learner.35
The Random Forest classifier is an improved bagging
method which basically exploits the benefits of bootstrapping
sampling through modeling. It grows a forest of random
decision trees on bagged samples yielding accurate results,
comparable with the best known classifiers.34
An advantageous
property of Random Forest classifiers is that they limit over
fitting through embedded out-of-bag (OOB) error estimation.
The out-of-bag error estimation for the ith tree in the Random
Forest model is computed using a percentage of cases not used
in the learning for this ith tree. Other advantages of Random
Forest models are: simple to train and tune in many appli-
cations, computationally efficient, can handle a large number
of variables, provide variable importance scores, embedded
method to estimate missing data, generation of a proximity
matrix among cases, handle variable interactions, can be
adapted to balance error due to datasets with unbalanced
numbers of samples, and capable of extending to unlabeled data
for unsupervised clustering, data views and outlier detection.34
Algorithm: a simple pseudocode for Random Forest classifier
construction is shown below.34,35
Select a number of cases independently, with replacement
from the original dataset to build the training data.
Use training data to grow a tree:
3 Select v variables at random from the total number of
input variables (V) where v { V.
3 Best variable among the v predictors is chosen to maximize
the information gain of the split.
3 Split the chosen node into two daughter nodes based on
the best variable.
Repeat Step 2 until all trees are built.
Output the ensemble of trees.
Important features of Random Forest classifiers are OOB
sampling, variable importance, and proximity plots. OOB
sampling is identical to cross-validation and, since Random
Forest is performed in parallel trees, a cross-validation can be
done along the way. Variable importance is a key feature of
Random Forests. The variables are ranked based on their
improvement in the empirical loss function among all trees,
meaning that variables that are chosen often in the trees
provide better predictive power or they minimize the loss
function. These proximity distances are measured by putting
all the data, training and out-of-bag, through the grown trees.
If instances i and j are in the same terminal node their
proximity increases by one and so on through all the trees.34
Then proximities are normalized by the number of trees in
the model.
State-of-the-art visualization methods such as multidimensional
scaling36
are used to illustrate how well features discriminate
among different conditions. Multidimensional scaling represents
high-dimensional data in a lower-dimensional space (often two or
three dimensions) in order to better visualize any structure in the
data. The algorithm generates points in the lower-dimensional
space that approximately preserve the pair-wise distances between
the points in the high-dimensional space.55
Conflict of Interest: none declared.
Acknowledgements
The authors would like to thank the personnel and support of
the Center for Biosignatures Discovery Automation in the
Biodesign Institute at Arizona State University. Funding: this
research is supported by the National Institutes of Health
(NIH), National Human Genome Research Institute
(NHGRI), Center of Excellence in Genomic Science (CEGS),
grant number 5 P50 HG002360 to Deirdre R. Meldrum.
References
1 M. Lidstrom and D. R. Meldrum, Life-on-a-chip, Nat. Rev.
Microbiol., 2003, 158, 164.
2 D. J. Wang and S. Bodovitz, Single cell analysis: the new frontier in
‘omics’, Trends Biotechnol., 2010, 28(6), 281–290.
3 T. Kalisky and S. R. Quake, Single-cell genomics, Nat. Methods,
2011, 8(4), 311–314.
4 N. Navin, J. Kendall, J. Troge, P. Andrews, L. Rodgers,
J. McIndoo, K. Cook, A. Stepansky, D. Levy, D. Esposito,
L. Muthuswamy, A. Krasnitz, W. R. McCombie, J. Hicks and
M. Wigler, Tumour evolution inferred by single-cell sequencing,
Nature, 2011, 472(7341), U90–U119.
5 E. J. Kostelich and T. Schreiber, Noise reduction in chaotic time-
series data: A survey of common methods, Phys. Rev. E: Stat. Phys.,
Plasmas, Fluids, Relat. Interdiscip. Top., 1993, 48, 1752–1763.
6 S. J. Orfanidis, Introduction to Signal Processing, Prentice-Hall,
Englewood Cliffs, NJ, 1996.
7 J. Brocker, U. Parlitz and M. Ogorzalek, Nonlinear Noise
Reduction, Proc. IEEE, 2002, 90(5), 898–918.
8 M. Schena, D. Shalon, R. W. Davis and P. O. Brown, Quantitative
monitoring of gene expression patterns with a complementary
DNA microarray, Science, 1995, 270(5235), 467–470.
9 D. A. Lashkari, J. L. DeRisi, J. H. McCusker, A. F. Namath,
C. Gentile, S. Y. Hwang, P. O. Brown and R. W. Davis, Yeast
microarrays for genome wide parallel genetic and gene expression
analysis, Proc. Natl. Acad. Sci. U. S. A., 1997, 94(24), 13057–13062.
10 V. G. Cheung, M. Morley, F. Aguilar, A. Massimi,
R. Kucherlapati and G. Childs, Making and reading microarrays,
Nat. Genet., 1999, 21, 15–19.
View Online

11 S. K. Moore, Making chips to probe genes, IEEE Spectrum, 2001,
38(3), 54–60.
12 W. Torres-Garcia, W. W. Zhang, R. Johnson, G. Runger and
D. R. Meldrum, Integrative analysis of transcriptomic, proteomic
data of Desulfovibrio vulgaris: a nonlinear model to predict abundance
of undetected proteins, Bioinformatics, 2009, 25, 1905–1914.
13 W. Torres-Garcia, S. D. Brown, R. H. Johnson, W. W. Zhang,
G. Runger and D. R. Meldrum, Integrative analysis of transcrip-
tomic and proteomic data of Shewanella oneidensis: missing value
imputation using temporal datasets, Mol. BioSyst., 2011, 7(4),
1093–1104.
14 M. L. T. Lee, F. C. Kuo, G. A. Whitmore and J. Sklar, Importance
of replication in microarray gene expression studies: Statistical
methods and evidence from repetitive cDNA hybridizations, Proc.
Natl. Acad. Sci., 2000, 97(18), 9834–9839.
15 D. E. Carter, J. F. Robinson, E. M. Allister, M. W. Huff and
R. A. Hegele, Quality assessment of microarray experiments, Clin.
Biochem., 2005, 38(7), 639–642.
16 J. Seo, M. Bakay, Y. W. Chen, S. Hilmer, B. Shneiderman and
E. P Hoffman, Interactively optimizing signal-to-noise ratios in
expression profiling: project-specific algorithm selection and detection
p-value weighting in Affymetrix microarrays, Bioinformatics, 2004,
20(16), 2534–2544.
17 T. Howlader and Y. P. Chaubey, Noise Reduction of cDNA
Microarray Images Using Complex Wavelets, IEEE Trans. Image
Process., 2010, 19(8), 1953–1967.
18 Y. Saeys, I. Inza and P. Larran˜ aga, A review of feature selection
techniques in bioinformatics, Bioinformatics, 2007, 23(19),
2507–2517.
19 J. P. Stevens, Intermediate Statistics. A Modern Approach,
Lawrence Erlbaum Associates Publishers, Mahwah, NJ, Second edn,
1999.
20 J. X. Pan and K. T. Fang, Growth Curve Models and Statistical
Diagnostics, Springer Series in Statistics, 2002.
21 S. E. Maxwell and H. D. Delaney, Designing Experiments and
Analyzing Data: A Model Comparison Perspective, Lawrence
Erlbaum, Second edn, 2003.
22 S. Weerahandi, Generalized inference in repeated measures: Exact
methods in MANOVA and mixed models, Wiley-Interscience, 2004.
23 Applied regression analysis and other multivariable methods, ed.
D. G. Kleinbaum, L. L. Kupper and K. E. Muller, PWS Publishing
Co., Boston, MA, USA, 4th edn, 2008.
24 Y. Anis, M. Holl and D. Meldrum, Automated selection and
placement of single cells using vision-based feedback control, IEEE
Trans. Autom. Sci. Eng., 2010, 7(3), 598–606.
25 H. Zhu, M. Holl, T. Ray, S. Bhushan and D. R. Meldrum,
Characterization of deep wet etching of fused silica glass for single
cell and optical sensor deposition, J. Micromech. Microeng., 2009,
19, 6.
26 Y. Tian, B. R. Shumway, C. Youngbull, Y. Li, A. K. Y. Jen,
R. H. Johnson and D. R. Meldrum, Dually fluorescent sensing
of ph and dissolved oxygen using a membrane made from poly-
merizable sensing monomers, Sens. Actuators, B, 2010, 47(2),
714–722.
27 S. Ashili, L. Kelbauskas, J. Houkal, D. Smith, Y. Tian,
C. Youngbull, H. Zhu, Y. Anis, M. Hupp, K. Lee, A. Kumar,
J. Vela, A. Shabilla, R. Johnson, M. Holl and D. Meldrum,
Automated platform for multiparameter stimulus response studies
of metabolic activity at the single-cell level, Proceedings Vol. 7929,
Microfluidics, BIOMEMS, and Medical Microsystems IX, 2011.
28 L. Kelbauskas, S. Ashili, J. Houkal, D. Smith, A. Mohammadreza,
K. Lee, A. Kumar, Y. Anis, T. Paulson, C. Youngbull, Y. Tian,
R. Johnson, M. Holl and D. Meldrum, A novel method for multi-
parameter physiological phenotype characterization at the since-cell
level, Proceedings Vol. 7902, Imaging, Manipulation and Analysis of
Biomolecules, Cells, and Tissues IX, 2011.
29 D. Montgomery, Introduction to Statistical Quality Control,
Wiley Higher Education, 2005.
30 T. Molter, S. C. McQuaide, M. Zhang, M. R. Holl, L. W. Burgess,
M. E. Lidstrom and D. R. Meldrum, Algorithm advancements for
the measurement of single cell oxygen consumption rates, IEEE
International Conference CASE 2007, Automation Science and
Engineering, 2007, 386–391.
31 J. P. Shaffer, Multiple Hypothesis Testing, Annu. Rev. Psychol.,
1995, 46, 561–584.
32 J. K. Joseph, D. Bunnachak, T. J. Burke and R. W. Schrier,
A novel method of inducing and assuring total anoxia during in vitro
studies of O2 deprivation injury, J. Am. Soc. Nephrol., 1990, 1, 837–840.
33 K. C. Ho, J. K. Leach, K. Eley, R. B. Mikkelsen and P. S. Lin,
A simple method of producing low oxygen conditions with Oxyrase
for cultured cells exposed to radiation and Tirapazamine, Am. J. Clin.
Oncol., 2003, 26(4), e86–e91.
34 L. Breiman, Random forests, Mach. Learn., 2001, 45, 5–32.
35 T. Hastie, R. Tibshirani and J. H. Friedman, The Elements of
Statistical Learning—Data Mining, Inference, Prediction, Springer
Verlag, 2nd edn, 2009.
36 T. F. Cox and M. A. Cox, Multidimensional scaling, Chapman and
Hall, London, 1994.
37 L. Breiman, J. Friedman, C. J. Olshen and R. A. Stone, Classification
and Regression Trees, Wadsworth International, Belmont, CA, 1984.
38 S. J. Altschuler and L. F. Wu, Cellular Heterogeneity: Do Differences
Make a Difference?, Cell, 2010, 141(4), 559–563.
39 A. Savitzky and M. J. E. Golay, Smoothing and differentiation of
data by simplified least squares procedures, Anal. Chem., 1964,
36(8), 1627–1639.
40 R. A. Leach, C. A. Carter and J. M. Harrister, Least-squares
polynomial filters for initial point and slope estimation, Anal.
Chem., 1984, 56(13), 2304–2307.
41 P. Persson and G. Strang, Mathematical systems theory in biology,
communications, computation, and finance, Springer, 2002.
42 Z. B. Alfassi, Z. Boger and Y. Ronen, Statistical Treatment of
Analytical Data, CRC Press, Blackwell Science, Boca Raton, FL,
2005.
43 P. A. Gorry, General least-squares smoothing and differentiation
by the convolution (Savitzky–Golay) method, Anal. Chem., 1990,
62(6), 570–573.
44 B. Walczak, Wavelets in chemistry, Elsevier Science, 2000, vol. 22.
45 J. Hunter, The exponentially weighted moving average, J. Qual.
Technol., 1996, 18(4), 203–210.
46 J. Pignatiello and G. C. Runger, Comparison of multivariate
CUSUM charts, J. Qual. Technol., 1990, 22, 173–186.
47 S. S. Prabhu, G. C. Runger and D. C. Montgomery, Selection of
the subgroup size and sampling interval for a CUSUM control
chart, IEEE Trans., 1997, 29, 451–457.
48 V. Golosnoy, S. Ragulin, W. Schmid, Multivariate CUSUM chart:
properties and enhancements, AStA Advances in Statistical Analysis,
Springer, 2009, vol. 93(3), 263–279.
49 R. A. Berk, Statistical Learning from a Regression Perspective,
Springer Science + Business Media, LLC, New York, 2008.
50 Y. Benjamini and Y. Hochberg, Controlling the false discovery rate: a
practical and powerful approach to multiple testing, J. R. Stat. Soc.
Ser. B, 1995, 57, 289–300.
51 Y. Benjamini and D. Yekutieli, The control of the false discovery
rate in multiple testing under dependency, Ann. Stat., 2001, 29,
1165–1188.
52 F. Wilcoxon, Individual comparisons by ranking methods,
Biometrics Bull., 1945, 6, 80–83.
53 H. B. Mann and D. R. Whitney, On a Test of Whether one of Two
Random Variables is Stochastically Larger than the Other, Ann.
Math. Stat., 1947, 18(1), 50–60.
54 W. H. Kruskal and W. A. Wallis, Use of ranks in one-criterion
variance analysis, J. Am. Stat. Assoc., 1952, 47(260), 583–621.
55 C. H. Chen, W. Hardle, A. Unwin, M. Cox and T. F. Cox,
Handbook of data visualization. In Springer Handbooks Comp.
Statistics, chapter Multidimensional Scaling, Springer, Berlin
Heidelberg, 2008, pp. 315–347.
View Online

A statistical framework for multiparameter analysis at the single cell level

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A statistical framework for multiparameter analysis at the single cell level

Similar to A statistical framework for multiparameter analysis at the single cell level (20)

More from Shashaanka Ashili

More from Shashaanka Ashili (6)

Recently uploaded

Recently uploaded (20)

A statistical framework for multiparameter analysis at the single cell level