report

Process Design and Optimization of Bioprocesses with Quality by
Design Approach
Kyunghee Cho, Yunhao He
June 11, 2014
1 Introduction
In this project, we work with data from a pharmaceutical company that uses a new biochemical method
to produce drugs. The ultimate goal is to improve the process and develop models to aid decision making.
In general, we hope for more information out of less measurement. Due to time-constraint, we focus on
the prediction under the classical statistical settings. That is, to predict an output Y from X using
the model developed from data X1, X2, · · · , Xn and Y1, Y2, · · · , Yn. Many models exist for this kind of
prediction task. We only use a few of them in this project.
1.1 Data Description
In this part we briefly describe the data. The following is how the raw data look like.
Batch.ID Run WD Ptime DO CO2 pH Stress GLC LAC GLN GLU NH4
1 1 539 0 0.0000000 50 46 7.12 2.2 7.00 0.67 1.88 1.32 3.55
2 1 539 1 0.8368056 50 21 7.10 2.2 5.95 1.52 1.29 1.67 3.97
3 1 539 2 1.7986111 50 23 6.97 2.2 5.13 2.06 1.31 1.80 4.11
4 1 539 3 2.8020833 50 33 6.97 2.2 3.85 2.16 1.28 2.64 4.57
5 1 539 4 3.8680556 50 39 6.98 2.2 3.02 1.93 1.73 2.57 4.45
6 1 539 5 4.8993056 50 40 7.02 2.2 2.27 1.71 2.07 3.53 4.20
OSM Xv Via Titer
1 322.0000 1.45 97.97297 0.03913624
2 315.4865 2.07 97.64151 0.05251863
3 308.0000 3.30 97.34513 0.05340000
4 305.0906 4.97 97.83465 0.10257388
5 302.0000 7.02 97.50000 0.14200000
6 298.0000 7.75 96.27329 0.19834530
In a Run of experiment, all the variables are measured from day WD 1 to WD 10, sometimes 11 or
12. In the experiments the controlled variables are DO, CO2, pH, Stress. Variables as GLC, LAC, GLN,
GLU, NH4, OSM, Xv, Via are measured throbbughout to monitor the state of the production. Titer is
the output we are interested in.
1.2 Missing Data Imputation
The raw data contains quite a few missing values. To carry out sensible statical analysis, we imputed
missing data for both the input values and the output values. Titer are interpolated by our client.
Missing values in the input values are imputed by MissForest, which is a recent achievement based on
RandomForest. See [1].
1.3 Data Standardization
It is sometimes a good practice to normalize the data to have mean 0 and variance 1. In our analysis,
all the models except for linear models are standardized, since we have more sophiscated customized
transformation for linear regressions. As a result, the coefficients of linear regressions and mars are
always different in scale.
1

2 Information on Models Used
2.1 Linear Model
Linear regression is the most mature and widely used method. Although it is quite simple and intuive,
sometimes it has good prediction power. We can easily tell if any model assumption is violated by looking
at the plots of the linear fit.
To make a linear model work best, it is necessary to specify the predictors manually. So we have to
consider which transformations to take, which interactions or orders to include and which to leave out.
One easy way to do this is fit a big model in the beginning and use stepwise selection by AIC(or BIC)
criterion1
which is a measure of balance between goodness of fit and model complexity.
To assess variable importance, a simple way is to look at p-value or t-value, given the model assump-
tions are met.
2.2 Decision Tree
Decision tree is a scale independent statistical model. It is easy to implement and deals with interactions
naturally. The biggest advantage is that it can be visualized easily. See [5].
2.3 Random Forest
All the information in this subsection is based on [4]. Random forest is an ensemble method that combines
many decision trees, in each of which the data and variables are sampled so that only part of them are
used. Due to implicit bootstrapping, the model suffers less from overfitting and has good prediction
power.
Random forest has built-in mechanisms to estimate importance of variables. The two measures are
described in the following way:
ˆ The first measure is computed from permuting OOB data: For each tree, the predic-
tion error on the out-of-bag portion of the data is recorded (error rate for classification,
MSE for regression). Then the same is done after permuting each predictor variable.
The difference between the two are then averaged over all trees, and normalized by the
standard deviation of the differences. If the standard deviation of the differences is equal
to 0 for a variable, the division is not done (but the average is almost always equal to 0
in that case).
ˆ The second measure is the total decrease in node impurities from splitting on the variable,
averaged over all trees. For classification, the node impurity is measured by the Gini
index. For regression, it is measured by residual sum of squares.
The package randomForest has a function varImpPlot to plot the importance of variables easily.
2.4 MARS
All the information in this subsection is based on [3].
MARS, multivariate adaptive regression splines, is an adaptive extension of linear regression. In the
final model, it is a linear regression with terms like (xj − d)+, (d − xj)+ and higher order interactions
of such terms. The algorithm adds terms forwardly and prune the model to a place where GCV is
minimized.
MARS can also estimate the importance of variables. From earth vignette: [3]
ˆ The nsubsets criterion counts the number of model subsets that include the variable.
Variables that are included in more subsets are considered more important.
By ”subsets” we mean the subsets of terms generated by the pruning pass. There is one
subset for each model size (from 1 to the size of the selected model) and the subset is
the best set of terms for that model size. (These subsets are speced in $prune.terms in
earth’s return value.) Only subsets that are smaller than or equal in size to the final
model are used for estimating variable importance.
1See http://stat.ethz.ch/R-manual/R-devel/library/stats/html/step.html. For more information about BIC, see
http://en.wikipedia.org/wiki/Bayesian_information_criterion
2

ˆ The rss criterion first calculates the decrease in the RSS for each subset relative to the
previous subset. (For multiple response models, RSS’s are calculated over all responses.)
Then for each variable it sums these decreases over all subsets that include the variable.
Finally, for ease of interpretation the summed decreases are scaled so the largest summed
decrease is 100. Variables which cause larger net decreases in the RSS are considered
more important.
ˆ The gcv criterion is the same, but uses the GCV instead of the RSS. Adding a variable
can increase the GCV, i.e., adding the variable has a deleterious effect on the model.
When this happens, the variable could even have a negative total importance, and thus
appear less important than unused variables.
2.5 Neural Network
Neural network is a felxible model which is in a sense an extension of linear regression. See en.wikipedia.
org/wiki/Artificial_neural_network.
3 Overview of Three Main Tasks
This statistical analysis comprises three main tasks, in each of which all three above mentioned models
were used, specifically, Linear Regression, Random Forest and MARS. Only for the third task a different
dataset was used, in which missing Titer values are interpolated by a logistic function fit within each
run.
3.1 First Task: Blackbox Models
Maximizing the output of a useful product by controlling experimental conditions is of primary interest.
The first blackbox models are fitted using only the four controlled variables, namely, DO, Stress, pHset,
Xv0 as input variables to predict the output variable, Titer at day 10. Secondly, all other input variables
at day 0 are used as input variables to predict Titer at day 10.
3.2 Second Task: Snapshot Models
In order to save the measurement cost of Titer, it is useful to predict the current value of Titer during
an experiment using the current values of the input variables, which are relatively cheaper to measure.
The snapshot approach means each observation is considered to be independant and input variables at
day t are used to predict Titer at day t.
3.3 Third Task: History Models
The history approach can be regarded as an extension of both the blackbox models and the snapshot
models. For the history models, not only the current value of the input variables at day t are considered
in the model as predictors but also how they have changed over time in the past, that is, the history of
the input variables. All Titer values in the future as well as the one at day t are to be predicted. In
other words, a (i, j) history model uses input variables at day 0, 1, ..., i as predictors to predict the Titer
value at day j(≥ i).
4 Model Comparison
In order to compare the prediction performance between the three statistical models, the cross-validation
method was used.
1. First, the data is randomly splitted into a training set and a test set by Run ID. That is, 30 runs
are randomly sampled for a test set among 122 runs.
2. Then a model is fitted using the training set, the rest of the data.
3. MSE (mean squared error) is calculated in the test set.
3

4. This is repeated 5 times and using the 5 resulting MSE’s, RMSECV is calculated, which is deﬁned
as follows.
For blackbox or history models:
RMSECV =
5
k=1
30
i=1(y
(k)
i −y
(k)
i )2
5·30
where y
(k)
i and y
(k)
i are the true Titer value and the predicted one of the ith sample run in the kth
cross-validation, repectively.
For snapshot models:
RMSECV =
5
k=1
30
i=1
ni
j=1(y
(k)
ij −y
(k)
ij )2
5·30·ni
where y
(k)
ij and y
(k)
ij are the true Titer value and the predicted one of the ith sample run in the kth
cross-validation at the day j, repectively.
In case the target variable is transformed, the RMSECV is calculated based on back-transformed
values of both true and predicted values.
5 Blackbox Models
This is a prediction problem with classical statistics setting. Before we use any statistical models, it is
helpful to get an intuition what the dataset looks like.
q
q
q
qq
q
q
qq
qq
q
q
qq
q
q
q
qq
q
qqqq
qq
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
qq q
q
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
600 700 800 900 1000
0.20.40.60.8
Final Titer in Different Runs without Transformation
dat.simp$Run
dat.simp$Titer
Since we are using only a small part of the raw data here, the number of observations is 122.
If only prediction power is of interest, you can jump to Section 5.6.
4

5.1 Linear Model
5.1.1 Linear Model with 4 Controlled Variables
The following model was selected by the BIC model selection criterion. We can see that both DO and
pHset exhibit quadratic eﬀect.
Call:
lm(formula = bb4lm$formula, data = if.data)
Residuals:
Min 1Q Median 3Q Max
-0.59583 -0.10260 -0.00799 0.10758 0.84827
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.695e-01 7.932e-02 -3.397 0.000952 ***
I((DO - 50)^2) -7.425e-04 8.316e-05 -8.929 1.18e-14 ***
log(Xv0) -3.114e-01 1.810e-01 -1.721 0.088092 .
I((pHset - 7.1)^2) -7.436e+00 9.961e-01 -7.465 2.17e-11 ***
I((DO - 50)^2):log(Xv0) 5.670e-04 1.714e-04 3.309 0.001270 **
I((DO - 50)^2):I((pHset - 7.1)^2) -1.852e-03 7.084e-04 -2.614 0.010202 *
log(Xv0):I((pHset - 7.1)^2) 8.604e+00 1.811e+00 4.750 6.23e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2011 on 109 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.8462, Adjusted R-squared: 0.8377
F-statistic: 99.95 on 6 and 109 DF, p-value: < 2.2e-16
[1] 0.09764054
−3.0 −2.0 −1.0
−0.50.00.51.0
Fitted values
Residuals
q
q
q
qq
qq
qq
qq
q
q
qqqqqqqqqq
qqqq
qq
q
q qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
qq
q
q
q
qq
q qqq
qq qq
q
q
q
qq
q
q q
qqq
qq
qqq
q
qqq
q
qq
q
qqqqqq
qq
q
q
qq
q
q
qqq
qq
q
q
qq
q
Residuals vs Fitted
73
4348
q
q
q
qq qq
qq
qq
q
q
q qqqqqqqqq
qqqq
q q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
qq
qq qq
qqqq
q
q
q
qq
q
qq
qqq qq qqq
q
qqq
q
qq
q
qqqqq q
qq
q
q
qq
q
q
q qq
qq
q
q
q q
q
−2 −1 0 1 2
−20246
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
73
48
43
−3.0 −2.0 −1.0
0.01.02.0
Fitted values
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
qqq
qq
qq
qqqq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
Scale−Location
73
48
43
0.0 0.1 0.2 0.3 0.4
−4−20246
Leverage
q
q
q
qqqq
qq
qq
q
q
qqqqqqqqqq
qqqq
qq
q
qqqqqq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
qq
qqqq
qqq q
q
q
q
qq
q
qq
qqqqqqqq
q
qqq
q
qq
q
q qqqqq
qq
q
q
qq
q
q
qqq
qq
q
q
qq
q
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
73
48
37
5

The transformations for DO, Xv0 and pHset were chosen according to the termplot to give a reasonable ﬁt.
> par(mfrow=c(2,2))
> termplot(bb4lmfit,partial.resid=TRUE)
10 20 30 40 50 60 70
−1.5−0.50.5
DO
PartialforI((DO−50)^2)
qq
q
qqqq
qqqqq
q
qqqqqqqqqq qqqq qq
q
qqqqqq
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
qq
q
q
q
q
q
qq
q
q
q
qqqqqqqqqqqqq
qq
q
qqqqqq
q
qqq
qq
q
qqqqqqqq
qq
q
1.0 1.2 1.4 1.6 1.8 2.0
−1.5−0.50.5
Xv0
Partialforlog(Xv0)
qq
q
qqqq
qq qq
q
q
q qqqq qq qq q
q qqq
q q
q
qqq qq
q
q
q
qq qq
q
qqqq
q
qq
q
qq
q
q
q
q
qqq
qqq q
qqq
q
q
q
q
qq
q
q
q qq q qqqq qqqqqq
qq
q
qq
qq qq
qq
qq
qq
qqqq q
q
qq
q
q q
q
6.7 6.8 6.9 7.0 7.1 7.2
−1.5−0.50.5
pHset
PartialforI((pHset−7.1)^2)
qq
q
qqqq
qqqqq
q
qqqqqqqqqq
qqqq
qq
q
q
qqqqq
q
qq
q
qq
q
q
qq
q
q
qq
q
q
q
q
q
q
q
qq
qqqqq
qq
q q
qq
q
qq
q
qq
qqqqqqqqqqqqq
qq
q
qq qqqq
q q
qq
qq
q q
qqqqqqq
qq
q
5.1.2 Linear Model with All Variables
The following model was selected by the BIC model selection criterion. Apart from the 4 controlled
variables, only NH4 and GLN.
Call:
lm(formula = bblmbic$formula, data = if.data)
Residuals:
-0.86766 -0.09680 0.01636 0.09528 0.59879
Coefficients:
(Intercept) -1.559e+00 2.359e-01 -6.606 1.49e-09 ***
I((DO - 50)^2) -5.932e-04 4.454e-05 -13.319 < 2e-16 ***
pCO2 5.530e-03 2.287e-03 2.418 0.01726 *
GLN 1.264e-01 5.030e-02 2.513 0.01343 *
NH4 8.313e-02 2.345e-02 3.545 0.00058 ***
log(Xv0) 5.981e-01 1.377e-01 4.343 3.16e-05 ***
I((pHset - 7.1)^2) -5.336e+00 5.264e-01 -10.137 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
6

−2.5 −2.0 −1.5 −1.0 −0.5
−1.0−0.50.00.5
Fitted values
Residuals
q
q
q
qq
qq
q
q
qq
q
q
q
qq
q
qq
q
q
qq
q
q
q
q
qq
q
q q
q
qq
qq q
q
q
q
q
q
qqq
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
qq
qq
q
qqqqq
q
qqq
q
q qq
q
q
qq
qq
q q
q
qq
q
q
qq
q
q
q
qq
q
q
qq
q
Residuals vs Fitted
47
4855
q
q
q
qq
qq
q
q
q q
q
q
q
qq
q
qq
q
q
qq
q
q
q
q
qq
q
qq
q
qq
qqq
q
q
q
q
q
q qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
qq
qq
q
q qq
q q
q
qqq
q
q qq
q
q
qq
qq
qq
q
qq
q
q
qq
q
q
q
qq
q
q
qq
q
−2 −1 0 1 2
−4−202
Normal Q−Q
47
4855
−2.5 −2.0 −1.5 −1.0 −0.5
0.00.51.01.52.0
Fitted values
q
q
q
qq
q
q
q
qq
qq
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
Scale−Location
47
4855
0.00 0.10 0.20 0.30
−4−202
Leverage
q
q
q
qqqq
q
q
q q
q
q
q
qq
q
qq
q
q
qq
q
q
qq
qq
q
qq
q
qq
q qq
q
q
q
q
q
qqq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
q
q q
q
q
q
q
qq
qq
q
qqqqq
q
qqq
q
q q q
q
q
qq
qqq q
q
qq
q
q
qq
q
q
q
q q
q
q
qq
q
Cook's distance 1
0.5
0.5
47
4855
5.2 MARS
To allow for ﬂexibility and nonlinear eﬀects, we use Mars - Multivariate adaptive regression splines.
5.2.1 MARS with 4 Controlled Variables
> source(file.path(cwd, "mars_simp.R"))
> summary(mars.con)
Call: earth(formula=Titer~DO+pHset+Xv+Stress, data=dat.simp,
trace=0)
coefficients
(Intercept) 0.69954039
h(-1.06901-DO) -0.22496222
h(DO-0.159614) -0.05639779
h(pHset-0.444749) -0.12694532
h(0.444749-pHset) -0.11450147
Selected 5 of 13 terms, and 2 of 4 predictors
Importance: DO, pHset, Xv-unused, Stress-unused
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 0.009779767 RSS 0.9944133 GRSq 0.6943894 RSq 0.7344234
> plot(mars.con, main = "Mars Using 4 Inputs.jpg")
7

0 2 4 6 8 10
Mars Using 4 Inputs.jpg
Number of terms
0.450.550.650.75
GRSqRSq
01234
Numberofusedpredictors
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20 0.30
0.00.20.40.60.81.0
abs(Residuals)
Proportion
0% 50% 75% 90% 95% 100%
q
qq
q q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q q
qq
q
q
qq
q
q
q
qq
q
q
qq
qq
q
q
qq
qqqqq
q
qqqq
qqqq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
qqq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
0.1 0.3 0.5 0.7
−0.3−0.10.1
Fitted
Residuals
42
106
47
q
qq
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
qq
q
q
qq
q
q
q
qq
q
q
qq
qq
q
q
qq
qq qqq
q
qqqq
qqqq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
qqq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−0.3−0.10.1
ResidualQuantiles
42
106
47
Titer: earth(formula=Titer~DO+pHset+Xv+Stre...
5.2.2 MARS with All Variables
> summary(mars.simp)
Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+
GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH, data=dat.simp,
trace=0)
coefficients
h(-1.06901-DO) -0.23177287
h(DO-0.159614) -0.06620347
h(pHset-0.444749) -0.09650702
h(0.444749-pHset) -0.09711387
h(Via-0.355972) 0.33786922
h(1.41519-GLC) 0.14153165
h(-0.949741-LAC) 0.36807897
h(NH4- -0.263424) 0.10067628
h(-0.0456292-OSM) 0.35674962
Importance: DO, pHset, NH4, Via, GLC, LAC, OSM, Xv-unused, Stress-unused, ...
GCV 0.007086922 RSS 0.5955396 GRSq 0.7785388 RSq 0.84095
> plot(mars.simp, main = "Mars Using All Inputs.jpg")
8

0 5 10 15 20
Mars Using All Inputs.jpg
Number of terms
0.50.60.70.8
GRSqRSq
0246810
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20
0.00.20.40.60.81.0
abs(Residuals)
Proportion
0% 50% 90% 100%
q
q
qq
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8
−0.2−0.10.00.1
Fitted
Residuals
42
93
47
q
q
qq
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−0.2−0.10.00.1
ResidualQuantiles
42
93
47
Titer: earth(formula=Titer~DO+pHset+Xv+...
Now we add the label(pCO2) showing whether or not CO2 is targeted to a certain level. As it turned out, this
label is not important.
> summary(mars.simp2)
GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pCO2t,
data=dat.simp, trace=0)
coefficients
h(DO-0.159614) -0.04110043
h(0.159614-DO) -0.09742652
h(pHset- -1.58274) -0.11389203
h(0.444749-pHset) -0.17420137
h(Via-0.340761) 0.29860268
h(-0.962971-LAC) 0.32408837
h(GLU- -1.50335) -0.81614339
h(GLU- -1.42405) 1.31071055
h(NH4- -0.292625) 0.10170289
h(-0.0456292-OSM) 0.52167298
Importance: pHset, DO, NH4, OSM, GLU, Via, LAC, Xv-unused, Stress-unused, ...
GCV 0.007404181 RSS 0.5975609 GRSq 0.7686247 RSq 0.8404102
> plot(mars.simp2,
+ main = "Mars Using All inputs and Additional CO2 Target Label.jpg")
9

0 5 10 15 20 25
Mars Using All inputs and Additional CO2 Target Label.jpg
Number of terms
0.30.50.70.9
GRSqRSq
0246810
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20
0.00.20.40.60.81.0
abs(Residuals)
Proportion
0% 50% 90% 100%
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.2 0.4 0.6 0.8
−0.2−0.10.00.1
Fitted
Residuals
42
47
93
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−0.2−0.10.00.1
ResidualQuantiles
42
47
93
Fitting Mars, records 42, 47 and 91 are identiﬁed as outliers.
> dat.simp[c(42, 47, 91), ]
Run WD Ptime DO pCO2 pH Xv Via GLC
42 630 0 0 0.159614 0.1237602 1.287431 -1.580576 0.3932955 1.078003
47 635 0 0 1.388239 0.9739924 1.426846 -1.894227 0.4848425 1.317452
91 768 0 0 0.159614 0.5488763 1.357139 -1.645298 0.4860406 1.156190
LAC GLN GLU NH4 OSM Stress
42 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237 -0.2096149
47 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617 -0.2096149
91 -0.9629707 0.3432373 -1.334834 1.247747682 0.145048862 -0.2096149
Xv0 pHset RunNo Discr.
42 0.3312174 0.03925127 BIOS-3.5L - 630 Standard conditions
47 -2.2207835 -2.39373696 BIOS-3.5L - 635 DO 70% / pH 6.70 / seeding 1 mio
91 -0.1953860 0.03925127 BIOS-3.5L - 768 WGE 15 - SE 3
Titer pCO2t
42 0.3510 0
47 0.0858 0
91 0.7650 0
Also, we can see the importance of variables. The methods for importance estimation have been
discussed in Section 2.
10

02468
Importance of Variables
nsubsets
nsubsets
sqrt gcv
sqrt rss
020406080100
normalizedsqrtgcvorrss
DO1
pHset2
NH411
Via6
GLC7
LAC8
OSM12
5.3 Random Forest
Using Random Forest, we can easily see the importance of diﬀerent variables.
5.3.1 Random Forest with 4 Controlled Variables
Call:
randomForest(formula = Titer ~ DO + pHset + Stress + Xv, data = dat.simp, importance = TRUE, mtry = 4)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 0.008918479
% Var explained: 71.66
5.3.2 Random Forest with All Variables
We can see the ranking of variable importances is similar to that of MARS.
11

Stress
pCO2
pH
GLU
GLC
GLN
Via
LAC
OSM
Xv
NH4
pHset
DO
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.5 1.0
rf.simp
IncNodePurity
> rf.simp
Call:
randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM +
5.4 Decision Tree
For completeness, we also include decision tree model. It cannot outperform random forest which is an
extension of decision tree, but it is easy for interpretation.
5.4.1 Decision Tree with 4 Controlled Variables
n= 119
node), split, n, deviance, yval
* denotes terminal node
1) root 119 3.74435600 0.5739803
2) DO< -1.683324 13 0.10235720 0.2245385 *
3) DO>=-1.683324 106 1.85988900 0.6168364
6) pHset< -0.7717448 10 0.22043140 0.3785372 *
7) pHset>=-0.7717448 96 1.01244000 0.6416593
14) pHset>=0.6474983 8 0.02327288 0.5111250 *
15) pHset< 0.6474983 88 0.84046160 0.6535260
12

30) Stress< 0.3004491 79 0.70332710 0.6425430 *
31) Stress>=0.3004491 9 0.04395770 0.7499322 *
|
DO< −1.683
pHset< −0.7717
pHset>=0.6475
Stress< 0.3004
0.2245
0.3785
0.5111
0.6425 0.7499
5.4.2 Decision Tree with All Variables
n= 119
1) root 119 3.74435600 0.5739803
2) DO< -1.683324 13 0.10235720 0.2245385 *
3) DO>=-1.683324 106 1.85988900 0.6168364
6) pHset< -0.7717448 10 0.22043140 0.3785372 *
7) pHset>=-0.7717448 96 1.01244000 0.6416593
14) NH4< 0.2658514 67 0.43613680 0.6091353
28) OSM>=-0.1028327 50 0.28442470 0.5858890
56) Via< 0.4785685 23 0.15613930 0.5504854 *
57) Via>=0.4785685 27 0.07489916 0.6160477 *
29) OSM< -0.1028327 17 0.04522386 0.6775066 *
15) NH4>=0.2658514 29 0.34168860 0.7168009
30) Xv< -1.573108 21 0.21895440 0.6899347
60) Xv>=-1.652965 12 0.13595400 0.6410000 *
61) Xv< -1.652965 9 0.01595145 0.7551809 *
31) Xv>=-1.573108 8 0.06778793 0.7873245 *
13

|
DO< −1.683
pHset< −0.7717
NH4< 0.2659
OSM>=−0.1028
Via< 0.4786
Xv< −1.573
Xv>=−1.653
0.2245
0.3785
0.5505 0.616 0.6775
0.641 0.7552
0.7873
5.5 Neural Network
Neural network is comparatively not so easy to interpret. We include it here to compare its prediction
accuracy with other models using CV.
5.6 Cross Validation
Although many packages nowadays have built-in measures of test errors, by implementing cross validation
ourselves, we can compare the performance of diﬀerent models on the same ground. Here we use Leave-
30-Runs-Out cross validation. Since it is a regression problem, mean squared error is a good indicator
of performance.
We can see the cross validation mean squared errors of diﬀerent methods.
RMSECV of Different Models Using All Input Variables
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.078494 0.1060308 0.09046751 0.1237991 0.1362411
RMSECV of Different Models Using 4 Input Variables
[1,] 0.09764054 0.117845 0.09327159 0.0972197 0.1150896
A surprising fact is that the more complicated models perform even worse. Linear model performs quite
well.
14

6 Snapshot Models
6.1 Linear model
The following model was selected by the BIC model selection criterion. We can see that now the number
of selected variables is much higher. In snapshot model, the eﬀects are too complex to be estimated by
a few variables.
Call:
lm(formula = snlm$formula, data = ccFdata)
Residuals:
-0.63033 -0.09694 0.00396 0.11494 0.48256
Coefficients:
(Intercept) 1.492e+00 1.705e+00 0.875 0.381944
I((DO - 50)^2) 9.907e-04 3.332e-04 2.973 0.003066 **
pCO2 1.522e-01 2.442e-02 6.233 8.54e-10 ***
I((pH - 7.1)^2) 3.698e+01 4.867e+00 7.598 1.14e-13 ***
log(Xv) 7.013e-01 1.506e-01 4.658 3.93e-06 ***
Via -3.025e-02 9.257e-03 -3.268 0.001144 **
GLC -2.973e-01 1.400e-01 -2.123 0.034175 *
LAC -2.858e-01 9.404e-02 -3.039 0.002479 **
GLN 5.613e-01 5.873e-02 9.557 < 2e-16 ***
GLU -1.950e+00 3.485e-01 -5.596 3.31e-08 ***
NH4 -1.106e-01 2.415e-02 -4.580 5.64e-06 ***
OSM -8.029e-03 3.901e-03 -2.058 0.039999 *
Stress -5.076e-02 1.549e-02 -3.276 0.001113 **
I((DO - 50)^2):pCO2 1.418e-05 4.040e-06 3.510 0.000482 ***
I((DO - 50)^2):Via -8.782e-06 2.193e-06 -4.005 6.98e-05 ***
I((DO - 50)^2):GLC 8.047e-05 1.697e-05 4.741 2.65e-06 ***
I((DO - 50)^2):GLU -1.739e-04 3.561e-05 -4.883 1.34e-06 ***
I((DO - 50)^2):NH4 -6.375e-05 2.582e-05 -2.469 0.013824 *
pCO2:Via -6.531e-04 1.354e-04 -4.824 1.78e-06 ***
pCO2:GLC 3.431e-03 6.722e-04 5.105 4.44e-07 ***
pCO2:GLN -6.369e-03 1.483e-03 -4.293 2.05e-05 ***
pCO2:OSM -2.974e-04 5.367e-05 -5.540 4.50e-08 ***
I((pH - 7.1)^2):LAC 2.711e+00 3.745e-01 7.240 1.36e-12 ***
I((pH - 7.1)^2):NH4 6.535e-01 2.225e-01 2.937 0.003435 **
I((pH - 7.1)^2):OSM -1.449e-01 1.812e-02 -7.997 6.46e-15 ***
log(Xv):GLC 1.177e-01 1.607e-02 7.323 7.75e-13 ***
log(Xv):GLU -2.613e-01 3.265e-02 -8.000 6.29e-15 ***
Via:GLC -4.150e-03 8.275e-04 -5.015 6.97e-07 ***
Via:GLU 1.534e-02 1.466e-03 10.463 < 2e-16 ***
Via:Stress 6.505e-04 1.614e-04 4.031 6.26e-05 ***
GLC:LAC -2.587e-02 7.330e-03 -3.529 0.000448 ***
GLC:NH4 1.571e-02 4.848e-03 3.241 0.001255 **
GLC:OSM 9.333e-04 3.859e-04 2.419 0.015863 *
LAC:NH4 -4.049e-02 8.656e-03 -4.678 3.58e-06 ***
LAC:OSM 1.038e-03 2.115e-04 4.911 1.16e-06 ***
GLN:Stress -4.219e-03 1.587e-03 -2.658 0.008059 **
GLU:OSM 4.736e-03 8.420e-04 5.625 2.83e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] 0.07678454
15

−4 −3 −2 −1 0
−0.6−0.20.20.6
Fitted values
Residuals
q
q
q q
qq
q
q q
q
qq
q
q
q
q
q
q
q
q q q q
q
q
q
q q
q
q
q
qq
q q
q
q
qqq q
q
q
qqq
q
q
q
q
qq
q
q
q
q q
qq
q
q q
q
q
qq
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q q
q
qq q
q
q
q
q
q q
q
q
q
q
q q
q
qq
q
q
q q
qqqq q
q
qq
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q q q
q
q
q
q q
qq
q
q
q
q
q
q q q
q
q
q q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q q
qq
q q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q q
q
qq
q
q
q
q q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q q
qq
q
q
q
q
q
q
qq q
q
q
qq
q q
q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
qq
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
qq
q
qq
q
q q
q
q
q
q
q q
qq
q
q
q
q
q
q
q q
qq
q
q q qq
q q qqqq
q
q
qq
q q
q
q
q
q
q
qq
q
q q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qqq
q
q
q q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
qq
q
q q
q
q q
q
q
q
q
q
q
q
q
q q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q qq
q
q q
q
q
Residuals vs Fitted
90 345
909
q
q
qq
qq
q
qq
q
qq
q
q
q
q
q
q
q
qqqq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qqqq
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
qq
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
q
q
q
qq
q
q
q
q
qq
q
qq
q
q
qq
qqqqq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
qq
qq
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
qq
qq
qq
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
qq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
qqqq
qqqqqq
q
q
qq
qq
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qq
q
q
−3 −2 −1 0 1 2 3
−3−1123
Normal Q−Q
90345
909
−4 −3 −2 −1 0
0.00.51.01.5
Fitted values
q
q
q q
qq
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
qq
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
qq
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
Scale−Location
90 345
909
0.0 0.2 0.4 0.6
−4−202
Leverage
q
q
qq
qq
q
qq
q
qq
q
q
q
q
q
q
q
qqqq
q
q
q
qq
q
q
q
qq
qq
q
q
qqqq
q
q
qqq
q
q
q
q
qq
q
q
q
qq
qq
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
q
q
q
qq
q
q
q
q
qq
q
qq
q
q
qq
qqqqq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
qq
qq
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q q
q
q
qq
qq
qq
q
q
q q
q
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
qq
q
qq
q
qq
q
q
q
q
q q
qq
q
q
q
q
q
q
qq
qq
q
qqq q
qqqqqq
q
q
qq
qq
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
qq
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
qq
q
q
Cook's distance
1
0.5
0.5
530
532
1377
6.2 Random Forest
The importance ranking of variables is now much diﬀerent from black-box model. GLU, LAC, NH4,
GLN can be used to estimate Titer at the same time, which is a good “snapshot”.
[1] "Titer~DO+pCO2+Via+GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+Xv"
Call:
randomForest(formula = as.formula(snrf$formula), data = ccFdata, na.action = na.omit)
[1] 0.06970167
16

Stress
DO
GLC
pH
OSM
NH4
Via
pCO2
Xv
GLN
LAC
GLU
q
q
q
q
q
q
q
q
q
q
q
q
0 2 4 6 8 10
snrffit
IncNodePurity
6.3 MARS
MARS have a similar importance plot to Random Forest.
Call: earth(formula=as.formula(snmars$formula), data=subset(ccFdata,
!is.na(ccFdata$Titer)))
coefficients
h(DO-70) 0.00384701
h(70-DO) 0.00081744
h(39-pCO2) -0.00387669
h(Via-95.1825) -0.01859399
h(1.64-GLC) -0.04381892
h(3.14-LAC) 0.06892429
h(GLN-2.09) 0.17083241
h(2.09-GLN) -0.07104502
h(GLU-4.49) 0.11840427
h(4.49-GLU) -0.13551714
h(NH4-5.19) -0.01831231
h(5.19-NH4) 0.04728241
h(OSM-305) 0.00128608
h(305-OSM) 0.00120199
h(Stress-21) -0.00161051
h(21-Stress) -0.00473324
h(7.11-pH) -0.20785734
h(Xv-2.96) -0.01598677
h(2.96-Xv) -0.10513829
17

Importance: GLU, LAC, NH4, GLN, Xv, OSM, pH, Stress, Via, pCO2, DO, GLC
GCV 0.004576446 RSS 2.605637 GRSq 0.8966136 RSq 0.9084545
[1] 0.07026124
0 5 10 15 20 25
Model Selection
Number of terms
0.60.70.80.9
GRSqRSq
024681012
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20
0.00.20.40.60.81.0
Cumulative Distribution
abs(Residuals)
Proportion
0% 50% 90% 100%
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qqq
qq
q
qq
qqq
q
qq
q
qq
qq
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqqq
q
qq
q
q
q
qq
q
q
q
qq
q
q
qqqq
q
q
qq
q
qq
q
qqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
qqq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
q
q
q
q
q
qqqq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
qq
qqq
q
qq
qq
q
q
q
qqq
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
qq
q
qq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8
−0.20.00.10.2
Residuals vs Fitted
Fitted
Residuals
586
10
582
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qqq
qq
q
qq
qqq
q
qq
q
qq
qq
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqqq
q
qq
q
q
q
qq
q
q
q
qq
q
q
qqqq
q
q
qq
q
qq
q
qqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
qqq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
q
q
q
q
q
qqqq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
qq
qqq
q
qq
qq
q
q
q
qqq
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
qq
q
qq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
−3 −1 0 1 2 3
−0.20.00.10.2
Normal Q−Q
ResidualQuantiles
586
10
582
Titer: earth(formula=as.formula(snmars$formu...
18

051015
Variable importance
nsubsets
nsubsets
sqrt gcv
sqrt rss
020406080100
GLU7
LAC5
NH48
GLN6
Xv12
OSM9
pH11
Stress10
Via3
pCO22
DO1
GLC4
7 History Model(Naive Way)
Let’s first use input data only at time t and treat t simply as another parameter.
Again, the target here is the Titer on day 10.
Explore with different models.
7.1 Linear Regression
Since we now have around 1300 observations, we can fit the model with more parameters. After this we
can select model using step.
Call:
lm(formula = Titer ~ I(DO^2) + DO + pHset + I(pHset^2) + Via +
poly(GLC, 3) + poly(LAC, 3) + GLN + GLU + NH4 + OSM + poly(Ptime,
3) + DO:pHset + I(pHset^2):poly(Ptime, 3) + Via:poly(Ptime,
3) + poly(LAC, 3):poly(Ptime, 3) + GLN:poly(Ptime, 3) + GLU:poly(Ptime,
3) + NH4:poly(Ptime, 3) + OSM:poly(Ptime, 3), data = dat.agg)
Residuals:
-0.296451 -0.047078 0.002983 0.044872 0.190200
Coefficients:
(Intercept) -1.049e+00 5.200e-01 -2.017 0.044626 *
I(DO^2) -3.644e-02 6.218e-03 -5.860 1.26e-08 ***
DO 1.747e-02 7.374e-03 2.369 0.018520 *
pHset 9.137e-03 7.090e-03 1.289 0.198499
19

I(pHset^2) -3.318e-02 5.477e-03 -6.058 4.31e-09 ***
Via 1.341e-01 2.815e-02 4.763 3.02e-06 ***
poly(GLC, 3)1 -8.356e-01 1.905e-01 -4.387 1.62e-05 ***
poly(GLC, 3)2 -2.459e-01 1.455e-01 -1.690 0.092150 .
poly(GLC, 3)3 1.980e-01 1.425e-01 1.389 0.165837
poly(LAC, 3)1 -1.572e+02 4.936e+01 -3.184 0.001611 **
poly(LAC, 3)2 -1.203e+02 3.832e+01 -3.138 0.001878 **
poly(LAC, 3)3 -3.211e+01 1.041e+01 -3.084 0.002244 **
GLN 3.912e-02 8.812e-03 4.439 1.29e-05 ***
GLU 5.716e-02 2.163e-02 2.642 0.008686 **
NH4 5.605e-02 7.730e-03 7.252 3.82e-12 ***
OSM -2.508e-02 2.934e-02 -0.855 0.393294
poly(Ptime, 3)1 2.465e+01 7.910e+00 3.117 0.002013 **
poly(Ptime, 3)2 -4.044e+01 1.110e+01 -3.642 0.000321 ***
poly(Ptime, 3)3 5.017e+00 1.286e+00 3.903 0.000119 ***
DO:pHset 5.555e-03 2.814e-03 1.974 0.049324 *
I(pHset^2):poly(Ptime, 3)1 4.147e-01 8.829e-02 4.697 4.10e-06 ***
I(pHset^2):poly(Ptime, 3)2 1.152e-02 4.871e-02 0.236 0.813219
I(pHset^2):poly(Ptime, 3)3 -2.443e-01 1.301e-01 -1.878 0.061423 .
Via:poly(Ptime, 3)1 -1.151e+00 4.468e-01 -2.576 0.010507 *
Via:poly(Ptime, 3)2 2.055e+00 5.664e-01 3.628 0.000338 ***
Via:poly(Ptime, 3)3 -5.371e-01 3.308e-01 -1.624 0.105528
poly(LAC, 3)1:poly(Ptime, 3)1 2.404e+03 7.520e+02 3.197 0.001546 **
poly(LAC, 3)1:poly(Ptime, 3)2 -3.718e+03 1.095e+03 -3.395 0.000784 ***
poly(LAC, 3)3:poly(Ptime, 3)3 8.598e+01 2.539e+01 3.387 0.000805 ***
GLN:poly(Ptime, 3)1 2.287e-01 1.479e-01 1.547 0.123014
GLN:poly(Ptime, 3)2 -1.576e-01 1.730e-01 -0.911 0.363026
GLN:poly(Ptime, 3)3 -2.659e-01 1.555e-01 -1.710 0.088273 .
GLU:poly(Ptime, 3)1 -1.712e-01 3.530e-01 -0.485 0.628013
GLU:poly(Ptime, 3)2 -8.317e-01 4.445e-01 -1.871 0.062348 .
GLU:poly(Ptime, 3)3 -5.699e-01 2.756e-01 -2.068 0.039551 *
NH4:poly(Ptime, 3)1 -5.983e-01 1.399e-01 -4.278 2.57e-05 ***
NH4:poly(Ptime, 3)2 -3.588e-01 1.453e-01 -2.469 0.014148 *
NH4:poly(Ptime, 3)3 1.258e-01 1.830e-01 0.687 0.492460
OSM:poly(Ptime, 3)1 1.639e+00 3.637e-01 4.507 9.58e-06 ***
OSM:poly(Ptime, 3)2 1.621e+00 6.580e-01 2.464 0.014321 *
OSM:poly(Ptime, 3)3 2.464e-01 1.771e-01 1.391 0.165379
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
20

0.0 0.2 0.4 0.6 0.8
−0.3−0.10.1
Fitted values
Residuals
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
qq
q
qqqqqq
qq
qqqqq
qq
q
q
q
q
q
q
qq
q q
qqqqq
qqqqq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
qq
q
qq
q
qq
q
qq
q
qqq
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
qq
q
qqqqq
q
qq q
q
q
q
q
q
qq
q
q
q
q
q
q
q qq
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qqqq
q
qqq
qqq
q
q
qq
q
qq
q
q
qqqq
q
q
q
q
q
q
qq
q
q
q
q
qqqqqq
qq
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
qq
q
q
q q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
q
q
qq
q
q
qqq
Residuals vs Fitted
104
258
323
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
qq
q
qq
qqqq
qq
qqqqq
qq
qq
q
q
q
q
qq
qq
qqqqq
qqqq
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
qq
q
q
q
q
qqq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q qq
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qqqq
q
qqqq
q
qqq
qq
q
q
q
qq
q
q
q
q
q
qqqq
q
q
q
q
q
q
qq
q
q
q
q
qqqq qq
qq
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
qq
qq
qq
q
q
q
q
q q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
qqq
−3 −2 −1 0 1 2 3
−4−202
Normal Q−Q
104
258
323
0.0 0.2 0.4 0.6 0.8
0.00.51.01.52.0
Fitted values
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
qq
qq
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
qq
q
q
q
q
q
q
q
qq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qq
q
q
q
qqq
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
Scale−Location
104
258
323
0.0 0.2 0.4 0.6 0.8 1.0
−4−202
Leverage
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q qq
q
q
qq
q
q qq qq q
q q
q qqqq
qq
qq
q
q
q
q
qq
qq
qqqqq
qqqqq
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
qq
q
q q
q
qq q
q
q q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q q
q
q
qq
q
q q
q q q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
q
q
qqq
q
q
q qq
q
q
q
q
q
qq
q
q
q
qq qq
q
qq qq
q
qq q
qq
q
q
q
qq
q
q
q
q
q
qq qq
q
q
q
q
q
q
qq
q
q
q
q
qqqq qq
qq
qq
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
qq
q q
qq
q
q
q
q
qq
q
q
q
q qq
q
q
q
q
q
q
q
q
q
q
q
qqq
Cook's distance
10.5
0.51
135
136
196
Linear model does not capture all the eﬀects. We can identify some outliers.
> (observations <- unique(dat.agg[which(abs(lm.agg$residuals) > 0.2), "Run"]))
[1] 630 770
7.2 Mars
We perform model ﬁtting and variables selection by Mars.
GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pdiff+
Ptime, data=dat.agg, trace=0, ncross=3, nfold=10)
coefficients
h(DO- -1.06901) -0.03339099
h(-1.06901-DO) -0.22191908
h(pHset-0.444749) -0.12856755
h(0.444749-pHset) -0.10199167
h(0.580131-Xv) -0.03825405
h(0.810513-Stress) -0.09597828
h(pCO2- -0.471402) 0.01252083
h(Via-0.390253) 0.18390409
h(GLC-1.1171) -0.05487600
h(GLN- -0.203858) 0.02302284
h(-0.203858-GLN) 0.03187502
h(NH4- -0.0225122) 0.15374637
h(NH4-1.01414) -0.17152919
h(pH- -0.664392) 0.04304223
21

h(pH-1.70568) -0.23513685
Importance: DO, pHset, NH4, Stress, Xv, pH, GLC, GLN, Via, pCO2, ...
GCV 0.006192977 RSS 1.708448 GRSq 0.8142985 RSq 0.8461598 cv.rsq 0.7821464
Note: the cross-validation sd's below are standard deviations across folds
Cross validation: nterms 16.90 sd 1.88 nvars 9.33 sd 0.92
cv.rsq sd MaxErr sd
0.78 0.069 -0.29 0.19
22

051015
Variable importance
nsubsets
nsubsets
sqrt gcv
sqrt rss
020406080100
DO1
pHset2
NH411
Stress4
Xv3
pH13
GLC7
GLN9
Via6
pCO25
0 5 10 15 20 25
Model Selection
Number of terms
0.50.60.70.8
GRSqRSq
0246810
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20
0.00.20.40.60.81.0
Cumulative Distribution
abs(Residuals)
Proportion
0% 50% 90% 100%
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8
−0.20.00.10.2
Residuals vs Fitted
Fitted
Residuals
104119
323
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
−3 −1 0 1 2 3
−0.20.00.10.2
Normal Q−Q
ResidualQuantiles
104 119
323
23

Take a look at outliers.
> dat.agg[c(104, 119, 323), ]
Run WD Ptime DO pCO2 pH Xv Via
104 630 0 0.0000000 0.159614 0.1237602 1.287431 -1.580576 0.3932955
119 635 0 0.0000000 1.388239 0.9739924 1.426846 -1.894227 0.4848425
323 1066 1 0.7791667 0.159614 -1.0665650 1.217723 -1.525812 0.1411053
GLC LAC GLN GLU NH4 OSM
104 1.0780028 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237
119 1.3174515 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617
323 0.6821795 -0.4139295 0.2494495 -1.790816 0.262201197 0.068777622
Stress Xv0 pHset RunNo
104 -0.2096149 0.3312174 0.03925127 BIOS-3.5L - 630
119 -0.2096149 -2.2207835 -2.39373696 BIOS-3.5L - 635
323 -0.2096149 -1.7751961 0.03925127 TACI-3L-1066
Discr. Titer pdiff pCO2t
104 Standard conditions 0.3510000 0.6432812 0
119 DO 70% / pH 6.70 / seeding 1 mio 0.0858000 0.6542345 0
323 STD condition (Control - no loop) 0.8432867 0.5062897 0
7.3 Random Forest
The parameter mtry = 10 is determined by tuneRF to optimize its performance.
> library(randomForest)
> rf.agg <- randomForest(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via
+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH
+ + pdiff + Ptime, data = dat.agg, importance = TRUE, mtry = 10)
> rf.agg
Call:
randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM +
24

0 100 200 300 400 500
0.0060.0080.0100.0120.0140.016
rf.agg
trees
Error
pCO2
Ptime
GLU
pdiff
GLN
Via
pH
GLC
Xv
OSM
Stress
LAC
NH4
pHset
DO
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
10 20 30 40 50 60
%IncMSE
Ptime
pdiff
pCO2
Stress
GLU
GLC
Via
GLN
OSM
pH
Xv
LAC
NH4
pHset
DO
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4
IncNodePurity
rf.agg
25

Under two importance plots, the ﬁrst 7 most important variables are the same.
7.4 Decision Tree
plot.cp is used to determine cp in decision tree.
> library(rpart)
> tr.agg <- rpart(Titer ~ DO + pHset + Stress + Xv + pCO2
+ + Via + GLC + LAC + GLN + GLU + NH4 + OSM
+ + Stress + pH + pdiff + Ptime, data=dat.agg)
> (tr.agg <- prune(tr.agg, cp = 0.019))
n= 335
1) root 335 11.10534000 0.5719146
2) DO< -1.683324 39 0.30707160 0.2245385 *
3) DO>=-1.683324 296 5.47207100 0.6176837
6) pHset< -1.988239 18 0.17149480 0.2748954 *
7) pHset>=-1.988239 278 3.04856000 0.6398787
14) NH4< 0.3899572 202 1.52384000 0.6085717
28) LAC>=0.002812559 9 0.02199422 0.4344444 *
29) LAC< 0.002812559 193 1.21623800 0.6166916 *
15) NH4>=0.3899572 76 0.80050930 0.7230893 *
> plot(tr.agg)
> text(tr.agg)
|
DO< −1.683
pHset< −1.988
NH4< 0.39
LAC>=0.002813
0.2245
0.2749
0.4344 0.6167
0.7231
26

7.5 Neural Network
> nn.agg <- nnet(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via
+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH + pdiff,
+ data = dat.agg, size = 4, skip = TRUE, decay = 4e-4,
+ lin.out = FALSE, maxit = 2000, trace = FALSE)
> nn.agg
a 14-4-1 network with 79 weights
inputs: DO pHset Xv Stress pCO2 Via GLC LAC GLN GLU NH4 OSM pH pdiff
output(s): Titer
options were - skip-layer connections decay=4e-04
7.6 Cross Validation
Because the nature of our data, we better leave the whole run of experiment out when we try to perform
cross validation. Still we are using MSE to assess performance.
Overall, the performance is better than in first task.
RMSECVs of Different Models
[1,] 0.09515759 0.09222569 0.07226036 0.1000236 0.09316403
8 History Models
Here we fit historical models and perform CV on them to measure their prediction powers.
The input of a historical model always include the 2 strictly controlled variables that are kept constant.
These are DO, Stress. For a historical model using the history up to day i, we use the other 9 variables
pCO2, Xv, Via, GLC, pH, LAC, GLN, GLU, NH4, OSM from day 0 to day i. This gives up 2+10∗(i+1)
input variables.
We do this by unfolding the raw data. In case of missing values, we impute them by using missForest
on the unfolded historical input values from day 0 to day i. This is done in function ArrHist.
We produce 2 tables for each statistical model. One is the table of variance explained in CV and the
other is RMSECV. We calculate variance explained as 1 − ( ( ˆyi − yi)2
)/( (yi − ¯y)2
) in each run of
cross validation and average them to get the final result. Similarly, RMSECV is also the mean over all
cross validation runs.
Now we see how models perfrom under this setting.
8.1 Linear model
The (i, j)th history linear model is defined as follows:
log(Titerj) = β0 +
i
t=0
[β1,t(DOt − 50)2
+ β2,tpCO2t
+ β3,t(pHt − 7.1)2
+ β4,tlog(Xvt) + β5,tV iat
+ β6,tGLCt + β7,tLACt + β8,tGLNt
+ β9,tGLUt + β10,tNH4t + β11,tOSMt + β12,tStresst]
And the following table shows the RMSECV values. The RMSECV value of the (i, j)th model is the
number at the ith row and the jth column.
RMSECV
27

Y0 Y1 Y2 Y3 Y4
X0 3.200970e-03 1.962524e-03 4.279477e-03 2.301885e-03 2.446558e-03
X0_1 1.545867e-02 1.659791e-02 2.354038e-03 1.230346e-03
X0_2 1.730547e-02 4.075473e-03 1.085479e-03
X0_3 4.041385e-03 2.398019e-03
X0_4 3.482082e-03
X0_5
X0_6
X0_7
X0_8
X0_9
X0_10
Y5 Y6 Y7 Y8 Y9
X0 2.051734e-03 3.416656e-03 6.081081e-03 7.219461e-03 1.941174e-02
X0_1 2.327897e-03 3.313077e-03 6.668536e-03 6.489212e-03 1.036755e-02
X0_2 3.283539e-03 1.628859e-03 5.512231e-03 3.368808e-03 7.914411e-03
X0_3 2.395535e-03 1.284514e-03 2.544108e-03 2.998240e-03 5.437181e-03
X0_4 1.389313e-03 1.658926e-03 2.458271e-03 2.497301e-03 1.091136e-02
X0_5 2.656460e-03 2.030364e-03 2.326792e-03 2.966961e-03 6.045417e-03
X0_6 2.639904e-03 2.232001e-03 3.808531e-03 1.056563e-02
X0_7 1.251427e-02 6.530875e-03 1.346802e-02
X0_8 3.184551e+02 7.913746e+13
X0_9 4.492133e+02
X0_10
Y10
X0 1.961133e-02
X0_1 5.580546e-02
X0_2 1.869011e-02
X0_3 1.359189e-02
X0_4 7.909511e-03
X0_5 1.306489e-02
X0_6 1.450043e-02
X0_7 3.412453e-02
X0_8 1.202324e+11
X0_9 1.613414e+00
X0_10 1.289649e+00
28

0
1
2
3
4
5
6
7
8
9
10
Titer
10
9
8
7
6
5
4
3
2
1
0
Historyused
RMSECV for history model based on Linear model
8.2 Random Forest
Variance explained in CV
Y0 Y1 Y2 Y3 Y4 Y5
X0 0.28324661 0.26155005 -0.09489543 -0.03088496 0.39081454 0.43354914
X0_1 0.37598610 0.21074302 0.35566190 0.31553364 0.37313831
X0_2 0.16472940 0.36469677 0.44168895 0.40601249
X0_3 0.25131256 0.35404242 0.51706127
X0_4 0.37300694 0.56438519
X0_5 0.54745573
X0_6
X0_7
X0_8
X0_9
X0_10
Y6 Y7 Y8 Y9 Y10
X0 0.44032714 0.55857068 0.57889201 0.64090502 0.55601625
X0_1 0.53608831 0.50026635 0.55488094 0.72923139 0.71702833
X0_2 0.69563541 0.67904955 0.69803823 0.66081810 0.71536101
X0_3 0.66675711 0.72303901 0.71583577 0.71933173 0.77114221
X0_4 0.77659188 0.77304970 0.77028128 0.80331752 0.74513898
X0_5 0.79546288 0.79299697 0.81030901 0.73912138 0.78821029
X0_6 0.76082620 0.76023416 0.79248812 0.82078319 0.79484015
X0_7 0.81875451 0.80675983 0.82533617 0.81928950
X0_8 0.86596990 0.82482254 0.80067588
X0_9 0.87802758 0.84462589
X0_10 0.81461628
RMSECV
29

Y0 Y1 Y2 Y3 Y4 Y5
X0 0.04226597 0.05005782 0.05790429 0.05367666 0.04880549 0.05140720
X0_1 0.03811918 0.05222506 0.03665541 0.04491827 0.04562487
X0_2 0.04978004 0.04001816 0.04554554 0.04125988
X0_3 0.04335484 0.03659673 0.04576400
X0_4 0.04272663 0.03664301
X0_5 0.03671531
X0_6
X0_7
X0_8
X0_9
X0_10
Y6 Y7 Y8 Y9 Y10
X0 0.06260631 0.07307958 0.08436972 0.09963318 0.10319883
X0_1 0.05939476 0.07320302 0.08885598 0.07992313 0.08665985
X0_2 0.04182959 0.06069709 0.06802562 0.09828617 0.08913791
X0_3 0.05082051 0.04949298 0.06647324 0.07765117 0.08482019
X0_4 0.03694516 0.05087756 0.05854120 0.06370206 0.08794341
X0_5 0.04183240 0.04704620 0.05592446 0.07192094 0.07872577
X0_6 0.03919570 0.05291936 0.05561075 0.07081846 0.07979003
X0_7 0.04348533 0.05150214 0.06694091 0.06891756
X0_8 0.04623633 0.06832347 0.07381787
X0_9 0.05664780 0.07072329
X0_10 0.06890835
0
1
2
3
4
5
6
7
8
9
10
Titer
10
9
8
7
6
5
4
3
2
1
0
Historyused
RMSECV for history model based on Random Forest
8.3 MARS
Variance explained in CV
Y0 Y1 Y2 Y3 Y4
X0 0.069814132 -0.005095947 0.024849356 -0.034696230 -0.120319657
30

X0_1 0.007649867 -0.396661134 -0.275546932 -0.033114509
X0_2 -0.827848243 -0.204988340 -1.123241366
X0_3 -0.239974296 0.267732287
X0_4 -1.130404725
X0_5
X0_6
X0_7
X0_8
X0_9
X0_10
Y5 Y6 Y7 Y8 Y9
X0 0.363899116 0.330268801 0.508385257 0.305851870 0.599098171
X0_1 -0.150206083 0.147392088 0.375098045 0.089004911 0.342867469
X0_2 0.287235896 0.651413305 0.651947560 0.742131695 0.616545382
X0_3 0.233978160 0.517310655 0.669245506 0.804495843 0.768104648
X0_4 0.397724415 0.704167262 0.821355319 0.839638561 0.836162082
X0_5 0.512508645 0.686174900 0.814525758 0.846335693 0.780111923
X0_6 0.846201740 0.823482389 0.841185054 0.823600595
X0_7 0.763357781 0.908825907 0.814810324
X0_8 0.850118872 0.714621386
X0_9 0.598640731
X0_10
Y10
X0 -0.266671250
X0_1 0.099612543
X0_2 0.296458653
X0_3 0.673287674
X0_4 0.681350544
X0_5 0.706625242
X0_6 0.812505473
X0_7 0.789365935
X0_8 0.657600818
X0_9 0.742572597
X0_10 0.830697666
RMSECV
Y0 Y1 Y2 Y3 Y4 Y5
X0 0.05414481 0.06435345 0.04914698 0.04440181 0.04899524 0.04751189
X0_1 0.06349293 0.07040039 0.07104746 0.04787156 0.06328275
X0_2 0.08353868 0.05609517 0.06006556 0.05632823
X0_3 0.05677938 0.05112222 0.05670555
X0_4 0.07399238 0.04632260
X0_5 0.04234317
X0_6
X0_7
X0_8
X0_9
X0_10
Y6 Y7 Y8 Y9 Y10
X0 0.06136661 0.06598322 0.10313729 0.10811682 0.16777832
X0_1 0.06212712 0.09116696 0.10869171 0.12338118 0.17132883
X0_2 0.05033973 0.05957265 0.06570322 0.09632618 0.12433817
X0_3 0.05168125 0.06504277 0.06069211 0.07182945 0.10263122
X0_4 0.03954150 0.04350307 0.05051779 0.06939214 0.09432636
X0_5 0.04101469 0.04404178 0.04560743 0.06953788 0.08301654
X0_6 0.02986677 0.04677269 0.05327379 0.07087714 0.07679419
X0_7 0.04676462 0.04301543 0.07409887 0.07206619
X0_8 0.04281923 0.07792298 0.09770436
X0_9 0.09445278 0.08202713
X0_10 0.07658425
31

Y0
Y1
Y2
Y3
Y4
Y5
Y6
Y7
Y8
Y9
Y10
Titer
X0_10
X0_9
X0_8
X0_7
X0_6
X0_5
X0_4
X0_3
X0_2
X0_1
X0
Historyused
RMSECV for history model based on MARS
9 Evaluation of Results
All models are performing reasonably well.
Random forest has the most stable performance. This is expected. It is also easy to implement.
There are some tuning parameters to consider, e.g., the number of trees, the size chosen for resampling.
However, default setting works good most of the time, and even the number of trees is too large it would
not be a big problem.
Linear regression is also doing ok. Its performance heavily relies on variable selection, interaction
specification and transformation, that is why the implementation of linear regression is not as straight-
forward as other models. A compromise is fitting a large model in the beginning and use stepwise selection
to reduce the number of parameters. More recent method would be using penalized linear regression, so
called lasso, ridge regression. See [2].
Mars is easy to implement, because it can adjust to nonlinearity automatically. Here its prediction
power is not as good as the other two, which might be resulted from overfitting. It is possible to tune
the parameters. See [3].
References
[1] Daniel J.Stekhoven. Using the missForest package, 2011. Available from http://stat.ethz.ch/
education/semesters/ss2012/ams/paper/missForest_1.2.pdf.
[2] Peter Bühlmann Martin Mächler. Computational statistics. Available from http://stat.ethz.ch/
education/semesters/ss2014/CompStat/sk.pdf, 2014.
[3] Stephen Milborrow. Notes on the earth package, 2014. Available from http://cran.r-project.
org/web/packages/earth/vignettes/earth-notes.pdf.
32

[4] Fortran original by Leo Breiman, R port by Andy Liaw Adele Cutler, and Matthew Wiener.
Package ‘randomForest’, 2012. Available from http://stat-www.berkeley.edu/users/breiman/
RandomForests.
[5] Mayo Foundation Terry M. Therneau, Elizabeth J. Atkinson. An Introduction to Recursive Partition-
ing Using the RPART Routines, 2013. Available from http://cran.r-project.org/web/packages/
rpart/vignettes/longintro.pdf.
33

report

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to report

Similar to report (20)

report