SlideShare a Scribd company logo
1 of 33
Download to read offline
Process Design and Optimization of Bioprocesses with Quality by
Design Approach
Kyunghee Cho, Yunhao He
June 11, 2014
1 Introduction
In this project, we work with data from a pharmaceutical company that uses a new biochemical method
to produce drugs. The ultimate goal is to improve the process and develop models to aid decision making.
In general, we hope for more information out of less measurement. Due to time-constraint, we focus on
the prediction under the classical statistical settings. That is, to predict an output Y from X using
the model developed from data X1, X2, · · · , Xn and Y1, Y2, · · · , Yn. Many models exist for this kind of
prediction task. We only use a few of them in this project.
1.1 Data Description
In this part we briefly describe the data. The following is how the raw data look like.
Batch.ID Run WD Ptime DO CO2 pH Stress GLC LAC GLN GLU NH4
1 1 539 0 0.0000000 50 46 7.12 2.2 7.00 0.67 1.88 1.32 3.55
2 1 539 1 0.8368056 50 21 7.10 2.2 5.95 1.52 1.29 1.67 3.97
3 1 539 2 1.7986111 50 23 6.97 2.2 5.13 2.06 1.31 1.80 4.11
4 1 539 3 2.8020833 50 33 6.97 2.2 3.85 2.16 1.28 2.64 4.57
5 1 539 4 3.8680556 50 39 6.98 2.2 3.02 1.93 1.73 2.57 4.45
6 1 539 5 4.8993056 50 40 7.02 2.2 2.27 1.71 2.07 3.53 4.20
OSM Xv Via Titer
1 322.0000 1.45 97.97297 0.03913624
2 315.4865 2.07 97.64151 0.05251863
3 308.0000 3.30 97.34513 0.05340000
4 305.0906 4.97 97.83465 0.10257388
5 302.0000 7.02 97.50000 0.14200000
6 298.0000 7.75 96.27329 0.19834530
In a Run of experiment, all the variables are measured from day WD 1 to WD 10, sometimes 11 or
12. In the experiments the controlled variables are DO, CO2, pH, Stress. Variables as GLC, LAC, GLN,
GLU, NH4, OSM, Xv, Via are measured throbbughout to monitor the state of the production. Titer is
the output we are interested in.
1.2 Missing Data Imputation
The raw data contains quite a few missing values. To carry out sensible statical analysis, we imputed
missing data for both the input values and the output values. Titer are interpolated by our client.
Missing values in the input values are imputed by MissForest, which is a recent achievement based on
RandomForest. See [1].
1.3 Data Standardization
It is sometimes a good practice to normalize the data to have mean 0 and variance 1. In our analysis,
all the models except for linear models are standardized, since we have more sophiscated customized
transformation for linear regressions. As a result, the coefficients of linear regressions and mars are
always different in scale.
1
2 Information on Models Used
2.1 Linear Model
Linear regression is the most mature and widely used method. Although it is quite simple and intuive,
sometimes it has good prediction power. We can easily tell if any model assumption is violated by looking
at the plots of the linear fit.
To make a linear model work best, it is necessary to specify the predictors manually. So we have to
consider which transformations to take, which interactions or orders to include and which to leave out.
One easy way to do this is fit a big model in the beginning and use stepwise selection by AIC(or BIC)
criterion1
which is a measure of balance between goodness of fit and model complexity.
To assess variable importance, a simple way is to look at p-value or t-value, given the model assump-
tions are met.
2.2 Decision Tree
Decision tree is a scale independent statistical model. It is easy to implement and deals with interactions
naturally. The biggest advantage is that it can be visualized easily. See [5].
2.3 Random Forest
All the information in this subsection is based on [4]. Random forest is an ensemble method that combines
many decision trees, in each of which the data and variables are sampled so that only part of them are
used. Due to implicit bootstrapping, the model suffers less from overfitting and has good prediction
power.
Random forest has built-in mechanisms to estimate importance of variables. The two measures are
described in the following way:
ˆ The first measure is computed from permuting OOB data: For each tree, the predic-
tion error on the out-of-bag portion of the data is recorded (error rate for classification,
MSE for regression). Then the same is done after permuting each predictor variable.
The difference between the two are then averaged over all trees, and normalized by the
standard deviation of the differences. If the standard deviation of the differences is equal
to 0 for a variable, the division is not done (but the average is almost always equal to 0
in that case).
ˆ The second measure is the total decrease in node impurities from splitting on the variable,
averaged over all trees. For classification, the node impurity is measured by the Gini
index. For regression, it is measured by residual sum of squares.
The package randomForest has a function varImpPlot to plot the importance of variables easily.
2.4 MARS
All the information in this subsection is based on [3].
MARS, multivariate adaptive regression splines, is an adaptive extension of linear regression. In the
final model, it is a linear regression with terms like (xj − d)+, (d − xj)+ and higher order interactions
of such terms. The algorithm adds terms forwardly and prune the model to a place where GCV is
minimized.
MARS can also estimate the importance of variables. From earth vignette: [3]
ˆ The nsubsets criterion counts the number of model subsets that include the variable.
Variables that are included in more subsets are considered more important.
By ”subsets” we mean the subsets of terms generated by the pruning pass. There is one
subset for each model size (from 1 to the size of the selected model) and the subset is
the best set of terms for that model size. (These subsets are speced in $prune.terms in
earth’s return value.) Only subsets that are smaller than or equal in size to the final
model are used for estimating variable importance.
1See http://stat.ethz.ch/R-manual/R-devel/library/stats/html/step.html. For more information about BIC, see
http://en.wikipedia.org/wiki/Bayesian_information_criterion
2
ˆ The rss criterion first calculates the decrease in the RSS for each subset relative to the
previous subset. (For multiple response models, RSS’s are calculated over all responses.)
Then for each variable it sums these decreases over all subsets that include the variable.
Finally, for ease of interpretation the summed decreases are scaled so the largest summed
decrease is 100. Variables which cause larger net decreases in the RSS are considered
more important.
ˆ The gcv criterion is the same, but uses the GCV instead of the RSS. Adding a variable
can increase the GCV, i.e., adding the variable has a deleterious effect on the model.
When this happens, the variable could even have a negative total importance, and thus
appear less important than unused variables.
2.5 Neural Network
Neural network is a felxible model which is in a sense an extension of linear regression. See en.wikipedia.
org/wiki/Artificial_neural_network.
3 Overview of Three Main Tasks
This statistical analysis comprises three main tasks, in each of which all three above mentioned models
were used, specifically, Linear Regression, Random Forest and MARS. Only for the third task a different
dataset was used, in which missing Titer values are interpolated by a logistic function fit within each
run.
3.1 First Task: Blackbox Models
Maximizing the output of a useful product by controlling experimental conditions is of primary interest.
The first blackbox models are fitted using only the four controlled variables, namely, DO, Stress, pHset,
Xv0 as input variables to predict the output variable, Titer at day 10. Secondly, all other input variables
at day 0 are used as input variables to predict Titer at day 10.
3.2 Second Task: Snapshot Models
In order to save the measurement cost of Titer, it is useful to predict the current value of Titer during
an experiment using the current values of the input variables, which are relatively cheaper to measure.
The snapshot approach means each observation is considered to be independant and input variables at
day t are used to predict Titer at day t.
3.3 Third Task: History Models
The history approach can be regarded as an extension of both the blackbox models and the snapshot
models. For the history models, not only the current value of the input variables at day t are considered
in the model as predictors but also how they have changed over time in the past, that is, the history of
the input variables. All Titer values in the future as well as the one at day t are to be predicted. In
other words, a (i, j) history model uses input variables at day 0, 1, ..., i as predictors to predict the Titer
value at day j(≥ i).
4 Model Comparison
In order to compare the prediction performance between the three statistical models, the cross-validation
method was used.
1. First, the data is randomly splitted into a training set and a test set by Run ID. That is, 30 runs
are randomly sampled for a test set among 122 runs.
2. Then a model is fitted using the training set, the rest of the data.
3. MSE (mean squared error) is calculated in the test set.
3
4. This is repeated 5 times and using the 5 resulting MSE’s, RMSECV is calculated, which is defined
as follows.
For blackbox or history models:
RMSECV =
5
k=1
30
i=1(y
(k)
i −y
(k)
i )2
5·30
where y
(k)
i and y
(k)
i are the true Titer value and the predicted one of the ith sample run in the kth
cross-validation, repectively.
For snapshot models:
RMSECV =
5
k=1
30
i=1
ni
j=1(y
(k)
ij −y
(k)
ij )2
5·30·ni
where y
(k)
ij and y
(k)
ij are the true Titer value and the predicted one of the ith sample run in the kth
cross-validation at the day j, repectively.
In case the target variable is transformed, the RMSECV is calculated based on back-transformed
values of both true and predicted values.
5 Blackbox Models
This is a prediction problem with classical statistics setting. Before we use any statistical models, it is
helpful to get an intuition what the dataset looks like.
q
q
q
qq
q
q
qq
qq
q
q
qq
q
q
q
qq
q
qqqq
qq
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
qq q
q
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
600 700 800 900 1000
0.20.40.60.8
Final Titer in Different Runs without Transformation
dat.simp$Run
dat.simp$Titer
Since we are using only a small part of the raw data here, the number of observations is 122.
If only prediction power is of interest, you can jump to Section 5.6.
4
5.1 Linear Model
5.1.1 Linear Model with 4 Controlled Variables
The following model was selected by the BIC model selection criterion. We can see that both DO and
pHset exhibit quadratic effect.
Call:
lm(formula = bb4lm$formula, data = if.data)
Residuals:
Min 1Q Median 3Q Max
-0.59583 -0.10260 -0.00799 0.10758 0.84827
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.695e-01 7.932e-02 -3.397 0.000952 ***
I((DO - 50)^2) -7.425e-04 8.316e-05 -8.929 1.18e-14 ***
log(Xv0) -3.114e-01 1.810e-01 -1.721 0.088092 .
I((pHset - 7.1)^2) -7.436e+00 9.961e-01 -7.465 2.17e-11 ***
I((DO - 50)^2):log(Xv0) 5.670e-04 1.714e-04 3.309 0.001270 **
I((DO - 50)^2):I((pHset - 7.1)^2) -1.852e-03 7.084e-04 -2.614 0.010202 *
log(Xv0):I((pHset - 7.1)^2) 8.604e+00 1.811e+00 4.750 6.23e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2011 on 109 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.8462, Adjusted R-squared: 0.8377
F-statistic: 99.95 on 6 and 109 DF, p-value: < 2.2e-16
[1] 0.09764054
−3.0 −2.0 −1.0
−0.50.00.51.0
Fitted values
Residuals
q
q
q
qq
qq
qq
qq
q
q
qqqqqqqqqq
qqqq
qq
q
q qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
qq
q
q
q
qq
q qqq
qq qq
q
q
q
qq
q
q q
qqq
qq
qqq
q
qqq
q
qq
q
qqqqqq
qq
q
q
qq
q
q
qqq
qq
q
q
qq
q
Residuals vs Fitted
73
4348
q
q
q
qq qq
qq
qq
q
q
q qqqqqqqqq
qqqq
q q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
qq
qq qq
qqqq
q
q
q
qq
q
qq
qqq qq qqq
q
qqq
q
qq
q
qqqqq q
qq
q
q
qq
q
q
q qq
qq
q
q
q q
q
−2 −1 0 1 2
−20246
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
73
48
43
−3.0 −2.0 −1.0
0.01.02.0
Fitted values
Standardizedresiduals
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
qqq
qq
qq
qqqq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
Scale−Location
73
48
43
0.0 0.1 0.2 0.3 0.4
−4−20246
Leverage
Standardizedresiduals
q
q
q
qqqq
qq
qq
q
q
qqqqqqqqqq
qqqq
qq
q
qqqqqq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
qq
qqqq
qqq q
q
q
q
qq
q
qq
qqqqqqqq
q
qqq
q
qq
q
q qqqqq
qq
q
q
qq
q
q
qqq
qq
q
q
qq
q
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
73
48
37
5
The transformations for DO, Xv0 and pHset were chosen according to the termplot to give a reasonable fit.
> par(mfrow=c(2,2))
> termplot(bb4lmfit,partial.resid=TRUE)
10 20 30 40 50 60 70
−1.5−0.50.5
DO
PartialforI((DO−50)^2)
qq
q
qqqq
qqqqq
q
qqqqqqqqqq qqqq qq
q
qqqqqq
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
qq
q
q
q
q
q
qq
q
q
q
qqqqqqqqqqqqq
qq
q
qqqqqq
q
qqq
qq
q
qqqqqqqq
qq
q
1.0 1.2 1.4 1.6 1.8 2.0
−1.5−0.50.5
Xv0
Partialforlog(Xv0)
qq
q
qqqq
qq qq
q
q
q qqqq qq qq q
q qqq
q q
q
qqq qq
q
q
q
qq qq
q
qqqq
q
qq
q
qq
q
q
q
q
qqq
qqq q
qqq
q
q
q
q
qq
q
q
q qq q qqqq qqqqqq
qq
q
qq
qq qq
qq
qq
qq
qqqq q
q
qq
q
q q
q
6.7 6.8 6.9 7.0 7.1 7.2
−1.5−0.50.5
pHset
PartialforI((pHset−7.1)^2)
qq
q
qqqq
qqqqq
q
qqqqqqqqqq
qqqq
qq
q
q
qqqqq
q
qq
q
qq
q
q
qq
q
q
qq
q
q
q
q
q
q
q
qq
qqqqq
qq
q q
qq
q
qq
q
qq
qqqqqqqqqqqqq
qq
q
qq qqqq
q q
qq
qq
q q
qqqqqqq
qq
q
5.1.2 Linear Model with All Variables
The following model was selected by the BIC model selection criterion. Apart from the 4 controlled
variables, only NH4 and GLN.
Call:
lm(formula = bblmbic$formula, data = if.data)
Residuals:
Min 1Q Median 3Q Max
-0.86766 -0.09680 0.01636 0.09528 0.59879
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.559e+00 2.359e-01 -6.606 1.49e-09 ***
I((DO - 50)^2) -5.932e-04 4.454e-05 -13.319 < 2e-16 ***
pCO2 5.530e-03 2.287e-03 2.418 0.01726 *
GLN 1.264e-01 5.030e-02 2.513 0.01343 *
NH4 8.313e-02 2.345e-02 3.545 0.00058 ***
log(Xv0) 5.981e-01 1.377e-01 4.343 3.16e-05 ***
I((pHset - 7.1)^2) -5.336e+00 5.264e-01 -10.137 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2224 on 109 degrees of freedom
(2 observations deleted due to missingness)
6
Multiple R-squared: 0.8119, Adjusted R-squared: 0.8015
F-statistic: 78.41 on 6 and 109 DF, p-value: < 2.2e-16
−2.5 −2.0 −1.5 −1.0 −0.5
−1.0−0.50.00.5
Fitted values
Residuals
q
q
q
qq
qq
q
q
qq
q
q
q
qq
q
qq
q
q
qq
q
q
q
q
qq
q
q q
q
qq
qq q
q
q
q
q
q
qqq
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
qq
qq
q
qqqqq
q
qqq
q
q qq
q
q
qq
qq
q q
q
qq
q
q
qq
q
q
q
qq
q
q
qq
q
Residuals vs Fitted
47
4855
q
q
q
qq
qq
q
q
q q
q
q
q
qq
q
qq
q
q
qq
q
q
q
q
qq
q
qq
q
qq
qqq
q
q
q
q
q
q qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
qq
qq
q
q qq
q q
q
qqq
q
q qq
q
q
qq
qq
qq
q
qq
q
q
qq
q
q
q
qq
q
q
qq
q
−2 −1 0 1 2
−4−202
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
47
4855
−2.5 −2.0 −1.5 −1.0 −0.5
0.00.51.01.52.0
Fitted values
Standardizedresiduals
q
q
q
qq
q
q
q
qq
qq
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
Scale−Location
47
4855
0.00 0.10 0.20 0.30
−4−202
Leverage
Standardizedresiduals
q
q
q
qqqq
q
q
q q
q
q
q
qq
q
qq
q
q
qq
q
q
qq
qq
q
qq
q
qq
q qq
q
q
q
q
q
qqq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
q
q q
q
q
q
q
qq
qq
q
qqqqq
q
qqq
q
q q q
q
q
qq
qqq q
q
qq
q
q
qq
q
q
q
q q
q
q
qq
q
Cook's distance 1
0.5
0.5
Residuals vs Leverage
47
4855
5.2 MARS
To allow for flexibility and nonlinear effects, we use Mars - Multivariate adaptive regression splines.
5.2.1 MARS with 4 Controlled Variables
> source(file.path(cwd, "mars_simp.R"))
> summary(mars.con)
Call: earth(formula=Titer~DO+pHset+Xv+Stress, data=dat.simp,
trace=0)
coefficients
(Intercept) 0.69954039
h(-1.06901-DO) -0.22496222
h(DO-0.159614) -0.05639779
h(pHset-0.444749) -0.12694532
h(0.444749-pHset) -0.11450147
Selected 5 of 13 terms, and 2 of 4 predictors
Importance: DO, pHset, Xv-unused, Stress-unused
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 0.009779767 RSS 0.9944133 GRSq 0.6943894 RSq 0.7344234
> plot(mars.con, main = "Mars Using 4 Inputs.jpg")
7
0 2 4 6 8 10
Mars Using 4 Inputs.jpg
Number of terms
0.450.550.650.75
GRSqRSq
01234
Numberofusedpredictors
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20 0.30
0.00.20.40.60.81.0
Mars Using 4 Inputs.jpg
abs(Residuals)
Proportion
0% 50% 75% 90% 95% 100%
q
qq
q q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q q
qq
q
q
qq
q
q
q
qq
q
q
qq
qq
q
q
qq
qqqqq
q
qqqq
qqqq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
qqq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
0.1 0.3 0.5 0.7
−0.3−0.10.1
Mars Using 4 Inputs.jpg
Fitted
Residuals
42
106
47
q
qq
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
qq
q
q
qq
q
q
q
qq
q
q
qq
qq
q
q
qq
qq qqq
q
qqqq
qqqq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
qqq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−0.3−0.10.1
Mars Using 4 Inputs.jpg
Theoretical Quantiles
ResidualQuantiles
42
106
47
Titer: earth(formula=Titer~DO+pHset+Xv+Stre...
5.2.2 MARS with All Variables
> summary(mars.simp)
Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+
GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH, data=dat.simp,
trace=0)
coefficients
(Intercept) 0.56859816
h(-1.06901-DO) -0.23177287
h(DO-0.159614) -0.06620347
h(pHset-0.444749) -0.09650702
h(0.444749-pHset) -0.09711387
h(Via-0.355972) 0.33786922
h(1.41519-GLC) 0.14153165
h(-0.949741-LAC) 0.36807897
h(NH4- -0.263424) 0.10067628
h(-0.0456292-OSM) 0.35674962
Selected 10 of 24 terms, and 7 of 13 predictors
Importance: DO, pHset, NH4, Via, GLC, LAC, OSM, Xv-unused, Stress-unused, ...
Number of terms at each degree of interaction: 1 9 (additive model)
GCV 0.007086922 RSS 0.5955396 GRSq 0.7785388 RSq 0.84095
> plot(mars.simp, main = "Mars Using All Inputs.jpg")
8
0 5 10 15 20
Mars Using All Inputs.jpg
Number of terms
0.50.60.70.8
GRSqRSq
0246810
Numberofusedpredictors
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20
0.00.20.40.60.81.0
Mars Using All Inputs.jpg
abs(Residuals)
Proportion
0% 50% 90% 100%
q
q
qq
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8
−0.2−0.10.00.1
Mars Using All Inputs.jpg
Fitted
Residuals
42
93
47
q
q
qq
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−0.2−0.10.00.1
Mars Using All Inputs.jpg
Theoretical Quantiles
ResidualQuantiles
42
93
47
Titer: earth(formula=Titer~DO+pHset+Xv+...
Now we add the label(pCO2) showing whether or not CO2 is targeted to a certain level. As it turned out, this
label is not important.
> summary(mars.simp2)
Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+
GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pCO2t,
data=dat.simp, trace=0)
coefficients
(Intercept) 0.80438564
h(DO-0.159614) -0.04110043
h(0.159614-DO) -0.09742652
h(pHset- -1.58274) -0.11389203
h(0.444749-pHset) -0.17420137
h(Via-0.340761) 0.29860268
h(-0.962971-LAC) 0.32408837
h(GLU- -1.50335) -0.81614339
h(GLU- -1.42405) 1.31071055
h(NH4- -0.292625) 0.10170289
h(-0.0456292-OSM) 0.52167298
Selected 11 of 27 terms, and 7 of 16 predictors
Importance: pHset, DO, NH4, OSM, GLU, Via, LAC, Xv-unused, Stress-unused, ...
Number of terms at each degree of interaction: 1 10 (additive model)
GCV 0.007404181 RSS 0.5975609 GRSq 0.7686247 RSq 0.8404102
> plot(mars.simp2,
+ main = "Mars Using All inputs and Additional CO2 Target Label.jpg")
9
0 5 10 15 20 25
Mars Using All inputs and Additional CO2 Target Label.jpg
Number of terms
0.30.50.70.9
GRSqRSq
0246810
Numberofusedpredictors
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20
0.00.20.40.60.81.0
Mars Using All inputs and Additional CO2 Target Label.jpg
abs(Residuals)
Proportion
0% 50% 90% 100%
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.2 0.4 0.6 0.8
−0.2−0.10.00.1
Mars Using All inputs and Additional CO2 Target Label.jpg
Fitted
Residuals
42
47
93
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−0.2−0.10.00.1
Mars Using All inputs and Additional CO2 Target Label.jpg
Theoretical Quantiles
ResidualQuantiles
42
47
93
Titer: earth(formula=Titer~DO+pHset+Xv+...
Fitting Mars, records 42, 47 and 91 are identified as outliers.
> dat.simp[c(42, 47, 91), ]
Run WD Ptime DO pCO2 pH Xv Via GLC
42 630 0 0 0.159614 0.1237602 1.287431 -1.580576 0.3932955 1.078003
47 635 0 0 1.388239 0.9739924 1.426846 -1.894227 0.4848425 1.317452
91 768 0 0 0.159614 0.5488763 1.357139 -1.645298 0.4860406 1.156190
LAC GLN GLU NH4 OSM Stress
42 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237 -0.2096149
47 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617 -0.2096149
91 -0.9629707 0.3432373 -1.334834 1.247747682 0.145048862 -0.2096149
Xv0 pHset RunNo Discr.
42 0.3312174 0.03925127 BIOS-3.5L - 630 Standard conditions
47 -2.2207835 -2.39373696 BIOS-3.5L - 635 DO 70% / pH 6.70 / seeding 1 mio
91 -0.1953860 0.03925127 BIOS-3.5L - 768 WGE 15 - SE 3
Titer pCO2t
42 0.3510 0
47 0.0858 0
91 0.7650 0
Also, we can see the importance of variables. The methods for importance estimation have been
discussed in Section 2.
10
02468
Importance of Variables
nsubsets
nsubsets
sqrt gcv
sqrt rss
020406080100
normalizedsqrtgcvorrss
DO1
pHset2
NH411
Via6
GLC7
LAC8
OSM12
5.3 Random Forest
Using Random Forest, we can easily see the importance of different variables.
5.3.1 Random Forest with 4 Controlled Variables
Call:
randomForest(formula = Titer ~ DO + pHset + Stress + Xv, data = dat.simp, importance = TRUE, mtry = 4)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 0.008918479
% Var explained: 71.66
5.3.2 Random Forest with All Variables
We can see the ranking of variable importances is similar to that of MARS.
11
Stress
pCO2
pH
GLU
GLC
GLN
Via
LAC
OSM
Xv
NH4
pHset
DO
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.5 1.0
rf.simp
IncNodePurity
> rf.simp
Call:
randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM +
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 0.007168825
% Var explained: 77.22
5.4 Decision Tree
For completeness, we also include decision tree model. It cannot outperform random forest which is an
extension of decision tree, but it is easy for interpretation.
5.4.1 Decision Tree with 4 Controlled Variables
n= 119
node), split, n, deviance, yval
* denotes terminal node
1) root 119 3.74435600 0.5739803
2) DO< -1.683324 13 0.10235720 0.2245385 *
3) DO>=-1.683324 106 1.85988900 0.6168364
6) pHset< -0.7717448 10 0.22043140 0.3785372 *
7) pHset>=-0.7717448 96 1.01244000 0.6416593
14) pHset>=0.6474983 8 0.02327288 0.5111250 *
15) pHset< 0.6474983 88 0.84046160 0.6535260
12
30) Stress< 0.3004491 79 0.70332710 0.6425430 *
31) Stress>=0.3004491 9 0.04395770 0.7499322 *
|
DO< −1.683
pHset< −0.7717
pHset>=0.6475
Stress< 0.3004
0.2245
0.3785
0.5111
0.6425 0.7499
5.4.2 Decision Tree with All Variables
n= 119
node), split, n, deviance, yval
* denotes terminal node
1) root 119 3.74435600 0.5739803
2) DO< -1.683324 13 0.10235720 0.2245385 *
3) DO>=-1.683324 106 1.85988900 0.6168364
6) pHset< -0.7717448 10 0.22043140 0.3785372 *
7) pHset>=-0.7717448 96 1.01244000 0.6416593
14) NH4< 0.2658514 67 0.43613680 0.6091353
28) OSM>=-0.1028327 50 0.28442470 0.5858890
56) Via< 0.4785685 23 0.15613930 0.5504854 *
57) Via>=0.4785685 27 0.07489916 0.6160477 *
29) OSM< -0.1028327 17 0.04522386 0.6775066 *
15) NH4>=0.2658514 29 0.34168860 0.7168009
30) Xv< -1.573108 21 0.21895440 0.6899347
60) Xv>=-1.652965 12 0.13595400 0.6410000 *
61) Xv< -1.652965 9 0.01595145 0.7551809 *
31) Xv>=-1.573108 8 0.06778793 0.7873245 *
13
|
DO< −1.683
pHset< −0.7717
NH4< 0.2659
OSM>=−0.1028
Via< 0.4786
Xv< −1.573
Xv>=−1.653
0.2245
0.3785
0.5505 0.616 0.6775
0.641 0.7552
0.7873
5.5 Neural Network
Neural network is comparatively not so easy to interpret. We include it here to compare its prediction
accuracy with other models using CV.
5.6 Cross Validation
Although many packages nowadays have built-in measures of test errors, by implementing cross validation
ourselves, we can compare the performance of different models on the same ground. Here we use Leave-
30-Runs-Out cross validation. Since it is a regression problem, mean squared error is a good indicator
of performance.
We can see the cross validation mean squared errors of different methods.
RMSECV of Different Models Using All Input Variables
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.078494 0.1060308 0.09046751 0.1237991 0.1362411
RMSECV of Different Models Using 4 Input Variables
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.09764054 0.117845 0.09327159 0.0972197 0.1150896
A surprising fact is that the more complicated models perform even worse. Linear model performs quite
well.
14
6 Snapshot Models
6.1 Linear model
The following model was selected by the BIC model selection criterion. We can see that now the number
of selected variables is much higher. In snapshot model, the effects are too complex to be estimated by
a few variables.
Call:
lm(formula = snlm$formula, data = ccFdata)
Residuals:
Min 1Q Median 3Q Max
-0.63033 -0.09694 0.00396 0.11494 0.48256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.492e+00 1.705e+00 0.875 0.381944
I((DO - 50)^2) 9.907e-04 3.332e-04 2.973 0.003066 **
pCO2 1.522e-01 2.442e-02 6.233 8.54e-10 ***
I((pH - 7.1)^2) 3.698e+01 4.867e+00 7.598 1.14e-13 ***
log(Xv) 7.013e-01 1.506e-01 4.658 3.93e-06 ***
Via -3.025e-02 9.257e-03 -3.268 0.001144 **
GLC -2.973e-01 1.400e-01 -2.123 0.034175 *
LAC -2.858e-01 9.404e-02 -3.039 0.002479 **
GLN 5.613e-01 5.873e-02 9.557 < 2e-16 ***
GLU -1.950e+00 3.485e-01 -5.596 3.31e-08 ***
NH4 -1.106e-01 2.415e-02 -4.580 5.64e-06 ***
OSM -8.029e-03 3.901e-03 -2.058 0.039999 *
Stress -5.076e-02 1.549e-02 -3.276 0.001113 **
I((DO - 50)^2):pCO2 1.418e-05 4.040e-06 3.510 0.000482 ***
I((DO - 50)^2):Via -8.782e-06 2.193e-06 -4.005 6.98e-05 ***
I((DO - 50)^2):GLC 8.047e-05 1.697e-05 4.741 2.65e-06 ***
I((DO - 50)^2):GLU -1.739e-04 3.561e-05 -4.883 1.34e-06 ***
I((DO - 50)^2):NH4 -6.375e-05 2.582e-05 -2.469 0.013824 *
pCO2:Via -6.531e-04 1.354e-04 -4.824 1.78e-06 ***
pCO2:GLC 3.431e-03 6.722e-04 5.105 4.44e-07 ***
pCO2:GLN -6.369e-03 1.483e-03 -4.293 2.05e-05 ***
pCO2:OSM -2.974e-04 5.367e-05 -5.540 4.50e-08 ***
I((pH - 7.1)^2):LAC 2.711e+00 3.745e-01 7.240 1.36e-12 ***
I((pH - 7.1)^2):NH4 6.535e-01 2.225e-01 2.937 0.003435 **
I((pH - 7.1)^2):OSM -1.449e-01 1.812e-02 -7.997 6.46e-15 ***
log(Xv):GLC 1.177e-01 1.607e-02 7.323 7.75e-13 ***
log(Xv):GLU -2.613e-01 3.265e-02 -8.000 6.29e-15 ***
Via:GLC -4.150e-03 8.275e-04 -5.015 6.97e-07 ***
Via:GLU 1.534e-02 1.466e-03 10.463 < 2e-16 ***
Via:Stress 6.505e-04 1.614e-04 4.031 6.26e-05 ***
GLC:LAC -2.587e-02 7.330e-03 -3.529 0.000448 ***
GLC:NH4 1.571e-02 4.848e-03 3.241 0.001255 **
GLC:OSM 9.333e-04 3.859e-04 2.419 0.015863 *
LAC:NH4 -4.049e-02 8.656e-03 -4.678 3.58e-06 ***
LAC:OSM 1.038e-03 2.115e-04 4.911 1.16e-06 ***
GLN:Stress -4.219e-03 1.587e-03 -2.658 0.008059 **
GLU:OSM 4.736e-03 8.420e-04 5.625 2.83e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1743 on 608 degrees of freedom
(740 observations deleted due to missingness)
Multiple R-squared: 0.9526, Adjusted R-squared: 0.9498
F-statistic: 339.2 on 36 and 608 DF, p-value: < 2.2e-16
[1] 0.07678454
15
−4 −3 −2 −1 0
−0.6−0.20.20.6
Fitted values
Residuals
q
q
q q
qq
q
q q
q
qq
q
q
q
q
q
q
q
q q q q
q
q
q
q q
q
q
q
qq
q q
q
q
qqq q
q
q
qqq
q
q
q
q
qq
q
q
q
q q
qq
q
q q
q
q
qq
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q q
q
qq q
q
q
q
q
q q
q
q
q
q
q q
q
qq
q
q
q q
qqqq q
q
qq
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q q q
q
q
q
q q
qq
q
q
q
q
q
q q q
q
q
q q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q q
qq
q q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q q
q
qq
q
q
q
q q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q q
qq
q
q
q
q
q
q
qq q
q
q
qq
q q
q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
qq
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
qq
q
qq
q
q q
q
q
q
q
q q
qq
q
q
q
q
q
q
q q
qq
q
q q qq
q q qqqq
q
q
qq
q q
q
q
q
q
q
qq
q
q q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qqq
q
q
q q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
qq
q
q q
q
q q
q
q
q
q
q
q
q
q
q q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q qq
q
q q
q
q
Residuals vs Fitted
90 345
909
q
q
qq
qq
q
qq
q
qq
q
q
q
q
q
q
q
qqqq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qqqq
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
qq
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
q
q
q
qq
q
q
q
q
qq
q
qq
q
q
qq
qqqqq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
qq
qq
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
qq
qq
qq
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
qq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
qqqq
qqqqqq
q
q
qq
qq
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qq
q
q
−3 −2 −1 0 1 2 3
−3−1123
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
90345
909
−4 −3 −2 −1 0
0.00.51.01.5
Fitted values
Standardizedresiduals
q
q
q q
qq
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
qq
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
qq
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
Scale−Location
90 345
909
0.0 0.2 0.4 0.6
−4−202
Leverage
Standardizedresiduals
q
q
qq
qq
q
qq
q
qq
q
q
q
q
q
q
q
qqqq
q
q
q
qq
q
q
q
qq
qq
q
q
qqqq
q
q
qqq
q
q
q
q
qq
q
q
q
qq
qq
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
q
q
q
qq
q
q
q
q
qq
q
qq
q
q
qq
qqqqq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
qq
qq
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q q
q
q
qq
qq
qq
q
q
q q
q
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
qqq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
qq
q
qq
q
qq
q
q
q
q
q q
qq
q
q
q
q
q
q
qq
qq
q
qqq q
qqqqqq
q
q
qq
qq
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
qq
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
qq
q
q
Cook's distance
1
0.5
0.5
Residuals vs Leverage
530
532
1377
6.2 Random Forest
The importance ranking of variables is now much different from black-box model. GLU, LAC, NH4,
GLN can be used to estimate Titer at the same time, which is a good “snapshot”.
[1] "Titer~DO+pCO2+Via+GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+Xv"
Call:
randomForest(formula = as.formula(snrf$formula), data = ccFdata, na.action = na.omit)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 0.003390071
% Var explained: 92.32
[1] 0.06970167
16
Stress
DO
GLC
pH
OSM
NH4
Via
pCO2
Xv
GLN
LAC
GLU
q
q
q
q
q
q
q
q
q
q
q
q
0 2 4 6 8 10
snrffit
IncNodePurity
6.3 MARS
MARS have a similar importance plot to Random Forest.
Call: earth(formula=as.formula(snmars$formula), data=subset(ccFdata,
!is.na(ccFdata$Titer)))
coefficients
(Intercept) 0.52558053
h(DO-70) 0.00384701
h(70-DO) 0.00081744
h(39-pCO2) -0.00387669
h(Via-95.1825) -0.01859399
h(1.64-GLC) -0.04381892
h(3.14-LAC) 0.06892429
h(GLN-2.09) 0.17083241
h(2.09-GLN) -0.07104502
h(GLU-4.49) 0.11840427
h(4.49-GLU) -0.13551714
h(NH4-5.19) -0.01831231
h(5.19-NH4) 0.04728241
h(OSM-305) 0.00128608
h(305-OSM) 0.00120199
h(Stress-21) -0.00161051
h(21-Stress) -0.00473324
h(7.11-pH) -0.20785734
h(Xv-2.96) -0.01598677
h(2.96-Xv) -0.10513829
Selected 20 of 25 terms, and 12 of 12 predictors
17
Importance: GLU, LAC, NH4, GLN, Xv, OSM, pH, Stress, Via, pCO2, DO, GLC
Number of terms at each degree of interaction: 1 19 (additive model)
GCV 0.004576446 RSS 2.605637 GRSq 0.8966136 RSq 0.9084545
[1] 0.07026124
0 5 10 15 20 25
Model Selection
Number of terms
0.60.70.80.9
GRSqRSq
024681012
Numberofusedpredictors
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20
0.00.20.40.60.81.0
Cumulative Distribution
abs(Residuals)
Proportion
0% 50% 90% 100%
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qqq
qq
q
qq
qqq
q
qq
q
qq
qq
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqqq
q
qq
q
q
q
qq
q
q
q
qq
q
q
qqqq
q
q
qq
q
qq
q
qqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
qqq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
q
q
q
q
q
qqqq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
qq
qqq
q
qq
qq
q
q
q
qqq
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
qq
q
qq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8
−0.20.00.10.2
Residuals vs Fitted
Fitted
Residuals
586
10
582
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qqq
qq
q
qq
qqq
q
qq
q
qq
qq
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqqq
q
qq
q
q
q
qq
q
q
q
qq
q
q
qqqq
q
q
qq
q
qq
q
qqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
qqq
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qqq
q
qqq
q
q
q
q
q
q
qqqq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
qq
qqq
q
qq
qq
q
q
q
qqq
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
qq
q
qq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
−3 −1 0 1 2 3
−0.20.00.10.2
Normal Q−Q
Theoretical Quantiles
ResidualQuantiles
586
10
582
Titer: earth(formula=as.formula(snmars$formu...
18
051015
Variable importance
nsubsets
nsubsets
sqrt gcv
sqrt rss
020406080100
normalizedsqrtgcvorrss
GLU7
LAC5
NH48
GLN6
Xv12
OSM9
pH11
Stress10
Via3
pCO22
DO1
GLC4
7 History Model(Naive Way)
Let’s first use input data only at time t and treat t simply as another parameter.
Again, the target here is the Titer on day 10.
Explore with different models.
7.1 Linear Regression
Since we now have around 1300 observations, we can fit the model with more parameters. After this we
can select model using step.
Call:
lm(formula = Titer ~ I(DO^2) + DO + pHset + I(pHset^2) + Via +
poly(GLC, 3) + poly(LAC, 3) + GLN + GLU + NH4 + OSM + poly(Ptime,
3) + DO:pHset + I(pHset^2):poly(Ptime, 3) + Via:poly(Ptime,
3) + poly(LAC, 3):poly(Ptime, 3) + GLN:poly(Ptime, 3) + GLU:poly(Ptime,
3) + NH4:poly(Ptime, 3) + OSM:poly(Ptime, 3), data = dat.agg)
Residuals:
Min 1Q Median 3Q Max
-0.296451 -0.047078 0.002983 0.044872 0.190200
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.049e+00 5.200e-01 -2.017 0.044626 *
I(DO^2) -3.644e-02 6.218e-03 -5.860 1.26e-08 ***
DO 1.747e-02 7.374e-03 2.369 0.018520 *
pHset 9.137e-03 7.090e-03 1.289 0.198499
19
I(pHset^2) -3.318e-02 5.477e-03 -6.058 4.31e-09 ***
Via 1.341e-01 2.815e-02 4.763 3.02e-06 ***
poly(GLC, 3)1 -8.356e-01 1.905e-01 -4.387 1.62e-05 ***
poly(GLC, 3)2 -2.459e-01 1.455e-01 -1.690 0.092150 .
poly(GLC, 3)3 1.980e-01 1.425e-01 1.389 0.165837
poly(LAC, 3)1 -1.572e+02 4.936e+01 -3.184 0.001611 **
poly(LAC, 3)2 -1.203e+02 3.832e+01 -3.138 0.001878 **
poly(LAC, 3)3 -3.211e+01 1.041e+01 -3.084 0.002244 **
GLN 3.912e-02 8.812e-03 4.439 1.29e-05 ***
GLU 5.716e-02 2.163e-02 2.642 0.008686 **
NH4 5.605e-02 7.730e-03 7.252 3.82e-12 ***
OSM -2.508e-02 2.934e-02 -0.855 0.393294
poly(Ptime, 3)1 2.465e+01 7.910e+00 3.117 0.002013 **
poly(Ptime, 3)2 -4.044e+01 1.110e+01 -3.642 0.000321 ***
poly(Ptime, 3)3 5.017e+00 1.286e+00 3.903 0.000119 ***
DO:pHset 5.555e-03 2.814e-03 1.974 0.049324 *
I(pHset^2):poly(Ptime, 3)1 4.147e-01 8.829e-02 4.697 4.10e-06 ***
I(pHset^2):poly(Ptime, 3)2 1.152e-02 4.871e-02 0.236 0.813219
I(pHset^2):poly(Ptime, 3)3 -2.443e-01 1.301e-01 -1.878 0.061423 .
Via:poly(Ptime, 3)1 -1.151e+00 4.468e-01 -2.576 0.010507 *
Via:poly(Ptime, 3)2 2.055e+00 5.664e-01 3.628 0.000338 ***
Via:poly(Ptime, 3)3 -5.371e-01 3.308e-01 -1.624 0.105528
poly(LAC, 3)1:poly(Ptime, 3)1 2.404e+03 7.520e+02 3.197 0.001546 **
poly(LAC, 3)2:poly(Ptime, 3)1 1.869e+03 5.845e+02 3.199 0.001536 **
poly(LAC, 3)3:poly(Ptime, 3)1 5.054e+02 1.594e+02 3.172 0.001680 **
poly(LAC, 3)1:poly(Ptime, 3)2 -3.718e+03 1.095e+03 -3.395 0.000784 ***
poly(LAC, 3)2:poly(Ptime, 3)2 -2.955e+03 8.698e+02 -3.398 0.000776 ***
poly(LAC, 3)3:poly(Ptime, 3)2 -8.601e+02 2.569e+02 -3.347 0.000924 ***
poly(LAC, 3)1:poly(Ptime, 3)3 3.446e+02 1.065e+02 3.236 0.001353 **
poly(LAC, 3)2:poly(Ptime, 3)3 2.778e+02 8.433e+01 3.294 0.001110 **
poly(LAC, 3)3:poly(Ptime, 3)3 8.598e+01 2.539e+01 3.387 0.000805 ***
GLN:poly(Ptime, 3)1 2.287e-01 1.479e-01 1.547 0.123014
GLN:poly(Ptime, 3)2 -1.576e-01 1.730e-01 -0.911 0.363026
GLN:poly(Ptime, 3)3 -2.659e-01 1.555e-01 -1.710 0.088273 .
GLU:poly(Ptime, 3)1 -1.712e-01 3.530e-01 -0.485 0.628013
GLU:poly(Ptime, 3)2 -8.317e-01 4.445e-01 -1.871 0.062348 .
GLU:poly(Ptime, 3)3 -5.699e-01 2.756e-01 -2.068 0.039551 *
NH4:poly(Ptime, 3)1 -5.983e-01 1.399e-01 -4.278 2.57e-05 ***
NH4:poly(Ptime, 3)2 -3.588e-01 1.453e-01 -2.469 0.014148 *
NH4:poly(Ptime, 3)3 1.258e-01 1.830e-01 0.687 0.492460
OSM:poly(Ptime, 3)1 1.639e+00 3.637e-01 4.507 9.58e-06 ***
OSM:poly(Ptime, 3)2 1.621e+00 6.580e-01 2.464 0.014321 *
OSM:poly(Ptime, 3)3 2.464e-01 1.771e-01 1.391 0.165379
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07481 on 288 degrees of freedom
Multiple R-squared: 0.8548, Adjusted R-squared: 0.8317
F-statistic: 36.87 on 46 and 288 DF, p-value: < 2.2e-16
20
0.0 0.2 0.4 0.6 0.8
−0.3−0.10.1
Fitted values
Residuals
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
qq
q
qqqqqq
qq
qqqqq
qq
q
q
q
q
q
q
qq
q q
qqqqq
qqqqq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
qq
q
qq
q
qq
q
qq
q
qqq
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
qq
q
qqqqq
q
qq q
q
q
q
q
q
qq
q
q
q
q
q
q
q qq
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qqqq
q
qqq
qqq
q
q
qq
q
qq
q
q
qqqq
q
q
q
q
q
q
qq
q
q
q
q
qqqqqq
qq
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
qq
q
q
q q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
q
q
qq
q
q
qqq
Residuals vs Fitted
104
258
323
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
qq
q
qq
qqqq
qq
qqqqq
qq
qq
q
q
q
q
qq
qq
qqqqq
qqqq
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
qq
q
q
q
q
qqq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q qq
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qqqq
q
qqqq
q
qqq
qq
q
q
q
qq
q
q
q
q
q
qqqq
q
q
q
q
q
q
qq
q
q
q
q
qqqq qq
qq
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
qq
qq
qq
q
q
q
q
q q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
qqq
−3 −2 −1 0 1 2 3
−4−202
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
104
258
323
0.0 0.2 0.4 0.6 0.8
0.00.51.01.52.0
Fitted values
Standardizedresiduals
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
qq
qq
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
qq
q
q
q
q
q
q
q
qq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
qq
q
q
q
qqq
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
Scale−Location
104
258
323
0.0 0.2 0.4 0.6 0.8 1.0
−4−202
Leverage
Standardizedresiduals
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q qq
q
q
qq
q
q qq qq q
q q
q qqqq
qq
qq
q
q
q
q
qq
qq
qqqqq
qqqqq
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
qq
q
q q
q
qq q
q
q q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q q
q
q
qq
q
q q
q q q
q
q
qq
q
q
q
q
q
q q
q
q
q
q
q
q
qqq
q
q
q qq
q
q
q
q
q
qq
q
q
q
qq qq
q
qq qq
q
qq q
qq
q
q
q
qq
q
q
q
q
q
qq qq
q
q
q
q
q
q
qq
q
q
q
q
qqqq qq
qq
qq
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
qq
q q
qq
q
q
q
q
qq
q
q
q
q qq
q
q
q
q
q
q
q
q
q
q
q
qqq
Cook's distance
10.5
0.51
Residuals vs Leverage
135
136
196
Linear model does not capture all the effects. We can identify some outliers.
> (observations <- unique(dat.agg[which(abs(lm.agg$residuals) > 0.2), "Run"]))
[1] 630 770
7.2 Mars
We perform model fitting and variables selection by Mars.
Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+
GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pdiff+
Ptime, data=dat.agg, trace=0, ncross=3, nfold=10)
coefficients
(Intercept) 0.74650082
h(DO- -1.06901) -0.03339099
h(-1.06901-DO) -0.22191908
h(pHset-0.444749) -0.12856755
h(0.444749-pHset) -0.10199167
h(0.580131-Xv) -0.03825405
h(0.810513-Stress) -0.09597828
h(pCO2- -0.471402) 0.01252083
h(Via-0.390253) 0.18390409
h(GLC-1.1171) -0.05487600
h(GLN- -0.203858) 0.02302284
h(-0.203858-GLN) 0.03187502
h(NH4- -0.0225122) 0.15374637
h(NH4-1.01414) -0.17152919
h(pH- -0.664392) 0.04304223
21
h(pH-1.70568) -0.23513685
Selected 16 of 26 terms, and 10 of 15 predictors
Importance: DO, pHset, NH4, Stress, Xv, pH, GLC, GLN, Via, pCO2, ...
Number of terms at each degree of interaction: 1 15 (additive model)
GCV 0.006192977 RSS 1.708448 GRSq 0.8142985 RSq 0.8461598 cv.rsq 0.7821464
Note: the cross-validation sd's below are standard deviations across folds
Cross validation: nterms 16.90 sd 1.88 nvars 9.33 sd 0.92
cv.rsq sd MaxErr sd
0.78 0.069 -0.29 0.19
22
051015
Variable importance
nsubsets
nsubsets
sqrt gcv
sqrt rss
020406080100
normalizedsqrtgcvorrss
DO1
pHset2
NH411
Stress4
Xv3
pH13
GLC7
GLN9
Via6
pCO25
0 5 10 15 20 25
Model Selection
Number of terms
0.50.60.70.8
GRSqRSq
0246810
Numberofusedpredictors
GRSq
selected model
RSq
nbr preds
0.00 0.10 0.20
0.00.20.40.60.81.0
Cumulative Distribution
abs(Residuals)
Proportion
0% 50% 90% 100%
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8
−0.20.00.10.2
Residuals vs Fitted
Fitted
Residuals
104119
323
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
qq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
−3 −1 0 1 2 3
−0.20.00.10.2
Normal Q−Q
Theoretical Quantiles
ResidualQuantiles
104 119
323
Titer: earth(formula=Titer~DO+pHset+Xv+...
23
Take a look at outliers.
> dat.agg[c(104, 119, 323), ]
Run WD Ptime DO pCO2 pH Xv Via
104 630 0 0.0000000 0.159614 0.1237602 1.287431 -1.580576 0.3932955
119 635 0 0.0000000 1.388239 0.9739924 1.426846 -1.894227 0.4848425
323 1066 1 0.7791667 0.159614 -1.0665650 1.217723 -1.525812 0.1411053
GLC LAC GLN GLU NH4 OSM
104 1.0780028 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237
119 1.3174515 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617
323 0.6821795 -0.4139295 0.2494495 -1.790816 0.262201197 0.068777622
Stress Xv0 pHset RunNo
104 -0.2096149 0.3312174 0.03925127 BIOS-3.5L - 630
119 -0.2096149 -2.2207835 -2.39373696 BIOS-3.5L - 635
323 -0.2096149 -1.7751961 0.03925127 TACI-3L-1066
Discr. Titer pdiff pCO2t
104 Standard conditions 0.3510000 0.6432812 0
119 DO 70% / pH 6.70 / seeding 1 mio 0.0858000 0.6542345 0
323 STD condition (Control - no loop) 0.8432867 0.5062897 0
7.3 Random Forest
The parameter mtry = 10 is determined by tuneRF to optimize its performance.
> library(randomForest)
> rf.agg <- randomForest(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via
+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH
+ + pdiff + Ptime, data = dat.agg, importance = TRUE, mtry = 10)
> rf.agg
Call:
randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM +
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 0.005391989
% Var explained: 83.73
24
0 100 200 300 400 500
0.0060.0080.0100.0120.0140.016
rf.agg
trees
Error
pCO2
Ptime
GLU
pdiff
GLN
Via
pH
GLC
Xv
OSM
Stress
LAC
NH4
pHset
DO
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
10 20 30 40 50 60
%IncMSE
Ptime
pdiff
pCO2
Stress
GLU
GLC
Via
GLN
OSM
pH
Xv
LAC
NH4
pHset
DO
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4
IncNodePurity
rf.agg
25
Under two importance plots, the first 7 most important variables are the same.
7.4 Decision Tree
plot.cp is used to determine cp in decision tree.
> library(rpart)
> tr.agg <- rpart(Titer ~ DO + pHset + Stress + Xv + pCO2
+ + Via + GLC + LAC + GLN + GLU + NH4 + OSM
+ + Stress + pH + pdiff + Ptime, data=dat.agg)
> (tr.agg <- prune(tr.agg, cp = 0.019))
n= 335
node), split, n, deviance, yval
* denotes terminal node
1) root 335 11.10534000 0.5719146
2) DO< -1.683324 39 0.30707160 0.2245385 *
3) DO>=-1.683324 296 5.47207100 0.6176837
6) pHset< -1.988239 18 0.17149480 0.2748954 *
7) pHset>=-1.988239 278 3.04856000 0.6398787
14) NH4< 0.3899572 202 1.52384000 0.6085717
28) LAC>=0.002812559 9 0.02199422 0.4344444 *
29) LAC< 0.002812559 193 1.21623800 0.6166916 *
15) NH4>=0.3899572 76 0.80050930 0.7230893 *
> plot(tr.agg)
> text(tr.agg)
|
DO< −1.683
pHset< −1.988
NH4< 0.39
LAC>=0.002813
0.2245
0.2749
0.4344 0.6167
0.7231
26
7.5 Neural Network
> nn.agg <- nnet(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via
+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH + pdiff,
+ data = dat.agg, size = 4, skip = TRUE, decay = 4e-4,
+ lin.out = FALSE, maxit = 2000, trace = FALSE)
> nn.agg
a 14-4-1 network with 79 weights
inputs: DO pHset Xv Stress pCO2 Via GLC LAC GLN GLU NH4 OSM pH pdiff
output(s): Titer
options were - skip-layer connections decay=4e-04
7.6 Cross Validation
Because the nature of our data, we better leave the whole run of experiment out when we try to perform
cross validation. Still we are using MSE to assess performance.
Overall, the performance is better than in first task.
RMSECVs of Different Models
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.09515759 0.09222569 0.07226036 0.1000236 0.09316403
8 History Models
Here we fit historical models and perform CV on them to measure their prediction powers.
The input of a historical model always include the 2 strictly controlled variables that are kept constant.
These are DO, Stress. For a historical model using the history up to day i, we use the other 9 variables
pCO2, Xv, Via, GLC, pH, LAC, GLN, GLU, NH4, OSM from day 0 to day i. This gives up 2+10∗(i+1)
input variables.
We do this by unfolding the raw data. In case of missing values, we impute them by using missForest
on the unfolded historical input values from day 0 to day i. This is done in function ArrHist.
We produce 2 tables for each statistical model. One is the table of variance explained in CV and the
other is RMSECV. We calculate variance explained as 1 − ( ( ˆyi − yi)2
)/( (yi − ¯y)2
) in each run of
cross validation and average them to get the final result. Similarly, RMSECV is also the mean over all
cross validation runs.
Now we see how models perfrom under this setting.
8.1 Linear model
The (i, j)th history linear model is defined as follows:
log(Titerj) = β0 +
i
t=0
[β1,t(DOt − 50)2
+ β2,tpCO2t
+ β3,t(pHt − 7.1)2
+ β4,tlog(Xvt) + β5,tV iat
+ β6,tGLCt + β7,tLACt + β8,tGLNt
+ β9,tGLUt + β10,tNH4t + β11,tOSMt + β12,tStresst]
And the following table shows the RMSECV values. The RMSECV value of the (i, j)th model is the
number at the ith row and the jth column.
RMSECV
27
Y0 Y1 Y2 Y3 Y4
X0 3.200970e-03 1.962524e-03 4.279477e-03 2.301885e-03 2.446558e-03
X0_1 1.545867e-02 1.659791e-02 2.354038e-03 1.230346e-03
X0_2 1.730547e-02 4.075473e-03 1.085479e-03
X0_3 4.041385e-03 2.398019e-03
X0_4 3.482082e-03
X0_5
X0_6
X0_7
X0_8
X0_9
X0_10
Y5 Y6 Y7 Y8 Y9
X0 2.051734e-03 3.416656e-03 6.081081e-03 7.219461e-03 1.941174e-02
X0_1 2.327897e-03 3.313077e-03 6.668536e-03 6.489212e-03 1.036755e-02
X0_2 3.283539e-03 1.628859e-03 5.512231e-03 3.368808e-03 7.914411e-03
X0_3 2.395535e-03 1.284514e-03 2.544108e-03 2.998240e-03 5.437181e-03
X0_4 1.389313e-03 1.658926e-03 2.458271e-03 2.497301e-03 1.091136e-02
X0_5 2.656460e-03 2.030364e-03 2.326792e-03 2.966961e-03 6.045417e-03
X0_6 2.639904e-03 2.232001e-03 3.808531e-03 1.056563e-02
X0_7 1.251427e-02 6.530875e-03 1.346802e-02
X0_8 3.184551e+02 7.913746e+13
X0_9 4.492133e+02
X0_10
Y10
X0 1.961133e-02
X0_1 5.580546e-02
X0_2 1.869011e-02
X0_3 1.359189e-02
X0_4 7.909511e-03
X0_5 1.306489e-02
X0_6 1.450043e-02
X0_7 3.412453e-02
X0_8 1.202324e+11
X0_9 1.613414e+00
X0_10 1.289649e+00
28
0
1
2
3
4
5
6
7
8
9
10
Titer
10
9
8
7
6
5
4
3
2
1
0
Historyused
RMSECV for history model based on Linear model
8.2 Random Forest
Variance explained in CV
Y0 Y1 Y2 Y3 Y4 Y5
X0 0.28324661 0.26155005 -0.09489543 -0.03088496 0.39081454 0.43354914
X0_1 0.37598610 0.21074302 0.35566190 0.31553364 0.37313831
X0_2 0.16472940 0.36469677 0.44168895 0.40601249
X0_3 0.25131256 0.35404242 0.51706127
X0_4 0.37300694 0.56438519
X0_5 0.54745573
X0_6
X0_7
X0_8
X0_9
X0_10
Y6 Y7 Y8 Y9 Y10
X0 0.44032714 0.55857068 0.57889201 0.64090502 0.55601625
X0_1 0.53608831 0.50026635 0.55488094 0.72923139 0.71702833
X0_2 0.69563541 0.67904955 0.69803823 0.66081810 0.71536101
X0_3 0.66675711 0.72303901 0.71583577 0.71933173 0.77114221
X0_4 0.77659188 0.77304970 0.77028128 0.80331752 0.74513898
X0_5 0.79546288 0.79299697 0.81030901 0.73912138 0.78821029
X0_6 0.76082620 0.76023416 0.79248812 0.82078319 0.79484015
X0_7 0.81875451 0.80675983 0.82533617 0.81928950
X0_8 0.86596990 0.82482254 0.80067588
X0_9 0.87802758 0.84462589
X0_10 0.81461628
RMSECV
29
Y0 Y1 Y2 Y3 Y4 Y5
X0 0.04226597 0.05005782 0.05790429 0.05367666 0.04880549 0.05140720
X0_1 0.03811918 0.05222506 0.03665541 0.04491827 0.04562487
X0_2 0.04978004 0.04001816 0.04554554 0.04125988
X0_3 0.04335484 0.03659673 0.04576400
X0_4 0.04272663 0.03664301
X0_5 0.03671531
X0_6
X0_7
X0_8
X0_9
X0_10
Y6 Y7 Y8 Y9 Y10
X0 0.06260631 0.07307958 0.08436972 0.09963318 0.10319883
X0_1 0.05939476 0.07320302 0.08885598 0.07992313 0.08665985
X0_2 0.04182959 0.06069709 0.06802562 0.09828617 0.08913791
X0_3 0.05082051 0.04949298 0.06647324 0.07765117 0.08482019
X0_4 0.03694516 0.05087756 0.05854120 0.06370206 0.08794341
X0_5 0.04183240 0.04704620 0.05592446 0.07192094 0.07872577
X0_6 0.03919570 0.05291936 0.05561075 0.07081846 0.07979003
X0_7 0.04348533 0.05150214 0.06694091 0.06891756
X0_8 0.04623633 0.06832347 0.07381787
X0_9 0.05664780 0.07072329
X0_10 0.06890835
0
1
2
3
4
5
6
7
8
9
10
Titer
10
9
8
7
6
5
4
3
2
1
0
Historyused
RMSECV for history model based on Random Forest
8.3 MARS
Variance explained in CV
Y0 Y1 Y2 Y3 Y4
X0 0.069814132 -0.005095947 0.024849356 -0.034696230 -0.120319657
30
X0_1 0.007649867 -0.396661134 -0.275546932 -0.033114509
X0_2 -0.827848243 -0.204988340 -1.123241366
X0_3 -0.239974296 0.267732287
X0_4 -1.130404725
X0_5
X0_6
X0_7
X0_8
X0_9
X0_10
Y5 Y6 Y7 Y8 Y9
X0 0.363899116 0.330268801 0.508385257 0.305851870 0.599098171
X0_1 -0.150206083 0.147392088 0.375098045 0.089004911 0.342867469
X0_2 0.287235896 0.651413305 0.651947560 0.742131695 0.616545382
X0_3 0.233978160 0.517310655 0.669245506 0.804495843 0.768104648
X0_4 0.397724415 0.704167262 0.821355319 0.839638561 0.836162082
X0_5 0.512508645 0.686174900 0.814525758 0.846335693 0.780111923
X0_6 0.846201740 0.823482389 0.841185054 0.823600595
X0_7 0.763357781 0.908825907 0.814810324
X0_8 0.850118872 0.714621386
X0_9 0.598640731
X0_10
Y10
X0 -0.266671250
X0_1 0.099612543
X0_2 0.296458653
X0_3 0.673287674
X0_4 0.681350544
X0_5 0.706625242
X0_6 0.812505473
X0_7 0.789365935
X0_8 0.657600818
X0_9 0.742572597
X0_10 0.830697666
RMSECV
Y0 Y1 Y2 Y3 Y4 Y5
X0 0.05414481 0.06435345 0.04914698 0.04440181 0.04899524 0.04751189
X0_1 0.06349293 0.07040039 0.07104746 0.04787156 0.06328275
X0_2 0.08353868 0.05609517 0.06006556 0.05632823
X0_3 0.05677938 0.05112222 0.05670555
X0_4 0.07399238 0.04632260
X0_5 0.04234317
X0_6
X0_7
X0_8
X0_9
X0_10
Y6 Y7 Y8 Y9 Y10
X0 0.06136661 0.06598322 0.10313729 0.10811682 0.16777832
X0_1 0.06212712 0.09116696 0.10869171 0.12338118 0.17132883
X0_2 0.05033973 0.05957265 0.06570322 0.09632618 0.12433817
X0_3 0.05168125 0.06504277 0.06069211 0.07182945 0.10263122
X0_4 0.03954150 0.04350307 0.05051779 0.06939214 0.09432636
X0_5 0.04101469 0.04404178 0.04560743 0.06953788 0.08301654
X0_6 0.02986677 0.04677269 0.05327379 0.07087714 0.07679419
X0_7 0.04676462 0.04301543 0.07409887 0.07206619
X0_8 0.04281923 0.07792298 0.09770436
X0_9 0.09445278 0.08202713
X0_10 0.07658425
31
Y0
Y1
Y2
Y3
Y4
Y5
Y6
Y7
Y8
Y9
Y10
Titer
X0_10
X0_9
X0_8
X0_7
X0_6
X0_5
X0_4
X0_3
X0_2
X0_1
X0
Historyused
RMSECV for history model based on MARS
9 Evaluation of Results
All models are performing reasonably well.
Random forest has the most stable performance. This is expected. It is also easy to implement.
There are some tuning parameters to consider, e.g., the number of trees, the size chosen for resampling.
However, default setting works good most of the time, and even the number of trees is too large it would
not be a big problem.
Linear regression is also doing ok. Its performance heavily relies on variable selection, interaction
specification and transformation, that is why the implementation of linear regression is not as straight-
forward as other models. A compromise is fitting a large model in the beginning and use stepwise selection
to reduce the number of parameters. More recent method would be using penalized linear regression, so
called lasso, ridge regression. See [2].
Mars is easy to implement, because it can adjust to nonlinearity automatically. Here its prediction
power is not as good as the other two, which might be resulted from overfitting. It is possible to tune
the parameters. See [3].
References
[1] Daniel J.Stekhoven. Using the missForest package, 2011. Available from http://stat.ethz.ch/
education/semesters/ss2012/ams/paper/missForest_1.2.pdf.
[2] Peter B¨uhlmann Martin M¨achler. Computational statistics. Available from http://stat.ethz.ch/
education/semesters/ss2014/CompStat/sk.pdf, 2014.
[3] Stephen Milborrow. Notes on the earth package, 2014. Available from http://cran.r-project.
org/web/packages/earth/vignettes/earth-notes.pdf.
32
[4] Fortran original by Leo Breiman, R port by Andy Liaw Adele Cutler, and Matthew Wiener.
Package ‘randomForest’, 2012. Available from http://stat-www.berkeley.edu/users/breiman/
RandomForests.
[5] Mayo Foundation Terry M. Therneau, Elizabeth J. Atkinson. An Introduction to Recursive Partition-
ing Using the RPART Routines, 2013. Available from http://cran.r-project.org/web/packages/
rpart/vignettes/longintro.pdf.
33

More Related Content

What's hot

New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
 
A02610104
A02610104A02610104
A02610104theijes
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
Quantitative Analysis for Emperical Research
Quantitative Analysis for Emperical ResearchQuantitative Analysis for Emperical Research
Quantitative Analysis for Emperical ResearchAmit Kamble
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA ijcsit
 
Estimating Reconstruction Error due to Jitter of Gaussian Markov Processes
Estimating Reconstruction Error due to Jitter of Gaussian Markov ProcessesEstimating Reconstruction Error due to Jitter of Gaussian Markov Processes
Estimating Reconstruction Error due to Jitter of Gaussian Markov ProcessesMudassir Javed
 
Desinging dsp (0, 1) acceptance sampling plans based on truncated life tests ...
Desinging dsp (0, 1) acceptance sampling plans based on truncated life tests ...Desinging dsp (0, 1) acceptance sampling plans based on truncated life tests ...
Desinging dsp (0, 1) acceptance sampling plans based on truncated life tests ...eSAT Journals
 
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
Simulation Study of Hurdle Model Performance on Zero Inflated Count DataSimulation Study of Hurdle Model Performance on Zero Inflated Count Data
Simulation Study of Hurdle Model Performance on Zero Inflated Count DataIan Camacho
 
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...ijaia
 
Paper id 21201488
Paper id 21201488Paper id 21201488
Paper id 21201488IJRAT
 
Survey on semi supervised classification methods and feature selection
Survey on semi supervised classification methods and feature selectionSurvey on semi supervised classification methods and feature selection
Survey on semi supervised classification methods and feature selectioneSAT Journals
 
Basic calculations of measurement uncertainty in medical testing
Basic calculations of measurement uncertainty in medical testingBasic calculations of measurement uncertainty in medical testing
Basic calculations of measurement uncertainty in medical testingGH Yeoh
 

What's hot (17)

New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
 
Msa
MsaMsa
Msa
 
A02610104
A02610104A02610104
A02610104
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
Quantitative Analysis for Emperical Research
Quantitative Analysis for Emperical ResearchQuantitative Analysis for Emperical Research
Quantitative Analysis for Emperical Research
 
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
 
Malhotra20
Malhotra20Malhotra20
Malhotra20
 
Distance Sort
Distance SortDistance Sort
Distance Sort
 
Estimating Reconstruction Error due to Jitter of Gaussian Markov Processes
Estimating Reconstruction Error due to Jitter of Gaussian Markov ProcessesEstimating Reconstruction Error due to Jitter of Gaussian Markov Processes
Estimating Reconstruction Error due to Jitter of Gaussian Markov Processes
 
Desinging dsp (0, 1) acceptance sampling plans based on truncated life tests ...
Desinging dsp (0, 1) acceptance sampling plans based on truncated life tests ...Desinging dsp (0, 1) acceptance sampling plans based on truncated life tests ...
Desinging dsp (0, 1) acceptance sampling plans based on truncated life tests ...
 
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
Simulation Study of Hurdle Model Performance on Zero Inflated Count DataSimulation Study of Hurdle Model Performance on Zero Inflated Count Data
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
 
Time Truncated Chain Sampling Plans for Inverse Rayleigh Distribution
Time Truncated Chain Sampling Plans for Inverse Rayleigh  Distribution Time Truncated Chain Sampling Plans for Inverse Rayleigh  Distribution
Time Truncated Chain Sampling Plans for Inverse Rayleigh Distribution
 
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...
 
Defuzzification
DefuzzificationDefuzzification
Defuzzification
 
Paper id 21201488
Paper id 21201488Paper id 21201488
Paper id 21201488
 
Survey on semi supervised classification methods and feature selection
Survey on semi supervised classification methods and feature selectionSurvey on semi supervised classification methods and feature selection
Survey on semi supervised classification methods and feature selection
 
Basic calculations of measurement uncertainty in medical testing
Basic calculations of measurement uncertainty in medical testingBasic calculations of measurement uncertainty in medical testing
Basic calculations of measurement uncertainty in medical testing
 

Similar to report

A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...IJAEMSJORNAL
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4YoussefKitane
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptxMarceloHenriques20
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
 
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...IJRESJOURNAL
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
Multinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisMultinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisHARISH Kumar H R
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakashShivaram Prakash
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.pptTanyaWadhwani4
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Sequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsSequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsYoussefKitane
 
IEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao LinIEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao LinMinchao Lin
 
Statistics project2
Statistics project2Statistics project2
Statistics project2shri1984
 

Similar to report (20)

A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
Multinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisMultinomial Logistic Regression Analysis
Multinomial Logistic Regression Analysis
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
3.2 Measures of variation
3.2 Measures of variation3.2 Measures of variation
3.2 Measures of variation
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.ppt
 
Dm
DmDm
Dm
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Sequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsSequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_models
 
IEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao LinIEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao Lin
 
Statistics project2
Statistics project2Statistics project2
Statistics project2
 

report

  • 1. Process Design and Optimization of Bioprocesses with Quality by Design Approach Kyunghee Cho, Yunhao He June 11, 2014 1 Introduction In this project, we work with data from a pharmaceutical company that uses a new biochemical method to produce drugs. The ultimate goal is to improve the process and develop models to aid decision making. In general, we hope for more information out of less measurement. Due to time-constraint, we focus on the prediction under the classical statistical settings. That is, to predict an output Y from X using the model developed from data X1, X2, · · · , Xn and Y1, Y2, · · · , Yn. Many models exist for this kind of prediction task. We only use a few of them in this project. 1.1 Data Description In this part we briefly describe the data. The following is how the raw data look like. Batch.ID Run WD Ptime DO CO2 pH Stress GLC LAC GLN GLU NH4 1 1 539 0 0.0000000 50 46 7.12 2.2 7.00 0.67 1.88 1.32 3.55 2 1 539 1 0.8368056 50 21 7.10 2.2 5.95 1.52 1.29 1.67 3.97 3 1 539 2 1.7986111 50 23 6.97 2.2 5.13 2.06 1.31 1.80 4.11 4 1 539 3 2.8020833 50 33 6.97 2.2 3.85 2.16 1.28 2.64 4.57 5 1 539 4 3.8680556 50 39 6.98 2.2 3.02 1.93 1.73 2.57 4.45 6 1 539 5 4.8993056 50 40 7.02 2.2 2.27 1.71 2.07 3.53 4.20 OSM Xv Via Titer 1 322.0000 1.45 97.97297 0.03913624 2 315.4865 2.07 97.64151 0.05251863 3 308.0000 3.30 97.34513 0.05340000 4 305.0906 4.97 97.83465 0.10257388 5 302.0000 7.02 97.50000 0.14200000 6 298.0000 7.75 96.27329 0.19834530 In a Run of experiment, all the variables are measured from day WD 1 to WD 10, sometimes 11 or 12. In the experiments the controlled variables are DO, CO2, pH, Stress. Variables as GLC, LAC, GLN, GLU, NH4, OSM, Xv, Via are measured throbbughout to monitor the state of the production. Titer is the output we are interested in. 1.2 Missing Data Imputation The raw data contains quite a few missing values. To carry out sensible statical analysis, we imputed missing data for both the input values and the output values. Titer are interpolated by our client. Missing values in the input values are imputed by MissForest, which is a recent achievement based on RandomForest. See [1]. 1.3 Data Standardization It is sometimes a good practice to normalize the data to have mean 0 and variance 1. In our analysis, all the models except for linear models are standardized, since we have more sophiscated customized transformation for linear regressions. As a result, the coefficients of linear regressions and mars are always different in scale. 1
  • 2. 2 Information on Models Used 2.1 Linear Model Linear regression is the most mature and widely used method. Although it is quite simple and intuive, sometimes it has good prediction power. We can easily tell if any model assumption is violated by looking at the plots of the linear fit. To make a linear model work best, it is necessary to specify the predictors manually. So we have to consider which transformations to take, which interactions or orders to include and which to leave out. One easy way to do this is fit a big model in the beginning and use stepwise selection by AIC(or BIC) criterion1 which is a measure of balance between goodness of fit and model complexity. To assess variable importance, a simple way is to look at p-value or t-value, given the model assump- tions are met. 2.2 Decision Tree Decision tree is a scale independent statistical model. It is easy to implement and deals with interactions naturally. The biggest advantage is that it can be visualized easily. See [5]. 2.3 Random Forest All the information in this subsection is based on [4]. Random forest is an ensemble method that combines many decision trees, in each of which the data and variables are sampled so that only part of them are used. Due to implicit bootstrapping, the model suffers less from overfitting and has good prediction power. Random forest has built-in mechanisms to estimate importance of variables. The two measures are described in the following way: ˆ The first measure is computed from permuting OOB data: For each tree, the predic- tion error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case). ˆ The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares. The package randomForest has a function varImpPlot to plot the importance of variables easily. 2.4 MARS All the information in this subsection is based on [3]. MARS, multivariate adaptive regression splines, is an adaptive extension of linear regression. In the final model, it is a linear regression with terms like (xj − d)+, (d − xj)+ and higher order interactions of such terms. The algorithm adds terms forwardly and prune the model to a place where GCV is minimized. MARS can also estimate the importance of variables. From earth vignette: [3] ˆ The nsubsets criterion counts the number of model subsets that include the variable. Variables that are included in more subsets are considered more important. By ”subsets” we mean the subsets of terms generated by the pruning pass. There is one subset for each model size (from 1 to the size of the selected model) and the subset is the best set of terms for that model size. (These subsets are speced in $prune.terms in earth’s return value.) Only subsets that are smaller than or equal in size to the final model are used for estimating variable importance. 1See http://stat.ethz.ch/R-manual/R-devel/library/stats/html/step.html. For more information about BIC, see http://en.wikipedia.org/wiki/Bayesian_information_criterion 2
  • 3. ˆ The rss criterion first calculates the decrease in the RSS for each subset relative to the previous subset. (For multiple response models, RSS’s are calculated over all responses.) Then for each variable it sums these decreases over all subsets that include the variable. Finally, for ease of interpretation the summed decreases are scaled so the largest summed decrease is 100. Variables which cause larger net decreases in the RSS are considered more important. ˆ The gcv criterion is the same, but uses the GCV instead of the RSS. Adding a variable can increase the GCV, i.e., adding the variable has a deleterious effect on the model. When this happens, the variable could even have a negative total importance, and thus appear less important than unused variables. 2.5 Neural Network Neural network is a felxible model which is in a sense an extension of linear regression. See en.wikipedia. org/wiki/Artificial_neural_network. 3 Overview of Three Main Tasks This statistical analysis comprises three main tasks, in each of which all three above mentioned models were used, specifically, Linear Regression, Random Forest and MARS. Only for the third task a different dataset was used, in which missing Titer values are interpolated by a logistic function fit within each run. 3.1 First Task: Blackbox Models Maximizing the output of a useful product by controlling experimental conditions is of primary interest. The first blackbox models are fitted using only the four controlled variables, namely, DO, Stress, pHset, Xv0 as input variables to predict the output variable, Titer at day 10. Secondly, all other input variables at day 0 are used as input variables to predict Titer at day 10. 3.2 Second Task: Snapshot Models In order to save the measurement cost of Titer, it is useful to predict the current value of Titer during an experiment using the current values of the input variables, which are relatively cheaper to measure. The snapshot approach means each observation is considered to be independant and input variables at day t are used to predict Titer at day t. 3.3 Third Task: History Models The history approach can be regarded as an extension of both the blackbox models and the snapshot models. For the history models, not only the current value of the input variables at day t are considered in the model as predictors but also how they have changed over time in the past, that is, the history of the input variables. All Titer values in the future as well as the one at day t are to be predicted. In other words, a (i, j) history model uses input variables at day 0, 1, ..., i as predictors to predict the Titer value at day j(≥ i). 4 Model Comparison In order to compare the prediction performance between the three statistical models, the cross-validation method was used. 1. First, the data is randomly splitted into a training set and a test set by Run ID. That is, 30 runs are randomly sampled for a test set among 122 runs. 2. Then a model is fitted using the training set, the rest of the data. 3. MSE (mean squared error) is calculated in the test set. 3
  • 4. 4. This is repeated 5 times and using the 5 resulting MSE’s, RMSECV is calculated, which is defined as follows. For blackbox or history models: RMSECV = 5 k=1 30 i=1(y (k) i −y (k) i )2 5·30 where y (k) i and y (k) i are the true Titer value and the predicted one of the ith sample run in the kth cross-validation, repectively. For snapshot models: RMSECV = 5 k=1 30 i=1 ni j=1(y (k) ij −y (k) ij )2 5·30·ni where y (k) ij and y (k) ij are the true Titer value and the predicted one of the ith sample run in the kth cross-validation at the day j, repectively. In case the target variable is transformed, the RMSECV is calculated based on back-transformed values of both true and predicted values. 5 Blackbox Models This is a prediction problem with classical statistics setting. Before we use any statistical models, it is helpful to get an intuition what the dataset looks like. q q q qq q q qq qq q q qq q q q qq q qqqq qq qq q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q qq qq q q qq q q qq q q q qqq q q q q q q q q q q qq q q q q q q q q q 600 700 800 900 1000 0.20.40.60.8 Final Titer in Different Runs without Transformation dat.simp$Run dat.simp$Titer Since we are using only a small part of the raw data here, the number of observations is 122. If only prediction power is of interest, you can jump to Section 5.6. 4
  • 5. 5.1 Linear Model 5.1.1 Linear Model with 4 Controlled Variables The following model was selected by the BIC model selection criterion. We can see that both DO and pHset exhibit quadratic effect. Call: lm(formula = bb4lm$formula, data = if.data) Residuals: Min 1Q Median 3Q Max -0.59583 -0.10260 -0.00799 0.10758 0.84827 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.695e-01 7.932e-02 -3.397 0.000952 *** I((DO - 50)^2) -7.425e-04 8.316e-05 -8.929 1.18e-14 *** log(Xv0) -3.114e-01 1.810e-01 -1.721 0.088092 . I((pHset - 7.1)^2) -7.436e+00 9.961e-01 -7.465 2.17e-11 *** I((DO - 50)^2):log(Xv0) 5.670e-04 1.714e-04 3.309 0.001270 ** I((DO - 50)^2):I((pHset - 7.1)^2) -1.852e-03 7.084e-04 -2.614 0.010202 * log(Xv0):I((pHset - 7.1)^2) 8.604e+00 1.811e+00 4.750 6.23e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2011 on 109 degrees of freedom (2 observations deleted due to missingness) Multiple R-squared: 0.8462, Adjusted R-squared: 0.8377 F-statistic: 99.95 on 6 and 109 DF, p-value: < 2.2e-16 [1] 0.09764054 −3.0 −2.0 −1.0 −0.50.00.51.0 Fitted values Residuals q q q qq qq qq qq q q qqqqqqqqqq qqqq qq q q qqqq q q q q q q q q q q q q q q q q qq qq q q q qq q qqq qq qq q q q qq q q q qqq qq qqq q qqq q qq q qqqqqq qq q q qq q q qqq qq q q qq q Residuals vs Fitted 73 4348 q q q qq qq qq qq q q q qqqqqqqqq qqqq q q q qqqqq q q q q q q q q q q q q q qq q qq q q q q q qq qq qq qqqq q q q qq q qq qqq qq qqq q qqq q qq q qqqqq q qq q q qq q q q qq qq q q q q q −2 −1 0 1 2 −20246 Theoretical Quantiles Standardizedresiduals Normal Q−Q 73 48 43 −3.0 −2.0 −1.0 0.01.02.0 Fitted values Standardizedresiduals q q q qq q q q q qq q q q q q qqq qq qq qqqq q q q q q qqq q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q qq q q q qq q q q q q q qq q q q q qqq q qq q q q q q q q qq q q q q q q qq q qq q q q q q Scale−Location 73 48 43 0.0 0.1 0.2 0.3 0.4 −4−20246 Leverage Standardizedresiduals q q q qqqq qq qq q q qqqqqqqqqq qqqq qq q qqqqqq q q q q q q q q q q q q qq q qq q q q q q qq qqqq qqq q q q q qq q qq qqqqqqqq q qqq q qq q q qqqqq qq q q qq q q qqq qq q q qq q Cook's distance 1 0.5 0.5 1 Residuals vs Leverage 73 48 37 5
  • 6. The transformations for DO, Xv0 and pHset were chosen according to the termplot to give a reasonable fit. > par(mfrow=c(2,2)) > termplot(bb4lmfit,partial.resid=TRUE) 10 20 30 40 50 60 70 −1.5−0.50.5 DO PartialforI((DO−50)^2) qq q qqqq qqqqq q qqqqqqqqqq qqqq qq q qqqqqq q q q q qq q q qq q q q q q q q q q q q q q q q qqq qq q q q q q qq q q q qqqqqqqqqqqqq qq q qqqqqq q qqq qq q qqqqqqqq qq q 1.0 1.2 1.4 1.6 1.8 2.0 −1.5−0.50.5 Xv0 Partialforlog(Xv0) qq q qqqq qq qq q q q qqqq qq qq q q qqq q q q qqq qq q q q qq qq q qqqq q qq q qq q q q q qqq qqq q qqq q q q q qq q q q qq q qqqq qqqqqq qq q qq qq qq qq qq qq qqqq q q qq q q q q 6.7 6.8 6.9 7.0 7.1 7.2 −1.5−0.50.5 pHset PartialforI((pHset−7.1)^2) qq q qqqq qqqqq q qqqqqqqqqq qqqq qq q q qqqqq q qq q qq q q qq q q qq q q q q q q q qq qqqqq qq q q qq q qq q qq qqqqqqqqqqqqq qq q qq qqqq q q qq qq q q qqqqqqq qq q 5.1.2 Linear Model with All Variables The following model was selected by the BIC model selection criterion. Apart from the 4 controlled variables, only NH4 and GLN. Call: lm(formula = bblmbic$formula, data = if.data) Residuals: Min 1Q Median 3Q Max -0.86766 -0.09680 0.01636 0.09528 0.59879 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.559e+00 2.359e-01 -6.606 1.49e-09 *** I((DO - 50)^2) -5.932e-04 4.454e-05 -13.319 < 2e-16 *** pCO2 5.530e-03 2.287e-03 2.418 0.01726 * GLN 1.264e-01 5.030e-02 2.513 0.01343 * NH4 8.313e-02 2.345e-02 3.545 0.00058 *** log(Xv0) 5.981e-01 1.377e-01 4.343 3.16e-05 *** I((pHset - 7.1)^2) -5.336e+00 5.264e-01 -10.137 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2224 on 109 degrees of freedom (2 observations deleted due to missingness) 6
  • 7. Multiple R-squared: 0.8119, Adjusted R-squared: 0.8015 F-statistic: 78.41 on 6 and 109 DF, p-value: < 2.2e-16 −2.5 −2.0 −1.5 −1.0 −0.5 −1.0−0.50.00.5 Fitted values Residuals q q q qq qq q q qq q q q qq q qq q q qq q q q q qq q q q q qq qq q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q qqqqq q qqq q q qq q q qq qq q q q qq q q qq q q q qq q q qq q Residuals vs Fitted 47 4855 q q q qq qq q q q q q q q qq q qq q q qq q q q q qq q qq q qq qqq q q q q q q qq q q q q q q qq q q q q q q qq q q q q q q q q q q q qq qq q q qq q q q qqq q q qq q q qq qq qq q qq q q qq q q q qq q q qq q −2 −1 0 1 2 −4−202 Theoretical Quantiles Standardizedresiduals Normal Q−Q 47 4855 −2.5 −2.0 −1.5 −1.0 −0.5 0.00.51.01.52.0 Fitted values Standardizedresiduals q q q qq q q q qq qq q q qq q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q q qq q q q qq q q qq q q q q q qq q q q q q q q q q q q qq q Scale−Location 47 4855 0.00 0.10 0.20 0.30 −4−202 Leverage Standardizedresiduals q q q qqqq q q q q q q q qq q qq q q qq q q qq qq q qq q qq q qq q q q q q qqq q q q q q q qq q q q q q q qq qq q q q q q q q q q qq qq q qqqqq q qqq q q q q q q qq qqq q q qq q q qq q q q q q q q qq q Cook's distance 1 0.5 0.5 Residuals vs Leverage 47 4855 5.2 MARS To allow for flexibility and nonlinear effects, we use Mars - Multivariate adaptive regression splines. 5.2.1 MARS with 4 Controlled Variables > source(file.path(cwd, "mars_simp.R")) > summary(mars.con) Call: earth(formula=Titer~DO+pHset+Xv+Stress, data=dat.simp, trace=0) coefficients (Intercept) 0.69954039 h(-1.06901-DO) -0.22496222 h(DO-0.159614) -0.05639779 h(pHset-0.444749) -0.12694532 h(0.444749-pHset) -0.11450147 Selected 5 of 13 terms, and 2 of 4 predictors Importance: DO, pHset, Xv-unused, Stress-unused Number of terms at each degree of interaction: 1 4 (additive model) GCV 0.009779767 RSS 0.9944133 GRSq 0.6943894 RSq 0.7344234 > plot(mars.con, main = "Mars Using 4 Inputs.jpg") 7
  • 8. 0 2 4 6 8 10 Mars Using 4 Inputs.jpg Number of terms 0.450.550.650.75 GRSqRSq 01234 Numberofusedpredictors GRSq selected model RSq nbr preds 0.00 0.10 0.20 0.30 0.00.20.40.60.81.0 Mars Using 4 Inputs.jpg abs(Residuals) Proportion 0% 50% 75% 90% 95% 100% q qq q q q q q q q q q q qq q q q q q q qq q q q q q q qq q q qq q q q qq q q qq qq q q qq qqqqq q qqqq qqqq q q q q q q q qq q q q q q q qq qq q q qqq q qq q q q qq q q q q q q q q qq q q q q q q q q q q q q q 0.1 0.3 0.5 0.7 −0.3−0.10.1 Mars Using 4 Inputs.jpg Fitted Residuals 42 106 47 q qq qq q q q q q q q q qq q q q q q q qq q q q q qq qq q q qq q q q qq q q qq qq q q qq qq qqq q qqqq qqqq q q q q q q q qq q q q q q q qq qq q q qqq q qq q q q qq q q q q q q q q qq q q q q q q q q q q q q q −2 −1 0 1 2 −0.3−0.10.1 Mars Using 4 Inputs.jpg Theoretical Quantiles ResidualQuantiles 42 106 47 Titer: earth(formula=Titer~DO+pHset+Xv+Stre... 5.2.2 MARS with All Variables > summary(mars.simp) Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+ GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH, data=dat.simp, trace=0) coefficients (Intercept) 0.56859816 h(-1.06901-DO) -0.23177287 h(DO-0.159614) -0.06620347 h(pHset-0.444749) -0.09650702 h(0.444749-pHset) -0.09711387 h(Via-0.355972) 0.33786922 h(1.41519-GLC) 0.14153165 h(-0.949741-LAC) 0.36807897 h(NH4- -0.263424) 0.10067628 h(-0.0456292-OSM) 0.35674962 Selected 10 of 24 terms, and 7 of 13 predictors Importance: DO, pHset, NH4, Via, GLC, LAC, OSM, Xv-unused, Stress-unused, ... Number of terms at each degree of interaction: 1 9 (additive model) GCV 0.007086922 RSS 0.5955396 GRSq 0.7785388 RSq 0.84095 > plot(mars.simp, main = "Mars Using All Inputs.jpg") 8
  • 9. 0 5 10 15 20 Mars Using All Inputs.jpg Number of terms 0.50.60.70.8 GRSqRSq 0246810 Numberofusedpredictors GRSq selected model RSq nbr preds 0.00 0.10 0.20 0.00.20.40.60.81.0 Mars Using All Inputs.jpg abs(Residuals) Proportion 0% 50% 90% 100% q q qq qq q q q q q q qq q q q q q q q q q q q qqq q q q q qq q q q q q q qq q q q q q qq q q q qq qq q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 −0.2−0.10.00.1 Mars Using All Inputs.jpg Fitted Residuals 42 93 47 q q qq qq q q q q q q qq q q q q q q q q q q q qqq q q q q qq q q q q q q qq q q q q q qq q q q qq qq q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q −2 −1 0 1 2 −0.2−0.10.00.1 Mars Using All Inputs.jpg Theoretical Quantiles ResidualQuantiles 42 93 47 Titer: earth(formula=Titer~DO+pHset+Xv+... Now we add the label(pCO2) showing whether or not CO2 is targeted to a certain level. As it turned out, this label is not important. > summary(mars.simp2) Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+ GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pCO2t, data=dat.simp, trace=0) coefficients (Intercept) 0.80438564 h(DO-0.159614) -0.04110043 h(0.159614-DO) -0.09742652 h(pHset- -1.58274) -0.11389203 h(0.444749-pHset) -0.17420137 h(Via-0.340761) 0.29860268 h(-0.962971-LAC) 0.32408837 h(GLU- -1.50335) -0.81614339 h(GLU- -1.42405) 1.31071055 h(NH4- -0.292625) 0.10170289 h(-0.0456292-OSM) 0.52167298 Selected 11 of 27 terms, and 7 of 16 predictors Importance: pHset, DO, NH4, OSM, GLU, Via, LAC, Xv-unused, Stress-unused, ... Number of terms at each degree of interaction: 1 10 (additive model) GCV 0.007404181 RSS 0.5975609 GRSq 0.7686247 RSq 0.8404102 > plot(mars.simp2, + main = "Mars Using All inputs and Additional CO2 Target Label.jpg") 9
  • 10. 0 5 10 15 20 25 Mars Using All inputs and Additional CO2 Target Label.jpg Number of terms 0.30.50.70.9 GRSqRSq 0246810 Numberofusedpredictors GRSq selected model RSq nbr preds 0.00 0.10 0.20 0.00.20.40.60.81.0 Mars Using All inputs and Additional CO2 Target Label.jpg abs(Residuals) Proportion 0% 50% 90% 100% q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q qqq q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q 0.2 0.4 0.6 0.8 −0.2−0.10.00.1 Mars Using All inputs and Additional CO2 Target Label.jpg Fitted Residuals 42 47 93 q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q qqq q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q −2 −1 0 1 2 −0.2−0.10.00.1 Mars Using All inputs and Additional CO2 Target Label.jpg Theoretical Quantiles ResidualQuantiles 42 47 93 Titer: earth(formula=Titer~DO+pHset+Xv+... Fitting Mars, records 42, 47 and 91 are identified as outliers. > dat.simp[c(42, 47, 91), ] Run WD Ptime DO pCO2 pH Xv Via GLC 42 630 0 0 0.159614 0.1237602 1.287431 -1.580576 0.3932955 1.078003 47 635 0 0 1.388239 0.9739924 1.426846 -1.894227 0.4848425 1.317452 91 768 0 0 0.159614 0.5488763 1.357139 -1.645298 0.4860406 1.156190 LAC GLN GLU NH4 OSM Stress 42 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237 -0.2096149 47 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617 -0.2096149 91 -0.9629707 0.3432373 -1.334834 1.247747682 0.145048862 -0.2096149 Xv0 pHset RunNo Discr. 42 0.3312174 0.03925127 BIOS-3.5L - 630 Standard conditions 47 -2.2207835 -2.39373696 BIOS-3.5L - 635 DO 70% / pH 6.70 / seeding 1 mio 91 -0.1953860 0.03925127 BIOS-3.5L - 768 WGE 15 - SE 3 Titer pCO2t 42 0.3510 0 47 0.0858 0 91 0.7650 0 Also, we can see the importance of variables. The methods for importance estimation have been discussed in Section 2. 10
  • 11. 02468 Importance of Variables nsubsets nsubsets sqrt gcv sqrt rss 020406080100 normalizedsqrtgcvorrss DO1 pHset2 NH411 Via6 GLC7 LAC8 OSM12 5.3 Random Forest Using Random Forest, we can easily see the importance of different variables. 5.3.1 Random Forest with 4 Controlled Variables Call: randomForest(formula = Titer ~ DO + pHset + Stress + Xv, data = dat.simp, importance = TRUE, mtry = 4) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 4 Mean of squared residuals: 0.008918479 % Var explained: 71.66 5.3.2 Random Forest with All Variables We can see the ranking of variable importances is similar to that of MARS. 11
  • 12. Stress pCO2 pH GLU GLC GLN Via LAC OSM Xv NH4 pHset DO q q q q q q q q q q q q q 0.0 0.5 1.0 rf.simp IncNodePurity > rf.simp Call: randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM + Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 10 Mean of squared residuals: 0.007168825 % Var explained: 77.22 5.4 Decision Tree For completeness, we also include decision tree model. It cannot outperform random forest which is an extension of decision tree, but it is easy for interpretation. 5.4.1 Decision Tree with 4 Controlled Variables n= 119 node), split, n, deviance, yval * denotes terminal node 1) root 119 3.74435600 0.5739803 2) DO< -1.683324 13 0.10235720 0.2245385 * 3) DO>=-1.683324 106 1.85988900 0.6168364 6) pHset< -0.7717448 10 0.22043140 0.3785372 * 7) pHset>=-0.7717448 96 1.01244000 0.6416593 14) pHset>=0.6474983 8 0.02327288 0.5111250 * 15) pHset< 0.6474983 88 0.84046160 0.6535260 12
  • 13. 30) Stress< 0.3004491 79 0.70332710 0.6425430 * 31) Stress>=0.3004491 9 0.04395770 0.7499322 * | DO< −1.683 pHset< −0.7717 pHset>=0.6475 Stress< 0.3004 0.2245 0.3785 0.5111 0.6425 0.7499 5.4.2 Decision Tree with All Variables n= 119 node), split, n, deviance, yval * denotes terminal node 1) root 119 3.74435600 0.5739803 2) DO< -1.683324 13 0.10235720 0.2245385 * 3) DO>=-1.683324 106 1.85988900 0.6168364 6) pHset< -0.7717448 10 0.22043140 0.3785372 * 7) pHset>=-0.7717448 96 1.01244000 0.6416593 14) NH4< 0.2658514 67 0.43613680 0.6091353 28) OSM>=-0.1028327 50 0.28442470 0.5858890 56) Via< 0.4785685 23 0.15613930 0.5504854 * 57) Via>=0.4785685 27 0.07489916 0.6160477 * 29) OSM< -0.1028327 17 0.04522386 0.6775066 * 15) NH4>=0.2658514 29 0.34168860 0.7168009 30) Xv< -1.573108 21 0.21895440 0.6899347 60) Xv>=-1.652965 12 0.13595400 0.6410000 * 61) Xv< -1.652965 9 0.01595145 0.7551809 * 31) Xv>=-1.573108 8 0.06778793 0.7873245 * 13
  • 14. | DO< −1.683 pHset< −0.7717 NH4< 0.2659 OSM>=−0.1028 Via< 0.4786 Xv< −1.573 Xv>=−1.653 0.2245 0.3785 0.5505 0.616 0.6775 0.641 0.7552 0.7873 5.5 Neural Network Neural network is comparatively not so easy to interpret. We include it here to compare its prediction accuracy with other models using CV. 5.6 Cross Validation Although many packages nowadays have built-in measures of test errors, by implementing cross validation ourselves, we can compare the performance of different models on the same ground. Here we use Leave- 30-Runs-Out cross validation. Since it is a regression problem, mean squared error is a good indicator of performance. We can see the cross validation mean squared errors of different methods. RMSECV of Different Models Using All Input Variables Linear Model Mars Random Forest Decision Tree Neural Network [1,] 0.078494 0.1060308 0.09046751 0.1237991 0.1362411 RMSECV of Different Models Using 4 Input Variables Linear Model Mars Random Forest Decision Tree Neural Network [1,] 0.09764054 0.117845 0.09327159 0.0972197 0.1150896 A surprising fact is that the more complicated models perform even worse. Linear model performs quite well. 14
  • 15. 6 Snapshot Models 6.1 Linear model The following model was selected by the BIC model selection criterion. We can see that now the number of selected variables is much higher. In snapshot model, the effects are too complex to be estimated by a few variables. Call: lm(formula = snlm$formula, data = ccFdata) Residuals: Min 1Q Median 3Q Max -0.63033 -0.09694 0.00396 0.11494 0.48256 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.492e+00 1.705e+00 0.875 0.381944 I((DO - 50)^2) 9.907e-04 3.332e-04 2.973 0.003066 ** pCO2 1.522e-01 2.442e-02 6.233 8.54e-10 *** I((pH - 7.1)^2) 3.698e+01 4.867e+00 7.598 1.14e-13 *** log(Xv) 7.013e-01 1.506e-01 4.658 3.93e-06 *** Via -3.025e-02 9.257e-03 -3.268 0.001144 ** GLC -2.973e-01 1.400e-01 -2.123 0.034175 * LAC -2.858e-01 9.404e-02 -3.039 0.002479 ** GLN 5.613e-01 5.873e-02 9.557 < 2e-16 *** GLU -1.950e+00 3.485e-01 -5.596 3.31e-08 *** NH4 -1.106e-01 2.415e-02 -4.580 5.64e-06 *** OSM -8.029e-03 3.901e-03 -2.058 0.039999 * Stress -5.076e-02 1.549e-02 -3.276 0.001113 ** I((DO - 50)^2):pCO2 1.418e-05 4.040e-06 3.510 0.000482 *** I((DO - 50)^2):Via -8.782e-06 2.193e-06 -4.005 6.98e-05 *** I((DO - 50)^2):GLC 8.047e-05 1.697e-05 4.741 2.65e-06 *** I((DO - 50)^2):GLU -1.739e-04 3.561e-05 -4.883 1.34e-06 *** I((DO - 50)^2):NH4 -6.375e-05 2.582e-05 -2.469 0.013824 * pCO2:Via -6.531e-04 1.354e-04 -4.824 1.78e-06 *** pCO2:GLC 3.431e-03 6.722e-04 5.105 4.44e-07 *** pCO2:GLN -6.369e-03 1.483e-03 -4.293 2.05e-05 *** pCO2:OSM -2.974e-04 5.367e-05 -5.540 4.50e-08 *** I((pH - 7.1)^2):LAC 2.711e+00 3.745e-01 7.240 1.36e-12 *** I((pH - 7.1)^2):NH4 6.535e-01 2.225e-01 2.937 0.003435 ** I((pH - 7.1)^2):OSM -1.449e-01 1.812e-02 -7.997 6.46e-15 *** log(Xv):GLC 1.177e-01 1.607e-02 7.323 7.75e-13 *** log(Xv):GLU -2.613e-01 3.265e-02 -8.000 6.29e-15 *** Via:GLC -4.150e-03 8.275e-04 -5.015 6.97e-07 *** Via:GLU 1.534e-02 1.466e-03 10.463 < 2e-16 *** Via:Stress 6.505e-04 1.614e-04 4.031 6.26e-05 *** GLC:LAC -2.587e-02 7.330e-03 -3.529 0.000448 *** GLC:NH4 1.571e-02 4.848e-03 3.241 0.001255 ** GLC:OSM 9.333e-04 3.859e-04 2.419 0.015863 * LAC:NH4 -4.049e-02 8.656e-03 -4.678 3.58e-06 *** LAC:OSM 1.038e-03 2.115e-04 4.911 1.16e-06 *** GLN:Stress -4.219e-03 1.587e-03 -2.658 0.008059 ** GLU:OSM 4.736e-03 8.420e-04 5.625 2.83e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1743 on 608 degrees of freedom (740 observations deleted due to missingness) Multiple R-squared: 0.9526, Adjusted R-squared: 0.9498 F-statistic: 339.2 on 36 and 608 DF, p-value: < 2.2e-16 [1] 0.07678454 15
  • 16. −4 −3 −2 −1 0 −0.6−0.20.20.6 Fitted values Residuals q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q qqq q q q qqq q q q q qq q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q qqqq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q qq q q q q qq q q q q q q q q qqq q q q qq q q q q qq q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq qq q q q qq q q q q q q q qq q q q q q q q qq q q q q q q q q qq q q qq q q q q q q q q q q qq q q q q q q qq q q q qq q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q qq q q q qq q qq q q q q q q q q q qq q q q q q q q q qq q q q qq q q qqqq q q qq q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q qqq q q q q q q q q qq q q q q q q q q q q qq q q qq q q q qq q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q Residuals vs Fitted 90 345 909 q q qq qq q qq q qq q q q q q q q qqqq q q q qq q q q q q q q q q qqqq q q qq q q q q q qq q q q qq qq q qq q q qq q q q q q q q q q q q q q q q qqq q qqq q q q q qq q q q q qq q qq q q qq qqqqq q qq q qq q q q q qq q q q q q q q q qq q q q qqq q q q qq qq q q q q q qqq q q qq q q q q qq q q q q qq q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q qq qq qq q q qq q q q q q q q q qqq q q q qq q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q qq q q q q q q q q q q q q q qq q qq q q q qq q q q qq q q qq q q q q q q q q qq qq q q q q q q qqq q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q qq q qq q q q q q qq q q q q q qq q q q q q q q q q q q q q q q qq qq q qq q qq q qq q q q q qq q q q q q q q q qq qq q qqqq qqqqqq q q qq qq q q q q q qq q qq q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q qq q q q q qqq q q qq q q q q qq qq q q q q q q q q qq q q qq q q q q q q q q q q qq q q qq q qq q qq q q q q q q q q qq qq qq q q q q q q q q q q q q q qq q qq q qq q q q q q q q q q q q q q q q q q q q q qqq q qq q q −3 −2 −1 0 1 2 3 −3−1123 Theoretical Quantiles Standardizedresiduals Normal Q−Q 90345 909 −4 −3 −2 −1 0 0.00.51.01.5 Fitted values Standardizedresiduals q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q qq q q q q q qq q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q Scale−Location 90 345 909 0.0 0.2 0.4 0.6 −4−202 Leverage Standardizedresiduals q q qq qq q qq q qq q q q q q q q qqqq q q q qq q q q qq qq q q qqqq q q qqq q q q q qq q q q qq qq q qq q q qq q q q qq q q q q q q q q q q qqq q qqq q q q q qq q q q q qq q qq q q qq qqqqq q qq q qq q q q q qq q q q q q q q q qq q q q qqq q q q qq qq q q q q q qqq q q qq q q q q qq q q q q qq q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq qq qq q q q q q q q q q q q q qqq q q q qq q q q q qq q q q q q q qq q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q qq q qq q q q qq q q q qq q q qq q q q q q q q q qq qq q q q q q q qqq q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q qq q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qq qq q qq q qq q qq q q q q q q qq q q q q q q qq qq q qqq q qqqqqq q q qq qq q q q q q qq q qq q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q qq q q q q qqq q q qq q q q q qq qq q q q q q q q q qq q q qq q q q qq q q q q q qq q q qq q qq q qq q q q q q q q q qq qq qq q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q Cook's distance 1 0.5 0.5 Residuals vs Leverage 530 532 1377 6.2 Random Forest The importance ranking of variables is now much different from black-box model. GLU, LAC, NH4, GLN can be used to estimate Titer at the same time, which is a good “snapshot”. [1] "Titer~DO+pCO2+Via+GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+Xv" Call: randomForest(formula = as.formula(snrf$formula), data = ccFdata, na.action = na.omit) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 4 Mean of squared residuals: 0.003390071 % Var explained: 92.32 [1] 0.06970167 16
  • 17. Stress DO GLC pH OSM NH4 Via pCO2 Xv GLN LAC GLU q q q q q q q q q q q q 0 2 4 6 8 10 snrffit IncNodePurity 6.3 MARS MARS have a similar importance plot to Random Forest. Call: earth(formula=as.formula(snmars$formula), data=subset(ccFdata, !is.na(ccFdata$Titer))) coefficients (Intercept) 0.52558053 h(DO-70) 0.00384701 h(70-DO) 0.00081744 h(39-pCO2) -0.00387669 h(Via-95.1825) -0.01859399 h(1.64-GLC) -0.04381892 h(3.14-LAC) 0.06892429 h(GLN-2.09) 0.17083241 h(2.09-GLN) -0.07104502 h(GLU-4.49) 0.11840427 h(4.49-GLU) -0.13551714 h(NH4-5.19) -0.01831231 h(5.19-NH4) 0.04728241 h(OSM-305) 0.00128608 h(305-OSM) 0.00120199 h(Stress-21) -0.00161051 h(21-Stress) -0.00473324 h(7.11-pH) -0.20785734 h(Xv-2.96) -0.01598677 h(2.96-Xv) -0.10513829 Selected 20 of 25 terms, and 12 of 12 predictors 17
  • 18. Importance: GLU, LAC, NH4, GLN, Xv, OSM, pH, Stress, Via, pCO2, DO, GLC Number of terms at each degree of interaction: 1 19 (additive model) GCV 0.004576446 RSS 2.605637 GRSq 0.8966136 RSq 0.9084545 [1] 0.07026124 0 5 10 15 20 25 Model Selection Number of terms 0.60.70.80.9 GRSqRSq 024681012 Numberofusedpredictors GRSq selected model RSq nbr preds 0.00 0.10 0.20 0.00.20.40.60.81.0 Cumulative Distribution abs(Residuals) Proportion 0% 50% 90% 100% q q q q q q qq qq q q q q q qqq qq q qq qqq q qq q qq qq q q q q qq q q q qq q q q q q qq q q q q q q q q qqqq q qq q q q qq q q q qq q q qqqq q q qq q qq q qqq q q q q q qq q q q q q q q qqq qq q q q qq q q q q q q q q q qqq q qqq q q q q q q qqqq q q q q qq q q q q qq q q q q q q qq qq q qq qqq q qq qq q q q qqq q q qq q q q qq q q qq q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq qq q q q q q qq q q q q qq q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q qq q q qq q q q q q q q q qqq q q q q q q q q q q q q q q qqq q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q qqq q q qqqq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q qq q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q qq qq q qq q qq q q q q q q qq q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq qq q q q q q 0.0 0.2 0.4 0.6 0.8 −0.20.00.10.2 Residuals vs Fitted Fitted Residuals 586 10 582 q q q q q q qq qq q q q q q qqq qq q qq qqq q qq q qq qq q q q q qq q q q qq q q q q q qq q q q q q q q q qqqq q qq q q q qq q q q qq q q qqqq q q qq q qq q qqq q q q q q qq q q q q q q q qqq qq q q q qq q q q q q q q q q qqq q qqq q q q q q q qqqq q q q q qq q q q q qq q q q q q q qq qq q qq qqq q qq qq q q q qqq q q qq q q q qq q q qq q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq qq q q q q q qq q q q q qq q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q qq q q qq q q q q q q q q qqq q q q q q q q q q q q q q q qqq q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q qqq q q qqqq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q qq q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q qq qq q qq q qq q q q q q q qq q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq qq q q q q q −3 −1 0 1 2 3 −0.20.00.10.2 Normal Q−Q Theoretical Quantiles ResidualQuantiles 586 10 582 Titer: earth(formula=as.formula(snmars$formu... 18
  • 19. 051015 Variable importance nsubsets nsubsets sqrt gcv sqrt rss 020406080100 normalizedsqrtgcvorrss GLU7 LAC5 NH48 GLN6 Xv12 OSM9 pH11 Stress10 Via3 pCO22 DO1 GLC4 7 History Model(Naive Way) Let’s first use input data only at time t and treat t simply as another parameter. Again, the target here is the Titer on day 10. Explore with different models. 7.1 Linear Regression Since we now have around 1300 observations, we can fit the model with more parameters. After this we can select model using step. Call: lm(formula = Titer ~ I(DO^2) + DO + pHset + I(pHset^2) + Via + poly(GLC, 3) + poly(LAC, 3) + GLN + GLU + NH4 + OSM + poly(Ptime, 3) + DO:pHset + I(pHset^2):poly(Ptime, 3) + Via:poly(Ptime, 3) + poly(LAC, 3):poly(Ptime, 3) + GLN:poly(Ptime, 3) + GLU:poly(Ptime, 3) + NH4:poly(Ptime, 3) + OSM:poly(Ptime, 3), data = dat.agg) Residuals: Min 1Q Median 3Q Max -0.296451 -0.047078 0.002983 0.044872 0.190200 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.049e+00 5.200e-01 -2.017 0.044626 * I(DO^2) -3.644e-02 6.218e-03 -5.860 1.26e-08 *** DO 1.747e-02 7.374e-03 2.369 0.018520 * pHset 9.137e-03 7.090e-03 1.289 0.198499 19
  • 20. I(pHset^2) -3.318e-02 5.477e-03 -6.058 4.31e-09 *** Via 1.341e-01 2.815e-02 4.763 3.02e-06 *** poly(GLC, 3)1 -8.356e-01 1.905e-01 -4.387 1.62e-05 *** poly(GLC, 3)2 -2.459e-01 1.455e-01 -1.690 0.092150 . poly(GLC, 3)3 1.980e-01 1.425e-01 1.389 0.165837 poly(LAC, 3)1 -1.572e+02 4.936e+01 -3.184 0.001611 ** poly(LAC, 3)2 -1.203e+02 3.832e+01 -3.138 0.001878 ** poly(LAC, 3)3 -3.211e+01 1.041e+01 -3.084 0.002244 ** GLN 3.912e-02 8.812e-03 4.439 1.29e-05 *** GLU 5.716e-02 2.163e-02 2.642 0.008686 ** NH4 5.605e-02 7.730e-03 7.252 3.82e-12 *** OSM -2.508e-02 2.934e-02 -0.855 0.393294 poly(Ptime, 3)1 2.465e+01 7.910e+00 3.117 0.002013 ** poly(Ptime, 3)2 -4.044e+01 1.110e+01 -3.642 0.000321 *** poly(Ptime, 3)3 5.017e+00 1.286e+00 3.903 0.000119 *** DO:pHset 5.555e-03 2.814e-03 1.974 0.049324 * I(pHset^2):poly(Ptime, 3)1 4.147e-01 8.829e-02 4.697 4.10e-06 *** I(pHset^2):poly(Ptime, 3)2 1.152e-02 4.871e-02 0.236 0.813219 I(pHset^2):poly(Ptime, 3)3 -2.443e-01 1.301e-01 -1.878 0.061423 . Via:poly(Ptime, 3)1 -1.151e+00 4.468e-01 -2.576 0.010507 * Via:poly(Ptime, 3)2 2.055e+00 5.664e-01 3.628 0.000338 *** Via:poly(Ptime, 3)3 -5.371e-01 3.308e-01 -1.624 0.105528 poly(LAC, 3)1:poly(Ptime, 3)1 2.404e+03 7.520e+02 3.197 0.001546 ** poly(LAC, 3)2:poly(Ptime, 3)1 1.869e+03 5.845e+02 3.199 0.001536 ** poly(LAC, 3)3:poly(Ptime, 3)1 5.054e+02 1.594e+02 3.172 0.001680 ** poly(LAC, 3)1:poly(Ptime, 3)2 -3.718e+03 1.095e+03 -3.395 0.000784 *** poly(LAC, 3)2:poly(Ptime, 3)2 -2.955e+03 8.698e+02 -3.398 0.000776 *** poly(LAC, 3)3:poly(Ptime, 3)2 -8.601e+02 2.569e+02 -3.347 0.000924 *** poly(LAC, 3)1:poly(Ptime, 3)3 3.446e+02 1.065e+02 3.236 0.001353 ** poly(LAC, 3)2:poly(Ptime, 3)3 2.778e+02 8.433e+01 3.294 0.001110 ** poly(LAC, 3)3:poly(Ptime, 3)3 8.598e+01 2.539e+01 3.387 0.000805 *** GLN:poly(Ptime, 3)1 2.287e-01 1.479e-01 1.547 0.123014 GLN:poly(Ptime, 3)2 -1.576e-01 1.730e-01 -0.911 0.363026 GLN:poly(Ptime, 3)3 -2.659e-01 1.555e-01 -1.710 0.088273 . GLU:poly(Ptime, 3)1 -1.712e-01 3.530e-01 -0.485 0.628013 GLU:poly(Ptime, 3)2 -8.317e-01 4.445e-01 -1.871 0.062348 . GLU:poly(Ptime, 3)3 -5.699e-01 2.756e-01 -2.068 0.039551 * NH4:poly(Ptime, 3)1 -5.983e-01 1.399e-01 -4.278 2.57e-05 *** NH4:poly(Ptime, 3)2 -3.588e-01 1.453e-01 -2.469 0.014148 * NH4:poly(Ptime, 3)3 1.258e-01 1.830e-01 0.687 0.492460 OSM:poly(Ptime, 3)1 1.639e+00 3.637e-01 4.507 9.58e-06 *** OSM:poly(Ptime, 3)2 1.621e+00 6.580e-01 2.464 0.014321 * OSM:poly(Ptime, 3)3 2.464e-01 1.771e-01 1.391 0.165379 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.07481 on 288 degrees of freedom Multiple R-squared: 0.8548, Adjusted R-squared: 0.8317 F-statistic: 36.87 on 46 and 288 DF, p-value: < 2.2e-16 20
  • 21. 0.0 0.2 0.4 0.6 0.8 −0.3−0.10.1 Fitted values Residuals q q qq q q q q q q q q q q qq q qqq q q q q q q q q q q q q q q q q q q qq q qqq q q qq q qqqqqq qq qqqqq qq q q q q q q qq q q qqqqq qqqqq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q qq q qq q qq q qqq q qq q q q q q q qq q q q qq q q qq q q qq q qqqqq q qq q q q q q q qq q q q q q q q qq q q qqq q q q q q q q q q q q qqq q qqqq q qqq qqq q q qq q qq q q qqqq q q q q q q qq q q q q qqqqqq qq qq q q qq q q q q q q q q q q q q q q q q q qq q qq qq q q q q q q q qq q q q q q q q q q q q q qq q q qq q q q q q q qq q q q q qq q q q qqq q q q q q q q qq q q qqq Residuals vs Fitted 104 258 323 q q q q q q q q q q q q q q qq q qqq q q q q q q q q q q q q q q q q q q qq q qqq q q qq q qq qqqq qq qqqqq qq qq q q q q qq qq qqqqq qqqq q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q qq q qq q qq q q q q qqq q qq q q q q q q qq q q q q q q q qq q q qq q qq q qq q q qq q q q q q qq q q q q q q q qq q q q qq q q q q q q q q q q qqqq q qqqq q qqq qq q q q qq q q q q q qqqq q q q q q q qq q q q q qqqq qq qq qq q q qq q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q qq q q q q q q q q q q q q qq q q qq q q qq qq qq q q q q q q q q q qqq q q q q q q q q q q q qqq −3 −2 −1 0 1 2 3 −4−202 Theoretical Quantiles Standardizedresiduals Normal Q−Q 104 258 323 0.0 0.2 0.4 0.6 0.8 0.00.51.01.52.0 Fitted values Standardizedresiduals q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq qq qq q qq q q qq q q q q q q q q q q q q qq q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q qqq qq q q q q q q q qq q qq q qq q q q q qq q q q q q q q qq q q qq q q q q qq q q q q q q q q q qq q q q q q q q qq q q q qq q q q q q q q q q q q qqq q qq q q q qqq qq q q q qq q q q q q qq q q q q q q q q q q q qq q q q qq q q qq q q q q q q q q qq q q q q q q qq q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q qq q q q qq q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q qqq Scale−Location 104 258 323 0.0 0.2 0.4 0.6 0.8 1.0 −4−202 Leverage Standardizedresiduals q q q q q q q q q q q q q q qq q qqq q q q q q q q q q q q q q q q q q q q q q q qq q q qq q q qq qq q q q q qqqq qq qq q q q q qq qq qqqqq qqqqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q q qq q q q q q q q q q q qq q q q qq q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q qqq q q q qq q q q q q qq q q q qq qq q qq qq q qq q qq q q q qq q q q q q qq qq q q q q q q qq q q q q qqqq qq qq qq q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q qq q q q q q q q q q q q qq q q qq q q qq q q qq q q q q qq q q q q qq q q q q q q q q q q q qqq Cook's distance 10.5 0.51 Residuals vs Leverage 135 136 196 Linear model does not capture all the effects. We can identify some outliers. > (observations <- unique(dat.agg[which(abs(lm.agg$residuals) > 0.2), "Run"])) [1] 630 770 7.2 Mars We perform model fitting and variables selection by Mars. Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+ GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pdiff+ Ptime, data=dat.agg, trace=0, ncross=3, nfold=10) coefficients (Intercept) 0.74650082 h(DO- -1.06901) -0.03339099 h(-1.06901-DO) -0.22191908 h(pHset-0.444749) -0.12856755 h(0.444749-pHset) -0.10199167 h(0.580131-Xv) -0.03825405 h(0.810513-Stress) -0.09597828 h(pCO2- -0.471402) 0.01252083 h(Via-0.390253) 0.18390409 h(GLC-1.1171) -0.05487600 h(GLN- -0.203858) 0.02302284 h(-0.203858-GLN) 0.03187502 h(NH4- -0.0225122) 0.15374637 h(NH4-1.01414) -0.17152919 h(pH- -0.664392) 0.04304223 21
  • 22. h(pH-1.70568) -0.23513685 Selected 16 of 26 terms, and 10 of 15 predictors Importance: DO, pHset, NH4, Stress, Xv, pH, GLC, GLN, Via, pCO2, ... Number of terms at each degree of interaction: 1 15 (additive model) GCV 0.006192977 RSS 1.708448 GRSq 0.8142985 RSq 0.8461598 cv.rsq 0.7821464 Note: the cross-validation sd's below are standard deviations across folds Cross validation: nterms 16.90 sd 1.88 nvars 9.33 sd 0.92 cv.rsq sd MaxErr sd 0.78 0.069 -0.29 0.19 22
  • 23. 051015 Variable importance nsubsets nsubsets sqrt gcv sqrt rss 020406080100 normalizedsqrtgcvorrss DO1 pHset2 NH411 Stress4 Xv3 pH13 GLC7 GLN9 Via6 pCO25 0 5 10 15 20 25 Model Selection Number of terms 0.50.60.70.8 GRSqRSq 0246810 Numberofusedpredictors GRSq selected model RSq nbr preds 0.00 0.10 0.20 0.00.20.40.60.81.0 Cumulative Distribution abs(Residuals) Proportion 0% 50% 90% 100% q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq qq q q q q q qqq q q q q q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q qq q q q q q q qq qq q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q qq qq q q q q q q q q q q q q qq q q q q q q q q qq qq q qq q q q qq q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q qq q q qqq q q q qq q q q q q q q 0.0 0.2 0.4 0.6 0.8 −0.20.00.10.2 Residuals vs Fitted Fitted Residuals 104119 323 q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq qq q q q q q qqq q q q q q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q qq q q q q q q qq qq q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q qq qq q q q q q q q q q q q q qq q q q q q q q q qq qq q qq q q q qq q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q qq q q qqq q q q qq q q q q q q q −3 −1 0 1 2 3 −0.20.00.10.2 Normal Q−Q Theoretical Quantiles ResidualQuantiles 104 119 323 Titer: earth(formula=Titer~DO+pHset+Xv+... 23
  • 24. Take a look at outliers. > dat.agg[c(104, 119, 323), ] Run WD Ptime DO pCO2 pH Xv Via 104 630 0 0.0000000 0.159614 0.1237602 1.287431 -1.580576 0.3932955 119 635 0 0.0000000 1.388239 0.9739924 1.426846 -1.894227 0.4848425 323 1066 1 0.7791667 0.159614 -1.0665650 1.217723 -1.525812 0.1411053 GLC LAC GLN GLU NH4 OSM 104 1.0780028 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237 119 1.3174515 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617 323 0.6821795 -0.4139295 0.2494495 -1.790816 0.262201197 0.068777622 Stress Xv0 pHset RunNo 104 -0.2096149 0.3312174 0.03925127 BIOS-3.5L - 630 119 -0.2096149 -2.2207835 -2.39373696 BIOS-3.5L - 635 323 -0.2096149 -1.7751961 0.03925127 TACI-3L-1066 Discr. Titer pdiff pCO2t 104 Standard conditions 0.3510000 0.6432812 0 119 DO 70% / pH 6.70 / seeding 1 mio 0.0858000 0.6542345 0 323 STD condition (Control - no loop) 0.8432867 0.5062897 0 7.3 Random Forest The parameter mtry = 10 is determined by tuneRF to optimize its performance. > library(randomForest) > rf.agg <- randomForest(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via + + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH + + pdiff + Ptime, data = dat.agg, importance = TRUE, mtry = 10) > rf.agg Call: randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM + Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 10 Mean of squared residuals: 0.005391989 % Var explained: 83.73 24
  • 25. 0 100 200 300 400 500 0.0060.0080.0100.0120.0140.016 rf.agg trees Error pCO2 Ptime GLU pdiff GLN Via pH GLC Xv OSM Stress LAC NH4 pHset DO q q q q q q q q q q q q q q q 10 20 30 40 50 60 %IncMSE Ptime pdiff pCO2 Stress GLU GLC Via GLN OSM pH Xv LAC NH4 pHset DO q q q q q q q q q q q q q q q 0 1 2 3 4 IncNodePurity rf.agg 25
  • 26. Under two importance plots, the first 7 most important variables are the same. 7.4 Decision Tree plot.cp is used to determine cp in decision tree. > library(rpart) > tr.agg <- rpart(Titer ~ DO + pHset + Stress + Xv + pCO2 + + Via + GLC + LAC + GLN + GLU + NH4 + OSM + + Stress + pH + pdiff + Ptime, data=dat.agg) > (tr.agg <- prune(tr.agg, cp = 0.019)) n= 335 node), split, n, deviance, yval * denotes terminal node 1) root 335 11.10534000 0.5719146 2) DO< -1.683324 39 0.30707160 0.2245385 * 3) DO>=-1.683324 296 5.47207100 0.6176837 6) pHset< -1.988239 18 0.17149480 0.2748954 * 7) pHset>=-1.988239 278 3.04856000 0.6398787 14) NH4< 0.3899572 202 1.52384000 0.6085717 28) LAC>=0.002812559 9 0.02199422 0.4344444 * 29) LAC< 0.002812559 193 1.21623800 0.6166916 * 15) NH4>=0.3899572 76 0.80050930 0.7230893 * > plot(tr.agg) > text(tr.agg) | DO< −1.683 pHset< −1.988 NH4< 0.39 LAC>=0.002813 0.2245 0.2749 0.4344 0.6167 0.7231 26
  • 27. 7.5 Neural Network > nn.agg <- nnet(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via + + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH + pdiff, + data = dat.agg, size = 4, skip = TRUE, decay = 4e-4, + lin.out = FALSE, maxit = 2000, trace = FALSE) > nn.agg a 14-4-1 network with 79 weights inputs: DO pHset Xv Stress pCO2 Via GLC LAC GLN GLU NH4 OSM pH pdiff output(s): Titer options were - skip-layer connections decay=4e-04 7.6 Cross Validation Because the nature of our data, we better leave the whole run of experiment out when we try to perform cross validation. Still we are using MSE to assess performance. Overall, the performance is better than in first task. RMSECVs of Different Models Linear Model Mars Random Forest Decision Tree Neural Network [1,] 0.09515759 0.09222569 0.07226036 0.1000236 0.09316403 8 History Models Here we fit historical models and perform CV on them to measure their prediction powers. The input of a historical model always include the 2 strictly controlled variables that are kept constant. These are DO, Stress. For a historical model using the history up to day i, we use the other 9 variables pCO2, Xv, Via, GLC, pH, LAC, GLN, GLU, NH4, OSM from day 0 to day i. This gives up 2+10∗(i+1) input variables. We do this by unfolding the raw data. In case of missing values, we impute them by using missForest on the unfolded historical input values from day 0 to day i. This is done in function ArrHist. We produce 2 tables for each statistical model. One is the table of variance explained in CV and the other is RMSECV. We calculate variance explained as 1 − ( ( ˆyi − yi)2 )/( (yi − ¯y)2 ) in each run of cross validation and average them to get the final result. Similarly, RMSECV is also the mean over all cross validation runs. Now we see how models perfrom under this setting. 8.1 Linear model The (i, j)th history linear model is defined as follows: log(Titerj) = β0 + i t=0 [β1,t(DOt − 50)2 + β2,tpCO2t + β3,t(pHt − 7.1)2 + β4,tlog(Xvt) + β5,tV iat + β6,tGLCt + β7,tLACt + β8,tGLNt + β9,tGLUt + β10,tNH4t + β11,tOSMt + β12,tStresst] And the following table shows the RMSECV values. The RMSECV value of the (i, j)th model is the number at the ith row and the jth column. RMSECV 27
  • 28. Y0 Y1 Y2 Y3 Y4 X0 3.200970e-03 1.962524e-03 4.279477e-03 2.301885e-03 2.446558e-03 X0_1 1.545867e-02 1.659791e-02 2.354038e-03 1.230346e-03 X0_2 1.730547e-02 4.075473e-03 1.085479e-03 X0_3 4.041385e-03 2.398019e-03 X0_4 3.482082e-03 X0_5 X0_6 X0_7 X0_8 X0_9 X0_10 Y5 Y6 Y7 Y8 Y9 X0 2.051734e-03 3.416656e-03 6.081081e-03 7.219461e-03 1.941174e-02 X0_1 2.327897e-03 3.313077e-03 6.668536e-03 6.489212e-03 1.036755e-02 X0_2 3.283539e-03 1.628859e-03 5.512231e-03 3.368808e-03 7.914411e-03 X0_3 2.395535e-03 1.284514e-03 2.544108e-03 2.998240e-03 5.437181e-03 X0_4 1.389313e-03 1.658926e-03 2.458271e-03 2.497301e-03 1.091136e-02 X0_5 2.656460e-03 2.030364e-03 2.326792e-03 2.966961e-03 6.045417e-03 X0_6 2.639904e-03 2.232001e-03 3.808531e-03 1.056563e-02 X0_7 1.251427e-02 6.530875e-03 1.346802e-02 X0_8 3.184551e+02 7.913746e+13 X0_9 4.492133e+02 X0_10 Y10 X0 1.961133e-02 X0_1 5.580546e-02 X0_2 1.869011e-02 X0_3 1.359189e-02 X0_4 7.909511e-03 X0_5 1.306489e-02 X0_6 1.450043e-02 X0_7 3.412453e-02 X0_8 1.202324e+11 X0_9 1.613414e+00 X0_10 1.289649e+00 28
  • 29. 0 1 2 3 4 5 6 7 8 9 10 Titer 10 9 8 7 6 5 4 3 2 1 0 Historyused RMSECV for history model based on Linear model 8.2 Random Forest Variance explained in CV Y0 Y1 Y2 Y3 Y4 Y5 X0 0.28324661 0.26155005 -0.09489543 -0.03088496 0.39081454 0.43354914 X0_1 0.37598610 0.21074302 0.35566190 0.31553364 0.37313831 X0_2 0.16472940 0.36469677 0.44168895 0.40601249 X0_3 0.25131256 0.35404242 0.51706127 X0_4 0.37300694 0.56438519 X0_5 0.54745573 X0_6 X0_7 X0_8 X0_9 X0_10 Y6 Y7 Y8 Y9 Y10 X0 0.44032714 0.55857068 0.57889201 0.64090502 0.55601625 X0_1 0.53608831 0.50026635 0.55488094 0.72923139 0.71702833 X0_2 0.69563541 0.67904955 0.69803823 0.66081810 0.71536101 X0_3 0.66675711 0.72303901 0.71583577 0.71933173 0.77114221 X0_4 0.77659188 0.77304970 0.77028128 0.80331752 0.74513898 X0_5 0.79546288 0.79299697 0.81030901 0.73912138 0.78821029 X0_6 0.76082620 0.76023416 0.79248812 0.82078319 0.79484015 X0_7 0.81875451 0.80675983 0.82533617 0.81928950 X0_8 0.86596990 0.82482254 0.80067588 X0_9 0.87802758 0.84462589 X0_10 0.81461628 RMSECV 29
  • 30. Y0 Y1 Y2 Y3 Y4 Y5 X0 0.04226597 0.05005782 0.05790429 0.05367666 0.04880549 0.05140720 X0_1 0.03811918 0.05222506 0.03665541 0.04491827 0.04562487 X0_2 0.04978004 0.04001816 0.04554554 0.04125988 X0_3 0.04335484 0.03659673 0.04576400 X0_4 0.04272663 0.03664301 X0_5 0.03671531 X0_6 X0_7 X0_8 X0_9 X0_10 Y6 Y7 Y8 Y9 Y10 X0 0.06260631 0.07307958 0.08436972 0.09963318 0.10319883 X0_1 0.05939476 0.07320302 0.08885598 0.07992313 0.08665985 X0_2 0.04182959 0.06069709 0.06802562 0.09828617 0.08913791 X0_3 0.05082051 0.04949298 0.06647324 0.07765117 0.08482019 X0_4 0.03694516 0.05087756 0.05854120 0.06370206 0.08794341 X0_5 0.04183240 0.04704620 0.05592446 0.07192094 0.07872577 X0_6 0.03919570 0.05291936 0.05561075 0.07081846 0.07979003 X0_7 0.04348533 0.05150214 0.06694091 0.06891756 X0_8 0.04623633 0.06832347 0.07381787 X0_9 0.05664780 0.07072329 X0_10 0.06890835 0 1 2 3 4 5 6 7 8 9 10 Titer 10 9 8 7 6 5 4 3 2 1 0 Historyused RMSECV for history model based on Random Forest 8.3 MARS Variance explained in CV Y0 Y1 Y2 Y3 Y4 X0 0.069814132 -0.005095947 0.024849356 -0.034696230 -0.120319657 30
  • 31. X0_1 0.007649867 -0.396661134 -0.275546932 -0.033114509 X0_2 -0.827848243 -0.204988340 -1.123241366 X0_3 -0.239974296 0.267732287 X0_4 -1.130404725 X0_5 X0_6 X0_7 X0_8 X0_9 X0_10 Y5 Y6 Y7 Y8 Y9 X0 0.363899116 0.330268801 0.508385257 0.305851870 0.599098171 X0_1 -0.150206083 0.147392088 0.375098045 0.089004911 0.342867469 X0_2 0.287235896 0.651413305 0.651947560 0.742131695 0.616545382 X0_3 0.233978160 0.517310655 0.669245506 0.804495843 0.768104648 X0_4 0.397724415 0.704167262 0.821355319 0.839638561 0.836162082 X0_5 0.512508645 0.686174900 0.814525758 0.846335693 0.780111923 X0_6 0.846201740 0.823482389 0.841185054 0.823600595 X0_7 0.763357781 0.908825907 0.814810324 X0_8 0.850118872 0.714621386 X0_9 0.598640731 X0_10 Y10 X0 -0.266671250 X0_1 0.099612543 X0_2 0.296458653 X0_3 0.673287674 X0_4 0.681350544 X0_5 0.706625242 X0_6 0.812505473 X0_7 0.789365935 X0_8 0.657600818 X0_9 0.742572597 X0_10 0.830697666 RMSECV Y0 Y1 Y2 Y3 Y4 Y5 X0 0.05414481 0.06435345 0.04914698 0.04440181 0.04899524 0.04751189 X0_1 0.06349293 0.07040039 0.07104746 0.04787156 0.06328275 X0_2 0.08353868 0.05609517 0.06006556 0.05632823 X0_3 0.05677938 0.05112222 0.05670555 X0_4 0.07399238 0.04632260 X0_5 0.04234317 X0_6 X0_7 X0_8 X0_9 X0_10 Y6 Y7 Y8 Y9 Y10 X0 0.06136661 0.06598322 0.10313729 0.10811682 0.16777832 X0_1 0.06212712 0.09116696 0.10869171 0.12338118 0.17132883 X0_2 0.05033973 0.05957265 0.06570322 0.09632618 0.12433817 X0_3 0.05168125 0.06504277 0.06069211 0.07182945 0.10263122 X0_4 0.03954150 0.04350307 0.05051779 0.06939214 0.09432636 X0_5 0.04101469 0.04404178 0.04560743 0.06953788 0.08301654 X0_6 0.02986677 0.04677269 0.05327379 0.07087714 0.07679419 X0_7 0.04676462 0.04301543 0.07409887 0.07206619 X0_8 0.04281923 0.07792298 0.09770436 X0_9 0.09445278 0.08202713 X0_10 0.07658425 31
  • 32. Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Titer X0_10 X0_9 X0_8 X0_7 X0_6 X0_5 X0_4 X0_3 X0_2 X0_1 X0 Historyused RMSECV for history model based on MARS 9 Evaluation of Results All models are performing reasonably well. Random forest has the most stable performance. This is expected. It is also easy to implement. There are some tuning parameters to consider, e.g., the number of trees, the size chosen for resampling. However, default setting works good most of the time, and even the number of trees is too large it would not be a big problem. Linear regression is also doing ok. Its performance heavily relies on variable selection, interaction specification and transformation, that is why the implementation of linear regression is not as straight- forward as other models. A compromise is fitting a large model in the beginning and use stepwise selection to reduce the number of parameters. More recent method would be using penalized linear regression, so called lasso, ridge regression. See [2]. Mars is easy to implement, because it can adjust to nonlinearity automatically. Here its prediction power is not as good as the other two, which might be resulted from overfitting. It is possible to tune the parameters. See [3]. References [1] Daniel J.Stekhoven. Using the missForest package, 2011. Available from http://stat.ethz.ch/ education/semesters/ss2012/ams/paper/missForest_1.2.pdf. [2] Peter B¨uhlmann Martin M¨achler. Computational statistics. Available from http://stat.ethz.ch/ education/semesters/ss2014/CompStat/sk.pdf, 2014. [3] Stephen Milborrow. Notes on the earth package, 2014. Available from http://cran.r-project. org/web/packages/earth/vignettes/earth-notes.pdf. 32
  • 33. [4] Fortran original by Leo Breiman, R port by Andy Liaw Adele Cutler, and Matthew Wiener. Package ‘randomForest’, 2012. Available from http://stat-www.berkeley.edu/users/breiman/ RandomForests. [5] Mayo Foundation Terry M. Therneau, Elizabeth J. Atkinson. An Introduction to Recursive Partition- ing Using the RPART Routines, 2013. Available from http://cran.r-project.org/web/packages/ rpart/vignettes/longintro.pdf. 33