1. Process Design and Optimization of Bioprocesses with Quality by
Design Approach
Kyunghee Cho, Yunhao He
June 11, 2014
1 Introduction
In this project, we work with data from a pharmaceutical company that uses a new biochemical method
to produce drugs. The ultimate goal is to improve the process and develop models to aid decision making.
In general, we hope for more information out of less measurement. Due to time-constraint, we focus on
the prediction under the classical statistical settings. That is, to predict an output Y from X using
the model developed from data X1, X2, · · · , Xn and Y1, Y2, · · · , Yn. Many models exist for this kind of
prediction task. We only use a few of them in this project.
1.1 Data Description
In this part we briefly describe the data. The following is how the raw data look like.
Batch.ID Run WD Ptime DO CO2 pH Stress GLC LAC GLN GLU NH4
1 1 539 0 0.0000000 50 46 7.12 2.2 7.00 0.67 1.88 1.32 3.55
2 1 539 1 0.8368056 50 21 7.10 2.2 5.95 1.52 1.29 1.67 3.97
3 1 539 2 1.7986111 50 23 6.97 2.2 5.13 2.06 1.31 1.80 4.11
4 1 539 3 2.8020833 50 33 6.97 2.2 3.85 2.16 1.28 2.64 4.57
5 1 539 4 3.8680556 50 39 6.98 2.2 3.02 1.93 1.73 2.57 4.45
6 1 539 5 4.8993056 50 40 7.02 2.2 2.27 1.71 2.07 3.53 4.20
OSM Xv Via Titer
1 322.0000 1.45 97.97297 0.03913624
2 315.4865 2.07 97.64151 0.05251863
3 308.0000 3.30 97.34513 0.05340000
4 305.0906 4.97 97.83465 0.10257388
5 302.0000 7.02 97.50000 0.14200000
6 298.0000 7.75 96.27329 0.19834530
In a Run of experiment, all the variables are measured from day WD 1 to WD 10, sometimes 11 or
12. In the experiments the controlled variables are DO, CO2, pH, Stress. Variables as GLC, LAC, GLN,
GLU, NH4, OSM, Xv, Via are measured throbbughout to monitor the state of the production. Titer is
the output we are interested in.
1.2 Missing Data Imputation
The raw data contains quite a few missing values. To carry out sensible statical analysis, we imputed
missing data for both the input values and the output values. Titer are interpolated by our client.
Missing values in the input values are imputed by MissForest, which is a recent achievement based on
RandomForest. See [1].
1.3 Data Standardization
It is sometimes a good practice to normalize the data to have mean 0 and variance 1. In our analysis,
all the models except for linear models are standardized, since we have more sophiscated customized
transformation for linear regressions. As a result, the coefficients of linear regressions and mars are
always different in scale.
1
2. 2 Information on Models Used
2.1 Linear Model
Linear regression is the most mature and widely used method. Although it is quite simple and intuive,
sometimes it has good prediction power. We can easily tell if any model assumption is violated by looking
at the plots of the linear fit.
To make a linear model work best, it is necessary to specify the predictors manually. So we have to
consider which transformations to take, which interactions or orders to include and which to leave out.
One easy way to do this is fit a big model in the beginning and use stepwise selection by AIC(or BIC)
criterion1
which is a measure of balance between goodness of fit and model complexity.
To assess variable importance, a simple way is to look at p-value or t-value, given the model assump-
tions are met.
2.2 Decision Tree
Decision tree is a scale independent statistical model. It is easy to implement and deals with interactions
naturally. The biggest advantage is that it can be visualized easily. See [5].
2.3 Random Forest
All the information in this subsection is based on [4]. Random forest is an ensemble method that combines
many decision trees, in each of which the data and variables are sampled so that only part of them are
used. Due to implicit bootstrapping, the model suffers less from overfitting and has good prediction
power.
Random forest has built-in mechanisms to estimate importance of variables. The two measures are
described in the following way:
ˆ The first measure is computed from permuting OOB data: For each tree, the predic-
tion error on the out-of-bag portion of the data is recorded (error rate for classification,
MSE for regression). Then the same is done after permuting each predictor variable.
The difference between the two are then averaged over all trees, and normalized by the
standard deviation of the differences. If the standard deviation of the differences is equal
to 0 for a variable, the division is not done (but the average is almost always equal to 0
in that case).
ˆ The second measure is the total decrease in node impurities from splitting on the variable,
averaged over all trees. For classification, the node impurity is measured by the Gini
index. For regression, it is measured by residual sum of squares.
The package randomForest has a function varImpPlot to plot the importance of variables easily.
2.4 MARS
All the information in this subsection is based on [3].
MARS, multivariate adaptive regression splines, is an adaptive extension of linear regression. In the
final model, it is a linear regression with terms like (xj − d)+, (d − xj)+ and higher order interactions
of such terms. The algorithm adds terms forwardly and prune the model to a place where GCV is
minimized.
MARS can also estimate the importance of variables. From earth vignette: [3]
ˆ The nsubsets criterion counts the number of model subsets that include the variable.
Variables that are included in more subsets are considered more important.
By ”subsets” we mean the subsets of terms generated by the pruning pass. There is one
subset for each model size (from 1 to the size of the selected model) and the subset is
the best set of terms for that model size. (These subsets are speced in $prune.terms in
earth’s return value.) Only subsets that are smaller than or equal in size to the final
model are used for estimating variable importance.
1See http://stat.ethz.ch/R-manual/R-devel/library/stats/html/step.html. For more information about BIC, see
http://en.wikipedia.org/wiki/Bayesian_information_criterion
2
3. ˆ The rss criterion first calculates the decrease in the RSS for each subset relative to the
previous subset. (For multiple response models, RSS’s are calculated over all responses.)
Then for each variable it sums these decreases over all subsets that include the variable.
Finally, for ease of interpretation the summed decreases are scaled so the largest summed
decrease is 100. Variables which cause larger net decreases in the RSS are considered
more important.
ˆ The gcv criterion is the same, but uses the GCV instead of the RSS. Adding a variable
can increase the GCV, i.e., adding the variable has a deleterious effect on the model.
When this happens, the variable could even have a negative total importance, and thus
appear less important than unused variables.
2.5 Neural Network
Neural network is a felxible model which is in a sense an extension of linear regression. See en.wikipedia.
org/wiki/Artificial_neural_network.
3 Overview of Three Main Tasks
This statistical analysis comprises three main tasks, in each of which all three above mentioned models
were used, specifically, Linear Regression, Random Forest and MARS. Only for the third task a different
dataset was used, in which missing Titer values are interpolated by a logistic function fit within each
run.
3.1 First Task: Blackbox Models
Maximizing the output of a useful product by controlling experimental conditions is of primary interest.
The first blackbox models are fitted using only the four controlled variables, namely, DO, Stress, pHset,
Xv0 as input variables to predict the output variable, Titer at day 10. Secondly, all other input variables
at day 0 are used as input variables to predict Titer at day 10.
3.2 Second Task: Snapshot Models
In order to save the measurement cost of Titer, it is useful to predict the current value of Titer during
an experiment using the current values of the input variables, which are relatively cheaper to measure.
The snapshot approach means each observation is considered to be independant and input variables at
day t are used to predict Titer at day t.
3.3 Third Task: History Models
The history approach can be regarded as an extension of both the blackbox models and the snapshot
models. For the history models, not only the current value of the input variables at day t are considered
in the model as predictors but also how they have changed over time in the past, that is, the history of
the input variables. All Titer values in the future as well as the one at day t are to be predicted. In
other words, a (i, j) history model uses input variables at day 0, 1, ..., i as predictors to predict the Titer
value at day j(≥ i).
4 Model Comparison
In order to compare the prediction performance between the three statistical models, the cross-validation
method was used.
1. First, the data is randomly splitted into a training set and a test set by Run ID. That is, 30 runs
are randomly sampled for a test set among 122 runs.
2. Then a model is fitted using the training set, the rest of the data.
3. MSE (mean squared error) is calculated in the test set.
3
4. 4. This is repeated 5 times and using the 5 resulting MSE’s, RMSECV is calculated, which is defined
as follows.
For blackbox or history models:
RMSECV =
5
k=1
30
i=1(y
(k)
i −y
(k)
i )2
5·30
where y
(k)
i and y
(k)
i are the true Titer value and the predicted one of the ith sample run in the kth
cross-validation, repectively.
For snapshot models:
RMSECV =
5
k=1
30
i=1
ni
j=1(y
(k)
ij −y
(k)
ij )2
5·30·ni
where y
(k)
ij and y
(k)
ij are the true Titer value and the predicted one of the ith sample run in the kth
cross-validation at the day j, repectively.
In case the target variable is transformed, the RMSECV is calculated based on back-transformed
values of both true and predicted values.
5 Blackbox Models
This is a prediction problem with classical statistics setting. Before we use any statistical models, it is
helpful to get an intuition what the dataset looks like.
q
q
q
qq
q
q
qq
qq
q
q
qq
q
q
q
qq
q
qqqq
qq
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
qq q
q
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
600 700 800 900 1000
0.20.40.60.8
Final Titer in Different Runs without Transformation
dat.simp$Run
dat.simp$Titer
Since we are using only a small part of the raw data here, the number of observations is 122.
If only prediction power is of interest, you can jump to Section 5.6.
4
11. 02468
Importance of Variables
nsubsets
nsubsets
sqrt gcv
sqrt rss
020406080100
normalizedsqrtgcvorrss
DO1
pHset2
NH411
Via6
GLC7
LAC8
OSM12
5.3 Random Forest
Using Random Forest, we can easily see the importance of different variables.
5.3.1 Random Forest with 4 Controlled Variables
Call:
randomForest(formula = Titer ~ DO + pHset + Stress + Xv, data = dat.simp, importance = TRUE, mtry = 4)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 0.008918479
% Var explained: 71.66
5.3.2 Random Forest with All Variables
We can see the ranking of variable importances is similar to that of MARS.
11
12. Stress
pCO2
pH
GLU
GLC
GLN
Via
LAC
OSM
Xv
NH4
pHset
DO
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.5 1.0
rf.simp
IncNodePurity
> rf.simp
Call:
randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM +
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 0.007168825
% Var explained: 77.22
5.4 Decision Tree
For completeness, we also include decision tree model. It cannot outperform random forest which is an
extension of decision tree, but it is easy for interpretation.
5.4.1 Decision Tree with 4 Controlled Variables
n= 119
node), split, n, deviance, yval
* denotes terminal node
1) root 119 3.74435600 0.5739803
2) DO< -1.683324 13 0.10235720 0.2245385 *
3) DO>=-1.683324 106 1.85988900 0.6168364
6) pHset< -0.7717448 10 0.22043140 0.3785372 *
7) pHset>=-0.7717448 96 1.01244000 0.6416593
14) pHset>=0.6474983 8 0.02327288 0.5111250 *
15) pHset< 0.6474983 88 0.84046160 0.6535260
12
14. |
DO< −1.683
pHset< −0.7717
NH4< 0.2659
OSM>=−0.1028
Via< 0.4786
Xv< −1.573
Xv>=−1.653
0.2245
0.3785
0.5505 0.616 0.6775
0.641 0.7552
0.7873
5.5 Neural Network
Neural network is comparatively not so easy to interpret. We include it here to compare its prediction
accuracy with other models using CV.
5.6 Cross Validation
Although many packages nowadays have built-in measures of test errors, by implementing cross validation
ourselves, we can compare the performance of different models on the same ground. Here we use Leave-
30-Runs-Out cross validation. Since it is a regression problem, mean squared error is a good indicator
of performance.
We can see the cross validation mean squared errors of different methods.
RMSECV of Different Models Using All Input Variables
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.078494 0.1060308 0.09046751 0.1237991 0.1362411
RMSECV of Different Models Using 4 Input Variables
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.09764054 0.117845 0.09327159 0.0972197 0.1150896
A surprising fact is that the more complicated models perform even worse. Linear model performs quite
well.
14
15. 6 Snapshot Models
6.1 Linear model
The following model was selected by the BIC model selection criterion. We can see that now the number
of selected variables is much higher. In snapshot model, the effects are too complex to be estimated by
a few variables.
Call:
lm(formula = snlm$formula, data = ccFdata)
Residuals:
Min 1Q Median 3Q Max
-0.63033 -0.09694 0.00396 0.11494 0.48256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.492e+00 1.705e+00 0.875 0.381944
I((DO - 50)^2) 9.907e-04 3.332e-04 2.973 0.003066 **
pCO2 1.522e-01 2.442e-02 6.233 8.54e-10 ***
I((pH - 7.1)^2) 3.698e+01 4.867e+00 7.598 1.14e-13 ***
log(Xv) 7.013e-01 1.506e-01 4.658 3.93e-06 ***
Via -3.025e-02 9.257e-03 -3.268 0.001144 **
GLC -2.973e-01 1.400e-01 -2.123 0.034175 *
LAC -2.858e-01 9.404e-02 -3.039 0.002479 **
GLN 5.613e-01 5.873e-02 9.557 < 2e-16 ***
GLU -1.950e+00 3.485e-01 -5.596 3.31e-08 ***
NH4 -1.106e-01 2.415e-02 -4.580 5.64e-06 ***
OSM -8.029e-03 3.901e-03 -2.058 0.039999 *
Stress -5.076e-02 1.549e-02 -3.276 0.001113 **
I((DO - 50)^2):pCO2 1.418e-05 4.040e-06 3.510 0.000482 ***
I((DO - 50)^2):Via -8.782e-06 2.193e-06 -4.005 6.98e-05 ***
I((DO - 50)^2):GLC 8.047e-05 1.697e-05 4.741 2.65e-06 ***
I((DO - 50)^2):GLU -1.739e-04 3.561e-05 -4.883 1.34e-06 ***
I((DO - 50)^2):NH4 -6.375e-05 2.582e-05 -2.469 0.013824 *
pCO2:Via -6.531e-04 1.354e-04 -4.824 1.78e-06 ***
pCO2:GLC 3.431e-03 6.722e-04 5.105 4.44e-07 ***
pCO2:GLN -6.369e-03 1.483e-03 -4.293 2.05e-05 ***
pCO2:OSM -2.974e-04 5.367e-05 -5.540 4.50e-08 ***
I((pH - 7.1)^2):LAC 2.711e+00 3.745e-01 7.240 1.36e-12 ***
I((pH - 7.1)^2):NH4 6.535e-01 2.225e-01 2.937 0.003435 **
I((pH - 7.1)^2):OSM -1.449e-01 1.812e-02 -7.997 6.46e-15 ***
log(Xv):GLC 1.177e-01 1.607e-02 7.323 7.75e-13 ***
log(Xv):GLU -2.613e-01 3.265e-02 -8.000 6.29e-15 ***
Via:GLC -4.150e-03 8.275e-04 -5.015 6.97e-07 ***
Via:GLU 1.534e-02 1.466e-03 10.463 < 2e-16 ***
Via:Stress 6.505e-04 1.614e-04 4.031 6.26e-05 ***
GLC:LAC -2.587e-02 7.330e-03 -3.529 0.000448 ***
GLC:NH4 1.571e-02 4.848e-03 3.241 0.001255 **
GLC:OSM 9.333e-04 3.859e-04 2.419 0.015863 *
LAC:NH4 -4.049e-02 8.656e-03 -4.678 3.58e-06 ***
LAC:OSM 1.038e-03 2.115e-04 4.911 1.16e-06 ***
GLN:Stress -4.219e-03 1.587e-03 -2.658 0.008059 **
GLU:OSM 4.736e-03 8.420e-04 5.625 2.83e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1743 on 608 degrees of freedom
(740 observations deleted due to missingness)
Multiple R-squared: 0.9526, Adjusted R-squared: 0.9498
F-statistic: 339.2 on 36 and 608 DF, p-value: < 2.2e-16
[1] 0.07678454
15
19. 051015
Variable importance
nsubsets
nsubsets
sqrt gcv
sqrt rss
020406080100
normalizedsqrtgcvorrss
GLU7
LAC5
NH48
GLN6
Xv12
OSM9
pH11
Stress10
Via3
pCO22
DO1
GLC4
7 History Model(Naive Way)
Let’s first use input data only at time t and treat t simply as another parameter.
Again, the target here is the Titer on day 10.
Explore with different models.
7.1 Linear Regression
Since we now have around 1300 observations, we can fit the model with more parameters. After this we
can select model using step.
Call:
lm(formula = Titer ~ I(DO^2) + DO + pHset + I(pHset^2) + Via +
poly(GLC, 3) + poly(LAC, 3) + GLN + GLU + NH4 + OSM + poly(Ptime,
3) + DO:pHset + I(pHset^2):poly(Ptime, 3) + Via:poly(Ptime,
3) + poly(LAC, 3):poly(Ptime, 3) + GLN:poly(Ptime, 3) + GLU:poly(Ptime,
3) + NH4:poly(Ptime, 3) + OSM:poly(Ptime, 3), data = dat.agg)
Residuals:
Min 1Q Median 3Q Max
-0.296451 -0.047078 0.002983 0.044872 0.190200
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.049e+00 5.200e-01 -2.017 0.044626 *
I(DO^2) -3.644e-02 6.218e-03 -5.860 1.26e-08 ***
DO 1.747e-02 7.374e-03 2.369 0.018520 *
pHset 9.137e-03 7.090e-03 1.289 0.198499
19
24. Take a look at outliers.
> dat.agg[c(104, 119, 323), ]
Run WD Ptime DO pCO2 pH Xv Via
104 630 0 0.0000000 0.159614 0.1237602 1.287431 -1.580576 0.3932955
119 635 0 0.0000000 1.388239 0.9739924 1.426846 -1.894227 0.4848425
323 1066 1 0.7791667 0.159614 -1.0665650 1.217723 -1.525812 0.1411053
GLC LAC GLN GLU NH4 OSM
104 1.0780028 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237
119 1.3174515 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617
323 0.6821795 -0.4139295 0.2494495 -1.790816 0.262201197 0.068777622
Stress Xv0 pHset RunNo
104 -0.2096149 0.3312174 0.03925127 BIOS-3.5L - 630
119 -0.2096149 -2.2207835 -2.39373696 BIOS-3.5L - 635
323 -0.2096149 -1.7751961 0.03925127 TACI-3L-1066
Discr. Titer pdiff pCO2t
104 Standard conditions 0.3510000 0.6432812 0
119 DO 70% / pH 6.70 / seeding 1 mio 0.0858000 0.6542345 0
323 STD condition (Control - no loop) 0.8432867 0.5062897 0
7.3 Random Forest
The parameter mtry = 10 is determined by tuneRF to optimize its performance.
> library(randomForest)
> rf.agg <- randomForest(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via
+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH
+ + pdiff + Ptime, data = dat.agg, importance = TRUE, mtry = 10)
> rf.agg
Call:
randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM +
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 0.005391989
% Var explained: 83.73
24
25. 0 100 200 300 400 500
0.0060.0080.0100.0120.0140.016
rf.agg
trees
Error
pCO2
Ptime
GLU
pdiff
GLN
Via
pH
GLC
Xv
OSM
Stress
LAC
NH4
pHset
DO
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
10 20 30 40 50 60
%IncMSE
Ptime
pdiff
pCO2
Stress
GLU
GLC
Via
GLN
OSM
pH
Xv
LAC
NH4
pHset
DO
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 1 2 3 4
IncNodePurity
rf.agg
25
26. Under two importance plots, the first 7 most important variables are the same.
7.4 Decision Tree
plot.cp is used to determine cp in decision tree.
> library(rpart)
> tr.agg <- rpart(Titer ~ DO + pHset + Stress + Xv + pCO2
+ + Via + GLC + LAC + GLN + GLU + NH4 + OSM
+ + Stress + pH + pdiff + Ptime, data=dat.agg)
> (tr.agg <- prune(tr.agg, cp = 0.019))
n= 335
node), split, n, deviance, yval
* denotes terminal node
1) root 335 11.10534000 0.5719146
2) DO< -1.683324 39 0.30707160 0.2245385 *
3) DO>=-1.683324 296 5.47207100 0.6176837
6) pHset< -1.988239 18 0.17149480 0.2748954 *
7) pHset>=-1.988239 278 3.04856000 0.6398787
14) NH4< 0.3899572 202 1.52384000 0.6085717
28) LAC>=0.002812559 9 0.02199422 0.4344444 *
29) LAC< 0.002812559 193 1.21623800 0.6166916 *
15) NH4>=0.3899572 76 0.80050930 0.7230893 *
> plot(tr.agg)
> text(tr.agg)
|
DO< −1.683
pHset< −1.988
NH4< 0.39
LAC>=0.002813
0.2245
0.2749
0.4344 0.6167
0.7231
26
27. 7.5 Neural Network
> nn.agg <- nnet(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via
+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH + pdiff,
+ data = dat.agg, size = 4, skip = TRUE, decay = 4e-4,
+ lin.out = FALSE, maxit = 2000, trace = FALSE)
> nn.agg
a 14-4-1 network with 79 weights
inputs: DO pHset Xv Stress pCO2 Via GLC LAC GLN GLU NH4 OSM pH pdiff
output(s): Titer
options were - skip-layer connections decay=4e-04
7.6 Cross Validation
Because the nature of our data, we better leave the whole run of experiment out when we try to perform
cross validation. Still we are using MSE to assess performance.
Overall, the performance is better than in first task.
RMSECVs of Different Models
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.09515759 0.09222569 0.07226036 0.1000236 0.09316403
8 History Models
Here we fit historical models and perform CV on them to measure their prediction powers.
The input of a historical model always include the 2 strictly controlled variables that are kept constant.
These are DO, Stress. For a historical model using the history up to day i, we use the other 9 variables
pCO2, Xv, Via, GLC, pH, LAC, GLN, GLU, NH4, OSM from day 0 to day i. This gives up 2+10∗(i+1)
input variables.
We do this by unfolding the raw data. In case of missing values, we impute them by using missForest
on the unfolded historical input values from day 0 to day i. This is done in function ArrHist.
We produce 2 tables for each statistical model. One is the table of variance explained in CV and the
other is RMSECV. We calculate variance explained as 1 − ( ( ˆyi − yi)2
)/( (yi − ¯y)2
) in each run of
cross validation and average them to get the final result. Similarly, RMSECV is also the mean over all
cross validation runs.
Now we see how models perfrom under this setting.
8.1 Linear model
The (i, j)th history linear model is defined as follows:
log(Titerj) = β0 +
i
t=0
[β1,t(DOt − 50)2
+ β2,tpCO2t
+ β3,t(pHt − 7.1)2
+ β4,tlog(Xvt) + β5,tV iat
+ β6,tGLCt + β7,tLACt + β8,tGLNt
+ β9,tGLUt + β10,tNH4t + β11,tOSMt + β12,tStresst]
And the following table shows the RMSECV values. The RMSECV value of the (i, j)th model is the
number at the ith row and the jth column.
RMSECV
27
32. Y0
Y1
Y2
Y3
Y4
Y5
Y6
Y7
Y8
Y9
Y10
Titer
X0_10
X0_9
X0_8
X0_7
X0_6
X0_5
X0_4
X0_3
X0_2
X0_1
X0
Historyused
RMSECV for history model based on MARS
9 Evaluation of Results
All models are performing reasonably well.
Random forest has the most stable performance. This is expected. It is also easy to implement.
There are some tuning parameters to consider, e.g., the number of trees, the size chosen for resampling.
However, default setting works good most of the time, and even the number of trees is too large it would
not be a big problem.
Linear regression is also doing ok. Its performance heavily relies on variable selection, interaction
specification and transformation, that is why the implementation of linear regression is not as straight-
forward as other models. A compromise is fitting a large model in the beginning and use stepwise selection
to reduce the number of parameters. More recent method would be using penalized linear regression, so
called lasso, ridge regression. See [2].
Mars is easy to implement, because it can adjust to nonlinearity automatically. Here its prediction
power is not as good as the other two, which might be resulted from overfitting. It is possible to tune
the parameters. See [3].
References
[1] Daniel J.Stekhoven. Using the missForest package, 2011. Available from http://stat.ethz.ch/
education/semesters/ss2012/ams/paper/missForest_1.2.pdf.
[2] Peter B¨uhlmann Martin M¨achler. Computational statistics. Available from http://stat.ethz.ch/
education/semesters/ss2014/CompStat/sk.pdf, 2014.
[3] Stephen Milborrow. Notes on the earth package, 2014. Available from http://cran.r-project.
org/web/packages/earth/vignettes/earth-notes.pdf.
32
33. [4] Fortran original by Leo Breiman, R port by Andy Liaw Adele Cutler, and Matthew Wiener.
Package ‘randomForest’, 2012. Available from http://stat-www.berkeley.edu/users/breiman/
RandomForests.
[5] Mayo Foundation Terry M. Therneau, Elizabeth J. Atkinson. An Introduction to Recursive Partition-
ing Using the RPART Routines, 2013. Available from http://cran.r-project.org/web/packages/
rpart/vignettes/longintro.pdf.
33