Data science with R - Clustering and Classification
Data_Mining_Exploration
1. Data Mining Exploration
Brett Keim
December 16, 2015
Abstract
Inside data mining there are infinite ways to explore a set with each way yielding different information
about the data set as well as about future data points inside the data set. This research is primary about
the techniques relating to support vector machines and how they can be utilized to provide useful models
and classifications along with the use of kernels. The research will also cover some elastic net regularized
regressions for the linear combinations which support vector machines do not incorporate.
1 Introduction to Support Vector Machines
It is wise to first start by looking into the definition of support vector machines and the most practical uses
of them. A SVM uses examples from a set and split them into two categories. It then uses the categories to
assign new examples to one or the other without the use of probability. The two categories should be mapped
into groups so there is a clear gap between them which should be maximized. The new samples are mapped
into the same space and labeled based on what side of the gap they fall into. This is how classification is
done using the SVM, but they can also classify non-linearly using kernel methods which will be discussed
later.
1.1 Obtaining Data for SVM Study
In order to study support vector machines I will need a data set that is complex enough to work on many
different layers of analysis, but at the same time being small enough so I can easily troubleshoot. I have
pulled a dataset from the University of California Irvine Machine Learning Repository [UCI]. I have chosen
to read them into R using the read.csv command.
setwd("C:/Users/bwkeim/Desktop/Grad Classes/Data Mining Exploration")
fires <- read.csv("forestfires.csv")
rfires <- runif(nrow(fires))
fires.train <- fires[rfires >= 0.33,]
fires.test <- fires[rfires < 0.22,]
dim(fires)
## [1] 517 13
1.2 Preliminary Testing
I will start by looking at the forest fire data from UCI to better understand how to properly use a SVM
on a data set. I need to look at some of the data first in order to understand what is inside the data file.
It is important to remember there is 6,721 data points inside the dataset. I have also split the set into two
subsets one for training and one for testing.
summary(fires)
1
2. ## X Y month day FFMC
## Min. :1.000 Min. :2.0 aug :184 fri:85 Min. :18.70
## 1st Qu.:3.000 1st Qu.:4.0 sep :172 mon:74 1st Qu.:90.20
## Median :4.000 Median :4.0 mar : 54 sat:84 Median :91.60
## Mean :4.669 Mean :4.3 jul : 32 sun:95 Mean :90.64
## 3rd Qu.:7.000 3rd Qu.:5.0 feb : 20 thu:61 3rd Qu.:92.90
## Max. :9.000 Max. :9.0 jun : 17 tue:64 Max. :96.20
## (Other): 38 wed:54
## DMC DC ISI temp
## Min. : 1.1 Min. : 7.9 Min. : 0.000 Min. : 2.20
## 1st Qu.: 68.6 1st Qu.:437.7 1st Qu.: 6.500 1st Qu.:15.50
## Median :108.3 Median :664.2 Median : 8.400 Median :19.30
## Mean :110.9 Mean :547.9 Mean : 9.022 Mean :18.89
## 3rd Qu.:142.4 3rd Qu.:713.9 3rd Qu.:10.800 3rd Qu.:22.80
## Max. :291.3 Max. :860.6 Max. :56.100 Max. :33.30
##
## RH wind rain area
## Min. : 15.00 Min. :0.400 Min. :0.00000 Min. : 0.00
## 1st Qu.: 33.00 1st Qu.:2.700 1st Qu.:0.00000 1st Qu.: 0.00
## Median : 42.00 Median :4.000 Median :0.00000 Median : 0.52
## Mean : 44.29 Mean :4.018 Mean :0.02166 Mean : 12.85
## 3rd Qu.: 53.00 3rd Qu.:4.900 3rd Qu.:0.00000 3rd Qu.: 6.57
## Max. :100.00 Max. :9.400 Max. :6.40000 Max. :1090.84
##
The file contains 13 different attributes as seen above by the summary output. When I think about forest
fires the first thing that comes to mind is what causes them and then at what point of the year are we most
venerable for one to occur.
fit_default <- svm(month~.,fires.train)
fit_default
##
## Call:
## svm(formula = month ~ ., data = fires.train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.05555556
##
## Number of Support Vectors: 281
I am using the svm() function inside the e1071 R package to begin my svm learning. This function uses
a formula set by the user to classify data points from a given dataset. The user can also easily change the
kernel type as well as the margin by using the kernel and cost portions of the function. When the user
changes the cost aspect they are changing the cost of misclassification on the data points. The larger the
cost the smaller the margin (separation of data points) because it places a higher cost on misclassification
this can lead to over fitting where as a smaller cost value allows the margin to widen, but does cause a higher
level of bias. The key is to find the right mix of the two. When deciding what is the correct mix it is best
to see how different levels perform on real data. Also there is another part of the function we must look at
which is the gamma. Gamma is the part of the function that determines how much influence a single data
2
3. value has on the training set. If gamma is too large the area of the influence radius of the support vectors
only includes the support vectors themselves and no cost regularization will allow us to avoid over fitting.
If the gamma value is too small the model will be too constricted and not be able to capture the shape of
the data. It is again important to get the correct mix of the two by seeing how actual data reacts to the
changes.
##
## Call:
## svm(formula = month ~ ., data = fires.train, cost = 10000)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10000
## gamma: 0.05555556
##
## Number of Support Vectors: 246
##
## Call:
## svm(formula = month ~ ., data = fires.train, cost = 0.001)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.001
## gamma: 0.05555556
##
## Number of Support Vectors: 328
##
## Call:
## svm(formula = month ~ ., data = fires.train, gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1
##
## Number of Support Vectors: 283
##
## Call:
## svm(formula = month ~ ., data = fires.train, gamma = 0.001)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.001
##
## Number of Support Vectors: 329
3
4. Since I have played with the models more than what is shown I think it is now time to start comparing
their values to the test portion of my data set. I will use predictions based on my svm model and compare
it to the test data using a simple table.
##
## Call:
## svm(formula = month ~ ., data = fires.train, cost = 10000, gamma = 0.01,
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10000
## gamma: 0.01
##
## Number of Support Vectors: 201
1.2.1 Initial Results
Using my fit 1 svm I was able to organiize my results into a table, which in this is case is better known a
confusion matrix. It shows the true values (columns) versus the predicted values (rows).
apr aug dec feb jan jul jun mar may nov oct sep
apr 0 0 0 3 0 0 0 0 0 0 0 0
aug 0 29 0 0 1 4 0 0 0 0 0 3
dec 0 0 1 0 0 0 0 0 0 0 0 0
feb 1 0 0 2 0 0 0 0 0 0 0 0
jan 0 0 0 0 0 0 0 0 0 0 0 0
jul 0 1 0 0 0 3 2 0 0 0 0 0
jun 0 0 0 0 0 0 3 0 0 0 0 0
mar 1 1 0 2 0 0 0 7 1 0 0 0
may 0 0 0 0 0 0 0 1 0 0 0 0
nov 0 0 0 0 0 0 0 0 0 0 0 0
oct 0 0 0 0 0 0 0 0 0 0 4 0
sep 0 4 0 0 0 0 0 0 0 0 0 40
Ideally we would like the diagonal to contain all of the data points. This would mean that our predictions
perfectly match the true values. Of course, there is no such thing as a perfect model so we have some values
that are left outside of the diagonal. These are known misclassified points and is something we need to take
a closer look at.
correct.pred <- sum(diag(cm))
correct.pred
## [1] 89
total.pred <- sum(cm)
total.pred
## [1] 114
misclass <- (total.pred - correct.pred) / total.pred
misclass
## [1] 0.2192982
4
5. By the above code, the model successfully predicted 89 cases based on the test data out of a total of 114
test points yielding a misclassification percentage of around 21.9298246 percent. This is not a bad percentage
in my point of view. This was using a radial kernel method, but there are other kernel methods used inside
SVM’s that can be more effective and useful. I want to look specifically at the linear kernel now.
1.3 Linear Kernel SVM’s
I will now use the same techniques as discussed early as far as finding the correct cost and gamma for my
svm, but this time I will be using a linear kernel.
fit_2 <- svm(month~., fires.train, cost = 1000, kernel = 'linear')
summary(fit_2)
##
## Call:
## svm(formula = month ~ ., data = fires.train, cost = 1000, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1000
## gamma: 0.05555556
##
## Number of Support Vectors: 157
##
## ( 21 9 39 38 5 17 11 1 8 7 1 )
##
##
## Number of Classes: 11
##
## Levels:
## apr aug dec feb jan jul jun mar may nov oct sep
fit_2_pred <- predict(fit_2, fires.test[,-3])
cm_2 <- table(pred = fit_2_pred, true = fires.test[,3])
sum(diag(cm_2))
## [1] 91
sum(cm_2)
## [1] 114
(sum(cm_2)-sum(diag(cm_2)))/sum(cm_2)
## [1] 0.2017544
The initial model I have made using linear kernels is slightly better than the first model with radial
kernels as you can see sense we have a 1 percent decrease in misclassification. I am going to play with the
cost level and see if I can’t further improve the model.
fit_3 <- svm(month~., fires.train, cost = 100, kernel = 'linear')
fit_3_pred <- predict(fit_3, fires.test[,-3])
cm_3 <- table(pred = fit_3_pred, true = fires.test[,3])
sum(diag(cm_3))
5
6. ## [1] 90
sum(cm_3)
## [1] 114
(sum(cm_3)-sum(diag(cm_3)))/sum(cm_3)
## [1] 0.2105263
1.3.1 Results of Linear Kernel
Below is the confusion matrix for the high cost svm.
apr aug dec feb jan jul jun mar may nov oct sep
apr 0 0 0 3 0 0 0 1 0 0 0 0
aug 0 31 0 0 0 3 0 0 0 0 0 3
dec 0 0 1 0 0 0 0 0 0 0 0 0
feb 1 0 0 2 1 0 0 0 0 0 0 0
jan 0 0 0 0 0 0 0 0 0 0 0 0
jul 0 0 0 0 0 4 3 0 0 0 0 0
jun 0 0 0 0 0 0 2 0 0 0 0 0
mar 1 1 0 2 0 0 0 6 1 0 0 0
may 0 0 0 0 0 0 0 1 0 0 0 0
nov 0 0 0 0 0 0 0 0 0 0 0 0
oct 0 0 0 0 0 0 0 0 0 0 4 0
sep 0 3 0 0 0 0 0 0 0 0 0 40
fit_4 <- svm(month~., fires.train, cost = 1, kernel = 'linear')
fit_4_pred <- predict(fit_4, fires.test[,-3])
cm_4 <- table(pred = fit_4_pred, true = fires.test[,3])
sum(diag(cm_4))
## [1] 90
sum(cm_4)
## [1] 114
(sum(cm_4)-sum(diag(cm_4)))/sum(cm_4)
## [1] 0.2105263
Below is the confusion matrix for the low cost svm.
6
7. apr aug dec feb jan jul jun mar may nov oct sep
apr 0 0 0 1 0 0 0 0 0 0 0 0
aug 0 31 0 0 0 4 0 0 0 0 0 3
dec 0 0 1 0 0 0 0 0 0 0 0 0
feb 1 0 0 2 0 0 0 0 0 0 0 0
jan 0 0 0 0 0 0 0 0 0 0 0 0
jul 0 0 0 0 0 2 2 0 0 0 0 0
jun 0 0 0 0 1 0 3 0 0 0 0 0
mar 1 1 0 4 0 0 0 7 1 0 0 0
may 0 0 0 0 0 0 0 1 0 0 0 0
nov 0 0 0 0 0 0 0 0 0 0 0 0
oct 0 0 0 0 0 0 0 0 0 0 4 0
sep 0 3 0 0 0 1 0 0 0 0 0 40
I have continued to mess around the cost attribute inside svm() and because of the randomness of the
model I continue to get mixed results. Sometimes the model with a higher cost is more accurate and
sometimes it is the flip side. I have shown two models above one with a higher cost and one with the general
cost setting of 1. Their misclassification cost percentages are 21.0526316 percent and 21.0526316 perecent
respectively.
1.4 Recap
After analyzing and trying different SVM’s I wanted to add a visual to show the count of fires per month
from the data set to see if my predictions and overall study makes any sense according to the data set I am
testing.
## Loading required package: tcltk
7
8. apr
aug
dec
feb
jan
jul
jun
mar
may
nov
oct
sep
0 50 100 150
Count
Month
Fire Counts by Month
The visual speaks for itself. It is easily seen that the fires dominate two specific months August and
September as we predicted and saw in our confusion matricies. The misclassification isn’t that incredibly
hard to believe in those months because of the shear numbers. I feel content with my model having an 80
percent success rate for now.
2 Elastic Net Regression
What is elastic net regression? What does it mean? Where should we use it? These are the three questions
that drove me to create this paper. I first want to start by defining elastic net regression and talk about
what the true definition means in relative terms. I then will use elastic net regressions on some real-life data
and see how they can be used in a real-world setting.
8
9. 2.1 Definition and Meaning
It is worth mentioning there are two main types of elastic net regression, regularized and nonregularized. Be-
fore I can define these two types I need to discuss lasso and ridge regression techniques. Lasso is an acronym
for least absolute shrinkage and selection operator which is essentially the method behind shrinking the sum
of the absolute value of estimators in a regression by assigning penalties to values that are large. As the
penalties grow larger and larger the estimators are driven to zero and in the lasso method they are allowed
to be exactly zero. This differs from the ridge technique in that as the penalties grow and the estimators
shrink they still remain nonzero. Ridge (Tikhonov) regression is essentially the penalizing of large parameter
estimators when making a linear regression, again with the main difference between it and lasso being that
ridge estimators remain nonzero.
Now, that I have discussed a couple pre-topics I am ready to get down to elastic net regression. This
method linearly combines the least absolute deviations from the lasso method and the least squares from
the ridge method to overcome shortcomings in each area. Specifically, elastic net accounts for a problem
in the Lasso method, when one has many parameters but a small sample size (relative) as well as when
there are multiple parameters that are strongly correlated. In the lasso method only one of those parameters
would be used, but elastic net accounts for that correlation by adding a quadratic term. THe addition of a
quadratic term does cause some other errors the biggest is the idea of double shrinkage. The shrinkage is
doubled based on how quadratics work, which is a problem unless it is corrected by a scaling factor. One of
the main benefits of elastic net regression is a convex mapping which allows for a single maximum/minimum.
2.2 Preliminary Testing
I want to first begin some tests using elastic net regression techniques on the same data set I used with SVM
in the first section of this paper. I will keep the same initial question and goal for the regression which again
is, ”When are we most susceptible to forest fires, what months?” I want to see if the elastic net methods
give me less misclassifications. I will first look at our original training and testing sets made in section 1 as
well as a quick dimension check on ’fires’ to make sure the sets have remained the same as before.
summary(fires.train)
## X Y month day FFMC
## Min. :1.000 Min. :2.00 aug :129 fri:63 Min. :53.40
## 1st Qu.:2.000 1st Qu.:4.00 sep :109 mon:46 1st Qu.:90.20
## Median :4.000 Median :4.00 mar : 39 sat:66 Median :91.60
## Mean :4.681 Mean :4.31 jul : 21 sun:51 Mean :90.69
## 3rd Qu.:7.000 3rd Qu.:5.00 feb : 12 thu:45 3rd Qu.:92.90
## Max. :9.000 Max. :9.00 jun : 10 tue:39 Max. :96.20
## (Other): 25 wed:35
## DMC DC ISI temp
## Min. : 2.4 Min. : 9.3 Min. : 0.400 Min. : 2.20
## 1st Qu.: 61.1 1st Qu.:433.3 1st Qu.: 6.300 1st Qu.:15.90
## Median :108.4 Median :664.2 Median : 8.400 Median :19.60
## Mean :111.9 Mean :546.8 Mean : 8.868 Mean :19.12
## 3rd Qu.:145.4 3rd Qu.:713.0 3rd Qu.:11.000 3rd Qu.:22.90
## Max. :291.3 Max. :860.6 Max. :22.700 Max. :33.10
##
## RH wind rain area
## Min. :18.0 Min. :0.900 Min. :0.00000 Min. : 0.00
## 1st Qu.:32.0 1st Qu.:2.700 1st Qu.:0.00000 1st Qu.: 0.00
## Median :41.0 Median :4.000 Median :0.00000 Median : 0.47
## Mean :43.6 Mean :4.002 Mean :0.03015 Mean : 14.71
## 3rd Qu.:53.0 3rd Qu.:4.900 3rd Qu.:0.00000 3rd Qu.: 6.96
9
10. ## Max. :96.0 Max. :9.400 Max. :6.40000 Max. :1090.84
##
summary(fires.test)
## X Y month day FFMC
## Min. :1.000 Min. :2.000 sep :43 fri:17 Min. :18.70
## 1st Qu.:3.000 1st Qu.:4.000 aug :35 mon:17 1st Qu.:90.20
## Median :4.000 Median :4.000 mar : 8 sat:11 Median :91.70
## Mean :4.649 Mean :4.272 feb : 7 sun:29 Mean :90.11
## 3rd Qu.:6.000 3rd Qu.:5.000 jul : 7 thu:10 3rd Qu.:92.88
## Max. :9.000 Max. :9.000 jun : 5 tue:14 Max. :96.20
## (Other): 9 wed:16
## DMC DC ISI temp
## Min. : 1.10 Min. : 7.9 Min. : 0.000 Min. : 4.20
## 1st Qu.: 65.08 1st Qu.:444.7 1st Qu.: 6.500 1st Qu.:14.82
## Median :103.10 Median :663.0 Median : 8.500 Median :18.05
## Mean :103.66 Mean :549.2 Mean : 9.376 Mean :18.23
## 3rd Qu.:135.10 3rd Qu.:714.8 3rd Qu.:10.925 3rd Qu.:22.55
## Max. :290.00 Max. :855.3 Max. :56.100 Max. :32.40
##
## RH wind rain area
## Min. : 15.00 Min. :0.400 Min. :0.000000 Min. : 0.000
## 1st Qu.: 35.00 1st Qu.:3.100 1st Qu.:0.000000 1st Qu.: 0.000
## Median : 42.50 Median :4.000 Median :0.000000 Median : 1.015
## Mean : 46.04 Mean :4.092 Mean :0.003509 Mean : 10.307
## 3rd Qu.: 54.00 3rd Qu.:5.400 3rd Qu.:0.000000 3rd Qu.: 5.785
## Max. :100.00 Max. :8.500 Max. :0.200000 Max. :278.530
##
dim(fires)
## [1] 517 13
Let us begin first with the ’glmnet’ package. I need to make sure I make the correct response vector (y)
and the correct input matrix (x).
names(fires.train)
## [1] "X" "Y" "month" "day" "FFMC" "DMC" "DC" "ISI"
## [9] "temp" "RH" "wind" "rain" "area"
fires.traiN <- fires.train[,-4]
names(fires.traiN)
## [1] "X" "Y" "month" "FFMC" "DMC" "DC" "ISI" "temp"
## [9] "RH" "wind" "rain" "area"
x <- fires.traiN[,-3]
y <- fires.traiN[,3]
names(x)
## [1] "X" "Y" "FFMC" "DMC" "DC" "ISI" "temp" "RH" "wind" "rain"
## [11] "area"
10
15. ## [56,] 11 0.8347 0.011930
## [57,] 11 0.8348 0.010870
## [58,] 11 0.8349 0.009906
## [59,] 11 0.8351 0.009026
## [60,] 11 0.8351 0.008224
## [61,] 11 0.8352 0.007493
## [62,] 11 0.8353 0.006828
## [63,] 11 0.8353 0.006221
## [64,] 11 0.8354 0.005668
## [65,] 11 0.8354 0.005165
## [66,] 11 0.8354 0.004706
## [67,] 11 0.8355 0.004288
## [68,] 11 0.8355 0.003907
## [69,] 11 0.8355 0.003560
## [70,] 11 0.8355 0.003244
## [71,] 11 0.8355 0.002956
## [72,] 11 0.8355 0.002693
## [73,] 11 0.8355 0.002454
##
## $lambda.min
## [1] 0.01899853
##
## $lambda.1se
## [1] 0.1340308
##
## attr(,"class")
## [1] "cv.glmnet"
Now, I have a model from the elastic net function that I can use to predict values. I will use the test set
that I made earlier. I will insert the test matrix into the model and look at the output compared to the real
values given in the original dataset.
names(fires.test)
## [1] "X" "Y" "month" "day" "FFMC" "DMC" "DC" "ISI"
## [9] "temp" "RH" "wind" "rain" "area"
fires.tesT <- fires.test[,-4]
names(fires.tesT)
## [1] "X" "Y" "month" "FFMC" "DMC" "DC" "ISI" "temp"
## [9] "RH" "wind" "rain" "area"
xp <- fires.tesT[,-3]
yp <- fires.tesT[,3]
Yp <- gsub('jan','1',yp)
Yp <- gsub('feb', '2',Yp)
Yp <- gsub('mar', '3',Yp)
Yp <- gsub('apr', '4',Yp)
Yp <- gsub('may', '5',Yp)
Yp <- gsub('jun', '6',Yp)
Yp <- gsub('jul', '7',Yp)
Yp <- gsub('aug', '8',Yp)
15
16. Yp <- gsub('sep', '9',Yp)
Yp <- gsub('oct', '10',Yp)
Yp <- gsub('nov', '11',Yp)
Yp <- gsub('dec', '12',Yp)
class(Yp)
## [1] "character"
Yp <- as.numeric(Yp)
Yp <- as.integer(Yp)
class(Yp)
## [1] "integer"
xp[,1] <- as.numeric(xp[,1])
xp[,2] <- as.numeric(xp[,2])
xp[,3] <- as.numeric(xp[,3])
xp[,4] <- as.numeric(xp[,4])
xp[,5] <- as.numeric(xp[,5])
xp[,6] <- as.numeric(xp[,6])
xp[,7] <- as.numeric(xp[,7])
xp[,8] <- as.numeric(xp[,8])
xp[,9] <- as.numeric(xp[,9])
xp[,10] <- as.numeric(xp[,10])
xp[,11] <- as.numeric(xp[,11])
Xp <- as.matrix(xp)
PRED1 <- round(predict(results, Xp, lambda = results$lambda.min),0)
It makes sense to show the outputs as tables so that we can see the different counts. It is important to
remember that PRED is the predictions using the minimum lambda from the glmnet() function and the Yp
is the actual results from the test set.
table(PRED1)
## PRED1
## 3 4 5 6 7 8 9 10
## 3 15 4 4 8 31 44 5
table(Yp)
## Yp
## 1 2 3 4 5 6 7 8 9 10 12
## 1 7 8 2 1 5 7 35 43 4 1
This again was using the minimum lambda found in the cv.glmnet() I now wonder if we can’t get a better
model by simplifying a little bit and using the $lambda.1se value instead of the $lambda.min
PRED2 <- round(predict(results, Xp, lambda = results$lambda.1se),0)
table(PRED2)
## PRED2
## 3 4 5 6 7 8 9 10
## 3 15 4 4 8 31 44 5
16
17. Let’s see how the two models compare to the true data.
ypdf <- rbind(c(1,0),c(2,1),c(3,10),c(4,1),c(5,0),c(6,4),c(7,8),c(8,36),c(9,30),c(10,4),c(11,0),c(12,3))
CONTINUE TRYING DIFFERENT METHODS!!!
The accuracy isn’t as good as the SVM in part one but we can see that we are getting the same two
”Problem” Months as the SVM section did which are August and September. I want to look more into the
glmnet() function now and explore some LASSO and Ridge Regressions. It shouldn’t be a painful process,
because all I have to do is set the value of lambda to either 0 or 1 depending on what model I would like to
use.
3 Ridge Regression
Now, I will look into Ridge Regressions a little it and compare it with the elastic net regressions I before. I
can use the same glmnet() function as I did previously I just need to set the value of lambda to 0.
dim(fires.train)
## [1] 345 13
dim(fires.test)
## [1] 114 13
str(fires.train)
## 'data.frame': 345 obs. of 13 variables:
## $ X : int 7 7 7 8 8 8 8 7 7 6 ...
## $ Y : int 5 4 4 6 6 6 6 5 5 5 ...
## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 11 11 2 2 2 12 12 12 2 ...
## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 6 3 4 2 2 6 3 3 1 ...
## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ...
## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ...
## $ DC : num 94.3 669.1 686.9 488 495.6 ...
## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ...
## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ...
## $ RH : int 51 33 33 29 27 86 63 51 38 72 ...
## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ...
## $ rain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
fT <- fires.train
fT[,3] <- gsub('jan','1',fT[,3])
fT[,3] <- gsub('feb', '2',fT[,3])
fT[,3] <- gsub('mar', '3',fT[,3])
fT[,3] <- gsub('apr', '4',fT[,3])
fT[,3] <- gsub('may', '5',fT[,3])
fT[,3] <- gsub('jun', '6',fT[,3])
fT[,3] <- gsub('jul', '7',fT[,3])
fT[,3] <- gsub('aug', '8',fT[,3])
fT[,3] <- gsub('sep', '9',fT[,3])
fT[,3] <- gsub('oct', '10',fT[,3])
fT[,3] <- gsub('nov', '11',fT[,3])
fT[,3] <- gsub('dec', '12',fT[,3])
class(fT[,3])
17
21. FTS <- fTs[,-3]
str(FTS)
## 'data.frame': 114 obs. of 12 variables:
## $ X : int 8 7 8 7 7 6 7 4 4 5 ...
## $ Y : int 6 5 5 4 4 3 3 4 4 6 ...
## $ day : int 5 6 1 7 6 7 6 2 6 1 ...
## $ FFMC: num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ...
## $ DMC : num 33.3 88 32.8 96.3 110.9 ...
## $ DC : num 77.5 698.6 664.2 200 537.4 ...
## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ...
## $ temp: num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ...
## $ RH : int 97 40 47 44 43 39 27 54 48 70 ...
## $ wind: num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ...
## $ rain: num 0.2 0 0 0 0 0 0 0 0 0 ...
## $ area: num 0 0 0 0 0 0 0 0 0 0 ...
FTS <- as.matrix(FTS)
class(FTS)
## [1] "matrix"
dim(FTS)
## [1] 114 12
dim(rmd)
## [1] 12 1
rpred <- FTS%*%rmd
I realize the top part might be confusing, but because I couldn’t find an ”easy” way to get the predictions
from the model I converted the coeffiecients into a matrix (rmd) and then I used the fires.test set to make
another matrix so that I could multiply the coeffients into the correct column and then it is produced in
rpred. The list is in decimals because of the way the R function lm.ridge does the calculations so I conver
the decimals to the nearest month by rounding and use table() to count the frequencies which is displayed
below by freq, the other table is the true values.
freq <- table(round(rpred,0))
freq
##
## -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 9
## 1 1 5 10 11 13 17 9 17 17 6 4 1 2
true <- table(fTs[,3])
true
##
## 1 2 3 4 5 6 7 8 9 10 12
## 1 7 8 2 1 5 7 35 43 4 1
I seem to be having troubles getting the class ”lmridge” to behave the way I would like it to. I am going
to try a different ridge function.
21
22. lp <- linearRidge(month~.,fT)
pre <- table(round(predict(lp, fTs),0))
pre
##
## 2 3 4 5 6 7 8 9 10 11
## 1 8 10 4 5 7 28 41 8 2
true
##
## 1 2 3 4 5 6 7 8 9 10 12
## 1 7 8 2 1 5 7 35 43 4 1
The predictions are not too bad at all except for the 3rd and 4th months (March and April). It seems
that the two values are basically inverted according to the true data. I wondering why those two are getting
such a bad fit??? I am not sure how to answer that question.
These results are very similar to the elastic net regression from section 2. It is a little bit better than
the elastic net, but it is not ’much’ better and it for sure is not better than the SVM models we used in
section 1.
This leads me to the next idea to focus which is to do a LASSO regression with the fires data. So far
I have done elastic net (which is a mixture of LASSO and Ridge) and ridge (which keeps all the predictors
and assigns a penalty). The next logical place to go would be LASSO (which allows for variable selection
along with the penalty factor).
4 LASSO Regression
A reminder that LASSO stands for least absolute shrinkage and selection operator. LASSO will shrink the
coefficients like the ridge regression did, but the main difference between the two is LASSO is going to force
some variables to exactly 0, thus eliminating them from the model entirely. This is a process commonly
known as variable selection.
I want to continue to work with the fires data set from previous sections. I will start by looking at the
overall structure of the data sets again where fires.train and fires.test are the two subsets of fires for which
training and testing of a model are done.
str(fires.test)
## 'data.frame': 114 obs. of 13 variables:
## $ X : int 8 7 8 7 7 6 7 4 4 5 ...
## $ Y : int 6 5 5 4 4 3 3 4 4 6 ...
## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 12 11 7 2 12 11 2 12 12 ...
## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 3 2 4 3 4 3 6 3 2 ...
## $ FFMC : num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ...
## $ DMC : num 33.3 88 32.8 96.3 110.9 ...
## $ DC : num 77.5 698.6 664.2 200 537.4 ...
## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ...
## $ temp : num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ...
## $ RH : int 97 40 47 44 43 39 27 54 48 70 ...
## $ wind : num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ...
## $ rain : num 0.2 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
22
23. str(fires.train)
## 'data.frame': 345 obs. of 13 variables:
## $ X : int 7 7 7 8 8 8 8 7 7 6 ...
## $ Y : int 5 4 4 6 6 6 6 5 5 5 ...
## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 11 11 2 2 2 12 12 12 2 ...
## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 6 3 4 2 2 6 3 3 1 ...
## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ...
## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ...
## $ DC : num 94.3 669.1 686.9 488 495.6 ...
## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ...
## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ...
## $ RH : int 51 33 33 29 27 86 63 51 38 72 ...
## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ...
## $ rain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
I plan to use the lars() function from the R package ’lars’ which requires a prediction matrix and a
response vector to train the model. Because I have already created those in previous sections I can just
recall those matrices because they will have the same data as I would need here. From the elastic net
(section 2) X is the prediction matrix and Y is the response vector (month).
lasso <- lars(X, Y)
Now that I have a LASSO model trained I can test my model against the fires.test data set we made
previously. Again, because elastic net also needed a prediction matrix and a response vector I can use the
matrix and vector formed in that section.
laspre <- predict(lasso, Xp,s=8)
t.laspre <- table(round(as.matrix(laspre$fit),0))
t.laspre
##
## 2 3 4 5 6 7 8 9 10
## 1 9 11 3 4 7 27 39 13
table(Yp)
## Yp
## 1 2 3 4 5 6 7 8 9 10 12
## 1 7 8 2 1 5 7 35 43 4 1
This one just have been my best model yet. I can see that the groupings are close together and in fact the
biggest misclassification that I have is for the fourth month at 7. I Believe this is yielding the best results.
I will run it through the process a few more times to see if I can generally conclude that I have reached my
best model.
5 References
UCI Depository (2008), ’Forest Fires’, University of California Irvine
Zou, Hui and Hastie, Trevor(2004) ’Regularization and variable selection via the elastic net’, Standford
University
23