SlideShare a Scribd company logo
1 of 23
Download to read offline
Data Mining Exploration
Brett Keim
December 16, 2015
Abstract
Inside data mining there are infinite ways to explore a set with each way yielding different information
about the data set as well as about future data points inside the data set. This research is primary about
the techniques relating to support vector machines and how they can be utilized to provide useful models
and classifications along with the use of kernels. The research will also cover some elastic net regularized
regressions for the linear combinations which support vector machines do not incorporate.
1 Introduction to Support Vector Machines
It is wise to first start by looking into the definition of support vector machines and the most practical uses
of them. A SVM uses examples from a set and split them into two categories. It then uses the categories to
assign new examples to one or the other without the use of probability. The two categories should be mapped
into groups so there is a clear gap between them which should be maximized. The new samples are mapped
into the same space and labeled based on what side of the gap they fall into. This is how classification is
done using the SVM, but they can also classify non-linearly using kernel methods which will be discussed
later.
1.1 Obtaining Data for SVM Study
In order to study support vector machines I will need a data set that is complex enough to work on many
different layers of analysis, but at the same time being small enough so I can easily troubleshoot. I have
pulled a dataset from the University of California Irvine Machine Learning Repository [UCI]. I have chosen
to read them into R using the read.csv command.
setwd("C:/Users/bwkeim/Desktop/Grad Classes/Data Mining Exploration")
fires <- read.csv("forestfires.csv")
rfires <- runif(nrow(fires))
fires.train <- fires[rfires >= 0.33,]
fires.test <- fires[rfires < 0.22,]
dim(fires)
## [1] 517 13
1.2 Preliminary Testing
I will start by looking at the forest fire data from UCI to better understand how to properly use a SVM
on a data set. I need to look at some of the data first in order to understand what is inside the data file.
It is important to remember there is 6,721 data points inside the dataset. I have also split the set into two
subsets one for training and one for testing.
summary(fires)
1
## X Y month day FFMC
## Min. :1.000 Min. :2.0 aug :184 fri:85 Min. :18.70
## 1st Qu.:3.000 1st Qu.:4.0 sep :172 mon:74 1st Qu.:90.20
## Median :4.000 Median :4.0 mar : 54 sat:84 Median :91.60
## Mean :4.669 Mean :4.3 jul : 32 sun:95 Mean :90.64
## 3rd Qu.:7.000 3rd Qu.:5.0 feb : 20 thu:61 3rd Qu.:92.90
## Max. :9.000 Max. :9.0 jun : 17 tue:64 Max. :96.20
## (Other): 38 wed:54
## DMC DC ISI temp
## Min. : 1.1 Min. : 7.9 Min. : 0.000 Min. : 2.20
## 1st Qu.: 68.6 1st Qu.:437.7 1st Qu.: 6.500 1st Qu.:15.50
## Median :108.3 Median :664.2 Median : 8.400 Median :19.30
## Mean :110.9 Mean :547.9 Mean : 9.022 Mean :18.89
## 3rd Qu.:142.4 3rd Qu.:713.9 3rd Qu.:10.800 3rd Qu.:22.80
## Max. :291.3 Max. :860.6 Max. :56.100 Max. :33.30
##
## RH wind rain area
## Min. : 15.00 Min. :0.400 Min. :0.00000 Min. : 0.00
## 1st Qu.: 33.00 1st Qu.:2.700 1st Qu.:0.00000 1st Qu.: 0.00
## Median : 42.00 Median :4.000 Median :0.00000 Median : 0.52
## Mean : 44.29 Mean :4.018 Mean :0.02166 Mean : 12.85
## 3rd Qu.: 53.00 3rd Qu.:4.900 3rd Qu.:0.00000 3rd Qu.: 6.57
## Max. :100.00 Max. :9.400 Max. :6.40000 Max. :1090.84
##
The file contains 13 different attributes as seen above by the summary output. When I think about forest
fires the first thing that comes to mind is what causes them and then at what point of the year are we most
venerable for one to occur.
fit_default <- svm(month~.,fires.train)
fit_default
##
## Call:
## svm(formula = month ~ ., data = fires.train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.05555556
##
## Number of Support Vectors: 281
I am using the svm() function inside the e1071 R package to begin my svm learning. This function uses
a formula set by the user to classify data points from a given dataset. The user can also easily change the
kernel type as well as the margin by using the kernel and cost portions of the function. When the user
changes the cost aspect they are changing the cost of misclassification on the data points. The larger the
cost the smaller the margin (separation of data points) because it places a higher cost on misclassification
this can lead to over fitting where as a smaller cost value allows the margin to widen, but does cause a higher
level of bias. The key is to find the right mix of the two. When deciding what is the correct mix it is best
to see how different levels perform on real data. Also there is another part of the function we must look at
which is the gamma. Gamma is the part of the function that determines how much influence a single data
2
value has on the training set. If gamma is too large the area of the influence radius of the support vectors
only includes the support vectors themselves and no cost regularization will allow us to avoid over fitting.
If the gamma value is too small the model will be too constricted and not be able to capture the shape of
the data. It is again important to get the correct mix of the two by seeing how actual data reacts to the
changes.
##
## Call:
## svm(formula = month ~ ., data = fires.train, cost = 10000)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10000
## gamma: 0.05555556
##
## Number of Support Vectors: 246
##
## Call:
## svm(formula = month ~ ., data = fires.train, cost = 0.001)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.001
## gamma: 0.05555556
##
## Number of Support Vectors: 328
##
## Call:
## svm(formula = month ~ ., data = fires.train, gamma = 0.1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1
##
## Number of Support Vectors: 283
##
## Call:
## svm(formula = month ~ ., data = fires.train, gamma = 0.001)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.001
##
## Number of Support Vectors: 329
3
Since I have played with the models more than what is shown I think it is now time to start comparing
their values to the test portion of my data set. I will use predictions based on my svm model and compare
it to the test data using a simple table.
##
## Call:
## svm(formula = month ~ ., data = fires.train, cost = 10000, gamma = 0.01,
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10000
## gamma: 0.01
##
## Number of Support Vectors: 201
1.2.1 Initial Results
Using my fit 1 svm I was able to organiize my results into a table, which in this is case is better known a
confusion matrix. It shows the true values (columns) versus the predicted values (rows).
apr aug dec feb jan jul jun mar may nov oct sep
apr 0 0 0 3 0 0 0 0 0 0 0 0
aug 0 29 0 0 1 4 0 0 0 0 0 3
dec 0 0 1 0 0 0 0 0 0 0 0 0
feb 1 0 0 2 0 0 0 0 0 0 0 0
jan 0 0 0 0 0 0 0 0 0 0 0 0
jul 0 1 0 0 0 3 2 0 0 0 0 0
jun 0 0 0 0 0 0 3 0 0 0 0 0
mar 1 1 0 2 0 0 0 7 1 0 0 0
may 0 0 0 0 0 0 0 1 0 0 0 0
nov 0 0 0 0 0 0 0 0 0 0 0 0
oct 0 0 0 0 0 0 0 0 0 0 4 0
sep 0 4 0 0 0 0 0 0 0 0 0 40
Ideally we would like the diagonal to contain all of the data points. This would mean that our predictions
perfectly match the true values. Of course, there is no such thing as a perfect model so we have some values
that are left outside of the diagonal. These are known misclassified points and is something we need to take
a closer look at.
correct.pred <- sum(diag(cm))
correct.pred
## [1] 89
total.pred <- sum(cm)
total.pred
## [1] 114
misclass <- (total.pred - correct.pred) / total.pred
misclass
## [1] 0.2192982
4
By the above code, the model successfully predicted 89 cases based on the test data out of a total of 114
test points yielding a misclassification percentage of around 21.9298246 percent. This is not a bad percentage
in my point of view. This was using a radial kernel method, but there are other kernel methods used inside
SVM’s that can be more effective and useful. I want to look specifically at the linear kernel now.
1.3 Linear Kernel SVM’s
I will now use the same techniques as discussed early as far as finding the correct cost and gamma for my
svm, but this time I will be using a linear kernel.
fit_2 <- svm(month~., fires.train, cost = 1000, kernel = 'linear')
summary(fit_2)
##
## Call:
## svm(formula = month ~ ., data = fires.train, cost = 1000, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1000
## gamma: 0.05555556
##
## Number of Support Vectors: 157
##
## ( 21 9 39 38 5 17 11 1 8 7 1 )
##
##
## Number of Classes: 11
##
## Levels:
## apr aug dec feb jan jul jun mar may nov oct sep
fit_2_pred <- predict(fit_2, fires.test[,-3])
cm_2 <- table(pred = fit_2_pred, true = fires.test[,3])
sum(diag(cm_2))
## [1] 91
sum(cm_2)
## [1] 114
(sum(cm_2)-sum(diag(cm_2)))/sum(cm_2)
## [1] 0.2017544
The initial model I have made using linear kernels is slightly better than the first model with radial
kernels as you can see sense we have a 1 percent decrease in misclassification. I am going to play with the
cost level and see if I can’t further improve the model.
fit_3 <- svm(month~., fires.train, cost = 100, kernel = 'linear')
fit_3_pred <- predict(fit_3, fires.test[,-3])
cm_3 <- table(pred = fit_3_pred, true = fires.test[,3])
sum(diag(cm_3))
5
## [1] 90
sum(cm_3)
## [1] 114
(sum(cm_3)-sum(diag(cm_3)))/sum(cm_3)
## [1] 0.2105263
1.3.1 Results of Linear Kernel
Below is the confusion matrix for the high cost svm.
apr aug dec feb jan jul jun mar may nov oct sep
apr 0 0 0 3 0 0 0 1 0 0 0 0
aug 0 31 0 0 0 3 0 0 0 0 0 3
dec 0 0 1 0 0 0 0 0 0 0 0 0
feb 1 0 0 2 1 0 0 0 0 0 0 0
jan 0 0 0 0 0 0 0 0 0 0 0 0
jul 0 0 0 0 0 4 3 0 0 0 0 0
jun 0 0 0 0 0 0 2 0 0 0 0 0
mar 1 1 0 2 0 0 0 6 1 0 0 0
may 0 0 0 0 0 0 0 1 0 0 0 0
nov 0 0 0 0 0 0 0 0 0 0 0 0
oct 0 0 0 0 0 0 0 0 0 0 4 0
sep 0 3 0 0 0 0 0 0 0 0 0 40
fit_4 <- svm(month~., fires.train, cost = 1, kernel = 'linear')
fit_4_pred <- predict(fit_4, fires.test[,-3])
cm_4 <- table(pred = fit_4_pred, true = fires.test[,3])
sum(diag(cm_4))
## [1] 90
sum(cm_4)
## [1] 114
(sum(cm_4)-sum(diag(cm_4)))/sum(cm_4)
## [1] 0.2105263
Below is the confusion matrix for the low cost svm.
6
apr aug dec feb jan jul jun mar may nov oct sep
apr 0 0 0 1 0 0 0 0 0 0 0 0
aug 0 31 0 0 0 4 0 0 0 0 0 3
dec 0 0 1 0 0 0 0 0 0 0 0 0
feb 1 0 0 2 0 0 0 0 0 0 0 0
jan 0 0 0 0 0 0 0 0 0 0 0 0
jul 0 0 0 0 0 2 2 0 0 0 0 0
jun 0 0 0 0 1 0 3 0 0 0 0 0
mar 1 1 0 4 0 0 0 7 1 0 0 0
may 0 0 0 0 0 0 0 1 0 0 0 0
nov 0 0 0 0 0 0 0 0 0 0 0 0
oct 0 0 0 0 0 0 0 0 0 0 4 0
sep 0 3 0 0 0 1 0 0 0 0 0 40
I have continued to mess around the cost attribute inside svm() and because of the randomness of the
model I continue to get mixed results. Sometimes the model with a higher cost is more accurate and
sometimes it is the flip side. I have shown two models above one with a higher cost and one with the general
cost setting of 1. Their misclassification cost percentages are 21.0526316 percent and 21.0526316 perecent
respectively.
1.4 Recap
After analyzing and trying different SVM’s I wanted to add a visual to show the count of fires per month
from the data set to see if my predictions and overall study makes any sense according to the data set I am
testing.
## Loading required package: tcltk
7
apr
aug
dec
feb
jan
jul
jun
mar
may
nov
oct
sep
0 50 100 150
Count
Month
Fire Counts by Month
The visual speaks for itself. It is easily seen that the fires dominate two specific months August and
September as we predicted and saw in our confusion matricies. The misclassification isn’t that incredibly
hard to believe in those months because of the shear numbers. I feel content with my model having an 80
percent success rate for now.
2 Elastic Net Regression
What is elastic net regression? What does it mean? Where should we use it? These are the three questions
that drove me to create this paper. I first want to start by defining elastic net regression and talk about
what the true definition means in relative terms. I then will use elastic net regressions on some real-life data
and see how they can be used in a real-world setting.
8
2.1 Definition and Meaning
It is worth mentioning there are two main types of elastic net regression, regularized and nonregularized. Be-
fore I can define these two types I need to discuss lasso and ridge regression techniques. Lasso is an acronym
for least absolute shrinkage and selection operator which is essentially the method behind shrinking the sum
of the absolute value of estimators in a regression by assigning penalties to values that are large. As the
penalties grow larger and larger the estimators are driven to zero and in the lasso method they are allowed
to be exactly zero. This differs from the ridge technique in that as the penalties grow and the estimators
shrink they still remain nonzero. Ridge (Tikhonov) regression is essentially the penalizing of large parameter
estimators when making a linear regression, again with the main difference between it and lasso being that
ridge estimators remain nonzero.
Now, that I have discussed a couple pre-topics I am ready to get down to elastic net regression. This
method linearly combines the least absolute deviations from the lasso method and the least squares from
the ridge method to overcome shortcomings in each area. Specifically, elastic net accounts for a problem
in the Lasso method, when one has many parameters but a small sample size (relative) as well as when
there are multiple parameters that are strongly correlated. In the lasso method only one of those parameters
would be used, but elastic net accounts for that correlation by adding a quadratic term. THe addition of a
quadratic term does cause some other errors the biggest is the idea of double shrinkage. The shrinkage is
doubled based on how quadratics work, which is a problem unless it is corrected by a scaling factor. One of
the main benefits of elastic net regression is a convex mapping which allows for a single maximum/minimum.
2.2 Preliminary Testing
I want to first begin some tests using elastic net regression techniques on the same data set I used with SVM
in the first section of this paper. I will keep the same initial question and goal for the regression which again
is, ”When are we most susceptible to forest fires, what months?” I want to see if the elastic net methods
give me less misclassifications. I will first look at our original training and testing sets made in section 1 as
well as a quick dimension check on ’fires’ to make sure the sets have remained the same as before.
summary(fires.train)
## X Y month day FFMC
## Min. :1.000 Min. :2.00 aug :129 fri:63 Min. :53.40
## 1st Qu.:2.000 1st Qu.:4.00 sep :109 mon:46 1st Qu.:90.20
## Median :4.000 Median :4.00 mar : 39 sat:66 Median :91.60
## Mean :4.681 Mean :4.31 jul : 21 sun:51 Mean :90.69
## 3rd Qu.:7.000 3rd Qu.:5.00 feb : 12 thu:45 3rd Qu.:92.90
## Max. :9.000 Max. :9.00 jun : 10 tue:39 Max. :96.20
## (Other): 25 wed:35
## DMC DC ISI temp
## Min. : 2.4 Min. : 9.3 Min. : 0.400 Min. : 2.20
## 1st Qu.: 61.1 1st Qu.:433.3 1st Qu.: 6.300 1st Qu.:15.90
## Median :108.4 Median :664.2 Median : 8.400 Median :19.60
## Mean :111.9 Mean :546.8 Mean : 8.868 Mean :19.12
## 3rd Qu.:145.4 3rd Qu.:713.0 3rd Qu.:11.000 3rd Qu.:22.90
## Max. :291.3 Max. :860.6 Max. :22.700 Max. :33.10
##
## RH wind rain area
## Min. :18.0 Min. :0.900 Min. :0.00000 Min. : 0.00
## 1st Qu.:32.0 1st Qu.:2.700 1st Qu.:0.00000 1st Qu.: 0.00
## Median :41.0 Median :4.000 Median :0.00000 Median : 0.47
## Mean :43.6 Mean :4.002 Mean :0.03015 Mean : 14.71
## 3rd Qu.:53.0 3rd Qu.:4.900 3rd Qu.:0.00000 3rd Qu.: 6.96
9
## Max. :96.0 Max. :9.400 Max. :6.40000 Max. :1090.84
##
summary(fires.test)
## X Y month day FFMC
## Min. :1.000 Min. :2.000 sep :43 fri:17 Min. :18.70
## 1st Qu.:3.000 1st Qu.:4.000 aug :35 mon:17 1st Qu.:90.20
## Median :4.000 Median :4.000 mar : 8 sat:11 Median :91.70
## Mean :4.649 Mean :4.272 feb : 7 sun:29 Mean :90.11
## 3rd Qu.:6.000 3rd Qu.:5.000 jul : 7 thu:10 3rd Qu.:92.88
## Max. :9.000 Max. :9.000 jun : 5 tue:14 Max. :96.20
## (Other): 9 wed:16
## DMC DC ISI temp
## Min. : 1.10 Min. : 7.9 Min. : 0.000 Min. : 4.20
## 1st Qu.: 65.08 1st Qu.:444.7 1st Qu.: 6.500 1st Qu.:14.82
## Median :103.10 Median :663.0 Median : 8.500 Median :18.05
## Mean :103.66 Mean :549.2 Mean : 9.376 Mean :18.23
## 3rd Qu.:135.10 3rd Qu.:714.8 3rd Qu.:10.925 3rd Qu.:22.55
## Max. :290.00 Max. :855.3 Max. :56.100 Max. :32.40
##
## RH wind rain area
## Min. : 15.00 Min. :0.400 Min. :0.000000 Min. : 0.000
## 1st Qu.: 35.00 1st Qu.:3.100 1st Qu.:0.000000 1st Qu.: 0.000
## Median : 42.50 Median :4.000 Median :0.000000 Median : 1.015
## Mean : 46.04 Mean :4.092 Mean :0.003509 Mean : 10.307
## 3rd Qu.: 54.00 3rd Qu.:5.400 3rd Qu.:0.000000 3rd Qu.: 5.785
## Max. :100.00 Max. :8.500 Max. :0.200000 Max. :278.530
##
dim(fires)
## [1] 517 13
Let us begin first with the ’glmnet’ package. I need to make sure I make the correct response vector (y)
and the correct input matrix (x).
names(fires.train)
## [1] "X" "Y" "month" "day" "FFMC" "DMC" "DC" "ISI"
## [9] "temp" "RH" "wind" "rain" "area"
fires.traiN <- fires.train[,-4]
names(fires.traiN)
## [1] "X" "Y" "month" "FFMC" "DMC" "DC" "ISI" "temp"
## [9] "RH" "wind" "rain" "area"
x <- fires.traiN[,-3]
y <- fires.traiN[,3]
names(x)
## [1] "X" "Y" "FFMC" "DMC" "DC" "ISI" "temp" "RH" "wind" "rain"
## [11] "area"
10
Y <- gsub('jan','1',y)
Y <- gsub('feb', '2',Y)
Y <- gsub('mar', '3',Y)
Y <- gsub('apr', '4',Y)
Y <- gsub('may', '5',Y)
Y <- gsub('jun', '6',Y)
Y <- gsub('jul', '7',Y)
Y <- gsub('aug', '8',Y)
Y <- gsub('sep', '9',Y)
Y <- gsub('oct', '10',Y)
Y <- gsub('nov', '11',Y)
Y <- gsub('dec', '12',Y)
class(Y)
## [1] "character"
Y <- as.numeric(Y)
Y <- as.integer(Y)
class(Y)
## [1] "integer"
x[,1] <- as.numeric(x[,1])
x[,2] <- as.numeric(x[,2])
x[,3] <- as.numeric(x[,3])
x[,4] <- as.numeric(x[,4])
x[,5] <- as.numeric(x[,5])
x[,6] <- as.numeric(x[,6])
x[,7] <- as.numeric(x[,7])
x[,8] <- as.numeric(x[,8])
x[,9] <- as.numeric(x[,9])
x[,10] <- as.numeric(x[,10])
x[,11] <- as.numeric(x[,11])
head(x)
## X Y FFMC DMC DC ISI temp RH wind rain area
## 1 7 5 86.2 26.2 94.3 5.1 8.2 51 6.7 0 0
## 2 7 4 90.6 35.4 669.1 6.7 18.0 33 0.9 0 0
## 3 7 4 90.6 43.7 686.9 6.7 14.6 33 1.3 0 0
## 6 8 6 92.3 85.3 488.0 14.7 22.2 29 5.4 0 0
## 7 8 6 92.3 88.9 495.6 8.5 24.1 27 3.1 0 0
## 8 8 6 91.5 145.4 608.2 10.7 8.0 86 2.2 0 0
dim(x)
## [1] 345 11
length(Y)
## [1] 345
class(x)
## [1] "data.frame"
11
str(x)
## 'data.frame': 345 obs. of 11 variables:
## $ X : num 7 7 7 8 8 8 8 7 7 6 ...
## $ Y : num 5 4 4 6 6 6 6 5 5 5 ...
## $ FFMC: num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ...
## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ...
## $ DC : num 94.3 669.1 686.9 488 495.6 ...
## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ...
## $ temp: num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ...
## $ RH : num 51 33 33 29 27 86 63 51 38 72 ...
## $ wind: num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ...
## $ rain: num 0 0 0 0 0 0 0 0 0 0 ...
## $ area: num 0 0 0 0 0 0 0 0 0 0 ...
X <- as.matrix(x)
class(X)
## [1] "matrix"
results <- cv.glmnet(X,Y)
results
## $lambda
## [1] 1.990315408 1.813501273 1.652394818 1.505600617 1.371847207
## [6] 1.249976083 1.138931652 1.037752103 0.945561067 0.861560028
## [11] 0.785021409 0.715282271 0.651738565 0.593839908 0.541084807
## [16] 0.493016324 0.449218112 0.409310813 0.372948769 0.339817028
## [21] 0.309628620 0.282122067 0.257059120 0.234222696 0.213414997
## [26] 0.194455797 0.177180880 0.161440619 0.147098679 0.134030838
## [31] 0.122123908 0.111274757 0.101389414 0.092382258 0.084175273
## [36] 0.076697373 0.069883790 0.063675507 0.058018750 0.052864524
## [41] 0.048168186 0.043889057 0.039990074 0.036437466 0.033200462
## [46] 0.030251024 0.027563607 0.025114932 0.022883791 0.020850858
## [51] 0.018998525 0.017310748 0.015772909 0.014371687 0.013094946
## [56] 0.011931627 0.010871655 0.009905847 0.009025839 0.008224008
## [61] 0.007493410 0.006827716 0.006221160 0.005668490 0.005164917
## [66] 0.004706080 0.004288005 0.003907070 0.003559977 0.003243718
##
## $cvm
## [1] 5.1557896 4.5274734 3.9661191 3.5000809 3.1131751 2.7919656 2.5252979
## [8] 2.3039112 2.1201170 1.9675324 1.8408582 1.7356947 1.6483897 1.5759106
## [15] 1.5157400 1.4657880 1.4243193 1.3898935 1.3613145 1.3375895 1.3178942
## [22] 1.3015444 1.2879718 1.2767162 1.2682002 1.2602525 1.2383533 1.2062084
## [29] 1.1744998 1.1488916 1.1248634 1.1043967 1.0857760 1.0690638 1.0542104
## [36] 1.0383157 1.0235117 1.0107188 1.0003412 0.9921089 0.9851035 0.9795120
## [43] 0.9752539 0.9719607 0.9696518 0.9682564 0.9674289 0.9668369 0.9664844
## [50] 0.9661883 0.9660219 0.9661044 0.9662086 0.9664366 0.9666119 0.9669238
## [57] 0.9671213 0.9673236 0.9675142 0.9676450 0.9677656 0.9679127 0.9681149
## [64] 0.9682766 0.9683967 0.9685622 0.9686798 0.9688161 0.9689918 0.9690978
##
## $cvsd
## [1] 0.3052258 0.2965443 0.2813673 0.2717944 0.2664935 0.2643274 0.2643760
## [8] 0.2659282 0.2684540 0.2715701 0.2750053 0.2785719 0.2821425 0.2856328
## [15] 0.2889890 0.2921786 0.2951837 0.2979968 0.3006170 0.3030482 0.3052973
12
## [22] 0.3073728 0.3092844 0.3110387 0.3124359 0.3137672 0.3156819 0.3134652
## [29] 0.3089602 0.3044185 0.2989233 0.2934814 0.2875823 0.2818579 0.2760164
## [36] 0.2686802 0.2611669 0.2540860 0.2474377 0.2412654 0.2352859 0.2297747
## [43] 0.2247959 0.2203583 0.2162666 0.2124843 0.2089385 0.2057743 0.2028791
## [50] 0.2002374 0.1978727 0.1957308 0.1937396 0.1919263 0.1901875 0.1885591
## [57] 0.1870212 0.1855954 0.1842725 0.1830436 0.1819223 0.1808839 0.1799487
## [64] 0.1790834 0.1782956 0.1775804 0.1769257 0.1763354 0.1758300 0.1753255
##
## $cvup
## [1] 5.461015 4.824018 4.247486 3.771875 3.379669 3.056293 2.789674
## [8] 2.569839 2.388571 2.239103 2.115863 2.014267 1.930532 1.861543
## [15] 1.804729 1.757967 1.719503 1.687890 1.661931 1.640638 1.623192
## [22] 1.608917 1.597256 1.587755 1.580636 1.574020 1.554035 1.519674
## [29] 1.483460 1.453310 1.423787 1.397878 1.373358 1.350922 1.330227
## [36] 1.306996 1.284679 1.264805 1.247779 1.233374 1.220389 1.209287
## [43] 1.200050 1.192319 1.185918 1.180741 1.176367 1.172611 1.169364
## [50] 1.166426 1.163895 1.161835 1.159948 1.158363 1.156799 1.155483
## [57] 1.154143 1.152919 1.151787 1.150689 1.149688 1.148797 1.148064
## [64] 1.147360 1.146692 1.146143 1.145606 1.145152 1.144822 1.144423
##
## $cvlo
## [1] 4.8505637 4.2309291 3.6847518 3.2282865 2.8466816 2.5276382 2.2609219
## [8] 2.0379830 1.8516630 1.6959623 1.5658528 1.4571228 1.3662471 1.2902777
## [15] 1.2267510 1.1736094 1.1291356 1.0918967 1.0606975 1.0345412 1.0125969
## [22] 0.9941715 0.9786873 0.9656774 0.9557643 0.9464853 0.9226714 0.8927433
## [29] 0.8655396 0.8444731 0.8259401 0.8109153 0.7981937 0.7872059 0.7781940
## [36] 0.7696355 0.7623448 0.7566328 0.7529035 0.7508436 0.7498176 0.7497373
## [43] 0.7504579 0.7516025 0.7533852 0.7557721 0.7584904 0.7610626 0.7636052
## [50] 0.7659509 0.7681492 0.7703736 0.7724690 0.7745103 0.7764244 0.7783646
## [57] 0.7801001 0.7817282 0.7832417 0.7846014 0.7858433 0.7870289 0.7881662
## [64] 0.7891931 0.7901011 0.7909817 0.7917541 0.7924807 0.7931617 0.7937723
##
## $nzero
## s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17
## 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35
## 1 1 1 1 1 1 1 2 3 3 3 3 4 5 5 5 6 6
## s36 s37 s38 s39 s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53
## 6 6 6 6 6 6 6 7 7 7 7 9 9 9 9 10 10 10
## s54 s55 s56 s57 s58 s59 s60 s61 s62 s63 s64 s65 s66 s67 s68 s69
## 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
##
## $name
## mse
## "Mean-Squared Error"
##
## $glmnet.fit
##
## Call: glmnet(x = X, y = Y)
##
## Df %Dev Lambda
## [1,] 0 0.0000 1.990000
## [2,] 1 0.1301 1.814000
13
## [3,] 1 0.2381 1.652000
## [4,] 1 0.3277 1.506000
## [5,] 1 0.4022 1.372000
## [6,] 1 0.4640 1.250000
## [7,] 1 0.5153 1.139000
## [8,] 1 0.5579 1.038000
## [9,] 1 0.5932 0.945600
## [10,] 1 0.6226 0.861600
## [11,] 1 0.6470 0.785000
## [12,] 1 0.6672 0.715300
## [13,] 1 0.6840 0.651700
## [14,] 1 0.6980 0.593800
## [15,] 1 0.7095 0.541100
## [16,] 1 0.7191 0.493000
## [17,] 1 0.7271 0.449200
## [18,] 1 0.7338 0.409300
## [19,] 1 0.7393 0.372900
## [20,] 1 0.7438 0.339800
## [21,] 1 0.7476 0.309600
## [22,] 1 0.7508 0.282100
## [23,] 1 0.7534 0.257100
## [24,] 1 0.7555 0.234200
## [25,] 1 0.7573 0.213400
## [26,] 2 0.7593 0.194500
## [27,] 3 0.7664 0.177200
## [28,] 3 0.7747 0.161400
## [29,] 3 0.7817 0.147100
## [30,] 3 0.7874 0.134000
## [31,] 4 0.7922 0.122100
## [32,] 5 0.7986 0.111300
## [33,] 5 0.8040 0.101400
## [34,] 5 0.8086 0.092380
## [35,] 6 0.8128 0.084180
## [36,] 6 0.8164 0.076700
## [37,] 6 0.8193 0.069880
## [38,] 6 0.8218 0.063680
## [39,] 6 0.8238 0.058020
## [40,] 6 0.8255 0.052860
## [41,] 6 0.8269 0.048170
## [42,] 6 0.8281 0.043890
## [43,] 6 0.8290 0.039990
## [44,] 7 0.8299 0.036440
## [45,] 7 0.8306 0.033200
## [46,] 7 0.8312 0.030250
## [47,] 7 0.8316 0.027560
## [48,] 9 0.8322 0.025110
## [49,] 9 0.8326 0.022880
## [50,] 9 0.8331 0.020850
## [51,] 9 0.8334 0.019000
## [52,] 10 0.8337 0.017310
## [53,] 10 0.8340 0.015770
## [54,] 10 0.8343 0.014370
## [55,] 10 0.8345 0.013090
14
## [56,] 11 0.8347 0.011930
## [57,] 11 0.8348 0.010870
## [58,] 11 0.8349 0.009906
## [59,] 11 0.8351 0.009026
## [60,] 11 0.8351 0.008224
## [61,] 11 0.8352 0.007493
## [62,] 11 0.8353 0.006828
## [63,] 11 0.8353 0.006221
## [64,] 11 0.8354 0.005668
## [65,] 11 0.8354 0.005165
## [66,] 11 0.8354 0.004706
## [67,] 11 0.8355 0.004288
## [68,] 11 0.8355 0.003907
## [69,] 11 0.8355 0.003560
## [70,] 11 0.8355 0.003244
## [71,] 11 0.8355 0.002956
## [72,] 11 0.8355 0.002693
## [73,] 11 0.8355 0.002454
##
## $lambda.min
## [1] 0.01899853
##
## $lambda.1se
## [1] 0.1340308
##
## attr(,"class")
## [1] "cv.glmnet"
Now, I have a model from the elastic net function that I can use to predict values. I will use the test set
that I made earlier. I will insert the test matrix into the model and look at the output compared to the real
values given in the original dataset.
names(fires.test)
## [1] "X" "Y" "month" "day" "FFMC" "DMC" "DC" "ISI"
## [9] "temp" "RH" "wind" "rain" "area"
fires.tesT <- fires.test[,-4]
names(fires.tesT)
## [1] "X" "Y" "month" "FFMC" "DMC" "DC" "ISI" "temp"
## [9] "RH" "wind" "rain" "area"
xp <- fires.tesT[,-3]
yp <- fires.tesT[,3]
Yp <- gsub('jan','1',yp)
Yp <- gsub('feb', '2',Yp)
Yp <- gsub('mar', '3',Yp)
Yp <- gsub('apr', '4',Yp)
Yp <- gsub('may', '5',Yp)
Yp <- gsub('jun', '6',Yp)
Yp <- gsub('jul', '7',Yp)
Yp <- gsub('aug', '8',Yp)
15
Yp <- gsub('sep', '9',Yp)
Yp <- gsub('oct', '10',Yp)
Yp <- gsub('nov', '11',Yp)
Yp <- gsub('dec', '12',Yp)
class(Yp)
## [1] "character"
Yp <- as.numeric(Yp)
Yp <- as.integer(Yp)
class(Yp)
## [1] "integer"
xp[,1] <- as.numeric(xp[,1])
xp[,2] <- as.numeric(xp[,2])
xp[,3] <- as.numeric(xp[,3])
xp[,4] <- as.numeric(xp[,4])
xp[,5] <- as.numeric(xp[,5])
xp[,6] <- as.numeric(xp[,6])
xp[,7] <- as.numeric(xp[,7])
xp[,8] <- as.numeric(xp[,8])
xp[,9] <- as.numeric(xp[,9])
xp[,10] <- as.numeric(xp[,10])
xp[,11] <- as.numeric(xp[,11])
Xp <- as.matrix(xp)
PRED1 <- round(predict(results, Xp, lambda = results$lambda.min),0)
It makes sense to show the outputs as tables so that we can see the different counts. It is important to
remember that PRED is the predictions using the minimum lambda from the glmnet() function and the Yp
is the actual results from the test set.
table(PRED1)
## PRED1
## 3 4 5 6 7 8 9 10
## 3 15 4 4 8 31 44 5
table(Yp)
## Yp
## 1 2 3 4 5 6 7 8 9 10 12
## 1 7 8 2 1 5 7 35 43 4 1
This again was using the minimum lambda found in the cv.glmnet() I now wonder if we can’t get a better
model by simplifying a little bit and using the $lambda.1se value instead of the $lambda.min
PRED2 <- round(predict(results, Xp, lambda = results$lambda.1se),0)
table(PRED2)
## PRED2
## 3 4 5 6 7 8 9 10
## 3 15 4 4 8 31 44 5
16
Let’s see how the two models compare to the true data.
ypdf <- rbind(c(1,0),c(2,1),c(3,10),c(4,1),c(5,0),c(6,4),c(7,8),c(8,36),c(9,30),c(10,4),c(11,0),c(12,3))
CONTINUE TRYING DIFFERENT METHODS!!!
The accuracy isn’t as good as the SVM in part one but we can see that we are getting the same two
”Problem” Months as the SVM section did which are August and September. I want to look more into the
glmnet() function now and explore some LASSO and Ridge Regressions. It shouldn’t be a painful process,
because all I have to do is set the value of lambda to either 0 or 1 depending on what model I would like to
use.
3 Ridge Regression
Now, I will look into Ridge Regressions a little it and compare it with the elastic net regressions I before. I
can use the same glmnet() function as I did previously I just need to set the value of lambda to 0.
dim(fires.train)
## [1] 345 13
dim(fires.test)
## [1] 114 13
str(fires.train)
## 'data.frame': 345 obs. of 13 variables:
## $ X : int 7 7 7 8 8 8 8 7 7 6 ...
## $ Y : int 5 4 4 6 6 6 6 5 5 5 ...
## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 11 11 2 2 2 12 12 12 2 ...
## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 6 3 4 2 2 6 3 3 1 ...
## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ...
## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ...
## $ DC : num 94.3 669.1 686.9 488 495.6 ...
## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ...
## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ...
## $ RH : int 51 33 33 29 27 86 63 51 38 72 ...
## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ...
## $ rain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
fT <- fires.train
fT[,3] <- gsub('jan','1',fT[,3])
fT[,3] <- gsub('feb', '2',fT[,3])
fT[,3] <- gsub('mar', '3',fT[,3])
fT[,3] <- gsub('apr', '4',fT[,3])
fT[,3] <- gsub('may', '5',fT[,3])
fT[,3] <- gsub('jun', '6',fT[,3])
fT[,3] <- gsub('jul', '7',fT[,3])
fT[,3] <- gsub('aug', '8',fT[,3])
fT[,3] <- gsub('sep', '9',fT[,3])
fT[,3] <- gsub('oct', '10',fT[,3])
fT[,3] <- gsub('nov', '11',fT[,3])
fT[,3] <- gsub('dec', '12',fT[,3])
class(fT[,3])
17
## [1] "character"
fT[,3] <- as.numeric(fT[,3])
fT[,3] <- as.integer(fT[,3])
class(fT[,3])
## [1] "integer"
str(fT)
## 'data.frame': 345 obs. of 13 variables:
## $ X : int 7 7 7 8 8 8 8 7 7 6 ...
## $ Y : int 5 4 4 6 6 6 6 5 5 5 ...
## $ month: int 3 10 10 8 8 8 9 9 9 8 ...
## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 6 3 4 2 2 6 3 3 1 ...
## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ...
## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ...
## $ DC : num 94.3 669.1 686.9 488 495.6 ...
## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ...
## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ...
## $ RH : int 51 33 33 29 27 86 63 51 38 72 ...
## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ...
## $ rain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
fT[,4] <- gsub('mon','1',fT[,4])
fT[,4] <- gsub('tue', '2',fT[,4])
fT[,4] <- gsub('wed', '3',fT[,4])
fT[,4] <- gsub('thu', '4',fT[,4])
fT[,4] <- gsub('fri', '5',fT[,4])
fT[,4] <- gsub('sat', '6',fT[,4])
fT[,4] <- gsub('sun', '7',fT[,4])
class(fT[,4])
## [1] "character"
fT[,4] <- as.numeric(fT[,4])
fT[,4] <- as.integer(fT[,4])
class(fT[,4])
## [1] "integer"
str(fT)
## 'data.frame': 345 obs. of 13 variables:
## $ X : int 7 7 7 8 8 8 8 7 7 6 ...
## $ Y : int 5 4 4 6 6 6 6 5 5 5 ...
## $ month: int 3 10 10 8 8 8 9 9 9 8 ...
## $ day : int 5 2 6 7 1 1 2 6 6 5 ...
## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ...
## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ...
## $ DC : num 94.3 669.1 686.9 488 495.6 ...
## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ...
## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ...
## $ RH : int 51 33 33 29 27 86 63 51 38 72 ...
## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ...
## $ rain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
18
After reconstructing the data frame into the correct structures I was able to finally find a model.
rmod <- lm.ridge(month ~., fT)
Now of course I want to test the model usint the fires.test subset of the fires data. I have to reconstruct
this set as well though so there are some steps that I must perform.
fTs <- fires.test
str(fTs)
## 'data.frame': 114 obs. of 13 variables:
## $ X : int 8 7 8 7 7 6 7 4 4 5 ...
## $ Y : int 6 5 5 4 4 3 3 4 4 6 ...
## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 12 11 7 2 12 11 2 12 12 ...
## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 3 2 4 3 4 3 6 3 2 ...
## $ FFMC : num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ...
## $ DMC : num 33.3 88 32.8 96.3 110.9 ...
## $ DC : num 77.5 698.6 664.2 200 537.4 ...
## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ...
## $ temp : num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ...
## $ RH : int 97 40 47 44 43 39 27 54 48 70 ...
## $ wind : num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ...
## $ rain : num 0.2 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
fTs[,3] <- gsub('jan','1',fTs[,3])
fTs[,3] <- gsub('feb', '2',fTs[,3])
fTs[,3] <- gsub('mar', '3',fTs[,3])
fTs[,3] <- gsub('apr', '4',fTs[,3])
fTs[,3] <- gsub('may', '5',fTs[,3])
fTs[,3] <- gsub('jun', '6',fTs[,3])
fTs[,3] <- gsub('jul', '7',fTs[,3])
fTs[,3] <- gsub('aug', '8',fTs[,3])
fTs[,3] <- gsub('sep', '9',fTs[,3])
fTs[,3] <- gsub('oct', '10',fTs[,3])
fTs[,3] <- gsub('nov', '11',fTs[,3])
fTs[,3] <- gsub('dec', '12',fTs[,3])
class(fTs[,3])
## [1] "character"
fTs[,3] <- as.numeric(fTs[,3])
fTs[,3] <- as.integer(fTs[,3])
class(fTs[,3])
## [1] "integer"
str(fTs)
## 'data.frame': 114 obs. of 13 variables:
## $ X : int 8 7 8 7 7 6 7 4 4 5 ...
## $ Y : int 6 5 5 4 4 3 3 4 4 6 ...
## $ month: int 3 9 10 6 8 9 10 8 9 9 ...
## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 3 2 4 3 4 3 6 3 2 ...
## $ FFMC : num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ...
19
## $ DMC : num 33.3 88 32.8 96.3 110.9 ...
## $ DC : num 77.5 698.6 664.2 200 537.4 ...
## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ...
## $ temp : num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ...
## $ RH : int 97 40 47 44 43 39 27 54 48 70 ...
## $ wind : num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ...
## $ rain : num 0.2 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
fTs[,4] <- gsub('mon','1',fTs[,4])
fTs[,4] <- gsub('tue', '2',fTs[,4])
fTs[,4] <- gsub('wed', '3',fTs[,4])
fTs[,4] <- gsub('thu', '4',fTs[,4])
fTs[,4] <- gsub('fri', '5',fTs[,4])
fTs[,4] <- gsub('sat', '6',fTs[,4])
fTs[,4] <- gsub('sun', '7',fTs[,4])
class(fTs[,4])
## [1] "character"
fTs[,4] <- as.numeric(fTs[,4])
fTs[,4] <- as.integer(fTs[,4])
class(fTs[,4])
## [1] "integer"
str(fT)
## 'data.frame': 345 obs. of 13 variables:
## $ X : int 7 7 7 8 8 8 8 7 7 6 ...
## $ Y : int 5 4 4 6 6 6 6 5 5 5 ...
## $ month: int 3 10 10 8 8 8 9 9 9 8 ...
## $ day : int 5 2 6 7 1 1 2 6 6 5 ...
## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ...
## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ...
## $ DC : num 94.3 669.1 686.9 488 495.6 ...
## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ...
## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ...
## $ RH : int 51 33 33 29 27 86 63 51 38 72 ...
## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ...
## $ rain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
I believe I am ready to perform the prediction on the data. Here goes nothing.
class(rmod)
## [1] "ridgelm"
rmd <- matrix(rmod$coef)
rmd[4,] <- 0.0271
rmd[6,] <- 0.01
rmd[5,] <- -.008
rmd[9,] <- -.01668
20
FTS <- fTs[,-3]
str(FTS)
## 'data.frame': 114 obs. of 12 variables:
## $ X : int 8 7 8 7 7 6 7 4 4 5 ...
## $ Y : int 6 5 5 4 4 3 3 4 4 6 ...
## $ day : int 5 6 1 7 6 7 6 2 6 1 ...
## $ FFMC: num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ...
## $ DMC : num 33.3 88 32.8 96.3 110.9 ...
## $ DC : num 77.5 698.6 664.2 200 537.4 ...
## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ...
## $ temp: num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ...
## $ RH : int 97 40 47 44 43 39 27 54 48 70 ...
## $ wind: num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ...
## $ rain: num 0.2 0 0 0 0 0 0 0 0 0 ...
## $ area: num 0 0 0 0 0 0 0 0 0 0 ...
FTS <- as.matrix(FTS)
class(FTS)
## [1] "matrix"
dim(FTS)
## [1] 114 12
dim(rmd)
## [1] 12 1
rpred <- FTS%*%rmd
I realize the top part might be confusing, but because I couldn’t find an ”easy” way to get the predictions
from the model I converted the coeffiecients into a matrix (rmd) and then I used the fires.test set to make
another matrix so that I could multiply the coeffients into the correct column and then it is produced in
rpred. The list is in decimals because of the way the R function lm.ridge does the calculations so I conver
the decimals to the nearest month by rounding and use table() to count the frequencies which is displayed
below by freq, the other table is the true values.
freq <- table(round(rpred,0))
freq
##
## -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 9
## 1 1 5 10 11 13 17 9 17 17 6 4 1 2
true <- table(fTs[,3])
true
##
## 1 2 3 4 5 6 7 8 9 10 12
## 1 7 8 2 1 5 7 35 43 4 1
I seem to be having troubles getting the class ”lmridge” to behave the way I would like it to. I am going
to try a different ridge function.
21
lp <- linearRidge(month~.,fT)
pre <- table(round(predict(lp, fTs),0))
pre
##
## 2 3 4 5 6 7 8 9 10 11
## 1 8 10 4 5 7 28 41 8 2
true
##
## 1 2 3 4 5 6 7 8 9 10 12
## 1 7 8 2 1 5 7 35 43 4 1
The predictions are not too bad at all except for the 3rd and 4th months (March and April). It seems
that the two values are basically inverted according to the true data. I wondering why those two are getting
such a bad fit??? I am not sure how to answer that question.
These results are very similar to the elastic net regression from section 2. It is a little bit better than
the elastic net, but it is not ’much’ better and it for sure is not better than the SVM models we used in
section 1.
This leads me to the next idea to focus which is to do a LASSO regression with the fires data. So far
I have done elastic net (which is a mixture of LASSO and Ridge) and ridge (which keeps all the predictors
and assigns a penalty). The next logical place to go would be LASSO (which allows for variable selection
along with the penalty factor).
4 LASSO Regression
A reminder that LASSO stands for least absolute shrinkage and selection operator. LASSO will shrink the
coefficients like the ridge regression did, but the main difference between the two is LASSO is going to force
some variables to exactly 0, thus eliminating them from the model entirely. This is a process commonly
known as variable selection.
I want to continue to work with the fires data set from previous sections. I will start by looking at the
overall structure of the data sets again where fires.train and fires.test are the two subsets of fires for which
training and testing of a model are done.
str(fires.test)
## 'data.frame': 114 obs. of 13 variables:
## $ X : int 8 7 8 7 7 6 7 4 4 5 ...
## $ Y : int 6 5 5 4 4 3 3 4 4 6 ...
## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 12 11 7 2 12 11 2 12 12 ...
## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 3 2 4 3 4 3 6 3 2 ...
## $ FFMC : num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ...
## $ DMC : num 33.3 88 32.8 96.3 110.9 ...
## $ DC : num 77.5 698.6 664.2 200 537.4 ...
## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ...
## $ temp : num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ...
## $ RH : int 97 40 47 44 43 39 27 54 48 70 ...
## $ wind : num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ...
## $ rain : num 0.2 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
22
str(fires.train)
## 'data.frame': 345 obs. of 13 variables:
## $ X : int 7 7 7 8 8 8 8 7 7 6 ...
## $ Y : int 5 4 4 6 6 6 6 5 5 5 ...
## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 11 11 2 2 2 12 12 12 2 ...
## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 6 3 4 2 2 6 3 3 1 ...
## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ...
## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ...
## $ DC : num 94.3 669.1 686.9 488 495.6 ...
## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ...
## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ...
## $ RH : int 51 33 33 29 27 86 63 51 38 72 ...
## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ...
## $ rain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
I plan to use the lars() function from the R package ’lars’ which requires a prediction matrix and a
response vector to train the model. Because I have already created those in previous sections I can just
recall those matrices because they will have the same data as I would need here. From the elastic net
(section 2) X is the prediction matrix and Y is the response vector (month).
lasso <- lars(X, Y)
Now that I have a LASSO model trained I can test my model against the fires.test data set we made
previously. Again, because elastic net also needed a prediction matrix and a response vector I can use the
matrix and vector formed in that section.
laspre <- predict(lasso, Xp,s=8)
t.laspre <- table(round(as.matrix(laspre$fit),0))
t.laspre
##
## 2 3 4 5 6 7 8 9 10
## 1 9 11 3 4 7 27 39 13
table(Yp)
## Yp
## 1 2 3 4 5 6 7 8 9 10 12
## 1 7 8 2 1 5 7 35 43 4 1
This one just have been my best model yet. I can see that the groupings are close together and in fact the
biggest misclassification that I have is for the fourth month at 7. I Believe this is yielding the best results.
I will run it through the process a few more times to see if I can generally conclude that I have reached my
best model.
5 References
UCI Depository (2008), ’Forest Fires’, University of California Irvine
Zou, Hui and Hastie, Trevor(2004) ’Regularization and variable selection via the elastic net’, Standford
University
23

More Related Content

What's hot

The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...odsc
 
Ultra Violet-Visable Spectroscopy Analysis of Spin Coated Substrates
Ultra Violet-Visable Spectroscopy Analysis of Spin Coated SubstratesUltra Violet-Visable Spectroscopy Analysis of Spin Coated Substrates
Ultra Violet-Visable Spectroscopy Analysis of Spin Coated SubstratesNicholas Lauer
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsNYC Predictive Analytics
 
Data Science Job Required Skill Analysis
Data Science Job Required Skill AnalysisData Science Job Required Skill Analysis
Data Science Job Required Skill AnalysisHarsh Kevadia
 
Machine teaching tbo_20190518
Machine teaching tbo_20190518Machine teaching tbo_20190518
Machine teaching tbo_20190518Yi-Fan Liou
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIRaouf KESKES
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svmtaikhoan262
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search EngineVirenKhandal
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 

What's hot (9)

The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...
 
Ultra Violet-Visable Spectroscopy Analysis of Spin Coated Substrates
Ultra Violet-Visable Spectroscopy Analysis of Spin Coated SubstratesUltra Violet-Visable Spectroscopy Analysis of Spin Coated Substrates
Ultra Violet-Visable Spectroscopy Analysis of Spin Coated Substrates
 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
 
Data Science Job Required Skill Analysis
Data Science Job Required Skill AnalysisData Science Job Required Skill Analysis
Data Science Job Required Skill Analysis
 
Machine teaching tbo_20190518
Machine teaching tbo_20190518Machine teaching tbo_20190518
Machine teaching tbo_20190518
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search Engine
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 

Viewers also liked

Traffic_Deaths
Traffic_DeathsTraffic_Deaths
Traffic_DeathsBrett Keim
 
poster_B-Cycle
poster_B-Cycleposter_B-Cycle
poster_B-CycleBrett Keim
 
Idioms & Phrases : Volume 001 (Mastering English Like A Ninja)
Idioms & Phrases : Volume 001 (Mastering English Like A Ninja)Idioms & Phrases : Volume 001 (Mastering English Like A Ninja)
Idioms & Phrases : Volume 001 (Mastering English Like A Ninja)Amit Kumar
 
Clases sociales de la edad medieval
Clases sociales de la edad medievalClases sociales de la edad medieval
Clases sociales de la edad medievalGiuliana Sanchez
 
The Snapys 2017 - Partner Pack
The Snapys 2017 - Partner PackThe Snapys 2017 - Partner Pack
The Snapys 2017 - Partner PackThe Snapys
 
Die Rechnung bitte! – UCS als Transaktionsplattform
Die Rechnung bitte! – UCS als TransaktionsplattformDie Rechnung bitte! – UCS als Transaktionsplattform
Die Rechnung bitte! – UCS als TransaktionsplattformUnivention GmbH
 

Viewers also liked (9)

Traffic_Deaths
Traffic_DeathsTraffic_Deaths
Traffic_Deaths
 
poster_B-Cycle
poster_B-Cycleposter_B-Cycle
poster_B-Cycle
 
Idioms & Phrases : Volume 001 (Mastering English Like A Ninja)
Idioms & Phrases : Volume 001 (Mastering English Like A Ninja)Idioms & Phrases : Volume 001 (Mastering English Like A Ninja)
Idioms & Phrases : Volume 001 (Mastering English Like A Ninja)
 
Ahmed kamal C.V
Ahmed kamal C.VAhmed kamal C.V
Ahmed kamal C.V
 
Reseña ieti ajc_2016
Reseña ieti ajc_2016Reseña ieti ajc_2016
Reseña ieti ajc_2016
 
Clases sociales de la edad medieval
Clases sociales de la edad medievalClases sociales de la edad medieval
Clases sociales de la edad medieval
 
The Snapys 2017 - Partner Pack
The Snapys 2017 - Partner PackThe Snapys 2017 - Partner Pack
The Snapys 2017 - Partner Pack
 
Assignment 1 uniqlo ppt.
Assignment 1 uniqlo ppt.Assignment 1 uniqlo ppt.
Assignment 1 uniqlo ppt.
 
Die Rechnung bitte! – UCS als Transaktionsplattform
Die Rechnung bitte! – UCS als TransaktionsplattformDie Rechnung bitte! – UCS als Transaktionsplattform
Die Rechnung bitte! – UCS als Transaktionsplattform
 

Similar to Data_Mining_Exploration

Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RShirin Elsinghorst
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3Max Kleiner
 
Clustering in Machine Learning.pdf
Clustering in Machine Learning.pdfClustering in Machine Learning.pdf
Clustering in Machine Learning.pdfSudhanshiBakre1
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemMasaharu Kinoshita
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningJohn Edward Slough II
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Support Vector Machine Optimal Kernel Selection
Support Vector Machine Optimal Kernel SelectionSupport Vector Machine Optimal Kernel Selection
Support Vector Machine Optimal Kernel SelectionIRJET Journal
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation SaravanakumarSekar4
 
Observations
ObservationsObservations
Observationsbutest
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningcsandit
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...cscpconf
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGcsandit
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better MathBrent Schneeman
 
Panoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURFPanoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURFEric Jansen
 
maXbox starter67 machine learning V
maXbox starter67 machine learning VmaXbox starter67 machine learning V
maXbox starter67 machine learning VMax Kleiner
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Data science with R - Clustering and Classification
Data science with R - Clustering and ClassificationData science with R - Clustering and Classification
Data science with R - Clustering and ClassificationBrigitte Mueller
 

Similar to Data_Mining_Exploration (20)

Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with R
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
maXbox starter65 machinelearning3
maXbox starter65 machinelearning3maXbox starter65 machinelearning3
maXbox starter65 machinelearning3
 
Clustering in Machine Learning.pdf
Clustering in Machine Learning.pdfClustering in Machine Learning.pdf
Clustering in Machine Learning.pdf
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting Problem
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine Learning
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Support Vector Machine Optimal Kernel Selection
Support Vector Machine Optimal Kernel SelectionSupport Vector Machine Optimal Kernel Selection
Support Vector Machine Optimal Kernel Selection
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation
 
Observations
ObservationsObservations
Observations
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better Math
 
Panoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURFPanoramic Imaging using SIFT and SURF
Panoramic Imaging using SIFT and SURF
 
maXbox starter67 machine learning V
maXbox starter67 machine learning VmaXbox starter67 machine learning V
maXbox starter67 machine learning V
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Data science with R - Clustering and Classification
Data science with R - Clustering and ClassificationData science with R - Clustering and Classification
Data science with R - Clustering and Classification
 

Data_Mining_Exploration

  • 1. Data Mining Exploration Brett Keim December 16, 2015 Abstract Inside data mining there are infinite ways to explore a set with each way yielding different information about the data set as well as about future data points inside the data set. This research is primary about the techniques relating to support vector machines and how they can be utilized to provide useful models and classifications along with the use of kernels. The research will also cover some elastic net regularized regressions for the linear combinations which support vector machines do not incorporate. 1 Introduction to Support Vector Machines It is wise to first start by looking into the definition of support vector machines and the most practical uses of them. A SVM uses examples from a set and split them into two categories. It then uses the categories to assign new examples to one or the other without the use of probability. The two categories should be mapped into groups so there is a clear gap between them which should be maximized. The new samples are mapped into the same space and labeled based on what side of the gap they fall into. This is how classification is done using the SVM, but they can also classify non-linearly using kernel methods which will be discussed later. 1.1 Obtaining Data for SVM Study In order to study support vector machines I will need a data set that is complex enough to work on many different layers of analysis, but at the same time being small enough so I can easily troubleshoot. I have pulled a dataset from the University of California Irvine Machine Learning Repository [UCI]. I have chosen to read them into R using the read.csv command. setwd("C:/Users/bwkeim/Desktop/Grad Classes/Data Mining Exploration") fires <- read.csv("forestfires.csv") rfires <- runif(nrow(fires)) fires.train <- fires[rfires >= 0.33,] fires.test <- fires[rfires < 0.22,] dim(fires) ## [1] 517 13 1.2 Preliminary Testing I will start by looking at the forest fire data from UCI to better understand how to properly use a SVM on a data set. I need to look at some of the data first in order to understand what is inside the data file. It is important to remember there is 6,721 data points inside the dataset. I have also split the set into two subsets one for training and one for testing. summary(fires) 1
  • 2. ## X Y month day FFMC ## Min. :1.000 Min. :2.0 aug :184 fri:85 Min. :18.70 ## 1st Qu.:3.000 1st Qu.:4.0 sep :172 mon:74 1st Qu.:90.20 ## Median :4.000 Median :4.0 mar : 54 sat:84 Median :91.60 ## Mean :4.669 Mean :4.3 jul : 32 sun:95 Mean :90.64 ## 3rd Qu.:7.000 3rd Qu.:5.0 feb : 20 thu:61 3rd Qu.:92.90 ## Max. :9.000 Max. :9.0 jun : 17 tue:64 Max. :96.20 ## (Other): 38 wed:54 ## DMC DC ISI temp ## Min. : 1.1 Min. : 7.9 Min. : 0.000 Min. : 2.20 ## 1st Qu.: 68.6 1st Qu.:437.7 1st Qu.: 6.500 1st Qu.:15.50 ## Median :108.3 Median :664.2 Median : 8.400 Median :19.30 ## Mean :110.9 Mean :547.9 Mean : 9.022 Mean :18.89 ## 3rd Qu.:142.4 3rd Qu.:713.9 3rd Qu.:10.800 3rd Qu.:22.80 ## Max. :291.3 Max. :860.6 Max. :56.100 Max. :33.30 ## ## RH wind rain area ## Min. : 15.00 Min. :0.400 Min. :0.00000 Min. : 0.00 ## 1st Qu.: 33.00 1st Qu.:2.700 1st Qu.:0.00000 1st Qu.: 0.00 ## Median : 42.00 Median :4.000 Median :0.00000 Median : 0.52 ## Mean : 44.29 Mean :4.018 Mean :0.02166 Mean : 12.85 ## 3rd Qu.: 53.00 3rd Qu.:4.900 3rd Qu.:0.00000 3rd Qu.: 6.57 ## Max. :100.00 Max. :9.400 Max. :6.40000 Max. :1090.84 ## The file contains 13 different attributes as seen above by the summary output. When I think about forest fires the first thing that comes to mind is what causes them and then at what point of the year are we most venerable for one to occur. fit_default <- svm(month~.,fires.train) fit_default ## ## Call: ## svm(formula = month ~ ., data = fires.train) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 1 ## gamma: 0.05555556 ## ## Number of Support Vectors: 281 I am using the svm() function inside the e1071 R package to begin my svm learning. This function uses a formula set by the user to classify data points from a given dataset. The user can also easily change the kernel type as well as the margin by using the kernel and cost portions of the function. When the user changes the cost aspect they are changing the cost of misclassification on the data points. The larger the cost the smaller the margin (separation of data points) because it places a higher cost on misclassification this can lead to over fitting where as a smaller cost value allows the margin to widen, but does cause a higher level of bias. The key is to find the right mix of the two. When deciding what is the correct mix it is best to see how different levels perform on real data. Also there is another part of the function we must look at which is the gamma. Gamma is the part of the function that determines how much influence a single data 2
  • 3. value has on the training set. If gamma is too large the area of the influence radius of the support vectors only includes the support vectors themselves and no cost regularization will allow us to avoid over fitting. If the gamma value is too small the model will be too constricted and not be able to capture the shape of the data. It is again important to get the correct mix of the two by seeing how actual data reacts to the changes. ## ## Call: ## svm(formula = month ~ ., data = fires.train, cost = 10000) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 10000 ## gamma: 0.05555556 ## ## Number of Support Vectors: 246 ## ## Call: ## svm(formula = month ~ ., data = fires.train, cost = 0.001) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 0.001 ## gamma: 0.05555556 ## ## Number of Support Vectors: 328 ## ## Call: ## svm(formula = month ~ ., data = fires.train, gamma = 0.1) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 1 ## gamma: 0.1 ## ## Number of Support Vectors: 283 ## ## Call: ## svm(formula = month ~ ., data = fires.train, gamma = 0.001) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 1 ## gamma: 0.001 ## ## Number of Support Vectors: 329 3
  • 4. Since I have played with the models more than what is shown I think it is now time to start comparing their values to the test portion of my data set. I will use predictions based on my svm model and compare it to the test data using a simple table. ## ## Call: ## svm(formula = month ~ ., data = fires.train, cost = 10000, gamma = 0.01, ## kernel = "radial") ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 10000 ## gamma: 0.01 ## ## Number of Support Vectors: 201 1.2.1 Initial Results Using my fit 1 svm I was able to organiize my results into a table, which in this is case is better known a confusion matrix. It shows the true values (columns) versus the predicted values (rows). apr aug dec feb jan jul jun mar may nov oct sep apr 0 0 0 3 0 0 0 0 0 0 0 0 aug 0 29 0 0 1 4 0 0 0 0 0 3 dec 0 0 1 0 0 0 0 0 0 0 0 0 feb 1 0 0 2 0 0 0 0 0 0 0 0 jan 0 0 0 0 0 0 0 0 0 0 0 0 jul 0 1 0 0 0 3 2 0 0 0 0 0 jun 0 0 0 0 0 0 3 0 0 0 0 0 mar 1 1 0 2 0 0 0 7 1 0 0 0 may 0 0 0 0 0 0 0 1 0 0 0 0 nov 0 0 0 0 0 0 0 0 0 0 0 0 oct 0 0 0 0 0 0 0 0 0 0 4 0 sep 0 4 0 0 0 0 0 0 0 0 0 40 Ideally we would like the diagonal to contain all of the data points. This would mean that our predictions perfectly match the true values. Of course, there is no such thing as a perfect model so we have some values that are left outside of the diagonal. These are known misclassified points and is something we need to take a closer look at. correct.pred <- sum(diag(cm)) correct.pred ## [1] 89 total.pred <- sum(cm) total.pred ## [1] 114 misclass <- (total.pred - correct.pred) / total.pred misclass ## [1] 0.2192982 4
  • 5. By the above code, the model successfully predicted 89 cases based on the test data out of a total of 114 test points yielding a misclassification percentage of around 21.9298246 percent. This is not a bad percentage in my point of view. This was using a radial kernel method, but there are other kernel methods used inside SVM’s that can be more effective and useful. I want to look specifically at the linear kernel now. 1.3 Linear Kernel SVM’s I will now use the same techniques as discussed early as far as finding the correct cost and gamma for my svm, but this time I will be using a linear kernel. fit_2 <- svm(month~., fires.train, cost = 1000, kernel = 'linear') summary(fit_2) ## ## Call: ## svm(formula = month ~ ., data = fires.train, cost = 1000, kernel = "linear") ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: linear ## cost: 1000 ## gamma: 0.05555556 ## ## Number of Support Vectors: 157 ## ## ( 21 9 39 38 5 17 11 1 8 7 1 ) ## ## ## Number of Classes: 11 ## ## Levels: ## apr aug dec feb jan jul jun mar may nov oct sep fit_2_pred <- predict(fit_2, fires.test[,-3]) cm_2 <- table(pred = fit_2_pred, true = fires.test[,3]) sum(diag(cm_2)) ## [1] 91 sum(cm_2) ## [1] 114 (sum(cm_2)-sum(diag(cm_2)))/sum(cm_2) ## [1] 0.2017544 The initial model I have made using linear kernels is slightly better than the first model with radial kernels as you can see sense we have a 1 percent decrease in misclassification. I am going to play with the cost level and see if I can’t further improve the model. fit_3 <- svm(month~., fires.train, cost = 100, kernel = 'linear') fit_3_pred <- predict(fit_3, fires.test[,-3]) cm_3 <- table(pred = fit_3_pred, true = fires.test[,3]) sum(diag(cm_3)) 5
  • 6. ## [1] 90 sum(cm_3) ## [1] 114 (sum(cm_3)-sum(diag(cm_3)))/sum(cm_3) ## [1] 0.2105263 1.3.1 Results of Linear Kernel Below is the confusion matrix for the high cost svm. apr aug dec feb jan jul jun mar may nov oct sep apr 0 0 0 3 0 0 0 1 0 0 0 0 aug 0 31 0 0 0 3 0 0 0 0 0 3 dec 0 0 1 0 0 0 0 0 0 0 0 0 feb 1 0 0 2 1 0 0 0 0 0 0 0 jan 0 0 0 0 0 0 0 0 0 0 0 0 jul 0 0 0 0 0 4 3 0 0 0 0 0 jun 0 0 0 0 0 0 2 0 0 0 0 0 mar 1 1 0 2 0 0 0 6 1 0 0 0 may 0 0 0 0 0 0 0 1 0 0 0 0 nov 0 0 0 0 0 0 0 0 0 0 0 0 oct 0 0 0 0 0 0 0 0 0 0 4 0 sep 0 3 0 0 0 0 0 0 0 0 0 40 fit_4 <- svm(month~., fires.train, cost = 1, kernel = 'linear') fit_4_pred <- predict(fit_4, fires.test[,-3]) cm_4 <- table(pred = fit_4_pred, true = fires.test[,3]) sum(diag(cm_4)) ## [1] 90 sum(cm_4) ## [1] 114 (sum(cm_4)-sum(diag(cm_4)))/sum(cm_4) ## [1] 0.2105263 Below is the confusion matrix for the low cost svm. 6
  • 7. apr aug dec feb jan jul jun mar may nov oct sep apr 0 0 0 1 0 0 0 0 0 0 0 0 aug 0 31 0 0 0 4 0 0 0 0 0 3 dec 0 0 1 0 0 0 0 0 0 0 0 0 feb 1 0 0 2 0 0 0 0 0 0 0 0 jan 0 0 0 0 0 0 0 0 0 0 0 0 jul 0 0 0 0 0 2 2 0 0 0 0 0 jun 0 0 0 0 1 0 3 0 0 0 0 0 mar 1 1 0 4 0 0 0 7 1 0 0 0 may 0 0 0 0 0 0 0 1 0 0 0 0 nov 0 0 0 0 0 0 0 0 0 0 0 0 oct 0 0 0 0 0 0 0 0 0 0 4 0 sep 0 3 0 0 0 1 0 0 0 0 0 40 I have continued to mess around the cost attribute inside svm() and because of the randomness of the model I continue to get mixed results. Sometimes the model with a higher cost is more accurate and sometimes it is the flip side. I have shown two models above one with a higher cost and one with the general cost setting of 1. Their misclassification cost percentages are 21.0526316 percent and 21.0526316 perecent respectively. 1.4 Recap After analyzing and trying different SVM’s I wanted to add a visual to show the count of fires per month from the data set to see if my predictions and overall study makes any sense according to the data set I am testing. ## Loading required package: tcltk 7
  • 8. apr aug dec feb jan jul jun mar may nov oct sep 0 50 100 150 Count Month Fire Counts by Month The visual speaks for itself. It is easily seen that the fires dominate two specific months August and September as we predicted and saw in our confusion matricies. The misclassification isn’t that incredibly hard to believe in those months because of the shear numbers. I feel content with my model having an 80 percent success rate for now. 2 Elastic Net Regression What is elastic net regression? What does it mean? Where should we use it? These are the three questions that drove me to create this paper. I first want to start by defining elastic net regression and talk about what the true definition means in relative terms. I then will use elastic net regressions on some real-life data and see how they can be used in a real-world setting. 8
  • 9. 2.1 Definition and Meaning It is worth mentioning there are two main types of elastic net regression, regularized and nonregularized. Be- fore I can define these two types I need to discuss lasso and ridge regression techniques. Lasso is an acronym for least absolute shrinkage and selection operator which is essentially the method behind shrinking the sum of the absolute value of estimators in a regression by assigning penalties to values that are large. As the penalties grow larger and larger the estimators are driven to zero and in the lasso method they are allowed to be exactly zero. This differs from the ridge technique in that as the penalties grow and the estimators shrink they still remain nonzero. Ridge (Tikhonov) regression is essentially the penalizing of large parameter estimators when making a linear regression, again with the main difference between it and lasso being that ridge estimators remain nonzero. Now, that I have discussed a couple pre-topics I am ready to get down to elastic net regression. This method linearly combines the least absolute deviations from the lasso method and the least squares from the ridge method to overcome shortcomings in each area. Specifically, elastic net accounts for a problem in the Lasso method, when one has many parameters but a small sample size (relative) as well as when there are multiple parameters that are strongly correlated. In the lasso method only one of those parameters would be used, but elastic net accounts for that correlation by adding a quadratic term. THe addition of a quadratic term does cause some other errors the biggest is the idea of double shrinkage. The shrinkage is doubled based on how quadratics work, which is a problem unless it is corrected by a scaling factor. One of the main benefits of elastic net regression is a convex mapping which allows for a single maximum/minimum. 2.2 Preliminary Testing I want to first begin some tests using elastic net regression techniques on the same data set I used with SVM in the first section of this paper. I will keep the same initial question and goal for the regression which again is, ”When are we most susceptible to forest fires, what months?” I want to see if the elastic net methods give me less misclassifications. I will first look at our original training and testing sets made in section 1 as well as a quick dimension check on ’fires’ to make sure the sets have remained the same as before. summary(fires.train) ## X Y month day FFMC ## Min. :1.000 Min. :2.00 aug :129 fri:63 Min. :53.40 ## 1st Qu.:2.000 1st Qu.:4.00 sep :109 mon:46 1st Qu.:90.20 ## Median :4.000 Median :4.00 mar : 39 sat:66 Median :91.60 ## Mean :4.681 Mean :4.31 jul : 21 sun:51 Mean :90.69 ## 3rd Qu.:7.000 3rd Qu.:5.00 feb : 12 thu:45 3rd Qu.:92.90 ## Max. :9.000 Max. :9.00 jun : 10 tue:39 Max. :96.20 ## (Other): 25 wed:35 ## DMC DC ISI temp ## Min. : 2.4 Min. : 9.3 Min. : 0.400 Min. : 2.20 ## 1st Qu.: 61.1 1st Qu.:433.3 1st Qu.: 6.300 1st Qu.:15.90 ## Median :108.4 Median :664.2 Median : 8.400 Median :19.60 ## Mean :111.9 Mean :546.8 Mean : 8.868 Mean :19.12 ## 3rd Qu.:145.4 3rd Qu.:713.0 3rd Qu.:11.000 3rd Qu.:22.90 ## Max. :291.3 Max. :860.6 Max. :22.700 Max. :33.10 ## ## RH wind rain area ## Min. :18.0 Min. :0.900 Min. :0.00000 Min. : 0.00 ## 1st Qu.:32.0 1st Qu.:2.700 1st Qu.:0.00000 1st Qu.: 0.00 ## Median :41.0 Median :4.000 Median :0.00000 Median : 0.47 ## Mean :43.6 Mean :4.002 Mean :0.03015 Mean : 14.71 ## 3rd Qu.:53.0 3rd Qu.:4.900 3rd Qu.:0.00000 3rd Qu.: 6.96 9
  • 10. ## Max. :96.0 Max. :9.400 Max. :6.40000 Max. :1090.84 ## summary(fires.test) ## X Y month day FFMC ## Min. :1.000 Min. :2.000 sep :43 fri:17 Min. :18.70 ## 1st Qu.:3.000 1st Qu.:4.000 aug :35 mon:17 1st Qu.:90.20 ## Median :4.000 Median :4.000 mar : 8 sat:11 Median :91.70 ## Mean :4.649 Mean :4.272 feb : 7 sun:29 Mean :90.11 ## 3rd Qu.:6.000 3rd Qu.:5.000 jul : 7 thu:10 3rd Qu.:92.88 ## Max. :9.000 Max. :9.000 jun : 5 tue:14 Max. :96.20 ## (Other): 9 wed:16 ## DMC DC ISI temp ## Min. : 1.10 Min. : 7.9 Min. : 0.000 Min. : 4.20 ## 1st Qu.: 65.08 1st Qu.:444.7 1st Qu.: 6.500 1st Qu.:14.82 ## Median :103.10 Median :663.0 Median : 8.500 Median :18.05 ## Mean :103.66 Mean :549.2 Mean : 9.376 Mean :18.23 ## 3rd Qu.:135.10 3rd Qu.:714.8 3rd Qu.:10.925 3rd Qu.:22.55 ## Max. :290.00 Max. :855.3 Max. :56.100 Max. :32.40 ## ## RH wind rain area ## Min. : 15.00 Min. :0.400 Min. :0.000000 Min. : 0.000 ## 1st Qu.: 35.00 1st Qu.:3.100 1st Qu.:0.000000 1st Qu.: 0.000 ## Median : 42.50 Median :4.000 Median :0.000000 Median : 1.015 ## Mean : 46.04 Mean :4.092 Mean :0.003509 Mean : 10.307 ## 3rd Qu.: 54.00 3rd Qu.:5.400 3rd Qu.:0.000000 3rd Qu.: 5.785 ## Max. :100.00 Max. :8.500 Max. :0.200000 Max. :278.530 ## dim(fires) ## [1] 517 13 Let us begin first with the ’glmnet’ package. I need to make sure I make the correct response vector (y) and the correct input matrix (x). names(fires.train) ## [1] "X" "Y" "month" "day" "FFMC" "DMC" "DC" "ISI" ## [9] "temp" "RH" "wind" "rain" "area" fires.traiN <- fires.train[,-4] names(fires.traiN) ## [1] "X" "Y" "month" "FFMC" "DMC" "DC" "ISI" "temp" ## [9] "RH" "wind" "rain" "area" x <- fires.traiN[,-3] y <- fires.traiN[,3] names(x) ## [1] "X" "Y" "FFMC" "DMC" "DC" "ISI" "temp" "RH" "wind" "rain" ## [11] "area" 10
  • 11. Y <- gsub('jan','1',y) Y <- gsub('feb', '2',Y) Y <- gsub('mar', '3',Y) Y <- gsub('apr', '4',Y) Y <- gsub('may', '5',Y) Y <- gsub('jun', '6',Y) Y <- gsub('jul', '7',Y) Y <- gsub('aug', '8',Y) Y <- gsub('sep', '9',Y) Y <- gsub('oct', '10',Y) Y <- gsub('nov', '11',Y) Y <- gsub('dec', '12',Y) class(Y) ## [1] "character" Y <- as.numeric(Y) Y <- as.integer(Y) class(Y) ## [1] "integer" x[,1] <- as.numeric(x[,1]) x[,2] <- as.numeric(x[,2]) x[,3] <- as.numeric(x[,3]) x[,4] <- as.numeric(x[,4]) x[,5] <- as.numeric(x[,5]) x[,6] <- as.numeric(x[,6]) x[,7] <- as.numeric(x[,7]) x[,8] <- as.numeric(x[,8]) x[,9] <- as.numeric(x[,9]) x[,10] <- as.numeric(x[,10]) x[,11] <- as.numeric(x[,11]) head(x) ## X Y FFMC DMC DC ISI temp RH wind rain area ## 1 7 5 86.2 26.2 94.3 5.1 8.2 51 6.7 0 0 ## 2 7 4 90.6 35.4 669.1 6.7 18.0 33 0.9 0 0 ## 3 7 4 90.6 43.7 686.9 6.7 14.6 33 1.3 0 0 ## 6 8 6 92.3 85.3 488.0 14.7 22.2 29 5.4 0 0 ## 7 8 6 92.3 88.9 495.6 8.5 24.1 27 3.1 0 0 ## 8 8 6 91.5 145.4 608.2 10.7 8.0 86 2.2 0 0 dim(x) ## [1] 345 11 length(Y) ## [1] 345 class(x) ## [1] "data.frame" 11
  • 12. str(x) ## 'data.frame': 345 obs. of 11 variables: ## $ X : num 7 7 7 8 8 8 8 7 7 6 ... ## $ Y : num 5 4 4 6 6 6 6 5 5 5 ... ## $ FFMC: num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ... ## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ... ## $ DC : num 94.3 669.1 686.9 488 495.6 ... ## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ... ## $ temp: num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ... ## $ RH : num 51 33 33 29 27 86 63 51 38 72 ... ## $ wind: num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ... ## $ rain: num 0 0 0 0 0 0 0 0 0 0 ... ## $ area: num 0 0 0 0 0 0 0 0 0 0 ... X <- as.matrix(x) class(X) ## [1] "matrix" results <- cv.glmnet(X,Y) results ## $lambda ## [1] 1.990315408 1.813501273 1.652394818 1.505600617 1.371847207 ## [6] 1.249976083 1.138931652 1.037752103 0.945561067 0.861560028 ## [11] 0.785021409 0.715282271 0.651738565 0.593839908 0.541084807 ## [16] 0.493016324 0.449218112 0.409310813 0.372948769 0.339817028 ## [21] 0.309628620 0.282122067 0.257059120 0.234222696 0.213414997 ## [26] 0.194455797 0.177180880 0.161440619 0.147098679 0.134030838 ## [31] 0.122123908 0.111274757 0.101389414 0.092382258 0.084175273 ## [36] 0.076697373 0.069883790 0.063675507 0.058018750 0.052864524 ## [41] 0.048168186 0.043889057 0.039990074 0.036437466 0.033200462 ## [46] 0.030251024 0.027563607 0.025114932 0.022883791 0.020850858 ## [51] 0.018998525 0.017310748 0.015772909 0.014371687 0.013094946 ## [56] 0.011931627 0.010871655 0.009905847 0.009025839 0.008224008 ## [61] 0.007493410 0.006827716 0.006221160 0.005668490 0.005164917 ## [66] 0.004706080 0.004288005 0.003907070 0.003559977 0.003243718 ## ## $cvm ## [1] 5.1557896 4.5274734 3.9661191 3.5000809 3.1131751 2.7919656 2.5252979 ## [8] 2.3039112 2.1201170 1.9675324 1.8408582 1.7356947 1.6483897 1.5759106 ## [15] 1.5157400 1.4657880 1.4243193 1.3898935 1.3613145 1.3375895 1.3178942 ## [22] 1.3015444 1.2879718 1.2767162 1.2682002 1.2602525 1.2383533 1.2062084 ## [29] 1.1744998 1.1488916 1.1248634 1.1043967 1.0857760 1.0690638 1.0542104 ## [36] 1.0383157 1.0235117 1.0107188 1.0003412 0.9921089 0.9851035 0.9795120 ## [43] 0.9752539 0.9719607 0.9696518 0.9682564 0.9674289 0.9668369 0.9664844 ## [50] 0.9661883 0.9660219 0.9661044 0.9662086 0.9664366 0.9666119 0.9669238 ## [57] 0.9671213 0.9673236 0.9675142 0.9676450 0.9677656 0.9679127 0.9681149 ## [64] 0.9682766 0.9683967 0.9685622 0.9686798 0.9688161 0.9689918 0.9690978 ## ## $cvsd ## [1] 0.3052258 0.2965443 0.2813673 0.2717944 0.2664935 0.2643274 0.2643760 ## [8] 0.2659282 0.2684540 0.2715701 0.2750053 0.2785719 0.2821425 0.2856328 ## [15] 0.2889890 0.2921786 0.2951837 0.2979968 0.3006170 0.3030482 0.3052973 12
  • 13. ## [22] 0.3073728 0.3092844 0.3110387 0.3124359 0.3137672 0.3156819 0.3134652 ## [29] 0.3089602 0.3044185 0.2989233 0.2934814 0.2875823 0.2818579 0.2760164 ## [36] 0.2686802 0.2611669 0.2540860 0.2474377 0.2412654 0.2352859 0.2297747 ## [43] 0.2247959 0.2203583 0.2162666 0.2124843 0.2089385 0.2057743 0.2028791 ## [50] 0.2002374 0.1978727 0.1957308 0.1937396 0.1919263 0.1901875 0.1885591 ## [57] 0.1870212 0.1855954 0.1842725 0.1830436 0.1819223 0.1808839 0.1799487 ## [64] 0.1790834 0.1782956 0.1775804 0.1769257 0.1763354 0.1758300 0.1753255 ## ## $cvup ## [1] 5.461015 4.824018 4.247486 3.771875 3.379669 3.056293 2.789674 ## [8] 2.569839 2.388571 2.239103 2.115863 2.014267 1.930532 1.861543 ## [15] 1.804729 1.757967 1.719503 1.687890 1.661931 1.640638 1.623192 ## [22] 1.608917 1.597256 1.587755 1.580636 1.574020 1.554035 1.519674 ## [29] 1.483460 1.453310 1.423787 1.397878 1.373358 1.350922 1.330227 ## [36] 1.306996 1.284679 1.264805 1.247779 1.233374 1.220389 1.209287 ## [43] 1.200050 1.192319 1.185918 1.180741 1.176367 1.172611 1.169364 ## [50] 1.166426 1.163895 1.161835 1.159948 1.158363 1.156799 1.155483 ## [57] 1.154143 1.152919 1.151787 1.150689 1.149688 1.148797 1.148064 ## [64] 1.147360 1.146692 1.146143 1.145606 1.145152 1.144822 1.144423 ## ## $cvlo ## [1] 4.8505637 4.2309291 3.6847518 3.2282865 2.8466816 2.5276382 2.2609219 ## [8] 2.0379830 1.8516630 1.6959623 1.5658528 1.4571228 1.3662471 1.2902777 ## [15] 1.2267510 1.1736094 1.1291356 1.0918967 1.0606975 1.0345412 1.0125969 ## [22] 0.9941715 0.9786873 0.9656774 0.9557643 0.9464853 0.9226714 0.8927433 ## [29] 0.8655396 0.8444731 0.8259401 0.8109153 0.7981937 0.7872059 0.7781940 ## [36] 0.7696355 0.7623448 0.7566328 0.7529035 0.7508436 0.7498176 0.7497373 ## [43] 0.7504579 0.7516025 0.7533852 0.7557721 0.7584904 0.7610626 0.7636052 ## [50] 0.7659509 0.7681492 0.7703736 0.7724690 0.7745103 0.7764244 0.7783646 ## [57] 0.7801001 0.7817282 0.7832417 0.7846014 0.7858433 0.7870289 0.7881662 ## [64] 0.7891931 0.7901011 0.7909817 0.7917541 0.7924807 0.7931617 0.7937723 ## ## $nzero ## s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 ## 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 ## 1 1 1 1 1 1 1 2 3 3 3 3 4 5 5 5 6 6 ## s36 s37 s38 s39 s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53 ## 6 6 6 6 6 6 6 7 7 7 7 9 9 9 9 10 10 10 ## s54 s55 s56 s57 s58 s59 s60 s61 s62 s63 s64 s65 s66 s67 s68 s69 ## 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 ## ## $name ## mse ## "Mean-Squared Error" ## ## $glmnet.fit ## ## Call: glmnet(x = X, y = Y) ## ## Df %Dev Lambda ## [1,] 0 0.0000 1.990000 ## [2,] 1 0.1301 1.814000 13
  • 14. ## [3,] 1 0.2381 1.652000 ## [4,] 1 0.3277 1.506000 ## [5,] 1 0.4022 1.372000 ## [6,] 1 0.4640 1.250000 ## [7,] 1 0.5153 1.139000 ## [8,] 1 0.5579 1.038000 ## [9,] 1 0.5932 0.945600 ## [10,] 1 0.6226 0.861600 ## [11,] 1 0.6470 0.785000 ## [12,] 1 0.6672 0.715300 ## [13,] 1 0.6840 0.651700 ## [14,] 1 0.6980 0.593800 ## [15,] 1 0.7095 0.541100 ## [16,] 1 0.7191 0.493000 ## [17,] 1 0.7271 0.449200 ## [18,] 1 0.7338 0.409300 ## [19,] 1 0.7393 0.372900 ## [20,] 1 0.7438 0.339800 ## [21,] 1 0.7476 0.309600 ## [22,] 1 0.7508 0.282100 ## [23,] 1 0.7534 0.257100 ## [24,] 1 0.7555 0.234200 ## [25,] 1 0.7573 0.213400 ## [26,] 2 0.7593 0.194500 ## [27,] 3 0.7664 0.177200 ## [28,] 3 0.7747 0.161400 ## [29,] 3 0.7817 0.147100 ## [30,] 3 0.7874 0.134000 ## [31,] 4 0.7922 0.122100 ## [32,] 5 0.7986 0.111300 ## [33,] 5 0.8040 0.101400 ## [34,] 5 0.8086 0.092380 ## [35,] 6 0.8128 0.084180 ## [36,] 6 0.8164 0.076700 ## [37,] 6 0.8193 0.069880 ## [38,] 6 0.8218 0.063680 ## [39,] 6 0.8238 0.058020 ## [40,] 6 0.8255 0.052860 ## [41,] 6 0.8269 0.048170 ## [42,] 6 0.8281 0.043890 ## [43,] 6 0.8290 0.039990 ## [44,] 7 0.8299 0.036440 ## [45,] 7 0.8306 0.033200 ## [46,] 7 0.8312 0.030250 ## [47,] 7 0.8316 0.027560 ## [48,] 9 0.8322 0.025110 ## [49,] 9 0.8326 0.022880 ## [50,] 9 0.8331 0.020850 ## [51,] 9 0.8334 0.019000 ## [52,] 10 0.8337 0.017310 ## [53,] 10 0.8340 0.015770 ## [54,] 10 0.8343 0.014370 ## [55,] 10 0.8345 0.013090 14
  • 15. ## [56,] 11 0.8347 0.011930 ## [57,] 11 0.8348 0.010870 ## [58,] 11 0.8349 0.009906 ## [59,] 11 0.8351 0.009026 ## [60,] 11 0.8351 0.008224 ## [61,] 11 0.8352 0.007493 ## [62,] 11 0.8353 0.006828 ## [63,] 11 0.8353 0.006221 ## [64,] 11 0.8354 0.005668 ## [65,] 11 0.8354 0.005165 ## [66,] 11 0.8354 0.004706 ## [67,] 11 0.8355 0.004288 ## [68,] 11 0.8355 0.003907 ## [69,] 11 0.8355 0.003560 ## [70,] 11 0.8355 0.003244 ## [71,] 11 0.8355 0.002956 ## [72,] 11 0.8355 0.002693 ## [73,] 11 0.8355 0.002454 ## ## $lambda.min ## [1] 0.01899853 ## ## $lambda.1se ## [1] 0.1340308 ## ## attr(,"class") ## [1] "cv.glmnet" Now, I have a model from the elastic net function that I can use to predict values. I will use the test set that I made earlier. I will insert the test matrix into the model and look at the output compared to the real values given in the original dataset. names(fires.test) ## [1] "X" "Y" "month" "day" "FFMC" "DMC" "DC" "ISI" ## [9] "temp" "RH" "wind" "rain" "area" fires.tesT <- fires.test[,-4] names(fires.tesT) ## [1] "X" "Y" "month" "FFMC" "DMC" "DC" "ISI" "temp" ## [9] "RH" "wind" "rain" "area" xp <- fires.tesT[,-3] yp <- fires.tesT[,3] Yp <- gsub('jan','1',yp) Yp <- gsub('feb', '2',Yp) Yp <- gsub('mar', '3',Yp) Yp <- gsub('apr', '4',Yp) Yp <- gsub('may', '5',Yp) Yp <- gsub('jun', '6',Yp) Yp <- gsub('jul', '7',Yp) Yp <- gsub('aug', '8',Yp) 15
  • 16. Yp <- gsub('sep', '9',Yp) Yp <- gsub('oct', '10',Yp) Yp <- gsub('nov', '11',Yp) Yp <- gsub('dec', '12',Yp) class(Yp) ## [1] "character" Yp <- as.numeric(Yp) Yp <- as.integer(Yp) class(Yp) ## [1] "integer" xp[,1] <- as.numeric(xp[,1]) xp[,2] <- as.numeric(xp[,2]) xp[,3] <- as.numeric(xp[,3]) xp[,4] <- as.numeric(xp[,4]) xp[,5] <- as.numeric(xp[,5]) xp[,6] <- as.numeric(xp[,6]) xp[,7] <- as.numeric(xp[,7]) xp[,8] <- as.numeric(xp[,8]) xp[,9] <- as.numeric(xp[,9]) xp[,10] <- as.numeric(xp[,10]) xp[,11] <- as.numeric(xp[,11]) Xp <- as.matrix(xp) PRED1 <- round(predict(results, Xp, lambda = results$lambda.min),0) It makes sense to show the outputs as tables so that we can see the different counts. It is important to remember that PRED is the predictions using the minimum lambda from the glmnet() function and the Yp is the actual results from the test set. table(PRED1) ## PRED1 ## 3 4 5 6 7 8 9 10 ## 3 15 4 4 8 31 44 5 table(Yp) ## Yp ## 1 2 3 4 5 6 7 8 9 10 12 ## 1 7 8 2 1 5 7 35 43 4 1 This again was using the minimum lambda found in the cv.glmnet() I now wonder if we can’t get a better model by simplifying a little bit and using the $lambda.1se value instead of the $lambda.min PRED2 <- round(predict(results, Xp, lambda = results$lambda.1se),0) table(PRED2) ## PRED2 ## 3 4 5 6 7 8 9 10 ## 3 15 4 4 8 31 44 5 16
  • 17. Let’s see how the two models compare to the true data. ypdf <- rbind(c(1,0),c(2,1),c(3,10),c(4,1),c(5,0),c(6,4),c(7,8),c(8,36),c(9,30),c(10,4),c(11,0),c(12,3)) CONTINUE TRYING DIFFERENT METHODS!!! The accuracy isn’t as good as the SVM in part one but we can see that we are getting the same two ”Problem” Months as the SVM section did which are August and September. I want to look more into the glmnet() function now and explore some LASSO and Ridge Regressions. It shouldn’t be a painful process, because all I have to do is set the value of lambda to either 0 or 1 depending on what model I would like to use. 3 Ridge Regression Now, I will look into Ridge Regressions a little it and compare it with the elastic net regressions I before. I can use the same glmnet() function as I did previously I just need to set the value of lambda to 0. dim(fires.train) ## [1] 345 13 dim(fires.test) ## [1] 114 13 str(fires.train) ## 'data.frame': 345 obs. of 13 variables: ## $ X : int 7 7 7 8 8 8 8 7 7 6 ... ## $ Y : int 5 4 4 6 6 6 6 5 5 5 ... ## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 11 11 2 2 2 12 12 12 2 ... ## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 6 3 4 2 2 6 3 3 1 ... ## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ... ## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ... ## $ DC : num 94.3 669.1 686.9 488 495.6 ... ## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ... ## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ... ## $ RH : int 51 33 33 29 27 86 63 51 38 72 ... ## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ... ## $ rain : num 0 0 0 0 0 0 0 0 0 0 ... ## $ area : num 0 0 0 0 0 0 0 0 0 0 ... fT <- fires.train fT[,3] <- gsub('jan','1',fT[,3]) fT[,3] <- gsub('feb', '2',fT[,3]) fT[,3] <- gsub('mar', '3',fT[,3]) fT[,3] <- gsub('apr', '4',fT[,3]) fT[,3] <- gsub('may', '5',fT[,3]) fT[,3] <- gsub('jun', '6',fT[,3]) fT[,3] <- gsub('jul', '7',fT[,3]) fT[,3] <- gsub('aug', '8',fT[,3]) fT[,3] <- gsub('sep', '9',fT[,3]) fT[,3] <- gsub('oct', '10',fT[,3]) fT[,3] <- gsub('nov', '11',fT[,3]) fT[,3] <- gsub('dec', '12',fT[,3]) class(fT[,3]) 17
  • 18. ## [1] "character" fT[,3] <- as.numeric(fT[,3]) fT[,3] <- as.integer(fT[,3]) class(fT[,3]) ## [1] "integer" str(fT) ## 'data.frame': 345 obs. of 13 variables: ## $ X : int 7 7 7 8 8 8 8 7 7 6 ... ## $ Y : int 5 4 4 6 6 6 6 5 5 5 ... ## $ month: int 3 10 10 8 8 8 9 9 9 8 ... ## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 6 3 4 2 2 6 3 3 1 ... ## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ... ## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ... ## $ DC : num 94.3 669.1 686.9 488 495.6 ... ## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ... ## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ... ## $ RH : int 51 33 33 29 27 86 63 51 38 72 ... ## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ... ## $ rain : num 0 0 0 0 0 0 0 0 0 0 ... ## $ area : num 0 0 0 0 0 0 0 0 0 0 ... fT[,4] <- gsub('mon','1',fT[,4]) fT[,4] <- gsub('tue', '2',fT[,4]) fT[,4] <- gsub('wed', '3',fT[,4]) fT[,4] <- gsub('thu', '4',fT[,4]) fT[,4] <- gsub('fri', '5',fT[,4]) fT[,4] <- gsub('sat', '6',fT[,4]) fT[,4] <- gsub('sun', '7',fT[,4]) class(fT[,4]) ## [1] "character" fT[,4] <- as.numeric(fT[,4]) fT[,4] <- as.integer(fT[,4]) class(fT[,4]) ## [1] "integer" str(fT) ## 'data.frame': 345 obs. of 13 variables: ## $ X : int 7 7 7 8 8 8 8 7 7 6 ... ## $ Y : int 5 4 4 6 6 6 6 5 5 5 ... ## $ month: int 3 10 10 8 8 8 9 9 9 8 ... ## $ day : int 5 2 6 7 1 1 2 6 6 5 ... ## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ... ## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ... ## $ DC : num 94.3 669.1 686.9 488 495.6 ... ## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ... ## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ... ## $ RH : int 51 33 33 29 27 86 63 51 38 72 ... ## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ... ## $ rain : num 0 0 0 0 0 0 0 0 0 0 ... ## $ area : num 0 0 0 0 0 0 0 0 0 0 ... 18
  • 19. After reconstructing the data frame into the correct structures I was able to finally find a model. rmod <- lm.ridge(month ~., fT) Now of course I want to test the model usint the fires.test subset of the fires data. I have to reconstruct this set as well though so there are some steps that I must perform. fTs <- fires.test str(fTs) ## 'data.frame': 114 obs. of 13 variables: ## $ X : int 8 7 8 7 7 6 7 4 4 5 ... ## $ Y : int 6 5 5 4 4 3 3 4 4 6 ... ## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 12 11 7 2 12 11 2 12 12 ... ## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 3 2 4 3 4 3 6 3 2 ... ## $ FFMC : num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ... ## $ DMC : num 33.3 88 32.8 96.3 110.9 ... ## $ DC : num 77.5 698.6 664.2 200 537.4 ... ## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ... ## $ temp : num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ... ## $ RH : int 97 40 47 44 43 39 27 54 48 70 ... ## $ wind : num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ... ## $ rain : num 0.2 0 0 0 0 0 0 0 0 0 ... ## $ area : num 0 0 0 0 0 0 0 0 0 0 ... fTs[,3] <- gsub('jan','1',fTs[,3]) fTs[,3] <- gsub('feb', '2',fTs[,3]) fTs[,3] <- gsub('mar', '3',fTs[,3]) fTs[,3] <- gsub('apr', '4',fTs[,3]) fTs[,3] <- gsub('may', '5',fTs[,3]) fTs[,3] <- gsub('jun', '6',fTs[,3]) fTs[,3] <- gsub('jul', '7',fTs[,3]) fTs[,3] <- gsub('aug', '8',fTs[,3]) fTs[,3] <- gsub('sep', '9',fTs[,3]) fTs[,3] <- gsub('oct', '10',fTs[,3]) fTs[,3] <- gsub('nov', '11',fTs[,3]) fTs[,3] <- gsub('dec', '12',fTs[,3]) class(fTs[,3]) ## [1] "character" fTs[,3] <- as.numeric(fTs[,3]) fTs[,3] <- as.integer(fTs[,3]) class(fTs[,3]) ## [1] "integer" str(fTs) ## 'data.frame': 114 obs. of 13 variables: ## $ X : int 8 7 8 7 7 6 7 4 4 5 ... ## $ Y : int 6 5 5 4 4 3 3 4 4 6 ... ## $ month: int 3 9 10 6 8 9 10 8 9 9 ... ## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 3 2 4 3 4 3 6 3 2 ... ## $ FFMC : num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ... 19
  • 20. ## $ DMC : num 33.3 88 32.8 96.3 110.9 ... ## $ DC : num 77.5 698.6 664.2 200 537.4 ... ## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ... ## $ temp : num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ... ## $ RH : int 97 40 47 44 43 39 27 54 48 70 ... ## $ wind : num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ... ## $ rain : num 0.2 0 0 0 0 0 0 0 0 0 ... ## $ area : num 0 0 0 0 0 0 0 0 0 0 ... fTs[,4] <- gsub('mon','1',fTs[,4]) fTs[,4] <- gsub('tue', '2',fTs[,4]) fTs[,4] <- gsub('wed', '3',fTs[,4]) fTs[,4] <- gsub('thu', '4',fTs[,4]) fTs[,4] <- gsub('fri', '5',fTs[,4]) fTs[,4] <- gsub('sat', '6',fTs[,4]) fTs[,4] <- gsub('sun', '7',fTs[,4]) class(fTs[,4]) ## [1] "character" fTs[,4] <- as.numeric(fTs[,4]) fTs[,4] <- as.integer(fTs[,4]) class(fTs[,4]) ## [1] "integer" str(fT) ## 'data.frame': 345 obs. of 13 variables: ## $ X : int 7 7 7 8 8 8 8 7 7 6 ... ## $ Y : int 5 4 4 6 6 6 6 5 5 5 ... ## $ month: int 3 10 10 8 8 8 9 9 9 8 ... ## $ day : int 5 2 6 7 1 1 2 6 6 5 ... ## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ... ## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ... ## $ DC : num 94.3 669.1 686.9 488 495.6 ... ## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ... ## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ... ## $ RH : int 51 33 33 29 27 86 63 51 38 72 ... ## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ... ## $ rain : num 0 0 0 0 0 0 0 0 0 0 ... ## $ area : num 0 0 0 0 0 0 0 0 0 0 ... I believe I am ready to perform the prediction on the data. Here goes nothing. class(rmod) ## [1] "ridgelm" rmd <- matrix(rmod$coef) rmd[4,] <- 0.0271 rmd[6,] <- 0.01 rmd[5,] <- -.008 rmd[9,] <- -.01668 20
  • 21. FTS <- fTs[,-3] str(FTS) ## 'data.frame': 114 obs. of 12 variables: ## $ X : int 8 7 8 7 7 6 7 4 4 5 ... ## $ Y : int 6 5 5 4 4 3 3 4 4 6 ... ## $ day : int 5 6 1 7 6 7 6 2 6 1 ... ## $ FFMC: num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ... ## $ DMC : num 33.3 88 32.8 96.3 110.9 ... ## $ DC : num 77.5 698.6 664.2 200 537.4 ... ## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ... ## $ temp: num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ... ## $ RH : int 97 40 47 44 43 39 27 54 48 70 ... ## $ wind: num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ... ## $ rain: num 0.2 0 0 0 0 0 0 0 0 0 ... ## $ area: num 0 0 0 0 0 0 0 0 0 0 ... FTS <- as.matrix(FTS) class(FTS) ## [1] "matrix" dim(FTS) ## [1] 114 12 dim(rmd) ## [1] 12 1 rpred <- FTS%*%rmd I realize the top part might be confusing, but because I couldn’t find an ”easy” way to get the predictions from the model I converted the coeffiecients into a matrix (rmd) and then I used the fires.test set to make another matrix so that I could multiply the coeffients into the correct column and then it is produced in rpred. The list is in decimals because of the way the R function lm.ridge does the calculations so I conver the decimals to the nearest month by rounding and use table() to count the frequencies which is displayed below by freq, the other table is the true values. freq <- table(round(rpred,0)) freq ## ## -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 9 ## 1 1 5 10 11 13 17 9 17 17 6 4 1 2 true <- table(fTs[,3]) true ## ## 1 2 3 4 5 6 7 8 9 10 12 ## 1 7 8 2 1 5 7 35 43 4 1 I seem to be having troubles getting the class ”lmridge” to behave the way I would like it to. I am going to try a different ridge function. 21
  • 22. lp <- linearRidge(month~.,fT) pre <- table(round(predict(lp, fTs),0)) pre ## ## 2 3 4 5 6 7 8 9 10 11 ## 1 8 10 4 5 7 28 41 8 2 true ## ## 1 2 3 4 5 6 7 8 9 10 12 ## 1 7 8 2 1 5 7 35 43 4 1 The predictions are not too bad at all except for the 3rd and 4th months (March and April). It seems that the two values are basically inverted according to the true data. I wondering why those two are getting such a bad fit??? I am not sure how to answer that question. These results are very similar to the elastic net regression from section 2. It is a little bit better than the elastic net, but it is not ’much’ better and it for sure is not better than the SVM models we used in section 1. This leads me to the next idea to focus which is to do a LASSO regression with the fires data. So far I have done elastic net (which is a mixture of LASSO and Ridge) and ridge (which keeps all the predictors and assigns a penalty). The next logical place to go would be LASSO (which allows for variable selection along with the penalty factor). 4 LASSO Regression A reminder that LASSO stands for least absolute shrinkage and selection operator. LASSO will shrink the coefficients like the ridge regression did, but the main difference between the two is LASSO is going to force some variables to exactly 0, thus eliminating them from the model entirely. This is a process commonly known as variable selection. I want to continue to work with the fires data set from previous sections. I will start by looking at the overall structure of the data sets again where fires.train and fires.test are the two subsets of fires for which training and testing of a model are done. str(fires.test) ## 'data.frame': 114 obs. of 13 variables: ## $ X : int 8 7 8 7 7 6 7 4 4 5 ... ## $ Y : int 6 5 5 4 4 3 3 4 4 6 ... ## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 12 11 7 2 12 11 2 12 12 ... ## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 3 2 4 3 4 3 6 3 2 ... ## $ FFMC : num 91.7 92.5 84.9 94.3 90.2 91.7 90.6 94.8 92.5 90.9 ... ## $ DMC : num 33.3 88 32.8 96.3 110.9 ... ## $ DC : num 77.5 698.6 664.2 200 537.4 ... ## $ ISI : num 9 7.1 3 56.1 6.2 7.8 6.7 17 7.1 7 ... ## $ temp : num 8.3 22.8 16.7 21 19.5 17.7 17.8 16.6 19.6 14.7 ... ## $ RH : int 97 40 47 44 43 39 27 54 48 70 ... ## $ wind : num 4 4 4.9 4.5 5.8 3.6 4 5.4 2.7 3.6 ... ## $ rain : num 0.2 0 0 0 0 0 0 0 0 0 ... ## $ area : num 0 0 0 0 0 0 0 0 0 0 ... 22
  • 23. str(fires.train) ## 'data.frame': 345 obs. of 13 variables: ## $ X : int 7 7 7 8 8 8 8 7 7 6 ... ## $ Y : int 5 4 4 6 6 6 6 5 5 5 ... ## $ month: Factor w/ 12 levels "apr","aug","dec",..: 8 11 11 2 2 2 12 12 12 2 ... ## $ day : Factor w/ 7 levels "fri","mon","sat",..: 1 6 3 4 2 2 6 3 3 1 ... ## $ FFMC : num 86.2 90.6 90.6 92.3 92.3 91.5 91 92.5 92.8 63.5 ... ## $ DMC : num 26.2 35.4 43.7 85.3 88.9 ... ## $ DC : num 94.3 669.1 686.9 488 495.6 ... ## $ ISI : num 5.1 6.7 6.7 14.7 8.5 10.7 7 7.1 22.6 0.8 ... ## $ temp : num 8.2 18 14.6 22.2 24.1 8 13.1 17.8 19.3 17 ... ## $ RH : int 51 33 33 29 27 86 63 51 38 72 ... ## $ wind : num 6.7 0.9 1.3 5.4 3.1 2.2 5.4 7.2 4 6.7 ... ## $ rain : num 0 0 0 0 0 0 0 0 0 0 ... ## $ area : num 0 0 0 0 0 0 0 0 0 0 ... I plan to use the lars() function from the R package ’lars’ which requires a prediction matrix and a response vector to train the model. Because I have already created those in previous sections I can just recall those matrices because they will have the same data as I would need here. From the elastic net (section 2) X is the prediction matrix and Y is the response vector (month). lasso <- lars(X, Y) Now that I have a LASSO model trained I can test my model against the fires.test data set we made previously. Again, because elastic net also needed a prediction matrix and a response vector I can use the matrix and vector formed in that section. laspre <- predict(lasso, Xp,s=8) t.laspre <- table(round(as.matrix(laspre$fit),0)) t.laspre ## ## 2 3 4 5 6 7 8 9 10 ## 1 9 11 3 4 7 27 39 13 table(Yp) ## Yp ## 1 2 3 4 5 6 7 8 9 10 12 ## 1 7 8 2 1 5 7 35 43 4 1 This one just have been my best model yet. I can see that the groupings are close together and in fact the biggest misclassification that I have is for the fourth month at 7. I Believe this is yielding the best results. I will run it through the process a few more times to see if I can generally conclude that I have reached my best model. 5 References UCI Depository (2008), ’Forest Fires’, University of California Irvine Zou, Hui and Hastie, Trevor(2004) ’Regularization and variable selection via the elastic net’, Standford University 23