4. LabeledPoint
The Spark MLlib abstraction for a feature vector is known
as a LabeledPoint, which consists of a Spark MLlib Vector
of features, and a target value, here called the label.
6. Covtype Dataset
The data set records the types of forest covering parcels of
land in Colorado, USA.
Name Data Type Measurement
Elevation quantitative meters
Aspect quantitative azimuth
Slope quantitative degrees
Horizontal_Distance_To_Hydrology quantitative meters
Vertical_Distance_To_Hydrology quantitative meters
Horizontal_Distance_To_Roadways quantitative meters
Hillshade_9am quantitative 0 to 255 index
Hillshade_Noon quantitative 0 to 255 index
Hillshade_3pm quantitative 0 to 255 index
Horizontal_Distance_To_Fire_Points quantitative meters
Wilderness_Area (4 binary columns) qualitative 0 or 1
Soil_Type (40 binary columns) qualitative 0 or 1
Cover_Type (7 types) integer 1 to 7
2596,51,3,258,0,510,221,232,148,6279, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5
2590,56,2,212,-6,390,220,235,151,6225, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5
7. A First Decision Tree (1)
val rawData = sc. textFile("hdfs:///user/ds/covtype.data")
val data = rawData. map { line =>
val values: Array[Double] = line. split(',').map(_.toDouble)
val featureVector:Vector[Double] = Vectors.dense(values.init)
val label = values. last - 1
LabeledPoint(label, featureVector)
}
val Array(trainData, cvData, testData)
= data.randomSplit(Array(0.8, 0.1, 0.1))
trainData.cache();cvData.cache();testData.cache()
for classification, labels
should take values {0, 1, ...,
numClasses-1}.
8. A First Decision Tree (2)
val model = DecisionTree.trainClassifier (trainData, 7, Map[Int,Int](),
"gini", 4, 100)
val predictionsAndLabels = cvData.map(example =>
(model.predict(example.features), example.label)
)
val metrics = new MulticlassMetrics( predictionsAndLabels)
12. Revising Categorical Features (1)
With one 40-valued categorical feature, the decision tree can
create decisions based on groups of categories in one decision,
which may be more direct and optimal. On the other hand,
having 40 numeric features represent one 40-valued categorical
feature also increases memory usage and slows things down
val data = rawData.map { line =>
val values = line.split(',').map(_.toDouble)
val wilderness = values.slice(10, 14).indexOf(1.0).toDouble
val soil = values.slice(14, 54).indexOf(1.0).toDouble
val featureVector = Vectors.dense(values.slice(0, 10) :+ wilderness
:+ soil) // (3)
val label = values.last - 1
LabeledPoint(label, featureVector)
}
val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1))
..1000.. => 0
..0100.. => 1
..0010.. => 2
..0001.. => 3
13. val evaluations = for (impurity <- Array("gini", "entropy");
depth <- Array(10, 20, 30);
bins <- Array(40, 300)
} yield {
val model = DecisionTree.trainClassifier( trainData, 7
, Map(10 -> 4, 11 -> 40), impurity, depth, bins)
val predictionsAndLabels = cvData.map( example =>
(model.predict(example.features), example.label) )
val accuracy = new MulticlassMetrics( predictionsAndLabels). precision
((impurity, depth, bins), accuracy)
}
evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy
}.reverse.foreach(println)
…
((entropy,30,300),0.9446513552658804)
((gini, 30,300),0.9391509759293745)
((entropy,30,40) ,0.9389268225394855)
((gini, 30,40) ,0.9355817642596042)
Revising Categorical Features (2)
vs. tuned 1-of-n encoding DT
((entropy,20,300),0.9119046392195256))
Map storing arity of categorical features. E.g., an entry (n ->
k) indicates that feature n is categorical with k categories
indexed from 0: {0, 1, ..., k-1}.
14. CV set vs. Test set
If the purpose of the CV set was to evaluate parameters fit to the
training set, then the purpose of the test set is to evaluate
hyperparameters that were “fit” to the CV set. That is, the test
set ensures an unbiased estimate of the accuracy of the final,
chosen model and its hyperparameters.
val model = DecisionTree. trainClassifier(
trainData.union(cvData), 7, Map[Int,Int](), "entropy", 20, 300)
val predictionsAndLabels = testData.map(example =>
(model.predict(example.features), example.label) )
val metrics = new MulticlassMetrics( predictionsAndLabels)
metricsOpt.precision = 0.9161946933031271
15. Random Decision Forests
It would be great to have not one tree, but many trees, each producing
reasonable but different and independent estimations of the right target
value. Their collective average prediction should fall close to the true
answer, more than any individual tree’s does. It’s the randomness in
the process of building that helps create this independence. This is the
key to random decision forests.
val model = RandomForest.trainClassifier( dataTrain, 7, Map(10 -> 4, 11 ->
40), 20, "auto", "entropy", 30, 300)
val predictionsAndLabels = cvData.map(example => (model.predict(example.
features), example.label) )
val metrics = new MulticlassMetrics( predictionsAndLabels)
metrics.precision = 0.9630068932322555
vs. Categorical Features DT ,0.9446513552658804))