Apache Spark Machine Learning Decision Trees

© 2016 MapR Technologies 10-1
© 2016 MapR Technologies
Machine Learning with Apache Spark

Agenda
• Brief overview of
• Classification
• Clustering
• Collaborative Filtering
• Predicting Flight Delays using a Decision Tree

Spark SQL
• Structured Data
• Querying with
SQL/HQL
• DataFrames
Spark Streaming
• Processing of live
streams
• Micro-batching
MLlib
• Machine Learning
• Multiple types of
ML algorithms
GraphX
• Graph processing
• Graph parallel
computations
RDD Transformations and Actions
• Task scheduling
• Memory management
• Fault recovery
• Interacting with storage systems
Spark Core
What is MLlib?

MLlib Algorithms and Utilities
Algorithms and Utilities Description
Basic statistics Includes summary statistics, correlations, hypothesis testing, random data
generation
Classification and
regression
Includes methods for linear models, decision trees and Naïve Bayes
Collaborative filtering Supports model-based collaborative filtering using alternating least
squares (ALS) algorithm
Clustering Supports K-means clustering
Dimensionality reduction Supports dimensionality reduction on the RowMatrix class; singular value
decomposition (SVD) and principal component analysis (PCA)
Feature extraction and
transformation
Contains several classes for common feature transformations

Examples of ML Algorithms
Supervised
• Classification
– Naïve Bayes
– SVM
– Random Decision
Forests
• Regression
– Linear
– Logistic
Machine Learning
Unsupervised
• Clustering
– K-means
• Dimensionality reduction
– Principal Component
Analysis
– SVD

Supervised
• Classification
– Naïve Bayes
– SVM
– Random Decision
Forests
• Regression
– Linear
– Logistic
Machine Learning
Unsupervised
• Clustering
– K-means
Analysis
– SVD

Machine Learning
Unsupervised
• Clustering
– K-means
Analysis
– SVD
Supervised
• Classification
– Naïve Bayes
– SVM
– Random Decision
Forests
• Regression
– Linear
– Logistic

Three Categories of Techniques for Machine Learning
Collaborative Filtering
(Recommendation)
Classification Clustering

Machine Learning: Classification
Classification
Identifies
category for item

Classification: Definition
Form of ML that:
• Identifies which category an item belongs to
• Uses supervised learning algorithms
– Data is labeled
Sentiment

If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks
quacks
swims
Features:

Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
Spam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans

+
+
̶+
̶ ̶
Feature Vectors
Featurization
Spam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans

+
+
̶+
̶ ̶
Feature Vectors Model
Featurization TrainingSpam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
+
+
̶+
̶ ̶

+
+
̶+
̶ ̶
Feature Vectors Model
Featurization Training
Model
Evaluation
Best Model
Spam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶

Machine Learning: Clustering

Clustering: Definition
• Unsupervised learning task
• Groups objects into clusters of high similarity

Clustering: Definition
• Unsupervised learning task
• Groups objects into clusters of high similarity
– Search results grouping
– Grouping of customers
– Anomaly detection
– Text categorization

Clustering: Example
• Group similar objects
• Use MLlib K-means algorithm
1. Initialize coordinates to center
of clusters (centroid)
2. Assign all points to nearest
centroid
3. Update centroids to center of
points
4. Repeat until conditions met

Three Categories of Techniques for Machine Learning
Collaborative Filtering
(Recommendation)

Collaborative Filtering with Spark
• Recommend items
– (Filtering)
• Based on user preferences data
– (Collaborative)
4 5 5
5 5
5 ?
Ted
Carol
Bob
A B C
User Item Rating Matrix

Train a Model to Make Predictions
Ted and Carol like movies B and C
Bob likes movie B, what might he like?
Bob likes movie B, predict C
Training
Data
ModelAlgorithm
New Data PredictionsModel
4 5 5
5 5
5 ?
Ted
Carol
Bob
A B C
User Item Rating Matrix

Predict Flight Delays

Use Case: Flight Data
• Predict if a flight is going to be delayed
• Use Decision Tree for prediction
• Used for Classification and Regression
• Represents tree with nodes, Binary decision at each node

Flight Data

// Define the schema
case class Flight(dofM: String, dofW: String, carrier: String, tailnum: String,
flnum: Int, org_id: String, origin: String, dest_id: String, dest: String,
crsdeptime: Double, deptime: Double, depdelaymins: Double, crsarrtime: Double,
arrtime: Double, arrdelay: Double, crselapsedtime: Double, dist: Int)
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5), line(6),
line(7), line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble,
line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble,
line(16).toInt)
}
// load file into a RDD
val rdd = sc.textFile(”flights.csv”)
// create an RDD of Flight objects
val flightRDD = rdd.map(parseFlight).cache()
//Array(Flight(1,3,AA,N338AA,1,12478,JFK,12892,LAX 900.0,914.0,14.0,1225.0,1238.0,
13.0,385.0,2475)
Parse Input

+
+
̶+
̶ ̶
Feature Vectors
Featurization
Delayed:
Friday
LAX
AA
Training Data
Not Delayed:
Wednesday
BNA
Delta

Classification Learning Problem - Features
Label  delayed and not delayed - delayed if delay > 40 minutes
Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier,
crselapsedtime, origin, dest}

// create map of airline -> number
var carrierMap: Map[String, Int] = Map()
var index: Int = 0
flightsRDD.map(flight => flight.carrier).distinct.collect.foreach(
x => { carrierMap += (x -> index); index += 1 }
)
carrierMap.toString
// String = Map(DL -> 5,US -> 9, AA -> 6, UA -> 4...)
// create map of destination airport -> number
var destMap: Map[String, Int] = Map()
var index2: Int = 0
flightsRDD.map(flight => flight.dest).distinct.collect.foreach(
x => { destMap += (x -> index2); index2 += 1 })
destMap.toString
// Map(JFK -> 214, LAX -> 294, ATL -> 273,MIA -> 175 ...
Transform non-numeric features into numeric values

Classification Learning Problem - Features
Label  delayed and not delayed - delayed if delay > 40 minutes
Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier,
crselapsedtime, origin, dest}
MLLIB Datatypes:
Vector: Contains the feature data points
LabeledPoint: Contains feature vector and label

// Defining the features array
val mlprep = flightsRDD.map(flight => {
val monthday = flight.dofM.toInt - 1 // category
val weekday = flight.dofW.toInt - 1 // category
val crsdeptime1 = flight.crsdeptime.toInt
val crsarrtime1 = flight.crsarrtime.toInt
val carrier1 = carrierMap(flight.carrier) // category
val crselapsedtime1 = flight.crselapsedtime.toDouble
val origin1 = originMap(flight.origin) // category
val dest1 = destMap(flight.dest) // category
val delayed = if (flight.depdelaymins.toDouble > 40) 1.0 else 0.0
Array(delayed.toDouble, monthday.toDouble, weekday.toDouble, crsdeptime1.toDouble,
crsarrtime1.toDouble, carrier1.toDouble, crselapsedtime1.toDouble, origin1.toDouble,
dest1.toDouble)
})
mlprep.take(1)
//Array(Array(0.0, 0.0, 2.0, 900.0, 1225.0, 6.0, 385.0, 214.0, 294.0))
val mldata = mlprep.map(x => LabeledPoint(x(0),Vectors.dense(x(1),x(2),x(3),x(4), x(5),x(6),
x(7), x(8))))
mldata.take(1)
// Array[LabeledPoint] = Array((0.0,[0.0,2.0,900.0,1225.0,6.0,385.0,214.0,294.0]))
Define the features, Create LabeledPoint with Vector

Build Model
Split data into:
• Training data RDD (80%)
• Test data RDD (20%)
Data
Build
Model
Training
Set
Test
Set

// Randomly split RDD into training data RDD (80%) and test
data RDD (20%)
val splits = mldata.randomSplit(Array(0.8, 0.2))
val trainingRDD = splits(0).cache()
val testRDD = splits(1).cache()
testData.take(1)
//Array[LabeledPoint] =
Array((0.0,[18.0,6.0,900.0,1225.0,6.0,385.0,214.0,294.0]))
Split Data

Build Model
Training Set with Labels, Build a model
Data
Build
Model
Training
Set
Test
Set

Use Case: Flight Data
• Predict if a flight is going to be delayed
• Use Decision Tree for prediction
• Used for Classification and Regression
• Represents tree with nodes
• Binary decision at each node

// set ranges for categorical features
var categoricalFeaturesInfo = Map[Int, Int]()
categoricalFeaturesInfo += (0 -> 31) //dofM 31 categories
categoricalFeaturesInfo += (1 -> 7) //dofW 7 categories
categoricalFeaturesInfo += (4 -> carrierMap.size) //number of carriers
categoricalFeaturesInfo += (6 -> originMap.size) //number of origin airports
categoricalFeaturesInfo += (7 -> destMap.size) //number of dest airports
val numClasses = 2
val impurity = "gini"
val maxDepth = 9
val maxBins = 7000
// call DecisionTree trainClassifier with the trainingData , which returns the model
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
Build Model

// print out the decision tree
model.toDebugString
// 0=dofM 4=carrier 3=crsarrtime1 6=origin
res20: String =
DecisionTreeModel classifier of depth 9 with 919 nodes
If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,
22.0,23.0,24.0,25.0,26.0,27.0,30.0})
If (feature 4 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,13.0})
If (feature 3 <= 1603.0)
If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0})
If (feature 6 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0...
Build Model

Get Predictions
Test
Data
Without label
Predict
Delay or Not
Model

// Get Predictions,create RDD of test Label, test Prediction
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
labelAndPreds.take(1)
// Label, Prediction
//Array((0.0,0.0))
Get Predictions

// get instances where label != prediction
val wrongPrediction =(labelAndPreds.filter{
case (label, prediction) => ( label !=prediction)
})
val wrong= wrongPrediction.count()
res35: Long = 11040
val ratioWrong=wrong.toDouble/testData.count()
ratioWrong: Double = 0.3157443157443157
Test Model

To Learn More:
• Download example code
– https://github.com/caroljmcdonald/sparkmldecisiontree
• Read explanation of example code
– https://www.mapr.com/blog/apache-spark-machine-learning-tutorial
• Engage with us!
– https://www.mapr.com/blog/author/carol-mcdonald
– https://community.mapr.com

Q&A
@mapr
https://www.mapr.com/blog/author/carol-mcdonald
Engage with us!
mapr-technologies

Apache Spark Machine Learning Decision Trees

More Related Content

What's hot

Viewers also liked

Similar to Apache Spark Machine Learning Decision Trees

Recently uploaded

Apache Spark Machine Learning Decision Trees

Editor's Notes