Email Classifier
using Spark 1.3 Mlib
/ ML Pipeline
leoricklin@gmail.com
Inspired by
201503 Email Classifier using Mahout on Hadoop
● Dataset from Apache Spam Assassin
o One file per email, with mail headers and HTML tags
o #spam = 501, #ham = 2501
● Output of Confusion Matrix
Actual  Predict spam ham
spam 69 TP 1 FN Recall = 98.5714%
ham 1 FP 382 TN
Precision = 98.5714% Accuacy = 99.5585%
Spam Sample
From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002
Return-Path: <12a1mailbot1@web.de>
Delivered-To: zzzz@localhost.spamassassin.taint.org
Received: from localhost (localhost [127.0.0.1])
by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32
for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3D"text/html; charset=3Dwindows-1252" http-equiv=3DContent-T=ype>
<META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR></HEAD>
<BODY><!-- Inserted by Calypso -->
<TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r=
<CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff=
0000
face=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10">
<CENTER>Why Spend More Than You Have To?
<CENTER><FONT color=3D#ff0000 face=3D"Copperplate Gothic Bold" size=3D5 PT=
SIZE=3D"10">
<CENTER>Life Quote Savings
Email headers
HTML tags
Email body
val tf = new HashingTF(numFeatures = 100)
val spam = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam")
val ham = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham")
val spamTrain = spam.map{ case (file, text) => tf.transform(text.split(" "))
}.map(features => LabeledPoint( 1, features))
val hamTrain = ham.map{ case (file, text) => tf.transform(text.split(" "))
}.map(features => LabeledPoint( 0, features))
val sampleData = spamTrain ++ hamTrain
sampleData.cache()
val trainData = sampleData.sample(false, 0.85, 707L)
val testData = sampleData.sample(false, 0.15, 707L)
val lrLearner = new LogisticRegressionWithSGD()
val model = lrLearner.run(trainData)
Featurization
Using Spark Mlib (1)
#samples=3002
#trainData=2549
#spam=431, #ham=2118
#testData=431
#spam=73, #ham=358
AccracyBase= 83.0626% ( (0+358)/431 )
Tokenization
training
Using Spark Mlib (2)
val validation = testData.map{ lpoint => (lpoint.label, model.predict(lpoint.features)) }
val matirx = validation.map{
ret => ret match {
case (1.0, 1.0) => Array(1, 0, 0, 0) // TP
case (0.0, 1.0) => Array(0, 1, 0, 0) // FP
case (0.0, 0.0) => Array(0, 0, 1, 0) // TN
case (1.0, 0.0) => Array(0, 0, 0, 1) // FN
}
}.reduce{
(ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3))
}
matrix:Array[Int] = Array(37TP, 11FP, 347TN, 36FN)
Accuracy = 89.0951% ( (37+347)/431 ) , vs. 99.5585% using Mahout
Precision = 77.0833% ( 37/(37+11) ) , vs. 98.5714% using Mahout
Recall = 50.6849% ( 37/(37+36) ) , vs. 98.5714% using Mahout
validation
Model Parameters
class org.apache.spark.mllib.feature.HashingTF
● val numFeatures: Int
number of features (default: 220)
class LogisticRegressionWithSGD
● val optimizer: GradientDescent
The optimizer to solve the problem.
class GradientDescent
● def setNumIterations(iters: Int): GradientDescent.this.type
Set the number of iterations for SGD.
● def setRegParam(regParam: Double): GradientDescent.this.type
Set the regularization parameter.
How to find the best
combination of each
parameter?
ML Pipeline Concepts
Transformer
A feature transformer might take a dataset, read a column (e.g., text), convert it into a
new column (e.g., feature vectors)
A learning model might take a dataset, read the column containing feature vectors,
predict the label for each feature vector, append the labels as a new column.
Estimators
An Estimator abstracts the concept of a learning algorithm or any algorithm which fits or
trains on data.
Pipeline
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer
or an Estimator. These stages are run in order, and the input dataset is modified as it
passes through each stage.
Spark 1.3.0 ML Programming Guide
Example ML Workflow
201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib
Using ML Pipeline (1)
case class Email(text:String)
case class EmailLabeled(text:String, label:Double)
val spamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam").map {
case (file, content) => EmailLabeled(content, 1.0)}
val hamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham").map {
case (file, content) => EmailLabeled(content, 0.0) }
val sampleSet = (spamTrain ++ hamTrain).toDF()
sampleSet.cache()
val trainSet = sampleSet.sample(false, 0.85, 100L)
val testSet = sampleSet.sample(false, 0.15, 100L)
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10)
val pipeline = new Pipeline().setStages( Array(tokenizer, hashingTF, lr) )
val crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder(
).addGrid( hashingTF.numFeatures, Array(10, 100, 1000)
).addGrid( lr.regParam, Array(0.1, 0.01)
).addGrid( lr.maxIter, Array(10, 20, 30, 50)
).build()
#samples=3002
#trainData=2528
#spam=421, #ham=2107
#testData=437
#spam=84, #ham=353
AccracyBase
= 80.7780% ( (0+353)/437 )
Using ML Pipeline (2)
crossval.setEstimatorParamMaps(paramGrid).setNumFolds(3)
val cvModel = crossval.fit(trainSet)
val validation = cvModel.transform(testSet)
val matrix = validation.select("label","prediction").map{
case Row(label: Double, prediction: Double) => (label, prediction) match {
case (1.0, 1.0) => Array(1, 0, 0, 0) // TP
case (0.0, 1.0) => Array(0, 1, 0, 0) // FP
case (0.0, 0.0) => Array(0, 0, 1, 0) // TN
case (1.0, 0.0) => Array(0, 0, 0, 1) // FN
}
}.reduce{
(ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3))
}
matrix:Array[Int] = Array(84TP, 1FP, 352TN, 0FN)
Accuracy = 99.7712% ( (84+352)/437 ) , vs. 99.5585% using Mahout
Precision = 98.8235% ( 84/(84+1) ) , vs. 98.5714% using Mahout
Recall = 100% ( 84/(84+0) ) , vs. 98.5714% using Mahout
All in One
Tokenization, Featurization, Model Training, Model Validation, and Prediction
cvModel.bestModel.fittingParamMap = {
LogisticRegression-3cb51fc7-maxIter: 20,
HashingTF-cb518e45-numFeatures: 1000,
LogisticRegression-3cb51fc7-regParam: 0.1 }
Pipelines: Recap
201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib

Email Classifier using Spark 1.3 Mlib / ML Pipeline

  • 1.
    Email Classifier using Spark1.3 Mlib / ML Pipeline leoricklin@gmail.com
  • 2.
    Inspired by 201503 EmailClassifier using Mahout on Hadoop ● Dataset from Apache Spam Assassin o One file per email, with mail headers and HTML tags o #spam = 501, #ham = 2501 ● Output of Confusion Matrix Actual Predict spam ham spam 69 TP 1 FN Recall = 98.5714% ham 1 FP 382 TN Precision = 98.5714% Accuacy = 99.5585%
  • 3.
    Spam Sample From 12a1mailbot1@web.deThu Aug 22 13:17:22 2002 Return-Path: <12a1mailbot1@web.de> Delivered-To: zzzz@localhost.spamassassin.taint.org Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <META content=3D"text/html; charset=3Dwindows-1252" http-equiv=3DContent-T=ype> <META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR></HEAD> <BODY><!-- Inserted by Calypso --> <TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r= <CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff= 0000 face=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10"> <CENTER>Why Spend More Than You Have To? <CENTER><FONT color=3D#ff0000 face=3D"Copperplate Gothic Bold" size=3D5 PT= SIZE=3D"10"> <CENTER>Life Quote Savings Email headers HTML tags Email body
  • 4.
    val tf =new HashingTF(numFeatures = 100) val spam = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam") val ham = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham") val spamTrain = spam.map{ case (file, text) => tf.transform(text.split(" ")) }.map(features => LabeledPoint( 1, features)) val hamTrain = ham.map{ case (file, text) => tf.transform(text.split(" ")) }.map(features => LabeledPoint( 0, features)) val sampleData = spamTrain ++ hamTrain sampleData.cache() val trainData = sampleData.sample(false, 0.85, 707L) val testData = sampleData.sample(false, 0.15, 707L) val lrLearner = new LogisticRegressionWithSGD() val model = lrLearner.run(trainData) Featurization Using Spark Mlib (1) #samples=3002 #trainData=2549 #spam=431, #ham=2118 #testData=431 #spam=73, #ham=358 AccracyBase= 83.0626% ( (0+358)/431 ) Tokenization training
  • 5.
    Using Spark Mlib(2) val validation = testData.map{ lpoint => (lpoint.label, model.predict(lpoint.features)) } val matirx = validation.map{ ret => ret match { case (1.0, 1.0) => Array(1, 0, 0, 0) // TP case (0.0, 1.0) => Array(0, 1, 0, 0) // FP case (0.0, 0.0) => Array(0, 0, 1, 0) // TN case (1.0, 0.0) => Array(0, 0, 0, 1) // FN } }.reduce{ (ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3)) } matrix:Array[Int] = Array(37TP, 11FP, 347TN, 36FN) Accuracy = 89.0951% ( (37+347)/431 ) , vs. 99.5585% using Mahout Precision = 77.0833% ( 37/(37+11) ) , vs. 98.5714% using Mahout Recall = 50.6849% ( 37/(37+36) ) , vs. 98.5714% using Mahout validation
  • 6.
    Model Parameters class org.apache.spark.mllib.feature.HashingTF ●val numFeatures: Int number of features (default: 220) class LogisticRegressionWithSGD ● val optimizer: GradientDescent The optimizer to solve the problem. class GradientDescent ● def setNumIterations(iters: Int): GradientDescent.this.type Set the number of iterations for SGD. ● def setRegParam(regParam: Double): GradientDescent.this.type Set the regularization parameter. How to find the best combination of each parameter?
  • 7.
    ML Pipeline Concepts Transformer Afeature transformer might take a dataset, read a column (e.g., text), convert it into a new column (e.g., feature vectors) A learning model might take a dataset, read the column containing feature vectors, predict the label for each feature vector, append the labels as a new column. Estimators An Estimator abstracts the concept of a learning algorithm or any algorithm which fits or trains on data. Pipeline A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input dataset is modified as it passes through each stage. Spark 1.3.0 ML Programming Guide
  • 8.
    Example ML Workflow 201503(Spark Summit East) Practical Machine Learning Pipelines with MLlib
  • 9.
    Using ML Pipeline(1) case class Email(text:String) case class EmailLabeled(text:String, label:Double) val spamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam").map { case (file, content) => EmailLabeled(content, 1.0)} val hamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham").map { case (file, content) => EmailLabeled(content, 0.0) } val sampleSet = (spamTrain ++ hamTrain).toDF() sampleSet.cache() val trainSet = sampleSet.sample(false, 0.85, 100L) val testSet = sampleSet.sample(false, 0.15, 100L) val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words") val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features") val lr = new LogisticRegression().setMaxIter(10) val pipeline = new Pipeline().setStages( Array(tokenizer, hashingTF, lr) ) val crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator) val paramGrid = new ParamGridBuilder( ).addGrid( hashingTF.numFeatures, Array(10, 100, 1000) ).addGrid( lr.regParam, Array(0.1, 0.01) ).addGrid( lr.maxIter, Array(10, 20, 30, 50) ).build() #samples=3002 #trainData=2528 #spam=421, #ham=2107 #testData=437 #spam=84, #ham=353 AccracyBase = 80.7780% ( (0+353)/437 )
  • 10.
    Using ML Pipeline(2) crossval.setEstimatorParamMaps(paramGrid).setNumFolds(3) val cvModel = crossval.fit(trainSet) val validation = cvModel.transform(testSet) val matrix = validation.select("label","prediction").map{ case Row(label: Double, prediction: Double) => (label, prediction) match { case (1.0, 1.0) => Array(1, 0, 0, 0) // TP case (0.0, 1.0) => Array(0, 1, 0, 0) // FP case (0.0, 0.0) => Array(0, 0, 1, 0) // TN case (1.0, 0.0) => Array(0, 0, 0, 1) // FN } }.reduce{ (ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3)) } matrix:Array[Int] = Array(84TP, 1FP, 352TN, 0FN) Accuracy = 99.7712% ( (84+352)/437 ) , vs. 99.5585% using Mahout Precision = 98.8235% ( 84/(84+1) ) , vs. 98.5714% using Mahout Recall = 100% ( 84/(84+0) ) , vs. 98.5714% using Mahout All in One Tokenization, Featurization, Model Training, Model Validation, and Prediction cvModel.bestModel.fittingParamMap = { LogisticRegression-3cb51fc7-maxIter: 20, HashingTF-cb518e45-numFeatures: 1000, LogisticRegression-3cb51fc7-regParam: 0.1 }
  • 11.
    Pipelines: Recap 201503 (SparkSummit East) Practical Machine Learning Pipelines with MLlib