SlideShare a Scribd company logo
Email Classifier
using Spark 1.3 Mlib
/ ML Pipeline
leoricklin@gmail.com
Inspired by
201503 Email Classifier using Mahout on Hadoop
● Dataset from Apache Spam Assassin
o One file per email, with mail headers and HTML tags
o #spam = 501, #ham = 2501
● Output of Confusion Matrix
Actual  Predict spam ham
spam 69 TP 1 FN Recall = 98.5714%
ham 1 FP 382 TN
Precision = 98.5714% Accuacy = 99.5585%
Spam Sample
From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002
Return-Path: <12a1mailbot1@web.de>
Delivered-To: zzzz@localhost.spamassassin.taint.org
Received: from localhost (localhost [127.0.0.1])
by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32
for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3D"text/html; charset=3Dwindows-1252" http-equiv=3DContent-T=ype>
<META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR></HEAD>
<BODY><!-- Inserted by Calypso -->
<TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r=
<CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff=
0000
face=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10">
<CENTER>Why Spend More Than You Have To?
<CENTER><FONT color=3D#ff0000 face=3D"Copperplate Gothic Bold" size=3D5 PT=
SIZE=3D"10">
<CENTER>Life Quote Savings
Email headers
HTML tags
Email body
val tf = new HashingTF(numFeatures = 100)
val spam = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam")
val ham = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham")
val spamTrain = spam.map{ case (file, text) => tf.transform(text.split(" "))
}.map(features => LabeledPoint( 1, features))
val hamTrain = ham.map{ case (file, text) => tf.transform(text.split(" "))
}.map(features => LabeledPoint( 0, features))
val sampleData = spamTrain ++ hamTrain
sampleData.cache()
val trainData = sampleData.sample(false, 0.85, 707L)
val testData = sampleData.sample(false, 0.15, 707L)
val lrLearner = new LogisticRegressionWithSGD()
val model = lrLearner.run(trainData)
Featurization
Using Spark Mlib (1)
#samples=3002
#trainData=2549
#spam=431, #ham=2118
#testData=431
#spam=73, #ham=358
AccracyBase= 83.0626% ( (0+358)/431 )
Tokenization
training
Using Spark Mlib (2)
val validation = testData.map{ lpoint => (lpoint.label, model.predict(lpoint.features)) }
val matirx = validation.map{
ret => ret match {
case (1.0, 1.0) => Array(1, 0, 0, 0) // TP
case (0.0, 1.0) => Array(0, 1, 0, 0) // FP
case (0.0, 0.0) => Array(0, 0, 1, 0) // TN
case (1.0, 0.0) => Array(0, 0, 0, 1) // FN
}
}.reduce{
(ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3))
}
matrix:Array[Int] = Array(37TP, 11FP, 347TN, 36FN)
Accuracy = 89.0951% ( (37+347)/431 ) , vs. 99.5585% using Mahout
Precision = 77.0833% ( 37/(37+11) ) , vs. 98.5714% using Mahout
Recall = 50.6849% ( 37/(37+36) ) , vs. 98.5714% using Mahout
validation
Model Parameters
class org.apache.spark.mllib.feature.HashingTF
● val numFeatures: Int
number of features (default: 220)
class LogisticRegressionWithSGD
● val optimizer: GradientDescent
The optimizer to solve the problem.
class GradientDescent
● def setNumIterations(iters: Int): GradientDescent.this.type
Set the number of iterations for SGD.
● def setRegParam(regParam: Double): GradientDescent.this.type
Set the regularization parameter.
How to find the best
combination of each
parameter?
ML Pipeline Concepts
Transformer
A feature transformer might take a dataset, read a column (e.g., text), convert it into a
new column (e.g., feature vectors)
A learning model might take a dataset, read the column containing feature vectors,
predict the label for each feature vector, append the labels as a new column.
Estimators
An Estimator abstracts the concept of a learning algorithm or any algorithm which fits or
trains on data.
Pipeline
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer
or an Estimator. These stages are run in order, and the input dataset is modified as it
passes through each stage.
Spark 1.3.0 ML Programming Guide
Example ML Workflow
201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib
Using ML Pipeline (1)
case class Email(text:String)
case class EmailLabeled(text:String, label:Double)
val spamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam").map {
case (file, content) => EmailLabeled(content, 1.0)}
val hamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham").map {
case (file, content) => EmailLabeled(content, 0.0) }
val sampleSet = (spamTrain ++ hamTrain).toDF()
sampleSet.cache()
val trainSet = sampleSet.sample(false, 0.85, 100L)
val testSet = sampleSet.sample(false, 0.15, 100L)
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10)
val pipeline = new Pipeline().setStages( Array(tokenizer, hashingTF, lr) )
val crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder(
).addGrid( hashingTF.numFeatures, Array(10, 100, 1000)
).addGrid( lr.regParam, Array(0.1, 0.01)
).addGrid( lr.maxIter, Array(10, 20, 30, 50)
).build()
#samples=3002
#trainData=2528
#spam=421, #ham=2107
#testData=437
#spam=84, #ham=353
AccracyBase
= 80.7780% ( (0+353)/437 )
Using ML Pipeline (2)
crossval.setEstimatorParamMaps(paramGrid).setNumFolds(3)
val cvModel = crossval.fit(trainSet)
val validation = cvModel.transform(testSet)
val matrix = validation.select("label","prediction").map{
case Row(label: Double, prediction: Double) => (label, prediction) match {
case (1.0, 1.0) => Array(1, 0, 0, 0) // TP
case (0.0, 1.0) => Array(0, 1, 0, 0) // FP
case (0.0, 0.0) => Array(0, 0, 1, 0) // TN
case (1.0, 0.0) => Array(0, 0, 0, 1) // FN
}
}.reduce{
(ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3))
}
matrix:Array[Int] = Array(84TP, 1FP, 352TN, 0FN)
Accuracy = 99.7712% ( (84+352)/437 ) , vs. 99.5585% using Mahout
Precision = 98.8235% ( 84/(84+1) ) , vs. 98.5714% using Mahout
Recall = 100% ( 84/(84+0) ) , vs. 98.5714% using Mahout
All in One
Tokenization, Featurization, Model Training, Model Validation, and Prediction
cvModel.bestModel.fittingParamMap = {
LogisticRegression-3cb51fc7-maxIter: 20,
HashingTF-cb518e45-numFeatures: 1000,
LogisticRegression-3cb51fc7-regParam: 0.1 }
Pipelines: Recap
201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib

More Related Content

What's hot

Chap1 array
Chap1 arrayChap1 array
Chap1 array
raksharao
 
Test string and array
Test string and arrayTest string and array
Test string and array
Nabeel Ahmed
 
Functions in python
Functions in pythonFunctions in python
Functions in pythonIlian Iliev
 
Input and Output
Input and OutputInput and Output
Input and Output
Marieswaran Ramasamy
 
Python programming Part -6
Python programming Part -6Python programming Part -6
Python programming Part -6
Megha V
 
Python programming –part 7
Python programming –part 7Python programming –part 7
Python programming –part 7
Megha V
 
Spock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted VinkeSpock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted Vinke
Ted Vinke
 
Python Cheat Sheet
Python Cheat SheetPython Cheat Sheet
Python Cheat Sheet
Muthu Vinayagam
 
Python programming- Part IV(Functions)
Python programming- Part IV(Functions)Python programming- Part IV(Functions)
Python programming- Part IV(Functions)
Megha V
 
Java String Handling
Java String HandlingJava String Handling
Java String Handling
Infoviaan Technologies
 
Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...
Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...
Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...
sachin kumar
 
Strings in python
Strings in pythonStrings in python
Strings in python
Prabhakaran V M
 
Python Functions (PyAtl Beginners Night)
Python Functions (PyAtl Beginners Night)Python Functions (PyAtl Beginners Night)
Python Functions (PyAtl Beginners Night)
Rick Copeland
 
Using Java Streams
Using Java StreamsUsing Java Streams
Using Java Streams
Rowan Marshall
 
Working with tf.data (TF 2)
Working with tf.data (TF 2)Working with tf.data (TF 2)
Working with tf.data (TF 2)
Oswald Campesato
 
Scala - en bedre Java?
Scala - en bedre Java?Scala - en bedre Java?
Scala - en bedre Java?
Jesper Kamstrup Linnet
 
The Arrow Library in Kotlin
The Arrow Library in KotlinThe Arrow Library in Kotlin
The Arrow Library in Kotlin
Garth Gilmour
 
Chapter 2 Java Methods
Chapter 2 Java MethodsChapter 2 Java Methods
Chapter 2 Java Methods
Khirulnizam Abd Rahman
 

What's hot (20)

Chap1 array
Chap1 arrayChap1 array
Chap1 array
 
Test string and array
Test string and arrayTest string and array
Test string and array
 
Functions in python
Functions in pythonFunctions in python
Functions in python
 
Input and Output
Input and OutputInput and Output
Input and Output
 
Oop lecture7
Oop lecture7Oop lecture7
Oop lecture7
 
Python programming Part -6
Python programming Part -6Python programming Part -6
Python programming Part -6
 
Python programming –part 7
Python programming –part 7Python programming –part 7
Python programming –part 7
 
Spock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted VinkeSpock the enterprise ready specifiation framework - Ted Vinke
Spock the enterprise ready specifiation framework - Ted Vinke
 
Python
PythonPython
Python
 
Python Cheat Sheet
Python Cheat SheetPython Cheat Sheet
Python Cheat Sheet
 
Python programming- Part IV(Functions)
Python programming- Part IV(Functions)Python programming- Part IV(Functions)
Python programming- Part IV(Functions)
 
Java String Handling
Java String HandlingJava String Handling
Java String Handling
 
Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...
Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...
Psycopg2 postgres python DDL Operaytions (select , Insert , update, create ta...
 
Strings in python
Strings in pythonStrings in python
Strings in python
 
Python Functions (PyAtl Beginners Night)
Python Functions (PyAtl Beginners Night)Python Functions (PyAtl Beginners Night)
Python Functions (PyAtl Beginners Night)
 
Using Java Streams
Using Java StreamsUsing Java Streams
Using Java Streams
 
Working with tf.data (TF 2)
Working with tf.data (TF 2)Working with tf.data (TF 2)
Working with tf.data (TF 2)
 
Scala - en bedre Java?
Scala - en bedre Java?Scala - en bedre Java?
Scala - en bedre Java?
 
The Arrow Library in Kotlin
The Arrow Library in KotlinThe Arrow Library in Kotlin
The Arrow Library in Kotlin
 
Chapter 2 Java Methods
Chapter 2 Java MethodsChapter 2 Java Methods
Chapter 2 Java Methods
 

Viewers also liked

Spark Data Streaming Pipeline
Spark Data Streaming PipelineSpark Data Streaming Pipeline
Spark Data Streaming Pipeline
Jonathan Bradshaw
 
Big Data Logging Pipeline with Apache Spark and Kafka
Big Data Logging Pipeline with Apache Spark and KafkaBig Data Logging Pipeline with Apache Spark and Kafka
Big Data Logging Pipeline with Apache Spark and Kafka
Dogukan Sonmez
 
Intro to Shader
Intro to ShaderIntro to Shader
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
Khalid Salama
 
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeksBeginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
JinTaek Seo
 
Geometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping SetupGeometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping Setup
Mark Kilgard
 
Shaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the BestShaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the Best
BeMyApp
 
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeksBeginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
JinTaek Seo
 
Unity Surface Shader for Artist 02
Unity Surface Shader for Artist 02Unity Surface Shader for Artist 02
Unity Surface Shader for Artist 02
SangYun Yi
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Chris Fregly
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
Christian Gügi
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Spark Summit
 
Working with Shader with Unity
Working with Shader with UnityWorking with Shader with Unity
Working with Shader with Unity
Minh Nghiem
 
Aws overview
Aws overviewAws overview
Aws overview
Minh Nghiem
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
AI Frontiers
 
Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編
Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編
Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編
小林 信行
 

Viewers also liked (19)

Pixel shaders
Pixel shadersPixel shaders
Pixel shaders
 
Spark Data Streaming Pipeline
Spark Data Streaming PipelineSpark Data Streaming Pipeline
Spark Data Streaming Pipeline
 
Big Data Logging Pipeline with Apache Spark and Kafka
Big Data Logging Pipeline with Apache Spark and KafkaBig Data Logging Pipeline with Apache Spark and Kafka
Big Data Logging Pipeline with Apache Spark and Kafka
 
Intro to Shader
Intro to ShaderIntro to Shader
Intro to Shader
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeksBeginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
Beginning direct3d gameprogramming10_shaderdetail_20160506_jintaeks
 
Geometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping SetupGeometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping Setup
 
Shaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the BestShaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the Best
 
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeksBeginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
 
Unity Surface Shader for Artist 02
Unity Surface Shader for Artist 02Unity Surface Shader for Artist 02
Unity Surface Shader for Artist 02
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Working with Shader with Unity
Working with Shader with UnityWorking with Shader with Unity
Working with Shader with Unity
 
Aws overview
Aws overviewAws overview
Aws overview
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編
Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編
Unity道場11 Shader Forge 101 ~ShaderForgeをつかって学ぶシェーダー入門~ 基本操作とよく使われるノード編
 

Similar to Email Classifier using Spark 1.3 Mlib / ML Pipeline

Java VS Python
Java VS PythonJava VS Python
Java VS Python
Simone Federici
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
Venkatesh Iyer
 
Testing CLI tools with Go
Testing CLI tools with GoTesting CLI tools with Go
Testing CLI tools with Go
Ricardo Gerardi
 
Meet scala
Meet scalaMeet scala
Meet scala
Wojciech Pituła
 
Rbootcamp Day 5
Rbootcamp Day 5Rbootcamp Day 5
Rbootcamp Day 5
Olga Scrivner
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
Peter Solymos
 
Handling Large State on BEAM
Handling Large State on BEAMHandling Large State on BEAM
Handling Large State on BEAM
Yoshihiro TANAKA
 
Swift for TensorFlow - CoreML Personalization
Swift for TensorFlow - CoreML PersonalizationSwift for TensorFlow - CoreML Personalization
Swift for TensorFlow - CoreML Personalization
Jacopo Mangiavacchi
 
Seminar on MATLAB
Seminar on MATLABSeminar on MATLAB
Seminar on MATLAB
Dharmesh Tank
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
BilawalBaloch1
 
Apex code benchmarking
Apex code benchmarkingApex code benchmarking
Apex code benchmarking
purushottambhaigade
 
JavaScript Objects
JavaScript ObjectsJavaScript Objects
JavaScript Objects
Reem Alattas
 
Functions In Scala
Functions In Scala Functions In Scala
Functions In Scala
Knoldus Inc.
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
leorick lin
 
Header file.pptx
Header file.pptxHeader file.pptx
Header file.pptx
ALANWALKERPIANO
 
MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709
Min-hyung Kim
 
Programming with matlab session 1
Programming with matlab session 1Programming with matlab session 1
Programming with matlab session 1
Infinity Tech Solutions
 
Python Lecture 11
Python Lecture 11Python Lecture 11
Python Lecture 11
Inzamam Baig
 

Similar to Email Classifier using Spark 1.3 Mlib / ML Pipeline (20)

Java VS Python
Java VS PythonJava VS Python
Java VS Python
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
Testing CLI tools with Go
Testing CLI tools with GoTesting CLI tools with Go
Testing CLI tools with Go
 
Meet scala
Meet scalaMeet scala
Meet scala
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
Rbootcamp Day 5
Rbootcamp Day 5Rbootcamp Day 5
Rbootcamp Day 5
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
 
Handling Large State on BEAM
Handling Large State on BEAMHandling Large State on BEAM
Handling Large State on BEAM
 
Swift for TensorFlow - CoreML Personalization
Swift for TensorFlow - CoreML PersonalizationSwift for TensorFlow - CoreML Personalization
Swift for TensorFlow - CoreML Personalization
 
Seminar on MATLAB
Seminar on MATLABSeminar on MATLAB
Seminar on MATLAB
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Apex code benchmarking
Apex code benchmarkingApex code benchmarking
Apex code benchmarking
 
JavaScript Objects
JavaScript ObjectsJavaScript Objects
JavaScript Objects
 
Faster Python, FOSDEM
Faster Python, FOSDEMFaster Python, FOSDEM
Faster Python, FOSDEM
 
Functions In Scala
Functions In Scala Functions In Scala
Functions In Scala
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
 
Header file.pptx
Header file.pptxHeader file.pptx
Header file.pptx
 
MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709
 
Programming with matlab session 1
Programming with matlab session 1Programming with matlab session 1
Programming with matlab session 1
 
Python Lecture 11
Python Lecture 11Python Lecture 11
Python Lecture 11
 

Recently uploaded

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 

Email Classifier using Spark 1.3 Mlib / ML Pipeline

  • 1. Email Classifier using Spark 1.3 Mlib / ML Pipeline leoricklin@gmail.com
  • 2. Inspired by 201503 Email Classifier using Mahout on Hadoop ● Dataset from Apache Spam Assassin o One file per email, with mail headers and HTML tags o #spam = 501, #ham = 2501 ● Output of Confusion Matrix Actual Predict spam ham spam 69 TP 1 FN Recall = 98.5714% ham 1 FP 382 TN Precision = 98.5714% Accuacy = 99.5585%
  • 3. Spam Sample From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002 Return-Path: <12a1mailbot1@web.de> Delivered-To: zzzz@localhost.spamassassin.taint.org Received: from localhost (localhost [127.0.0.1]) by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32 for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <META content=3D"text/html; charset=3Dwindows-1252" http-equiv=3DContent-T=ype> <META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR></HEAD> <BODY><!-- Inserted by Calypso --> <TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r= <CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff= 0000 face=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10"> <CENTER>Why Spend More Than You Have To? <CENTER><FONT color=3D#ff0000 face=3D"Copperplate Gothic Bold" size=3D5 PT= SIZE=3D"10"> <CENTER>Life Quote Savings Email headers HTML tags Email body
  • 4. val tf = new HashingTF(numFeatures = 100) val spam = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam") val ham = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham") val spamTrain = spam.map{ case (file, text) => tf.transform(text.split(" ")) }.map(features => LabeledPoint( 1, features)) val hamTrain = ham.map{ case (file, text) => tf.transform(text.split(" ")) }.map(features => LabeledPoint( 0, features)) val sampleData = spamTrain ++ hamTrain sampleData.cache() val trainData = sampleData.sample(false, 0.85, 707L) val testData = sampleData.sample(false, 0.15, 707L) val lrLearner = new LogisticRegressionWithSGD() val model = lrLearner.run(trainData) Featurization Using Spark Mlib (1) #samples=3002 #trainData=2549 #spam=431, #ham=2118 #testData=431 #spam=73, #ham=358 AccracyBase= 83.0626% ( (0+358)/431 ) Tokenization training
  • 5. Using Spark Mlib (2) val validation = testData.map{ lpoint => (lpoint.label, model.predict(lpoint.features)) } val matirx = validation.map{ ret => ret match { case (1.0, 1.0) => Array(1, 0, 0, 0) // TP case (0.0, 1.0) => Array(0, 1, 0, 0) // FP case (0.0, 0.0) => Array(0, 0, 1, 0) // TN case (1.0, 0.0) => Array(0, 0, 0, 1) // FN } }.reduce{ (ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3)) } matrix:Array[Int] = Array(37TP, 11FP, 347TN, 36FN) Accuracy = 89.0951% ( (37+347)/431 ) , vs. 99.5585% using Mahout Precision = 77.0833% ( 37/(37+11) ) , vs. 98.5714% using Mahout Recall = 50.6849% ( 37/(37+36) ) , vs. 98.5714% using Mahout validation
  • 6. Model Parameters class org.apache.spark.mllib.feature.HashingTF ● val numFeatures: Int number of features (default: 220) class LogisticRegressionWithSGD ● val optimizer: GradientDescent The optimizer to solve the problem. class GradientDescent ● def setNumIterations(iters: Int): GradientDescent.this.type Set the number of iterations for SGD. ● def setRegParam(regParam: Double): GradientDescent.this.type Set the regularization parameter. How to find the best combination of each parameter?
  • 7. ML Pipeline Concepts Transformer A feature transformer might take a dataset, read a column (e.g., text), convert it into a new column (e.g., feature vectors) A learning model might take a dataset, read the column containing feature vectors, predict the label for each feature vector, append the labels as a new column. Estimators An Estimator abstracts the concept of a learning algorithm or any algorithm which fits or trains on data. Pipeline A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input dataset is modified as it passes through each stage. Spark 1.3.0 ML Programming Guide
  • 8. Example ML Workflow 201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib
  • 9. Using ML Pipeline (1) case class Email(text:String) case class EmailLabeled(text:String, label:Double) val spamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.spam").map { case (file, content) => EmailLabeled(content, 1.0)} val hamTrain = sc.wholeTextFiles("file:///home/leo/spam/20030228.easyham").map { case (file, content) => EmailLabeled(content, 0.0) } val sampleSet = (spamTrain ++ hamTrain).toDF() sampleSet.cache() val trainSet = sampleSet.sample(false, 0.85, 100L) val testSet = sampleSet.sample(false, 0.15, 100L) val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words") val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features") val lr = new LogisticRegression().setMaxIter(10) val pipeline = new Pipeline().setStages( Array(tokenizer, hashingTF, lr) ) val crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator) val paramGrid = new ParamGridBuilder( ).addGrid( hashingTF.numFeatures, Array(10, 100, 1000) ).addGrid( lr.regParam, Array(0.1, 0.01) ).addGrid( lr.maxIter, Array(10, 20, 30, 50) ).build() #samples=3002 #trainData=2528 #spam=421, #ham=2107 #testData=437 #spam=84, #ham=353 AccracyBase = 80.7780% ( (0+353)/437 )
  • 10. Using ML Pipeline (2) crossval.setEstimatorParamMaps(paramGrid).setNumFolds(3) val cvModel = crossval.fit(trainSet) val validation = cvModel.transform(testSet) val matrix = validation.select("label","prediction").map{ case Row(label: Double, prediction: Double) => (label, prediction) match { case (1.0, 1.0) => Array(1, 0, 0, 0) // TP case (0.0, 1.0) => Array(0, 1, 0, 0) // FP case (0.0, 0.0) => Array(0, 0, 1, 0) // TN case (1.0, 0.0) => Array(0, 0, 0, 1) // FN } }.reduce{ (ary1, ary2) => Array(ary1(0)+ary2(0), ary1(1)+ary2(1), ary1(2)+ary2(2), ary1(3)+ary2(3)) } matrix:Array[Int] = Array(84TP, 1FP, 352TN, 0FN) Accuracy = 99.7712% ( (84+352)/437 ) , vs. 99.5585% using Mahout Precision = 98.8235% ( 84/(84+1) ) , vs. 98.5714% using Mahout Recall = 100% ( 84/(84+0) ) , vs. 98.5714% using Mahout All in One Tokenization, Featurization, Model Training, Model Validation, and Prediction cvModel.bestModel.fittingParamMap = { LogisticRegression-3cb51fc7-maxIter: 20, HashingTF-cb518e45-numFeatures: 1000, LogisticRegression-3cb51fc7-regParam: 0.1 }
  • 11. Pipelines: Recap 201503 (Spark Summit East) Practical Machine Learning Pipelines with MLlib