SlideShare a Scribd company logo
1 of 15
Download to read offline
Multiclassification
with Decision Tree
in Spark MLlib 1.3
References
● 201504 Advanced Analytics with Spark
● 201409 (清華大學) 資料挖礦與大數據分析
● Apache Spark MLlib API
● MCI Machine Learning Repository
A simple Decision Tree
Figure 4-1. Decision tree: Is it spoiled?
LabeledPoint
The Spark MLlib abstraction for a feature vector is known
as a LabeledPoint, which consists of a Spark MLlib Vector
of features, and a target value, here called the label.
1-of-n Encoding
LabeledPoint can
be used with
categorical
features, with
appropriate
encoding.
Covtype Dataset
The data set records the types of forest covering parcels of
land in Colorado, USA.
Name Data Type Measurement
Elevation quantitative meters
Aspect quantitative azimuth
Slope quantitative degrees
Horizontal_Distance_To_Hydrology quantitative meters
Vertical_Distance_To_Hydrology quantitative meters
Horizontal_Distance_To_Roadways quantitative meters
Hillshade_9am quantitative 0 to 255 index
Hillshade_Noon quantitative 0 to 255 index
Hillshade_3pm quantitative 0 to 255 index
Horizontal_Distance_To_Fire_Points quantitative meters
Wilderness_Area (4 binary columns) qualitative 0 or 1
Soil_Type (40 binary columns) qualitative 0 or 1
Cover_Type (7 types) integer 1 to 7
2596,51,3,258,0,510,221,232,148,6279, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5
2590,56,2,212,-6,390,220,235,151,6225, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5
A First Decision Tree (1)
val rawData = sc. textFile("hdfs:///user/ds/covtype.data")
val data = rawData. map { line =>
val values: Array[Double] = line. split(',').map(_.toDouble)
val featureVector:Vector[Double] = Vectors.dense(values.init)
val label = values. last - 1
LabeledPoint(label, featureVector)
}
val Array(trainData, cvData, testData)
= data.randomSplit(Array(0.8, 0.1, 0.1))
trainData.cache();cvData.cache();testData.cache()
for classification, labels
should take values {0, 1, ...,
numClasses-1}.
A First Decision Tree (2)
val model = DecisionTree.trainClassifier (trainData, 7, Map[Int,Int](),
"gini", 4, 100)
val predictionsAndLabels = cvData.map(example =>
(model.predict(example.features), example.label)
)
val metrics = new MulticlassMetrics( predictionsAndLabels)
A First Decision Tree (3)
println(model. toDebugString)
metrics.precision
= 0.6996101063190258
metrics.confusionMatrix
DecisionTreeModel classifier of depth 4 with 31
nodes
If (feature 0 <= 3046.0)
If (feature 0 <= 2497.0)
If (feature 3 <= 0.0)
If (feature 12 <= 0.0)
Predict: 3.0
Else (feature 12 > 0.0)
Predict: 5.0
Actual 
Predict
cat0 cat1 cat2 cat3 cat4 cat5 cat6
cat0 14248.0 6615.0 5.0 0.0 0.0 1.0 422.0
cat1 5556.0 22440.0 355.0 19.0 0.0 4.0 41.0
cat2 0.0 452.0 3050.0 74.0 0.0 14.0 0.0
cat3 0.0 0.0 163.0 109.0 0.0 0.0 0.0
cat4 0.0 885.0 40.0 1.0 0.0 0.0 0.0
cat5 0.0 564.0 1091.0 37.0 0.0 53.0 0.0
cat6 1078.0 24.0 0.0 0.0 0.0 0.0 883.0
Tuning Decision Trees (1)
val evaluations: Array[((String, Int, Int), Double)] =
for (impurity <- Array("gini", "entropy");
depth <- Array(1, 20);
bins <- Array(10, 300)
)
yield {
val model = DecisionTree.trainClassifier(
trainData, 7, Map[Int,Int](), impurity, depth, bins)
val predictionsAndLabels = cvData.map( example =>
(model.predict(example.features), example.label) )
val accuracy = new MulticlassMetrics( predictionsAndLabels). precision
((impurity, depth, bins), accuracy)
}
Tuning Decision Trees (2)
evaluations.sortBy{
case ((impurity, depth, bins), accuracy) => accuracy
}.reverse.foreach(println)
...
((entropy,20,300),0.9119046392195256))
((gini ,20,300),0.9058758867075454)
((entropy,20,10 ),0.8968585218391989)
((gini ,20,10 ),0.89050342659865)
((gini ,1 ,10 ),0.6330018378248399)
((gini ,1 ,300),0.6323319764346198)
((entropy,1 ,300),0.48406932206592124)
((entropy,1 ,10 ),0.48406932206592124)
Sort by accuracy,
descending, and print
Revising Categorical Features (1)
With one 40-valued categorical feature, the decision tree can
create decisions based on groups of categories in one decision,
which may be more direct and optimal. On the other hand,
having 40 numeric features represent one 40-valued categorical
feature also increases memory usage and slows things down
val data = rawData.map { line =>
val values = line.split(',').map(_.toDouble)
val wilderness = values.slice(10, 14).indexOf(1.0).toDouble
val soil = values.slice(14, 54).indexOf(1.0).toDouble
val featureVector = Vectors.dense(values.slice(0, 10) :+ wilderness
:+ soil) // (3)
val label = values.last - 1
LabeledPoint(label, featureVector)
}
val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1))
..1000.. => 0
..0100.. => 1
..0010.. => 2
..0001.. => 3
val evaluations = for (impurity <- Array("gini", "entropy");
depth <- Array(10, 20, 30);
bins <- Array(40, 300)
} yield {
val model = DecisionTree.trainClassifier( trainData, 7
, Map(10 -> 4, 11 -> 40), impurity, depth, bins)
val predictionsAndLabels = cvData.map( example =>
(model.predict(example.features), example.label) )
val accuracy = new MulticlassMetrics( predictionsAndLabels). precision
((impurity, depth, bins), accuracy)
}
evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy
}.reverse.foreach(println)
…
((entropy,30,300),0.9446513552658804)
((gini, 30,300),0.9391509759293745)
((entropy,30,40) ,0.9389268225394855)
((gini, 30,40) ,0.9355817642596042)
Revising Categorical Features (2)
vs. tuned 1-of-n encoding DT
((entropy,20,300),0.9119046392195256))
Map storing arity of categorical features. E.g., an entry (n ->
k) indicates that feature n is categorical with k categories
indexed from 0: {0, 1, ..., k-1}.
CV set vs. Test set
If the purpose of the CV set was to evaluate parameters fit to the
training set, then the purpose of the test set is to evaluate
hyperparameters that were “fit” to the CV set. That is, the test
set ensures an unbiased estimate of the accuracy of the final,
chosen model and its hyperparameters.
val model = DecisionTree. trainClassifier(
trainData.union(cvData), 7, Map[Int,Int](), "entropy", 20, 300)
val predictionsAndLabels = testData.map(example =>
(model.predict(example.features), example.label) )
val metrics = new MulticlassMetrics( predictionsAndLabels)
metricsOpt.precision = 0.9161946933031271
Random Decision Forests
It would be great to have not one tree, but many trees, each producing
reasonable but different and independent estimations of the right target
value. Their collective average prediction should fall close to the true
answer, more than any individual tree’s does. It’s the randomness in
the process of building that helps create this independence. This is the
key to random decision forests.
val model = RandomForest.trainClassifier( dataTrain, 7, Map(10 -> 4, 11 ->
40), 20, "auto", "entropy", 30, 300)
val predictionsAndLabels = cvData.map(example => (model.predict(example.
features), example.label) )
val metrics = new MulticlassMetrics( predictionsAndLabels)
metrics.precision = 0.9630068932322555
vs. Categorical Features DT ,0.9446513552658804))

More Related Content

What's hot

Intelligent System Optimizations
Intelligent System OptimizationsIntelligent System Optimizations
Intelligent System OptimizationsMartin Zapletal
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RDr. Volkan OBAN
 
Datamining R 4th
Datamining R 4thDatamining R 4th
Datamining R 4thsesejun
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slidesYanchang Zhao
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformationLaura Hughes
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformationTim Essam
 
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...Revolution Analytics
 
A walk in graph databases v1.0
A walk in graph databases v1.0A walk in graph databases v1.0
A walk in graph databases v1.0Pierre De Wilde
 
Dian Vitiana Ningrum ()6211540000020)
Dian Vitiana Ningrum  ()6211540000020)Dian Vitiana Ningrum  ()6211540000020)
Dian Vitiana Ningrum ()6211540000020)dian vit
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processingTim Essam
 
Killing The Unit test talk
Killing The Unit test talkKilling The Unit test talk
Killing The Unit test talkdor sever
 
Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Filippo Vitale
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
Maneuvering target track prediction model
Maneuvering target track prediction modelManeuvering target track prediction model
Maneuvering target track prediction modelIJCI JOURNAL
 
r for data science 2. grammar of graphics (ggplot2) clean -ref
r for data science 2. grammar of graphics (ggplot2)  clean -refr for data science 2. grammar of graphics (ggplot2)  clean -ref
r for data science 2. grammar of graphics (ggplot2) clean -refMin-hyung Kim
 
Stata cheat sheet analysis
Stata cheat sheet analysisStata cheat sheet analysis
Stata cheat sheet analysisTim Essam
 
Machine Learning Live
Machine Learning LiveMachine Learning Live
Machine Learning LiveMike Anderson
 

What's hot (20)

Intelligent System Optimizations
Intelligent System OptimizationsIntelligent System Optimizations
Intelligent System Optimizations
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with R
 
Datamining R 4th
Datamining R 4thDatamining R 4th
Datamining R 4th
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slides
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
 
A walk in graph databases v1.0
A walk in graph databases v1.0A walk in graph databases v1.0
A walk in graph databases v1.0
 
Dian Vitiana Ningrum ()6211540000020)
Dian Vitiana Ningrum  ()6211540000020)Dian Vitiana Ningrum  ()6211540000020)
Dian Vitiana Ningrum ()6211540000020)
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processing
 
Functional programming in scala
Functional programming in scalaFunctional programming in scala
Functional programming in scala
 
Killing The Unit test talk
Killing The Unit test talkKilling The Unit test talk
Killing The Unit test talk
 
Ml all programs
Ml all programsMl all programs
Ml all programs
 
Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Maneuvering target track prediction model
Maneuvering target track prediction modelManeuvering target track prediction model
Maneuvering target track prediction model
 
r for data science 2. grammar of graphics (ggplot2) clean -ref
r for data science 2. grammar of graphics (ggplot2)  clean -refr for data science 2. grammar of graphics (ggplot2)  clean -ref
r for data science 2. grammar of graphics (ggplot2) clean -ref
 
ScalaMeter 2012
ScalaMeter 2012ScalaMeter 2012
ScalaMeter 2012
 
Stata cheat sheet analysis
Stata cheat sheet analysisStata cheat sheet analysis
Stata cheat sheet analysis
 
Machine Learning Live
Machine Learning LiveMachine Learning Live
Machine Learning Live
 

Viewers also liked

Random Forest for Big Data
Random Forest for Big DataRandom Forest for Big Data
Random Forest for Big Datatuxette
 
Decision tree for Predictive Modeling
Decision tree for Predictive ModelingDecision tree for Predictive Modeling
Decision tree for Predictive ModelingEdureka!
 

Viewers also liked (6)

Random Forest for Big Data
Random Forest for Big DataRandom Forest for Big Data
Random Forest for Big Data
 
Decision tree for Predictive Modeling
Decision tree for Predictive ModelingDecision tree for Predictive Modeling
Decision tree for Predictive Modeling
 
Random forest
Random forestRandom forest
Random forest
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision tree
Decision treeDecision tree
Decision tree
 

Similar to Multiclassification with Decision Tree in Spark MLlib 1.3

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Comparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text MiningComparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text MiningAndrea Gigli
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_TitanicAliciaWei1
 
Building Machine Learning Pipelines
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning PipelinesInMobi Technology
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Craig Chao
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_financeStefan Duprey
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmHadi Fadlallah
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
 
Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RShirin Elsinghorst
 
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classificationYiwei Chen
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Daniel Chan
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with RBarbara Fusinska
 
Graduation project based on multi-label learning
Graduation project based on multi-label learningGraduation project based on multi-label learning
Graduation project based on multi-label learningxinyue Liu
 
Machine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsMachine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsOmkar Rane
 
8. Vectors data frames
8. Vectors data frames8. Vectors data frames
8. Vectors data framesExternalEvents
 

Similar to Multiclassification with Decision Tree in Spark MLlib 1.3 (20)

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Comparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text MiningComparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text Mining
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_Titanic
 
Building Machine Learning Pipelines
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning Pipelines
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
 
Building ML Pipelines
Building ML PipelinesBuilding ML Pipelines
Building ML Pipelines
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithm
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 
Workshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with RWorkshop - Introduction to Machine Learning with R
Workshop - Introduction to Machine Learning with R
 
Xgboost
XgboostXgboost
Xgboost
 
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classification
 
R and data mining
R and data miningR and data mining
R and data mining
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with R
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
Graduation project based on multi-label learning
Graduation project based on multi-label learningGraduation project based on multi-label learning
Graduation project based on multi-label learning
 
Machine Learning Model for M.S admissions
Machine Learning Model for M.S admissionsMachine Learning Model for M.S admissions
Machine Learning Model for M.S admissions
 
8. Vectors data frames
8. Vectors data frames8. Vectors data frames
8. Vectors data frames
 

More from leorick lin

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021leorick lin
 
1.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.51.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.5leorick lin
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark mlleorick lin
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatleorick lin
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipelineleorick lin
 
Integrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopIntegrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopleorick lin
 

More from leorick lin (6)

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021
 
1.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.51.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.5
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipeline
 
Integrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopIntegrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoop
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

Multiclassification with Decision Tree in Spark MLlib 1.3

  • 2. References ● 201504 Advanced Analytics with Spark ● 201409 (清華大學) 資料挖礦與大數據分析 ● Apache Spark MLlib API ● MCI Machine Learning Repository
  • 3. A simple Decision Tree Figure 4-1. Decision tree: Is it spoiled?
  • 4. LabeledPoint The Spark MLlib abstraction for a feature vector is known as a LabeledPoint, which consists of a Spark MLlib Vector of features, and a target value, here called the label.
  • 5. 1-of-n Encoding LabeledPoint can be used with categorical features, with appropriate encoding.
  • 6. Covtype Dataset The data set records the types of forest covering parcels of land in Colorado, USA. Name Data Type Measurement Elevation quantitative meters Aspect quantitative azimuth Slope quantitative degrees Horizontal_Distance_To_Hydrology quantitative meters Vertical_Distance_To_Hydrology quantitative meters Horizontal_Distance_To_Roadways quantitative meters Hillshade_9am quantitative 0 to 255 index Hillshade_Noon quantitative 0 to 255 index Hillshade_3pm quantitative 0 to 255 index Horizontal_Distance_To_Fire_Points quantitative meters Wilderness_Area (4 binary columns) qualitative 0 or 1 Soil_Type (40 binary columns) qualitative 0 or 1 Cover_Type (7 types) integer 1 to 7 2596,51,3,258,0,510,221,232,148,6279, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5 2590,56,2,212,-6,390,220,235,151,6225, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,5
  • 7. A First Decision Tree (1) val rawData = sc. textFile("hdfs:///user/ds/covtype.data") val data = rawData. map { line => val values: Array[Double] = line. split(',').map(_.toDouble) val featureVector:Vector[Double] = Vectors.dense(values.init) val label = values. last - 1 LabeledPoint(label, featureVector) } val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1)) trainData.cache();cvData.cache();testData.cache() for classification, labels should take values {0, 1, ..., numClasses-1}.
  • 8. A First Decision Tree (2) val model = DecisionTree.trainClassifier (trainData, 7, Map[Int,Int](), "gini", 4, 100) val predictionsAndLabels = cvData.map(example => (model.predict(example.features), example.label) ) val metrics = new MulticlassMetrics( predictionsAndLabels)
  • 9. A First Decision Tree (3) println(model. toDebugString) metrics.precision = 0.6996101063190258 metrics.confusionMatrix DecisionTreeModel classifier of depth 4 with 31 nodes If (feature 0 <= 3046.0) If (feature 0 <= 2497.0) If (feature 3 <= 0.0) If (feature 12 <= 0.0) Predict: 3.0 Else (feature 12 > 0.0) Predict: 5.0 Actual Predict cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat0 14248.0 6615.0 5.0 0.0 0.0 1.0 422.0 cat1 5556.0 22440.0 355.0 19.0 0.0 4.0 41.0 cat2 0.0 452.0 3050.0 74.0 0.0 14.0 0.0 cat3 0.0 0.0 163.0 109.0 0.0 0.0 0.0 cat4 0.0 885.0 40.0 1.0 0.0 0.0 0.0 cat5 0.0 564.0 1091.0 37.0 0.0 53.0 0.0 cat6 1078.0 24.0 0.0 0.0 0.0 0.0 883.0
  • 10. Tuning Decision Trees (1) val evaluations: Array[((String, Int, Int), Double)] = for (impurity <- Array("gini", "entropy"); depth <- Array(1, 20); bins <- Array(10, 300) ) yield { val model = DecisionTree.trainClassifier( trainData, 7, Map[Int,Int](), impurity, depth, bins) val predictionsAndLabels = cvData.map( example => (model.predict(example.features), example.label) ) val accuracy = new MulticlassMetrics( predictionsAndLabels). precision ((impurity, depth, bins), accuracy) }
  • 11. Tuning Decision Trees (2) evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy }.reverse.foreach(println) ... ((entropy,20,300),0.9119046392195256)) ((gini ,20,300),0.9058758867075454) ((entropy,20,10 ),0.8968585218391989) ((gini ,20,10 ),0.89050342659865) ((gini ,1 ,10 ),0.6330018378248399) ((gini ,1 ,300),0.6323319764346198) ((entropy,1 ,300),0.48406932206592124) ((entropy,1 ,10 ),0.48406932206592124) Sort by accuracy, descending, and print
  • 12. Revising Categorical Features (1) With one 40-valued categorical feature, the decision tree can create decisions based on groups of categories in one decision, which may be more direct and optimal. On the other hand, having 40 numeric features represent one 40-valued categorical feature also increases memory usage and slows things down val data = rawData.map { line => val values = line.split(',').map(_.toDouble) val wilderness = values.slice(10, 14).indexOf(1.0).toDouble val soil = values.slice(14, 54).indexOf(1.0).toDouble val featureVector = Vectors.dense(values.slice(0, 10) :+ wilderness :+ soil) // (3) val label = values.last - 1 LabeledPoint(label, featureVector) } val Array(trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1)) ..1000.. => 0 ..0100.. => 1 ..0010.. => 2 ..0001.. => 3
  • 13. val evaluations = for (impurity <- Array("gini", "entropy"); depth <- Array(10, 20, 30); bins <- Array(40, 300) } yield { val model = DecisionTree.trainClassifier( trainData, 7 , Map(10 -> 4, 11 -> 40), impurity, depth, bins) val predictionsAndLabels = cvData.map( example => (model.predict(example.features), example.label) ) val accuracy = new MulticlassMetrics( predictionsAndLabels). precision ((impurity, depth, bins), accuracy) } evaluations.sortBy{ case ((impurity, depth, bins), accuracy) => accuracy }.reverse.foreach(println) … ((entropy,30,300),0.9446513552658804) ((gini, 30,300),0.9391509759293745) ((entropy,30,40) ,0.9389268225394855) ((gini, 30,40) ,0.9355817642596042) Revising Categorical Features (2) vs. tuned 1-of-n encoding DT ((entropy,20,300),0.9119046392195256)) Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
  • 14. CV set vs. Test set If the purpose of the CV set was to evaluate parameters fit to the training set, then the purpose of the test set is to evaluate hyperparameters that were “fit” to the CV set. That is, the test set ensures an unbiased estimate of the accuracy of the final, chosen model and its hyperparameters. val model = DecisionTree. trainClassifier( trainData.union(cvData), 7, Map[Int,Int](), "entropy", 20, 300) val predictionsAndLabels = testData.map(example => (model.predict(example.features), example.label) ) val metrics = new MulticlassMetrics( predictionsAndLabels) metricsOpt.precision = 0.9161946933031271
  • 15. Random Decision Forests It would be great to have not one tree, but many trees, each producing reasonable but different and independent estimations of the right target value. Their collective average prediction should fall close to the true answer, more than any individual tree’s does. It’s the randomness in the process of building that helps create this independence. This is the key to random decision forests. val model = RandomForest.trainClassifier( dataTrain, 7, Map(10 -> 4, 11 -> 40), 20, "auto", "entropy", 30, 300) val predictionsAndLabels = cvData.map(example => (model.predict(example. features), example.label) ) val metrics = new MulticlassMetrics( predictionsAndLabels) metrics.precision = 0.9630068932322555 vs. Categorical Features DT ,0.9446513552658804))