SlideShare a Scribd company logo
Austin SIGKDD Spark talk
Random Forest and Decision Trees in
Spark MLlib
Tuhin Mahmud
Sep 20th, 2017
About Me
Tuhin Mahmud
Advisory Software Engineer , IBM
https://www.linkedin.com/in/tuhinmahmud/
Agenda
● Spark and MLLib Overview
● Decision Trees in Spark MlLib
● Random Forest in Spark MlLib
● Demo
What is MLlib?
● MLlib is Spark’s machine learning (ML) library.
● Its goal is to make practical machine learning scalable and easy.
● ML algorithms include Classification , Regression , Decision Trees and
random Forests, Recommendation, Clustering,Topic Modeling (LDA),
Distributed linear Algebra(SVD,PCA) and many more.
https://spark.apache.org/docs/latest/ml-guide.html
Spark MLlib Trajectory
https://www.slideshare.net/databricks/apache-spark-mllib-20-preview-data-science-and-production
Spark MlLib
● Initially developed by MLbase team in AmpLab (UC berkeley) - supported only Scala and Java
● Shipped to spark v0.8 ( Sep 2013)
● Current release Spark 2.2.0 released (Jul 11, 2017)
● MLLib 2.2:
○ DataFrame-based Api becomes the primary API for MLlib
■ org.apache.spark.ml and pysprak.ml
○ Ease of use
■ Java, Scala, Python, R
■ interoperates with NumPy in Python
■ Any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.
Spark Mllib in production
● ML persistence
○ Saving and loading
● Pipeline MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a
single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is
mostly inspired by the scikit-learn project.
Some concepts[5]
● DataFrame
● Transformer - one dataframe to another
● Estimator - fits on a dataframe to produce a transformer
● Pipeline A pipeline chains multiple transformers and Estimators to specify ML workflow.
MLlib 2.2.0 API
1. MLlib: RDD-based API ( maintanace mode)
2. MLlib: DataFrame-based API
Why is MLlib switching to the DataFrame-based API?
● DataFrames provide a more user-friendly API than RDDs.
a. The many benefits of DataFrames include Spark Datasources,
b. SQL/DataFrame queries, Tungsten and Catalyst optimizations,
c. uniform APIs across languages.
● The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
● DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.
What is “Spark ML”?
● “Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
https://spark.apache.org/docs/2.2.0/ml-guide.html
Agenda
● Spark and MLLib Overview
● Decision Tree in Spark MlLib
● Random Forest in Spark MlLib
● Demo
Decision Tree
http://www.saedsayad.com/decision_tree.htm
Decision Tree Applied
Machine learning with Random Forests And Decision Trees- A visual Guide for Beginner - by Scott Hartshorn
DecisionTrees packages in MLlib
1. from pyspark.mllib.tree import DecisionTree, DecisionTreeModel ( RDD
based)
2. from pyspark.ml.classification import DecisionTreeClassifier (Dataframe based)
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#decision-trees
DecisionTree Classifier (MLlib Dataframe based API)
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
>>> model = dt.fit(td)
>>> model.numNodes
https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier
Decison Tree
hyper-parameters for decision tree in MlLib
● numClasses: How many classes are we trying to classify?
● categoricalFeaturesInfo: A specification whereby we declare what features are categorical features and should
not be treated as numbers
● impurity: A measure of the homogeneity of the labels at the node. Currently in Spark, there are two measures
of impurity with respect to classification: Gini and Entropy
● maxDepth: A stopping criterion which limits the depth of constructed trees. Generally, deeper trees lead to
more accurate results but run the risk of overfitting.
Decision Trees are prone to Overfitting
Reference: Machine Learning - Decision Trees and Random Forests - by Loonycorn
Spark MLlib -Decision Tree Example Notebook
1. Notebook using small dataset golf play
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp
arkMllibPyspark.golf.ipynb
image:http://i2.cdn.turner.com/dr/pga/sites/default/files/articles/bro-hof-17th-bubba-072411-640x360.jpg?1311698162
Agenda
● Spark and MLLib Overview
● Decision Tree in Spark MlLib
● Random Forest in Spark MlLib
● Demo
Random Forest
Reference: Machine Learning - Decision Trees and Random Forests - by Loonycorn
Random Forest
The name “Random Forest” comes from combining the randomness that is used
to pick the subset of data with having a bunch of Decision trees.
A random forest (RF) is a collection of tree predictors
f ( x, T, Θk ), k = 1,…,K
where the Θk are i.i.d random vectors.
The trees are combined by
● voting (for classification)
● averaging (for regression).
Random Forest
http://www.math.usu.edu/adele/RandomForests/ENAR.pdf
Random Forest - Out of bag Error
Out-of-bag Error Estimate
● Average over the cases within each class to get a classwise out-of-bag error
rate.
● Average over all cases to get an overall out-of-bag error rate.
In random forests, there is no need for cross-
validation to get an unbiased estimate of the
test set error. It is estimated internally,
RandomForest in MLLib
DataFrame Based API
class pyspark.ml.classification.RandomForestClassifier(self, featuresCol="features", labelCol="label", predictionCol="prediction",
probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0,
maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", numTrees=20, featureSubsetStrategy="auto",
seed=None, subsamplingRate=1.0)[source]¶
>>> import numpy>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier
Random Forest - parameters
RandomForest - some parameters ( older RDD based but some apply to
dataframe based)
● numTrees: Number of trees in the resulting forest. Increasing the number of trees
decreases model variance.
● featureSubsetStrategy: Specifies a method which produces a number of how many
features are selected for training a single tree.
● seed: Seed for random generator initialization, since RandomForest depends on
random selection of features and rows
Random Forest -parameters
Spark also provides additional parameters to stop tree growing and produce fine-grained trees:
● minInstancesPerNode: A node is not split anymore, if it would provide left or right nodes which
would contain smaller number of observations than the value specified by this parameter. Default value is 1,
but typically for regression problems or large trees, the value should be higher.
● minInfoGain: Minimum information gain a split must get. Default value is 0.0.
MLlib - Labeled point vector (RDD based)
Labeled point vector
● Prior to running any supervised machine learning algorithm using Spark MLlib, we must convert our dataset
into a labeled point vector.
○ val higgs = response.zip(features).map {
case (response, features) =>
LabeledPoint(response, features) }
higgs.setName("higgs").cache()
● An example of a labeled point vector follows:
(1.0, [0.123, 0.456, 0.567, 0.678, ..., 0.789])
MLlib - data caching
Data caching
Many machine learning algorithms are iterative in nature and thus require multiple passes over the data.
Spark provides a way to persist the data in case we need to iterate over it. Spark also publishes several
StorageLevels to allow storing data with various options:
● NONE: No caching at all
● MEMORY_ONLY: Caches RDD data only in memory
● DISK_ONLY: Write cached RDD data to a disk and releases from memory
Making sense of a tragedy - titanic dataset
image:http://www.oceangate.com/images/expeditions/titanic/titanic-sinking-wikimedia-commons.jpg
● Notebook using titanic dataset and MLlib spark dataframe based apis for
Decision Tree and Random Forest
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Spar
kMlLibTitanicNewDFbasedAPI.ipynb
Notebooks
1. Notebook using small dataset golf play
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp
arkMllibPyspark.golf.ipynb
2. Notebook using titanic dataset and MLlib spark dataframe based apis for
Decision Tree and Random Forest
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp
arkMlLibTitanicNewDFbasedAPI.ipynb
THANK YOU!
Back slide : Spark Stack
Reference
1. https://www.slideshare.net/databricks/apache-spark-mllib-20-preview-data-
science-and-production
2. https://spark.apache.org/mllib/
3. https://spark.apache.org/docs/latest/ml-guide.html
4. http://www.saedsayad.com/decision_tree.htm
5. http://www.math.usu.edu/adele/RandomForests/ENAR.pdf
6. https://spark.apache.org/docs
7. https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier
8. https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTr

More Related Content

What's hot

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
Sempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on HadoopSempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on Hadoop
Alexander Schätzle
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
Rob Vesse
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
Databricks
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Chris Fregly
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Databricks
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Edureka!
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and Stanbol
All Things Open
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
DataWorks Summit
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
Tatsuhiro Chiba
 
Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking Revisited
Rob Vesse
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
J On The Beach
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
Dean Wampler
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 

What's hot (20)

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Sempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on HadoopSempala - Interactive SPARQL Query Processing on Hadoop
Sempala - Interactive SPARQL Query Processing on Hadoop
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and Stanbol
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking Revisited
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 

Similar to Apache Spark MLlib - Random Foreset and Desicion Trees

Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
CloudxLab
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Big Data Science in Scala V2
Big Data Science in Scala V2 Big Data Science in Scala V2
Big Data Science in Scala V2
Anastasia Bobyreva
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
Adam Doyle
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0
alpinedatalabs
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS ModelerUsing Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
Global Knowledge Training
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
DESMOND YUEN
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
Knoldus Inc.
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Spark
SparkSpark

Similar to Apache Spark MLlib - Random Foreset and Desicion Trees (20)

Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Big Data Science in Scala V2
Big Data Science in Scala V2 Big Data Science in Scala V2
Big Data Science in Scala V2
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Alpine innovation final v1.0
Alpine innovation final v1.0Alpine innovation final v1.0
Alpine innovation final v1.0
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS ModelerUsing Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
End-to-end working of Apache Spark
End-to-end working of Apache SparkEnd-to-end working of Apache Spark
End-to-end working of Apache Spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Spark
SparkSpark
Spark
 

Recently uploaded

Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 

Recently uploaded (20)

Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 

Apache Spark MLlib - Random Foreset and Desicion Trees

  • 1. Austin SIGKDD Spark talk Random Forest and Decision Trees in Spark MLlib Tuhin Mahmud Sep 20th, 2017
  • 2. About Me Tuhin Mahmud Advisory Software Engineer , IBM https://www.linkedin.com/in/tuhinmahmud/
  • 3. Agenda ● Spark and MLLib Overview ● Decision Trees in Spark MlLib ● Random Forest in Spark MlLib ● Demo
  • 4. What is MLlib? ● MLlib is Spark’s machine learning (ML) library. ● Its goal is to make practical machine learning scalable and easy. ● ML algorithms include Classification , Regression , Decision Trees and random Forests, Recommendation, Clustering,Topic Modeling (LDA), Distributed linear Algebra(SVD,PCA) and many more. https://spark.apache.org/docs/latest/ml-guide.html
  • 6. Spark MlLib ● Initially developed by MLbase team in AmpLab (UC berkeley) - supported only Scala and Java ● Shipped to spark v0.8 ( Sep 2013) ● Current release Spark 2.2.0 released (Jul 11, 2017) ● MLLib 2.2: ○ DataFrame-based Api becomes the primary API for MLlib ■ org.apache.spark.ml and pysprak.ml ○ Ease of use ■ Java, Scala, Python, R ■ interoperates with NumPy in Python ■ Any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.
  • 7. Spark Mllib in production ● ML persistence ○ Saving and loading ● Pipeline MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project. Some concepts[5] ● DataFrame ● Transformer - one dataframe to another ● Estimator - fits on a dataframe to produce a transformer ● Pipeline A pipeline chains multiple transformers and Estimators to specify ML workflow.
  • 8. MLlib 2.2.0 API 1. MLlib: RDD-based API ( maintanace mode) 2. MLlib: DataFrame-based API Why is MLlib switching to the DataFrame-based API? ● DataFrames provide a more user-friendly API than RDDs. a. The many benefits of DataFrames include Spark Datasources, b. SQL/DataFrame queries, Tungsten and Catalyst optimizations, c. uniform APIs across languages. ● The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages. ● DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details. What is “Spark ML”? ● “Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API. https://spark.apache.org/docs/2.2.0/ml-guide.html
  • 9. Agenda ● Spark and MLLib Overview ● Decision Tree in Spark MlLib ● Random Forest in Spark MlLib ● Demo
  • 11. Decision Tree Applied Machine learning with Random Forests And Decision Trees- A visual Guide for Beginner - by Scott Hartshorn
  • 12. DecisionTrees packages in MLlib 1. from pyspark.mllib.tree import DecisionTree, DecisionTreeModel ( RDD based) 2. from pyspark.ml.classification import DecisionTreeClassifier (Dataframe based) https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#decision-trees
  • 13. DecisionTree Classifier (MLlib Dataframe based API) >>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer >>> df = spark.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> si_model = stringIndexer.fit(df) >>> td = si_model.transform(df) >>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed") >>> model = dt.fit(td) >>> model.numNodes https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier
  • 14. Decison Tree hyper-parameters for decision tree in MlLib ● numClasses: How many classes are we trying to classify? ● categoricalFeaturesInfo: A specification whereby we declare what features are categorical features and should not be treated as numbers ● impurity: A measure of the homogeneity of the labels at the node. Currently in Spark, there are two measures of impurity with respect to classification: Gini and Entropy ● maxDepth: A stopping criterion which limits the depth of constructed trees. Generally, deeper trees lead to more accurate results but run the risk of overfitting.
  • 15. Decision Trees are prone to Overfitting Reference: Machine Learning - Decision Trees and Random Forests - by Loonycorn
  • 16. Spark MLlib -Decision Tree Example Notebook 1. Notebook using small dataset golf play http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp arkMllibPyspark.golf.ipynb image:http://i2.cdn.turner.com/dr/pga/sites/default/files/articles/bro-hof-17th-bubba-072411-640x360.jpg?1311698162
  • 17. Agenda ● Spark and MLLib Overview ● Decision Tree in Spark MlLib ● Random Forest in Spark MlLib ● Demo
  • 18. Random Forest Reference: Machine Learning - Decision Trees and Random Forests - by Loonycorn
  • 19. Random Forest The name “Random Forest” comes from combining the randomness that is used to pick the subset of data with having a bunch of Decision trees. A random forest (RF) is a collection of tree predictors f ( x, T, Θk ), k = 1,…,K where the Θk are i.i.d random vectors. The trees are combined by ● voting (for classification) ● averaging (for regression).
  • 21. Random Forest - Out of bag Error Out-of-bag Error Estimate ● Average over the cases within each class to get a classwise out-of-bag error rate. ● Average over all cases to get an overall out-of-bag error rate. In random forests, there is no need for cross- validation to get an unbiased estimate of the test set error. It is estimated internally,
  • 22. RandomForest in MLLib DataFrame Based API class pyspark.ml.classification.RandomForestClassifier(self, featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", numTrees=20, featureSubsetStrategy="auto", seed=None, subsamplingRate=1.0)[source]¶ >>> import numpy>>> from numpy import allclose >>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.feature import StringIndexer >>> df = spark.createDataFrame([ ... (1.0, Vectors.dense(1.0)), ... (0.0, Vectors.sparse(1, [], []))], ["label", "features"]) >>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed") >>> si_model = stringIndexer.fit(df) >>> td = si_model.transform(df) >>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42) >>> model = rf.fit(td) https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier
  • 23. Random Forest - parameters RandomForest - some parameters ( older RDD based but some apply to dataframe based) ● numTrees: Number of trees in the resulting forest. Increasing the number of trees decreases model variance. ● featureSubsetStrategy: Specifies a method which produces a number of how many features are selected for training a single tree. ● seed: Seed for random generator initialization, since RandomForest depends on random selection of features and rows
  • 24. Random Forest -parameters Spark also provides additional parameters to stop tree growing and produce fine-grained trees: ● minInstancesPerNode: A node is not split anymore, if it would provide left or right nodes which would contain smaller number of observations than the value specified by this parameter. Default value is 1, but typically for regression problems or large trees, the value should be higher. ● minInfoGain: Minimum information gain a split must get. Default value is 0.0.
  • 25. MLlib - Labeled point vector (RDD based) Labeled point vector ● Prior to running any supervised machine learning algorithm using Spark MLlib, we must convert our dataset into a labeled point vector. ○ val higgs = response.zip(features).map { case (response, features) => LabeledPoint(response, features) } higgs.setName("higgs").cache() ● An example of a labeled point vector follows: (1.0, [0.123, 0.456, 0.567, 0.678, ..., 0.789])
  • 26. MLlib - data caching Data caching Many machine learning algorithms are iterative in nature and thus require multiple passes over the data. Spark provides a way to persist the data in case we need to iterate over it. Spark also publishes several StorageLevels to allow storing data with various options: ● NONE: No caching at all ● MEMORY_ONLY: Caches RDD data only in memory ● DISK_ONLY: Write cached RDD data to a disk and releases from memory
  • 27. Making sense of a tragedy - titanic dataset image:http://www.oceangate.com/images/expeditions/titanic/titanic-sinking-wikimedia-commons.jpg ● Notebook using titanic dataset and MLlib spark dataframe based apis for Decision Tree and Random Forest http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Spar kMlLibTitanicNewDFbasedAPI.ipynb
  • 28. Notebooks 1. Notebook using small dataset golf play http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp arkMllibPyspark.golf.ipynb 2. Notebook using titanic dataset and MLlib spark dataframe based apis for Decision Tree and Random Forest http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp arkMlLibTitanicNewDFbasedAPI.ipynb
  • 30. Back slide : Spark Stack
  • 31. Reference 1. https://www.slideshare.net/databricks/apache-spark-mllib-20-preview-data- science-and-production 2. https://spark.apache.org/mllib/ 3. https://spark.apache.org/docs/latest/ml-guide.html 4. http://www.saedsayad.com/decision_tree.htm 5. http://www.math.usu.edu/adele/RandomForests/ENAR.pdf 6. https://spark.apache.org/docs 7. https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier 8. https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTr