SlideShare a Scribd company logo
1 of 25
1© Cloudera, Inc. All rights reserved.
Random Decision Forests
- at Scale
Todd M. Boetticher| Solution Consultant
2© Cloudera, Inc. All rights reserved.
Overview
• Decision trees
• Introduction to decision tree’s
• Building a decision tree with Spark
• Tuning your model with Hyerparameters
• Random Decision Forests
• Introduction to Random Decision
Forests
• Handicapping individual learners
• Deploying random decision trees in
Spark
3© Cloudera, Inc. All rights reserved.
Decision Tree
The Decision tree is one of the most commonly
used classification techniques. According
to a recent survey, the decision tree is the
most common technique used today.
A decision tree is a collection of outcomes that
eventually lead to a decision. Like the game 21
questions, decision tree’s use features to split
the data into subsets that will give you the best
results.
4© Cloudera, Inc. All rights reserved.
Benefits and disadvantages of Decision Trees
Pros
• Computationally cheap to use
• Easy for humans to understand learned results
• Missing values OK
• Capable of dealing with irrelevant feature
Cons
• Prone to overfitting
• Lacks the performance available through other methods
• Small changes in the Data can have enormous impacts on the data
5© Cloudera, Inc. All rights reserved.
Iris Dataset
6.4, 3.2, 4.5, 1.5, Iris-versacolor
6© Cloudera, Inc. All rights reserved.
Possible Decision Trees
7© Cloudera, Inc. All rights reserved.
Interpreting Models
Simple graphic
explanation of how the
feature space can be
divided into decision
boundaries.
8© Cloudera, Inc. All rights reserved.
Decision Breakdown
def visualize_tree(tree, feature_names):
tree -- scikit-learn DecsisionTree.
feature_names -- list of feature names.
"""
with open("dt.dot", 'w') as f:
export_graphviz(tree, out_file=f,
feature_names=feature_names)
command = ["dot", "-Tpng", "dt.dot", "-o",
"dt.png"]
try:
subprocess.check_call(command)
except:
exit("Could not run dot, ie graphviz, to "
"produce visualization")
9© Cloudera, Inc. All rights reserved.
Building a Decision Tree
10© Cloudera, Inc. All rights reserved.
Name and Address Matching
• Potential uses in every industry
• Marketing
• Defense and law enforcement
• Bank Secrecy Act and Patriot Act Compliance
RN-KOMSOMOLSKY LLC,
352 DeerPath Ave SW,
Leesburg,
Virginia,
United States
KOMSOMOLSKY REFINERY,
1621 Parkcrest Cir,
Reston,
Virginia,
United States
VS
11© Cloudera, Inc. All rights reserved.
0.78, 0.0, 0.0, 0.78, 0.78, True Positive
Name and Address Matching
Name Address City State Country
0.81 0.0 0.0 0.78 1.0
0.6 0.6 0.2 0.0 0.0
0.2 0.0 0.0 0.0 0.0
0.91 0.91 0.91 0.91 1.0
0.78 0.0 0.0 0.78 0.78
0.4 0.4 0.36 0.0 1.0
1.0 0.0 0.0 0.0 0.0
Hits
True-Positive
False-Positive
False-Positive
True-Positive
True-Positive
False-Positive
False-Positive
12© Cloudera, Inc. All rights reserved.
Building our first decision tree in MLlib
13© Cloudera, Inc. All rights reserved.
Evaluating a decision Tree
.82243123866534172
~82.24% accuracy
14© Cloudera, Inc. All rights reserved.
Benchmark vs Random
.2718652198532764
~27.19% accuracy
15© Cloudera, Inc. All rights reserved.
Hyperparameters
Impurity: Gini, Impurity,
Variance
Maximum Depth Maximum Bin
Measures the expected
value of the information.
The calculation of impurity is
generally computed with
Gini impurity or entropy
Measures the expected
value of the information.
The calculation of impurity is
generally computed with
Gini impurity or entropy
Measures the expected
value of the information.
The calculation of impurity is
generally computed with
Gini impurity or entropy
trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy,
impurity, maxDepth, maxBins)
16© Cloudera, Inc. All rights reserved.
Impurity: Gini, Impurity, Variance
1 − 𝑃𝑖2H= - 𝑖=1
𝑛
𝑃(𝑥𝑖)𝑙𝑜𝑔2 𝑃(𝑥𝑖)
Gini Impurity Entropy
Entropy is defined as the expected
value of information. First, we need
to define information. If you’re
classifying something that can take
on multiple values, the information
for symbol 𝑥 𝑖 is defined as
Gini impurity is a measure of how
often a randomly chosen element
from the set would be incorrectly
labeled if it were randomly
labeled according to the
distribution of labels in the subset
1
𝑛 𝑖=1
𝑛
(𝑦𝑖−𝜇)2
Variance
vi is label for an instance, N is the
number of instances and μ is the
mean given by 1N∑Ni=1xi1N∑i=1Nxi.
17© Cloudera, Inc. All rights reserved.
Maximum Depth
• Maximum tree depth is a limit to
stop further splitting of nodes when
the specified tree depth has been
reached during the building of the
initial decision tree.
• The absolute maximum depth
would be N−1, where N is the
number of training samples. You
can derive this by considering that
the least effective split would be
peeling off one training example
per node.
1.
2.
3.
18© Cloudera, Inc. All rights reserved.
Decision Trees to Random
Decision Forests
19© Cloudera, Inc. All rights reserved.
Wisdom of the crowds
• The wisdom of the crowd is the collective opinion of a group of individuals rather
than that of a single expert.
• A large group's aggregated answers to questions involving quantity estimation,
general world knowledge, and spatial reasoning has generally been found to be
as good as, and often better than, the answer given by any of the individuals
within the group. An explanation for this phenomenon is that there is
idiosyncratic noise associated with each individual judgment, and taking the
average over a large number of responses will go some way toward canceling the
effect of this noise.
20© Cloudera, Inc. All rights reserved.
What is Random Forest
Random Forests grows many
classification trees. To classify a new
object from an input vector, put the
input vector down each of the trees in
the forest. Each tree gives a classification,
and we say the tree "votes" for that class.
The forest chooses the classification
having the most votes (over all the trees
in the forest).
21© Cloudera, Inc. All rights reserved.
How to Create a Crowd?
22© Cloudera, Inc. All rights reserved.
Random Decision Forests with Spark MLlib
RandomForest.trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity,
maxDepth, maxBins)
Number of Trees Feature Subset Strategy
Number of trees in the forest. Increasing the number
of trees will decrease the variance in predictions,
improving the model’s test-time accuracy.
Training time increases roughly linearly in the number
of trees.
Number of features to use as candidates for splitting at
each tree node. The number is specified as a fraction or
function of the total number of features. Decreasing this
number will speed up training, but can sometimes
impact performance if too low.
~ 98.67 % Accuracy
23© Cloudera, Inc. All rights reserved.
Trees See Subsets of Examples
24© Cloudera, Inc. All rights reserved.
Or Subsets of Features
25© Cloudera, Inc. All rights reserved.
Thank you
tboetticher@cloudera.com

More Related Content

What's hot

Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeCloudera, Inc.
 
The Five Markers on Your Big Data Journey
The Five Markers on Your Big Data JourneyThe Five Markers on Your Big Data Journey
The Five Markers on Your Big Data JourneyCloudera, Inc.
 
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Cloudera, Inc.
 
2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the UnionCloudera, Inc.
 
The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)Cloudera, Inc.
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissanceCloudera, Inc.
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
 
IoT-Enabled Predictive Maintenance
IoT-Enabled Predictive MaintenanceIoT-Enabled Predictive Maintenance
IoT-Enabled Predictive MaintenanceCloudera, Inc.
 
Optimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataOptimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataCloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataCloudera, Inc.
 
How Cloudera SDX can aid GDPR compliance 6.21.18
How Cloudera SDX can aid GDPR compliance 6.21.18How Cloudera SDX can aid GDPR compliance 6.21.18
How Cloudera SDX can aid GDPR compliance 6.21.18Cloudera, Inc.
 
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...DataStax
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionCloudera, Inc.
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in EnterpriseJosh Yeh
 

What's hot (20)

Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural Change
 
The Five Markers on Your Big Data Journey
The Five Markers on Your Big Data JourneyThe Five Markers on Your Big Data Journey
The Five Markers on Your Big Data Journey
 
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning

 
2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union
 
The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity Renaissance
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
IoT-Enabled Predictive Maintenance
IoT-Enabled Predictive MaintenanceIoT-Enabled Predictive Maintenance
IoT-Enabled Predictive Maintenance
 
Optimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big DataOptimizing Regulatory Compliance with Big Data
Optimizing Regulatory Compliance with Big Data
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Demystifying ML & AI
Demystifying ML & AIDemystifying ML & AI
Demystifying ML & AI
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
How Cloudera SDX can aid GDPR compliance 6.21.18
How Cloudera SDX can aid GDPR compliance 6.21.18How Cloudera SDX can aid GDPR compliance 6.21.18
How Cloudera SDX can aid GDPR compliance 6.21.18
 
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber Solution
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in Enterprise
 

Similar to Random Decision Forests at Scale

Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012Salford Systems
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataDatameer
 
The Data Science Product Management Toolkit
The Data Science Product Management ToolkitThe Data Science Product Management Toolkit
The Data Science Product Management ToolkitJack Moore
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest Rupak Roy
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!
 
unit 5 decision tree2.pptx
unit 5 decision tree2.pptxunit 5 decision tree2.pptx
unit 5 decision tree2.pptxssuser5c580e1
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Managementmark madsen
 
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...Edge AI and Vision Alliance
 
Know How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfKnow How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfData Science Council of America
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021Praneeth Vepakomma
 
Churn Modeling For Mobile Telecommunications
Churn Modeling For Mobile TelecommunicationsChurn Modeling For Mobile Telecommunications
Churn Modeling For Mobile TelecommunicationsSalford Systems
 
An Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithmsAn Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithmsShouvic Banik0139
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...Bobby Filar
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)Dinis Cruz
 
Software Faults, Failures and Their Mitigations | Turing100@Persistent
Software Faults, Failures and Their Mitigations | Turing100@PersistentSoftware Faults, Failures and Their Mitigations | Turing100@Persistent
Software Faults, Failures and Their Mitigations | Turing100@PersistentPersistent Systems Ltd.
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing gridThang Nguyen
 

Similar to Random Decision Forests at Scale (20)

Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Random Forests Lightning Talk
Random Forests Lightning TalkRandom Forests Lightning Talk
Random Forests Lightning Talk
 
The Data Science Product Management Toolkit
The Data Science Product Management ToolkitThe Data Science Product Management Toolkit
The Data Science Product Management Toolkit
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
 
unit 5 decision tree2.pptx
unit 5 decision tree2.pptxunit 5 decision tree2.pptx
unit 5 decision tree2.pptx
 
The Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data ManagementThe Black Box: Interpretability, Reproducibility, and Data Management
The Black Box: Interpretability, Reproducibility, and Data Management
 
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
 
Know How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfKnow How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdf
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021
 
Churn Modeling For Mobile Telecommunications
Churn Modeling For Mobile TelecommunicationsChurn Modeling For Mobile Telecommunications
Churn Modeling For Mobile Telecommunications
 
An Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithmsAn Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithms
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
 
Software Faults, Failures and Their Mitigations | Turing100@Persistent
Software Faults, Failures and Their Mitigations | Turing100@PersistentSoftware Faults, Failures and Their Mitigations | Turing100@Persistent
Software Faults, Failures and Their Mitigations | Turing100@Persistent
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing grid
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Cloudera, Inc.
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionCloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 

Recently uploaded

What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 

Recently uploaded (20)

What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 

Random Decision Forests at Scale

  • 1. 1© Cloudera, Inc. All rights reserved. Random Decision Forests - at Scale Todd M. Boetticher| Solution Consultant
  • 2. 2© Cloudera, Inc. All rights reserved. Overview • Decision trees • Introduction to decision tree’s • Building a decision tree with Spark • Tuning your model with Hyerparameters • Random Decision Forests • Introduction to Random Decision Forests • Handicapping individual learners • Deploying random decision trees in Spark
  • 3. 3© Cloudera, Inc. All rights reserved. Decision Tree The Decision tree is one of the most commonly used classification techniques. According to a recent survey, the decision tree is the most common technique used today. A decision tree is a collection of outcomes that eventually lead to a decision. Like the game 21 questions, decision tree’s use features to split the data into subsets that will give you the best results.
  • 4. 4© Cloudera, Inc. All rights reserved. Benefits and disadvantages of Decision Trees Pros • Computationally cheap to use • Easy for humans to understand learned results • Missing values OK • Capable of dealing with irrelevant feature Cons • Prone to overfitting • Lacks the performance available through other methods • Small changes in the Data can have enormous impacts on the data
  • 5. 5© Cloudera, Inc. All rights reserved. Iris Dataset 6.4, 3.2, 4.5, 1.5, Iris-versacolor
  • 6. 6© Cloudera, Inc. All rights reserved. Possible Decision Trees
  • 7. 7© Cloudera, Inc. All rights reserved. Interpreting Models Simple graphic explanation of how the feature space can be divided into decision boundaries.
  • 8. 8© Cloudera, Inc. All rights reserved. Decision Breakdown def visualize_tree(tree, feature_names): tree -- scikit-learn DecsisionTree. feature_names -- list of feature names. """ with open("dt.dot", 'w') as f: export_graphviz(tree, out_file=f, feature_names=feature_names) command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"] try: subprocess.check_call(command) except: exit("Could not run dot, ie graphviz, to " "produce visualization")
  • 9. 9© Cloudera, Inc. All rights reserved. Building a Decision Tree
  • 10. 10© Cloudera, Inc. All rights reserved. Name and Address Matching • Potential uses in every industry • Marketing • Defense and law enforcement • Bank Secrecy Act and Patriot Act Compliance RN-KOMSOMOLSKY LLC, 352 DeerPath Ave SW, Leesburg, Virginia, United States KOMSOMOLSKY REFINERY, 1621 Parkcrest Cir, Reston, Virginia, United States VS
  • 11. 11© Cloudera, Inc. All rights reserved. 0.78, 0.0, 0.0, 0.78, 0.78, True Positive Name and Address Matching Name Address City State Country 0.81 0.0 0.0 0.78 1.0 0.6 0.6 0.2 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.91 0.91 0.91 0.91 1.0 0.78 0.0 0.0 0.78 0.78 0.4 0.4 0.36 0.0 1.0 1.0 0.0 0.0 0.0 0.0 Hits True-Positive False-Positive False-Positive True-Positive True-Positive False-Positive False-Positive
  • 12. 12© Cloudera, Inc. All rights reserved. Building our first decision tree in MLlib
  • 13. 13© Cloudera, Inc. All rights reserved. Evaluating a decision Tree .82243123866534172 ~82.24% accuracy
  • 14. 14© Cloudera, Inc. All rights reserved. Benchmark vs Random .2718652198532764 ~27.19% accuracy
  • 15. 15© Cloudera, Inc. All rights reserved. Hyperparameters Impurity: Gini, Impurity, Variance Maximum Depth Maximum Bin Measures the expected value of the information. The calculation of impurity is generally computed with Gini impurity or entropy Measures the expected value of the information. The calculation of impurity is generally computed with Gini impurity or entropy Measures the expected value of the information. The calculation of impurity is generally computed with Gini impurity or entropy trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
  • 16. 16© Cloudera, Inc. All rights reserved. Impurity: Gini, Impurity, Variance 1 − 𝑃𝑖2H= - 𝑖=1 𝑛 𝑃(𝑥𝑖)𝑙𝑜𝑔2 𝑃(𝑥𝑖) Gini Impurity Entropy Entropy is defined as the expected value of information. First, we need to define information. If you’re classifying something that can take on multiple values, the information for symbol 𝑥 𝑖 is defined as Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset 1 𝑛 𝑖=1 𝑛 (𝑦𝑖−𝜇)2 Variance vi is label for an instance, N is the number of instances and μ is the mean given by 1N∑Ni=1xi1N∑i=1Nxi.
  • 17. 17© Cloudera, Inc. All rights reserved. Maximum Depth • Maximum tree depth is a limit to stop further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree. • The absolute maximum depth would be N−1, where N is the number of training samples. You can derive this by considering that the least effective split would be peeling off one training example per node. 1. 2. 3.
  • 18. 18© Cloudera, Inc. All rights reserved. Decision Trees to Random Decision Forests
  • 19. 19© Cloudera, Inc. All rights reserved. Wisdom of the crowds • The wisdom of the crowd is the collective opinion of a group of individuals rather than that of a single expert. • A large group's aggregated answers to questions involving quantity estimation, general world knowledge, and spatial reasoning has generally been found to be as good as, and often better than, the answer given by any of the individuals within the group. An explanation for this phenomenon is that there is idiosyncratic noise associated with each individual judgment, and taking the average over a large number of responses will go some way toward canceling the effect of this noise.
  • 20. 20© Cloudera, Inc. All rights reserved. What is Random Forest Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
  • 21. 21© Cloudera, Inc. All rights reserved. How to Create a Crowd?
  • 22. 22© Cloudera, Inc. All rights reserved. Random Decision Forests with Spark MLlib RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) Number of Trees Feature Subset Strategy Number of trees in the forest. Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy. Training time increases roughly linearly in the number of trees. Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low. ~ 98.67 % Accuracy
  • 23. 23© Cloudera, Inc. All rights reserved. Trees See Subsets of Examples
  • 24. 24© Cloudera, Inc. All rights reserved. Or Subsets of Features
  • 25. 25© Cloudera, Inc. All rights reserved. Thank you tboetticher@cloudera.com

Editor's Notes

  1. Entropy: Information Theory Concept – Claude Shannon Measures mixed-ness, unpredictability of population Lower is better Gini: 1 minus probability that random guess I (probability pi) is correct Lower is better
  2. Table of Contents Introduction 1 Why should I know this? 2 Concept – Example 2 Describe the Iris Data Set 2 Explain what machine learning has to do with the iris data set 2 Physical representation of the decision tree concept 2 Interpreting the Model 2 Building Decision Trees 2 Introduce data set 2 Implement a Decision tree with Data Set 2 Explain how it worked 2 Split the data 2 Calculate Entropy 2 Cache the training data 2 Show current accuracy 2 Benchmark against Random 2 Confusion Matrix 2 Tuning the model (Hyper parameter) 2 Impurity: 2 Maximum Depth: 3 Maximim Bins 3 Try multiple parameters 3 Can the data be improved? 3 Random Decision Forests 3 Wisdom of the crowd 3 Introduce concept of Decision Forests 3 How to create Diversity of opinion 3 Handicap each tree 3 Subset of features 3 Aggregating Results 3 How to implement that in Spark 3                 Introduction -   Good Evening, Today I would like to talk about Random Decision forests and how we can implement them at scale using the Hadoop Ecosystem. I believe that this is important discussion due to the method’s popularity, and its potential impact for most organizations.   Let’s Begin   Concept – Example   Before we can begin talking about Random Decision forests, I believe that it is important that we begin by discussing decision tree’s. If you have ever had the opportunity to play “21 questions” it is a game where an individual picks an object, or place and the competing individual slowly asks binary questions, attempting to shrink the possibilities for the object. This often begins by the person asking if it’s a person or a thing? Generally, these questions are mutually exclusive and collectively exhausted. The inquisitive individual would then ask a follow on question in order to lower the number of possibilities again and again until they know the answer or have run out of available questions. This is a perfect example of a decision tree. The decision tree is one of the most commonly used classification techniques and according to a recent survey, the technique is among most commonly implemented technique today.   The utilization of this technique ranges from agriculture to physics. In physics decision trees have been used for the detection of physical particles, or classification of particle signatures. In other sectors, Decision trees have even been used to build personal learning assistant and classification of sleep patterns.   Due to decisions tree’s ability to be utilized in both classification and regression, the possibilities are almost endless.   Pro’s vs Cons Pro’s - The decision tree is used in so many problems due to its many strengths and short list of disadvantages. One of the key benefits of decision trees is that they implicitly perform variable screening or feature selection naturally. Additionally, the technique is naturally computationally inexpensive and it requires very little effort from the user during data preparation. Lastly, the best feature of using trees for analytics - easy to interpret and explain to executives! This is increasingly important in fields where individuals have to explain why there black box makes specific decisions at certain periods of time. Con’s – Due to the overall flexibility of this technique, models using a decision tree can be prone to overfitting. Overfitting is where the model describes errors in the data and improperly fits a relationship that generally cannot be connected. In addition to their ability to be easily over fitted, the technique also lacks the performance available through other available techniques. Lastly Decision forests can be significantly impacted by small changes in Data over time.   Describe the Iris Data Set We don’t need to understand the relationship in order to model the relationships. We can learn them imperially from the dataset   Explain what machine learning has to do with the iris data set   Attempt to plot the iris Data-set   Graphical example of Knn Iris   Physical representation of the decision tree concept   Interpreting the Model   Introduction to Decision Trees Give Example   Building Decision Trees Introduce data set Implement a Decision tree with Data Set Explain how it worked     Split the data Cache the training data Show current accuracy Benchmark against Random Confusion Matrix   Tuning the model (Hyper parameter) Impurity: Measure how much decision decreases impurity using gini impurity (vs. entropy)   Maximum Depth: Calculates the total depth of the tree and set the number of decisions allowed before terminating a prediction.   Maximim Bins Calculates the total number of rules allowed Fix the training data   Try multiple parameters   Can the data be improved?   Random Decision Forests Wisdom of the crowd Introduce concept of Decision Forests Ensemble Learning   How to create Diversity of opinion Handicap each tree Subset of features Aggregating Results   Hyperparameters in a forest     How to implement that in Spark Determining results from Random Forest