Ml6 decision trees

1.
Decision Trees

2.
Legal Notices andDisclaimers This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com. This sample source code is released under the Intel Sample Source Code License Agreement. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright © 2017, Intel Corporation. All rights reserved.

3.
Overview of ClassifierCharacteristics X Y • For K-Nearest Neighbors, training data is the model • Fitting is fast—just store data • Prediction can be slow—lots of distances to measure • Decision boundary is flexible

4.

5.

6.

7.
X 0.0 1.0 Probability 0.5 𝑦 𝛽 𝑥= 1 1+𝑒−(𝛽0+ 𝛽1 𝑥 + ε ) Overview of Classifier Characteristics • For logistic regression, model is just parameters • Fitting can be slow—must find best parameters • Prediction is fast—calculate expected value • Decision boundary is simple, less flexible

8.

9.

10.

11.
Introduction to DecisionTrees

12.
Introduction to DecisionTrees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees

13.
Introduction to DecisionTrees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees Temperature: >= Mild Play TennisNo Tennis

14.
Play Tennis Nodes Introduction toDecision Trees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees Leaves Temperature: >= Mild No Tennis

15.
Play TennisNo Tennis Nodes Leaves NoTennis Humidity: = Normal Introduction to Decision Trees • Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees Temperature: >= Mild

16.
Play TennisNo Tennis Nodes Leaves •Want to predict whether to play tennis based on temperature, humidity, wind, outlook • Segment data based on features to predict result • Trees that predict categorical results are decision trees No Tennis Introduction to Decision Trees Temperature: >= Mild Humidity: = Normal

17.
• Example: useslope an elevation in Himalayas • Predict average precipitation (continuous value) Regression Trees Predict Continuous Values

18.
48.50 in.13.67 in. Nodes Leaves Elevation: <7900 ft. 55.42 in. Slope: < 2.5º Regression Trees Predict Continuous Values • Example: use slope an elevation in Himalayas • Predict average precipitation (continuous value)

19.
48.50 in.13.67 in. Nodes Leaves Elevation: <7900 ft. • Example: use slope an elevation in Himalayas • Predict average precipitation (continuous value) • Values at leaves are averages of members 55.42 in. Slope: < 2.5º Regression Trees Predict Continuous Values

20.
Regression Trees PredictContinuous Values 0 1 2 3 4 5 -1.0 -2.0 0.0 1.0 2.0 Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

21.
Regression Trees PredictContinuous Values max_depth=2 0 1 2 3 4 5 -1.0 -2.0 0.0 1.0 2.0 Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

22.
Regression Trees PredictContinuous Values max_depth=2 max_depth=5 0 1 2 3 4 5 -1.0 -2.0 0.0 1.0 2.0 Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

23.
Building a DecisionTree • Select a feature and split data into binary tree • Continue splitting with available features

24.

25.

26.
How Long toKeep Splitting? Until: • Leaf node(s) are pure (only one class remains) • A maximum depth is reached • A performance metric is achieved

27.

28.

29.
How Long toKeep Splitting? Until: • Leaf node(s) are pure—only one class remains • A maximum depth is reached • A performance metric is achieved

30.
• Use greedysearch: find the best split at each step • What defines the best split? • One that maximizes the information gained from the split • How is information gain defined? Building the Best Decision Tree Play Tennis Leaves No Tennis Temperature: >= Mild

31.

32.

33.

34.
Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Before 1 − 8 12 = 0.3333

35.
Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Before 1 − 8 12 = 0.3333

36.
Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Left Side 1 − 2 4 = 0.5000 0.3333

37.
Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Left Side 1 − 2 4 = 0.5000 0.3333 Information lost on small # of data points

38.
Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Right Side 1 − 6 8 = 0.2500 0.3333 0.5000

39.
Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Change 0.3333 − 4 12 ∗ 0.5000 − 8 12 ∗0.2500 = 0 0.3333 0.5000 0.2500

40.
Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error Equation Classification Error Change 0.3333 0.5000 0.2500 0.3333 − 4 12 ∗ 0.5000 − 8 12 ∗0.2500 = 0

41.
Splitting Based onClassification Error 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Using classification error, no further splits would occur • Problem: end nodes are not homogeneous • Try a different metric?

42.
Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Before − 8 12 𝑙𝑜𝑔2(8 12) − 4 12 𝑙𝑜𝑔2(4 12) = 0.9183

43.
Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Before − 8 12 𝑙𝑜𝑔2(8 12) − 4 12 𝑙𝑜𝑔2(4 12) = 0.9183

44.
Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Left Side − 2 4 𝑙𝑜𝑔2(2 4) − 2 4 𝑙𝑜𝑔2(2 4) = 1.0000 0.9183

45.
Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Right Side − 6 8 𝑙𝑜𝑔2(6 8) − 2 8 𝑙𝑜𝑔2(2 8) = 0.8113 0.9183 1.0000

46.
Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Entropy Equation Entropy Change 0.9183 − 4 12 ∗ 1.0000 − 8 12 ∗0.8113 = 0.0441 0.9183 1.0000 0.8113

47.
Splitting Based onEntropy 2 Yes 2 No 6 Yes 2 No Play TennisNo Tennis Temperature: >= Mild 8 Yes 4 No • Splitting based on entropy allows further splits to occur • Can eventually reach goal of homogeneous nodes • Why does this work with entropy but not classification error?

48.

49.

50.
• Classification erroris a flat function with maximum at center • Center represents ambiguity—50/50 split • Splitting metrics favor results that are furthest away from the center Classification Error vs Entropy 0.0 0.5 Purity 1.0 Classification Error 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Error

51.
• Classification erroris a flat function with maximum at center • Center represents ambiguity—50/50 split • Splitting metrics favor results that are furthest away from the center Classification Error vs Entropy 0.0 0.5 Purity 1.0 Classification Error 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Error

52.
0.0 0.5 Purity 1.0 Classification Error • Classificationerror is a flat function with maximum at center • Center represents ambiguity—50/50 split • Splitting metrics favor results that are furthest away from the center 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ] Classification Error vs Entropy Error

53.
Classification Error vsEntropy • Entropy has the same maximum but is curved • Curvature allows splitting to continue until nodes are pure • How does this work? 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Error/Entropy

54.
Classification Error vsEntropy • Entropy has the same maximum but is curved • Curvature allows splitting to continue until nodes are pure • How does this work? 0.0 0.5 Purity 1.0 Classification ErrorCross Entropy 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Error/Entropy

55.
0.0 0.5 Purity 1.0 Classification ErrorCross Entropy ClassificationError vs Entropy • Entropy has the same maximum but is curved • Curvature allows splitting to continue until nodes are pure • How does this work? 𝐻 𝑡 = − 𝑖=1 𝑛 𝑝 𝑖 𝑡 𝑙𝑜𝑔2[𝑝(𝑖|𝑡)] Error/Entropy

56.
Information Gained bySplitting • With classification error, the function is flat • Final average classification error can be identical to parent • Resulting in premature stopping

57.

58.

59.

60.
Information Gained bySplitting • With entropy gain, the function has a "bulge" • Allows average information of children to be less than parent • Results in information gain and continued splitting

61.

62.

63.
0.0 0.5 Purity 1.0 Classification ErrorCross Entropy GiniIndex • In practice, Gini index often used for splitting • Function is similar to entropy—has bulge • Does not contain logarithm The Gini Index 𝐺 𝑡 = 1 − 𝑖=1 𝑛 𝑝 𝑖 𝑡 2

64.

65.

66.
Decision Trees areHigh Variance • Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees

67.
• Problem: decisiontrees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees Decision Trees are High Variance

68.
• Problem: decisiontrees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees Decision Trees are High Variance

69.
Pruning Decision Trees •Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees

70.
Pruning Decision Trees •Problem: decision trees tend to overfit • Small changes in data greatly affect prediction-- high variance • Solution: Prune trees

71.
• How todecide what leaves to prune? • Solution: prune based on classification error threshold Pruning Decision Trees

72.
• How todecide what leaves to prune? • Solution: prune based on classification error threshold Pruning Decision Trees 𝐸 𝑡 = 1 − max 𝑖 [𝑝 𝑖 𝑡 ]

73.
• Easy tointerpret and implement—"if … then … else" logic • Handle any data category— binary, ordinal, continuous • No preprocessing or scaling required Strengths of Decision Trees

74.

75.

76.
Import the classcontaining the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax

77.

78.
Import the classcontaining the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression. DecisionTreeClassifier: The Syntax tree parameters

79.

80.
DecisionTreeClassifier: The Syntax Importthe class containing the classification method from sklearn.tree import DecisionTreeClassifier Create an instance of the class DTC = DecisionTreeClassifier(criterion='gini', max_features=10, max_depth=5) Fit the instance on the data and then predict the expected value DTC = DTC.fit(X_train, y_train) y_predict = DTC.predict(X_test) Tune parameters with cross-validation. Use DecisionTreeRegressor for regression.

Ml6 decision trees

More Related Content

What's hot

Similar to Ml6 decision trees

More from ankit_ppt

Recently uploaded

Ml6 decision trees