Automatic pruning method for random forest

Auto-CES
an Automatic pruning method through
Clustering Ensemble Selection
Authors:
• Mojtaba Amiri Maskouni
• Saeid Hosseini
• Hadi Mohammadzadeh Abachi
• Mohammadreza Kangavari
• Xiaofang Zhou

Title and Content Layout with List
• Background
– Ensemble Classification
– Ensemble Diversity
– Random Forests
• Clustering and Ensemble Diversity
– CLUB-DRF
– Experimental Study
• Summary and Future Work

Ensemble Learning:
The learning algorithms that construct a set of trained classifiers whose individual
decisions are combined to classify new examples.
Bagging, boosting, random subspace and random forests are among the major
approaches to build ensemble of classifiers.

Diversity in Ensemble
• Definition:
– Has no general Definition
– the capability to maximize prediction correctness for a set of classifiers that are categorized
into a unique ensemble
– can not always assure an accurate estimation outcome
– Maximize stability
• Augmenting the diversity in Ensemble:
– Improve the efficiency: Higher diversity  elimination of similar classifiers
– promote the generalization performance
• Diversification methods:
– Bootstrap (bagging)
– Random feature selection (random subspace)

Random Forests
• An ensemble classification and regression technique introduced by Leo Breiman.
• It generates a diversified ensemble of decision trees adopting two methods:
– A bootstrap sample is used for the construction of each tree (bagging), resulting in
approximately 63.2% unique samples, and the rest are repeated
– At each node split, only a subset of features are drawn randomly to assess the goodness
of each feature/attribute (F or log2 F is used, where F is the total number of features)
• Adding excessive classifiers in the forest does not improve the accuracy.
– Main Challenge: find optimum number of classifier

Cluster Ensemble Selection (CES)
• Definition:
– A joint process that produces a small ensemble (prune other) that can perform
classification as effective, or even better than the original ensemble.
– A smaller set can perform more efficient than the complete ensemble.
• These methods are two-fold:
– categorize homogeneous classifiers.
– select a subset of clusters to maximize diversity between chosen classifiers.

Random Forest pruning algorithms based-on CES
Methods Description Reference
ERF
1. Sort all trees in their AUC descending order
2. Select the top P trees with high AUC values
3. Cluster these p selected trees to Q cluster
4. Select a tree from each cluster with high AUC
Bharathidason
2014
CLUB-DRF
1. Trees are clustered (K-Modes) according to their
classification pattern
2. One or more representative are chosen from each
cluster based-on random or high AUC
Fawagreh
2015
Main challenge: need to setting parameters

Auto-CES: an Automatic pruning method through
Clustering Ensemble Selection

Auto-CES has Two following stages:
• Clustering: cluster the homogeneous trees based on predefined similarities
• Selection: Select best tree from each cluster based-on the cohesiveness measure
• Nobilities:
– Grouping trees in Automatic way
– Define the cohesiveness measure to select the trees

Clustering step
• Find Epsilon
𝐷𝐹𝑡 𝑖,𝑡 𝑘
=
𝑁00
𝑁
𝐷𝐼𝑆𝑡 𝑖,𝑡 𝑘
= 1 − 𝐷𝐹𝑡 𝑖,𝑡 𝑘
𝜀 =
2
𝑛 𝑛 −1 𝑖=1
𝑛−1
𝑘=𝑖+1
𝑛
𝐷𝐼𝑆𝑡 𝑖,𝑡 𝑘
Notation Description
𝑁00 the number of joint
misclassified samples of the
tree pair (𝑡𝑖 , 𝑡 𝑘)
𝑁 the total number of
validation samples
𝐷𝐹 the double fault criterion
𝐷𝐼𝑆 Dissimilarity value
𝜀 The epsilon threshold
𝑛 Total number of classifier
𝑡𝑖 The 𝑖 𝑡ℎ
𝑡𝑟𝑒𝑒

Selection step
Individual effect ( 𝜃𝑡 𝑖
): the average of accuracy over training and validation of the 𝒊 𝒕𝒉 tree
General effect (𝐷𝐼𝑉𝑡 𝑖
): the average of all dissimilarity for the 𝒊 𝒕𝒉
tree
Cohesiveness measure (𝜁 𝑡 𝑖
): the selection measure that considers both accuracy and
diversity
𝜁 𝑡 𝑖 = 𝜃𝑡 𝑖
∗ 𝐷𝐼𝑉𝑡 𝑖

Test the prune
RF
Prune RF
Generate
RF
Train
Test
Validation
Programing
language
𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 ×
Precision × Recall
Precision + Recall
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑙𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Experimental Setup
Evaluation
measures

Competing methods
1. Breiman’s as CART-based RF (BC-RF)
2. CLUB-DRF applies the K-MODES clustering model
3. Auto-CES:
– Based on the Accuracy and Diversity (B-A-D)
– Based on the Accuracy alone (B-A)
– Based on the Diversity alone (B-D)
Description of the Data Sets

Performance Comparison according to Accuracy and test time
Impact of the number of tree on the Accuracy Impact of the number of tree on the test time
Our approaches always give the same or better result
Through accuracy compared with other rivals
Except the results shown for Herman's dataset,
Auto-CES gains the best efficiency.

Performance Comparison according to F-measure
Without noise 20 percent noise10 percent noise
• At least one of the our pruned models give the same or even better effectiveness
through F-measure.
• As a result, the selected trees create the ensembles that achieve higher stability.

Impact of noise on Wilt dataset
Without noise 20 percent noise10 percent noise
Result based on accuracy over the Wilt data set:
• the retrieved trees chosen by our model gain more stability and robustness.

Dissuasion:
Two reasons support the good results of our method:
1. An essential component in calculating of the cohesiveness for each tree is the
average accuracy 𝜃 that is computed during both training and validation.
Hence, the trees that are selected have the highest stability among other trees.
2. The effect of 𝜃 and 𝐷𝐼𝑉 are simultaneously employed to compute the
cohesiveness metric (𝜁). As a result, the selected trees create the ensembles
that achieve higher robustness.
Future work:
• Extend our algorithm in a large-scale environment including the multi-cluster
spark platforms.

Automatic pruning method for random forest

Recommended

Recommended

More Related Content

Similar to Automatic pruning method for random forest

Similar to Automatic pruning method for random forest (20)

Recently uploaded

Recently uploaded (20)

Automatic pruning method for random forest