2. Combination of K-Means Method with Davies
Bouldin Index and Decision Tree Method with
Parameter Optimization for Best Performance
Elly Muningsih1
Chandra Kesuma2
Aprih Widayanto3
Sunanto4
Suripah5
2nd International Conference on Advance Information Scientific Development
1,2,4,5 Universitas Bina Sarana Informatika, 3Universitas Nusa Mandiri
5. Clustering is one of the common ways of mining used to explore closed
structures in a data set approach (Rohini & Suseendran, 2016).
The purpose of Clustering is to divide the dataset into groups that share the
same similarities or characteristics (Muningsih et al., 2020).
Some popular Clustering methods include K-Means, Fuzzy C-Means,
DBSCAN, K-Medoids.
6. While Classification is the process to classify new observations based on a
predetermined class, namely supervised learning (Gupta & Chandra,
2020).
Classification techniques work on labeled data sets and classifications are
helpful for predicting class labels that are classified or categorized
(Diwathe & Dongare, 2017).
Some popular Clasification methods are Decision Tree, Naïve Bayes,
Neural Network, kNN, Support Vector Machine.
8. This research will develop a combination of Clustering K-Means method and
Decision Tree Clasification method.
The K-Means method is used to group datasets into groups.
To overcome one of the shortcomings of the K-Means method in determining
the number of clusters, Davies Bouldin Index (IDB) is used which is known from
the smallest value.
The result of Clustering is then used as a label to be classified using Decision
Tree Method with Parameter Optimization to get the highest performance
(accuracy, precision and recall).
9.
10. Similar research has been conducted by (Rohini & Suseendran, 2016) to
analyze spirometry data that is widely used for medical i.e. application-
related.
The methods used are the K-Means And Decision Tree Aggregate
methods. From the results of the investigation, it is known that the
proposed K-Means Aggregate algorithm and Decision Tree algorithm for
spirometry data are better compared to other algorithms such as genetic
algorithms, classification training algorithms, and neural network-based
classification algorithms.
11. Other research conducted by (Khan & Mohamudally, 2011) which integrates the
K-means clustering algorithm with the Decision Tree (ID3) algorithm.
Decision Tree (ID3) is the best choice, used for the interpretation of K-Means
algorithm clusters because it is more user friendly, faster to generate and simpler
to explain "understandable" decision rules, compared to other data mining
algorithms.
This research resulted in an efficient Data Mining algorithm using intelligent
agents, called Learning Intelligent Agents (LIAgent), capable of performing
classification, grouping, and interpretation tasks on data sets.
12.
13. The method used in this study is the Data
Mining method of clustering and classification
functions.
For clustering use the K-Means Method and for
classification use the Decision Tree Method.
Data processing for this study conducted using
RapidMiner tools
14. K-Means is a simple and fast method, commonly used because it is easy to
implement and a relatively small number of iterations (Lin & Ji, 2020).
While Davies-Bouldin Index is one method to evaluate the validity of
clusters in clustering where the principle of DBI measurement is to maximize
the distance between clusters and at the same time minimize the distance
between points in a cluster (Jumadi Dehotman Sitompul, Salim Sitompul, &
Sihombing, 2019).
The smallest DBI value represents the best among other DBI values.
15. Decision Tree is the most commonly used predictive model for classification
which is a structure-like flowchart in which each node shows a test on an
attribute value, each branch represents the test result and the tree leaf
represents a class or distribution class (Jain & Srivastava, 2013).
A Decision Tree is a format that contains root vertices, branches, and leaf
segments.
The topmost node in the tree is referred to as the root node.
The main goal is to generate a model that predicts the value of variables
needed based on many input variables that are also used by the decision
tree classification model of prediction-based rules (Rohini & Suseendran,
2016).
16. Dataset
The dataset used for processing is sales transaction dataset weekly which has
initial data 811 record and 104 attributes. The data attribute in question is
Product_Code, sales data from 52 weeks, minimum sales data, maximum sales
data, and Normalized weekly data values.
Data Pre-Processing
From the existing attribute data then selected attributes Product_Code consisting
of P1, P2, P3, P4 ..., P819 as and sales data from 52 weeks namely W0, W1, W2
...,W51 contains integer data for the first data processing using clustering method
K-Means with P1, P2, P3, P4 ..., P819 as 'id'. The next stage of clustering results is
processed again using the Decision Tree method with additional cluster attributes
as 'labels' with cluster 0, cluster1 and cluster2 fields.
17. Modeling and Evaluation
This research consists of two parts and data processing conducted using
RapidMiner tools.
The first part is modelling the K-Means Method, as shown in Figure 2.
The processed data will be linked with the Replace Missing Value operator to
remove the lost data. After that it is then connected with the Clustering
operator in this case the K-Means Method. The Performance operator is used
to see the Davies Bouldin Index value where the smallest value indicates the
optimal number of clusters. The test is to test the number of clusters 3 to 10.
19. The second part is data processing using Decision Tree data
classification method divided into 2 data training and testing data with
a ratio of 80:20. Operators used to know the optimization of
parameters for the best performance is Optimize Parameter (Grid).
Modelling for Decision Tree with Parameter Optimization is shown in
Figure 3 through Figure 5.
21. Figure 4. Decision Tree Modelling Testing Figure 5. Detail Modelling Decision Tree
22.
23. In general accuracy, precision, and classifying acquisition are used to evaluate
model performance. Samples that predict positive or negative categories can
be known from the classifying prediction report. At the same time, the data
category can be known and can be obtained the calculation values of the four
basic indicators as shown in Table 1.
Positive Negative
True TP TN
False FP FN
Table 1. Confusion Matrix
24. The first data processing using the K-
Means method, from performance tests
to the number of clusters 3 to 10, is
known the smallest Davies Bouldin Index
value is 0.626 so it is known that the
optimal number of clusters is 3. as shown
in Table 2 :
Cluster DBI Value
3 0,626
4 0,864
5 0,777
6 1,988
7 1,939
8 2,204
9 2,342
10 2,178
Table 3. Cluster and DBI Value
25. The result of clustering which is then processed by Decision Tree
Method with parameter optimization for the best performance, known
by its parameters and values are:
a.Cross Validation.number_of_folds : 8
b.Decision Tree. Criterion : information_gain
c.Decision Tree.maximal_depth : 50
d.Decision Tree.apply_pruning : true
e.Decision Tree.apply_prepuning : true
For the resulting performance is in Table 3 as follows :
26. Training Data Testing Data
Akurasi 98,06 97,12
Precision 97,64 97,10
Recall 97,49 96,05
Because the accuracy value of the tests conducted is greater than 90, the
classification in this study is excellent clasification. From the data processing
done, also obtained Decision Tree as shown in Figure 6 where W8 is the
root that most determines and affects the decision tree or called the root
node.
Table 3. Accuracy, Precision and Recall Values
30. Conclusion
1. From the data processing conducted, it is known that in this study the K-
Means method with the Davies Bouldin Index produced an optimal
cluster count of 3.
2. Decision Tree method will produce the best performance with parameter
settings namely Iteration, Cros Validation, Criterion, Maximal Depth,
Pruning and Pre Pruning.
3. The combination of clustering method and classification in this study was
able to produce excellent classification category because the accuracy
value in the test > 90.
4. In the next study in the second part of the study can be compared with
classification methods such as Naïve Bayes, Neural Network, kNN and
Linear Regression.