Data Mining (Predict The Future)

Outline Big Data Project Rule-based Classiﬁer Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classiﬁer
Machine Learning for Data Mining
Dr. Dewan Md. Farid
Department of Computer Science & Engineering,
United International University, Bangladesh
December 01, 2016
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh

Big Data Project
Rule-based Classiﬁer
Class Imbalanced Problem
Active Learning
Ensemble Clustering
Hybrid Classiﬁer

Data Mining: What is Data Mining?
Data mining (DM) is also known as Knowledge Discovery from
Data, or KDD for short, which turns a large collection of data into
knowledge. DM is a multidisciplinary field including machine learning,
artificial intelligence, pattern recognition, knowledge-based systems,
high-performance computing, database technology and data visualisation.
1. Data mining is the process of analysing data from different
perspectives and summarising it into useful information.
2. Data mining is the process of finding hidden information and
patterns in a huge database.
3. Data mining is the extraction of implicit, previously unknown, and
potentially useful information from data.

Machine Learning
Machine learning (ML) provides the technical basis of data mining,
which concerns the construction and study of systems that can learn
from data.
1. Supervised learning/ Classiﬁcation - the supervision in the
learning comes from the labeled instances.
2. Unsupervised learning/ Clustering - the learning process is
unsupervised since the instances are not class labeled.
3. Semi-supervised learning - uses of both labeled and unlabelled
instances when learning a model.
4. Active learning - lets users play an active role in the learning
process. It asks a user (e.g., a domain expert) to label an instance,
which may be from a set of unlabelled instances.

Learning Algorithms
Decision Tree (DT) Induction
Na¨ıve Bayes (NB) Classifier
NBTree Classifier
RainForest and BOAT Classifier
k Nearest Neighbour (kNN) Classifier
Random Forest, Bagging and Boosting (AdaBoost)
Support Vector Machines (SVM)
k Means Clustering
Similarity based Clustering

Mining Big Data
Mining big data is the process of extracting knowledge to uncover large
hidden information from the massive amount of complex data or
databases. Big data is defined by the three V’s:
Volume - the quantity of data.
Variety - the category of data.
Velocity - the speed of data in and out.
It might suggest throwing a few more V’s into the mix:
Vision - having a purpose/ plan).
Verification - ensuring that the data conforms to a set of
specifications.
Validation - checking that its purpose is fulfilled.

Big Data Project
1. BRiDGEIris - Brussels Big Data Platform for Sharing and Discovery
in Clinical Genomics.
Hosted by IB2
(Interuniversity Institute of Bioinformatics in
Brussels).
Funded by INNOVIRIS
(Brussels Institute for Research and
Innovation).
2. FWO research project G004414N “Machine Learning for Data
Mining Applications in Cancer Genomics”.

BRiDGEIris Project
Brussels big data platform for sharing and discovery in clinical genomics
project aims to answer the research challenges by:
1. Design and creation of a multi-site clinical/phenomic and genomic
data warehouse.
2. Development of automated tools for extracting relevant information
from genetic data.
3. Use of the designed tools to extract new knowledge and transfer it
to the medical setting.

VUB AI Lab (CoMo)
Lab is particularly focused on the aspect of design and developing
strategy for information discovery on genomic and clinical big data
by employing an optimal ensemble method. Goal is to evaluate
ensemble predictive modelling techniques for:
1. Improving the prediction accuracy of variant identification/ genomic
variants classification.
2. Pathology classification tasks.
Developing new methods/ algorithms to deal with the following issues:
Multi-class classification
High-dimensional data
Class imbalanced data
Big data

Brugada syndrome
Brugada syndrome (BrS), also known as sudden adult death
syndrome (SADS) is a genetic disease. It increases the risk of sudden
cardiac death (SCD) at a young age. The Spanish cardiologists Pedro
Brugada and Josep Brugada name Brugada syndrome.
BrS is detected by abnormal electrocardiogram (ECG) ﬁndings called
a type 1 Brugada ECG pattern, which is much more common in men.
BrS is a heart rhythm disorder.
Sudden cardiac death (SCD) caused when the heart doesn’t pump
eﬀectively and not enough blood travels to the rest of the body.
The Exome datasets of 148 patients have analysed for Brugada syndrome
at UZ Brussels (Universitair Ziekenhuis Brussel) (www.uzbrussel.be/)

Knowledge Discovery from Genomic Data
Exome 1
Formatted
Data
Gene Panel
Mining Algorithm
Genomic Data Sets
Knowledge Discovery
from Genomic Data
Exome 2
Exome 148
Data
Preprocessing
Feature Selection
Figure: The process of extracting knowledge from genomic data in data mining.

Genomic Data of BrS
Table: Classiﬁcation of DNA variants for Brugada syndrome.
Class Label
Class I Nonpathogenic
Class II VUS1 - Unlikely pathogenic
Class III VUS2 - Unclear
Class IV VUS3 - Likely pathogenic
Class V Pathogenic

Gene Panel of BrS
Table: Gene panel of Brugada syndrome.
Chromosome Name of Gene
Chr 1 KCND3
Chr 3 SCN5A, GPD1L, SLMAP, CAV3, SCN10A
Chr 4 ANK2
Chr 7 CACNA2D1, AKAP9, KCNH2
Chr 10 CACNAB2
Chr 11 KCNE3, SCN3B, SCN2B, KCNJ5,
KCNQ1, SCN4B
Chr 12 CACNA1C, KCNJ8
Chr 15 HCN4
Chr 17 RANGRF, KCNJ2
Chr 19 SCN1B, TRPM4
Chr 20 SNTA1
Chr 21 KCNE1, KCNE2
Chr X KCNE1L

Chromosomes
1 11 12 15 17 19 21 3 4 7 X
Chromosomes
No.ofVariants
0100200300400500
Figure: Chromosomes in 148 Exome Datasets.

Genomic Data
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
38
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
87
89
91
93
95
97
99
101
103
105
107
109
111
113
115
117
119
121
123
125
127
129
131
133
135
137
139
141
143
145
147
No. of Variants
Exome Data Sets
Annotated vcf File
Gene Panel
BrS Variants
Figure: Genomic Data: 148 Exome Datasets.

Rule-based Classifier
Rule-based classifier is easy to deal with complex classification problems.
It has various advantages:
Highly expressive as DT
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to DT
New rules can be added to existing rules without disturbing ones
already in there
Rules can be executed in any order

Adaptive Rule-based Classifier
It combines the random subspace and boosting approaches with
ensemble of decision trees to construct a set of classification rules for
multi-class classification of biological big data.
Random subspace method (or attribute bagging) to avoid
overfitting
Boosting approach for classifying noisy instances
Ensemble of decision trees to deal with class-imbalance data
It uses two popular classification techniques: decision tree (DT) and
k-nearest-neighbour (kNN) classifiers.
DTs are used for evolving classification rules from the training data.
kNN is used for analysing the misclassified instances and removing
vagueness between the contradictory rules.

Random Subspace & Boosting Method
Random subspace is an ensemble classifier. It consists of several
classifiers each operating in a subspace of the original feature space, and
outputs the class based on the outputs of these individual classifiers.
It has been used for decision trees (random decision forests).
It is an attractive choice for high dimensional data.
Boosting is designed specifically for classification.
It converts weak classifiers to strong ones.
It is an iterative process.
It uses voting for classification to combine the output of individual
classifiers.

Ensemble Classiﬁer
Figure: An example of an ensemble classiﬁer.

Decision Tree Induction
Decision tree (DT) induction is a top down recursive divide and
conquer algorithm for multi-class classiﬁcation task. The goal of DT is to
iteratively partition the data into smaller subsets until all the subsets
belong to a single class. It is easy to interpret and explain, and also
requires little prior knowledge.
Information Gain: ID3 (Iterative Dichotomiser) algorithm
Gain Ratio: C4.5 algorithm
Gini Index: CART algorithm

Algorithm 1 Decision Tree Induction
Input: D = {x1, · · · , xi , · · · , xN }
Output: A decision tree, DT.
Method:
1: DT = ∅;
2: ﬁnd the root node with best splitting, Aj ∈ D;
3: DT = create the root node;
4: DT = add arc to root node for each split predicate and label;
5: for each arc do
6: Dj created by applying splitting predicate to D;
7: if stopping point reached for this path, then
8: DT = create a leaf node and label it with cl ;
9: else
10: DT = DTBuild(Dj );
11: end if
12: DT = add DT to arc;
13: end for

K-Nearest-Neighbour (kNN) Classifier
The k-nearest-neighbour (kNN) is a simple classifier. It uses the
distance measurement techniques that widely used in pattern recognition.
kNN finds k instances, X = {x1, x2, · · · , xk } ∈ Dtraining that are closest to
the test instance, xtest and assigns the most frequent class label,
cl → xtest among the X. When a classification is to be made for a new
instance, xnew , its distance to each Aj ∈ Dtraining , must be determined.
Only the k closest instances, X ∈ Dtraining are considered further. The
closest is defined in terms of a distance metric, such as Euclidean
distance. The Euclidean distance between two points,
x1 = (x11, x12, · · · , x1n) and x2 = (x21, x22, · · · , x2n), is shown in Eq. 1
dist(x1, x2) =
n
i=1
(x1i − x2i )2 (1)

Algorithm 2 k-Nearest-Neighbour classifier
Input: D = {x1, · · · , xi , · · · , xn}
Output: kNN classifier, kNN.
Method:
1: find X ∈ D that identify the k nearest neighbours, regardless of class
label, cl .
2: out of these instances, X = {x1, x2, · · · , xk }, identify the number of
instances, ki , that belong to class cl , l = 1, 2, · · · , M. Obviously,
i ki = k.
3: assign xtest to the class cl with the maximum number of ki of instances.

Constructing Classification Rules
Extracting classification rules from DTs is easy and well-known process.
Rules are highly expressive as DT, so the performance of rule-based
classifier is comparable to DT.
Each rule is generated for each leaf of the DT.
Each path in DT from the root node to a leaf node corresponds with
a rule.
Tree corresponds exactly to the classification rules.
DT vs. Rules
New rules can be added to an existing rule set without disturbing ones
already there, whereas to add to a tree structure may require reshaping
the whole tree. Rules can be executed in any order.

Algorithm: Adaptive rule-based (ARB) classifier
It considers a series of k iterations.
Initially, an equal weight, 1
N is assigned to each training instance.
The weights of training instances are adjusted according to how they
are classified in every iterations.
In each iteration, a sub-dataset Dj is created from the original
training dataset D and previous sub-dataset Dj−1 with maximum
weighted instances. Only the sampling with replacement technique
is used to create the sub-dataset D1 from the original training data
D in the first iteration.
A tree DTj is built from the sub-dataset Dj with randomly selected
features in each iteration.
Each rule is generated for each leaf node of DTj .
Each path in DTj from the root to a leaf corresponds with a rule.

Algorithm 3 Adaptive rule-based classifier.
Input:
D = {x1, · · · , xi , · · · , xN }, training dataset;
k, number of iterations;
DT learning scheme;
Output: rule-set; // A set of classification rules.
Method:
1: rule-set = ∅;
2: for i = 1 to N do
3: xi = 1
N ; // initialising weights of each xi ∈ D.
4: end for
5: for j = 1 to k do
6: if j==1 then
7: create Dj , by sampling D with replacement;
8: else
9: create Dj , by Dj−1 and D with maximum weighted X;
10: end if
11: build a tree, DTj ← Dj by randomly selected features;
12: compute error(DTj ); // the error rate of DTj .
13: if error(DTj ) ≥ threshold-value then
14: go back to step 6 and try again;
15: else
16: rules ← DTj ; // extracting the rules from DTj .
17: end if
18: for each xi ∈ Dj that was correctly classified do
19: multiply the weight of xi by (
error(DTj )
1−error(DTj ) ); // update weights.
20: end for
21: normalise the weight of each xi ∈ Dj ;
22: rule-set = rule-set ∪ rules;
23: end for
24: return rule-set;
25: create sub-dataset, Dmisclassified with misclassified instances from Dj ;
26: analyse Dmisclassified employing algorithm 4.

Error Rrate Calculation
The error rate of DTj is calculated by the sum of weights of misclassified
instances that is shown in Eq. 2. Where, err(xi ) is the misclassification
error of an instance xi . If an instance, xi is misclassified, then err(xi ) is
one. Otherwise, err(xi ) is zero (correctly classified).
error(DTj ) =
n
i=1
wi × err(xi ) (2)
If error rate of DTj is less than the threshold-value, then rules are
extracted from DTj .

Mining Big Data with Rules
Big data is so big (millions of instances) that we cannot process all
the instances together at the same time.
It is not possible to store all the data in the main memory at a time.
We can create several smaller sample (or subsets) of data from the
big data that each of which fits in main memory.
Each subset of data is used to construct a set of rules, resulting in
several sets of rules.
Then the rules are examined and used to merge together to
construct the final set of classification rules to deal with big data.
As we have the advantage to add new rules with existing rules and
rules are executed in any order.

Mining Big Data with Rules (con.)

Data
Data
Data
Integrating Rules
Big Data
Sub-data, 1
Adaptive Rule-based
Classifier

Final Classification Rules
Adaptive Rule-based
Classifier

Adaptive Rule-based
Classifier

Sub-data, N Sub-data, 2
Figure: Mining big data using adaptive rule-based classiﬁer.

Reduced-Error Pruning
Split the original data into two parts: (a) a growing set, and (b) a
pruning set.
Rules are generated using growing set only. So, important rules
might miss because some key instances had been assigned to the
pruning set.
A rule generated from the growing set is deleted, and the eﬀect is
evaluated by trying out the truncated rule from the pruning set and
seeing whether it performs well than the original rule.
If the new truncated rule performs better then this new rule is added
to the rule set.
This process continues for each rule and for each class.
The overall best rules are established by evaluating the rules on the
pruning set.

Algorithm: Analysing Misclassified Instances
To check the classes of misclassified instances we used the kNN classifier
with feature selection and weighting approach.
We applied DT induction for feature selection and weighting
approach.
We build a tree from the misclassified instances.
Each feature that is tested in the tree, Aj ∈ Dmisclassified is assigned
by a weight 1
d . Where d is the depth of the tree.
We do not consider the features that are not tested in the tree for
similarity measure of kNN classifier.
We apply kNN classifier to classify each misclassified instance based
on the weighted features.
We update the class label of misclassified instances.
We check for the contradictory rules, if there is any.

Algorithm 4 Analysing misclassified instances
Input: D, original training data;
Dmisclassified , dataset with misclassified instances;
Output: A set of instances, X with right class labels.
Method:
1: build a tree, DT using Dmisclassified ;
2: for each Aj ∈ Dmisclassified do
3: if Aj is tested in DT then
4: assign weight to Aj by 1
d , where d is the depth of DT;
5: else
6: not to consider, Aj for similarity measure;
7: end if
8: end for
9: for each xi ∈ Dmisclassified do
10: find X ∈ D, with the similarity of weighted A =
{A1, · · · , Aj , · · · , An};
11: find the most frequent class, cl , in X;
12: assign xi ← cl ;
13: end for

Performance Measurement
The classification accuracy:
accuracy =
|X|
i=1 assess(xi )
|X|
, xi ∈ X (3)
If xi is correctly classified then assess(xi ) = 1, or If xi is misclassified then
assess(xi ) = 0.
precision =
TP
TP + FP
(4)
recall =
TP
TP + FN
(5)
F − score =
2 × precision × recall
precision + recall
(6)

Experiments on Exome datasets
The performance of the proposed ARB classifier against RainForest, NB
and kNN classifiers on 148 Exome datasets. The ARB classifier correctly
classifies 91% gene variants for BrS using training data. We have
considered five iterations for the proposed ARB classifier on each Exome
dataset.
Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and
proposed ARB classifier using training data.
Algorithm Classification Precision Recall F-score
accuracy (%) (weighted (weighted (weighted
avg.) avg.) avg.)
RainForest 83.33 0.76 0.83 0.79
NB 83.33 0.79 0.83 0.78
kNN 75 0.56 0.75 0.64
ARB classifier 91.66 0.95 0.91 0.92

Experiments on Exome datasets (con.)
and kNN classifiers using 10-folds cross validation on 148 Exome
datasets.
proposed ARB classifier using 10 folds cross-validation.
avg.) avg.) avg.)
RainForest 58.33 0.46 0.58 0.51
NB 58.33 0.63 0.58 0.6
kNN 50 0.33 0.5 0.4
ARB classifier 75 0.73 0.75 0.68

Experiments on Exome datasets (con.)
and kNN classifiers using unseen test variants of 45 Exome datasets.
Where 103 Exome datasets were used for training the models.
proposed ARB classifier using testing data.
avg.) avg.) avg.)
RainForest 50 0.33 0.5 0.4
NB 50 0.25 0.5 0.62
kNN 50 0.25 0.5 0.33
ARB classifier 66.66 0.44 0.66 0.53

Benchmark Life Sciences Datasets
Table: 10 real benchmark life sciences datasets from UCI (University of
California, Irvine) machine learning repository.
No. Datasets Instances No of Att. Att. Types Classes
1 Appendicitis 106 7 Numeric 2
2 Breast cancer 286 9 Nominal 2
3 Contraceptive 1473 9 Numeric 3
4 Ecoli 336 7 Numeric 8
5 Heart 270 13 Numeric 2
6 Pima diabetes 768 8 Numeric 2
7 Iris 150 4 Numeric 3
8 Soybean 683 35 Nominal 19
9 Thyroid 215 5 Numeric 2
10 Yeast 1484 8 Numeric 10

Classification Accuracy
Table: The classification accuracy (%) of C4.5, kNN, na¨ıve Bayes (NB) and
proposed adaptive rule-based classifier with 10-fold cross validation.
Datasets C4.5 kNN NB Proposed
classifier
Appendicitis 85.84 86.79 85.84 87.73
Breast cancer 75.52 73.42 71.67 75.52
Contraceptive 50.98 49.76 48.13 50.1
Ecoli 79.76 83.03 78.86 83.92
Heart 77.40 78.88 83.7 83.7
Pima diabetes 73.82 73.17 76.3 75.65
Iris 96 95.33 96 95.33
Soybean 91.50 90.19 92.97 91.94
Thyroid 98.13 97.2 98.13 98.13
Yeast 56.73 56.94 57.88 61.99

Classification Accuracy (con.)
45
50
55
60
65
70
75
80
85
90
95
100
Appendici1s Breast cancer Contracep1ve Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast
Classifica(on Accuracy
UCI Benchmark Life Sciences Data Sets
C4.5 kNN NB Adap1ve rule-based classifier

Accuracy having 20% noisy instances
40
45
50
55
60
65
70
75
80
85
Appendici/s Breast cancer Contracep/ve Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast
Classiﬁca(on Accuracy
UCI Benchmark Life Sciences Data Sets
C4.5 kNN NB Adap/ve rule-based classiﬁer

Data Balancing Methods
Classification of multi-class imbalanced data is a difficult task, as real
data sets are noisy, high dimensional, small sample size that results
overfitting and overlapping of classes..
Traditional machine learning algorithms are very successful with
classifying majority class instances compare to the minority class
instances.
The conventional data balancing methods alter the original data
distribution, so they might suffer from overfitting or drop some
potential information.
We proposed a new method for dealing with multi-class imbalanced data
based on clustering and selecting most informative instances from the
majority classes.

Classifying Imbalanced Data
Machine learning algorithms successfully classify majority class instances,
but misclassify the minority class instances in many high-dimensional
data sets.
Following methods are used for class imbalance problems:
1. Sampling methods
Under-sampling
Over-sampling
2. Cost-sensitive learning methods (diﬃcult to get the accurate
misclassiﬁcation cost)
3. Ensemble methods
Bagging
Boosting

Proposed Data Balancing Method
Initially, we cluster the majority class instances into several clusters.
Find the most informative instances in each cluster. The informative
instances are close to the center of cluster and border of cluster.
Then several data sets are created using these clusters with most
informative instances by combining the instances of minority classes.
Every data set should have almost equal number of
minority-majority classes instances.
Finally, multiple classiﬁers are trained using these data sets. The
voting technique is used to classify the existing/ new instances.

Proposed Data Balancing Method (con.)
Imbalanced Data
Majority Classes
Instances
Minority Classes
Instances
Cluster 1
Balanced
Data 1
Classifier 1
Find
Informative
Instances
Cluster 2 Cluster N
Find
Informative
Instances
Find
Informative
Instances
Balanced
Data 2
Balanced
Data N
Classifier 2 Classifier N
Combine Votes
Prediction
New Data
Instances
Figure: Proposed data balancing method.

Performance of Data Balancing Methods
The performance of data balancing methods using area under the ROC
(Receiver Operating Characteristic) curve (AUC) on 2143 variants of
Brugada syndrome (BrS) of 148 Exome data sets.
Table: Average AUC values of 148 imbalanced Exome data sets for diﬀerent
imbalance data handling methods.
Algorithm Average AUC value
Random Under-Sampling 0.8923
Random Over-Sampling 0.8673
Bagging 0.8915
Boosting 0.9136
Proposed Method 0.9317

Active Learning
It achieves high accuracy using the number of instances to learn a
concept can often be much lower than the number required in typical
supervised learning.
It interactively queries a user/ expert for class labels of unlabeled
instances.
The objective is to train a classiﬁer using as few labeled instances as
possible by selecting the most informative instances.
Let the data, D contains both set of labeled data, DL and set of
unlabeled data, DU . Initially, a model, M∗
trains using DL. Then a
querying function uses to select unlabeled instances, XU ∈ DU and
requests a user for labeling, XU → XL. After XL is added to DL and train
M∗
again. The process repeats until the user is satisﬁed.

Active Learning (con.)
Data, D
Labeled Data,
DL
Unlabeled Data,
DU
Unlabeled
Instances, XU
Labeled
Instances, XL
DL + XL
Ensemble Model,
M*
User/ Oracle
Figure: Active learning process.

Proposed Method
The na¨ıve Bayes (NB) classiﬁer and clustering are used to ﬁnd the most
informative instances for labeling as part of active learning. The unlabeled
instances are selected for labeling using the following two strategies:
Instances close to centers of clusters and borders of clusters.
If the posterior probabilities of instances are equal/ very close.

Performance of Ensemble Methods
Adaptive boosting (AdaBoost algorithm) with NB classifier is used as
base classifier.
Table: The accuracy and F-score of ensemble methods on 2143 DNA variants
of Brugada syndrome.
Algorithm Classification F-score
accuracy (%) (weighted
avg.)
Random Forest 92.3 0.93
Bagging 87.5 0.83
Boosting 91.66 0.9
AdaBoost with NB classifier 94.73 0.93

Clustering of high-dimensional big data
An ensemble clustering method with feature selection and grouping
approach.
K-means clustering.
Similarity-based clustering.
Biclustering (On each cluster that generated by ensemble clustering
to ﬁnd the sub-matrices).
Unlabelled genomic data of Brugada syndrome (148 Exome
datasets).
The proposed method selects the most relevant features in the dataset
and grouping them into subset of features to overcome the problems
associated with the traditional clustering methods.

Clustering
It is the process of grouping a set of instances into clusters (subsets or
groups) so that instances within a cluster have high similarity in
comparison to one another, but are very dissimilar to instances in other
clusters.
Let X be the unlabelled data set, that is,
X = {x1, x2, · · · , xN }; (7)
The partition of X into k clusters, C1, · · · , Ck , so that the following
conditions are met:
Ci = ∅, i = 1, · · · , k; (8)
∪k
i=1Ci = X; (9)
Ci ∩ Cj = ∅, i = j, i, j = 1, · · · , k; (10)

Challenges
Pattern extracting from the genomic big data.
Genomic data is often too big and too messy.
Genomic data is also high-dimensional, so traditional distance
measures may be dominated by the noise in many dimensions.
In genomic data, we need to ﬁnd not only the clusters of instances
(genes), but for each cluster a set of features (conditions).

k-Means
It defines the mean value of instances {xi1, xi2, · · · , xiN } ∈ Ci .
It randomly selects k instances, {xk1, xk2, · · · , xkN } ∈ X each of
which initially represents a cluster center.
Remaining instances, xi ∈ X, xi is assigned to the cluster.
Similar is measure based on the Euclidean distance between xi and
Ci .
It iteratively improves the within-cluster variation.
A high degree of similarity among instances in clusters is obtained, while
a high degree of dissimilarity among instances in different clusters is
achieved simultaneously. The cluster mean of Ci = {xi1, xi2, · · · , xiN } is
defined in equation 11.
Mean = Ci =
N
j=1(xij )
N
(11)

Algorithm 5 k-Means Clustering
Input: X = {x1, x2, · · · , xN } // A set of unlabelled instances.
k // the number of clusters
Output: A set of k clusters.
Method:
1: arbitrarily choose k number of instances, {xk1, xk2, · · · , xkN } ∈ X as
the initial k clusters center;
2: repeat
3: (re)assign each xi ∈ X → k to which the xi is the most similar based
on the mean value of the xm ∈ k;
4: update the k means, that is, calculate the mean value of the instances
for each cluster;
5: until no change

Similarity-Based Clustering (SCM)
It is robust to initialise the cluster numbers.
It detects diﬀerent volumes of clusters.
Let’s consider sim(xi , xl ) as the similarity measure between instances xi
and the lth cluster center xl . The goal is to ﬁnd xl to maximise the total
similarity measure shown in Eq. 12.
Js(C) =
k
l=1
N
i=1
f (sim(xi , xl )) (12)
Where, f (sim(xi , xl )) is a reasonable similarity measure and
C = {C1, · · · , Ck }. In general, SCM uses feature values to check the
similarity between instances. However, any suitable distance measure can
be used to check the similarity between the instances.

Algorithm 6 Similarity-based Clustering
Input: X = {x1, x2, · · · , xN } // A set of unlabelled instances.
Output: A set of clusters, C = {C1, C2, · · · , Ck }.
Method:
1: C = ∅;
2: k = 1;
3: Ck = {x1};
4: C = C ∪ Ck ;
5: for i = 2 to N do
6: for l = 1 to k do
7: ﬁnd the lth cluster center xl ∈ Cl to maximize the similarity
measure, sim(xi , xl );
8: end for
9: if sim(xi , xl ) ≥ threshold value then
10: Cl = Cl ∪ xi
11: else
12: k = k + 1;
13: Ck = {xi };
14: C = C ∪ Ck ;
15: end if
16: end for

Ensemble Clustering
Ensemble clustering is a process of integrating multiple clustering
algorithms to form a single strong clustering approach that usually
provides better clustering results. It generates a set of clusters from a
given unlabelled data set and then combines the clusters into final
clusters to improve the quality of individual clustering.
No single cluster analysis method is optimal.
Different clustering methods may produce different clusters, because
they impose different structure on the data set.
Ensemble clustering performs more effectively in high dimensional
complex data.
It’s a good alternative when facing cluster analysis problems.

Ensemble clustering (con.)
Generally three strategies are applied in ensemble clustering:
1. Using different clustering algorithms on the same data set to create
heterogeneous clusters.
2. Using different samples/ subsets of the data with different clustering
algorithms to cluster them to produce component clusters.
3. Running the same clustering algorithm many times on same data set
with different parameters or initialisations to create homogeneous
clusters.
The main goal of the ensemble clustering is to integrate component
clustering into one final clustering with a higher accuracy.

Ensemble clustering on genomic/ biological data
Pattern extraction from genomic data applying ensemble clustering.
Data

Data

Data

Data

Preprocessing

Biclustering

Big
Biological
Data

Hidden
Patterns

in
Data

Feature
Selection

Feature
Grouping

Ensemble

Clustering


Data Pre-processing
It transforms raw data into an understandable format, which includes
several techniques:
Data cleaning is the process of dealing with missing values.
Data integration merges data from diﬀerent multiple sources into a
coherent data store like data warehouse or integrate metadata.
Data transformation includes the followings: (a) normalisation, (b)
aggregation, (c) generalisation, and (d) feature construction.
Data reduction obtains a reduced representation of data set
(eliminating redundant features/ instances).
Data discretisation involves the reduction of a number of values of
a continuous feature by dividing the range of feature intervals.

Feature Selection
It is the process of selecting a subset of relevant features from a total
original features in data.
Mainly the following three reasons are used for feature selection:
Simplification of models
Shorter training times
Reducing overfitting
In biological data, features may contain false correlations and the
information they add is contained in other features. In this work, we have
applied an unsupervised feature selection approach based on measuring
similarities between features by maximum information compression index.
We have quantified the information loss in feature selection with entropy
measure technique. After selecting the subset of features from the data,
we have grouped them into two groups: nominal and numeric features.

Subspace Clustering
The subspace clustering ﬁnds subspace clusters in high-dimensional data.
It can be classiﬁed into three groups:
1. Subspace search methods.
2. Correlation-based clustering methods
3. Biclustering methods.
A subspace search method searches various subspaces for clusters (set
of instances that are similar to each other in a subspace) in the full
space. It uses two kinds of strategies:
Bottom-up approach - start from low-dimensional subspace and
search higher-dimensional subspaces.
Top-down approach - start with full space and search smaller
subspaces recursively.

Algorithm 7 δ-Biclustering
Input: E, a data matrix and δ ≥ 0, the maximum acceptable mean squared
residue score.
Output: EIJ , a δ-bicluster that is a submatrix of E with row set I and
column set J, with a score no longer than δ.
Initialization: I and J are initialized to the instance and feature sets in
the data and EIJ = E.
Deletion phase:
1: compute eiJ for all i ∈ I, eIj for all j ∈ J, eIJ , and H(I, J);
2: if H(I, J) ≤ δ then
3: return EIJ ;
4: end if
5: ﬁnd the rows i ∈ I with d(i) = j∈J (eij −eiJ −eIj +eIJ )2
|J| ;
6: ﬁnd the columns j ∈ J with d(j) = i∈I (eij −eiJ −eIj +eIJ )2
|I| ;
7: remove rows i ∈ I and columns j ∈ J with larger d;
Addition phase:
1: compute eiJ for all i, eIj for all j, eIJ , and H(I, J);
2: add the columns j /∈ J with i∈I (eij −eiJ −eIj +eIJ )2
|I| ≤ H(I, J);
3: recompute eiJ , eIJ and H(I, J);
4: add the rows i /∈ I with j∈J (eij −eiJ −eIj +eIJ )2
|J| ≤ H(I, J);
5: for each row i /∈ I do
6: if j∈J (eij −eiJ −eIj +eIJ )2
|J| ≤ H(I, J) then
7: add inverse of i;
8: end if
9: end for
10: return EIJ ;

Clustering of BrS variants
Distribution of BrS variants in clusters using proposed ensemble
clustering.

Experimental Method
To test the performance of clustering algorithms we have used an
unsupervised evaluation method that compute the Compactness (CP) of
clusters is shown in Eq. 13.
CP =
1
n
k
l=1
nl
xi ,xj ∈Cl
d(xi , xj )
nl (nl − 1)/2
(13)
Where d(xi , xj ) is the distance between two instances in cluster Cl and nl
is the number of instances in Cl . The smaller the CP for a clustering
result, the more compact and better the clustering result.

Results
The proposed ensemble clustering is compared with following clustering
algorithms:
SimpleKMeans (clustering using the k-means method)
XMeans (extension of k-means)
DBScan (nearest-neighbor-based that automatically determines the
number of clusters)
MakeDensityBasedCluster (wrap a clusterer to make it return
distribution and density)
Table: Comparison of clustering results on 148 Exome data sets of BrS.
Clustering Method Compactness (CP)
SimpleKMeans 9.401
XMeans 8.297
MakeDensityBasedCluster 7.483
DBScan 6.351
Ensemble Clustering 5.647

Hybrid Decision Tree & Na¨ıve Bayes Classifiers
The presence of noisy contradictory instances in the training data cause
the learning models suffer from overfitting and decrease classification
accuracy.
Hybrid Decision Tree (DT) classifier - A na¨ıve Bayes (NB)
classifier is used to remove the noisy troublesome instances from the
training data before the DT induction.
Hybrid Na¨ıve Bayes (NB) classifier - A DT is used to select a
comparatively more important subset of features for the production
of na¨ıve assumption of class conditional independence. It is
extremely computationally expensive for a na¨ıve Bayes classifier to
compute class conditional independence for high dimensional data
sets.

Algorithm 8 Decision Tree Induction
Input: D = {x1, x2, · · · , xn} // Training dataset, D, which contains a set
of training instances and their associated class labels.
Output: T, Decision tree.
Method:
1: for each class, Ci ∈ D, do
2: Find the prior probabilities, P(Ci ).
3: end for
4: for each attribute value, Aij ∈ D, do
5: Find the class conditional probabilities, P(Aij |Ci ).
6: end for
7: for each training instance, xi ∈ D, do
8: Find the posterior probability, P(Ci |xi )
9: if xi is misclassiﬁed, then
10: Remove xi from D;
11: end if
12: end for
13: T = ∅;
14: Determine best splitting attribute;
15: T = Create the root node and label it with the splitting attribute;
16: T = Add arc to the root node for each split predicate and label;
17: for each arc do
18: D = Dataset created by applying splitting predicate to D;
20: T = Create a leaf node and label it with an appropriate class;
21: else
22: T = DTBuild(D);
23: end if
24: T = Add T to arc;
25: end for

Algorithm 9 Na¨ıve Bayes classiﬁer
Input: D = {x1, x2, · · · , xn} // Training data.
Output: A classiﬁcation Model.
Method:
1: T = ∅;
2: Determine the best splitting attribute;
3: T = Create the root node and label it with the splitting attribute;
4: T = Add arc to the root node for each split predicate and label;
5: for each arc do
6: D = Dataset created by applying splitting predicate to D;
8: T = Create a leaf node and label it with an appropriate class;
9: else
10: T = DTBuild(D);
11: end if
12: T = Add T to arc;
13: end for
14: for each attribute, Ai ∈ D, do
15: if Ai is not tested in T, then
16: Wi = 0;
17: else
18: d as the minimum depth of Ai ∈ T, and Wi = 1√
d
;
19: end if
20: end for
21: for each class, Ci ∈ D, do
22: Find the prior probabilities, P(Ci ).
23: end for
24: for each attribute, Ai ∈ D and Wi = 0, do
25: for each attribute value, Aij ∈ Ai , do
26: Find the class conditional probabilities, P(Aij |Ci )
Wi
.
27: end for
28: end for
29: for each instance, xi ∈ D, do
30: Find the posterior probability, P(Ci |xi );
31: end for

Accuracy on Benchmark Datasets
Figure: Classiﬁcation accuracy on 10 datasets with 10-fold cross validation.

Novel Class Instances
Figure: Instances with a ﬁxed number of class labels (left) and instances of a
novel class arriving in the data stream (right).

Novel Class Instances (con.)
Figure: Flow chart of classiﬁcation and novel class detection.

Novel Class Instances (con.)

*** THANK YOU ***

Data Mining (Predict The Future)

More Related Content

What's hot

Viewers also liked

Similar to Data Mining (Predict The Future)

More from Daffodil International University

Recently uploaded

Data Mining (Predict The Future)