Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases

Tackling the Class-Imbalance Learning Problem in
Semantic Web knowledge bases
19th International Conference on Knowledge Engineering and Knowledge
Management
Giuseppe Rizzo, Claudia d’Amato, Nicola Fanizzi and Floriana Esposito
Dipartimento di Informatica
Universit`a degli Studi di Bari ”Aldo Moro”, Bari, Italy
November 24 - 28, 2014
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 1 / 20

Outline
1 Introduction & Motivations
2 The framework
3 Experiments
4 Conclusions and Extensions

Introduction & Motivations
Introduction
In the context of Semantic Web, procedures for deciding the
membership of an individual w.r.t. a query concept exploit automated
reasoning techniques
The quality of inferences can be aﬀected by the uncertainty originated
from the distributed nature of Semantic Web
the inherent incompleteness, due to the Open World Assumption
inconsistency, due to the diverse quality of the ontologies

Introduction
Machine learning algorithms can be employed to support query
answering tasks(e.g. class-membership prediction)
statistical regularities are exploited to infer new assertions
The quality of inductive approaches depends on the training set
composition
Given a query concept, it is easier to ﬁnd more uncertain-membership
examples than individuals that belong to the target concept (or to its
complement)
the quality of predictions can be poor
A problem of class-imbalance occurs

Motivations
In machine learning, most solutions are based on sampling methods
Undersampling methods are typically based on (randomly or
informated) procedures for discarding training instances
it is possible to obtain loss of information
Oversampling methods require that some training instances are
replicated
the produced model can overﬁt over training data
These problems must be mitigated

The framework
The proposed approach
Combining the sampling strategy with ensemble learning methods
Ensemble learning methods require the training for a set of classifiers
(weak learners)
predictions are combined by a meta-learner for deciding the final
answer
Specifically, the proposed solution is based on bagging methods
various bootstrap samples are generated through the sampling with
replacement procedure
a model is induced for each sample
predictions are made by voting procedure

The framework
Terminological Random Forests
In this work, we developed Terminological Random Forest (TRF) for
class-membership prediction, which extends Terminological Decision
Trees model (TDTs).
Let K = (T , A), a Terminological Decision Tree is a binary tree
where:
each node contains a conjunctive concept description D;
each departing edge is the result of an instance-check test w.r.t. D,
i.e., given an individual a, K |= D(a)?
if a node with E is the father of the node with D then D is obtained by
using a reﬁnement operator and one of the following conditions should
be veriﬁed:
D introduces a new concept name (or its complement),
D is an existential restriction,
D is an universal restriction of any its ancestor.

The framework
Terminological Random Forests
A TRF is an ensemble of TDTs such that:
each TDT is trained on a re-balanced subset of examples extracted
from the original training set
each TDT is built thanks to the downward reﬁnement operator and a
random selection of concept description candidates
voting rule is employed to decide the membership

The framework
Learning Terminological Random Forests
In order to learn a TRF, given a
a target concept C
the number of tree n
a training set Tr = Ps, Ns, Us
Ps = {a ∈ Ind(A)|K |= C(a)}
Ns = {b ∈ Ind(A)|K |= ¬C(b)}
Us = {c ∈ Ind(A)|K |= C(c) ∧ K |= ¬C(c)}
the algorithm can be summarized as follows:
build a n rebalanced bootstrap samples
learn a TDT model from each bootstrap sample

The framework
Procedure for building the rebalanced bootstrap sample
In order to mitigate the drawback deriving from the under-sampling
procedure, a two-step approach is employed.
Firstly, a stratiﬁed sampling with replacement procedure is employed
in order to represent the minority class instances in the bootstrap
sample.
Then, the majority class instances (either positive or negative) and
the uncertain-membership instances are discarded.

The framework
Learning TDTs
Given a bootstrap sample Di , a TDT is trained according to a
recursive strategy
Starting from the root the method reﬁnes the concept description
installed into the current node
Various candidates are returned and a subset of concepts is selected by
randomly chosing its elements
Best Concept: the one maximizes the information gain w.r.t. the
previous level
Split the instances according to the results of the instance check test
uncertain-membership instances are replicated in both recursive calls
Stop conditions: the node is pure w.r.t. the membership

The framework
Predicting unseen individuals
Given a forest F and a new individual a, the algorithm collects the
prediction returned by each TDT and decides the class according the
majority vote rule
The class-membership returned by a TDT is decided by traversing
recursively the tree (until a leaf is reached) according to the instance
check test result.
For a concept description installed as node D
if K |= D(a) the left branch is followed
if K |= ¬D(a) the right branch is followed
if neither K |= ¬D(a) nor K |= D(a), the uncertain-membership is
assigned to a

Experiments
Experiments
15 query concepts have been randomly generated
10-fold cross validation as design of the experiments
number of candidate randomly selected: |ρ(·)|
Stratiﬁed Sampling rates: no-sampling, 50%, 70 %, 80 %
Using a reasoner to decide the ground truth:
match: rate of the test cases (individuals) for which the inductive
model and a reasoner predict the same membership (i.e. +1 | +1,
−1 | −1, 0 | 0);
commission: rate of the cases for which predictions are opposite (i.e.
+1 | −1, −1 | +1);
omission: rate of test cases for which the inductive method cannot
determine a deﬁnite membership (−1, +1) while the reasoner is able to
do it;
induction: rate of cases where the inductive method can predict a
membership while it is not logically derivable.

Experiments
Experiments
Ontology Index TDT
No Sampling
10 trees 20 trees 30 trees
BCO
M% 80.44±11.01 87.99±07.85 87.82±13.86 79.33±22.41
C% 07.56±08.08 04.32±04.68 02.77±04.77 01.64±02.36
O% 05.04±04.28 00.09±00.27 00.02±00.04 10.38±19.28
I% 06.96±05.97 07.61±06.82 09.40±13.93 08.65±14.03
BioPax
M% 66.63±14.60 75.93±17.05 75.49±17.05 75.30±16.23
C% 31.03±12.95 22.11±16.54 18.54±17.80 18.74±17.80
O% 00.39±00.61 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.95±07.13 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 68.85±13.23 83.42±07.85 83.42±07.85 83.42±07.85
C% 00.37±00.30 00.02±00.04 00.02±00.04 00.02±00.04
O% 09.51±07.06 13.40±10.17 13.40±10.17 13.40±10.17
I% 21.27±08.73 03.16±04.65 03.16±04.65 03.16±04.65
HD
M% 58.31±14.06 67.95±16.99 67.95±16.99 67.95±16.99
C% 00.44±00.47 00.02±00.05 00.02±00.05 00.02±00.05
O% 05.51±01.81 06.38±02.03 06.38±02.03 06.38±02.03
I% 35.74±15.90 25.61±18.98 25.61±18.98 25.61±18.98

Experiments
Experiments
Ontology Index
50%
BCO
M% 86.27±15.79 86.24±15.94 86.26±15.84
C% 02.47±03.70 02.43±03.70 02.84±03.70
O% 01.90±07.30 01.97±07.55 01.92±07.37
I% 09.36±13.96 09.36±13.96 09.36±13.96
BioPax
M% 75.30±16.23 75.30±16.23 75.30±16.23
C% 18.74±17.80 18.74±17.80 18.74±17.80
O% 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 83.41±07.85 83.42±07.85 83.42±07.85
C% 00.02±00.04 00.02±00.04 00.02±00.04
O% 13.40±10.17 13.40±10.17 13.40±10.17
I% 03.17±04.65 03.16±04.65 03.16±04.65
HD
M% 68.00±16.98 68.00±16.99 67.98±16.99
C% 00.02±00.05 00.02±00.05 00.02±00.05
O% 06.38±02.03 06.38±02.03 06.38±02.03
I% 25.59±18.98 25.59±18.98 25.62±18.98

Experiments
Experiments
Ontology Index
70%
BCO
M% 84.12±18.27 85.70±16.98 85.52±17.09
C% 02.16±03.09 02.32±03.39 02.30±03.38
O% 04.50±12.59 02.65±09.93 02.86±10.04
I% 09.23±13.98 09.33±13.97 09.31±13.91
BioPax
M% 75.30±16.23 75.30±16.23 75.30±16.23
C% 18.74±17.80 18.74±17.80 18.74±17.80
O% 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 83.42±07.85 83.42±07.85 83.42±07.85
C% 00.02±00.04 00.02±00.04 00.02±00.04
O% 13.40±10.17 13.40±10.17 13.40±10.17
I% 03.16±04.65 03.16±04.65 03.16±04.65
HD
M% 68.00±16.99 68.00±16.99 68.00±16.99
C% 00.02±00.05 00.02±00.05 00.02±00.05
O% 06.38±02.03 06.38±02.03 06.38±02.03
I% 25.59±18.98 25.59±18.98 25.59±18.98

Experiments
Experiments
Ontology Index
80%
BCO
M% 76.57±24.28 81.27±19.27 79.33±22.41
C% 01.45±01.77 01.89±02.65 01.64±02.36
O% 13.51±22.19 08.05±15.04 10.38±19.28
I% 08.47±14.03 08.79±13.98 08.65±14.23
BioPax
M% 75.30±16.23 75.30±16.23 75.30±16.23
C% 18.74±17.80 18.74±17.80 18.74±17.80
O% 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 83.41±07.85 83.42±07.85 83.42±07.85
C% 00.02±00.04 00.02±00.04 00.02±00.04
O% 13.40±10.17 13.40±10.17 13.40±10.17
I% 03.17±04.65 03.16±04.65 03.16±04.65
HD
M% 68.00±16.99 68.00±16.99 68.00±16.99
C% 00.02±00.05 00.02±00.05 00.02±00.05
O% 06.38±02.03 06.38±02.03 06.38±02.03
I% 25.59±18.98 25.59±18.98 25.59±18.98

Experiments
Considerations and Lessons Learnt
improvement w.r.t. TDTs
smallest changes in terms of match rate relating to the number of
trees
weak diversification(overlapping) between trees by increasing the
number of trees
there is no need to set high values for these parameters
e.g. 10-trees TRFs with a sampling rate of 50% is accurate enough
small disjuncts problem due to the poorly discriminative concepts
generated from the refinement operator is the cause of:
misclassification cases mitigated from the presence of other trees
a bottleneck for learning phase
execution times span from few minutes to almost 10 hours

Conclusions and Extensions
Conclusions and Further Extensions
Development of further reﬁnement operators
Further ensemble techniques and combination rules
Further experiments with ontologies extracted from Linked Data
Cloud
Parallelization of the current implementation

Conclusions and Extensions
Thank you!
Questions?

Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases

Recommended

Recommended

More Related Content

Similar to Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases

Similar to Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases (20)

Recently uploaded

Recently uploaded (20)

Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases