Tackling the Class-Imbalance Learning Problem in
Semantic Web knowledge bases
19th International Conference on Knowledge Engineering and Knowledge
Management
Giuseppe Rizzo, Claudia d’Amato, Nicola Fanizzi and Floriana Esposito
Dipartimento di Informatica
Universit`a degli Studi di Bari ”Aldo Moro”, Bari, Italy
November 24 - 28, 2014
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 1 / 20
Outline
1 Introduction & Motivations
2 The framework
3 Experiments
4 Conclusions and Extensions
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 2 / 20
Introduction & Motivations
Introduction
In the context of Semantic Web, procedures for deciding the
membership of an individual w.r.t. a query concept exploit automated
reasoning techniques
The quality of inferences can be affected by the uncertainty originated
from the distributed nature of Semantic Web
the inherent incompleteness, due to the Open World Assumption
inconsistency, due to the diverse quality of the ontologies
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 3 / 20
Introduction & Motivations
Introduction
Machine learning algorithms can be employed to support query
answering tasks(e.g. class-membership prediction)
statistical regularities are exploited to infer new assertions
The quality of inductive approaches depends on the training set
composition
Given a query concept, it is easier to find more uncertain-membership
examples than individuals that belong to the target concept (or to its
complement)
the quality of predictions can be poor
A problem of class-imbalance occurs
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 4 / 20
Introduction & Motivations
Motivations
In machine learning, most solutions are based on sampling methods
Undersampling methods are typically based on (randomly or
informated) procedures for discarding training instances
it is possible to obtain loss of information
Oversampling methods require that some training instances are
replicated
the produced model can overfit over training data
These problems must be mitigated
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 5 / 20
The framework
The proposed approach
Combining the sampling strategy with ensemble learning methods
Ensemble learning methods require the training for a set of classifiers
(weak learners)
predictions are combined by a meta-learner for deciding the final
answer
Specifically, the proposed solution is based on bagging methods
various bootstrap samples are generated through the sampling with
replacement procedure
a model is induced for each sample
predictions are made by voting procedure
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 6 / 20
The framework
Terminological Random Forests
In this work, we developed Terminological Random Forest (TRF) for
class-membership prediction, which extends Terminological Decision
Trees model (TDTs).
Let K = (T , A), a Terminological Decision Tree is a binary tree
where:
each node contains a conjunctive concept description D;
each departing edge is the result of an instance-check test w.r.t. D,
i.e., given an individual a, K |= D(a)?
if a node with E is the father of the node with D then D is obtained by
using a refinement operator and one of the following conditions should
be verified:
D introduces a new concept name (or its complement),
D is an existential restriction,
D is an universal restriction of any its ancestor.
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 7 / 20
The framework
Terminological Random Forests
A TRF is an ensemble of TDTs such that:
each TDT is trained on a re-balanced subset of examples extracted
from the original training set
each TDT is built thanks to the downward refinement operator and a
random selection of concept description candidates
voting rule is employed to decide the membership
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 8 / 20
The framework
Learning Terminological Random Forests
In order to learn a TRF, given a
a target concept C
the number of tree n
a training set Tr = Ps, Ns, Us
Ps = {a ∈ Ind(A)|K |= C(a)}
Ns = {b ∈ Ind(A)|K |= ¬C(b)}
Us = {c ∈ Ind(A)|K |= C(c) ∧ K |= ¬C(c)}
the algorithm can be summarized as follows:
build a n rebalanced bootstrap samples
learn a TDT model from each bootstrap sample
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 9 / 20
The framework
Learning Terminological Random Forests
Procedure for building the rebalanced bootstrap sample
In order to mitigate the drawback deriving from the under-sampling
procedure, a two-step approach is employed.
Firstly, a stratified sampling with replacement procedure is employed
in order to represent the minority class instances in the bootstrap
sample.
Then, the majority class instances (either positive or negative) and
the uncertain-membership instances are discarded.
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 10 / 20
The framework
Learning Terminological Random Forests
Learning TDTs
Given a bootstrap sample Di , a TDT is trained according to a
recursive strategy
Starting from the root the method refines the concept description
installed into the current node
Various candidates are returned and a subset of concepts is selected by
randomly chosing its elements
Best Concept: the one maximizes the information gain w.r.t. the
previous level
Split the instances according to the results of the instance check test
uncertain-membership instances are replicated in both recursive calls
Stop conditions: the node is pure w.r.t. the membership
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 11 / 20
The framework
Predicting unseen individuals
Given a forest F and a new individual a, the algorithm collects the
prediction returned by each TDT and decides the class according the
majority vote rule
The class-membership returned by a TDT is decided by traversing
recursively the tree (until a leaf is reached) according to the instance
check test result.
For a concept description installed as node D
if K |= D(a) the left branch is followed
if K |= ¬D(a) the right branch is followed
if neither K |= ¬D(a) nor K |= D(a), the uncertain-membership is
assigned to a
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 12 / 20
Experiments
Experiments
15 query concepts have been randomly generated
10-fold cross validation as design of the experiments
number of candidate randomly selected: |ρ(·)|
Stratified Sampling rates: no-sampling, 50%, 70 %, 80 %
Using a reasoner to decide the ground truth:
match: rate of the test cases (individuals) for which the inductive
model and a reasoner predict the same membership (i.e. +1 | +1,
−1 | −1, 0 | 0);
commission: rate of the cases for which predictions are opposite (i.e.
+1 | −1, −1 | +1);
omission: rate of test cases for which the inductive method cannot
determine a definite membership (−1, +1) while the reasoner is able to
do it;
induction: rate of cases where the inductive method can predict a
membership while it is not logically derivable.
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 13 / 20
Experiments
Experiments
Ontology Index TDT
No Sampling
10 trees 20 trees 30 trees
BCO
M% 80.44±11.01 87.99±07.85 87.82±13.86 79.33±22.41
C% 07.56±08.08 04.32±04.68 02.77±04.77 01.64±02.36
O% 05.04±04.28 00.09±00.27 00.02±00.04 10.38±19.28
I% 06.96±05.97 07.61±06.82 09.40±13.93 08.65±14.03
BioPax
M% 66.63±14.60 75.93±17.05 75.49±17.05 75.30±16.23
C% 31.03±12.95 22.11±16.54 18.54±17.80 18.74±17.80
O% 00.39±00.61 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.95±07.13 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 68.85±13.23 83.42±07.85 83.42±07.85 83.42±07.85
C% 00.37±00.30 00.02±00.04 00.02±00.04 00.02±00.04
O% 09.51±07.06 13.40±10.17 13.40±10.17 13.40±10.17
I% 21.27±08.73 03.16±04.65 03.16±04.65 03.16±04.65
HD
M% 58.31±14.06 67.95±16.99 67.95±16.99 67.95±16.99
C% 00.44±00.47 00.02±00.05 00.02±00.05 00.02±00.05
O% 05.51±01.81 06.38±02.03 06.38±02.03 06.38±02.03
I% 35.74±15.90 25.61±18.98 25.61±18.98 25.61±18.98
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 14 / 20
Experiments
Experiments
Ontology Index
50%
10 trees 20 trees 30 trees
BCO
M% 86.27±15.79 86.24±15.94 86.26±15.84
C% 02.47±03.70 02.43±03.70 02.84±03.70
O% 01.90±07.30 01.97±07.55 01.92±07.37
I% 09.36±13.96 09.36±13.96 09.36±13.96
BioPax
M% 75.30±16.23 75.30±16.23 75.30±16.23
C% 18.74±17.80 18.74±17.80 18.74±17.80
O% 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 83.41±07.85 83.42±07.85 83.42±07.85
C% 00.02±00.04 00.02±00.04 00.02±00.04
O% 13.40±10.17 13.40±10.17 13.40±10.17
I% 03.17±04.65 03.16±04.65 03.16±04.65
HD
M% 68.00±16.98 68.00±16.99 67.98±16.99
C% 00.02±00.05 00.02±00.05 00.02±00.05
O% 06.38±02.03 06.38±02.03 06.38±02.03
I% 25.59±18.98 25.59±18.98 25.62±18.98
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 15 / 20
Experiments
Experiments
Ontology Index
70%
10 trees 20 trees 30 trees
BCO
M% 84.12±18.27 85.70±16.98 85.52±17.09
C% 02.16±03.09 02.32±03.39 02.30±03.38
O% 04.50±12.59 02.65±09.93 02.86±10.04
I% 09.23±13.98 09.33±13.97 09.31±13.91
BioPax
M% 75.30±16.23 75.30±16.23 75.30±16.23
C% 18.74±17.80 18.74±17.80 18.74±17.80
O% 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 83.42±07.85 83.42±07.85 83.42±07.85
C% 00.02±00.04 00.02±00.04 00.02±00.04
O% 13.40±10.17 13.40±10.17 13.40±10.17
I% 03.16±04.65 03.16±04.65 03.16±04.65
HD
M% 68.00±16.99 68.00±16.99 68.00±16.99
C% 00.02±00.05 00.02±00.05 00.02±00.05
O% 06.38±02.03 06.38±02.03 06.38±02.03
I% 25.59±18.98 25.59±18.98 25.59±18.98
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 16 / 20
Experiments
Experiments
Ontology Index
80%
10 trees 20 trees 30 trees
BCO
M% 76.57±24.28 81.27±19.27 79.33±22.41
C% 01.45±01.77 01.89±02.65 01.64±02.36
O% 13.51±22.19 08.05±15.04 10.38±19.28
I% 08.47±14.03 08.79±13.98 08.65±14.23
BioPax
M% 75.30±16.23 75.30±16.23 75.30±16.23
C% 18.74±17.80 18.74±17.80 18.74±17.80
O% 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 83.41±07.85 83.42±07.85 83.42±07.85
C% 00.02±00.04 00.02±00.04 00.02±00.04
O% 13.40±10.17 13.40±10.17 13.40±10.17
I% 03.17±04.65 03.16±04.65 03.16±04.65
HD
M% 68.00±16.99 68.00±16.99 68.00±16.99
C% 00.02±00.05 00.02±00.05 00.02±00.05
O% 06.38±02.03 06.38±02.03 06.38±02.03
I% 25.59±18.98 25.59±18.98 25.59±18.98
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 17 / 20
Experiments
Considerations and Lessons Learnt
improvement w.r.t. TDTs
smallest changes in terms of match rate relating to the number of
trees
weak diversification(overlapping) between trees by increasing the
number of trees
there is no need to set high values for these parameters
e.g. 10-trees TRFs with a sampling rate of 50% is accurate enough
small disjuncts problem due to the poorly discriminative concepts
generated from the refinement operator is the cause of:
misclassification cases mitigated from the presence of other trees
a bottleneck for learning phase
execution times span from few minutes to almost 10 hours
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 18 / 20
Conclusions and Extensions
Conclusions and Further Extensions
Development of further refinement operators
Further ensemble techniques and combination rules
Further experiments with ontologies extracted from Linked Data
Cloud
Parallelization of the current implementation
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 19 / 20
Conclusions and Extensions
Thank you!
Questions?
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 20 / 20

Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases

  • 1.
    Tackling the Class-ImbalanceLearning Problem in Semantic Web knowledge bases 19th International Conference on Knowledge Engineering and Knowledge Management Giuseppe Rizzo, Claudia d’Amato, Nicola Fanizzi and Floriana Esposito Dipartimento di Informatica Universit`a degli Studi di Bari ”Aldo Moro”, Bari, Italy November 24 - 28, 2014 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 1 / 20
  • 2.
    Outline 1 Introduction &Motivations 2 The framework 3 Experiments 4 Conclusions and Extensions G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 2 / 20
  • 3.
    Introduction & Motivations Introduction Inthe context of Semantic Web, procedures for deciding the membership of an individual w.r.t. a query concept exploit automated reasoning techniques The quality of inferences can be affected by the uncertainty originated from the distributed nature of Semantic Web the inherent incompleteness, due to the Open World Assumption inconsistency, due to the diverse quality of the ontologies G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 3 / 20
  • 4.
    Introduction & Motivations Introduction Machinelearning algorithms can be employed to support query answering tasks(e.g. class-membership prediction) statistical regularities are exploited to infer new assertions The quality of inductive approaches depends on the training set composition Given a query concept, it is easier to find more uncertain-membership examples than individuals that belong to the target concept (or to its complement) the quality of predictions can be poor A problem of class-imbalance occurs G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 4 / 20
  • 5.
    Introduction & Motivations Motivations Inmachine learning, most solutions are based on sampling methods Undersampling methods are typically based on (randomly or informated) procedures for discarding training instances it is possible to obtain loss of information Oversampling methods require that some training instances are replicated the produced model can overfit over training data These problems must be mitigated G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 5 / 20
  • 6.
    The framework The proposedapproach Combining the sampling strategy with ensemble learning methods Ensemble learning methods require the training for a set of classifiers (weak learners) predictions are combined by a meta-learner for deciding the final answer Specifically, the proposed solution is based on bagging methods various bootstrap samples are generated through the sampling with replacement procedure a model is induced for each sample predictions are made by voting procedure G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 6 / 20
  • 7.
    The framework Terminological RandomForests In this work, we developed Terminological Random Forest (TRF) for class-membership prediction, which extends Terminological Decision Trees model (TDTs). Let K = (T , A), a Terminological Decision Tree is a binary tree where: each node contains a conjunctive concept description D; each departing edge is the result of an instance-check test w.r.t. D, i.e., given an individual a, K |= D(a)? if a node with E is the father of the node with D then D is obtained by using a refinement operator and one of the following conditions should be verified: D introduces a new concept name (or its complement), D is an existential restriction, D is an universal restriction of any its ancestor. G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 7 / 20
  • 8.
    The framework Terminological RandomForests A TRF is an ensemble of TDTs such that: each TDT is trained on a re-balanced subset of examples extracted from the original training set each TDT is built thanks to the downward refinement operator and a random selection of concept description candidates voting rule is employed to decide the membership G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 8 / 20
  • 9.
    The framework Learning TerminologicalRandom Forests In order to learn a TRF, given a a target concept C the number of tree n a training set Tr = Ps, Ns, Us Ps = {a ∈ Ind(A)|K |= C(a)} Ns = {b ∈ Ind(A)|K |= ¬C(b)} Us = {c ∈ Ind(A)|K |= C(c) ∧ K |= ¬C(c)} the algorithm can be summarized as follows: build a n rebalanced bootstrap samples learn a TDT model from each bootstrap sample G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 9 / 20
  • 10.
    The framework Learning TerminologicalRandom Forests Procedure for building the rebalanced bootstrap sample In order to mitigate the drawback deriving from the under-sampling procedure, a two-step approach is employed. Firstly, a stratified sampling with replacement procedure is employed in order to represent the minority class instances in the bootstrap sample. Then, the majority class instances (either positive or negative) and the uncertain-membership instances are discarded. G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 10 / 20
  • 11.
    The framework Learning TerminologicalRandom Forests Learning TDTs Given a bootstrap sample Di , a TDT is trained according to a recursive strategy Starting from the root the method refines the concept description installed into the current node Various candidates are returned and a subset of concepts is selected by randomly chosing its elements Best Concept: the one maximizes the information gain w.r.t. the previous level Split the instances according to the results of the instance check test uncertain-membership instances are replicated in both recursive calls Stop conditions: the node is pure w.r.t. the membership G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 11 / 20
  • 12.
    The framework Predicting unseenindividuals Given a forest F and a new individual a, the algorithm collects the prediction returned by each TDT and decides the class according the majority vote rule The class-membership returned by a TDT is decided by traversing recursively the tree (until a leaf is reached) according to the instance check test result. For a concept description installed as node D if K |= D(a) the left branch is followed if K |= ¬D(a) the right branch is followed if neither K |= ¬D(a) nor K |= D(a), the uncertain-membership is assigned to a G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 12 / 20
  • 13.
    Experiments Experiments 15 query conceptshave been randomly generated 10-fold cross validation as design of the experiments number of candidate randomly selected: |ρ(·)| Stratified Sampling rates: no-sampling, 50%, 70 %, 80 % Using a reasoner to decide the ground truth: match: rate of the test cases (individuals) for which the inductive model and a reasoner predict the same membership (i.e. +1 | +1, −1 | −1, 0 | 0); commission: rate of the cases for which predictions are opposite (i.e. +1 | −1, −1 | +1); omission: rate of test cases for which the inductive method cannot determine a definite membership (−1, +1) while the reasoner is able to do it; induction: rate of cases where the inductive method can predict a membership while it is not logically derivable. G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 13 / 20
  • 14.
    Experiments Experiments Ontology Index TDT NoSampling 10 trees 20 trees 30 trees BCO M% 80.44±11.01 87.99±07.85 87.82±13.86 79.33±22.41 C% 07.56±08.08 04.32±04.68 02.77±04.77 01.64±02.36 O% 05.04±04.28 00.09±00.27 00.02±00.04 10.38±19.28 I% 06.96±05.97 07.61±06.82 09.40±13.93 08.65±14.03 BioPax M% 66.63±14.60 75.93±17.05 75.49±17.05 75.30±16.23 C% 31.03±12.95 22.11±16.54 18.54±17.80 18.74±17.80 O% 00.39±00.61 00.00±00.00 00.00±00.00 00.00±00.00 I% 01.95±07.13 01.97±07.16 01.97±07.16 01.97±07.16 NTN M% 68.85±13.23 83.42±07.85 83.42±07.85 83.42±07.85 C% 00.37±00.30 00.02±00.04 00.02±00.04 00.02±00.04 O% 09.51±07.06 13.40±10.17 13.40±10.17 13.40±10.17 I% 21.27±08.73 03.16±04.65 03.16±04.65 03.16±04.65 HD M% 58.31±14.06 67.95±16.99 67.95±16.99 67.95±16.99 C% 00.44±00.47 00.02±00.05 00.02±00.05 00.02±00.05 O% 05.51±01.81 06.38±02.03 06.38±02.03 06.38±02.03 I% 35.74±15.90 25.61±18.98 25.61±18.98 25.61±18.98 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 14 / 20
  • 15.
    Experiments Experiments Ontology Index 50% 10 trees20 trees 30 trees BCO M% 86.27±15.79 86.24±15.94 86.26±15.84 C% 02.47±03.70 02.43±03.70 02.84±03.70 O% 01.90±07.30 01.97±07.55 01.92±07.37 I% 09.36±13.96 09.36±13.96 09.36±13.96 BioPax M% 75.30±16.23 75.30±16.23 75.30±16.23 C% 18.74±17.80 18.74±17.80 18.74±17.80 O% 00.00±00.00 00.00±00.00 00.00±00.00 I% 01.97±07.16 01.97±07.16 01.97±07.16 NTN M% 83.41±07.85 83.42±07.85 83.42±07.85 C% 00.02±00.04 00.02±00.04 00.02±00.04 O% 13.40±10.17 13.40±10.17 13.40±10.17 I% 03.17±04.65 03.16±04.65 03.16±04.65 HD M% 68.00±16.98 68.00±16.99 67.98±16.99 C% 00.02±00.05 00.02±00.05 00.02±00.05 O% 06.38±02.03 06.38±02.03 06.38±02.03 I% 25.59±18.98 25.59±18.98 25.62±18.98 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 15 / 20
  • 16.
    Experiments Experiments Ontology Index 70% 10 trees20 trees 30 trees BCO M% 84.12±18.27 85.70±16.98 85.52±17.09 C% 02.16±03.09 02.32±03.39 02.30±03.38 O% 04.50±12.59 02.65±09.93 02.86±10.04 I% 09.23±13.98 09.33±13.97 09.31±13.91 BioPax M% 75.30±16.23 75.30±16.23 75.30±16.23 C% 18.74±17.80 18.74±17.80 18.74±17.80 O% 00.00±00.00 00.00±00.00 00.00±00.00 I% 01.97±07.16 01.97±07.16 01.97±07.16 NTN M% 83.42±07.85 83.42±07.85 83.42±07.85 C% 00.02±00.04 00.02±00.04 00.02±00.04 O% 13.40±10.17 13.40±10.17 13.40±10.17 I% 03.16±04.65 03.16±04.65 03.16±04.65 HD M% 68.00±16.99 68.00±16.99 68.00±16.99 C% 00.02±00.05 00.02±00.05 00.02±00.05 O% 06.38±02.03 06.38±02.03 06.38±02.03 I% 25.59±18.98 25.59±18.98 25.59±18.98 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 16 / 20
  • 17.
    Experiments Experiments Ontology Index 80% 10 trees20 trees 30 trees BCO M% 76.57±24.28 81.27±19.27 79.33±22.41 C% 01.45±01.77 01.89±02.65 01.64±02.36 O% 13.51±22.19 08.05±15.04 10.38±19.28 I% 08.47±14.03 08.79±13.98 08.65±14.23 BioPax M% 75.30±16.23 75.30±16.23 75.30±16.23 C% 18.74±17.80 18.74±17.80 18.74±17.80 O% 00.00±00.00 00.00±00.00 00.00±00.00 I% 01.97±07.16 01.97±07.16 01.97±07.16 NTN M% 83.41±07.85 83.42±07.85 83.42±07.85 C% 00.02±00.04 00.02±00.04 00.02±00.04 O% 13.40±10.17 13.40±10.17 13.40±10.17 I% 03.17±04.65 03.16±04.65 03.16±04.65 HD M% 68.00±16.99 68.00±16.99 68.00±16.99 C% 00.02±00.05 00.02±00.05 00.02±00.05 O% 06.38±02.03 06.38±02.03 06.38±02.03 I% 25.59±18.98 25.59±18.98 25.59±18.98 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 17 / 20
  • 18.
    Experiments Considerations and LessonsLearnt improvement w.r.t. TDTs smallest changes in terms of match rate relating to the number of trees weak diversification(overlapping) between trees by increasing the number of trees there is no need to set high values for these parameters e.g. 10-trees TRFs with a sampling rate of 50% is accurate enough small disjuncts problem due to the poorly discriminative concepts generated from the refinement operator is the cause of: misclassification cases mitigated from the presence of other trees a bottleneck for learning phase execution times span from few minutes to almost 10 hours G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 18 / 20
  • 19.
    Conclusions and Extensions Conclusionsand Further Extensions Development of further refinement operators Further ensemble techniques and combination rules Further experiments with ontologies extracted from Linked Data Cloud Parallelization of the current implementation G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 19 / 20
  • 20.
    Conclusions and Extensions Thankyou! Questions? G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 20 / 20