SlideShare a Scribd company logo
Tackling the Class-Imbalance Learning Problem in
Semantic Web knowledge bases
19th International Conference on Knowledge Engineering and Knowledge
Management
Giuseppe Rizzo, Claudia d’Amato, Nicola Fanizzi and Floriana Esposito
Dipartimento di Informatica
Universit`a degli Studi di Bari ”Aldo Moro”, Bari, Italy
November 24 - 28, 2014
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 1 / 20
Outline
1 Introduction & Motivations
2 The framework
3 Experiments
4 Conclusions and Extensions
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 2 / 20
Introduction & Motivations
Introduction
In the context of Semantic Web, procedures for deciding the
membership of an individual w.r.t. a query concept exploit automated
reasoning techniques
The quality of inferences can be affected by the uncertainty originated
from the distributed nature of Semantic Web
the inherent incompleteness, due to the Open World Assumption
inconsistency, due to the diverse quality of the ontologies
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 3 / 20
Introduction & Motivations
Introduction
Machine learning algorithms can be employed to support query
answering tasks(e.g. class-membership prediction)
statistical regularities are exploited to infer new assertions
The quality of inductive approaches depends on the training set
composition
Given a query concept, it is easier to find more uncertain-membership
examples than individuals that belong to the target concept (or to its
complement)
the quality of predictions can be poor
A problem of class-imbalance occurs
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 4 / 20
Introduction & Motivations
Motivations
In machine learning, most solutions are based on sampling methods
Undersampling methods are typically based on (randomly or
informated) procedures for discarding training instances
it is possible to obtain loss of information
Oversampling methods require that some training instances are
replicated
the produced model can overfit over training data
These problems must be mitigated
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 5 / 20
The framework
The proposed approach
Combining the sampling strategy with ensemble learning methods
Ensemble learning methods require the training for a set of classifiers
(weak learners)
predictions are combined by a meta-learner for deciding the final
answer
Specifically, the proposed solution is based on bagging methods
various bootstrap samples are generated through the sampling with
replacement procedure
a model is induced for each sample
predictions are made by voting procedure
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 6 / 20
The framework
Terminological Random Forests
In this work, we developed Terminological Random Forest (TRF) for
class-membership prediction, which extends Terminological Decision
Trees model (TDTs).
Let K = (T , A), a Terminological Decision Tree is a binary tree
where:
each node contains a conjunctive concept description D;
each departing edge is the result of an instance-check test w.r.t. D,
i.e., given an individual a, K |= D(a)?
if a node with E is the father of the node with D then D is obtained by
using a refinement operator and one of the following conditions should
be verified:
D introduces a new concept name (or its complement),
D is an existential restriction,
D is an universal restriction of any its ancestor.
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 7 / 20
The framework
Terminological Random Forests
A TRF is an ensemble of TDTs such that:
each TDT is trained on a re-balanced subset of examples extracted
from the original training set
each TDT is built thanks to the downward refinement operator and a
random selection of concept description candidates
voting rule is employed to decide the membership
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 8 / 20
The framework
Learning Terminological Random Forests
In order to learn a TRF, given a
a target concept C
the number of tree n
a training set Tr = Ps, Ns, Us
Ps = {a ∈ Ind(A)|K |= C(a)}
Ns = {b ∈ Ind(A)|K |= ¬C(b)}
Us = {c ∈ Ind(A)|K |= C(c) ∧ K |= ¬C(c)}
the algorithm can be summarized as follows:
build a n rebalanced bootstrap samples
learn a TDT model from each bootstrap sample
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 9 / 20
The framework
Learning Terminological Random Forests
Procedure for building the rebalanced bootstrap sample
In order to mitigate the drawback deriving from the under-sampling
procedure, a two-step approach is employed.
Firstly, a stratified sampling with replacement procedure is employed
in order to represent the minority class instances in the bootstrap
sample.
Then, the majority class instances (either positive or negative) and
the uncertain-membership instances are discarded.
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 10 / 20
The framework
Learning Terminological Random Forests
Learning TDTs
Given a bootstrap sample Di , a TDT is trained according to a
recursive strategy
Starting from the root the method refines the concept description
installed into the current node
Various candidates are returned and a subset of concepts is selected by
randomly chosing its elements
Best Concept: the one maximizes the information gain w.r.t. the
previous level
Split the instances according to the results of the instance check test
uncertain-membership instances are replicated in both recursive calls
Stop conditions: the node is pure w.r.t. the membership
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 11 / 20
The framework
Predicting unseen individuals
Given a forest F and a new individual a, the algorithm collects the
prediction returned by each TDT and decides the class according the
majority vote rule
The class-membership returned by a TDT is decided by traversing
recursively the tree (until a leaf is reached) according to the instance
check test result.
For a concept description installed as node D
if K |= D(a) the left branch is followed
if K |= ¬D(a) the right branch is followed
if neither K |= ¬D(a) nor K |= D(a), the uncertain-membership is
assigned to a
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 12 / 20
Experiments
Experiments
15 query concepts have been randomly generated
10-fold cross validation as design of the experiments
number of candidate randomly selected: |ρ(·)|
Stratified Sampling rates: no-sampling, 50%, 70 %, 80 %
Using a reasoner to decide the ground truth:
match: rate of the test cases (individuals) for which the inductive
model and a reasoner predict the same membership (i.e. +1 | +1,
−1 | −1, 0 | 0);
commission: rate of the cases for which predictions are opposite (i.e.
+1 | −1, −1 | +1);
omission: rate of test cases for which the inductive method cannot
determine a definite membership (−1, +1) while the reasoner is able to
do it;
induction: rate of cases where the inductive method can predict a
membership while it is not logically derivable.
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 13 / 20
Experiments
Experiments
Ontology Index TDT
No Sampling
10 trees 20 trees 30 trees
BCO
M% 80.44±11.01 87.99±07.85 87.82±13.86 79.33±22.41
C% 07.56±08.08 04.32±04.68 02.77±04.77 01.64±02.36
O% 05.04±04.28 00.09±00.27 00.02±00.04 10.38±19.28
I% 06.96±05.97 07.61±06.82 09.40±13.93 08.65±14.03
BioPax
M% 66.63±14.60 75.93±17.05 75.49±17.05 75.30±16.23
C% 31.03±12.95 22.11±16.54 18.54±17.80 18.74±17.80
O% 00.39±00.61 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.95±07.13 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 68.85±13.23 83.42±07.85 83.42±07.85 83.42±07.85
C% 00.37±00.30 00.02±00.04 00.02±00.04 00.02±00.04
O% 09.51±07.06 13.40±10.17 13.40±10.17 13.40±10.17
I% 21.27±08.73 03.16±04.65 03.16±04.65 03.16±04.65
HD
M% 58.31±14.06 67.95±16.99 67.95±16.99 67.95±16.99
C% 00.44±00.47 00.02±00.05 00.02±00.05 00.02±00.05
O% 05.51±01.81 06.38±02.03 06.38±02.03 06.38±02.03
I% 35.74±15.90 25.61±18.98 25.61±18.98 25.61±18.98
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 14 / 20
Experiments
Experiments
Ontology Index
50%
10 trees 20 trees 30 trees
BCO
M% 86.27±15.79 86.24±15.94 86.26±15.84
C% 02.47±03.70 02.43±03.70 02.84±03.70
O% 01.90±07.30 01.97±07.55 01.92±07.37
I% 09.36±13.96 09.36±13.96 09.36±13.96
BioPax
M% 75.30±16.23 75.30±16.23 75.30±16.23
C% 18.74±17.80 18.74±17.80 18.74±17.80
O% 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 83.41±07.85 83.42±07.85 83.42±07.85
C% 00.02±00.04 00.02±00.04 00.02±00.04
O% 13.40±10.17 13.40±10.17 13.40±10.17
I% 03.17±04.65 03.16±04.65 03.16±04.65
HD
M% 68.00±16.98 68.00±16.99 67.98±16.99
C% 00.02±00.05 00.02±00.05 00.02±00.05
O% 06.38±02.03 06.38±02.03 06.38±02.03
I% 25.59±18.98 25.59±18.98 25.62±18.98
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 15 / 20
Experiments
Experiments
Ontology Index
70%
10 trees 20 trees 30 trees
BCO
M% 84.12±18.27 85.70±16.98 85.52±17.09
C% 02.16±03.09 02.32±03.39 02.30±03.38
O% 04.50±12.59 02.65±09.93 02.86±10.04
I% 09.23±13.98 09.33±13.97 09.31±13.91
BioPax
M% 75.30±16.23 75.30±16.23 75.30±16.23
C% 18.74±17.80 18.74±17.80 18.74±17.80
O% 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 83.42±07.85 83.42±07.85 83.42±07.85
C% 00.02±00.04 00.02±00.04 00.02±00.04
O% 13.40±10.17 13.40±10.17 13.40±10.17
I% 03.16±04.65 03.16±04.65 03.16±04.65
HD
M% 68.00±16.99 68.00±16.99 68.00±16.99
C% 00.02±00.05 00.02±00.05 00.02±00.05
O% 06.38±02.03 06.38±02.03 06.38±02.03
I% 25.59±18.98 25.59±18.98 25.59±18.98
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 16 / 20
Experiments
Experiments
Ontology Index
80%
10 trees 20 trees 30 trees
BCO
M% 76.57±24.28 81.27±19.27 79.33±22.41
C% 01.45±01.77 01.89±02.65 01.64±02.36
O% 13.51±22.19 08.05±15.04 10.38±19.28
I% 08.47±14.03 08.79±13.98 08.65±14.23
BioPax
M% 75.30±16.23 75.30±16.23 75.30±16.23
C% 18.74±17.80 18.74±17.80 18.74±17.80
O% 00.00±00.00 00.00±00.00 00.00±00.00
I% 01.97±07.16 01.97±07.16 01.97±07.16
NTN
M% 83.41±07.85 83.42±07.85 83.42±07.85
C% 00.02±00.04 00.02±00.04 00.02±00.04
O% 13.40±10.17 13.40±10.17 13.40±10.17
I% 03.17±04.65 03.16±04.65 03.16±04.65
HD
M% 68.00±16.99 68.00±16.99 68.00±16.99
C% 00.02±00.05 00.02±00.05 00.02±00.05
O% 06.38±02.03 06.38±02.03 06.38±02.03
I% 25.59±18.98 25.59±18.98 25.59±18.98
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 17 / 20
Experiments
Considerations and Lessons Learnt
improvement w.r.t. TDTs
smallest changes in terms of match rate relating to the number of
trees
weak diversification(overlapping) between trees by increasing the
number of trees
there is no need to set high values for these parameters
e.g. 10-trees TRFs with a sampling rate of 50% is accurate enough
small disjuncts problem due to the poorly discriminative concepts
generated from the refinement operator is the cause of:
misclassification cases mitigated from the presence of other trees
a bottleneck for learning phase
execution times span from few minutes to almost 10 hours
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 18 / 20
Conclusions and Extensions
Conclusions and Further Extensions
Development of further refinement operators
Further ensemble techniques and combination rules
Further experiments with ontologies extracted from Linked Data
Cloud
Parallelization of the current implementation
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 19 / 20
Conclusions and Extensions
Thank you!
Questions?
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 20 / 20

More Related Content

Similar to Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases

Teknik sampling.pptx
Teknik sampling.pptxTeknik sampling.pptx
Teknik sampling.pptx
CitraCirebon
 
Methodology
MethodologyMethodology
Methodology
Nelyn Joy Castillon
 
Adaptive Multilevel Clustering Model for the Prediction of Academic Risk
Adaptive Multilevel Clustering Model for the Prediction of Academic RiskAdaptive Multilevel Clustering Model for the Prediction of Academic Risk
Adaptive Multilevel Clustering Model for the Prediction of Academic Risk
Xavier Ochoa
 
eMba i qt unit-5_sampling
eMba i qt unit-5_samplingeMba i qt unit-5_sampling
eMba i qt unit-5_sampling
Rai University
 
Heuristics for the Maximal Diversity Selection Problem
Heuristics for the Maximal Diversity Selection ProblemHeuristics for the Maximal Diversity Selection Problem
Heuristics for the Maximal Diversity Selection Problem
IJMER
 
Sampling.pptx
Sampling.pptxSampling.pptx
Sampling.pptx
ssuser9b12ee
 
Ch07
Ch07Ch07
Download It
Download ItDownload It
Download It
butest
 
On cascading small decision trees
On cascading small decision treesOn cascading small decision trees
On cascading small decision trees
Julià Minguillón
 
Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...
butest
 
Decision tree
Decision tree Decision tree
Decision tree
Learnbay Datascience
 
Population and Sampling.pptx
Population and Sampling.pptxPopulation and Sampling.pptx
Population and Sampling.pptx
Vijayalakshmi Murugesan
 
HAR-MI method for multi-class imbalanced datasets
HAR-MI method for multi-class imbalanced datasetsHAR-MI method for multi-class imbalanced datasets
HAR-MI method for multi-class imbalanced datasets
TELKOMNIKA JOURNAL
 
CABT SHS Statistics & Probability - Sampling Distribution of Means
CABT SHS Statistics & Probability - Sampling Distribution of MeansCABT SHS Statistics & Probability - Sampling Distribution of Means
CABT SHS Statistics & Probability - Sampling Distribution of Means
Gilbert Joseph Abueg
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
Chirag Gupta
 
ensemble learning
ensemble learningensemble learning
ensemble learning
butest
 
How to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or DissertationHow to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or Dissertation
Maria Sanchez
 
How to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or DissertationHow to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or Dissertation
DoctoralNet Limited
 
Statistics lesson 1
Statistics   lesson 1Statistics   lesson 1
Statistics lesson 1
Katrina Mae
 
Statistics lesson 1
Statistics   lesson 1Statistics   lesson 1
Statistics lesson 1
Katrina Mae
 

Similar to Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases (20)

Teknik sampling.pptx
Teknik sampling.pptxTeknik sampling.pptx
Teknik sampling.pptx
 
Methodology
MethodologyMethodology
Methodology
 
Adaptive Multilevel Clustering Model for the Prediction of Academic Risk
Adaptive Multilevel Clustering Model for the Prediction of Academic RiskAdaptive Multilevel Clustering Model for the Prediction of Academic Risk
Adaptive Multilevel Clustering Model for the Prediction of Academic Risk
 
eMba i qt unit-5_sampling
eMba i qt unit-5_samplingeMba i qt unit-5_sampling
eMba i qt unit-5_sampling
 
Heuristics for the Maximal Diversity Selection Problem
Heuristics for the Maximal Diversity Selection ProblemHeuristics for the Maximal Diversity Selection Problem
Heuristics for the Maximal Diversity Selection Problem
 
Sampling.pptx
Sampling.pptxSampling.pptx
Sampling.pptx
 
Ch07
Ch07Ch07
Ch07
 
Download It
Download ItDownload It
Download It
 
On cascading small decision trees
On cascading small decision treesOn cascading small decision trees
On cascading small decision trees
 
Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...
 
Decision tree
Decision tree Decision tree
Decision tree
 
Population and Sampling.pptx
Population and Sampling.pptxPopulation and Sampling.pptx
Population and Sampling.pptx
 
HAR-MI method for multi-class imbalanced datasets
HAR-MI method for multi-class imbalanced datasetsHAR-MI method for multi-class imbalanced datasets
HAR-MI method for multi-class imbalanced datasets
 
CABT SHS Statistics & Probability - Sampling Distribution of Means
CABT SHS Statistics & Probability - Sampling Distribution of MeansCABT SHS Statistics & Probability - Sampling Distribution of Means
CABT SHS Statistics & Probability - Sampling Distribution of Means
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
ensemble learning
ensemble learningensemble learning
ensemble learning
 
How to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or DissertationHow to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or Dissertation
 
How to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or DissertationHow to Choose a Sample for Your Thesis or Dissertation
How to Choose a Sample for Your Thesis or Dissertation
 
Statistics lesson 1
Statistics   lesson 1Statistics   lesson 1
Statistics lesson 1
 
Statistics lesson 1
Statistics   lesson 1Statistics   lesson 1
Statistics lesson 1
 

Recently uploaded

一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
8 things to know before you start to code in 2024
8 things to know before you start to code in 20248 things to know before you start to code in 2024
8 things to know before you start to code in 2024
ArianaRamos54
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 

Recently uploaded (20)

一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
8 things to know before you start to code in 2024
8 things to know before you start to code in 20248 things to know before you start to code in 2024
8 things to know before you start to code in 2024
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 

Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases

  • 1. Tackling the Class-Imbalance Learning Problem in Semantic Web knowledge bases 19th International Conference on Knowledge Engineering and Knowledge Management Giuseppe Rizzo, Claudia d’Amato, Nicola Fanizzi and Floriana Esposito Dipartimento di Informatica Universit`a degli Studi di Bari ”Aldo Moro”, Bari, Italy November 24 - 28, 2014 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 1 / 20
  • 2. Outline 1 Introduction & Motivations 2 The framework 3 Experiments 4 Conclusions and Extensions G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 2 / 20
  • 3. Introduction & Motivations Introduction In the context of Semantic Web, procedures for deciding the membership of an individual w.r.t. a query concept exploit automated reasoning techniques The quality of inferences can be affected by the uncertainty originated from the distributed nature of Semantic Web the inherent incompleteness, due to the Open World Assumption inconsistency, due to the diverse quality of the ontologies G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 3 / 20
  • 4. Introduction & Motivations Introduction Machine learning algorithms can be employed to support query answering tasks(e.g. class-membership prediction) statistical regularities are exploited to infer new assertions The quality of inductive approaches depends on the training set composition Given a query concept, it is easier to find more uncertain-membership examples than individuals that belong to the target concept (or to its complement) the quality of predictions can be poor A problem of class-imbalance occurs G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 4 / 20
  • 5. Introduction & Motivations Motivations In machine learning, most solutions are based on sampling methods Undersampling methods are typically based on (randomly or informated) procedures for discarding training instances it is possible to obtain loss of information Oversampling methods require that some training instances are replicated the produced model can overfit over training data These problems must be mitigated G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 5 / 20
  • 6. The framework The proposed approach Combining the sampling strategy with ensemble learning methods Ensemble learning methods require the training for a set of classifiers (weak learners) predictions are combined by a meta-learner for deciding the final answer Specifically, the proposed solution is based on bagging methods various bootstrap samples are generated through the sampling with replacement procedure a model is induced for each sample predictions are made by voting procedure G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 6 / 20
  • 7. The framework Terminological Random Forests In this work, we developed Terminological Random Forest (TRF) for class-membership prediction, which extends Terminological Decision Trees model (TDTs). Let K = (T , A), a Terminological Decision Tree is a binary tree where: each node contains a conjunctive concept description D; each departing edge is the result of an instance-check test w.r.t. D, i.e., given an individual a, K |= D(a)? if a node with E is the father of the node with D then D is obtained by using a refinement operator and one of the following conditions should be verified: D introduces a new concept name (or its complement), D is an existential restriction, D is an universal restriction of any its ancestor. G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 7 / 20
  • 8. The framework Terminological Random Forests A TRF is an ensemble of TDTs such that: each TDT is trained on a re-balanced subset of examples extracted from the original training set each TDT is built thanks to the downward refinement operator and a random selection of concept description candidates voting rule is employed to decide the membership G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 8 / 20
  • 9. The framework Learning Terminological Random Forests In order to learn a TRF, given a a target concept C the number of tree n a training set Tr = Ps, Ns, Us Ps = {a ∈ Ind(A)|K |= C(a)} Ns = {b ∈ Ind(A)|K |= ¬C(b)} Us = {c ∈ Ind(A)|K |= C(c) ∧ K |= ¬C(c)} the algorithm can be summarized as follows: build a n rebalanced bootstrap samples learn a TDT model from each bootstrap sample G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 9 / 20
  • 10. The framework Learning Terminological Random Forests Procedure for building the rebalanced bootstrap sample In order to mitigate the drawback deriving from the under-sampling procedure, a two-step approach is employed. Firstly, a stratified sampling with replacement procedure is employed in order to represent the minority class instances in the bootstrap sample. Then, the majority class instances (either positive or negative) and the uncertain-membership instances are discarded. G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 10 / 20
  • 11. The framework Learning Terminological Random Forests Learning TDTs Given a bootstrap sample Di , a TDT is trained according to a recursive strategy Starting from the root the method refines the concept description installed into the current node Various candidates are returned and a subset of concepts is selected by randomly chosing its elements Best Concept: the one maximizes the information gain w.r.t. the previous level Split the instances according to the results of the instance check test uncertain-membership instances are replicated in both recursive calls Stop conditions: the node is pure w.r.t. the membership G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 11 / 20
  • 12. The framework Predicting unseen individuals Given a forest F and a new individual a, the algorithm collects the prediction returned by each TDT and decides the class according the majority vote rule The class-membership returned by a TDT is decided by traversing recursively the tree (until a leaf is reached) according to the instance check test result. For a concept description installed as node D if K |= D(a) the left branch is followed if K |= ¬D(a) the right branch is followed if neither K |= ¬D(a) nor K |= D(a), the uncertain-membership is assigned to a G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 12 / 20
  • 13. Experiments Experiments 15 query concepts have been randomly generated 10-fold cross validation as design of the experiments number of candidate randomly selected: |ρ(·)| Stratified Sampling rates: no-sampling, 50%, 70 %, 80 % Using a reasoner to decide the ground truth: match: rate of the test cases (individuals) for which the inductive model and a reasoner predict the same membership (i.e. +1 | +1, −1 | −1, 0 | 0); commission: rate of the cases for which predictions are opposite (i.e. +1 | −1, −1 | +1); omission: rate of test cases for which the inductive method cannot determine a definite membership (−1, +1) while the reasoner is able to do it; induction: rate of cases where the inductive method can predict a membership while it is not logically derivable. G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 13 / 20
  • 14. Experiments Experiments Ontology Index TDT No Sampling 10 trees 20 trees 30 trees BCO M% 80.44±11.01 87.99±07.85 87.82±13.86 79.33±22.41 C% 07.56±08.08 04.32±04.68 02.77±04.77 01.64±02.36 O% 05.04±04.28 00.09±00.27 00.02±00.04 10.38±19.28 I% 06.96±05.97 07.61±06.82 09.40±13.93 08.65±14.03 BioPax M% 66.63±14.60 75.93±17.05 75.49±17.05 75.30±16.23 C% 31.03±12.95 22.11±16.54 18.54±17.80 18.74±17.80 O% 00.39±00.61 00.00±00.00 00.00±00.00 00.00±00.00 I% 01.95±07.13 01.97±07.16 01.97±07.16 01.97±07.16 NTN M% 68.85±13.23 83.42±07.85 83.42±07.85 83.42±07.85 C% 00.37±00.30 00.02±00.04 00.02±00.04 00.02±00.04 O% 09.51±07.06 13.40±10.17 13.40±10.17 13.40±10.17 I% 21.27±08.73 03.16±04.65 03.16±04.65 03.16±04.65 HD M% 58.31±14.06 67.95±16.99 67.95±16.99 67.95±16.99 C% 00.44±00.47 00.02±00.05 00.02±00.05 00.02±00.05 O% 05.51±01.81 06.38±02.03 06.38±02.03 06.38±02.03 I% 35.74±15.90 25.61±18.98 25.61±18.98 25.61±18.98 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 14 / 20
  • 15. Experiments Experiments Ontology Index 50% 10 trees 20 trees 30 trees BCO M% 86.27±15.79 86.24±15.94 86.26±15.84 C% 02.47±03.70 02.43±03.70 02.84±03.70 O% 01.90±07.30 01.97±07.55 01.92±07.37 I% 09.36±13.96 09.36±13.96 09.36±13.96 BioPax M% 75.30±16.23 75.30±16.23 75.30±16.23 C% 18.74±17.80 18.74±17.80 18.74±17.80 O% 00.00±00.00 00.00±00.00 00.00±00.00 I% 01.97±07.16 01.97±07.16 01.97±07.16 NTN M% 83.41±07.85 83.42±07.85 83.42±07.85 C% 00.02±00.04 00.02±00.04 00.02±00.04 O% 13.40±10.17 13.40±10.17 13.40±10.17 I% 03.17±04.65 03.16±04.65 03.16±04.65 HD M% 68.00±16.98 68.00±16.99 67.98±16.99 C% 00.02±00.05 00.02±00.05 00.02±00.05 O% 06.38±02.03 06.38±02.03 06.38±02.03 I% 25.59±18.98 25.59±18.98 25.62±18.98 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 15 / 20
  • 16. Experiments Experiments Ontology Index 70% 10 trees 20 trees 30 trees BCO M% 84.12±18.27 85.70±16.98 85.52±17.09 C% 02.16±03.09 02.32±03.39 02.30±03.38 O% 04.50±12.59 02.65±09.93 02.86±10.04 I% 09.23±13.98 09.33±13.97 09.31±13.91 BioPax M% 75.30±16.23 75.30±16.23 75.30±16.23 C% 18.74±17.80 18.74±17.80 18.74±17.80 O% 00.00±00.00 00.00±00.00 00.00±00.00 I% 01.97±07.16 01.97±07.16 01.97±07.16 NTN M% 83.42±07.85 83.42±07.85 83.42±07.85 C% 00.02±00.04 00.02±00.04 00.02±00.04 O% 13.40±10.17 13.40±10.17 13.40±10.17 I% 03.16±04.65 03.16±04.65 03.16±04.65 HD M% 68.00±16.99 68.00±16.99 68.00±16.99 C% 00.02±00.05 00.02±00.05 00.02±00.05 O% 06.38±02.03 06.38±02.03 06.38±02.03 I% 25.59±18.98 25.59±18.98 25.59±18.98 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 16 / 20
  • 17. Experiments Experiments Ontology Index 80% 10 trees 20 trees 30 trees BCO M% 76.57±24.28 81.27±19.27 79.33±22.41 C% 01.45±01.77 01.89±02.65 01.64±02.36 O% 13.51±22.19 08.05±15.04 10.38±19.28 I% 08.47±14.03 08.79±13.98 08.65±14.23 BioPax M% 75.30±16.23 75.30±16.23 75.30±16.23 C% 18.74±17.80 18.74±17.80 18.74±17.80 O% 00.00±00.00 00.00±00.00 00.00±00.00 I% 01.97±07.16 01.97±07.16 01.97±07.16 NTN M% 83.41±07.85 83.42±07.85 83.42±07.85 C% 00.02±00.04 00.02±00.04 00.02±00.04 O% 13.40±10.17 13.40±10.17 13.40±10.17 I% 03.17±04.65 03.16±04.65 03.16±04.65 HD M% 68.00±16.99 68.00±16.99 68.00±16.99 C% 00.02±00.05 00.02±00.05 00.02±00.05 O% 06.38±02.03 06.38±02.03 06.38±02.03 I% 25.59±18.98 25.59±18.98 25.59±18.98 G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 17 / 20
  • 18. Experiments Considerations and Lessons Learnt improvement w.r.t. TDTs smallest changes in terms of match rate relating to the number of trees weak diversification(overlapping) between trees by increasing the number of trees there is no need to set high values for these parameters e.g. 10-trees TRFs with a sampling rate of 50% is accurate enough small disjuncts problem due to the poorly discriminative concepts generated from the refinement operator is the cause of: misclassification cases mitigated from the presence of other trees a bottleneck for learning phase execution times span from few minutes to almost 10 hours G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 18 / 20
  • 19. Conclusions and Extensions Conclusions and Further Extensions Development of further refinement operators Further ensemble techniques and combination rules Further experiments with ontologies extracted from Linked Data Cloud Parallelization of the current implementation G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 19 / 20
  • 20. Conclusions and Extensions Thank you! Questions? G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 20 / 20