The document describes a framework for tackling class imbalance problems in machine learning on semantic web knowledge bases. It proposes combining sampling strategies with ensemble learning methods like bagging to generate multiple balanced training subsets. A technique called Terminological Random Forest is presented which uses terminological decision trees as weak learners. Experiments on several ontologies show the framework improves performance over single classifiers, with matching rates to a reasoner of up to 87% and low commission rates.
A cluster-based analysis to diagnose students’ learning achievementsMiguel R. Artacho
The document describes a proposed methodology for diagnosing students' learning achievements using cluster-based analysis. The methodology involves using item response theory to assess students' skill levels on concepts, identifying weaknesses and misconceptions, and clustering students based on similar disabilities. The methodology aims to provide adaptive feedback to help students improve and inform teaching strategies. A software tool was developed to implement the diagnostic assessment and clustering.
This proposal aims to evaluate the effectiveness of computer-assisted pronunciation training (CAPT) tools for vocational college students in Taiwan. The study will involve an experimental group that receives blended CAPT and traditional teaching, and a control group that receives only traditional teaching. Both groups will complete pre- and post-tests to measure pronunciation quality improvements. The proposal outlines the background, purpose, research question, methodology including participants, design, instruments, procedures, and statistical analysis that will be used.
Using Knowledge Building Forums in EFL Classroms - FIETxs2019ARGET URV
1) The document describes a study that examined the impact of using Knowledge Building forums on the development of English language skills for Spanish students.
2) Sixty-seven Spanish students participated in the study, engaging with Knowledge Building forums and completing pre- and post-tests of their English abilities.
3) The results showed that collaborative writing in the forums significantly improved students' English writing skills and comprehension, but did not necessarily improve their vocabulary or specific grammar skills.
On the Effectiveness of Evidence-based Terminological Decision TreesGiuseppe Rizzo
The document presents a framework for evidence-based terminological decision trees (ETDTs) to predict class membership for individuals in description logics. ETDTs combine description logics, decision trees, and Dempster-Shafer theory. Experiments show ETDTs outperform previous approaches by assigning correct membership and limiting omission cases. While performance is similar to terminological decision trees when membership is definite, ETDTs induce better models. Future work includes further experiments, heuristics, combination rules and refinement operators.
Inducing Predictive Clustering Trees for Datatype properties ValuesGiuseppe Rizzo
The document proposes inducing predictive clustering trees (PCTs) to approximate numerical datatype property values in knowledge bases. PCTs perform multi-target regression by clustering individuals based on descriptive logic concept descriptions, then fitting a predictive model to each cluster. The approach is tested on datasets extracted from DBPedia, showing PCTs outperform alternative methods like terminological regression trees, k-NN, and linear regression in terms of accuracy and efficiency. Future work could explore new refinement operators, heuristics, and linear models at leaf nodes to further improve PCTs for predicting property values in semantic data.
Towards Evidence Terminological Decision TreeGiuseppe Rizzo
The document proposes extending Terminological Decision Trees (TDTs) with Dempster-Shafer Theory to handle uncertainty when predicting class membership in ontologies under the open world assumption. It introduces Dempster-Shafer Terminological Decision Trees (DST-TDTs) which associate each node with a concept description and basic belief assignment. An evaluation on several datasets shows DST-TDTs do not clearly outperform standard TDTs due to conservative combination rules and high variance, but the authors identify opportunities to improve the approach through pruning, alternative selection measures, and using linked data.
Inductive Classification through Evidence-based Models and Their EnsembleGiuseppe Rizzo
The document presents an ensemble machine learning framework called Evidential Terminological Random Forests (ETRF) for inductive classification over semantic web data. ETRF combines evidential terminological decision trees and Dempster-Shafer theory to make probabilistic membership predictions that account for uncertainty. Experiments show ETRF improves over other models by achieving higher accuracy and lower variance in predictions across several ontologies, while addressing issues like class imbalance. Future work is proposed to further enhance the refinement operators, combination rules, and scale the approach to larger datasets.
Learninig Analytics Special Track: A cluster-based analisys to diagnose stude...Miguel Rodriguez Artacho
The document proposes a diagnostic test methodology using cluster analysis to identify students' learning disabilities and weaknesses. It uses item response theory to assess students' skill levels on concepts, identifies misconceptions through relationships between test items and concepts, and clusters students based on similar disabilities. The methodology was implemented in a software tool that provides individualized feedback to students on their learning paths.
A cluster-based analysis to diagnose students’ learning achievementsMiguel R. Artacho
The document describes a proposed methodology for diagnosing students' learning achievements using cluster-based analysis. The methodology involves using item response theory to assess students' skill levels on concepts, identifying weaknesses and misconceptions, and clustering students based on similar disabilities. The methodology aims to provide adaptive feedback to help students improve and inform teaching strategies. A software tool was developed to implement the diagnostic assessment and clustering.
This proposal aims to evaluate the effectiveness of computer-assisted pronunciation training (CAPT) tools for vocational college students in Taiwan. The study will involve an experimental group that receives blended CAPT and traditional teaching, and a control group that receives only traditional teaching. Both groups will complete pre- and post-tests to measure pronunciation quality improvements. The proposal outlines the background, purpose, research question, methodology including participants, design, instruments, procedures, and statistical analysis that will be used.
Using Knowledge Building Forums in EFL Classroms - FIETxs2019ARGET URV
1) The document describes a study that examined the impact of using Knowledge Building forums on the development of English language skills for Spanish students.
2) Sixty-seven Spanish students participated in the study, engaging with Knowledge Building forums and completing pre- and post-tests of their English abilities.
3) The results showed that collaborative writing in the forums significantly improved students' English writing skills and comprehension, but did not necessarily improve their vocabulary or specific grammar skills.
On the Effectiveness of Evidence-based Terminological Decision TreesGiuseppe Rizzo
The document presents a framework for evidence-based terminological decision trees (ETDTs) to predict class membership for individuals in description logics. ETDTs combine description logics, decision trees, and Dempster-Shafer theory. Experiments show ETDTs outperform previous approaches by assigning correct membership and limiting omission cases. While performance is similar to terminological decision trees when membership is definite, ETDTs induce better models. Future work includes further experiments, heuristics, combination rules and refinement operators.
Inducing Predictive Clustering Trees for Datatype properties ValuesGiuseppe Rizzo
The document proposes inducing predictive clustering trees (PCTs) to approximate numerical datatype property values in knowledge bases. PCTs perform multi-target regression by clustering individuals based on descriptive logic concept descriptions, then fitting a predictive model to each cluster. The approach is tested on datasets extracted from DBPedia, showing PCTs outperform alternative methods like terminological regression trees, k-NN, and linear regression in terms of accuracy and efficiency. Future work could explore new refinement operators, heuristics, and linear models at leaf nodes to further improve PCTs for predicting property values in semantic data.
Towards Evidence Terminological Decision TreeGiuseppe Rizzo
The document proposes extending Terminological Decision Trees (TDTs) with Dempster-Shafer Theory to handle uncertainty when predicting class membership in ontologies under the open world assumption. It introduces Dempster-Shafer Terminological Decision Trees (DST-TDTs) which associate each node with a concept description and basic belief assignment. An evaluation on several datasets shows DST-TDTs do not clearly outperform standard TDTs due to conservative combination rules and high variance, but the authors identify opportunities to improve the approach through pruning, alternative selection measures, and using linked data.
Inductive Classification through Evidence-based Models and Their EnsembleGiuseppe Rizzo
The document presents an ensemble machine learning framework called Evidential Terminological Random Forests (ETRF) for inductive classification over semantic web data. ETRF combines evidential terminological decision trees and Dempster-Shafer theory to make probabilistic membership predictions that account for uncertainty. Experiments show ETRF improves over other models by achieving higher accuracy and lower variance in predictions across several ontologies, while addressing issues like class imbalance. Future work is proposed to further enhance the refinement operators, combination rules, and scale the approach to larger datasets.
Learninig Analytics Special Track: A cluster-based analisys to diagnose stude...Miguel Rodriguez Artacho
The document proposes a diagnostic test methodology using cluster analysis to identify students' learning disabilities and weaknesses. It uses item response theory to assess students' skill levels on concepts, identifies misconceptions through relationships between test items and concepts, and clusters students based on similar disabilities. The methodology was implemented in a software tool that provides individualized feedback to students on their learning paths.
This document discusses sampling techniques and sample size. It defines key terms like population, sample, sampling frame, and sampling schemes. It describes different sampling methods like probability and non-probability sampling. Probability sampling methods allow results to be generalized to the population and include simple random sampling, systematic sampling, stratified sampling, cluster sampling, and multi-stage sampling. Sample size is determined based on desired confidence level, precision, and power to detect differences between groups. Sample size calculations are provided for estimating population parameters and comparing two groups.
This document provides an overview of research methods and statistical concepts. It discusses research design types including descriptive, historical, and experimental. Experimental design can be true experiments or quasi-experiments. It also discusses quantitative and qualitative research approaches and mixed methods. Key statistical concepts are defined, such as population, sample, probability and non-probability sampling, and levels of measurement. Common statistical tests are introduced along with important assumptions. The document provides guidance on how to measure learning experimentally using different research designs. It also discusses how to determine appropriate sample sizes and select statistical analyses based on the research questions.
This document discusses key concepts in quantitative techniques related to population, sample, sampling, and sample size calculation. It defines population as the total set of measurements of interest, and sample as a subset of the population. Probability and non-probability sampling methods are described. Probability sampling allows results to be generalized to the population, while non-probability sampling does not. Several probability sampling techniques are explained, including simple random sampling, systematic sampling, stratified sampling, cluster sampling, and multi-stage sampling. The document also covers concepts like sampling error, confidence level, statistical power, and formulas for calculating minimum sample sizes. Sample size determination depends on factors like confidence level, power, expected difference, and standard deviation. Formulas presented can be used
Heuristics for the Maximal Diversity Selection ProblemIJMER
The problem of selecting k items from among a given set of N items such that the ‘diversity’
among the k items is maximum, is a classical problem with applications in many diverse areas such as
forming committees, jury selection, product testing, surveys, plant breeding, ecological preservation,
capital investment, etc. A suitably defined distance metric is used to determine the diversity. However,
this is a hard problem, and the optimal solution is computationally intractable. In this paper we present
the experimental evaluation of two approximation algorithms (heuristics) for the maximal diversity
selection problem
This document discusses various sampling methods for research. It begins by defining key terms like population, study population, and sample. It then describes and provides examples of both probability sampling methods like simple random sampling, systematic sampling, and multistage sampling as well as non-probability sampling methods like purposive sampling, snowball sampling, and quota sampling. The document explains how to determine sample size and discusses concepts like sampling error, bias, and representativeness. It emphasizes that the choice of sampling method depends on the research purpose and design.
This chapter discusses sampling and sampling distributions. It covers different sampling methods like simple random sampling, stratified sampling, and cluster sampling. The key concepts explained are sampling frame, sampling distribution, standard error of the mean, and the Central Limit Theorem. The Central Limit Theorem states that as the sample size increases, the sampling distribution of the mean will approach a normal distribution, even if the population is not normally distributed.
This document summarizes ensemble classification methods including bagging, boosting, and random forests. It discusses discriminative vs generative models and reviews literature on various machine learning algorithms. It provides details on bagging, boosting, random forests algorithms and compares their pros and cons. It discusses empirical comparisons of algorithm performance on different datasets and problems.
The document discusses progressive decision trees, which aim to overcome some limitations of classical decision trees. Progressive decision trees break the classification problem into a sequence of simpler sub-problems using small decision trees. Three types of cascading progressive decision trees are described (Type A, B, C) which differ in how information is passed between trees. Experimental results on document layout recognition, hyperspectral imaging, brain tumour classification, and UCI datasets show that progressive decision trees can improve accuracy and reduce costs compared to single decision trees. Further research opportunities in progressive decision trees are also outlined.
Ensemble Learning Featuring the Netflix Prize Competition and ...butest
The document discusses ensemble learning methods for improving prediction accuracy. It provides an overview of using multiple models together (ensembles) and techniques like bagging and boosting to increase diversity among models. Bagging involves training models on different subsets of data, while boosting incrementally focuses on misclassified examples. The Netflix Prize is used as a case study, where top teams achieved over 5% better accuracy than Netflix by developing diverse ensembles of up to 100 models using different algorithms and inputs.
A decision tree is a guide to the potential results of a progression of related choices. It permits an individual or association to gauge potential activities against each other dependent on their costs, probabilities, and advantages. They can be utilized either to drive casual conversation or to outline a calculation that predicts the most ideal decision scientifically.
Meaning & Definition of Population & Sampling, Types of Sampling - Probability & Non-Probability Sampling Techniques, Characteristics of Probability Sampling Techniques, Types of Probability Sampling Techniques, Characteristics of Non-Probability Sampling Techniques, Types of Non-Probability Sampling Techniques, Errors in Sampling, Size of sample, Application of Sampling Technique in Research
Research on multi-class imbalance from a number of researchers faces
obstacles in the form of poor data diversity and a large number of classifiers.
The Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method
is a Hybrid Ensembles method which is the development of the Hybrid
Approach Redefinion (HAR) method. This study has compared the results
obtained with the Dynamic Ensemble Selection-Multiclass Imbalance
(DES-MI) method in handling multiclass imbalance. In the HAR-MI
Method, the preprocessing stage was carried out using the random balance
ensembles method and dynamic ensemble selection to produce a candidate
ensemble and the processing stages was carried out using different
contribution sampling and dynamic ensemble selection to produce
a candidate ensemble. This research has been conducted by using multi-class
imbalance datasets sourced from the KEEL Repository. The results show that
the HAR-MI method can overcome multi-class imbalance with better data
diversity, smaller number of classifiers, and better classifier performance
compared to a DES-MI method. These results were tested with a Wilcoxon
signed-rank statistical test which showed that the superiority of the HAR-MI
method with respect to DES-MI method.
CABT SHS Statistics & Probability - Sampling Distribution of MeansGilbert Joseph Abueg
This document is a presentation on sampling distributions of means for a Grade 11 Statistics and Probability lecture. It begins by defining populations and samples, and explaining how inferential statistics makes conclusions about populations based on sample data. It then discusses different sampling techniques like simple random sampling, systematic random sampling, stratified random sampling and cluster sampling. The key concepts of parameters, statistics, and sampling distributions are also introduced. Examples are provided to illustrate how to construct sampling distributions of means.
Probability density estimation using Product of Conditional ExpertsChirag Gupta
This document discusses probability density estimation using a product of conditional experts model. It summarizes that density estimation constructs a probability distribution function from observed data to understand the underlying pattern. A product of conditional experts model is proposed, where simple classification models like logistic regression are used as experts to estimate the conditional probability. The experts are combined by multiplying their probabilities. The model is trained using gradient ascent to maximize the log probability. When evaluated on artificial and real datasets, the product of conditional experts model is shown to learn distributions close to the true distributions and generalize better than linear and non-linear baseline models. The document also explores applying the model to outlier detection.
Ensemble learning methods were very successful in the Netflix Prize competition to improve movie recommendations. These methods combine the predictions from multiple models to obtain better accuracy than single models. Popular ensemble techniques included bagging, boosting, and random forests. The winning teams in the Netflix Prize all used ensemble methods that blended the predictions of dozens or hundreds of individual models.
Statistics involves collecting, organizing, analyzing, and interpreting data. Descriptive statistics describe characteristics of a data set through measures like central tendency and variability. Inferential statistics draw conclusions about a population based on a sample. Key terms include population, sample, parameter, statistic, data types, levels of measurement, and sampling techniques like simple random sampling. Common data gathering methods are interviews, questionnaires, and registration records. Data can be presented textually, in tables, or graphically through charts, graphs, and maps.
Statistics involves collecting, organizing, analyzing, and interpreting data. Descriptive statistics describe characteristics of a data set through measures like central tendency and variability. Inferential statistics draw conclusions about a population based on a sample. Key terms include population, sample, parameter, statistic, data types, levels of measurement, and sampling techniques like simple random sampling. Common data gathering methods are interviews, questionnaires, and registration records. Data can be presented textually, in tables, or graphically through charts, graphs, and maps.
This document discusses sampling techniques and sample size. It defines key terms like population, sample, sampling frame, and sampling schemes. It describes different sampling methods like probability and non-probability sampling. Probability sampling methods allow results to be generalized to the population and include simple random sampling, systematic sampling, stratified sampling, cluster sampling, and multi-stage sampling. Sample size is determined based on desired confidence level, precision, and power to detect differences between groups. Sample size calculations are provided for estimating population parameters and comparing two groups.
This document provides an overview of research methods and statistical concepts. It discusses research design types including descriptive, historical, and experimental. Experimental design can be true experiments or quasi-experiments. It also discusses quantitative and qualitative research approaches and mixed methods. Key statistical concepts are defined, such as population, sample, probability and non-probability sampling, and levels of measurement. Common statistical tests are introduced along with important assumptions. The document provides guidance on how to measure learning experimentally using different research designs. It also discusses how to determine appropriate sample sizes and select statistical analyses based on the research questions.
This document discusses key concepts in quantitative techniques related to population, sample, sampling, and sample size calculation. It defines population as the total set of measurements of interest, and sample as a subset of the population. Probability and non-probability sampling methods are described. Probability sampling allows results to be generalized to the population, while non-probability sampling does not. Several probability sampling techniques are explained, including simple random sampling, systematic sampling, stratified sampling, cluster sampling, and multi-stage sampling. The document also covers concepts like sampling error, confidence level, statistical power, and formulas for calculating minimum sample sizes. Sample size determination depends on factors like confidence level, power, expected difference, and standard deviation. Formulas presented can be used
Heuristics for the Maximal Diversity Selection ProblemIJMER
The problem of selecting k items from among a given set of N items such that the ‘diversity’
among the k items is maximum, is a classical problem with applications in many diverse areas such as
forming committees, jury selection, product testing, surveys, plant breeding, ecological preservation,
capital investment, etc. A suitably defined distance metric is used to determine the diversity. However,
this is a hard problem, and the optimal solution is computationally intractable. In this paper we present
the experimental evaluation of two approximation algorithms (heuristics) for the maximal diversity
selection problem
This document discusses various sampling methods for research. It begins by defining key terms like population, study population, and sample. It then describes and provides examples of both probability sampling methods like simple random sampling, systematic sampling, and multistage sampling as well as non-probability sampling methods like purposive sampling, snowball sampling, and quota sampling. The document explains how to determine sample size and discusses concepts like sampling error, bias, and representativeness. It emphasizes that the choice of sampling method depends on the research purpose and design.
This chapter discusses sampling and sampling distributions. It covers different sampling methods like simple random sampling, stratified sampling, and cluster sampling. The key concepts explained are sampling frame, sampling distribution, standard error of the mean, and the Central Limit Theorem. The Central Limit Theorem states that as the sample size increases, the sampling distribution of the mean will approach a normal distribution, even if the population is not normally distributed.
This document summarizes ensemble classification methods including bagging, boosting, and random forests. It discusses discriminative vs generative models and reviews literature on various machine learning algorithms. It provides details on bagging, boosting, random forests algorithms and compares their pros and cons. It discusses empirical comparisons of algorithm performance on different datasets and problems.
The document discusses progressive decision trees, which aim to overcome some limitations of classical decision trees. Progressive decision trees break the classification problem into a sequence of simpler sub-problems using small decision trees. Three types of cascading progressive decision trees are described (Type A, B, C) which differ in how information is passed between trees. Experimental results on document layout recognition, hyperspectral imaging, brain tumour classification, and UCI datasets show that progressive decision trees can improve accuracy and reduce costs compared to single decision trees. Further research opportunities in progressive decision trees are also outlined.
Ensemble Learning Featuring the Netflix Prize Competition and ...butest
The document discusses ensemble learning methods for improving prediction accuracy. It provides an overview of using multiple models together (ensembles) and techniques like bagging and boosting to increase diversity among models. Bagging involves training models on different subsets of data, while boosting incrementally focuses on misclassified examples. The Netflix Prize is used as a case study, where top teams achieved over 5% better accuracy than Netflix by developing diverse ensembles of up to 100 models using different algorithms and inputs.
A decision tree is a guide to the potential results of a progression of related choices. It permits an individual or association to gauge potential activities against each other dependent on their costs, probabilities, and advantages. They can be utilized either to drive casual conversation or to outline a calculation that predicts the most ideal decision scientifically.
Meaning & Definition of Population & Sampling, Types of Sampling - Probability & Non-Probability Sampling Techniques, Characteristics of Probability Sampling Techniques, Types of Probability Sampling Techniques, Characteristics of Non-Probability Sampling Techniques, Types of Non-Probability Sampling Techniques, Errors in Sampling, Size of sample, Application of Sampling Technique in Research
Research on multi-class imbalance from a number of researchers faces
obstacles in the form of poor data diversity and a large number of classifiers.
The Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method
is a Hybrid Ensembles method which is the development of the Hybrid
Approach Redefinion (HAR) method. This study has compared the results
obtained with the Dynamic Ensemble Selection-Multiclass Imbalance
(DES-MI) method in handling multiclass imbalance. In the HAR-MI
Method, the preprocessing stage was carried out using the random balance
ensembles method and dynamic ensemble selection to produce a candidate
ensemble and the processing stages was carried out using different
contribution sampling and dynamic ensemble selection to produce
a candidate ensemble. This research has been conducted by using multi-class
imbalance datasets sourced from the KEEL Repository. The results show that
the HAR-MI method can overcome multi-class imbalance with better data
diversity, smaller number of classifiers, and better classifier performance
compared to a DES-MI method. These results were tested with a Wilcoxon
signed-rank statistical test which showed that the superiority of the HAR-MI
method with respect to DES-MI method.
CABT SHS Statistics & Probability - Sampling Distribution of MeansGilbert Joseph Abueg
This document is a presentation on sampling distributions of means for a Grade 11 Statistics and Probability lecture. It begins by defining populations and samples, and explaining how inferential statistics makes conclusions about populations based on sample data. It then discusses different sampling techniques like simple random sampling, systematic random sampling, stratified random sampling and cluster sampling. The key concepts of parameters, statistics, and sampling distributions are also introduced. Examples are provided to illustrate how to construct sampling distributions of means.
Probability density estimation using Product of Conditional ExpertsChirag Gupta
This document discusses probability density estimation using a product of conditional experts model. It summarizes that density estimation constructs a probability distribution function from observed data to understand the underlying pattern. A product of conditional experts model is proposed, where simple classification models like logistic regression are used as experts to estimate the conditional probability. The experts are combined by multiplying their probabilities. The model is trained using gradient ascent to maximize the log probability. When evaluated on artificial and real datasets, the product of conditional experts model is shown to learn distributions close to the true distributions and generalize better than linear and non-linear baseline models. The document also explores applying the model to outlier detection.
Ensemble learning methods were very successful in the Netflix Prize competition to improve movie recommendations. These methods combine the predictions from multiple models to obtain better accuracy than single models. Popular ensemble techniques included bagging, boosting, and random forests. The winning teams in the Netflix Prize all used ensemble methods that blended the predictions of dozens or hundreds of individual models.
Statistics involves collecting, organizing, analyzing, and interpreting data. Descriptive statistics describe characteristics of a data set through measures like central tendency and variability. Inferential statistics draw conclusions about a population based on a sample. Key terms include population, sample, parameter, statistic, data types, levels of measurement, and sampling techniques like simple random sampling. Common data gathering methods are interviews, questionnaires, and registration records. Data can be presented textually, in tables, or graphically through charts, graphs, and maps.
Statistics involves collecting, organizing, analyzing, and interpreting data. Descriptive statistics describe characteristics of a data set through measures like central tendency and variability. Inferential statistics draw conclusions about a population based on a sample. Key terms include population, sample, parameter, statistic, data types, levels of measurement, and sampling techniques like simple random sampling. Common data gathering methods are interviews, questionnaires, and registration records. Data can be presented textually, in tables, or graphically through charts, graphs, and maps.
Similar to Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases (20)
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
Tackling the Class Imbalance Learning Problem in Semantic Web Knowledge bases
1. Tackling the Class-Imbalance Learning Problem in
Semantic Web knowledge bases
19th International Conference on Knowledge Engineering and Knowledge
Management
Giuseppe Rizzo, Claudia d’Amato, Nicola Fanizzi and Floriana Esposito
Dipartimento di Informatica
Universit`a degli Studi di Bari ”Aldo Moro”, Bari, Italy
November 24 - 28, 2014
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 1 / 20
2. Outline
1 Introduction & Motivations
2 The framework
3 Experiments
4 Conclusions and Extensions
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 2 / 20
3. Introduction & Motivations
Introduction
In the context of Semantic Web, procedures for deciding the
membership of an individual w.r.t. a query concept exploit automated
reasoning techniques
The quality of inferences can be affected by the uncertainty originated
from the distributed nature of Semantic Web
the inherent incompleteness, due to the Open World Assumption
inconsistency, due to the diverse quality of the ontologies
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 3 / 20
4. Introduction & Motivations
Introduction
Machine learning algorithms can be employed to support query
answering tasks(e.g. class-membership prediction)
statistical regularities are exploited to infer new assertions
The quality of inductive approaches depends on the training set
composition
Given a query concept, it is easier to find more uncertain-membership
examples than individuals that belong to the target concept (or to its
complement)
the quality of predictions can be poor
A problem of class-imbalance occurs
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 4 / 20
5. Introduction & Motivations
Motivations
In machine learning, most solutions are based on sampling methods
Undersampling methods are typically based on (randomly or
informated) procedures for discarding training instances
it is possible to obtain loss of information
Oversampling methods require that some training instances are
replicated
the produced model can overfit over training data
These problems must be mitigated
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 5 / 20
6. The framework
The proposed approach
Combining the sampling strategy with ensemble learning methods
Ensemble learning methods require the training for a set of classifiers
(weak learners)
predictions are combined by a meta-learner for deciding the final
answer
Specifically, the proposed solution is based on bagging methods
various bootstrap samples are generated through the sampling with
replacement procedure
a model is induced for each sample
predictions are made by voting procedure
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 6 / 20
7. The framework
Terminological Random Forests
In this work, we developed Terminological Random Forest (TRF) for
class-membership prediction, which extends Terminological Decision
Trees model (TDTs).
Let K = (T , A), a Terminological Decision Tree is a binary tree
where:
each node contains a conjunctive concept description D;
each departing edge is the result of an instance-check test w.r.t. D,
i.e., given an individual a, K |= D(a)?
if a node with E is the father of the node with D then D is obtained by
using a refinement operator and one of the following conditions should
be verified:
D introduces a new concept name (or its complement),
D is an existential restriction,
D is an universal restriction of any its ancestor.
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 7 / 20
8. The framework
Terminological Random Forests
A TRF is an ensemble of TDTs such that:
each TDT is trained on a re-balanced subset of examples extracted
from the original training set
each TDT is built thanks to the downward refinement operator and a
random selection of concept description candidates
voting rule is employed to decide the membership
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 8 / 20
9. The framework
Learning Terminological Random Forests
In order to learn a TRF, given a
a target concept C
the number of tree n
a training set Tr = Ps, Ns, Us
Ps = {a ∈ Ind(A)|K |= C(a)}
Ns = {b ∈ Ind(A)|K |= ¬C(b)}
Us = {c ∈ Ind(A)|K |= C(c) ∧ K |= ¬C(c)}
the algorithm can be summarized as follows:
build a n rebalanced bootstrap samples
learn a TDT model from each bootstrap sample
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 9 / 20
10. The framework
Learning Terminological Random Forests
Procedure for building the rebalanced bootstrap sample
In order to mitigate the drawback deriving from the under-sampling
procedure, a two-step approach is employed.
Firstly, a stratified sampling with replacement procedure is employed
in order to represent the minority class instances in the bootstrap
sample.
Then, the majority class instances (either positive or negative) and
the uncertain-membership instances are discarded.
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 10 / 20
11. The framework
Learning Terminological Random Forests
Learning TDTs
Given a bootstrap sample Di , a TDT is trained according to a
recursive strategy
Starting from the root the method refines the concept description
installed into the current node
Various candidates are returned and a subset of concepts is selected by
randomly chosing its elements
Best Concept: the one maximizes the information gain w.r.t. the
previous level
Split the instances according to the results of the instance check test
uncertain-membership instances are replicated in both recursive calls
Stop conditions: the node is pure w.r.t. the membership
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 11 / 20
12. The framework
Predicting unseen individuals
Given a forest F and a new individual a, the algorithm collects the
prediction returned by each TDT and decides the class according the
majority vote rule
The class-membership returned by a TDT is decided by traversing
recursively the tree (until a leaf is reached) according to the instance
check test result.
For a concept description installed as node D
if K |= D(a) the left branch is followed
if K |= ¬D(a) the right branch is followed
if neither K |= ¬D(a) nor K |= D(a), the uncertain-membership is
assigned to a
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 12 / 20
13. Experiments
Experiments
15 query concepts have been randomly generated
10-fold cross validation as design of the experiments
number of candidate randomly selected: |ρ(·)|
Stratified Sampling rates: no-sampling, 50%, 70 %, 80 %
Using a reasoner to decide the ground truth:
match: rate of the test cases (individuals) for which the inductive
model and a reasoner predict the same membership (i.e. +1 | +1,
−1 | −1, 0 | 0);
commission: rate of the cases for which predictions are opposite (i.e.
+1 | −1, −1 | +1);
omission: rate of test cases for which the inductive method cannot
determine a definite membership (−1, +1) while the reasoner is able to
do it;
induction: rate of cases where the inductive method can predict a
membership while it is not logically derivable.
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 13 / 20
18. Experiments
Considerations and Lessons Learnt
improvement w.r.t. TDTs
smallest changes in terms of match rate relating to the number of
trees
weak diversification(overlapping) between trees by increasing the
number of trees
there is no need to set high values for these parameters
e.g. 10-trees TRFs with a sampling rate of 50% is accurate enough
small disjuncts problem due to the poorly discriminative concepts
generated from the refinement operator is the cause of:
misclassification cases mitigated from the presence of other trees
a bottleneck for learning phase
execution times span from few minutes to almost 10 hours
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 18 / 20
19. Conclusions and Extensions
Conclusions and Further Extensions
Development of further refinement operators
Further ensemble techniques and combination rules
Further experiments with ontologies extracted from Linked Data
Cloud
Parallelization of the current implementation
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 19 / 20
20. Conclusions and Extensions
Thank you!
Questions?
G.Rizzo et al. (DIB - Univ. Aldo Moro) Tackling Class Imbalance Learning Problem November 24 - 28, 2014 20 / 20