Current machine learning processes are data-driven and ignore existing background knowledge. This document proposes incorporating background knowledge into ML processes to address issues like long training times and heavy reliance on data quality. It suggests augmenting data by discovering new features from knowledge graphs, balancing classes to address data imbalances, and using PageRank for feature selection without requiring data. Combining data-driven and knowledge-driven approaches could leverage both data and expert knowledge to improve ML processes. The document outlines typical ML steps and how knowledge incorporation could enhance preprocessing, feature extraction, selection and modeling.
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Enhancing ML with Semantic Knowledge
1. Enhancing white-box machine learning processes by incorporating
semantic background knowledge
Gilles Vandewiele
Promotors: Femke Ongenae & Filip De Turck
Mentor: Agnieszka Ławrynowicz
2. Current ML processes are purely data-driven and ignore all existing
knowledge
Long training times
→ scales with #samples AND #features
A lot of data required to get good results
→ depends heavily on the quality of the data
-
-
3. There’s a lot of background & expert knowledge available in structured form!
Experts Knowledge
base
Expert
system
5. Research question
Can we combine the advantages of both data-driven and knowledge-
driven approaches by incorporating prior knowledge in the steps of a
(data-driven) ML process and what is the impact of this incorporation?
“
”
6. Outline
Lay-out of a typical machine learning process
& white box vs black box models
1.
Pre-processing: augmenting and balancing the data2.
Feature selection with PageRank3.
Other possible incorporations & conclusion4.
Introduction0.
13. White vs black box models
Instance-based explanation
Why did you classify this sample as positive?
<->
Model-based explanation
What are the most important features? Why do you classify this group of samples as positive? …
→ LIME [1], MFI [2], SHAP [3]
14. White vs black box models
Model-based explanation
Model debugging
Feature importances → selection
...
Faster adoption in critical domains
15. Outline
Lay-out of a typical machine learning process
& white box vs black box models
1.
Pre-processing: augmenting and balancing the data2.
Feature selection with PageRank3.
Other possible incorporations & conclusion4.
Introduction0.
17. Hair? Feathers? Eggs? Milk? Predator? Legs? …
Bass
Bear
Boar
Calf
Deer
Girl
…
LINK
LINK
18. Hair? Feathers? Eggs? Milk? Predator? Legs? …
Bass
Bear
Boar
Calf
Deer
Girl
…
LINK
LINK
How to link this unstructured data with minimal
user interaction?
19. Data augmentation: discovering new features
dbr:Bear
dbr:Mammal
dbr:Carnivora
dbo:class
dbo:order
dbr:Flea
dbr:Insect
dbr:Endopterygota
dbo:class
dbo:order
…
…
[3], [4], ...
20. Data augmentation: open problems
How to find useful features in immensely large
graph?
When is a feature useful?
- When to stop exploring children of a certain node?
- Not too many introduced missing values
- Can we gain information by adding the new feature?
22. Data balancing approaches
Custom objective function (higher penalty for
minority class)
Oversampling of minority class
Undersampling of majority class
Model-agnostic+
25. Outline
Lay-out of a typical machine learning process
& white box vs black box models
1.
Pre-processing: augmenting and balancing the data2.
Feature selection with PageRank3.
Other possible incorporations & conclusion4.
Introduction0.
27. PageRank Feature Selection: advantages
Requires no data
Can be used in unsupervised scenarios (i.e. clustering)
Fast (runtime in function of #features, not #samples)
+
+
+
28. PageRank Feature Selection: preliminary results
Zoo dataset: 7 classes, 16 features (categorical)
Glass dataset: 6 classes, 9 features (continuous)
→ CLUSTERING
→ V-Measure ~ F-Measure
Features
Lin similarity
PageRank
30. PageRank Feature Selection: open problems
How to initialize the edge weights?
What ranking algorithm to use?
- Similarity/distance measures? Which ones?
- Distance or similarity: feature <-> feature, class <-> class, feature <-> class?
- PageRank out of the box?
31. Outline
Lay-out of a typical machine learning process
& white box vs black box models
1.
Pre-processing: augmenting and balancing the data2.
Feature selection with PageRank3.
Other possible incorporations & conclusion4.
Introduction0.
32. Feature extraction: knowledge subgraph vector embedding
BEAR
How to identify the relevant subgraph in the immensely large knowledge graph?
36. References
[1] Ribeiro, Marco Tulio et al. "Model-agnostic interpretability of machine learning.“
[2] Vidovic, Marina M-C. et al. "Feature Importance Measure for Non-linear Learning Algorithms.“
[3] Lundberg, Scott et al. "An unexpected unity among methods for interpreting model predictions."
[4] Paulheim, Heiko, et al. "Data mining with background knowledge from the web.“
[5] Terziev, Yordan. "Feature Generation using Ontologies during Induction of Decision Trees on Linked Data.“
[6] Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique.“
[7] He, Haibo, et al. "ADASYN: Adaptive synthetic sampling approach for imbalanced learning.“
[8] Ristoski, Petar, and Heiko Paulheim. "Rdf2vec: Rdf graph embeddings for data mining.”
37. THANK YOU!
Acknowledgements:
- Reviewers & organizing committee
- My mentor: Agnieszka Ławrynowicz
- My promotors: Filip De Turck & Femke Ongenae gilles.vandewiele@ugent.be
@Gillesvdwiele