Enhancing ML with Semantic Knowledge

Enhancing white-box machine learning processes by incorporating
semantic background knowledge
Gilles Vandewiele
Promotors: Femke Ongenae & Filip De Turck
Mentor: Agnieszka Ławrynowicz

Current ML processes are purely data-driven and ignore all existing
knowledge
Long training times
→ scales with #samples AND #features
A lot of data required to get good results
→ depends heavily on the quality of the data
-
-

There’s a lot of background & expert knowledge available in structured form!
Experts Knowledge
base
Expert
system

4
Some examples of critical domains...
Healthcare Finance Law

Research question
Can we combine the advantages of both data-driven and knowledge-
driven approaches by incorporating prior knowledge in the steps of a
(data-driven) ML process and what is the impact of this incorporation?
“
”

Outline
Lay-out of a typical machine learning process
& white box vs black box models
1.
Pre-processing: augmenting and balancing the data2.
Feature selection with PageRank3.
Other possible incorporations & conclusion4.
Introduction0.

The steps of a typical ML process: feature extraction/engineering
Hair? Feathers? Eggs? Milk? Predator? Legs? …
Bass 0 0 1 0 0 0
Bear 1 0 0 1 1 1
Boar 1 0 0 1 1 1
Calf 1 0 0 1 0 1
Deer 1 0 0 1 0 1
Girl 1 0 0 1 1 1
…

The steps of a typical ML process: feature selection
Hair? Feathers? Milk? Legs? …
Bass 0 0 0 0
Bear 1 0 1 1
Boar 1 0 1 1
Calf 1 0 1 1
Deer 1 0 1 1
Girl 1 0 1 1
…
Bass 0 0 1 0 0 0
Bear 1 0 0 1 1 1
Boar 1 0 0 1 1 1
Calf 1 0 0 1 0 1
Deer 1 0 0 1 0 1
Girl 1 0 0 1 1 1
…

The steps of a typical ML process: model construction
Bass 0 0 0 0
Bear 1 0 1 1
Boar 1 0 1 1
Calf 1 0 1 1
Deer 1 0 1 1
Girl 1 0 1 1
…
White-box
Black-box

Additional step: pre-processing
Bass 0 0 0 0
Bear 1 0 1 1
Boar 1 0 1 1
Calf 1 0 1 1
Deer 1 0 1 1
Girl 1 0 1 1
…
Data augmentation
Generate new (artificial)
samples or discover/engineer
new features

Additional step: pre-processing
Class balancing
Create a more uniform class
distribution in the dataset

White vs black box models
Instance-based explanation
Why did you classify this sample as positive?
<->
Model-based explanation
What are the most important features? Why do you classify this group of samples as positive? …
→ LIME [1], MFI [2], SHAP [3]

White vs black box models
Model-based explanation
Model debugging
Feature importances → selection
...
Faster adoption in critical domains

Bass 0 0 1 0 0 0
Bear 1 0 0 1 1 1
Boar 1 0 0 1 1 1
Calf 1 0 0 1 0 1
Deer 1 0 0 1 0 1
Girl 1 0 0 1 1 1
…
Features
Samples

Bass
Bear
Boar
Calf
Deer
Girl
…
LINK
LINK

Bass
Bear
Boar
Calf
Deer
Girl
…
LINK
LINK
How to link this unstructured data with minimal
user interaction?

Data augmentation: discovering new features
dbr:Bear
dbr:Mammal
dbr:Carnivora
dbo:class
dbo:order
dbr:Flea
dbr:Insect
dbr:Endopterygota
dbo:class
dbo:order
…
…
[3], [4], ...

Data augmentation: open problems
How to find useful features in immensely large
graph?
When is a feature useful?
- When to stop exploring children of a certain node?
- Not too many introduced missing values
- Can we gain information by adding the new feature?

Data imbalance
Model that always predict 0
 VERY HIGH accuracy
 COMPLETELY useless

Data balancing approaches
Custom objective function (higher penalty for
minority class)
Oversampling of minority class
Undersampling of majority class
Model-agnostic+

Current approaches: SMOTE [6] & ADASYN [7]

F1
F2
F3
F4
F5
C1
C2
C3
PageRank Feature Selection
Feature concept
Class concept

PageRank Feature Selection: advantages
Requires no data
Can be used in unsupervised scenarios (i.e. clustering)
Fast (runtime in function of #features, not #samples)
+
+
+

PageRank Feature Selection: preliminary results
Zoo dataset: 7 classes, 16 features (categorical)
Glass dataset: 6 classes, 9 features (continuous)
→ CLUSTERING
→ V-Measure ~ F-Measure
Features
Lin similarity
PageRank

PageRank Feature Selection: preliminary results
Zoo Glass

PageRank Feature Selection: open problems
How to initialize the edge weights?
What ranking algorithm to use?
- Similarity/distance measures? Which ones?
- Distance or similarity: feature <-> feature, class <-> class, feature <-> class?
- PageRank out of the box?

Feature extraction: knowledge subgraph vector embedding
BEAR
How to identify the relevant subgraph in the immensely large knowledge graph?

Feature extraction: knowledge subgraph vector embedding
EMBEDDING [8]
0.45
0.15
0.887
0.51
0.24
0.41
…

References
[1] Ribeiro, Marco Tulio et al. "Model-agnostic interpretability of machine learning.“
[2] Vidovic, Marina M-C. et al. "Feature Importance Measure for Non-linear Learning Algorithms.“
[3] Lundberg, Scott et al. "An unexpected unity among methods for interpreting model predictions."
[4] Paulheim, Heiko, et al. "Data mining with background knowledge from the web.“
[5] Terziev, Yordan. "Feature Generation using Ontologies during Induction of Decision Trees on Linked Data.“
[6] Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique.“
[7] He, Haibo, et al. "ADASYN: Adaptive synthetic sampling approach for imbalanced learning.“
[8] Ristoski, Petar, and Heiko Paulheim. "Rdf2vec: Rdf graph embeddings for data mining.”

THANK YOU!
Acknowledgements:
- Reviewers & organizing committee
- My mentor: Agnieszka Ławrynowicz
- My promotors: Filip De Turck & Femke Ongenae gilles.vandewiele@ugent.be
@Gillesvdwiele

Enhancing ML with Semantic Knowledge

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Enhancing ML with Semantic Knowledge

Similar to Enhancing ML with Semantic Knowledge (20)

Recently uploaded

Recently uploaded (20)

Enhancing ML with Semantic Knowledge