Ontology-driven KDD Process Composition

UNIVERSITA’ POLITECNICA DELLE MARCHE
DIIGA – Dipartimento di Ingegneria Informatica,
Gestionale e dell’Automazione
Ancona, Italy

Ontology-Driven
KDD Process Composition

Claudia Diamantini, Domenico Potena, Emanuele Storti
{diamantini, potena, storti}@diiga.univpm.it
www.diiga.univpm.it

IDA'09, Lyon, Aug 31

Introduction

 Knowledge Discovery in Databases is the non-trivial
process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data. [Fayyad et al., 1996]
 Many sources of complexity:
 iterative/interactive process
 many tasks and phases
 several algorithms available for each
phase, with specific:
 characteristics, interfaces
 preconditions/postconditions
 performances

IDA'09, Lyon, Aug 31 Emanuele Storti

Introduction

 Knowledge Discovery in Databases is the non-trivial
process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data. [Fayyad et al., 1996]
 Many sources of complexity:
 iterative/interactive process
 many tasks and phases
 several algorithms available for each
phase, with specific:
 characteristics, interfaces
 preconditions/postconditions
 performances

Need of systems for supporting users in composing algorithm for producing valid
and useful KDD processes


Aim of the work

 Idea: adding semantics to KDD algorithms for
supporting an automatic KDD process
composition procedure


Aim of the work

 Formalizing knowledge of KDD experts into an
ontology for describing algorithms, their interfaces
and their relations


Aim of the work

and their relations

 Defining techniques for matching algorithms with
compatible interfaces


Aim of the work

and their relations


 Defining a goal-oriented composition procedure
which starts from user requests and produces a list
of valid processes ranked according to some criteria


Aim of the work

and their relations


goal
which starts from user requests and produces a list dataset
of valid processes ranked according to some criteria constraints


Aim of the work

and their relations


goal
which starts from user requests and produces a list dataset processes
of valid processes ranked according to some criteria constraints


Framework
 KDDVM project: service-oriented system for
sharing, discovering, accessing, executing Data
Mining and KDD tools

 Separation of information in 3 logical layer:

KDD Algorithm abstract algorithm

KDD Tool specific implementation of an algorithm

KDD Service tool running on a specific machine

Algorithm level  output = prototype KDD processes


KDD Ontology (1)

 KDDONTO is an ontology formalizing the
domain of KDD algorithms:
 developed following a formal methodology [Noy, 2002]
(concept definition  logic modeling  translation in OWL  evaluation)

 taking into account quality requirements [Gruber, 1995]

Main classes and relations:
 Algorithm, Method
 Task, Phase
 Data, DataFeature
 Performance
 has_input/has_output
 ...


KDD Ontology (2)

 KDDONTO is coinceived for supporting process
composition
 Properties useful for representing algorithm's interfaces:
 has_condition  pre/postcondition for some input/output data
 in_module/out_module suggestions about composable algorithms
 not_with/not_before  explicit incompatibilities between methods

 Properties useful for representing relations among data:
 part_of/has_part  relations between a compound datum and
its subcomponents
 in_constrast  explicit incompatibilities between conditions


Algorithm Matchmaking
 Linking algorithms with compatible interfaces
Exact Match Approximate Match
Interfaces share the same data Interfaces share similar data
- equivalence only - is-a and part-of relations
- inferential reasoning on KDDONTO

matchE({A 1 , A2 } ,B): matchA({A 1 , A2 } ,B):



1
in1 ≡o outA1
B



1 1 2 2
in ≡o outA1 inB ≡o outA1
B



1 1 2 2 3 1
in ≡o outA1 inB ≡o outA1 inB ≡o outA2
B



1 1 2 2 3 1
in ≡o outA1 inB ≡o outA1 inB ≡o outA2 VQ part_of LVQ
B A1
B



1 1 2 2 3 1
in ≡o outA1 inB ≡o outA1 inB ≡o outA2 VQ part_of LVQ
B A1
DATASET ≡o DATASETA2
B
B


Composition Procedure (1)
 Goal-driven procedure for composing KDD processes,
exploiting KDDONTO and matching functionalities
 produces a subset of all possible valid processes

Three phases:
I. Definition of dataset , goal and user constraints



Three phases:

A Dataset type and set of
instances of DataFeature
class
e.g.: LabeledDataset
{float, balanced,
normalized,
missing_values}



Three phases:

A Dataset type and set of An instance of Task class
e.g.: CLASSIFICATION
class
{float, balanced,
normalized,
missing_values}



Three phases:

A Dataset type and set of An instance of Task class
e.g.: CLASSIFICATION
class
Pruning Criteria
{float, balanced, • max number of algorithms in a process;
normalized, • max cost of a process;
missing_values} • max computational complexity



II. Process building
Starts from task and goes backwards iteratively
A
iteration, algorithms
are added to processes task
by exploiting matching ds
functionalities

Stop conditions: - no process can be further expanded
- some process constrains are violated
Output: only valid processes: - satisfying the user goal (task)
- compatible with the given dataset



II. Process building
Starts from task and goes backwards iteratively
A
iteration, algorithms
are added to processes task
by exploiting matching ds
functionalities

Stop conditions: - no process can be further expanded
- some process constrains are violated
Output: only valid processes: - satisfying the user goal (task)
- compatible with the given dataset

III. Process ranking
Cost function takes into account: kind of match (exact / approximate),
precondition relaxation, algorithm performances, ...


KDDComposer
 A prototype implementing the composition
procedure
Example scenario:
Task: CLASSIFICATION
Dataset: LabeledDataset
Dataset features:
{float, normalized,
missing_values,...}
Constraints: max 5 algorithms, etc.


KDDComposer
 A prototype implementing the composition
procedure
Example scenario:
Task: CLASSIFICATION
Dataset: LabeledDataset
Dataset features:
{float, normalized,
missing_values,...}
Constraints: max 5 algorithms, etc.

Results
a ranked list of many valid processes
Compared to a non-ontological approach  more valid processes (inference)
 less invalid processes (ontological and
non-ontological pruning)


Conclusion
 Procedure for composing valid KDD processes
 semantic representation of algorithms and data

Advantages
 KDDONTO  resulting processes are valid
supports complex pruning strategies
 Approximate Match more valid results (novel w.r.t other works in the Literature)
 Ranking according to both ontological and non-ontological criteria
 Prototype processes can be themselves considered as valid, unknown and useful
knowledge, valuable for both novice and experts users

Future works
 translating each prototype process in a concrete workflow of KDD Web Services


Project website

Project website: http://boole.diiga.univpm.it


Ontology-driven KDD Process Composition

More Related Content

Viewers also liked

Similar to Ontology-driven KDD Process Composition

Recently uploaded

Ontology-driven KDD Process Composition