UNIVERSITA’ POLITECNICA DELLE MARCHE
                       DIIGA – Dipartimento di Ingegneria Informatica,
                               Gestionale e dell’Automazione
                                       Ancona, Italy




              Ontology-Driven
          KDD Process Composition

            Claudia Diamantini, Domenico Potena, Emanuele Storti
                  {diamantini, potena, storti}@diiga.univpm.it
                              www.diiga.univpm.it




IDA'09, Lyon, Aug 31
Introduction

   Knowledge Discovery in Databases is the non-trivial
    process of identifying valid, novel, potentially useful, and
    ultimately understandable patterns in data. [Fayyad et al., 1996]
   Many sources of complexity:
            iterative/interactive process
            many tasks and phases
            several algorithms available for each
             phase, with specific:
                characteristics, interfaces
                preconditions/postconditions
                performances




IDA'09, Lyon, Aug 31                 Emanuele Storti
Introduction

   Knowledge Discovery in Databases is the non-trivial
    process of identifying valid, novel, potentially useful, and
    ultimately understandable patterns in data. [Fayyad et al., 1996]
   Many sources of complexity:
            iterative/interactive process
            many tasks and phases
            several algorithms available for each
             phase, with specific:
                characteristics, interfaces
                preconditions/postconditions
                performances

Need of systems for supporting users in composing algorithm for producing valid
and useful KDD processes

IDA'09, Lyon, Aug 31                 Emanuele Storti
Aim of the work

   Idea: adding semantics to KDD algorithms for
    supporting an automatic KDD process
    composition procedure




IDA'09, Lyon, Aug 31   Emanuele Storti
Aim of the work

   Idea: adding semantics to KDD algorithms for
    supporting an automatic KDD process
    composition procedure
    Formalizing knowledge of KDD experts into an
     ontology for describing algorithms, their interfaces
     and their relations




IDA'09, Lyon, Aug 31             Emanuele Storti
Aim of the work

   Idea: adding semantics to KDD algorithms for
    supporting an automatic KDD process
    composition procedure
    Formalizing knowledge of KDD experts into an
     ontology for describing algorithms, their interfaces
     and their relations

    Defining techniques for matching algorithms with
     compatible interfaces




IDA'09, Lyon, Aug 31             Emanuele Storti
Aim of the work

   Idea: adding semantics to KDD algorithms for
    supporting an automatic KDD process
    composition procedure
    Formalizing knowledge of KDD experts into an
     ontology for describing algorithms, their interfaces
     and their relations

    Defining techniques for matching algorithms with
     compatible interfaces

    Defining a goal-oriented composition procedure
     which starts from user requests and produces a list
     of valid processes ranked according to some criteria


IDA'09, Lyon, Aug 31             Emanuele Storti
Aim of the work

   Idea: adding semantics to KDD algorithms for
    supporting an automatic KDD process
    composition procedure
    Formalizing knowledge of KDD experts into an
     ontology for describing algorithms, their interfaces
     and their relations

    Defining techniques for matching algorithms with
     compatible interfaces

    Defining a goal-oriented composition procedure
                                                                  goal
     which starts from user requests and produces a list       dataset
     of valid processes ranked according to some criteria   constraints



IDA'09, Lyon, Aug 31             Emanuele Storti
Aim of the work

   Idea: adding semantics to KDD algorithms for
    supporting an automatic KDD process
    composition procedure
    Formalizing knowledge of KDD experts into an
     ontology for describing algorithms, their interfaces
     and their relations

    Defining techniques for matching algorithms with
     compatible interfaces

    Defining a goal-oriented composition procedure
                                                                  goal
     which starts from user requests and produces a list       dataset    processes
     of valid processes ranked according to some criteria   constraints



IDA'09, Lyon, Aug 31             Emanuele Storti
Framework
   KDDVM project: service-oriented system for
    sharing, discovering, accessing, executing Data
    Mining and KDD tools

   Separation of information in 3 logical layer:

    KDD Algorithm       abstract algorithm

       KDD Tool         specific implementation of an algorithm

     KDD Service        tool running on a specific machine

Algorithm level  output = prototype KDD processes


IDA'09, Lyon, Aug 31         Emanuele Storti
Framework
   KDDVM project: service-oriented system for
    sharing, discovering, accessing, executing Data
    Mining and KDD tools

   Separation of information in 3 logical layer:

    KDD Algorithm       abstract algorithm

       KDD Tool         specific implementation of an algorithm

     KDD Service        tool running on a specific machine

Algorithm level  output = prototype KDD processes


IDA'09, Lyon, Aug 31         Emanuele Storti
KDD Ontology (1)

   KDDONTO is an ontology formalizing the
    domain of KDD algorithms:
       developed following a formal methodology [Noy, 2002]
    (concept definition  logic modeling  translation in OWL  evaluation)

       taking into account quality requirements [Gruber, 1995]

    Main classes and relations:
       Algorithm, Method
       Task, Phase
       Data, DataFeature
       Performance
       has_input/has_output
       ...


IDA'09, Lyon, Aug 31             Emanuele Storti
KDD Ontology (2)

   KDDONTO is coinceived for supporting process
    composition
       Properties useful for representing algorithm's interfaces:
           has_condition        pre/postcondition for some input/output data
           in_module/out_module suggestions about composable algorithms
           not_with/not_before  explicit incompatibilities between methods

       Properties useful for representing relations among data:
           part_of/has_part        relations between a compound datum and
                                     its subcomponents
           in_constrast            explicit incompatibilities between conditions




IDA'09, Lyon, Aug 31              Emanuele Storti
Algorithm Matchmaking
   Linking algorithms with compatible interfaces
Exact Match                           Approximate Match
Interfaces share the same data        Interfaces share similar data
 - equivalence only                   - is-a and part-of relations
                                      - inferential reasoning on KDDONTO




matchE({A 1 , A2 } ,B):             matchA({A 1 , A2 } ,B):




IDA'09, Lyon, Aug 31        Emanuele Storti
Algorithm Matchmaking
   Linking algorithms with compatible interfaces
Exact Match                           Approximate Match
Interfaces share the same data        Interfaces share similar data
 - equivalence only                   - is-a and part-of relations
                                      - inferential reasoning on KDDONTO




matchE({A 1 , A2 } ,B):             matchA({A 1 , A2 } ,B):
         1
in1 ≡o outA1
    B


IDA'09, Lyon, Aug 31        Emanuele Storti
Algorithm Matchmaking
       Linking algorithms with compatible interfaces
Exact Match                               Approximate Match
Interfaces share the same data            Interfaces share similar data
 - equivalence only                       - is-a and part-of relations
                                          - inferential reasoning on KDDONTO




matchE({A 1 , A2 } ,B):                 matchA({A 1 , A2 } ,B):
    1       1    2        2
in ≡o outA1     inB ≡o outA1
    B


IDA'09, Lyon, Aug 31            Emanuele Storti
Algorithm Matchmaking
       Linking algorithms with compatible interfaces
Exact Match                                     Approximate Match
Interfaces share the same data                  Interfaces share similar data
 - equivalence only                             - is-a and part-of relations
                                                - inferential reasoning on KDDONTO




matchE({A 1 , A2 } ,B):                       matchA({A 1 , A2 } ,B):
    1       1    2        2    3      1
in ≡o outA1     inB ≡o outA1 inB ≡o outA2
    B


IDA'09, Lyon, Aug 31                  Emanuele Storti
Algorithm Matchmaking
       Linking algorithms with compatible interfaces
Exact Match                                     Approximate Match
Interfaces share the same data                  Interfaces share similar data
 - equivalence only                             - is-a and part-of relations
                                                - inferential reasoning on KDDONTO




matchE({A 1 , A2 } ,B):                       matchA({A 1 , A2 } ,B):
    1       1    2        2    3      1
in ≡o outA1     inB ≡o outA1 inB ≡o outA2      VQ part_of LVQ
                                                  B             A1
    B


IDA'09, Lyon, Aug 31                  Emanuele Storti
Algorithm Matchmaking
       Linking algorithms with compatible interfaces
Exact Match                                     Approximate Match
Interfaces share the same data                  Interfaces share similar data
 - equivalence only                             - is-a and part-of relations
                                                - inferential reasoning on KDDONTO




matchE({A 1 , A2 } ,B):                       matchA({A 1 , A2 } ,B):
    1       1    2        2    3      1
in ≡o outA1     inB ≡o outA1 inB ≡o outA2      VQ part_of LVQ
                                                  B             A1
                                                                     DATASET ≡o DATASETA2
                                                                            B
    B


IDA'09, Lyon, Aug 31                  Emanuele Storti
Composition Procedure (1)
   Goal-driven procedure for composing KDD processes,
    exploiting KDDONTO and matching functionalities
     produces a subset of all possible valid processes


Three phases:
I. Definition of dataset , goal and user constraints




IDA'09, Lyon, Aug 31     Emanuele Storti
Composition Procedure (1)
   Goal-driven procedure for composing KDD processes,
    exploiting KDDONTO and matching functionalities
     produces a subset of all possible valid processes


Three phases:
I. Definition of dataset , goal and user constraints

A Dataset type and set of
instances of DataFeature
class
e.g.: LabeledDataset
{float, balanced,
normalized,
missing_values}

IDA'09, Lyon, Aug 31        Emanuele Storti
Composition Procedure (1)
   Goal-driven procedure for composing KDD processes,
    exploiting KDDONTO and matching functionalities
     produces a subset of all possible valid processes


Three phases:
I. Definition of dataset , goal and user constraints

A Dataset type and set of         An instance of Task class
instances of DataFeature
                                  e.g.: CLASSIFICATION
class
e.g.: LabeledDataset
{float, balanced,
normalized,
missing_values}

IDA'09, Lyon, Aug 31        Emanuele Storti
Composition Procedure (1)
   Goal-driven procedure for composing KDD processes,
    exploiting KDDONTO and matching functionalities
     produces a subset of all possible valid processes


Three phases:
I. Definition of dataset , goal and user constraints

A Dataset type and set of         An instance of Task class
instances of DataFeature
                                  e.g.: CLASSIFICATION
class
e.g.: LabeledDataset
                                  Pruning Criteria
{float, balanced,                • max number of algorithms in a process;
normalized,                      • max cost of a process;
missing_values}                  • max computational complexity

IDA'09, Lyon, Aug 31        Emanuele Storti
Composition Procedure (2)

II. Process building
Starts from task and goes backwards iteratively
 A
   iteration, algorithms
are added to processes                                              task
by exploiting matching        ds
functionalities

Stop conditions: - no process can be further expanded
                 - some process constrains are violated
Output: only valid processes: - satisfying the user goal (task)
                              - compatible with the given dataset




IDA'09, Lyon, Aug 31           Emanuele Storti
Composition Procedure (2)

II. Process building
Starts from task and goes backwards iteratively
 A
   iteration, algorithms
are added to processes                                              task
by exploiting matching        ds
functionalities

Stop conditions: - no process can be further expanded
                 - some process constrains are violated
Output: only valid processes: - satisfying the user goal (task)
                              - compatible with the given dataset




IDA'09, Lyon, Aug 31           Emanuele Storti
Composition Procedure (2)

II. Process building
Starts from task and goes backwards iteratively
 A
   iteration, algorithms
are added to processes                                              task
by exploiting matching        ds
functionalities

Stop conditions: - no process can be further expanded
                 - some process constrains are violated
Output: only valid processes: - satisfying the user goal (task)
                              - compatible with the given dataset




IDA'09, Lyon, Aug 31           Emanuele Storti
Composition Procedure (2)

II. Process building
Starts from task and goes backwards iteratively
 A
   iteration, algorithms
are added to processes                                              task
by exploiting matching        ds
functionalities

Stop conditions: - no process can be further expanded
                 - some process constrains are violated
Output: only valid processes: - satisfying the user goal (task)
                              - compatible with the given dataset




IDA'09, Lyon, Aug 31           Emanuele Storti
Composition Procedure (2)

II. Process building
Starts from task and goes backwards iteratively
 A
   iteration, algorithms
are added to processes                                              task
by exploiting matching        ds
functionalities

Stop conditions: - no process can be further expanded
                 - some process constrains are violated
Output: only valid processes: - satisfying the user goal (task)
                              - compatible with the given dataset




IDA'09, Lyon, Aug 31           Emanuele Storti
Composition Procedure (2)

II. Process building
Starts from task and goes backwards iteratively
 A
   iteration, algorithms
are added to processes                                              task
by exploiting matching        ds
functionalities

Stop conditions: - no process can be further expanded
                 - some process constrains are violated
Output: only valid processes: - satisfying the user goal (task)
                              - compatible with the given dataset




IDA'09, Lyon, Aug 31           Emanuele Storti
Composition Procedure (2)

II. Process building
Starts from task and goes backwards iteratively
 A
   iteration, algorithms
are added to processes                                                   task
by exploiting matching        ds
functionalities

Stop conditions: - no process can be further expanded
                 - some process constrains are violated
Output: only valid processes: - satisfying the user goal (task)
                              - compatible with the given dataset

III. Process ranking
Cost function takes into account: kind of match (exact / approximate),
precondition relaxation, algorithm performances, ...

IDA'09, Lyon, Aug 31           Emanuele Storti
KDDComposer
   A prototype implementing the composition
    procedure
Example scenario:
Task:   CLASSIFICATION
Dataset: LabeledDataset
Dataset features:
    {float, normalized,
    missing_values,...}
Constraints: max 5 algorithms, etc.




IDA'09, Lyon, Aug 31                  Emanuele Storti
KDDComposer
   A prototype implementing the composition
    procedure
Example scenario:
Task:   CLASSIFICATION
Dataset: LabeledDataset
Dataset features:
    {float, normalized,
    missing_values,...}
Constraints: max 5 algorithms, etc.




IDA'09, Lyon, Aug 31                  Emanuele Storti
KDDComposer
   A prototype implementing the composition
    procedure
Example scenario:
Task:   CLASSIFICATION
Dataset: LabeledDataset
Dataset features:
    {float, normalized,
    missing_values,...}
Constraints: max 5 algorithms, etc.




IDA'09, Lyon, Aug 31                  Emanuele Storti
KDDComposer
   A prototype implementing the composition
    procedure
Example scenario:
Task:   CLASSIFICATION
Dataset: LabeledDataset
Dataset features:
    {float, normalized,
    missing_values,...}
Constraints: max 5 algorithms, etc.




IDA'09, Lyon, Aug 31                  Emanuele Storti
KDDComposer
   A prototype implementing the composition
    procedure
Example scenario:
Task:   CLASSIFICATION
Dataset: LabeledDataset
Dataset features:
    {float, normalized,
    missing_values,...}
Constraints: max 5 algorithms, etc.




IDA'09, Lyon, Aug 31                  Emanuele Storti
KDDComposer
   A prototype implementing the composition
    procedure
Example scenario:
Task:   CLASSIFICATION
Dataset: LabeledDataset
Dataset features:
    {float, normalized,
    missing_values,...}
Constraints: max 5 algorithms, etc.

Results
a ranked list of many valid processes
Compared to a non-ontological approach  more valid processes (inference)
                                        less invalid processes (ontological and
                                                            non-ontological pruning)

IDA'09, Lyon, Aug 31                  Emanuele Storti
Conclusion
   Procedure for composing valid KDD processes
       semantic representation of algorithms and data

Advantages
   KDDONTO  resulting processes are valid
                  supports complex pruning strategies
   Approximate Match more valid results (novel w.r.t other works in the Literature)
   Ranking according to both ontological and non-ontological criteria
   Prototype processes can be themselves considered as valid, unknown and useful
    knowledge, valuable for both novice and experts users



Future works
   translating each prototype process in a concrete workflow of KDD Web Services



IDA'09, Lyon, Aug 31               Emanuele Storti
Project website




                         Project website: http://boole.diiga.univpm.it



IDA'09, Lyon, Aug 31   Emanuele Storti
UNIVERSITA’ POLITECNICA DELLE MARCHE
                       DIIGA – Dipartimento di Ingegneria Informatica,
                               Gestionale e dell’Automazione
                                       Ancona, Italy




              Ontology-Driven
          KDD Process Composition

            Claudia Diamantini, Domenico Potena, Emanuele Storti
                  {diamantini, potena, storti}@diiga.univpm.it
                              www.diiga.univpm.it




IDA'09, Lyon, Aug 31

Ontology-driven KDD Process Composition

  • 1.
    UNIVERSITA’ POLITECNICA DELLEMARCHE DIIGA – Dipartimento di Ingegneria Informatica, Gestionale e dell’Automazione Ancona, Italy Ontology-Driven KDD Process Composition Claudia Diamantini, Domenico Potena, Emanuele Storti {diamantini, potena, storti}@diiga.univpm.it www.diiga.univpm.it IDA'09, Lyon, Aug 31
  • 2.
    Introduction  Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. [Fayyad et al., 1996]  Many sources of complexity:  iterative/interactive process  many tasks and phases  several algorithms available for each phase, with specific:  characteristics, interfaces  preconditions/postconditions  performances IDA'09, Lyon, Aug 31 Emanuele Storti
  • 3.
    Introduction  Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. [Fayyad et al., 1996]  Many sources of complexity:  iterative/interactive process  many tasks and phases  several algorithms available for each phase, with specific:  characteristics, interfaces  preconditions/postconditions  performances Need of systems for supporting users in composing algorithm for producing valid and useful KDD processes IDA'09, Lyon, Aug 31 Emanuele Storti
  • 4.
    Aim of thework  Idea: adding semantics to KDD algorithms for supporting an automatic KDD process composition procedure IDA'09, Lyon, Aug 31 Emanuele Storti
  • 5.
    Aim of thework  Idea: adding semantics to KDD algorithms for supporting an automatic KDD process composition procedure  Formalizing knowledge of KDD experts into an ontology for describing algorithms, their interfaces and their relations IDA'09, Lyon, Aug 31 Emanuele Storti
  • 6.
    Aim of thework  Idea: adding semantics to KDD algorithms for supporting an automatic KDD process composition procedure  Formalizing knowledge of KDD experts into an ontology for describing algorithms, their interfaces and their relations  Defining techniques for matching algorithms with compatible interfaces IDA'09, Lyon, Aug 31 Emanuele Storti
  • 7.
    Aim of thework  Idea: adding semantics to KDD algorithms for supporting an automatic KDD process composition procedure  Formalizing knowledge of KDD experts into an ontology for describing algorithms, their interfaces and their relations  Defining techniques for matching algorithms with compatible interfaces  Defining a goal-oriented composition procedure which starts from user requests and produces a list of valid processes ranked according to some criteria IDA'09, Lyon, Aug 31 Emanuele Storti
  • 8.
    Aim of thework  Idea: adding semantics to KDD algorithms for supporting an automatic KDD process composition procedure  Formalizing knowledge of KDD experts into an ontology for describing algorithms, their interfaces and their relations  Defining techniques for matching algorithms with compatible interfaces  Defining a goal-oriented composition procedure goal which starts from user requests and produces a list dataset of valid processes ranked according to some criteria constraints IDA'09, Lyon, Aug 31 Emanuele Storti
  • 9.
    Aim of thework  Idea: adding semantics to KDD algorithms for supporting an automatic KDD process composition procedure  Formalizing knowledge of KDD experts into an ontology for describing algorithms, their interfaces and their relations  Defining techniques for matching algorithms with compatible interfaces  Defining a goal-oriented composition procedure goal which starts from user requests and produces a list dataset processes of valid processes ranked according to some criteria constraints IDA'09, Lyon, Aug 31 Emanuele Storti
  • 10.
    Framework  KDDVM project: service-oriented system for sharing, discovering, accessing, executing Data Mining and KDD tools  Separation of information in 3 logical layer: KDD Algorithm abstract algorithm KDD Tool specific implementation of an algorithm KDD Service tool running on a specific machine Algorithm level  output = prototype KDD processes IDA'09, Lyon, Aug 31 Emanuele Storti
  • 11.
    Framework  KDDVM project: service-oriented system for sharing, discovering, accessing, executing Data Mining and KDD tools  Separation of information in 3 logical layer: KDD Algorithm abstract algorithm KDD Tool specific implementation of an algorithm KDD Service tool running on a specific machine Algorithm level  output = prototype KDD processes IDA'09, Lyon, Aug 31 Emanuele Storti
  • 12.
    KDD Ontology (1)  KDDONTO is an ontology formalizing the domain of KDD algorithms:  developed following a formal methodology [Noy, 2002] (concept definition  logic modeling  translation in OWL  evaluation)  taking into account quality requirements [Gruber, 1995] Main classes and relations:  Algorithm, Method  Task, Phase  Data, DataFeature  Performance  has_input/has_output  ... IDA'09, Lyon, Aug 31 Emanuele Storti
  • 13.
    KDD Ontology (2)  KDDONTO is coinceived for supporting process composition  Properties useful for representing algorithm's interfaces:  has_condition  pre/postcondition for some input/output data  in_module/out_module suggestions about composable algorithms  not_with/not_before  explicit incompatibilities between methods  Properties useful for representing relations among data:  part_of/has_part  relations between a compound datum and its subcomponents  in_constrast  explicit incompatibilities between conditions IDA'09, Lyon, Aug 31 Emanuele Storti
  • 14.
    Algorithm Matchmaking  Linking algorithms with compatible interfaces Exact Match Approximate Match Interfaces share the same data Interfaces share similar data - equivalence only - is-a and part-of relations - inferential reasoning on KDDONTO matchE({A 1 , A2 } ,B): matchA({A 1 , A2 } ,B): IDA'09, Lyon, Aug 31 Emanuele Storti
  • 15.
    Algorithm Matchmaking  Linking algorithms with compatible interfaces Exact Match Approximate Match Interfaces share the same data Interfaces share similar data - equivalence only - is-a and part-of relations - inferential reasoning on KDDONTO matchE({A 1 , A2 } ,B): matchA({A 1 , A2 } ,B): 1 in1 ≡o outA1 B IDA'09, Lyon, Aug 31 Emanuele Storti
  • 16.
    Algorithm Matchmaking  Linking algorithms with compatible interfaces Exact Match Approximate Match Interfaces share the same data Interfaces share similar data - equivalence only - is-a and part-of relations - inferential reasoning on KDDONTO matchE({A 1 , A2 } ,B): matchA({A 1 , A2 } ,B): 1 1 2 2 in ≡o outA1 inB ≡o outA1 B IDA'09, Lyon, Aug 31 Emanuele Storti
  • 17.
    Algorithm Matchmaking  Linking algorithms with compatible interfaces Exact Match Approximate Match Interfaces share the same data Interfaces share similar data - equivalence only - is-a and part-of relations - inferential reasoning on KDDONTO matchE({A 1 , A2 } ,B): matchA({A 1 , A2 } ,B): 1 1 2 2 3 1 in ≡o outA1 inB ≡o outA1 inB ≡o outA2 B IDA'09, Lyon, Aug 31 Emanuele Storti
  • 18.
    Algorithm Matchmaking  Linking algorithms with compatible interfaces Exact Match Approximate Match Interfaces share the same data Interfaces share similar data - equivalence only - is-a and part-of relations - inferential reasoning on KDDONTO matchE({A 1 , A2 } ,B): matchA({A 1 , A2 } ,B): 1 1 2 2 3 1 in ≡o outA1 inB ≡o outA1 inB ≡o outA2 VQ part_of LVQ B A1 B IDA'09, Lyon, Aug 31 Emanuele Storti
  • 19.
    Algorithm Matchmaking  Linking algorithms with compatible interfaces Exact Match Approximate Match Interfaces share the same data Interfaces share similar data - equivalence only - is-a and part-of relations - inferential reasoning on KDDONTO matchE({A 1 , A2 } ,B): matchA({A 1 , A2 } ,B): 1 1 2 2 3 1 in ≡o outA1 inB ≡o outA1 inB ≡o outA2 VQ part_of LVQ B A1 DATASET ≡o DATASETA2 B B IDA'09, Lyon, Aug 31 Emanuele Storti
  • 20.
    Composition Procedure (1)  Goal-driven procedure for composing KDD processes, exploiting KDDONTO and matching functionalities  produces a subset of all possible valid processes Three phases: I. Definition of dataset , goal and user constraints IDA'09, Lyon, Aug 31 Emanuele Storti
  • 21.
    Composition Procedure (1)  Goal-driven procedure for composing KDD processes, exploiting KDDONTO and matching functionalities  produces a subset of all possible valid processes Three phases: I. Definition of dataset , goal and user constraints A Dataset type and set of instances of DataFeature class e.g.: LabeledDataset {float, balanced, normalized, missing_values} IDA'09, Lyon, Aug 31 Emanuele Storti
  • 22.
    Composition Procedure (1)  Goal-driven procedure for composing KDD processes, exploiting KDDONTO and matching functionalities  produces a subset of all possible valid processes Three phases: I. Definition of dataset , goal and user constraints A Dataset type and set of An instance of Task class instances of DataFeature e.g.: CLASSIFICATION class e.g.: LabeledDataset {float, balanced, normalized, missing_values} IDA'09, Lyon, Aug 31 Emanuele Storti
  • 23.
    Composition Procedure (1)  Goal-driven procedure for composing KDD processes, exploiting KDDONTO and matching functionalities  produces a subset of all possible valid processes Three phases: I. Definition of dataset , goal and user constraints A Dataset type and set of An instance of Task class instances of DataFeature e.g.: CLASSIFICATION class e.g.: LabeledDataset Pruning Criteria {float, balanced, • max number of algorithms in a process; normalized, • max cost of a process; missing_values} • max computational complexity IDA'09, Lyon, Aug 31 Emanuele Storti
  • 24.
    Composition Procedure (2) II.Process building Starts from task and goes backwards iteratively A iteration, algorithms are added to processes task by exploiting matching ds functionalities Stop conditions: - no process can be further expanded - some process constrains are violated Output: only valid processes: - satisfying the user goal (task) - compatible with the given dataset IDA'09, Lyon, Aug 31 Emanuele Storti
  • 25.
    Composition Procedure (2) II.Process building Starts from task and goes backwards iteratively A iteration, algorithms are added to processes task by exploiting matching ds functionalities Stop conditions: - no process can be further expanded - some process constrains are violated Output: only valid processes: - satisfying the user goal (task) - compatible with the given dataset IDA'09, Lyon, Aug 31 Emanuele Storti
  • 26.
    Composition Procedure (2) II.Process building Starts from task and goes backwards iteratively A iteration, algorithms are added to processes task by exploiting matching ds functionalities Stop conditions: - no process can be further expanded - some process constrains are violated Output: only valid processes: - satisfying the user goal (task) - compatible with the given dataset IDA'09, Lyon, Aug 31 Emanuele Storti
  • 27.
    Composition Procedure (2) II.Process building Starts from task and goes backwards iteratively A iteration, algorithms are added to processes task by exploiting matching ds functionalities Stop conditions: - no process can be further expanded - some process constrains are violated Output: only valid processes: - satisfying the user goal (task) - compatible with the given dataset IDA'09, Lyon, Aug 31 Emanuele Storti
  • 28.
    Composition Procedure (2) II.Process building Starts from task and goes backwards iteratively A iteration, algorithms are added to processes task by exploiting matching ds functionalities Stop conditions: - no process can be further expanded - some process constrains are violated Output: only valid processes: - satisfying the user goal (task) - compatible with the given dataset IDA'09, Lyon, Aug 31 Emanuele Storti
  • 29.
    Composition Procedure (2) II.Process building Starts from task and goes backwards iteratively A iteration, algorithms are added to processes task by exploiting matching ds functionalities Stop conditions: - no process can be further expanded - some process constrains are violated Output: only valid processes: - satisfying the user goal (task) - compatible with the given dataset IDA'09, Lyon, Aug 31 Emanuele Storti
  • 30.
    Composition Procedure (2) II.Process building Starts from task and goes backwards iteratively A iteration, algorithms are added to processes task by exploiting matching ds functionalities Stop conditions: - no process can be further expanded - some process constrains are violated Output: only valid processes: - satisfying the user goal (task) - compatible with the given dataset III. Process ranking Cost function takes into account: kind of match (exact / approximate), precondition relaxation, algorithm performances, ... IDA'09, Lyon, Aug 31 Emanuele Storti
  • 31.
    KDDComposer  A prototype implementing the composition procedure Example scenario: Task: CLASSIFICATION Dataset: LabeledDataset Dataset features: {float, normalized, missing_values,...} Constraints: max 5 algorithms, etc. IDA'09, Lyon, Aug 31 Emanuele Storti
  • 32.
    KDDComposer  A prototype implementing the composition procedure Example scenario: Task: CLASSIFICATION Dataset: LabeledDataset Dataset features: {float, normalized, missing_values,...} Constraints: max 5 algorithms, etc. IDA'09, Lyon, Aug 31 Emanuele Storti
  • 33.
    KDDComposer  A prototype implementing the composition procedure Example scenario: Task: CLASSIFICATION Dataset: LabeledDataset Dataset features: {float, normalized, missing_values,...} Constraints: max 5 algorithms, etc. IDA'09, Lyon, Aug 31 Emanuele Storti
  • 34.
    KDDComposer  A prototype implementing the composition procedure Example scenario: Task: CLASSIFICATION Dataset: LabeledDataset Dataset features: {float, normalized, missing_values,...} Constraints: max 5 algorithms, etc. IDA'09, Lyon, Aug 31 Emanuele Storti
  • 35.
    KDDComposer  A prototype implementing the composition procedure Example scenario: Task: CLASSIFICATION Dataset: LabeledDataset Dataset features: {float, normalized, missing_values,...} Constraints: max 5 algorithms, etc. IDA'09, Lyon, Aug 31 Emanuele Storti
  • 36.
    KDDComposer  A prototype implementing the composition procedure Example scenario: Task: CLASSIFICATION Dataset: LabeledDataset Dataset features: {float, normalized, missing_values,...} Constraints: max 5 algorithms, etc. Results a ranked list of many valid processes Compared to a non-ontological approach  more valid processes (inference)  less invalid processes (ontological and non-ontological pruning) IDA'09, Lyon, Aug 31 Emanuele Storti
  • 37.
    Conclusion  Procedure for composing valid KDD processes  semantic representation of algorithms and data Advantages  KDDONTO  resulting processes are valid supports complex pruning strategies  Approximate Match more valid results (novel w.r.t other works in the Literature)  Ranking according to both ontological and non-ontological criteria  Prototype processes can be themselves considered as valid, unknown and useful knowledge, valuable for both novice and experts users Future works  translating each prototype process in a concrete workflow of KDD Web Services IDA'09, Lyon, Aug 31 Emanuele Storti
  • 38.
    Project website Project website: http://boole.diiga.univpm.it IDA'09, Lyon, Aug 31 Emanuele Storti
  • 39.
    UNIVERSITA’ POLITECNICA DELLEMARCHE DIIGA – Dipartimento di Ingegneria Informatica, Gestionale e dell’Automazione Ancona, Italy Ontology-Driven KDD Process Composition Claudia Diamantini, Domenico Potena, Emanuele Storti {diamantini, potena, storti}@diiga.univpm.it www.diiga.univpm.it IDA'09, Lyon, Aug 31