Successfully reported this slideshow.

Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

432 views

Published on

Full paper: http://boole.diiga.univpm.it/paper/planlearn2010.pdf

Data Mining has reached a quite mature and sophisticated stage, with a plethora of techniques to deal with complex data
analysis tasks. In contrast, the capability of users to fully exploit
these techniques has not increased proportionately. For this reason
the definition of methods and systems supporting users in Knowledge Discovery in Databases (KDD) activities is gaining increasing
attention among researchers. The present work fits into this mainstream, proposing a methodology and the related system to support users in the composition of tools for forming valid and useful KDD processes. The basic pillar of the methodology is a similarity matching technique devised to recognize valid algorithmic sequences on the basis of their input/output pairs. Similarity is based on a semantic description of algorithms, their properties and interfaces, and is
measured by a proper evaluation function. This allows to rank the
candidate processes, so that users are provided with a criterion to choose the most suitable process with respect to their requests.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

  1. 1. UNIVERSITA’ POLITECNICA DELLE MARCHE Dipartimento di Ingegneria Informatica, Gestionale e dell’Automazione Ancona, Italy Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach Claudia Diamantini, Domenico Potena, Emanuele Storti storti@diiga.univpm.itPlanLearn 2010, Lisbon, August 17
  2. 2. Outline I. Introduction a) Aim of the work b) Scenario II. Methodology a) General approach b) KDD ontology c) Algorithm Matchmaking d) Process Composition III. Applications a) Our framework b) Software & services IV. Conclusion & Future WorkPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  3. 3. Aim of the work  How to automate Data Mining process? (Yang et al., 10 Challenging Problems for Data Mining Research, ICDM2005)  filling the gap between knowledge hidden in data and the needed know- how for its extraction  New scenario: collaboration/distribution  virtual organizations  distributed teams and tools  Examples: KD for enterprises, E-science projectsPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  4. 4. Aim of the work  KDD in a collaborative/distributed scenario  complexity: users have various expertise Usability  heterogeneity: tools have different interfaces Integration  KDDVM project: service-oriented platform for sharing, discovering, accessing, executing data analysis and knowledge discovery tools  KDD tools produced by different organizations are remotely accessible as basic services through standard protocols  Formalization of experts knowledge in a conceptual semantic model, to support advanced services  auto-parameter setting, coordination management, service discovery, process compositionPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  5. 5. Approach  Separation of information in different layers: KDD algorithms ID3 SVM KDD services ID3_v1.2 ID3_v2.0 SVM_v.1.0  Benefits: loose-coupling, reusability  Advanced services rely on such a layer:  service discovery  process compositionPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  6. 6. Methodology in a nutshell  Formalizing knowledge of KDD experts into an ontology for describing algorithms, their interfaces and their relations  Defining techniques for matching algorithms with compatible interfaces  Defining a goal-oriented composition procedure goal processes dataset which starts from user requests and produces a list constraints of valid processes ranked according to some criteriaPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  7. 7. KDD Ontology (1)  KDDONTO is an ontology formalizing the domain of KDD algorithms:  developed following a formal methodology  taking into account quality requirements Main classes and relations:  Algorithm, Method  Task, Phase  Data, DataFeature  Performance  has_input/has_output  ...PlanLearn 2010, Lisbon, August 17 Emanuele Storti
  8. 8. KDD Ontology (2)  KDDONTO is coinceived for supporting process composition  Properties useful for representing algorithms interfaces:  has_condition  pre/postcondition for some input/output data  not_with/not_before  explicit incompatibilities between methods  Properties useful for representing relations among data:  part_of/has_part  relations between a compound datum and its subcomponents  in_constrast  explicit incompatibilities between conditionsPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  9. 9. KDD Ontology (3)  Example: SOM (interface)  has_input:  input_type: UNLABELED_DATASET  has_precondition:  condition_type: FLOAT  condition_strenght: 0.4  has_precondition:  condition_type: NO_MISSING_VALUES  condition_strenght: 1.0  has_input:  input_type: VQ  has_input:  input_type: LEARNING_RATE  is_parameter: yes  has_output:  output_type: VQPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  10. 10. Algorithm Matchmaking  Linking algorithms with compatible interfaces A is compatible with B iff INB: A A   (either) INB is_parameter  (or) ∃ OUTA such that:  OUT and IN are valid w.r.t. preconditions A B  OUT and IN are similar datatypes (is_a, part_of) A B ? LDS part_of L UDSPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  11. 11. Matchmaking: cost  How to evaluate the cost of a match?  Degree of similarity between I/O  weighted distance between IN and OUT  weight(specialization) < weight(part_of)  Preconditions and their possible relaxation  the higher the condition_strenght, the higher the cost  Performance of algorithms  e.g.: the higher the complexity, the higher the costPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  12. 12. Composition Procedure (1)  Goal-driven procedure for composing KDD processes, exploiting KDDONTO and matching functionalities  produces a subset of all possible valid processes  I. Definition of dataset, goal and user constraints A Dataset type and set of An instance of Task class instances of DataFeature e.g.: CLASSIFICATION class e.g.: LabeledDataset Pruning Criteria {float, balanced, • max number of algorithms in a process; normalized, • max cost of a process; missing_values} • max computational complexityPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  13. 13. Composition Procedure (2)  II. Process building  Starts from task and goes backwards iteratively A iteration, algorithms task are added to processes ds by exploiting matching functionalities  Stop conditions:  no process can be further expanded  some process constraints are violated  Output only valid processes:  satisfying the user goal  compatible with the given datasetPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  14. 14. Composition Procedure (3) III. Process ranking  Several possible ranking functions:  number of algorithms in the process n  process cost ( ∑ C i ) i=1  easiness-of-usage (function of the number of user-parameters)  overall computational complexity (function of the max complexity among the algorithms in the process)PlanLearn 2010, Lisbon, August 17 Emanuele Storti
  15. 15. KDDVM Framework Basic services Resources BVQ_1.0 PCA_1.0 ... id3_1.2 UDDI KDDONTO Advanced services Support services WSMatch Semantic Broker ... ... Clients KDDComposer KDDWebDesigner BrokerClient OntoViewer BasicClientPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  16. 16. KDDComposer  A prototype implementing the composition procedure Example scenario: Task: CLASSIFICATION Dataset: LabeledDataset Dataset features: {float, normalized, missing_values,...} Constraints: max 5 algorithms, ... Results  a ranked list of many valid processes  detailed information about each process, algorithm, match, connectionPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  17. 17. WSMatch  A WS implementing the matchmaking functionality match (A, B)? cost match (?, B) WSMatch KDDONTO match set={...} WS ClientPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  18. 18. KDD WebDesigner search services by name/algorithm (call to SemanticBroker) check compatibility (WSMatch)PlanLearn 2010, Lisbon, August 17 Emanuele Storti
  19. 19. Conclusion  Open environments and heterogeneous tools  different interfaces: need of a common representation (service)  abstraction for an high-level description of tools (algorithm)  Algorithm matchmaking  based on algorithms  different similarity relations: subsumption, part_of  verification of precondition/postconditions  reusable for several applications  Process composition procedure  abstract processes are reusable:  steps to be performed with real tools  composition patterns for solving certain types of problems  valid and useful knowledge, valuable for both novice and experts usersPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  20. 20. Future Work  Enhancement of KDDONTOs descriptive capabilities  add information about statistical characteristics of data  identify which algorithm is likely to perform best  Translation of abstract processes into concrete workflows  for each algorithm, find curresponding services  check for possible mismatches, evaluate syntatic compatibility, perform syntactic translations between different formats  Comprehensive tests  evaluate effectiveness of composition procedure  evaluate ranking functionsPlanLearn 2010, Lisbon, August 17 Emanuele Storti
  21. 21. UNIVERSITA’ POLITECNICA DELLE MARCHE Dipartimento di Ingegneria Informatica, Gestionale e dell’Automazione Ancona, Italy Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach Claudia Diamantini, Domenico Potena, Emanuele Storti storti@diiga.univpm.itPlanLearn 2010, Lisbon, August 17

×