Università Politecnica delle Marche         Scuola di dottorato in Scienze dell’Ingegneria         Curriculum in Ingegneri...
Summary   I. Introduction           •   Background & Motivation           •   Related Work           •   Research Question...
Introduction   “Knowledge is the common wealth of humanity”*   Access to vast collections of data is the key to   understa...
Background   Knowledge Discovery in Databases (KDD)   Process of identifying valid, novel, potentially useful patterns in ...
Related Work   Early proposals (Intelligent Data Analysis systems):   • local frameworks   • single-user   • predefined se...
Motivation   Main issues in open, distributed and collaborative scenarios   • Localization of many available distributed t...
Research Question   How to support a community of users in the design   of a KDD project in an open, heterogeneous,   dist...
Functional Requirements   1. Support functionalities that should be offered by the   platform:    • importing new heteroge...
Non-functional Requirements   2. Principles on which the platform should be based                                         ...
Resources in a KDD project   3. Identification of resources involved in a KDD process   • Computational units (gathering, ...
Knowledge- and user-centric approach    Service-Oriented platform    Basic Services                                       ...
Knowledge Layer   Methodology   • degrees of abstraction for description of resources   • semantic technologies for their ...
Knowledge Layer > Computational units                                     [IDA 2009] [ECML 2009]                          ...
Knowledge Layer > Computational units                                                                                     ...
Knowledge Layer > Computational units   Mappings between abstraction levels   • KDDONTO provides a shared vocabulary which...
Knowledge Layer > Actors   TeamONTO is a formal ontology devoted to represent details of   actors [CTS 2012]              ...
Knowledge Layer > Processes   XML-descriptor for concrete level processes     •     <services> included in the process    ...
Knowledge Layer > Processes   Process Repository indexing method [SEBD 2011][SAC 2012]   Application of graph-based hierar...
Knowledge Layer > Processes   Process Repository indexing method [SEBD 2011][SAC 2012]   Application of graph-based hierar...
Knowledge Layer > Overview                              KDDONTO                                                   TeamONTO...
KDDVM Platform  Service-oriented platform for KDD process design  [CTS 2010][CTS 2011][IFS]   collaborative   functionalit...
KDDVM Platform > KDDDesigner   • integration point of other support services   • platform front-end10/12/12      Emanuele ...
Service Discovery   Retrieval of KDD services satisfying user requirements    •      syntactic service discovery (service ...
Service Discovery   Retrieval of KDD services satisfying user requirements    •      syntactic service discovery (service ...
Process Composition   Verification of I/O data compatibility10/12/12   Emanuele Storti   KDD Process Design in Collaborati...
Process Composition: matchmaking   Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]   • syntactic co...
Process Composition: matchmaking   Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]   • syntactic co...
Process Composition: matchmaking   Usage of Matchmaking:     • provide the user with information about the validity of the...
Process Composition: semi-automatic procedure   Procedure to generate conceptual processes [IDA 2009] [ITAIS 2009] [ECAI 2...
Collaborative Features   Collaborative features for process design within a team                                         [...
Collaborative Features   Collaborative features for process design within a team                                         [...
Collaborative Features   Collaborative features for process design within a team                                         [...
Collaborative Features   Collaborative features for process design within a team                                         [...
Formal Verification of Processes   1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]            • langu...
Formal Verification of Processes   1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]            • langu...
Formal Verification of Processes   1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]            • langu...
Conclusion   KDD process design in distributed and collaborative environments    • Distributed and heterogeneous settings:...
References   [Fayyad, 1996] “From data mining to knowledge discovery: an overview”, American     Association for Artificia...
Publications   [IDA 2009] “Ontology-driven KDD Process Composition”. LNCS, Springer, 2009.   [ECML 2009] “KDDONTO: an Onto...
Upcoming SlideShare
Loading in …5
×

Kdd Process Design in Collaborative and Distributed Environments

649 views

Published on

Full thesis (open access): http://openarchive.univpm.it/jspui/handle/123456789/494

Abstract
-------------
Knowledge Discovery in Databases (KDD), as well as scientific experimentation in e-Science, is a complex and computationally intensive process aimed at gaining knowledge from a huge set of data. Often performed in distributed settings, KDD projects usually involve a deep interaction among heterogeneous tools and several users with specific expertise. Given the high complexity of the process, such users need effective support to achieve their goal of knowledge extraction.
This work presents the Knowledge Discovery in Databases Virtual Mart (KDDVM), a user- and knowledge-centric framework aimed at supporting the design of KDD processes in a highly distributed and collaborative scenario, in which computational resources and actors dynamically interoperate to share and elaborate knowledge. The contribution of the work is two-fold: firstly, a conceptual systematization of the relevant knowledge is provided, with the aim to formalize, through semantic technologies, each element taking part in the design and execution of a KDD process, including computational resources, data and actors; secondly, we propose an implementation of the framework as an open, modular and extensible Service-Oriented platform, in which several services are available both to perform basic operations of data manipulations and to support more advanced functionalities. Among them, the management of deployment/activation of computational resources, service discovery and their composition to build KDD processes. Since the cooperative design and execution of a distributed KDD process typically require several skills, both technical and managerial, collaboration can easily become a source of complexity if not supported by any kind of coordination. For such reasons, a set of functionalities of the platform is specifically addressed to support collaboration within a distributed team, by providing an environment in which users can work on the same project and share processes, results and ideas.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
649
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Kdd Process Design in Collaborative and Distributed Environments

  1. 1. Università Politecnica delle Marche Scuola di dottorato in Scienze dell’Ingegneria Curriculum in Ingegneria Informatica, Gestionale e dell’Automazione KDD Process Design in Collaborative and Distributed Environments Emanuele StortiAdvisor: Prof.ssa Claudia DiamantiniCurriculum supervisor: Prof. Sauro Longhi 28 Febbraio 2012
  2. 2. Summary I. Introduction • Background & Motivation • Related Work • Research Question II. Approach III. Knowledge Layer IV. Platform • Service Discovery • Process Composition • Collaboration features • Formal Verification of processes V. Conclusion10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments
  3. 3. Introduction “Knowledge is the common wealth of humanity”* Access to vast collections of data is the key to understanding and responding to complex problems Growing capability in data production: ge Data Delu • availability of cost-effective technologies • enhancement of communication infrastructures Not only organizations, also e-Science • global problems ask for data-intensive computation (genomics, physics, climatology) • global effort through world-wide scientific collaboration Need of tools for supporting distribution and collaboration in data analysis * A. Samassekou, UN World Summit on the Information Society10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 1
  4. 4. Background Knowledge Discovery in Databases (KDD) Process of identifying valid, novel, potentially useful patterns in data [Fayyad, 1996] • knowledge refinement • many steps, several iterations • strong user interaction • need of user knowledge Support to: • decision making in organizations • e-Science experimentations10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 2
  5. 5. Related Work Early proposals (Intelligent Data Analysis systems): • local frameworks • single-user • predefined set of tools (little extensibility) Distribution of tools & computational aspects: • SOA [Ali, 2005] [Kumar, 2005] • Grid [Kgrid, 2003] • OGSA: SOA + Grid Distribution of users & evolution of organizations: • user support in advanced activities [Bernstein, 2005] [Žáková, 2011] • collaboration [myExperiment, 2009]10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 3
  6. 6. Motivation Main issues in open, distributed and collaborative scenarios • Localization of many available distributed tools • Integration of heterogeneous tools (interfaces, programming languages, OSs, transfer protocols,…) • Complexity in their usage (data preparation, I/O interpretation, precondition satisfaction, process design), many possible combinations • Coordination in cooperative work (source of complexity)10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 4
  7. 7. Research Question How to support a community of users in the design of a KDD project in an open, heterogeneous, distributed and collaborative scenario? 1. Which kind of support and functionalities should a platform offer? 2. Which principles should the platform be based on? 3. Which resources are involved in a KDD process and how to represent them?10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 5
  8. 8. Functional Requirements 1. Support functionalities that should be offered by the platform: • importing new heterogeneous tools for several KDD tasks • retrieving tools useful for certain purposes • designing a process by connecting tools together • understanding whether such connections are meaningful • suggesting tool’s sequences effective for a given goal • execution of the process • supporting collaboration all through the design process • co-operative design • communication among team’s members10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 6
  9. 9. Non-functional Requirements 2. Principles on which the platform should be based Requirements • Interoperability Main issues • Flexibility • heterogeneity • Modularity • complexity • Reusability • distribution • Transparency • coordination • Usability10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 7
  10. 10. Resources in a KDD project 3. Identification of resources involved in a KDD process • Computational units (gathering, preparation, modeling, visualization, …) • Data and Models (dataset, input parameters, intermediate results, final model) • Actors (domain experts, DB/DWH administrators, DM and KDD experts) • Computational processes10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 8
  11. 11. Knowledge- and user-centric approach Service-Oriented platform Basic Services Support Services Services for every KDD phase. • back-end functionalities • tools are wrapped as services, • advanced functionalities • deployed on a server, • published in a common repository Collaborative functionalities • co-operation running as • communication facilities tool service Knowledge Layer Data model: systematization of knowledge regarding each KDD resource, from both a theoretical and an operational perspective10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 9
  12. 12. Knowledge Layer Methodology • degrees of abstraction for description of resources • semantic technologies for their representation • formal ontology building methodology [Fernandez, 1997] [Staab, 2001] • quality requirements [Gruber, 1995] Conceptual: abstract representation Domain ontologies Semantically annotated Concrete: actual resources of the platform XML descriptors Execution: logs and traces10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 10
  13. 13. Knowledge Layer > Computational units [IDA 2009] [ECML 2009] KDDONTO Performance Index algorithm Method Algorithm Data implemented as Task Phase tool • algorithms arranged in a taxonomy running as • method, task, KDD phase, performances • data arranged in a taxonomy • algorithms’ interfaces (I/O, pre/post-conditions) service •… Implemented in OWL-DL (ALCOIF expressivity)10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 11
  14. 14. Knowledge Layer > Computational units <location> SERVICE <I/O> Algorithm alg <algorithm> WSDL url <performance> Qos val <author> algorithm UDDI eSAWSDL implemented as SAWSDL: Semantic annotations for WSDL (w3c) • location (<xmls:impl>) • I/O <message>: type, syntax Further extensions: tool • implemented <algorithm> • <performance> running as • <author> •… service Descriptors are stored in a UDDI registry10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 12
  15. 15. Knowledge Layer > Computational units Mappings between abstraction levels • KDDONTO provides a shared vocabulary which services refer to • such mappings can support process composition KDDONTO Remove missing values Labeled Dataset C4.5 algorithm algorithm <location> <location> <output> <input> <algorithm> <algorithm> <performance> <author> abc <performance> <author> eSAWSDL1 eSAWSDL210/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 13
  16. 16. Knowledge Layer > Actors TeamONTO is a formal ontology devoted to represent details of actors [CTS 2012] TeamONTO hasSkill writes Algorithm Person Publication hasSkill worksIn WebService Organization hasSkill memberOf Domain Project about10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 14
  17. 17. Knowledge Layer > Processes XML-descriptor for concrete level processes • <services> included in the process • <connection> among their I/O interfaces • <users> in charge of settings service parameters • metadata: creation <date>/<time>, <author>, <comments> Process descriptors are stored in a Process Repository <location> <I/O> <algorithm> <performance> <author> <location> <location> <I/O> eSAWSDL 2 <I/O> <algorithm> <algorithm> <performance> <performance> <author> <author> <location> eSAWSDL 1 <I/O> eSAWSDL 4 project <algorithm> <performance> <author> ProcRep index eSAWSDL 310/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 15
  18. 18. Knowledge Layer > Processes Process Repository indexing method [SEBD 2011][SAC 2012] Application of graph-based hierarchical clustering [Jonyer, 2002] • extracts the most frequent and common subprocesses • arrange them in a lattice hierarchical clustering ProcRep index10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 16
  19. 19. Knowledge Layer > Processes Process Repository indexing method [SEBD 2011][SAC 2012] Application of graph-based hierarchical clustering [Jonyer, 2002] • extracts the most frequent and common subprocesses • arrange them in a lattice Support for process retrieval: subgraph isomorphism on the index is more efficient than on the whole repository graph matching ProcRep User query index10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 16
  20. 20. Knowledge Layer > Overview KDDONTO TeamONTO Algorithm Algorithm Person Performance Data WebService WebService Project <location> <location> <I/O> <I/O> <algorithm> <algorithm> <performance> <performance> SERVICE <author> <author> Algorithm alg eSAWSDL 1 eSAWSDL 2 WSDL url project Qos val Process UDDI ProcRep index Advantages: loose-coupling, modularity, reusability, support to advanced functions10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 17
  21. 21. KDDVM Platform Service-oriented platform for KDD process design [CTS 2010][CTS 2011][IFS] collaborative functionalities support services10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 18
  22. 22. KDDVM Platform > KDDDesigner • integration point of other support services • platform front-end10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 19
  23. 23. Service Discovery Retrieval of KDD services satisfying user requirements • syntactic service discovery (service name) • semantic service discovery (functionalities, goal, interface,…) 1. Search, in KDDONTO, for algorithm satisfying the requirements 2. Search, inside UDDI, for services implementing such algorithms KDDONTO Algorithm Data Goal SERVICE <location> <I/O> Algorithm alg <algorithm> WSDL url <performance> Qos val <author> eSAWSDL UDDI10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 20
  24. 24. Service Discovery Retrieval of KDD services satisfying user requirements • syntactic service discovery (service name) • semantic service discovery (functionalities, goal, interface,…) 1. Search, in KDDONTO, for algorithm satisfying the requirements 2. Search, inside UDDI, for services implementing such algorithms10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 21
  25. 25. Process Composition Verification of I/O data compatibility10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 22
  26. 26. Process Composition: matchmaking Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010] • syntactic compatibility: comparison between service descriptors KDDONTO <location> <location> <I/O> <I/O> <algorithm> <algorithm> Same format? <performance> abc <performance> <author> Same syntax? <author> eSAWSDL1 eSAWSDL210/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 23
  27. 27. Process Composition: matchmaking Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010] • syntactic compatibility: comparison between service descriptors • semantic compatibility: comparison between ontological annotations of the services (kind of match between I/O, preconditions/postconditions...) KDDONTO same concept? data1 subconcept? data2 Output: part-of concept? match cost <location> <location> <I/O> <I/O> match evaluation <algorithm> <algorithm> function: <performance> abc <performance> <author> <author> • kind of match eSAWSDL1 eSAWSDL2 • preconditions • etc…10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 23
  28. 28. Process Composition: matchmaking Usage of Matchmaking: • provide the user with information about the validity of the match • find all services compatible with a given one • support an advanced functionality for semi-automatic process composition10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 24
  29. 29. Process Composition: semi-automatic procedure Procedure to generate conceptual processes [IDA 2009] [ITAIS 2009] [ECAI 2010] • planning technique: goal-driven, backwards strategy • basic step based on matchmaking • pruning criteria, stop criteria, user constraints • results: abstract processes useful as templates, ranked according to a metric10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 25
  30. 30. Collaborative Features Collaborative features for process design within a team [CTS 2011] 1. Team building: to find users with certain skills & build a team 2. Multi-user Process design and Versioning 3. Communication 4. Task assignment Retrieval of users with certain competencies, inside TeamONTO10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
  31. 31. Collaborative Features Collaborative features for process design within a team [CTS 2011] 1. Team building: to find users with certain skills & build a team 2. Multi-user Process design and Versioning 3. Communication 4. Task assignment Asynchronous edit of a process, and its storage as a new version in the repository10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
  32. 32. Collaborative Features Collaborative features for process design within a team [CTS 2011] 1. Team building: to find users with certain skills & build a team 2. Multi-user Process design and Versioning 3. Communication 4. Task assignment Talk page & comments attached to versions10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
  33. 33. Collaborative Features Collaborative features for process design within a team [CTS 2011] 1. Team building: to find users with certain skills & build a team 2. Multi-user Process design and Versioning 3. Communication 4. Task assignment Assignment to a user of the team the parameter setting for the service execution10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 26
  34. 34. Formal Verification of Processes 1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004] • language for representation of interaction among services • formal and compositional semantics 2. Specification of the desider behavior through mCRL2 [Groote, 2006] 3. Model-checking to verify whether the process satisfies the behavior10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 27
  35. 35. Formal Verification of Processes 1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004] • language for representation of interaction among services • formal and compositional semantics 2. Specification of the desider behavior through mCRL2 [Groote, 2006] 3. Model-checking to verify whether the process satisfies the behavior10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 27
  36. 36. Formal Verification of Processes 1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004] • language for representation of interaction among services • formal and compositional semantics 2. Specification of the desider behavior through mCRL2 [Groote, 2006] 3. Model-checking to verify whether the process satisfies the behavior Purposes [SEBD 2010] • formal verification of KDD processes at design-time • reuse of typical pre-verified KDD subprocesses10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 27
  37. 37. Conclusion KDD process design in distributed and collaborative environments • Distributed and heterogeneous settings: • systematic use of semantic information all through the process • loosely-coupled service architecture • Community-centered approach: • support functionalities • users with various degrees of experience • General-purpose attitude: • domain-indipendence • generalization to experimental processes in e-Science10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments 28
  38. 38. References [Fayyad, 1996] “From data mining to knowledge discovery: an overview”, American Association for Artificial Intelligence. [Ali, 2005] “Web services composition for distributed data mining”, Parallel Processing [Kgrid, 2003] “The knowledge grid”, Comm. ACM [Bernstein, 2005] “Towards Intelligent Assistance for a Data Mining Process: An Ontology Based Approach for Cost-Sensitive Classification”, IEEE TKDE [Žáková, 2011] “Automating Knowledge Discovery Workflow Composition Through Ontology-Based Planning”, IEEE T. Automation Science and Engineering [myExperiment, 2006] “The Design and Realisation of the Virtual Research Environment for Social Sharing of Workflows”, FGCS [Fernandez, 1997] “METHONTOLOGY: from Ontological Art towards Ontological Engineering”, Proc. of the AAAI97 Spring Symposium. [Staab, 2001] “Knowledge Processes and Ontologies”, IEEE Intelligent Systems. [Gruber, 1995] “Toward principles for the design of ontologies used for knowledge sharing”, Int. J. Hum.-Comput. Stud. [Jonyer, 2002] “Graph-based hierarchical conceptual clustering”, J. Mach. Learn. Res. [Groote, 2006] “The formal specification language mcrl2”, in MMOSS. [Arbab, 2004] “Reo: a channel-based coordination model for component composition”, Mathematical. Structures in Comp. Sci.10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments
  39. 39. Publications [IDA 2009] “Ontology-driven KDD Process Composition”. LNCS, Springer, 2009. [ECML 2009] “KDDONTO: an Ontology for Discovery and Composition of KDD Algorithms”. ECML/PKDD09 Workshop. [ITAIS 2009] “Automatic Definition of KDD Prototype Process by Composition”. “Management of Interconnected World”, Springer. [CTS 2010] “Semantic-Driven Design and Management of KDD Processes”. CTS 2010, IEEE. [SEBD 2010] “Towards Coordination Patterns for Complex Experimentations in Data Mining”. SEBD 2010. [ECAI 2010] “Supporting Users in KDD Process Design. A Semantic Similarity Matching Approach”. Planning To Learn Workshop in ECAI2010. [CTS 2011] “Semantic-Aided Designer for Knowledge Discovery”. CTS 2011, IEEE. [SEBD 2011] “Clustering of process schema by graph mining techniques”. SEBD 2011. [SAC 2012] “Mining Usage Patterns from a Repository of Scientific Workflows”. SAC2012, ACM. [CTS 2012] “Semantically-supported Team Building in a KDD Virtual Environment”. CTS 2012, IEEE. [ISF] “A Virtual Mart for Knowledge Discovery in Databases”. Information System Frontiers, Springer, accepted.10/12/12 Emanuele Storti KDD Process Design in Collaborative and Distributed Environments

×