Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The potentials of Cooperative Clustering for Graph Data


Published on

Research Proposal

Published in: Engineering, Technology, Education
  • Be the first to comment

  • Be the first to like this

The potentials of Cooperative Clustering for Graph Data

  1. 1. Cooperative Based Graph Clustering Ahmed Ibrahim May 9, 2014 Department of Electrical & Computer Engineering, University of Waterloo 1 Introduction Graph clustering (partitioning) is a well-studied problem and as a result, many fast heuristic algorithms exist for nding high-quality partitionings. They are focused on discovering the structures and representative examples existing in the raw graph data. However, the question of how we can identify the algo- rithm best suited for a particular graph partition problem remains unanswered. At this point, it seems intuitively that combining the strengths of various algo- rithms, where more than one actors (e.g. algorithms) work together to achieve a clustering solution, better than any one of those algorithms individually. This approach has been explored in various disciplines, but according to our knowl- edge no work has been done for graph mining. I have developed this approach during my study and named it Cooperative Clustering for Graphs (CC/G) to indicate that it is based on graphs. CC/G has been applied on software architecture recover and some experimental work has been done using open source software systems to show that forging cooperation between leading state-of-the-art algorithms produces better results than any algorithm considered. The algorithm makes no assumptions about the size or the number of clus- ters. Besides this, the algorithm can make use of multiple clustering criteria functions. Also, It overcomes the drawbacks of clustering algorithms that the result will not vary greatly unless the data change in the same amount when using dierent clustering criteria. 1
  2. 2. Extracting Program Structure CC/G {Ca1 ….. Cam} {Cb1 ….. Cbm} {Cd1 ….. Cdm} {Ce1 ….. Cem} Clusters Set CC/G Clusters graph structure inherent in software Software Program1 Software Programn Structure Partitioning • Hill Climbing (with three different configurations) • Lattix Partitioning Algorithm Evaluation • Partitioning Quality • Similarity Figure 1: CC/G data ow diagram for software architecture recovery 2 Cooperative Clustering on Graph Cooperative Clustering on Graphs is an unsupervised learning algorithm for clustering a graph networks into k partitions based on intra-cluster density and inter-cluster sparsity. The main idea is to apply dierent clustering algorithms on the graph network. Each algorithm will provide dierent k clusters. A com- mon agreement among those clusterings is then found. This agreement identies the minimum number of k clusters that the graph network should have. The last step is to merge the remaining graph elements that exhibited disagreement in- between the clusters that were initially determined using optimum intra-cluster density and inter-cluster sparsity. These steps are illustrated by the following gure below. 2.1 Illustrative Example We used CC/G to partition a small mobile game application, written in Java (2). We chose this mobile game because it is a nicely designed system with a few number of classes. Figure 2 shows the dependency graph for this mobile game. A dependency graph is a directed graph that represents the dependencies of several software artifacts one another. The mobile game application consists of 9 classes. The classes have been given letters for simplicity. We will start by, partitioning the application dependency graph using three dierent clustering algorithms: Nearest Ascent Hill Climbing (NAHC), Steepest Ascend Hill Climbing (SAHC) and Bertin's reoderable matrix. Then, we are going to measure the modularization quality using the following formula: CFi =    0 µi= 0; µi µi+ 1 2 k i=1j=i (εij +εji) otherwise. (1) TurboMQis given by: 2
  3. 3. TurboMQ = k i=1 CFi (2) The second step, we are going to nd the a common agreement among those clusterings. last step, we are going to merge the remaining graph elements that exhibited disagreement in-between the clusters that were initially determined. Apparently, the three clustering algorithms provide dierent solutions with dierent modularization quality. Also, they have a disagreement about the position of element e. Here, CC/G, provides a solution for those two problems. CC/G gives a better modularization quality and resolve the disagreement between clustering solutions. Applying 3 different clustering algorithms D2 3D D1 Mobile Game Application CC/G Clustering for mobile game application MQ = 1.278 TurboMQ= 1.2441 TurboMQ= 0.76923 TurboMQ = 1.2380 Figure 2: CC/G illustrative example 3 Conclusion Future Work A new graph-based clustering approach (CC/G) may be promising for graph mining. This new approach has been applied in software architecture recovery domain of research and it provides a measurably superior clustering compared with that produced by any single individual approach. Cooperative Clustering on Graph (CC/G) can be useful in many graph min- ing research challenges. It can be used to identify anomalies in graph based 3
  4. 4. data-sets which can be useful in vast elds including but not limited to network intrusion, fraud detection, health monitoring, sensor networks and more. Also, It can be used as a technique for sub-graph matching in the context of dynamic graphs, data-sets that represent temporal evolution of relationship between en- tities. 3.1 For Information Systems Management (Information Technology) Systems engineering of products, processes, and organizations requires tools and techniques for system decomposition and integration. A design structure matrix (DSM) provides a simple, compact, and visual representation of a complex sys- tem that supports innovative solutions to decomposition and integration prob- lems. DSM has been useful for modeling systems architecture, organization relations, activity-based or schedule processes, and parameter-based relation- ships. DSMs are usually analyzed with clustering algorithms. According to my knowledge, cooperative clustering has not been used to analyze DSMs. 3.2 For Structural Health Monitoring (Civil Engineering) Structural health monitoring (SHM) is a process to detect damage(s) in engi- neering structures and to determine the damage locations and types. Typically, damages in a structure are detected by using measurements collected from sen- sors built for that purpose. Research in SHM is a multidisciplinary research in nature aimed towards improving structural systems performance by employing; signal processing techniques, advance clustering and classication methods and, decision fusion theories. Normally this research starts with, advanced signal processing technique used to extract frequency features from the vibration data. A transformation approach is used to generate unique set of features that can be used for clustering and classication of the status of the structure's health. Second, a clustering algorithm is used to group the extracted sets of features into homogeneous classes of similar features. Third, soft computing approach is used as the decision fusion technique to: resolve any conict, reduce the level of uncertainty, and produce trustful decision with high level of condence in the decision made by the clustering and classication algorithm. We argue that CC/G can help on nding groups of homogeneous features that may deliver additional accuracy of the third step. It will also help in discovering the features that have a negative impact on the accuracy of the models. 3.3 For Transportation Network Optimization (Supply Chain Management) A transportation network is a dynamic, stochastic, and complex system. Mod- eled as a graph, nodes fall into categories that correspond to manufacturing 4
  5. 5. sources, distribution centers, and end customers. The most common objective in transportation network optimization is to nd the shortest-distance distribution on a network, i.e., to determine an opti- mal set of routes between suppliers and customers. Today there is a growing interest in a new and much more sophisticated class of network solutions that involve multiple optimization factors like prot, service level, fault tolerance (or resilience), and environmental footprint, with the optimal solution balancing the complex trade-os among all these parameters simultaneously. Designing a distribution network often involves planning of routes over re- gions or deciding on locations for warehouses. Cluster analysis oers an alter- native solution to categorize locations in a systematic way and speeds up the process of exploring several dierent versions of the clusters. Although clustering algorithms have been successfully applied in specic transportation network optimization problems, but the question of how can we identify the clustering algorithm best suited for a particular problem remains unanswered. We argue that CC/G can synthesize a solution from the results of an aggregation of constituent clustering algorithms, with multiple optimization factors like prot, service level, fault tolerance (or resilience), and environmental footprint. This will produce measurably better results than any one of those algorithms individually. 3.4 For Bug Triaging (Software Bug Classication) Regarding to Storey's paper The Impact of Social Media on Software Engi- neering Practices and Tools. I found the following research question very inter- esting: What inuence will social media have on the quality of software that is collaboratively authored? I think triaging bugs can have a major impact on the quality of software that is collaboratively authored. Ecient bug triaging pro- cedure has become an important precondition for successful of any large and/or open source collaborative software engineering projects. We argue that CC/G can be an ecient and practical algorithm that is able to identify valid bug reports which refer to an actual software bugs. This can be done by using CC/G to automatically labeling the bug reporters' position in the collaboration network based on their attributes. This automated labeling can easily be integrated into bug tracking platforms and analyze its performance. 3.5 For Software Eort Estimation (Software Engineer- ing) Regarding to Luiz's paper Improving Eort Estimation by Voting Software Estimation Models, there is a limitation of having too many inputs for the models author are using in his paper. Perhaps using CC/G for inputs can help on nding groups of homogeneous features that may deliver additional accuracy of the proposed approach. CC/G will also help in discovering the features that have a negative impact on the accuracy of the models. 5
  6. 6. Relevant Publication 1. A. Ibrahim, D. Rayside, R. Kashef, Cooperative Based Software Clus- tering on Dependency Graphs, IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Toronto, Canada (2014). 2. R. Naseem, O. Maqbool, and S. Muhammad, Cooperative clustering for software modularization, Journal of Systems and Software, 2013. 3. Rasha Kashef and Mohamed S. Kamel, Cooperative clustering,Pattern Recognition, vol. 43, no. 6, pp. 2315 2329, 2010. 4. William Eberle and Lawrence Holder. Anomaly Detection in Data Rep- resented as Graphs. Intelligent Data Analysis: An International Journal. Volume 11, Number 6, pp. 663-689. 2007 6