• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Re-engineering Pattern Extraction for Program Understanding ...
 

Re-engineering Pattern Extraction for Program Understanding ...

on

  • 464 views

 

Statistics

Views

Total Views
464
Views on SlideShare
464
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Re-engineering Pattern Extraction for Program Understanding ... Re-engineering Pattern Extraction for Program Understanding ... Document Transcript

    • Re-engineering Pattern Extraction for Program Understanding Technical Report CS-014-2004 Onaiza Maqbool, Haroon Babri Computer Science Department Lahore University of Management Sciences LUMS
    • Acknowledgment We would like to thank the Lahore University of Management Sciences for providing funding for this research. Thanks to Rainer Koschke of the Bauhaus group at the University of Stuttgart for providing the RSF for Bash, and to Johannes Martin of the Rigi group, University of Victoria, for providing the RSF for Xfig. ii
    • Table of Contents Acknowledgment...................................................................................................................................................................ii Table of Contents..................................................................................................................................................................iii 1. Introduction.........................................................................................................................................................................1 1. Related work........................................................................................................................................................................3 2. An overview of association rule mining.............................................................................................................................5 3. Re-engineering pattern extraction.......................................................................................................................................7 4. Experiments and results....................................................................................................................................................14 5.1The test systems...................................................................................................................................................14 5.2Analysis of results...............................................................................................................................................15 5. Conclusions.......................................................................................................................................................................33 References............................................................................................................................................................................34 iii
    • List of Tables Table 1: A set of transactions .................................................................................................................................................5 Table 2: Statistics of the Bash and Xfig systems.................................................................................................................15 Table 3: Global variables accessed by maximum number of functions..............................................................................17 Table 4: List of files containing frequently accessed functions (Xfig)................................................................................26 Table 5: Function calls made in various sub-systems..........................................................................................................28 Table 6: Number of function calls made to functions in each sub-system..........................................................................29 Table 7: Five or more function calls made to functions in each sub-system.......................................................................29 Table 8: Accesses to user defined types in various Xfig sub-systems.................................................................................30 iv
    • List of Figures Figure 1: Number of functions accessing a global variable (Bash).....................................................................................16 Figure 2: Global variable access breakdown (Bash)............................................................................................................16 Figure 3: Number of functions accessing a global variable (Xfig)......................................................................................17 Figure 4: Global variable access breakdown (Xfig).............................................................................................................17 Figure 5: Functions called by one function only (Bash)......................................................................................................18 Figure 6: Functions called by one function only (Xfig).......................................................................................................18 Figure 7: Functions that are called together (Bash).............................................................................................................19 Figure 8: Functions that are called together (Xfig)..............................................................................................................19 Figure 9: Global variables that are accessed together with high confidence and support > 2 (Bash )...............................20 Figure 10: Association between global variables (Bash).....................................................................................................20 Figure 11: Global variables that are accessed together with high confidence and support > 2 (Xfig )..............................21 Figure 12: Association between global variables (Xfig)......................................................................................................21 Figure 13: User defined types that are accessed together with high confidence (Bash) ....................................................21 Figure 14: Association between user defined types (Bash)..................................................................................................22 Figure 15: User defined types that are accessed together with high confidence (Xfig)......................................................22 Figure 16: Association between types (Xfig).......................................................................................................................22 Figure 17: Number of functions accessing a type (Bash).....................................................................................................23 Figure 18: Number of functions accessing a type (Xfig).....................................................................................................23 Figure 19: Confidence 1 association between user defined types (Bash)............................................................................24 Figure 20: Confidence 1 association between user defined types (Xfig)............................................................................24 Figure 21: Confidence 1 association between globals and user defined types (Bash)........................................................24 Figure 22: Confidence 1 association between globals and user defined types (Xfig).........................................................25 Figure 23: Number of function calls made to functions (Bash)...........................................................................................25 Figure 24: Number of function calls made to functions (Xfig)...........................................................................................25 Figure 25: Number of function calls made to functions (Xfig d_*files sub-system)..........................................................26 Figure 26: Number of function calls made to functions (Xfig e_*files sub-system)..........................................................27 Figure 27: Number of function calls made to functions (Xfig f_*files sub-system)...........................................................27 Figure 28: Number of function calls made to functions (Xfig u_*files sub-system)..........................................................27 Figure 29: Number of function calls made to functions (Xfig w_*files sub-system).........................................................28 Figure 30: Types accessed by a subsystem (d_*files)..........................................................................................................30 Figure 31: Types accessed by a subsystem (e_*files)..........................................................................................................30 Figure 32: Types accessed by a subsystem (f_*files)...........................................................................................................31 Figure 33: Types accessed by a subsystem (u_*files)..........................................................................................................31 Figure 34: Types accessed by a subsystem (w_*files).........................................................................................................31 v
    • 1. Introduction Legacy systems are old software systems that are crucial to the operation of a business. These systems are expected to have undergone changes in their lifetime due to changes in requirements, business conditions and technology. It is quite likely that such changes were made without proper regard to software engineering principles. The result is often a deteriorated structure, which is unstable but cannot be discarded because it is costly to do so. Moreover, another reason for retaining these legacy systems is that they embed business knowledge which is not documented elsewhere. Since it is often not feasible to discard a system and develop a new one, techniques must be employed to improve the structure of the existing system. An effective strategy for change must be devised; reengineering is one such strategy. Re-engineering is a process that re-implements legacy systems to make them more maintainable 5. According to 5, re-engineering is any activity that improves one’s understanding of software or prepares/improves the software itself, usually for increased maintainability, reusability or evolvability Given the fact that software maintenance usually accounts for over 50% of project effort 5,5,5, making it the single most expensive software engineering activity 5, and perhaps the most important life cycle phase 5, the need for re-engineering to ease the maintenance effort is justified. The re-engineering option should be chosen when system quality has been degraded by regular change, but change is still required i.e. the system under consideration has low quality but a high business value, and the re-engineering effort is less risky and less costly than system replacement. The re-engineering effort starts with gaining an understanding of the software system, a process known as reverse engineering. Understanding is critical to many activities including maintenance, enhancement, reuse, design of a similar system and training 5. Reverse engineering has been heralded as one of the most promising technologies to combat the legacy systems problem 5. However, gaining system understanding is difficult because documentation for the system is often not available and source code files are the only means of information regarding the system. According to 5 system understanding takes up 47% of the software maintenance effort. Hall 5 places the system understanding effort at 47%-62%. Tools and techniques are thus required to make the program understanding task easier. Tools provide automated support for system understanding at the procedural level by extracting the procedural design or at a higher level by extracting the architectural design. Tools for procedural design extraction are based on techniques including knowledge based systems with inference rule engines 5, graph parsing 5 and program plan recognition 5. Techniques utilized by architectural design recovery tools include the composition of sub-system hierarchies using (k,2)- partite graphs 5, graph transformations using relational algebra 5 and view extraction and fusion using SQL 5. In the past, the application of deductive techniques to different aspects of software engineering has been more frequent than the use of inductive techniques 5. Deductive techniques usually employ some knowledge base as their underlying technology which is used to deduce relations within the software system. Great effort is required to build the knowledge base and continuously maintain it. Moreover, algorithms used for deductions may be computationally demanding 5. Researchers have thus started exploring the use of inductive or data mining techniques in software engineering. Data mining is considered one of the most promising interdisciplinary developments in the information industry 5. There has been growing interest in the application of data mining techniques to gain better understanding of software systems. In recent years, researchers have applied data mining techniques in different contexts e.g. for architecture recovery of legacy systems 5 - 5, to discover patterns for re-using library components 5, 5, to support software system maintenance 5, 5 - 5, to discover user interaction patterns 5, to aid program understanding 5, 5 and to facilitate software re-use 5 - 5. In this report, our focus is on the use of association rule mining for discovering patterns within the source code that are helpful in system understanding and improvement. Patterns were first adopted by the software community as a way of documenting recurring solutions to design problems 5. However, their use is not restricted to design problems; they have been used as an effective means to communicate best practice in 1
    • various aspects of software development including the development process, testing and re-engineering. Re- engineering patterns present solutions to re-engineering problems. It is interesting to note that although re- engineering may be carried out due to a number of different reasons, the actual technical problems within legacy systems are often similar and hence some general techniques or patterns can be utilized to aid in the re- engineering task 5. Re-engineering patterns for object-oriented legacy systems were identified based on experiences during the FAMOOS project 5 carried out to support the evolution of object-oriented software. In this report, we identify patterns for traditional legacy systems developed using the structured approach with functions as basic components. Given the source files of such a software system, we use association rule mining algorithms and tools to gain insight about the software. The understanding gained allows suggestions for making subsequent changes and optimizations to the source code for better maintainability. The organization of this report is as follows. In section 2 we present related research. Section 3 gives an overview of association rule mining. Section 4 details our approach. Section 5 gives the results of applying association rule mining to two test systems. Finally, we present the conclusions. 2
    • 1. Related work The use of data mining for software understanding is gaining popularity as evidenced by recent work in this area. One possible categorization of the work done would be according to the mining technique employed. Popular techniques include association rule mining, concept learning and cluster analysis. Another useful categorization is according to the application or purpose of work. Keeping in view that some researchers employ more than one technique to arrive at results, in this report we chose to categorize research according to its purpose e.g. architecture recovery, re-modularization, maintenance support, facilitating re-use, interaction pattern mining and program comprehension. The architecture recovery of software systems using data mining techniques is discussed in 5 - 5. The data mining technique employed in these papers is primarily association rule mining. The Identification of sub- systems based on associations (ISA) was proposed by Oca and Carver 5. They used association rule mining for extraction of data cohesive sub-systems by grouping together programs that use the same data files. Experiments were performed on a 25 KLOC Cobol system with 28 programs and 36 data files and sub- systems were successfully identified. However, the results obtained were not compared with any existing documentation or decomposition. A very similar approach is that of 5, where the use of a representation model (RM) is discussed to represent the sub-systems identified using the ISA methodology in 5. Sartipi et al. 5 discuss a technique for recovering the structural view of a legacy system’s architecture based on association rule mining, clustering and matches with architectural plans. Rather than using programs as sub-system entities as in 5, functions are used. Moreover, associations between functions are determined based on the variables and types they access, and the function calls they make. After closely associated function groups have been identified, a branch and bound algorithm is used for matching with architectural plans. In experiments performed with the CLIPS system (40 KLOC, 734 functions, 59 aggregate data types and 163 global variables), a precision level of 90% and a recall level of 60% is maintained between the conceptual and concrete architecture recovered through the defined process. Tjortjis et al. 5 also employ association rule mining to arrive at decompositions of a system at the function level. Their rule mining approach is similar to the one discussed in 5, where attributes are variables, data types and calls to functions. Groups of functions i.e. sub-systems are created finding common attributes participating in the same association rules. Experiments performed on a Cobol system, with programs of an average size of 1000 lines of code show that the results compared with a mental model constructed by a developer of the system are satisfactory. Software re-modularization using clustering techniques is discussed in 5 - 5. Tzerpos and Holt 5 present the case for using clustering techniques to re-modularize software, after the techniques have been adapted to fit the peculiarities of the software domain. In 5, Wiggerts provides a framework to apply cluster analysis for re-modularization. The clustering process is described in some detail, along with commonly used similarity measures and clustering algorithms. Experiments with clustering as a re-modularization method are described in 5 and 5. Both papers conclude by recommending similarity measures and clustering algorithms which yield good experimental results for software artifacts. A theoretical explanation to some previous experimental results obtained by researchers in the area is provided in 5. The paper also describes a new clustering algorithm, which gives better results as compared to the currently employed algorithms for clustering software. In 5, the weighted combined algorithm for software clustering is presented. This algorithm shows improvement in clustering results as compared to previously employed clustering algorithms. Data mining techniques have also been employed to support the maintenance task 5, 5 - 5. The use of inductive techniques to extract a maintenance relevance relation (MRR) from the source code, maintenance logs and historical maintenance update records is discussed in 5, 5, 5. An MRR simply indicates that if a software engineer needs to understand file1, he/she probably also needs to understand file2. The problem has 3
    • been presented as a concept learning problem, and a decision tree classifier is used for classifying file pairs as relevant, not relevant and potentially relevant. Relevance indicates that two files were modified in the same update and potential relevance indicates that both files were looked at in the same update. Experiments performed on the SX2000 system, a large legacy telephony switching system with 1.9 MLOC and 4700 files show that the 2-class problem, with relevant and non-relevant classes only, yields better results than the 3- class problem. It is also seen that combining text based features with syntactic features yields better results than using syntactic features alone. Zimmermann et al. 5 apply association rule mining to ease maintenance by mining version histories. Association rule mining is used to predict likely changes after a change has been made, prevent errors due to incomplete changes and detect coupling between items which is not revealed by program analysis. Mining is carried out through the ROSE tool developed for this purpose. The tool reads a version archive and groups changes into transactions so that rules describing them are formed. ROSE was tested on 8 open source projects and was found to be a helpful tool in suggesting further changes and in warning about missing changes. Software re-use can be facilitated by data mining techniques 5, 5, 5 - 5. Software library reuse patterns have been mined using generalized association rules in 5. The discovered patterns serve as guides to identify typical usage of the software library i.e. the combination of classes and member functions that are typically re- used by applications. The paper extends earlier work by the author 5, in which association rules rather than generalized association rules were used. Re-use patterns for the KDE core libraries version 1.1.2 were mined by analyzing 76 applications. Reusable components are identified by finding similarities between components using lexical rather than structural techniques in 5. Mc Carey et al. 5 utilize collaborative filtering for recommending reusable components by enabling prediction of the utility of an item based on the user’s previous history and opinions of other like minded users. The building of a digital library for source code to facilitate reuse of code segments is discussed in 5. Garg 5 discusses the sharing of knowledge of re-usable components across multiple projects. El-Ramly et al. 5 apply sequential data mining to user activities to discover interaction patterns depicting how users interact with systems. Interaction pattern mining is applied to legacy and web-based systems. For legacy systems, these patterns reflect active services. For web-based systems, the patterns can be used for reengineering the website for easier and faster access. Program comprehension can be aided by source code mining 5, 5. Balanyi and Ferenc 5 automatically search design patterns in C++ code. A tool called Columbus is used to build an internal representation of the source code which is compared with pattern descriptions written in DPML, a language based on XML. Four publicly available C++ projects were used for experiments. Some problems were faced because of rule violations in implementation, and when the structures of patterns were similar. However, the overall results were satisfactory. The comprehension of C++ programs by clustering together similar entities based on their attributes is proposed in 5. Four entities are used: classes, member data, member functions and parameters. Results of applying the process to three open source systems have been presented. Analysis of the results reveals correlations between classes, thus revealing portions of code that have common characteristics and are expected to change together. A workshop for mining software repositories for assisting in program understanding and studying evolution was held recently 5. Papers presented in the workshop covered various aspects, including the infrastructure required for extraction of information, use of mining for program understanding, identification of change patterns, defect analysis, software re-use and process and community analysis. 4
    • 2. An overview of association rule mining Association rule mining is a data mining technique that finds interesting association or correlation relationships among a large set of data items 5. Traditionally, association rule mining has been employed as a useful tool to support business decision making by discovering interesting relationships among business transaction records. To illustrate the concept of association rule mining, consider a set of items I = {i1, i2,….in}. Let D be a set of transactions, with each transaction T being a subset of I i.e. T ⊆ I . An association rule is an implication of the form A ⇒ B where A ⊂ I , B ⊂ I and A  B = φ. As an example, consider a set of computer accessories (CDs, memory sticks, microphones, speakers) that are available at a certain store. These accessories form the set of items I of interest to us. Every sale made represents a transaction T. Suppose the sales made are represented in the form of the following set of transactions D: Transaction ID Items Sold T1 CD, memory stick T2 CD T3 Microphone, speaker T4 CD, speaker, Microphone T5 Memory stick, microphone, speaker Table 1: A set of transactions Association rules in the above case represent the items that tend to be sold together e.g. the association rule CD ⇒ Spea ker shows that those who buy CDs also tend to buy speakers. A large number of such association rules may exist in a given set of transaction and not all of them may be of interest. A pattern is said to be interesting if it is easily understood, valid, useful, novel or validates a hypothesis that the user sought to confirm 5. To find interesting rules, support and confidence are commonly used as objective measures of pattern interestingness. Support represents the percentage of transactions in D which contain both A and B. Confidence is the percentage of transactions in D containing A that also contain B. Another measure of interest is coverage. The coverage of an association rule is the proportion of transactions in D that have the items specified on the left hand side of the rule. Mathematically: Support ( A ⇒ B ) = P( A  B) Confidence ( A ⇒ B) = P ( B | A) Coverage ( A ⇒ B ) = P ( A) An association rule is said to be strong if it satisfies both a minimum support threshold and a minimum confidence threshold. 5
    • For the association rule CD ⇒ Spea ker , support is 1/5, confidence is 1/3 and coverage is 3/5. An interesting association rule in this case is Microphone ⇒ Spea ker , for which support is 3/5, confidence is 1 and coverage is 3/5. 6
    • 3. Re-engineering pattern extraction To employ association rule mining for pattern extraction, the first step is to identify a set of items and transactions. The guiding principle is to choose items which facilitate understanding of the code and allow suggestions for re-structuring the code for greater maintainability. Most of the legacy software systems that exist have been developed using the structured approach, with functions or routines forming basic components. Moreover, in legacy software, the use of global variables is often widespread leading to difficulty in understanding the code. In view of these facts, we decided to use functions and global variables as items. Moreover, we also decided to use user defined types. The reason is that user defined types become potential data objects when code is to be re-structured as object-oriented code. Thus, the transaction set that we use consists of functions within a software system, with items being the global variables accessed, user defined types accessed, and function calls made by the function. In the next step, we identify re-engineering patterns that help in identifying problems in legacy code and suggesting appropriate solutions. Whereas design patterns have to do with choosing a particular solution to a design problem, re-engineering patterns have to do with discovering an existing design, determining what problems it has and repairing these problems 5. Traditionally high thresholds of support, confidence and coverage have been used for finding interesting rules. We use both low and high thresholds of the three measures, coverage, support and confidence. It is apparent from the patterns below that low thresholds can reveal interesting facts about the software and provide insight into its structure. One can consider the range 0.7 to1.0 as high for any measure and 0.0 to 0.3 as low. However, it is not meaningful to fix an absolute threshold, since system characteristics such as size, number of global variables, types and functions etc. can vary widely from system to system. Thus we recommend that high and low thresholds be determined subjectively, depending on the system under consideration. It is relevant to note that if we employ user defined types, functions and global variables as items, and use coverage, support and confidence criteria with values zero, low, high and one, the number of possible association rules is almost 1500. It is obvious that the chosen objective measures are not sufficient to guide the mining process. Useful patterns need to be filtered using subjective interestingness measures. These measures are based on user beliefs in data and find patterns interesting if they are unexpected or offer information on which the user can act 5. In this report, our emphasis is on selection of unexpected and actionable patterns in the context of program understanding. The associations that we describe are somewhat different from those of interest in conventional applications like market basket analysis. However, they are relevant in the software context and convey meaningful information, thus highlighting the need and benefits of adapting techniques to suit the peculiarities of specific domains. We present a small subset of the total possible association rules in this report. They are a representative sample of interesting association rules and have been selected because they help in understanding the design of legacy software and in identifying problem areas, whereas the related patterns offer suggestions for improvement. In the tables below, we list the interesting association rules and detail their implication. We also list benefits of employing the patterns, as well as related issues and problems. In some of the association rules identified, one out of the three interestingness measures is used. This indicates that the value of the other two measures does not influence the pattern. Patterns in which a combination of measures gives interesting results are also listed. Pattern 1:Reduce global variable usage 7
    • Association rule Coverage Global → Calling function Low Implication Only a small proportion of functions in the system use the global variable on the LHS. Suggestion Rather than using the variable as a global variable, pass it as a parameter within the relevant functions. If passing a global variable as a parameter is not convenient, its scope may be restricted to a single file by defining it as a static variable. Motivation & Benefits When a large number of variables are to be shared amongst functions, global variables are convenient. Moreover, global variables have a longer lifetime than automatic variables, making it simpler to share information between functions that do not call each other 5. However, unless the global data is read only, the use of global variables results in undesirable coupling between functions 5, leading to difficulties in program understanding and maintenance. By removing unnecessary global variables and restricting their usage, coupling among components is reduced, making it easier to trace faults and avoid unintentional changes to data. Issues & Problems Careful evaluation of each global variable is required to decide whether it should be passed as a parameter, declared to be static or left as it is. If the global variable has low coverage, it implies that only a small proportion of functions in the system use the global variable. However, for a large system, this small proportion can mean a large absolute number of functions e.g. if we consider a small system with 100 functions, a 30% coverage implies upto 30 functions accessing the global variable, which is not a small number. Moreover, even if the number of functions is considerably less, the designers of the system may have valid reasons for defining a variable as global. Pattern 2:Select appropriate storage classes Association rule Coverage Global → Calling function High Implication Global variable on LHS is used by most of the functions in the system. Suggestion The global variable appears to be frequently used. Declare the global variable as a register variable. Motivation & Benefits Register variables are a suggestion to the compiler that they be placed in a register instead of memory. This provides a certain degree of control over program efficiency, and may result in faster access and speed improvement 5. Issues & Problems The storage of program variables in registers may interfere with conventional usage of a register by the compiler, thus slowing down execution of a program 5. Thus knowledge of the architecture and compiler 8
    • is required before such declarations can be of use. It is to be kept in mind that the declaration is a suggestion to the compiler only, which the compiler may choose to ignore 5. Also, not all application languages support the declaration of register variables. Pattern 3: Increase locality of reference Association rule Confidence Called function → Calling One function Implication The function is called by one function only. Suggestion Place the called function in the same file as the calling function. If the size of the function is small, it may be appropriate to make the function inline. Motivation & Benefits Since the function is called by one function only, there appears to be a strong relation between the two. Functions that are related should be grouped together and placed in one file to promote understandability and information hiding 5. Such grouping promotes modular continuity 5 because changes in requirements are localized instead of resulting in system wide changes. Making a function inline results in performance improvement and is clearly beneficial if the function expansion is shorter than the code for the calling sequence 5. Issues & Problems Large inline functions will save a small percentage of run time but will have a higher space penalty. Also functions with loops should almost never be inlined 5 because the run time of a loop is likely to swamp the function call overhead. Large inlined functions may also make it difficult to understand the functionality of the calling function. Association rule Confidence Support Called function → Called High High function Implication Whenever one function is called, there is a high probability that the other function is also called. Suggestion Place functions in the same file. Motivation & Benefits The fact that two functions are called together indicates that they perform related tasks. A file is a way to package related functions into a module 5. Grouping related functions together is a good design practice because it promotes understandability. Moreover, such grouping promotes modular continuity 5 because changes in requirements are localized instead of resulting in system wise changes. Localized changes ease the maintenance task. 9
    • Issues & Problems It may be the case that a function is associated with more than one function. These functions may be present in different files making it difficult to decide in which file to place the function under consideration. The support of the association rule can serve as a useful indicator. If a function is associated with more than one function, then the function should be placed in the same file as the function with which it is associated with highest support. In case support is the same, the decision depends on other factors e.g. similar functionality of a function with functions in a certain file etc. Pattern 4: Increase data modularity Association rule Confidence Support Global → Global High High Implication Whenever one global variable is accessed, there is a high probability that the other global variable is also accessed. Suggestion Examine the global variables to see if they form a coherent entity. If they do, combine them into a structure. Association rule Confidence Support Type → Type High High Implication Whenever one type is accessed, there is a high probability that the other type is also accessed. Suggestion Examine the types to see if they form a coherent entity. If they do, combine them into a structure. Motivation & Benefits Combining variables/types into structures leads to code that is easier to understand and change 5. If at some stage, a shift is to be made to an object-oriented design paradigm, the structures become potential classes. Issues & Problems It may be the case that one global variable/type is associated with a high degree of confidence with a number of other global variables/types i.e. when the global variable/type is accessed, a number of other global variables/types are accessed. In this case, to avoid a large structure that hinders rather than promotes understandability, the software engineer should study the code and analyze which global variables/types are to be combined into a structure. Pattern 5:Strengthen encapsulation 10
    • Association rule Confidence Calling function → Type One Implication The functions access the type on the RHS. Suggestion If we are considering converting a ‘structured’ system to an ‘object-oriented’ system, consider the type as a candidate class and the functions as its member functions. Association rule Confidence Type → Type One Implication The types are used together by functions within the system. Suggestion Examine the types to see if they form a coherent entity. If they do, combine the types into a structure. If we are considering converting a ‘structured’ system to an ‘object-oriented’ system, the structure is a candidate class and the functions are its member functions. Association rule Confidence Global → Type One Implication It is always the case that functions in the system access the type and global variable together. Suggestion If we are considering converting a ‘structured’ system to an ‘object-oriented’ system treat the type as a candidate class, the global as a static data member and the functions as member functions. Motivation & Benefits A collection of functions and the common data set they access can be packaged to provide information hiding. Large programs that use information hiding have been found easier to modify by a factor of 4 than programs that don’t 5. Information hiding forms a foundation for both structured and object-oriented design. Issues & Problems Functions may access more than one type, in which case a careful study of the code is required to decide the type with which the function should be associated. It may be the case that the types accessed by a function form an entity which can be transformed into a structure (Pattern 4). Pattern 6: Identify utilities 11
    • Association rule Coverage Called function → Calling function High Implication Function on LHS is called by most of the functions. Suggestion Treat the function as a utility function. It may be useful to put groups of related utility functions in separate files. Association rule Coverage Called function → Calling function High (within a sub-system) Implication Function on LHS is called by most of the functions within a sub-system. Suggestion Treat the function as a utility function for that sub-system. The utility functions for a particular sub- system may be placed in a separate file. Motivation & Benefits Utility functions represent re-usable components of a structured system. Re-usable components can lead to measurable benefits in terms of reduction in development cycle time and project cost, and increase in productivity 5. Issues & Problems In order for functions to be re-used effectively, they must be properly catalogued for easy reference, standardized for easy application and validated for easy integration 5. In case the number of such functions is large, related functions need to be identified and grouped together, otherwise searching for the appropriate function may be time consuming. Pattern 7:Localize structures Association rule Confidence Global → Calling function One Implication The global variables are used by one function only. Suggestion Place global variables in one local structure. Motivation & Benefits It is recommended that all variables have the smallest scope possible. If a variable is used by one 12
    • function only, there is no plausible reason for defining the variable as global. Localizing the variable promotes understandability and maintainability. Issues & Problems None Pattern 8:Beware of side effects Association rule Confidence Calling function → Global One Implication When sorted by global variable, we get a list of functions that use the same global variable and thus are highly coupled. Suggestion Changes in the global variable or a function accessing the global variable should be made carefully keeping in view all related functions. Motivation & Benefits Using global variables weakens modularity because functions cannot be understood on their own. Understanding the purpose and working of all functions accessing one global variable leads to reduced side effects in case of changes in the functions or global variable. Issues & Problems If a large number of functions access a global variable, understanding their working may require time and effort. The best approach is to reduce global variable usage (see Pattern 1). However, if this is not possible, the effort is well justified because inadvertent changes are avoided. Pattern 9: Generate alternative views Association rule Coverage Type → Calling function High (Within a sub-system) Implication Type on LHS is used by most of the functions within the sub-system. Suggestion The association of types with a certain sub-system should be highlighted. Motivation & Benefits Sub-systems and their inter-relationships represent the architecture of a software system. It is important to focus on the architecture from multiple perspectives so that large scale structural changes are easier to make 5. A study of the types associated with a sub-system provides an alternative ‘data’ view which is particularly important if changes are to be made to types or functions within the sub- system. Issues & Problems 13
    • Different sub-systems of the software system may access the same types, making it difficult to associate a type with one sub-system only. In such a case, it may be feasible to use Pattern 5 on the entire software system in order to group together functions with related types and hence arrive at an alternative modularization based on data abstraction. 4. Experiments and results 5.1 The test systems For conducting experiments, Xfig version 3.2.3 and a subset of Bash version 1.14.4 were used. Xfig is an open source drawing tool that runs under X Window system. It has been written in C, and consists of around 75,000 lines of code. The design documentation of Xfig is not available, although the user manual and other useful information is available at the Xfig site 5. Bash is a Unix shell, and the subset which we experimented with consists of 38K lines of source code 5. Bash and Xfig have been used for architecture recovery experiments in 5, 5, 5. Our data mining experiments are helpful in gaining understanding of the test systems by providing alternative views of the software. The source files for the Xfig and Bash system have been parsed using the Rigi tool and relevant ‘facts’ have been stored in an exchange format called the ‘Rigi Standard Format (RSF) 5, 5. The transaction set discussed in section 4 was developed from this fact set. Some useful statistics of the two systems are provided in the following table. It is relevant to note that the Xfig system consists of five major subsystems, whose source code files can be identified by their names. We have experimented with the 95 files in these sub-systems, leaving 4 files which have not been considered. We do not expect that the average figures and percentages obtained for Xfig will change substantially on inclusion of these 4 files. 14
    • System Purpose Globals Functions Types Xfig 1746 1661 828 d_*files Drawing tasks 94 e_*files Editing tasks 369 f_*files File related tasks 139 u_*files Utility files 422 w_*files Window related tasks 637 Bash 539 892 198 Table 2: Statistics of the Bash and Xfig systems 5.2 Analysis of results In this section, we present the results of applying association rule mining to Xfig and Bash and analyze the results using the patterns identified in section 3. 15
    • Pattern 1: Reduce global variable usage Bash - Global variable usage 70 60 Number of functions accessing a global 50 variable 40 30 20 10 0 Global variables Figure 1: Number of functions accessing a global variable (Bash) Figure 1 shows the usage of global variables by functions in Bash, where 543 functions access global variables. The average number of functions accessing a global variable is 4.12 with a standard deviation of 5.70. The global variable rl_point is accessed by the maximum number of functions (62) with coverage 0.07. Bash - Global variable usage 7% 28% 65% Globals accessed by more than 10 functions Globals accessed by 2 - 9 functions Globals accessed by 1 function Figure 2: Global variable access breakdown (Bash) [1] As illustrated in , 28% of the global variables are accessed by one function only. Similar results are obtained for Xfig (Figure 3, Figure 4), where average number of functions accessing a global variable is 5.97 with a standard deviation of 15.73 and 22% of the global variables are accessed by one function only. The variable XtStrings^* is accessed by the maximum number of functions (231) with a coverage of 0.139. 16
    • Xfig - Global variable usage 260 240 220 Number of functions accessing a global 200 180 160 variable 140 120 100 80 60 40 20 0 Global variables Figure 3: Number of functions accessing a global variable (Xfig) Xfig - Global variable usage 22% 10% 68% Globals accessed by more than 10 functions Globals accessed by 2 - 9 f unctions Globals accessed by 1 function Figure 4: Global variable access breakdown (Xfig) It is evident from Figure 1 - Figure 4 that a substantial proportion of global variables are being accessed by very few functions in Bash and Xfig. It can also be observed that some global variables are accessed by a single function. The fact that the variables have been defined as global shows a poor design or a design that has deteriorated over time. To enhance software quality, it will be useful to reduce such a large number of global variables by localizing them. Pattern 2: Select appropriate storage classes Table 2 shows global variables accessed by the largest number of functions in Xfig and Bash. Only three global variables are shown. Number of functions Global variable Coverage Bash 62 rl_point 0.07 46 rl_end 0.052 39 __ctype_b 0.044 Xfig 231 XtStrings^* 0.139 211 _ArgCount^* 0.127 211 _ArgList^* 0.127 Table 3: Global variables accessed by maximum number of functions 17
    • If the functions are accessing the global variables are called frequently, the global variables are candidate register variables. Pattern 3: Increase locality of reference Bash - Functions called by one function 26% 74% Functions residing in different files Functions residing in same file Figure 5: Functions called by one function only (Bash) Figure 5 shows that Bash has 298 functions that are called by only one function. An examination of the code shows that 220 (74%) of these functions are present in the same file as the calling function, and 78 (26%) are in different files. It can be seen from Figure 6 that similar results are obtained for Xfig. 368 functions are called by only one function out of which 270 (73%) functions are present in the same file as the calling function, and 98 (27%) are present in a different file. Xfig - Functions called by one function 27% 73% Functions residing in different files Functions residing in same file Figure 6: Functions called by one function only (Xfig) A valid reason for placing a called function in a different file from the calling function is that the called function may be a ‘utility’ function which has been placed together with other utility functions in a separate file. If this is not the case, the functions should be placed in the same file as the calling function to increase efficiency and ease maintenance by increasing locality of reference. 18
    • Bash - Functions residing in same file with high confidence and support > .001 49% 51% Functions residing in different files Functions residing in same file Figure 7: Functions that are called together (Bash) 3644 functions are called together with high confidence (0.7 – 1.0) in Bash. The association between the functions xrealloc and xmalloc carries the highest support 0.045 i.e. 31 functions call them together. The two functions are present in the same file. It can be seen from Figure 7 that 51% of the functions called together with support > .001 (1) reside in same files and 49% reside in different files. If we consider files with high confidence and ignore support, it is seen that 43% of the functions called together reside in same files and 57% reside in different files. In Xfig, 3584 functions are called together with high confidence. The association between the functions cleanup and set_action_object carries highest support 0.045 i.e. 59 functions call them together. The two functions are present in the same file. It can be seen from Figure 78 that 52% of the functions called together with support > .001 (1) reside in same files and 48% reside in different files. If we consider files with high confidence and ignore support, the same percentage figures are obtained. Xfig - Functions called together with high confidence and support > 0.001 48% 52% Functions residing in different files Functions residing in same file Figure 8: Functions that are called together (Xfig) Unless there are valid reasons for doing otherwise, functions called together by a large number of functions should be placed in same file. Reasons for placing them in different files need to be examined with care. Pattern 4: Increase data modularity In Bash, 1635 global variables are accessed together with high confidence, with 1549 of these accessed together with a confidence of 1. The highest support is 0.077 (42) for the variables rl_end and rl_point, which represents that 42 functions access these variables together with high confidence. 19
    • Bash - Support for association between global variables 45 40 35 30 Support 25 20 15 10 5 0 Global variables Figure 9: Global variables that are accessed together with high confidence and support > 2 (Bash ) The support figures in Figure 9 help us to decide which global variables should preferably be grouped together. Without these figures the decision may be difficult since a single global variable may be associated with a large number of other variables and it may not be possible to group them all together. Average support is 1.51 with a standard deviation of 1.61. Figure 10 shows the number of variables with which various global variables in Bash are highly associated. On an average, a global variable is associated with 6.15 other global variables with a standard deviation of 9.31. Bash - Global associations 40 35 30 associated with high Number of variables 25 confidence 20 15 10 5 0 Global variables Figure 10: Association between global variables (Bash) In Xfig, 14983 global variables are associated with each other with a high confidence with 14030 of these associated with a confidence of 1.00. The highest support is 0.159 (211) for the variables _ArgList^* and _ArgCount^*, which represents that 211 functions access these variables together with high confidence. The support figures for Xfig are depicted in Figure 11. Support average is 2.13 with a standard deviation of 5.95. This is higher compared to the support figure for Bash. Xfig - Support for association between global variables 250 200 150 Support 100 50 0 Global variables 20
    • Figure 11: Global variables that are accessed together with high confidence and support > 2 (Xfig ) Figure 12 shows the number of variables with which various global variables in Xfig are highly associated. On an average, a global variable is associated with 12.95 other global variables with a standard deviation of 18.87. The average number of global variables with which a variable is associated is higher for Xfig as compared to Bash. For global variables associated with high confidence, being accessed together by a large number of functions, it is suggested that they be placed in a single structure. Xfig - Global associations 120 100 associated with high 80 Number of variables confidence 60 40 20 0 Global variables Figure 12: Association between global variables (Xfig) In Bash, 155 types are accessed together with high confidence, with 148 of these accessed together with a confidence of 1. The highest support is 0.92 (23) for the types KEYMAP_ENTRY_ARRAY and Keymap, which represents that 23 functions access these types together with high confidence. Bash - Support for association between types 25 20 15 Support 10 5 0 User defined types Figure 13: User defined types that are accessed together with high confidence (Bash) The support figures in Figure 13 help us to decide which user defined types should preferably be grouped together. Without these figures the decision may be difficult since a single type may be associated with a large number of other types and it may not be possible to group them all together. Average support figures, very similar to global variable average support figures are 1.74 with a standard deviation of 2.53. Figure 14 shows the number of types with which various user defined types in Bash are highly associated. On an average, a type is associated with 4.08 other types with a standard deviation of 5.75. It can be seen that quite a large number of types are associated with one other type only. If the support for such an association is high, the two types can easily be combined to form a structure. For types associated with a number of other types, support figures should be utilized to take an appropriate decision. 21
    • Bash - Type associations 18 16 14 Number of types associated with high confidence 12 10 8 6 4 2 0 User defined types Figure 14: Association between user defined types (Bash) In Xfig, 422 user defined types are associated with each other with a high confidence with 378 of these associated with a confidence of 1.00. The highest support is 0.156 (213) for the types Arg and WidgetList, which represents that 213 functions access these types together with high confidence. The support figures for Xfig are depicted in Figure 15. Support average is 4.34 with a standard deviation of 13.94. This is once again higher compared to the support figure for Bash. Xfig - Support for association between types 250 200 150 Support 100 50 0 User defined types Figure 15: User defined types that are accessed together with high confidence (Xfig) Figure 16 shows the number of types with which various types in Xfig are highly associated. On an average, a type is associated with 4.44 other types with a standard deviation of 5.12. The average number of types with which a type is associated is almost the same for Xfig and Bash. It can be seen that quite a large number of types are associated with one other type only, similar to the results for Bash. Xfig - Type associations 20 Number of types associated 15 with high confidence 10 5 0 User defined types Figure 16: Association between types (Xfig) 22
    • Pattern 5: Strengthen encapsulation Figure 5 shows the number of functions that access a user defined type for Bash. The average number of functions that access a type is 10.84 with a standard deviation of 14.18. It can be seen that 71% of the types are accessed by 10 or less functions. Bash - Type accessess 80 70 Number of functions accessing the types 60 50 40 30 20 10 0 Types Figure 17: Number of functions accessing a type (Bash) In Xfig, average is 26.42, with 67% of the types accessed by 10 or less functions. In order to promote information hiding, the types should be packaged with the functions accessing them. If the system is to be re- structured as an object oriented system, the types are candidate classes and accessing functions are member functions. Xfig - Type accesses 400 350 Number of functions accessing the types 300 250 200 150 100 50 0 Type s Figure 18: Number of functions accessing a type (Xfig) Figure 19 shows the number of types with which various user defined types in Bash are associated with a confidence of 1. On an average, a type is associated with 4.77 other types with a standard deviation of 6.17. For Xfig, average is 5.18 with a standard deviation of 5.59. Associations are depicted in Figure 20. 23
    • Bash - Type associations 20 associated types 15 Number of 10 5 0 Types Figure 19: Confidence 1 association between user defined types (Bash) Xfig - Type associations 20 15 10 5 0 Ty p e s Figure 20: Confidence 1 association between user defined types (Xfig) Figure 21 and Figure 22 show the number of types that are associated with each global variable with a confidence of 1 for Bash and Xfig. The number is large, and due to the fact that a global is normally associated with more than one type, it may be difficult to identify the type for which the global is to become a static data member in case of conversion to an object-oriented design. It may be appropriate to apply pattern 3 to determined highly associated data types. If the set of types with which a global is associated are highly associated with each other, they may first be grouped together to form a class. B a s h - G l o b a l Ty p e a s s o c i a t i o n s 20 18 16 14 12 10 8 6 4 2 0 Gl o b a l s Figure 21: Confidence 1 association between globals and user defined types (Bash) 24
    • Xfig - Global Type associations 25 20 Number of types associated 15 with a global 10 5 0 Globals Figure 22: Confidence 1 association between globals and user defined types (Xfig) Pattern 6: Identify utilities Figure 23 shows the calls made by functions in Bash, where 694 functions make function calls. The average number of calls made is 3.18 with a standard deviation of 8.46. The highest number of calls (195) are made to the xmalloc function with a coverage of 0.219. An examination of the code shows that 75% of the functions to which 20 or more calls are made reside in the files general.c, error.c or variable.c. This indicates that utility functions are placed in separate files in Bash. Bash - Function calls made 200 175 Number of functions calling a function 150 125 100 75 50 25 0 Functions Figure 23: Number of function calls made to functions (Bash) As can be seen from Figure 24, similar results are obtained for Xfig (average number of calls made is 5.00 with standard deviation of 10.17). 979 functions make function calls. The highest number of calls (180) are made to the put_msg function with a coverage of 0.096. Xfig - Function calls made 200 Number of functions calling 150 a function 100 50 0 Functions Figure 24: Number of function calls made to functions (Xfig) 25
    • An examination of the code shows that there are 36 functions to which 20 or more calls are made and they reside in 19 different files. A listing of these files, along with the number of functions in a file is given in Table 4 below: Mathinline 2 string2 1 stdlib 2 f_util 2 mode 3 u_bound 1 u_elastic 1 u_markers 2 u_redraw 2 u_search 1 u_undo 2 w_color 1 w_cursor 3 w_drawprim 1 w_indpanel 2 w_modepanel 1 w_mousefun 2 w_msgpanel 3 Table 4: List of files containing frequently accessed functions (Xfig) It can be seen that in Xfig, as compared to Bash, a larger number of files contain frequently accessed functions. The names of the files indicate that they contain utility functions. However, a more detailed look at the files is required to ascertain that this is indeed the case. To identify the utility functions within various sub-systems of Xfig, associations between the functions with the software, and functions within each sub-system were noted. Figure 25 - Figure 29 show the calls made within the various sub-systems. The graphs are similar, depicting that function calls in each sub-system follow a similar trend. Xfig - Function calls made in the d_*files subsystem 40 35 30 Number of functions calling a function 25 20 15 10 5 0 Functions Figure 25: Number of function calls made to functions (Xfig d_*files sub-system) 26
    • Xfig - Function calls made in the e_*files subsystem 50 40 Number of functions calling a function 30 20 10 0 Functions Figure 26: Number of function calls made to functions (Xfig e_*files sub-system) Xfig - Function calls m ade in the f_*files subsystem Number of functions calling 40 35 30 a function 25 20 15 10 5 0 Functions Figure 27: Number of function calls made to functions (Xfig f_*files sub-system) Xfig - Function calls made in u_*files subsystem 35 30 Number of calls made to a 25 20 function 15 10 5 0 Functions Figure 28: Number of function calls made to functions (Xfig u_*files sub-system) 27
    • Xfig - Function calls made in the w_*files subsystem 80 70 60 Number of functions calling a function 50 40 30 20 10 0 Functions Figure 29: Number of function calls made to functions (Xfig w_*files sub-system) Table 5 shows some statistics of the function calls made. It can be observed that the average number of calls made in all the sub-systems is very similar. Sub-system Average number Standard deviation Highest number of calls Coverage of calls made made (function) d_*files 3.27 5.00 draw_mousefun_canvas 37 (0.394) set_cursor e_*files 3.44 5.19 46 (0.125) file_msg f_*files 2.03 2.98 38 (0.273) set_action_object u_*files 3.09 3.70 33 (0.078) put_msg w_*files 3.02 5.13 69 (0.108) Table 5: Function calls made in various sub-systems Table 6 shows the function calls that are made to various functions by functions within a sub-system. It appears that the u_*files sub-system contains utility functions, because maximum number of functions are called from this sub-system. This was our initial assumption , which has been confirmed through the results obtained. The d_*files sub-system functions make more calls to functions within the u_*files sub-system as compared to any other sub-system. However, the rest of the functions make more calls to functions within their own sub-system, with the 2nd highest number of calls to the u_*files sub-system. Function calls are made to functions in all other sub-systems, but the d_*files sub-system appears to contain the least number of utilities. These figures show that the Xfig system is quite well structured, but its structure can be improved by further localizing the calls to own sub-system or the u_*files sub-system. Table 7 shows call results for functions to which 5 or more calls are made. Results are similar to those in Table 6, with the maximum number of functions being called from the u_*files sub-system and no calls to the d_*files sub-system functions except from functions within d_*files. 28
    • Calls made Calls made Calls made Calls made Calls made Calls made to functions to functions to functions to functions to functions to functions Sub-system in d_*files in e_*files in f_*files in u_*files in w_*files in misc. files d_*files 49 1 1 53 13 6 e_*files 6 224 16 177 51 11 f_*files 1 2 113 48 28 6 u_*files 2 10 9 302 43 10 w_*files 2 7 20 39 301 16 Total 60 244 159 619 436 49 Table 6: Number of function calls made to functions in each sub-system Calls made Calls made Calls made Calls made Calls made Sub-system Calls made to functions to functions to functions to functions to functions to functions in e_*files in f_*files in u_*files in w_*files in misc. files in d_*files d_*files 1 9 6 2 e_*files 12 1 55 12 7 f_*files 7 2 1 u_*files 43 11 6 w_*files 1 4 3 33 6 Total 1 13 12 110 64 22 Table 7: Five or more function calls made to functions in each sub-system Pattern 7: Localize structures An application of the association rule shows that in Bash, 115 global variables are used by only one function, whereas in Xfig, there are 310 such global variables. This can also be verified by the statistics presented in Figure 2 and Figure 4. Thus a significant percentage of global variables in both systems is used by just one function. It is recommended that such global variables be localized. Pattern 8: Beware of side effects An application of the association rule results in a listing of functions that access the same global variables. Figure 1 shows the number of functions that access common global variables for Bash and Figure 3 shows the number for Xfig. These functions are highly coupled and any changes to the functions involving the global variables should be made carefully. Pattern 9: Generate alternative views Table 8 summarizes the statistics for types accessed by functions in various sub-systems of Xfig. Figure 30- Figure 34 show the types accessed within the sub-systems. Sub-system Average number Standard deviation Highest number of Coverage of accesses to accesses (Type) 29
    • Number of functions accessing a Number of functions type accessing a type 0 5 10 15 20 25 f_*files e_*files u_*files d_*files w_*files Cursor 0 20 40 60 80 100 120 XFontStruct F_compound F_point F_line WidgetList Window F_point F_pos F_spline Cursor F_text F_ellipse F_arc XCharStruct Arg F_spline F_pos F_text F_line e_edit.c.unnamed.199(struct) 6.49 6.28 F_ellipse Types WidgetClass types 16.74 15.10 17.34 F_arrow F_sfactor F_sfactor F_pic F_compound icon_struct XtAppContext Pixmap Display FILE XtVarArgsList Window size_t choice_info F_pic Pixel Xfig - Types accessed within d_*files subsystem size_t object.h.unnamed.149( Colormap object.h.unnamed.149(struct) F_arc XtLanguageProc Position Types FILE Dimension 8.21 5.92 XFontStruct Figure 31: Types accessed by a subsystem (e_*files) Figure 30: Types accessed by a subsystem (d_*files) 35.40 25.33 25.89 XtActionsRec GC XtAppContext Atom appresStruct sfactor_def(struct) XtOrderProc XtCallbackRec XtCallbackProc stat(struct) XCharStruct XKeyReleasedEvent Table 8: Accesses to user defined types in various Xfig sub-systems Time Xfig - Types accessed within e_*files subsystem paper_def(struct) pid_t time_t object.h.unnamed.195(struct) e_edit.c.unnamed.200(struct) FILE counts(struct) patrn_strct Cursor XtWidgetGeometry XColor WidgetList F_compound F_compound 37 (0.266) 99 (0.268) 22 (0.234) 257 (0.403) 127 (0.301) 30
    • Number of functions Number of functions Number of functions accessing a type accessing a type accessing a type 0 5 10 15 20 25 30 35 40 0 50 100 150 200 250 300 FILE 0 20 40 60 80 100 120 140 WidgetList F_compound Arg F_compound F_pos Display F_pic appresStruct F_line ind_sw_info F_point F_line Pixmap F_spline appresStruct Window F_arc icon_struct size_t GC Window COLR XColor F_pos F_spline WidgetClass F_ellipse XtAppContext F_text F_arc XtLanguageProc F_ellipse F_compound appresStruct XFontStruct Display F_text Colormap GC XColor Pixel F_sfactor Cursor Cmap(struct) XtVarArgsList F_arrow F_arrow XtActionsRec F_linkinfo fig_settings FILE counts(struct) size_t zXPoint F_point XtIntervalId Display Dimension FILE Position F_pic paper_def(struct) Atom XFontStruct __uint8_t mode_sw_info paper_def(struct) XCharStruct stat(struct) main_menu_info object.h.unnamed.1 _recent_files XButtonReleasedEv Cursor jpeg_error_mgr(struc XEvent _fstruct(struct) XGCValues XColor pcxheadr choice_info Cursor style_template(struc object.h.unnamed.1 Types F_text Pixel error_ptr XPoint size_t f_readgif.c.unnamed. Types LIBRARY_REC(stru XEvent F_sfactor Screen XClientMessageEv Figure 32: Types accessed by a subsystem (f_*files) RGB j_common_ptr Figure 33: Types accessed by a subsystem (u_*files) Figure 34: Types accessed by a subsystem (w_*files) F_pos WidgetList object.h.unnamed.14 FigListWidget Atom ListPart /usr/X11R6/include/ Window HSV Colormap WidgetList va_list XtCallbackRec angle_table(struct) Arg fig_colors Cmap(struct) Colormap spin_struct xfont(struct) f_read.c.unnamed.19 RotatedTextItem _xfstruct(struct) XawListReturnStruct hdr(struct) XtOrderProc XGCValues f_readgif.c.unnamed. PIXRECT u_draw.c.unnamed. Xfig -- Types accessed within f_*files subsystem paper_def(struct) u_draw.c.unnamed. f_readeps.c.unname Xfig - Types accessed with u_*files subsystem funcs(struct) u_draw.c.unnamed. object.h.unnamed.19 XtWidgetGeometry _XPrivDisplay funcs(struct) jpeg_memory_mgr(st F_line __uint8_t j_decompress_ptr stat(struct) Visual Types counts(struct) JSAMPIMAGE XExposeEvent PIXRECT object.h.unnamed.14 XRectangle _fpnt(struct) f_neuclrtab.c.unnam F_arc XPoint F_ellipse Region __int32_t F_spline _arrow_shape(stru f_readpcx.c.unname XWindowAttributes Drawable XErrorEvent pcxhed(struct) XawTextBlock fig_settings _fstruct(struct) globalStruct patrn_strct XKeyReleasedEvent XPointerMovedEvent KeySym zXPoint Xfig - Types accessed by w_*files subsystem DIR Visual dirent(struct) CorePart XKeyPressedEvent xfont(struct) CompKey _recent_files XrmValuePtr MenuItemRec _xfstruct(struct) FigSmeBSBObject FigSmeBSBPart XKeyboardControl w_drawprim.c.unna __uint8_t passwd(struct) pid_t XtCallbackProc XtTranslations w_drawprim.c.unna RectObjClassPart F_pic GrabInfo menu_def RectObjPart Region XFontSetExtents SmeBSBClassRec XConfigureEvent XButtonPressedEve SmeBSBPart SmePart It can be seen that some of the types are associated with a certain sub-system only, whereas some of them by all sub-systems. The reason for this is that these shapes are drawn in the d_*files sub-system, edited in the Time are accessed by more than one sub-system e.g. the types F_Line, F_Spline, F_arc etc. are accessed 31 SmeThreeDPart Font
    • e_*files sub-system, etc. Thus for an object-oriented view, these types will be associated with functions across sub-systems. On the other hand, types accessed within a certain sub-system may be associated with functions within the sub-system. 32
    • 5. Conclusions In this report we applied association rule mining to the problem of understanding a software system given only the source code. We analyzed the structure of two legacy systems, Bash and Xfig, and extracted meaningful association rules and patterns which provide useful insight about the software’s overall structure. As illustrated, these patterns can be used to re-structure the code for maintainability, and if required, to re- modularize the code e.g. by converting a structured design to an object-oriented design. A manual inspection to carry out the same tasks would have taken a much longer time. Our experiments with Xfig and Bash reveal similar results in terms of the average and percentage values in the patterns discussed. This observation can be helpful in revealing interesting characteristics, trends and nature of open source legacy systems. In the future, we intend to pursue the mining of associations between items other than the ones explored here e.g. between the input and output parameters of functions. Furthermore, patterns should be applied to other software systems in order to validate results obtained with Xfig and Bash and perhaps reveal other interesting properties of legacy systems. 33
    • References [1] I. Sommerville, Software Engineering, Fifth Edition, Addison Wesley, 2000. [2] R.S. Arnold, Software Reengineering, IEEE Computer Society Press, 1993. [3] R.S. Pressman, Software Engineering A Practitioner’s Approach, Fifth Edition, Mc Graw Hill, 2001. [4] S.L. Pfleeger, Software Engineering Theory and Practice, Prentice Hall, 1998. [5] R.L. Glass, Frequently Forgotten Fundamental Facts about Software Engineering, IEEE Software, May/June 2001. [6] T.J.Biggerstaff, “Design Recovery for Maintenance and Reuse”. IEEE Computer, 22(7), pages 36-49, July 1989. [7] H.A. Muller, M. Story, J.H. Jahnke, D.B. Smith, A.R. Tilley, K. Wong, “Reverse Engineering: A Roadmap”, The 22nd International Conference on Software Engineering (ICSE’00), June 2000 [8] G. Parikh, N. Zvegintzov, Tutorial on Software Maintenance, IEEE Computer Society Press, 1983. [9] R.P. Hall, Seven Ways to Cut Software Maintenance Costs, Datamation, July 1987. [10] M.T.Harandi, J.Q.Ning, “Knowledge-Based Program Analysis”, IEEE Software 7(1), pages 74-81, January 1990 . [11] Rich, Wills “Recognizing a program’s design: A graph parsing approach”, IEEE Software, 7(1), pages 82-89, January 1990. [12] A.Quilici, “A Memory-Based Approach to Recognizing Programming Plans”. Communications of the ACM, 37(5), pages 84-93, May 1994. [13] H. A. Müller, K. Wong, and S. R. Tilley “Understanding software systems using reverse engineering technology.” The 62nd Congress of L'Association Canadienne Francaise pour l'Avancement des Sciences Proceedings (ACFAS) 1994. [14] H.M.Fahmy, R.C.Holt, J.R.Cordy , “Wins and Losses of Algebraic Transformations of Software Architectures”. Automated Software Engineering ASE 2001, San Diego, California, November 26-29, 2001. [15] R.Kazman, S.J.Carrière, "View Extraction and View Fusion in Architectural Understanding". The 5th International Conference on Software Reuse, Victoria, BC, Canada, June 1998. [16] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Supporting software maintenance by mining software update records, International Conference on Software Maintenance, (ICSM) 2001. [17] Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, August 2000. [18] C. Montes de Oca, D. L .Carver, Identification of Data Cohesive Subsystems using Data Mining Techniques, International Conference on Software Maintenance, (ICSM) November 1998. [19] C. Montes de Oca, D. L .Carver, A Visual Representation Model for Software Subsystem Decomposition, Working Conference on Reverse Engineering (WCRE'98), October, 1998. [20] K. Sartipi, K. Kontogiannis, F. Mavaddat, Architectural Design Recovery Using Data Mining Techniques, Conference on Software Maintenance and Reengineering (CSMR’00), February, 2000. [21] Tjortjis, C., Sinos, L., Layzell, P., Facilitating Program Comprehension by Mining Association Rules from Source Code, 11th IEEE International Workshop on Program Comprehension (IWPC'03), May, 2003. 34
    • [22] A. Michail, Data mining library reuse patterns using generalized association rules, Proceedings of the 22nd international conference on Software engineering, June 2000. [23] A. Michail, Data Mining Library Reuse Patterns in User-Selected Applications, 14th IEEE International Conference on Automated Software Engineering , October, 1999. [24] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Mining the maintenance history of a legacy software system, International Conference on Software Maintenance, (ICSM) , 2003. [25] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Mining the software change repository of a legacy telephony system, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [26] T. Zimmermann, P. Weibgerber, S. Diehl, A. Zeller, Mining version histories to guide software changes, Proceedings of the 26th International Conference on Software Engineering (ICSE) 2004. [27] M. El-Ramly, E. Stroulia, Mining software usage data, Proceedings of the 1 st International Workshop on Mining Software Repositories, 2004. [28] Z. Balanyi, R. Ferenc, Mining design patterns from C++ source code, International Conference on Software Maintenance (ICSM) 2003. [29] Y. Kanellopoulos, C. Tjortjis, Data mining source code to facilitate program comprehension: Experiments on clustering data retrieved from C++ programs, International Workshop on Program Comprehension (IWPC) 2004, [30] R. Amin, M. Cinneide, T. Veale, LASER: A lexixal approach to analogy in software reuse, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [31] F. McCarey, M. Cinneide, N. Kushmerick, A case study on recommending reusable software components using collaborative filtering, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [32] Y. Yusof, O. F. Rana, Template mining in source-code digital libraries, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [33] P. K. Garg, T. Gschwind, K. Inoue, Multi-project software engineering: An example, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004. [34] Demeyer, S., Ducasse, S., Nierstrasz, O., Object-Oriented Reengineering Patterns, Morgan Kaufmann, 2003. [35] Website http://www.iam.unibe.ch/~scg/Archive/famoos/ [36] V. Tzerpos, R.C. Holt, “Software Botryology: Automatic Clustering of Software Systems”, Ninth International Workshop on Database and Expert Systems Applications (DEXA’98), August 1998. [37] T.A. Wiggerts, “Using clustering algorithms in legacy systems remodularization,” Fourth Working Conference on Reverse Engineering (WCRE’97), October 1997. [38] N.Anquetil and T.C.Lethbridge, “Experiments with clustering as a software remodularization method,” The Sixth Working Conference on Reverse Engineering (WCRE’99), 1999. [39] J.Davey and E.Burd, “Evaluating the Suitability of Data Clustering for Software Remodularization”, The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000. [40] M.Saeed, O.Maqbool, H.A.Babri, S.M. Sarwar, S.Z. Hassan “Software Clustering Techniques and the Use of the Combined Algorithm”, Conference on Software Maintenance and Re-engineering (CSMR’03), March 2003. [41] O.Maqbool, H.A.Babri, “The Weighted Combined Algorithm: A Linkage Algorithm for Software Clustering”, Conference on Software Maintenance and Re-engineering (CSMR’04), March 2004. 35
    • [42] Website MSR 2004 http://msr.uwaterloo.ca [43]B.W. Kernighan, D.M. Ritchie, The C Programming Language, Prentice Hall, 1988. [44]S. McConnell, Code Complete A Practical Handbook of Software Construction, Microsoft Press, 1993. [45]M.A. Weiss, Efficient C Programming A Practical Approach, Prentice-Hall, 1995. [46] Website http://publications.gbdirect.co.uk/c_book/ [47]B. Meyer, Object-Oriented Software Construction, Prentice Hall, 1988. [48]R.B. Murray, C++ Strategies and Tactics, Addison Wesley, 1993. [49]K. Wong, S. Tilley, H. Muller, M.A. Storey, Structural Redocumentation: A Case Study, IEEE Software, January, 1995. [50] Website Xfig http://www.xfig.org [51] R. Koschke, “Atomic Architectural Component Recovery for Program Understanding and Evolution”, PhD Thesis, University of Stuttgart, 2000. [52] Website http://www.bauhaus-stuttgart.de/bauhaus [53] J.Martin, K.Wong, B. Winter and H.A.Müller, “Analyzing xfig using the Rigi Tool Suite”, The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000. 36