SlideShare a Scribd company logo
1 of 13
1
Software Systems Modularization
through Data Mining Techniques
2
Abstract
In the following research, we investigate one of the applications of data
mining techniques in terms of solving and improving the software
modularization which is a significant issue in the field of the software
engineering.
One of the aspects of software modularization is to fulfill the appropriate
situation for the maintenance in software engineering. The method of
clustering is the process of investigating similar group of items in a specified
dataset. The dataset which is under the investigation in this project is the
source codes extracted from open source software. The items which are
situated in a cluster have the similar features. For fulfilling the software
modularization, clustering method is investigated and used. In other words,
through grouping the related source code entities, an appropriate result will be
provided for the beneficiaries of the mentioned field such as software
developers, software maintenance who are especially cooperative with each
other.
3
Content
1 Introduction…………………………………………………….. 4
2 RelatedWork…………………………………………………… 4
3 ResearchMethod……………………………………………….. 5
3.1 Original Dataset……………………………………………… 5
3.2 Meta Data……………………………………………………. 6
3.3 Data Quality Validation…………………………………….. .. 6
4 Implementation………………………………………………….. 6
4.1 Preprocessing………………………………………………… 6
4.2 Data Analysis………………………………………………… 8
5 Result Evaluation……………………………………………….. 9
5.1 Random index………………………………………………… 9
5.2 Precision and Recall………………………………………….. 9
5.3 Comparison to Experts Suggestion………………………… .. 10
5.4 Calculating Coupling, Cohesion and Cyclomatic Complexity. 11
5.5 Discussion……………………………………………………. 11
6 Conclusion……………………………………………………….. 11
References………………………………………………………… 13
4
1. Introduction
Managing the project carefully is known as the noticeable issue in the field of
software engineering. In addition, importance of Modularization is widely believed
that a well modularized software system is easier to develop and maintain.
Modularity is designing a system that is divided into a set of functional units (named
modules) that can be composed into a larger application. A module represents a set of
related concerns. Modules are independent of one another but can communicate with
each other in a loosely coupled fashion. [1] There are lots of benefits in terms of
software modularization. For example, it allows distributed development, which
means different developer can work in parallel. It is worth saying that it is cost
effective financially and also decreases the times consumption of development or
maintenance. The module leads to reusability of program. Also, it will be easily
readable for users. A good Modularity makes program higher readability, higher
scalability. Programmers don’t have to spend unnecessary energy and waste their
times. On the other hand, it can say that through software modularization, program
will be more manageable. These individual modules are easier to design, implement
and test. However, badly modularized software is regarded as a source of problems
for comprehension, increasing the time for on-going maintenance and testing.
2. RelatedWork
Various techniques have been proposed which in recent years each one has
particular features and capabilities. In this section we review several software
modularization methods. In [2] a software modularization technique has been
illustrated how requirement scenarios can be clustered. It should be mentioned that
the author's investigation is based on attributes defined in the scenarios to
quantitatively assess software modularization.
A firefly based method has been presented in [3] to solve the software systems
modularization problem. They used meta heuristic in terms of software clustering to
fulfill the software modularization.
In [4], the authors described a technique through Genetic Algorithm which
software system clustering done automatically. The method treats clustering as an
optimization problem, with the aims of finding appropriate partition.
Another technique for software modularization is proposed in [5]. The
researchers consider the automatic modularization as an optimization problem. The
technique used in the mentioned investigation makes use of the traditional hill
climbing and Genetic Algorithm. A clustering tool which called Bunch has been
developed in terms of creating system decomposition through treating clustering.
Another technique is proposed in [6]. Multi-objective model for module
5
clustering process is used in the mentioned research. The aims of this investigation are
to improve the module clustering process. Also, the authors introduce the first Pareto
optimal multi-objective formulation of automated software module clustering.
3. ResearchMethod
3.1 Original Dataset
What we require for the research is the complete source code of any projects; in
terms of preparing the dataset. (Figure 1) And, converting them to the file type which
is suitable for processing; As It can be seen in Figure 2. It should be mentioned that,
the source code is extracted from open source website for fulfilling the processes of
our projects.
Figure 1 Source Code Dataset
Figure 2 Source Code Dataset; after converting to Excel file.
6
After getting the source code, it worth showing the connection and the
relationship of each source code file in this project. The main method is to find out
each calling function name of each file and, every called function of each calling
function. That is, we can deduct the relationship between functions. Then,
generalizing the table containing the information is another step. As the Table 1
illustrates, we list some part of the table in Table 1.
Table 1
Format of the Original Dataset
File name Calling Function Called Function
USBusb_descript.c USBDESC_Init usbIO_GetEnumerateType
USBusb_descript.c USBDESC_Init Memcopy
USBusb_descript.c USBDESC_Init Memcopy
USBusb_descript.c USBDESC_Init Memcopy
USBusb_descript.c USBDESC_SendDescript USB_Write
USBusbIO usbIO_SetEPEventMask
USBusbIO usbIO_Task USB_InitHostCon
USBusbIO usbIO_Task USB_InitHostCon
3.2 Meta Data
In the original dataset, the type of information is string. The set contains 25,570
records. It should be confess that we can see there are lots of duplicate value and
missing value, because functions are called repeatedly and some functions don’t have
calls. Hence, it was expecting to do some data preprocessing. In other words, data
preprocessing is one of the matters which we used in this section.
Also, during this project, it is demanded to ask some experts to give us advice for
our dataset and result.
3.3 Data Quality Validation
To evaluate our quality of dataset, we figure out some methods: one is evaluating
degree of data integrality on the basis of Data Integrity Analyst field; the other is
calculating percentages of the usage extent of the dataset in order to recognize the
significance of dataset.
4 Implementation
4.1 Preprocessing
It should be mentioned that the dataset is so big. So, we cannot focus on what we
need and what we want. We have to shrink the data set and add some useful features.
7
We have to reduce the duplicate data and missing information.
In terms of the duplicated data, we think it is useful information because it
indicates the connection between the functions. We can infer the result to file
connection as well. When a function in a file calls a function once, the times of the
function call between these two files increase one. If a file calls others repeatedly,
function call frequency would be increased in synchronization. Besides, there is
directional property. A simple example is shown in Figure 3. As it can be seen the
situation function call from File A, to the File B is different from that File B calls File
A. So, the number of these two kinds of function calls should be counted and taken
into account respectively.
In addition, some records with missing value can also be deleted, because the
functions really don’t have called. If we keep these records, they will increase the
complexity of mining the data. On the other hand, we find that in this dataset, some
function calls come from built-in libraries. They are not made by the developer.
Hence, we also can delete these records.
According to this data preprocessing method, we can get the frequency of
function calling relationship and we can add a new feature to our data set. The result
of some part of preprocessed dataset is shown in Table 2.
Figure 3 Example of a Tiny System
Table 2
Preprocessed Dataset
File name Calling Function Called Function Frequency
USBusb_descript.c USBDESC_Init usbIO_GetEnumerateType 1
USBusb_descript.c USBDESC_SendDescript USB_Write 1
USBusbIO usbIO_Task USB_InitHostCon 2
8
4.2 Data Analysis
We select the clustering as our implementation method and solution for
modularization because we want to make a modularization tool to fit every project,
but we are not always familiar with our target project. We are not sure whether the
project has some obvious function or not and, don’t have some pre-defined classes.
What we only have is the source code. We can only find out the relationship of these
files and there might have some clues for binding them together.
Now, we have a dataset with frequency and, we can use the number and build up
a matrix. As an example, the Error! Reference source not found. is taken and, so,
can derive the matrix in the below:
[
0 2 1 0
1
0
0
0
1
0
0 0
0
0
0
0
]
Because, as the Figure 3 illustrates, it is deduced,
C(A, B) = 2
C(A, C) = 1
C(B, A) = 1
C(C, B) = 1
Where,
C(X, Y) = the times that X calls Y.
Then, we use this matrix to calculate the proximity matrix. First, sum up the
value of the counterpart value is similar to the situation that the value of A calls B and
B calls A. The reason is obvious, because the relationship between the files should be
consistent. Then we choose the average distance to calculate the agglomeration. We
can use some existing tool to do the cluster such as SciPy library of Python. Then we
get every procedure of clustering. Figure 4 illustrates the procedure of this research.
Figure 4 Flow Chart of the Procedure
Find Out the
Function Call
Relationship
Build Dataset
Preprocessing
• Calculate
Frequency
• Delete Missing
Value
• Delete Useless
Records
Transform
Datasetinto
Calling Matrix
Transform
Calling Matrix
into Proximity
Matrix
Agglomerate Get Result
9
5. Result Evaluation
To evaluate the result, we offer three methods: calculating the precision and
recall value, asking the developers in terms of getting advice and the last method is
using some open source tool to calculate cohesion and coupling of the system.
5.1 Random index
The Random index in data clustering is a measure of the similarity between two
data clustering. A form of the Random index may be defined that is adjusted for the
chance grouping of elements; this is the adjusted Random index. From a mathematical
standpoint, Random index is involved in the percentage of decisions that are correct.
At first, we assign two documents to the same cluster if and only if they are similar.
That is the Random index formula:
RI =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
5.2 Precision and Recall
To evaluate the clustering results, precision, recall, were calculated the relevance
between two nodes. For each pair of nodes that share at least one cluster in the
overlapping clustering results, these measures try to estimate whether the prediction
of this pair as being in the same cluster was correct with respect to the underlying true
categories in the data. Precision is calculated as the fraction of pairs correctly put in
the same cluster. Recall is the fraction of actual pairs that were identified. Precision
and recall are based on an understanding and measure of relevance.
10
Figure 5 Precision and Recall
Therefore, we always discuss Precision and Recall together. Precision is the
probability that a (randomly selected) retrieved document is relevant. Recall is the
probability that a (randomly selected) relevant document is retrieved in a search.
Recall and Precision are two of those our evaluation method. The Precision and
Recall are calculated as follow:
Precision =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
Recall =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
5.3 Comparison to Experts Suggestion
As we are not familiar with the target project, we asked the developers or, who
knows the project well. They classify the system with their human wisdom based on
the functionality of the files. We calculate the coverage of cluster result to their result.
The coverage equation is shown as following:
𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 =
𝐴 ∩ 𝐵
𝐴 ∪ 𝐵
Where,
11
A = components in a group classified manually
B = components in a cluster
The result of this evaluation is not well. The average value of coverage for the
modules locates between [0.09, 1]. However, only few modules can reach one; the
best value. Most of them locate nearby 0.5.
5.4 Calculating Coupling, Cohesion and Cyclomatic Complexity
A condition of a well-developed system is to maximize the cohesion and
minimize the coupling for each module. Also, the cyclomatic complexity should be
small. Hence, we can use some open source tool to calculate the value of the clustered
projects. The value result will be many values for each module, so we calculate the
average of them as system’s result. The result is shown in Table 3. Three metrics of
system made by clustering is better than artificial one.
Table 3
Comparison of Coupling, Cohesion and Cyclomatic Complexity
between Expert Classification and Cluster Result
Coupling Cohesion Cyclomatic Complexity
cbo rfc lcom4 cyclo_o_avg cyclo_o_max
Manual
Classification
6.44 34.96 19.44 5.99 45.45
Cluster Result 3.77 27.17 16.42 4.08 28.19
5.5 Discussion
The similarity between cluster and experts result are very different. The main
reason is that clustering only takes functions’ relationship into consideration. And the
experts consider their functionality. This result indicates the system’s structure might
be bad, for files with same functionality are not in the same cluster. We deduct the
source code need modified to get the easily-maintained system.
Moreover, we also need to enhance the accuracy of the cluster. If we divide the
system into some layers, such as GUI, or we can remove some utility files because
they look like libraries and offer service to a lot of files, the result will be more
accuracy.
6. Conclusion
The significant application of lustering method in the field of software systems;
especially among software developers and maintenances; is to modularize the source
codes by grouping items which are similar or related to each other. Most of the
12
software maintenance efforts are dedicated to comprehend the software systems. The
beneficiary of this research would be the software developers and the maintenance
companies who have the cooperation relationship together. In other word, in addition
to help them for comprehending the software system, there would be noticeable
improvement in terms of times and monetary saving. On the other hand, Software
modularization is still NP-complete problem. The degree of modularization is an
objective concept that is difficult to measure. Badly modularized software is widely
regarded as a source of problem for comprehension, increasing the time for ongoing
maintenance. So, it is demanding the accurate effort for researchers who situated
designing the approach concern with this matter.
13
References
[1] Microsoft API and referencing catalog," Modularity",
https://msdn.microsoft.com/en-us/library/ff921069(v=pandp.20).aspx.
[2] Turky Otaiby, Mohsen AlSherif, Walter P. Bond, Toward software re requirements modularization
using hierarchical clustering techniques. Proceedings of the 43nd Annual Southeast Regional
Conference, 2005, Kennesaw, Georgia, Alabama, USA, March 18-20, 2005, Volume 2Source.
[3] Ali Safari Mamaghani, Meysam Hajizadeh, Software Modularization Using the Modified Firefly
Algorithm, IEEE, 2014 8th Malaysian Software Engineering Conference (MySEC).
[4] D. Doval and S. Mancoridis, "Automatic Clustering of Software Systems using a Genetic
Algorithm", In Proc. Of the IEEE Int. Conf. on Software Tools and Engineering Practice(STEP'99),
1999.
[5] S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen and E. R. Gansner, "Using Automatic Clustering
to Produce High-Level System Organizations of Source Code", In Proc. Of the Int. Workshop On
Programm Understanding,1998.
[6] Deepika, T., Brindha, R. Multi objective functions for software module clustering with module
properties 2012 International Conference on Communications and Signal Processing (ICCSP)

More Related Content

What's hot

Automatic document clustering
Automatic document clusteringAutomatic document clustering
Automatic document clusteringIAEME Publication
 
CIS 336 STUDY Introduction Education--cis336study.com
CIS 336 STUDY Introduction Education--cis336study.comCIS 336 STUDY Introduction Education--cis336study.com
CIS 336 STUDY Introduction Education--cis336study.comclaric262
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
16-model-compare-hilda
16-model-compare-hilda16-model-compare-hilda
16-model-compare-hildaDezhi Fang
 
16-mmap-ml-sigmod
16-mmap-ml-sigmod16-mmap-ml-sigmod
16-mmap-ml-sigmodDezhi Fang
 
Dbms important questions and answers
Dbms important questions and answersDbms important questions and answers
Dbms important questions and answersLakshmiSarvani6
 
Linked List Problems
Linked List ProblemsLinked List Problems
Linked List ProblemsSriram Raj
 
Dbms interview questions
Dbms interview questionsDbms interview questions
Dbms interview questionsambika93
 
Data structures and algorithms alfred v. aho, john e. hopcroft and jeffrey ...
Data structures and algorithms   alfred v. aho, john e. hopcroft and jeffrey ...Data structures and algorithms   alfred v. aho, john e. hopcroft and jeffrey ...
Data structures and algorithms alfred v. aho, john e. hopcroft and jeffrey ...Chethan Nt
 
introduction of database in DBMS
introduction of database in DBMSintroduction of database in DBMS
introduction of database in DBMSAbhishekRajpoot8
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Pretext Knowledge Grids on Unstructured Data for Facilitating Online Education
Pretext Knowledge Grids on Unstructured Data for Facilitating  Online EducationPretext Knowledge Grids on Unstructured Data for Facilitating  Online Education
Pretext Knowledge Grids on Unstructured Data for Facilitating Online EducationIOSR Journals
 

What's hot (19)

Higher Homework
Higher HomeworkHigher Homework
Higher Homework
 
Sq lite module4
Sq lite module4Sq lite module4
Sq lite module4
 
Automatic document clustering
Automatic document clusteringAutomatic document clustering
Automatic document clustering
 
CIS 336 STUDY Introduction Education--cis336study.com
CIS 336 STUDY Introduction Education--cis336study.comCIS 336 STUDY Introduction Education--cis336study.com
CIS 336 STUDY Introduction Education--cis336study.com
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
16-model-compare-hilda
16-model-compare-hilda16-model-compare-hilda
16-model-compare-hilda
 
16-mmap-ml-sigmod
16-mmap-ml-sigmod16-mmap-ml-sigmod
16-mmap-ml-sigmod
 
Advance DBMS
Advance DBMSAdvance DBMS
Advance DBMS
 
T7
T7T7
T7
 
Dbms important questions and answers
Dbms important questions and answersDbms important questions and answers
Dbms important questions and answers
 
Linked List Problems
Linked List ProblemsLinked List Problems
Linked List Problems
 
Dbms interview questions
Dbms interview questionsDbms interview questions
Dbms interview questions
 
Data structures and algorithms alfred v. aho, john e. hopcroft and jeffrey ...
Data structures and algorithms   alfred v. aho, john e. hopcroft and jeffrey ...Data structures and algorithms   alfred v. aho, john e. hopcroft and jeffrey ...
Data structures and algorithms alfred v. aho, john e. hopcroft and jeffrey ...
 
introduction of database in DBMS
introduction of database in DBMSintroduction of database in DBMS
introduction of database in DBMS
 
Presentation
PresentationPresentation
Presentation
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
1861 1865
1861 18651861 1865
1861 1865
 
Pretext Knowledge Grids on Unstructured Data for Facilitating Online Education
Pretext Knowledge Grids on Unstructured Data for Facilitating  Online EducationPretext Knowledge Grids on Unstructured Data for Facilitating  Online Education
Pretext Knowledge Grids on Unstructured Data for Facilitating Online Education
 
Sq lite module2
Sq lite module2Sq lite module2
Sq lite module2
 

Similar to Software Systems Modularization

Software_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxSoftware_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxArifaMehreen1
 
Programming In C++
Programming In C++ Programming In C++
Programming In C++ shammi mehra
 
A Hand Book of Visual Basic 6.0.pdf.pdf
A Hand Book of Visual Basic 6.0.pdf.pdfA Hand Book of Visual Basic 6.0.pdf.pdf
A Hand Book of Visual Basic 6.0.pdf.pdfAnn Wera
 
Design concepts and principles
Design concepts and principlesDesign concepts and principles
Design concepts and principlessaurabhshertukde
 
Object Oriented Programming using C++.pptx
Object Oriented Programming using C++.pptxObject Oriented Programming using C++.pptx
Object Oriented Programming using C++.pptxparveen837153
 
Simple Obfuscation Tool for Software Protection
Simple Obfuscation Tool for Software ProtectionSimple Obfuscation Tool for Software Protection
Simple Obfuscation Tool for Software ProtectionQUESTJOURNAL
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNicole Gomez
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clusteringbutest
 
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptxFAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptxgattamanenitejeswar
 
A Case Study Of A Reusable Component Collection
A Case Study Of A Reusable Component CollectionA Case Study Of A Reusable Component Collection
A Case Study Of A Reusable Component CollectionJennifer Strong
 
Introduction to object oriented language
Introduction to object oriented languageIntroduction to object oriented language
Introduction to object oriented languagefarhan amjad
 
DOC-20210303-WA0017..pptx,coding stuff in c
DOC-20210303-WA0017..pptx,coding stuff in cDOC-20210303-WA0017..pptx,coding stuff in c
DOC-20210303-WA0017..pptx,coding stuff in cfloraaluoch3
 
CHAPTER FOUR buugii 2023.docx
CHAPTER FOUR buugii 2023.docxCHAPTER FOUR buugii 2023.docx
CHAPTER FOUR buugii 2023.docxRUKIAHASSAN4
 
Linux Operating System Resembles Unix Operating. System
Linux Operating System Resembles Unix Operating. SystemLinux Operating System Resembles Unix Operating. System
Linux Operating System Resembles Unix Operating. SystemOlga Bautista
 

Similar to Software Systems Modularization (20)

Ems
EmsEms
Ems
 
Software_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxSoftware_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptx
 
Fulltext01
Fulltext01Fulltext01
Fulltext01
 
Programming In C++
Programming In C++ Programming In C++
Programming In C++
 
A Hand Book of Visual Basic 6.0.pdf.pdf
A Hand Book of Visual Basic 6.0.pdf.pdfA Hand Book of Visual Basic 6.0.pdf.pdf
A Hand Book of Visual Basic 6.0.pdf.pdf
 
Design concepts and principles
Design concepts and principlesDesign concepts and principles
Design concepts and principles
 
OOP ppt.pdf
OOP ppt.pdfOOP ppt.pdf
OOP ppt.pdf
 
Object Oriented Programming using C++.pptx
Object Oriented Programming using C++.pptxObject Oriented Programming using C++.pptx
Object Oriented Programming using C++.pptx
 
Simple Obfuscation Tool for Software Protection
Simple Obfuscation Tool for Software ProtectionSimple Obfuscation Tool for Software Protection
Simple Obfuscation Tool for Software Protection
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptxFAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
 
A Case Study Of A Reusable Component Collection
A Case Study Of A Reusable Component CollectionA Case Study Of A Reusable Component Collection
A Case Study Of A Reusable Component Collection
 
Introduction to object oriented language
Introduction to object oriented languageIntroduction to object oriented language
Introduction to object oriented language
 
DOC-20210303-WA0017..pptx,coding stuff in c
DOC-20210303-WA0017..pptx,coding stuff in cDOC-20210303-WA0017..pptx,coding stuff in c
DOC-20210303-WA0017..pptx,coding stuff in c
 
CHAPTER FOUR buugii 2023.docx
CHAPTER FOUR buugii 2023.docxCHAPTER FOUR buugii 2023.docx
CHAPTER FOUR buugii 2023.docx
 
Chapter1
Chapter1Chapter1
Chapter1
 
OOPS_Unit_1
OOPS_Unit_1OOPS_Unit_1
OOPS_Unit_1
 
Linking in MS-Dos System
Linking in MS-Dos SystemLinking in MS-Dos System
Linking in MS-Dos System
 
Linux Operating System Resembles Unix Operating. System
Linux Operating System Resembles Unix Operating. SystemLinux Operating System Resembles Unix Operating. System
Linux Operating System Resembles Unix Operating. System
 

Software Systems Modularization

  • 2. 2 Abstract In the following research, we investigate one of the applications of data mining techniques in terms of solving and improving the software modularization which is a significant issue in the field of the software engineering. One of the aspects of software modularization is to fulfill the appropriate situation for the maintenance in software engineering. The method of clustering is the process of investigating similar group of items in a specified dataset. The dataset which is under the investigation in this project is the source codes extracted from open source software. The items which are situated in a cluster have the similar features. For fulfilling the software modularization, clustering method is investigated and used. In other words, through grouping the related source code entities, an appropriate result will be provided for the beneficiaries of the mentioned field such as software developers, software maintenance who are especially cooperative with each other.
  • 3. 3 Content 1 Introduction…………………………………………………….. 4 2 RelatedWork…………………………………………………… 4 3 ResearchMethod……………………………………………….. 5 3.1 Original Dataset……………………………………………… 5 3.2 Meta Data……………………………………………………. 6 3.3 Data Quality Validation…………………………………….. .. 6 4 Implementation………………………………………………….. 6 4.1 Preprocessing………………………………………………… 6 4.2 Data Analysis………………………………………………… 8 5 Result Evaluation……………………………………………….. 9 5.1 Random index………………………………………………… 9 5.2 Precision and Recall………………………………………….. 9 5.3 Comparison to Experts Suggestion………………………… .. 10 5.4 Calculating Coupling, Cohesion and Cyclomatic Complexity. 11 5.5 Discussion……………………………………………………. 11 6 Conclusion……………………………………………………….. 11 References………………………………………………………… 13
  • 4. 4 1. Introduction Managing the project carefully is known as the noticeable issue in the field of software engineering. In addition, importance of Modularization is widely believed that a well modularized software system is easier to develop and maintain. Modularity is designing a system that is divided into a set of functional units (named modules) that can be composed into a larger application. A module represents a set of related concerns. Modules are independent of one another but can communicate with each other in a loosely coupled fashion. [1] There are lots of benefits in terms of software modularization. For example, it allows distributed development, which means different developer can work in parallel. It is worth saying that it is cost effective financially and also decreases the times consumption of development or maintenance. The module leads to reusability of program. Also, it will be easily readable for users. A good Modularity makes program higher readability, higher scalability. Programmers don’t have to spend unnecessary energy and waste their times. On the other hand, it can say that through software modularization, program will be more manageable. These individual modules are easier to design, implement and test. However, badly modularized software is regarded as a source of problems for comprehension, increasing the time for on-going maintenance and testing. 2. RelatedWork Various techniques have been proposed which in recent years each one has particular features and capabilities. In this section we review several software modularization methods. In [2] a software modularization technique has been illustrated how requirement scenarios can be clustered. It should be mentioned that the author's investigation is based on attributes defined in the scenarios to quantitatively assess software modularization. A firefly based method has been presented in [3] to solve the software systems modularization problem. They used meta heuristic in terms of software clustering to fulfill the software modularization. In [4], the authors described a technique through Genetic Algorithm which software system clustering done automatically. The method treats clustering as an optimization problem, with the aims of finding appropriate partition. Another technique for software modularization is proposed in [5]. The researchers consider the automatic modularization as an optimization problem. The technique used in the mentioned investigation makes use of the traditional hill climbing and Genetic Algorithm. A clustering tool which called Bunch has been developed in terms of creating system decomposition through treating clustering. Another technique is proposed in [6]. Multi-objective model for module
  • 5. 5 clustering process is used in the mentioned research. The aims of this investigation are to improve the module clustering process. Also, the authors introduce the first Pareto optimal multi-objective formulation of automated software module clustering. 3. ResearchMethod 3.1 Original Dataset What we require for the research is the complete source code of any projects; in terms of preparing the dataset. (Figure 1) And, converting them to the file type which is suitable for processing; As It can be seen in Figure 2. It should be mentioned that, the source code is extracted from open source website for fulfilling the processes of our projects. Figure 1 Source Code Dataset Figure 2 Source Code Dataset; after converting to Excel file.
  • 6. 6 After getting the source code, it worth showing the connection and the relationship of each source code file in this project. The main method is to find out each calling function name of each file and, every called function of each calling function. That is, we can deduct the relationship between functions. Then, generalizing the table containing the information is another step. As the Table 1 illustrates, we list some part of the table in Table 1. Table 1 Format of the Original Dataset File name Calling Function Called Function USBusb_descript.c USBDESC_Init usbIO_GetEnumerateType USBusb_descript.c USBDESC_Init Memcopy USBusb_descript.c USBDESC_Init Memcopy USBusb_descript.c USBDESC_Init Memcopy USBusb_descript.c USBDESC_SendDescript USB_Write USBusbIO usbIO_SetEPEventMask USBusbIO usbIO_Task USB_InitHostCon USBusbIO usbIO_Task USB_InitHostCon 3.2 Meta Data In the original dataset, the type of information is string. The set contains 25,570 records. It should be confess that we can see there are lots of duplicate value and missing value, because functions are called repeatedly and some functions don’t have calls. Hence, it was expecting to do some data preprocessing. In other words, data preprocessing is one of the matters which we used in this section. Also, during this project, it is demanded to ask some experts to give us advice for our dataset and result. 3.3 Data Quality Validation To evaluate our quality of dataset, we figure out some methods: one is evaluating degree of data integrality on the basis of Data Integrity Analyst field; the other is calculating percentages of the usage extent of the dataset in order to recognize the significance of dataset. 4 Implementation 4.1 Preprocessing It should be mentioned that the dataset is so big. So, we cannot focus on what we need and what we want. We have to shrink the data set and add some useful features.
  • 7. 7 We have to reduce the duplicate data and missing information. In terms of the duplicated data, we think it is useful information because it indicates the connection between the functions. We can infer the result to file connection as well. When a function in a file calls a function once, the times of the function call between these two files increase one. If a file calls others repeatedly, function call frequency would be increased in synchronization. Besides, there is directional property. A simple example is shown in Figure 3. As it can be seen the situation function call from File A, to the File B is different from that File B calls File A. So, the number of these two kinds of function calls should be counted and taken into account respectively. In addition, some records with missing value can also be deleted, because the functions really don’t have called. If we keep these records, they will increase the complexity of mining the data. On the other hand, we find that in this dataset, some function calls come from built-in libraries. They are not made by the developer. Hence, we also can delete these records. According to this data preprocessing method, we can get the frequency of function calling relationship and we can add a new feature to our data set. The result of some part of preprocessed dataset is shown in Table 2. Figure 3 Example of a Tiny System Table 2 Preprocessed Dataset File name Calling Function Called Function Frequency USBusb_descript.c USBDESC_Init usbIO_GetEnumerateType 1 USBusb_descript.c USBDESC_SendDescript USB_Write 1 USBusbIO usbIO_Task USB_InitHostCon 2
  • 8. 8 4.2 Data Analysis We select the clustering as our implementation method and solution for modularization because we want to make a modularization tool to fit every project, but we are not always familiar with our target project. We are not sure whether the project has some obvious function or not and, don’t have some pre-defined classes. What we only have is the source code. We can only find out the relationship of these files and there might have some clues for binding them together. Now, we have a dataset with frequency and, we can use the number and build up a matrix. As an example, the Error! Reference source not found. is taken and, so, can derive the matrix in the below: [ 0 2 1 0 1 0 0 0 1 0 0 0 0 0 0 0 ] Because, as the Figure 3 illustrates, it is deduced, C(A, B) = 2 C(A, C) = 1 C(B, A) = 1 C(C, B) = 1 Where, C(X, Y) = the times that X calls Y. Then, we use this matrix to calculate the proximity matrix. First, sum up the value of the counterpart value is similar to the situation that the value of A calls B and B calls A. The reason is obvious, because the relationship between the files should be consistent. Then we choose the average distance to calculate the agglomeration. We can use some existing tool to do the cluster such as SciPy library of Python. Then we get every procedure of clustering. Figure 4 illustrates the procedure of this research. Figure 4 Flow Chart of the Procedure Find Out the Function Call Relationship Build Dataset Preprocessing • Calculate Frequency • Delete Missing Value • Delete Useless Records Transform Datasetinto Calling Matrix Transform Calling Matrix into Proximity Matrix Agglomerate Get Result
  • 9. 9 5. Result Evaluation To evaluate the result, we offer three methods: calculating the precision and recall value, asking the developers in terms of getting advice and the last method is using some open source tool to calculate cohesion and coupling of the system. 5.1 Random index The Random index in data clustering is a measure of the similarity between two data clustering. A form of the Random index may be defined that is adjusted for the chance grouping of elements; this is the adjusted Random index. From a mathematical standpoint, Random index is involved in the percentage of decisions that are correct. At first, we assign two documents to the same cluster if and only if they are similar. That is the Random index formula: RI = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁 5.2 Precision and Recall To evaluate the clustering results, precision, recall, were calculated the relevance between two nodes. For each pair of nodes that share at least one cluster in the overlapping clustering results, these measures try to estimate whether the prediction of this pair as being in the same cluster was correct with respect to the underlying true categories in the data. Precision is calculated as the fraction of pairs correctly put in the same cluster. Recall is the fraction of actual pairs that were identified. Precision and recall are based on an understanding and measure of relevance.
  • 10. 10 Figure 5 Precision and Recall Therefore, we always discuss Precision and Recall together. Precision is the probability that a (randomly selected) retrieved document is relevant. Recall is the probability that a (randomly selected) relevant document is retrieved in a search. Recall and Precision are two of those our evaluation method. The Precision and Recall are calculated as follow: Precision = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 Recall = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 5.3 Comparison to Experts Suggestion As we are not familiar with the target project, we asked the developers or, who knows the project well. They classify the system with their human wisdom based on the functionality of the files. We calculate the coverage of cluster result to their result. The coverage equation is shown as following: 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 𝐴 ∩ 𝐵 𝐴 ∪ 𝐵 Where,
  • 11. 11 A = components in a group classified manually B = components in a cluster The result of this evaluation is not well. The average value of coverage for the modules locates between [0.09, 1]. However, only few modules can reach one; the best value. Most of them locate nearby 0.5. 5.4 Calculating Coupling, Cohesion and Cyclomatic Complexity A condition of a well-developed system is to maximize the cohesion and minimize the coupling for each module. Also, the cyclomatic complexity should be small. Hence, we can use some open source tool to calculate the value of the clustered projects. The value result will be many values for each module, so we calculate the average of them as system’s result. The result is shown in Table 3. Three metrics of system made by clustering is better than artificial one. Table 3 Comparison of Coupling, Cohesion and Cyclomatic Complexity between Expert Classification and Cluster Result Coupling Cohesion Cyclomatic Complexity cbo rfc lcom4 cyclo_o_avg cyclo_o_max Manual Classification 6.44 34.96 19.44 5.99 45.45 Cluster Result 3.77 27.17 16.42 4.08 28.19 5.5 Discussion The similarity between cluster and experts result are very different. The main reason is that clustering only takes functions’ relationship into consideration. And the experts consider their functionality. This result indicates the system’s structure might be bad, for files with same functionality are not in the same cluster. We deduct the source code need modified to get the easily-maintained system. Moreover, we also need to enhance the accuracy of the cluster. If we divide the system into some layers, such as GUI, or we can remove some utility files because they look like libraries and offer service to a lot of files, the result will be more accuracy. 6. Conclusion The significant application of lustering method in the field of software systems; especially among software developers and maintenances; is to modularize the source codes by grouping items which are similar or related to each other. Most of the
  • 12. 12 software maintenance efforts are dedicated to comprehend the software systems. The beneficiary of this research would be the software developers and the maintenance companies who have the cooperation relationship together. In other word, in addition to help them for comprehending the software system, there would be noticeable improvement in terms of times and monetary saving. On the other hand, Software modularization is still NP-complete problem. The degree of modularization is an objective concept that is difficult to measure. Badly modularized software is widely regarded as a source of problem for comprehension, increasing the time for ongoing maintenance. So, it is demanding the accurate effort for researchers who situated designing the approach concern with this matter.
  • 13. 13 References [1] Microsoft API and referencing catalog," Modularity", https://msdn.microsoft.com/en-us/library/ff921069(v=pandp.20).aspx. [2] Turky Otaiby, Mohsen AlSherif, Walter P. Bond, Toward software re requirements modularization using hierarchical clustering techniques. Proceedings of the 43nd Annual Southeast Regional Conference, 2005, Kennesaw, Georgia, Alabama, USA, March 18-20, 2005, Volume 2Source. [3] Ali Safari Mamaghani, Meysam Hajizadeh, Software Modularization Using the Modified Firefly Algorithm, IEEE, 2014 8th Malaysian Software Engineering Conference (MySEC). [4] D. Doval and S. Mancoridis, "Automatic Clustering of Software Systems using a Genetic Algorithm", In Proc. Of the IEEE Int. Conf. on Software Tools and Engineering Practice(STEP'99), 1999. [5] S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen and E. R. Gansner, "Using Automatic Clustering to Produce High-Level System Organizations of Source Code", In Proc. Of the Int. Workshop On Programm Understanding,1998. [6] Deepika, T., Brindha, R. Multi objective functions for software module clustering with module properties 2012 International Conference on Communications and Signal Processing (ICCSP)