Software Systems Modularization

1
Software Systems Modularization
through Data Mining Techniques

2
Abstract
In the following research, we investigate one of the applications of data
mining techniques in terms of solving and improving the software
modularization which is a significant issue in the field of the software
engineering.
One of the aspects of software modularization is to fulfill the appropriate
situation for the maintenance in software engineering. The method of
clustering is the process of investigating similar group of items in a specified
dataset. The dataset which is under the investigation in this project is the
source codes extracted from open source software. The items which are
situated in a cluster have the similar features. For fulfilling the software
modularization, clustering method is investigated and used. In other words,
through grouping the related source code entities, an appropriate result will be
provided for the beneficiaries of the mentioned field such as software
developers, software maintenance who are especially cooperative with each
other.

3
Content
1 Introduction…………………………………………………….. 4
2 RelatedWork…………………………………………………… 4
3 ResearchMethod……………………………………………….. 5
3.1 Original Dataset……………………………………………… 5
3.2 Meta Data……………………………………………………. 6
3.3 Data Quality Validation…………………………………….. .. 6
4 Implementation………………………………………………….. 6
4.1 Preprocessing………………………………………………… 6
4.2 Data Analysis………………………………………………… 8
5 Result Evaluation……………………………………………….. 9
5.1 Random index………………………………………………… 9
5.2 Precision and Recall………………………………………….. 9
5.3 Comparison to Experts Suggestion………………………… .. 10
5.4 Calculating Coupling, Cohesion and Cyclomatic Complexity. 11
5.5 Discussion……………………………………………………. 11
6 Conclusion……………………………………………………….. 11
References………………………………………………………… 13

4
1. Introduction
Managing the project carefully is known as the noticeable issue in the field of
software engineering. In addition, importance of Modularization is widely believed
that a well modularized software system is easier to develop and maintain.
Modularity is designing a system that is divided into a set of functional units (named
modules) that can be composed into a larger application. A module represents a set of
related concerns. Modules are independent of one another but can communicate with
each other in a loosely coupled fashion. [1] There are lots of benefits in terms of
software modularization. For example, it allows distributed development, which
means different developer can work in parallel. It is worth saying that it is cost
effective financially and also decreases the times consumption of development or
maintenance. The module leads to reusability of program. Also, it will be easily
readable for users. A good Modularity makes program higher readability, higher
scalability. Programmers don’t have to spend unnecessary energy and waste their
times. On the other hand, it can say that through software modularization, program
will be more manageable. These individual modules are easier to design, implement
and test. However, badly modularized software is regarded as a source of problems
for comprehension, increasing the time for on-going maintenance and testing.
2. RelatedWork
Various techniques have been proposed which in recent years each one has
particular features and capabilities. In this section we review several software
modularization methods. In [2] a software modularization technique has been
illustrated how requirement scenarios can be clustered. It should be mentioned that
the author's investigation is based on attributes defined in the scenarios to
quantitatively assess software modularization.
A firefly based method has been presented in [3] to solve the software systems
modularization problem. They used meta heuristic in terms of software clustering to
fulfill the software modularization.
In [4], the authors described a technique through Genetic Algorithm which
software system clustering done automatically. The method treats clustering as an
optimization problem, with the aims of finding appropriate partition.
Another technique for software modularization is proposed in [5]. The
researchers consider the automatic modularization as an optimization problem. The
technique used in the mentioned investigation makes use of the traditional hill
climbing and Genetic Algorithm. A clustering tool which called Bunch has been
developed in terms of creating system decomposition through treating clustering.
Another technique is proposed in [6]. Multi-objective model for module

5
clustering process is used in the mentioned research. The aims of this investigation are
to improve the module clustering process. Also, the authors introduce the first Pareto
optimal multi-objective formulation of automated software module clustering.
3. ResearchMethod
3.1 Original Dataset
What we require for the research is the complete source code of any projects; in
terms of preparing the dataset. (Figure 1) And, converting them to the file type which
is suitable for processing; As It can be seen in Figure 2. It should be mentioned that,
the source code is extracted from open source website for fulfilling the processes of
our projects.
Figure 1 Source Code Dataset
Figure 2 Source Code Dataset; after converting to Excel file.

6
After getting the source code, it worth showing the connection and the
relationship of each source code file in this project. The main method is to find out
each calling function name of each file and, every called function of each calling
function. That is, we can deduct the relationship between functions. Then,
generalizing the table containing the information is another step. As the Table 1
illustrates, we list some part of the table in Table 1.
Table 1
Format of the Original Dataset
File name Calling Function Called Function
USBusb_descript.c USBDESC_Init usbIO_GetEnumerateType
USBusb_descript.c USBDESC_Init Memcopy
USBusb_descript.c USBDESC_SendDescript USB_Write
USBusbIO usbIO_SetEPEventMask
USBusbIO usbIO_Task USB_InitHostCon
USBusbIO usbIO_Task USB_InitHostCon
3.2 Meta Data
In the original dataset, the type of information is string. The set contains 25,570
records. It should be confess that we can see there are lots of duplicate value and
missing value, because functions are called repeatedly and some functions don’t have
calls. Hence, it was expecting to do some data preprocessing. In other words, data
preprocessing is one of the matters which we used in this section.
Also, during this project, it is demanded to ask some experts to give us advice for
our dataset and result.
3.3 Data Quality Validation
To evaluate our quality of dataset, we figure out some methods: one is evaluating
degree of data integrality on the basis of Data Integrity Analyst field; the other is
calculating percentages of the usage extent of the dataset in order to recognize the
significance of dataset.
4 Implementation
4.1 Preprocessing
It should be mentioned that the dataset is so big. So, we cannot focus on what we
need and what we want. We have to shrink the data set and add some useful features.

7
We have to reduce the duplicate data and missing information.
In terms of the duplicated data, we think it is useful information because it
indicates the connection between the functions. We can infer the result to file
connection as well. When a function in a file calls a function once, the times of the
function call between these two files increase one. If a file calls others repeatedly,
function call frequency would be increased in synchronization. Besides, there is
directional property. A simple example is shown in Figure 3. As it can be seen the
situation function call from File A, to the File B is different from that File B calls File
A. So, the number of these two kinds of function calls should be counted and taken
into account respectively.
In addition, some records with missing value can also be deleted, because the
functions really don’t have called. If we keep these records, they will increase the
complexity of mining the data. On the other hand, we find that in this dataset, some
function calls come from built-in libraries. They are not made by the developer.
Hence, we also can delete these records.
According to this data preprocessing method, we can get the frequency of
function calling relationship and we can add a new feature to our data set. The result
of some part of preprocessed dataset is shown in Table 2.
Figure 3 Example of a Tiny System
Table 2
Preprocessed Dataset
File name Calling Function Called Function Frequency
USBusb_descript.c USBDESC_Init usbIO_GetEnumerateType 1
USBusb_descript.c USBDESC_SendDescript USB_Write 1
USBusbIO usbIO_Task USB_InitHostCon 2

8
4.2 Data Analysis
We select the clustering as our implementation method and solution for
modularization because we want to make a modularization tool to fit every project,
but we are not always familiar with our target project. We are not sure whether the
project has some obvious function or not and, don’t have some pre-defined classes.
What we only have is the source code. We can only find out the relationship of these
files and there might have some clues for binding them together.
Now, we have a dataset with frequency and, we can use the number and build up
a matrix. As an example, the Error! Reference source not found. is taken and, so,
can derive the matrix in the below:
[
0 2 1 0
1
0
0
0
1
0
0 0
0
0
0
0
]
Because, as the Figure 3 illustrates, it is deduced,
C(A, B) = 2
C(A, C) = 1
C(B, A) = 1
C(C, B) = 1
Where,
C(X, Y) = the times that X calls Y.
Then, we use this matrix to calculate the proximity matrix. First, sum up the
value of the counterpart value is similar to the situation that the value of A calls B and
B calls A. The reason is obvious, because the relationship between the files should be
consistent. Then we choose the average distance to calculate the agglomeration. We
can use some existing tool to do the cluster such as SciPy library of Python. Then we
get every procedure of clustering. Figure 4 illustrates the procedure of this research.
Figure 4 Flow Chart of the Procedure
Find Out the
Function Call
Relationship
Build Dataset
Preprocessing
• Calculate
Frequency
• Delete Missing
Value
• Delete Useless
Records
Transform
Datasetinto
Calling Matrix
Transform
Calling Matrix
into Proximity
Matrix
Agglomerate Get Result

9
5. Result Evaluation
To evaluate the result, we offer three methods: calculating the precision and
recall value, asking the developers in terms of getting advice and the last method is
using some open source tool to calculate cohesion and coupling of the system.
5.1 Random index
The Random index in data clustering is a measure of the similarity between two
data clustering. A form of the Random index may be defined that is adjusted for the
chance grouping of elements; this is the adjusted Random index. From a mathematical
standpoint, Random index is involved in the percentage of decisions that are correct.
At first, we assign two documents to the same cluster if and only if they are similar.
That is the Random index formula:
RI =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
5.2 Precision and Recall
To evaluate the clustering results, precision, recall, were calculated the relevance
between two nodes. For each pair of nodes that share at least one cluster in the
overlapping clustering results, these measures try to estimate whether the prediction
of this pair as being in the same cluster was correct with respect to the underlying true
categories in the data. Precision is calculated as the fraction of pairs correctly put in
the same cluster. Recall is the fraction of actual pairs that were identified. Precision
and recall are based on an understanding and measure of relevance.

10
Figure 5 Precision and Recall
Therefore, we always discuss Precision and Recall together. Precision is the
probability that a (randomly selected) retrieved document is relevant. Recall is the
probability that a (randomly selected) relevant document is retrieved in a search.
Recall and Precision are two of those our evaluation method. The Precision and
Recall are calculated as follow:
Precision =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
Recall =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
5.3 Comparison to Experts Suggestion
As we are not familiar with the target project, we asked the developers or, who
knows the project well. They classify the system with their human wisdom based on
the functionality of the files. We calculate the coverage of cluster result to their result.
The coverage equation is shown as following:
𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 =
𝐴 ∩ 𝐵
𝐴 ∪ 𝐵
Where,

11
A = components in a group classified manually
B = components in a cluster
The result of this evaluation is not well. The average value of coverage for the
modules locates between [0.09, 1]. However, only few modules can reach one; the
best value. Most of them locate nearby 0.5.
5.4 Calculating Coupling, Cohesion and Cyclomatic Complexity
A condition of a well-developed system is to maximize the cohesion and
minimize the coupling for each module. Also, the cyclomatic complexity should be
small. Hence, we can use some open source tool to calculate the value of the clustered
projects. The value result will be many values for each module, so we calculate the
average of them as system’s result. The result is shown in Table 3. Three metrics of
system made by clustering is better than artificial one.
Table 3
Comparison of Coupling, Cohesion and Cyclomatic Complexity
between Expert Classification and Cluster Result
Coupling Cohesion Cyclomatic Complexity
cbo rfc lcom4 cyclo_o_avg cyclo_o_max
Manual
Classification
6.44 34.96 19.44 5.99 45.45
Cluster Result 3.77 27.17 16.42 4.08 28.19
5.5 Discussion
The similarity between cluster and experts result are very different. The main
reason is that clustering only takes functions’ relationship into consideration. And the
experts consider their functionality. This result indicates the system’s structure might
be bad, for files with same functionality are not in the same cluster. We deduct the
source code need modified to get the easily-maintained system.
Moreover, we also need to enhance the accuracy of the cluster. If we divide the
system into some layers, such as GUI, or we can remove some utility files because
they look like libraries and offer service to a lot of files, the result will be more
accuracy.
6. Conclusion
The significant application of lustering method in the field of software systems;
especially among software developers and maintenances; is to modularize the source
codes by grouping items which are similar or related to each other. Most of the

12
software maintenance efforts are dedicated to comprehend the software systems. The
beneficiary of this research would be the software developers and the maintenance
companies who have the cooperation relationship together. In other word, in addition
to help them for comprehending the software system, there would be noticeable
improvement in terms of times and monetary saving. On the other hand, Software
modularization is still NP-complete problem. The degree of modularization is an
objective concept that is difficult to measure. Badly modularized software is widely
regarded as a source of problem for comprehension, increasing the time for ongoing
maintenance. So, it is demanding the accurate effort for researchers who situated
designing the approach concern with this matter.

13
References
[1] Microsoft API and referencing catalog," Modularity",
https://msdn.microsoft.com/en-us/library/ff921069(v=pandp.20).aspx.
[2] Turky Otaiby, Mohsen AlSherif, Walter P. Bond, Toward software re requirements modularization
using hierarchical clustering techniques. Proceedings of the 43nd Annual Southeast Regional
Conference, 2005, Kennesaw, Georgia, Alabama, USA, March 18-20, 2005, Volume 2Source.
[3] Ali Safari Mamaghani, Meysam Hajizadeh, Software Modularization Using the Modified Firefly
Algorithm, IEEE, 2014 8th Malaysian Software Engineering Conference (MySEC).
[4] D. Doval and S. Mancoridis, "Automatic Clustering of Software Systems using a Genetic
Algorithm", In Proc. Of the IEEE Int. Conf. on Software Tools and Engineering Practice(STEP'99),
1999.
[5] S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen and E. R. Gansner, "Using Automatic Clustering
to Produce High-Level System Organizations of Source Code", In Proc. Of the Int. Workshop On
Programm Understanding,1998.
[6] Deepika, T., Brindha, R. Multi objective functions for software module clustering with module
properties 2012 International Conference on Communications and Signal Processing (ICCSP)

Software Systems Modularization

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Software Systems Modularization

Similar to Software Systems Modularization (20)

Software Systems Modularization