Machine Learning in Software Engineering

NEW TRENDS IN LEARNING FOR
SOFTWARE ENGINEERING
Alaa Hamouda
Department of Computer Engineering,
Engineering Faculty,
Al-Azhar University, Egypt
1

Agenda
• Introduction
• Software Engineering Phases
• Machine Learning Overview
• Applications of ML in SWE with each process:
– Project Planning
– Requirements
– Design
– Implementation
– Testing
– Maintenance
• Conclusion
2

Problem Definition
• There is a need to meet the challenge of
developing and maintaining large and complex
software systems.
• Machine learning methods have been playing
an increasingly important role in many
software development and maintenance
tasks.
3

Overview of ML
• Machine learning methods fall into the following
broad categories: supervised learning and
unsupervised learning. Supervised learning deals
with learning a target function from labeled
examples. Unsupervised learning attempts to
learn patterns and associations from a set of
objects that do not have attached class labels.
• Supervised learning can be divided into eager and
lazy classifiers
5

8
The loan data (reproduced)
Approved or not

9
A decision tree from the loan data
Decision nodes and leaf nodes (classes)

Agenda
• Introduction
– Requirements
– Design
– Implementation
– Testing
– Maintenance
• Conclusion
10

Project Planning
• The statistics report failure rate of 70% for the
software
• The cost overrun has been indicated 189%
• The researches show that inaccurate
estimation is the root factor of fail in the most
software project fails.
11

Size Estimation
• Size -- Effort - Cost
• twenty-eight out of the collected sixty
publications (almost 47%) deal with the issue
of how to build models to predict or estimate
certain property of software development
process or artifacts.
12

Function Point
13
Internal Logical File: File accessed and maintained by the application under
development
External Interface File: File accessed by the Processing Logic, but maintained
by another application
External Input: An elementary process that processes data that comes from
outside the application boundary.
–Maintains ILF
External Output: An elementary process that sends data outside the
application boundary.
-EO represents information to user through processing logic in addition to
retrieval of data
External in Query: An elementary process that sends data outside the
application boundary
-EQ presents information to a user through retrieval of data from ILF/EIF.
-No data manipulation or processing logic.

Size estimation (Cont’)
Input:
• Function points
• Project domains
• Number of components types:
– Number of menu components
– Number of inputs components
– Number of output components
ML Algorithm:
• Neural Network
Output:
• LOC to be fed to the cost estimation stage
14

Effort Estimation
Input:
• Line of Code (generated from the size estimation)
• Scale factors
• Cost Drivers
Algorithm:
• Fuzzy Inference Engine
Output:
• Estimated efforts (e.g. man-hours)
16

Inputs (scale factors)
Factor Explanation
Precedentedness
(PREC)
Reflects the previous experience of the
organization
Development
Flexibility (FLEX)
Reflects the degree of flexibility in the
development process.
Risk Resolution (RESL) Reflects the extent of risk analysis carried out.
Team Cohesion (TEAM) Reflects how well the development team knows
each other and work together.
Process maturity (PMAT) Reflects the process maturity of the organization.
17
Factor Explanation
LOC Line of Code

Inputs (Cost Drivers)
Attribute Type Description
RELY Product Required system reliability
CPLX Product Complexity of system modules
DOCU Product Extent of documentation required
DATA Product Size of database used
RUSE Product Required percentage of reusable components
TIME Computer Execution time constraint
PVOL Computer Volatility of development platform
STOR Computer Memory constraints
ACAP Personnel Capability of project analysts
PCON Personnel Personnel continuity
PCAP Personnel Programmer capability
PEXP Personnel Programmer experience in project domain
AEXP Personnel Analyst experience in project domain
LTEX Personnel Language and tool experience
TOOL Project Use of software tools
SCED Project Development schedule compression
SITE Project Extent of multisite working and quality of inter-
site communications 18

Effort Estimation directly from UCP
In the previous method:
• FP (size) -- > LOC (size) -- > Effort
Another method:
• UCP (size) -- > Effort (directly)
20

Project Complexity
• Level 1: the project team is familiar with this type of
project and the team has developed similar projects in
the past. The number and type of interfaces are simple.
The project will be installed in normal conditions
where high security or safety factors are not required.
Moreover, Level 1 projects are those of which around
20% of their design or implementation parts are reused
(came from old similar projects).
• Level 2: This is similar to level1 category with a
difference that only about 10% of these projects are
reused.
24

Project Complexity (Cont’d)
• Level 3: the technology, interface, installation conditions
are normal. Furthermore, no parts of the projects had been
previously designed or implemented.
• Level 4: the project is required to be installed on a
complicated topology/architecture such as distributed
systems. Moreover, in this level, the number of variables
and interface is large.
• Level 5: This is similar to Level4 but with additional
constraints such as a special type of security or high safety
factors.
25

Effort Estimation (Cont’d)
The results show that the proposed ANN model
outperforms:
• Regression models by 8%
• UCP models by 50%
27

Agenda
• Introduction
– Requirements
– Design
– Implementation
– Testing
– Maintenance
• Conclusion
28

Requirements Analysis
29
Business Analysis  System Analysis

Requirements Analysis
Lexicons Phase-I Phase –II
User Noun Actor
fills Verb Action
the Article -------
form Noun Object
32

Requirements
• Reverse engineering where we have legacy systems
that are critical to the operation of an organization
which uses them and that must still be maintained.
• Most legacy systems were developed before software
engineering techniques were widely used. Thus they
may be poorly structured and their documentation
may be either out-of-date or non-existent.
• In order to bring to bear the legacy system
maintenance, the first task is to recover the design or
specification of a legacy system from its source or
executable code
33

Agenda
• Introduction
– Requirements
– Design
– Implementation
– Testing
– Maintenance
• Conclusion
34

Design
1. Finding Fault Prone components for reuse
2. UI Design
35

Components Re-use
• Software quality classification models can be
used to indicate which program modules are
fault-prone (FP) and not fault-prone (NFP).
• These models can be used to select the best
candidate modules.
36

Components Re-use
Attribute
U_1 Number of unique operators
N_1 Total number of operators
U_2 Number of unique operands
N_2 Total number of operands
V(G) McCabe’s cyclomatic complexity
N_L Number of logical operators
LOC Lines of code
ELOC Executable lines of code
37

User Interface Design
• Learnability is an important aspect of usability
• users lose up to 40% of their time due to
“frustrating experiences” with computers,
with one of the most common causes of these
frustrations being missing, hard to find, and
unusable features of the software.
38

User Interface Design
• Nielsen defines that a highly learnable system
could be categorized as “allowing users to
reach a reasonable level of usage proficiency
within a short time”.
• Web usage map is mined through Label
Sequential Rule
39

Agenda
• Introduction
– Requirements
– Design
– Implementation
– Testing
– Maintenance
• Conclusion
41

Implementation
• Implementation is a core process in the software
engineering life cycle.
• One of the challenges in this phase is the
modularization –or remodularization-.
• Genetic algorithms have been successfully used
to address this problem.
• The objective is to improve the module quality
(MQ). All versions of MQ are combinations of
cohesion and coupling into a single weighted
fitness function.
42

Implementation (Cont’d)
• Clustering has also been applied to package
coupling, to reduce overall package size and to
explore the relationship between design and
code level software structure.
• Additional objectives might include closeness to
original module structure, business goals,
technical constraints, testability, and other
metrics that may be important in finding a good
module structure.
43

• Refactoring is to rewrite existing source code in
order to improve its readability, reusability or
structure without affecting its meaning or
behavior.
• For project managers it is interesting to know
which locations are likely to demand refactoring.
Refactoring improves the understandability of the
code, but on the other hand requires
development time
45

• Researches screen evolution data from
versioning systems of open source projects.
• ArgoUML and the Spring framework are
examples developed in Java and consist of
5000 and 10000 classes each.
• Each class is usually placed in a separate file in
Java, thus they use files equivalent to classes
and focus on files for our analysis.
46

The used features can be divided into different
categories:
• Size
This category contains size measures such as
lines of code from an evolution perspective:
linesAdded, linesModified, or linesDeleted
relative to the total LOC (lines of code) of a file.
47

• Team
The number of authors of files influences the way
software is developed. It is expected that the more
authors are working on the changes the higher the
probability of rework and mistakes.
• Complexity of existing solution
According to the laws of software evolution, software
continuously becomes more and more complex. Changes
are more difficult to add as the software is more difficult
to understand and the contracts between existing parts
have to retain. As a result they investigate the
changeCount in relation to the number of changes during
the entire history of each file.
48

• New Requirements
In software development projects usually new
classes are added to object-oriented systems when
new requirements have to be satisfied. They use
the information whether a file was newly
introduced during the prediction period
• Relational Aspects
One of the most important features of this category
are couplings such as the number of
changes/revisions where other files have been
committed with.
49

• With the described features, the number of
refactorings is predicted
50

• Decision tree and neural network are used as
classifiers.
• The F-measure was about 65%.
• It is clear that several features such as lines
activity rate and number of lines altered per
commit provide much information for the
assessment of refactorings.
• But also the structure of the system is crucial for
refactorings, as the number of co-changed files
and the number of files introduced during the
maintenance are relevant features.
51

Agenda
• Introduction
– Requirements
– Design
– Implementation
– Testing
– Maintenance
• Conclusion
52

Testing
• Software quality models help ensure the
reliability of the delivered products.
• Early detection of fault-prone software
components enables verification experts to
concentrate their time and resources on the
problem areas of the software system under
development.
• Accurate prediction of fault-prone modules
enables the verification and validation activities
focused on the critical software components.
53

Testing (Cont’d)
• Decision trees correctly predicted 79.3% of
high development effort fault-prone modules
(detection rate), while the trees generated
from the best parameter combinations
correctly identified 88.4% of those modules
on the average.
55

Agenda
• Introduction
– Requirements
– Design
– Implementation
– Testing
– Maintenance
• Conclusion
56

Maintenance
• Software maintenance is widely recognized to be
the most expensive and time-consuming aspect
of the software process.
• A relevance relation maps a tuple of system
elements to a value indicating how related they
are.
• These software change repositories reflect a
history of the system, which includes actions that
result in the creation of new relationships and the
strengthening of the existing relationships in the
software.
57

Maintenance (Cont’d)
• Software entities include documents, source files,
routines, modules, variables, and even the entire
software system.
• A relevance relation is a predictor that maps
tuples of two or more software entities to a value
r quantifying how relevant, that is, connected or
related, the entities are to each other.
• r shows the strength of relevance among the
entities.
59

Maintenance Effort Prediction
• If the predictions are based on formal software
development effort prediction models, such as the
estimation part of the Function Point Analysis, essential
differences in characteristics between software
development and software maintenance are neglected
• The focus of software development is the creation of
software, but the focus of software maintenance is more
the change of software.
• The development of a software application typically is a
one-of-a-kind project, but the maintenance activities on an
application usually comprise a large number of tasks
carried out over a long period of time in a relatively stable
environment.
62

• Some researches collected data on:
– 109 randomly selected maintenance tasks
– 70 applications
– The size of the applications varied from a few
thousand lines of code (LOC) to about 500,000 LOC
– the age of the applications varied from less than a
year to more than 20 years
– The functions of the applications included payroll,
order entry, billing and invoicing, inventory control,
service management, and personnel administration.
63

The following data was collected for each maintenance task:
• Type of maintenance task, i.e., corrective or perfective.
• Priority of task, i.e., high, medium or low priority.
• Maintainer’s knowledge and confidence about how to solve
the task immediately after having read or heard the task
specification.
• Years of experience as maintainer, and on the maintained
application.
• Education level of the maintainer.
• Work-hours (effort) spent on the task.
• Task size and the programming language
• Age and size of the changed application.
64

Most Important features:
• Cause: Corrective maintenance = 0, otherwise = 1
• Change: More than 50% of the effort is believed to be
spent on updating of code compared to inserting and
deleting the code = 0, otherwise = l
• Mode: More than 50% of the effort is believed to be
spent on development of new modules (New module
mode) = 0, otherwise (Embedded mode) = 1
• Confidence: The maintainer believes he knows how to
solve the task when the task specification is read/heard
the first time = 0 (High confidence), otherwise = 1
(Medium or low confidence).
65

Less effect features:
• Type of language
• Maintainer experience
• Task priority
• Application age
• Application size
66

• Neural network and regression were used as
approaches for effort prediction.
• The prediction accuracy was acceptable (error of 60%).
• A recommended use of an effort prediction model is,
therefore, to support the expert predictions.
• Another important use of a formal prediction model
may be to support the collection and analysis of
maintenance data in order to enable improvement of
the maintenance process and product.
67

Open Problems
• Most of presented work are immature and a
lot of related issues are still open.
• Machine learning can help in the
requirements engineering phase in developing
knowledge based systems and ontologies to
manage the requirements and model problem
domains
68

Open Problems (Cont’d)
• One of the most difficult problems is the
problem of transforming requirements into
architectures. Much research is needed in this
area to address the ever increasing complexity
of functional and non-functional
requirements.
69

Open Problems (Cont’d)
• One area that has received some attention is
the use of automated algorithms with
machine learning to make repair assignments.
• In any case, more studies with respect to the
appropriate criteria for selecting assignment
policy, reward mechanisms and management
goals need to be undertaken.
70

Conclusion
• The existing work certainly proves that the
field of software engineering is a fertile
ground for the application of machine learning
methods.
• It is clear that there is an increased interest in
the niche area of machine learning and
software engineering.
71

Conclusion (cont’d)
• The strength of machine learning methods lies
in the fact that they have sound mathematical
and logical justifications
• The power of machine learning methods does
not come from a particular induction method,
but instead from proper formulation of the
problems and from crafting the representation
to make learning tractable.
72

Conclusion (cont’d)
• Machine learning can play a good role in the
different phases of software engineering; project
planning, requirements analysis, design,
implementation, testing, and even in maintenance
• It is expected that this interest in applying
machine learning in software engineering tasks
will increase significantly especially with the
increase interest in the empirical software
engineering.
73

Machine Learning in Software Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Machine Learning in Software Engineering

Similar to Machine Learning in Software Engineering (20)

Recently uploaded

Recently uploaded (20)

Machine Learning in Software Engineering