Improving Software Maintenance using Unsupervised Machine Learning techniques

Dottorato in Scienze Computazionali e Informatiche, XXV Ciclo
Ph.D. Candidate:
Valerio Maggio
Thesis Advisors:
Dr. Sergio Di Martino
Dr. Anna Corazza
June 5th, 2013
UNSUPERVISED
MACHINE LEARNING
FOR SOFTWARE
MAINTENANCE

THESIS
G O A L
UNSUPERVISED
MACHINE LEARNING FOR
SOFTWARE MAINTENANCE
These solutions exploit (unsupervised) machine learning
techniques to mine information from the source code
Define and experimentally evaluate solutions (techniques and
prototype tools) for automatic software analysis to support
software maintenance activities.

There exist different types of
Software Maintenance.
“A software system must be continually adapted during its overall life cycle or it
progressively becomes less satisfactory.”
(cit. Lehman’s First Law of Software Evolution)

Software Maintenance represents the most
expensive, time consuming and challenging
phase of the whole development process.

Software Maintenance represents the most
expensive, time consuming and challenging
phase of the whole development process.
Software Maintenance could account up to the
85-90% of the total software costs.

ISSUES
SOFTWARE
MAINTENANCE
“Software Maintenance is about change!”
(cit. S. Jarzabek)
Change Analysis

ISSUES
SOFTWARE
MAINTENANCE
(cit. S. Jarzabek)
Change Analysis
Program
Comprehension
The documentation is usually
scarce or not up to date!

ISSUES
SOFTWARE
MAINTENANCE
The source code (usually) represents the most
reliable source of information about the system
(cit. S. Jarzabek)
Change Analysis
Program
Comprehension

ISSUES
SOFTWARE
MAINTENANCE
The source code (usually) represents the most
reliable source of information about the system
(cit. S. Jarzabek)
Change Analysis
Program
Comprehension
Reverse Engineering

REVERSE
E N G I N E
E R I N G
• Definition of tools and techniques to support maintenance activities
• Goal: Build higher-level software models in an automatic fashion
gathering information from the source code or any other document
• Goal: To aid the comprehension of the system

REVERSE
E N G I N E
E R I N G
• Definition of tools and techniques to support maintenance activities
• Goal: Build higher-level software models in an automatic fashion
gathering information from the source code or any other document
• Goal: To aid the comprehension of the system
STATIC
ANALYSIS
DYNAMIC
ANALYSIS

MACHINE
L E A R N
I N G
• Provides computational effective solutions to analyze large data sets

MACHINE
L E A R N
I N G
• Provides solutions that can be tailored to different tasks/domains

MACHINE
L E A R N
I N G
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain

MACHINE
L E A R N
I N G
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain
• the application of the learning algorithms to the considered data

UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from the data
Learn by examples

UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from the data
Learn by examples
(+) No cost of labeling samples
(-) Trade-off imposed on the quality of the data

THESIS
C O N T R
I B U T I
O N S
&

THESIS
C O N T R
I B U T I
O N S
Contributions to three relevant and related
open issues in Software Maintenance
&

THESIS
C O N T R
I B U T I
O N S
Contributions to three relevant and related
open issues in Software Maintenance
1. SOFTWARE
RE-MODULARIZATION
3. CLONE
DETECTION
2. SOURCE CODE
NORMALIZATION
&

ISSUE
#1
SOFTWARE
RE-MODULARIZATION

Software Classes
UI Process
Components
UI
Components
Data Access
Components
Data Helpers /
Utilities
Security
Operational Management
Communications
Business
Components
Application Facade
Buisiness
Workﬂows
Messages
Interfaces
Service Interfaces
Re-modularization provides a way to
support software maintainers by
automatically grouping together
(clustering) “related” software classes
SOFTWARE
RE-MODULARIZATION
PROBL
EM
S T A T E
M E N T

Re-modularization provides a way to
support software maintainers by
automatically grouping together
(clustering) “related” software classes
SOFTWARE
RE-MODULARIZATION
External
Systems
Service Consumers
Services
Service Interfaces
Messages
Interfaces
Cross Cutting
Security
OperationalManagement
Communications
Data
Data Access
Components
Data Helpers /
Utilities
Presentation
UI
Components
UI Process
Components
Business
Application Facade
Buisiness
Workﬂows
Business
Components
Clusters of Software Classes
PROBL
EM
S T A T E
M E N T

SOURCECODE
LEXICALINFORMATION (IDEA): Exploit the lexical information gathered from the source code
to produce clusters of classes that are lexically related.

SOURCECODE
LEXICALINFORMATION (IDEA): Exploit the lexical information gathered from the source code
to produce clusters of classes that are lexically related.
State of the Art: Information Retrieval (IR) based approaches.

Implicit assumption: The “same” words are used whenever a
particular concept occurs
IRINDEXING

1. Tokenization
2. Normalization
draw, the, are,
null, handl,
box, r,
rectangl, g,
graphic,
box,
display, box,
...
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
Implicit assumption: The “same” words are used whenever a
IRINDEXING

SOURCECODEZONES
LEXICALINFORMATION

SOURCECODEZONES
LEXICALINFORMATION
Class Names

SOURCECODEZONES
LEXICALINFORMATION
Class Names
Attribute Names

SOURCECODEZONES
LEXICALINFORMATION
Class Names
Attribute Names
Method Names
Method Names

SOURCECODEZONES
LEXICALINFORMATION
Class Names
Attribute Names
Method Names
Parameter Names
Method Names
Parameter Names

SOURCECODEZONES
LEXICALINFORMATION
Class Names
Attribute Names
Method Names
Parameter Names
Comments
Comments
Method Names
Parameter Names
Comments

SOURCECODEZONES
LEXICALINFORMATION
Source Code
Class Names
Attribute Names
Method Names
Parameter Names
Comments
Comments
Method Names
Parameter Names
Source Code
Comments

ZONEINDEXING
RQ1: Do terms in different Zones provide different contributions?

1. Tokenization
2. Normalization
ZONEINDEXING
Draws, the, are,
NullHandle,
box, r,
draw, g,
Graphics,
color,
displayBox,
...
draw, the, are,
null, handl,
box, r,
draw, g,
graphic,
color,
display, box,
...
RQ1: Do terms in different Zones provide different contributions?

SOURCECODELEXICON JFreeChart JEdit
JUnit Xerces

SOURCECODELEXICON JFreeChart
• Good lexicon in every Zone!

SOURCECODELEXICON JEdit
• Very good in Comments
• Very poor in Method and Parameter names

SOURCECODELEXICON JUnit
• Good in Class and Attribute names
• No Comments at all

SOURCECODELEXICON Xerces
• Poor in Method, Parameter names and Comments
• Good in Method Code and Variable names

SOURCECODELEXICON JFreeChart JEdit
JUnit Xerces
RQ2: How to automatically weight the different contributions ?

CONTRI
BUTION
SOFTWARE RE-
MODULARIZATION
• We propose a source code Zone Indexing
Corazza A., Di Martino, S., Maggio, V., Scanniello, G.
Investigating the use of lexical information for software system clustering
(2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44.
ISSN: 15345351 ISBN: 978-076954343-7
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.
Combining machine learning and information retrieval techniques for software clustering
(2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0

CONTRI
BUTION
SOFTWARE RE-
MODULARIZATION
• An approach to automatically weight the contribution of different terms in
Zones
• Maximum-Likelihood Estimation (MLE) Approach
ISSN: 15345351 ISBN: 978-076954343-7

CONTRI
BUTION
SOFTWARE RE-
MODULARIZATION
• An approach to automatically weight the contribution of different terms in
Zones
• Maximum-Likelihood Estimation (MLE) Approach
• A variant of the K-medoid clustering algorithm
• Tailored to the software clustering task
ISSN: 15345351 ISBN: 978-076954343-7

StackedVerticalBarRenderer
LineAndShapeRenderer
K-MEDOIDVARIANT
ArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
HorizontalNumberAxis
NumberAxisPropertyEditPanel
PlotFitAlgorithm
ChartFrame
GraphicsHolder
PropertyEditPanel
DateAxisPropertyEditPanel
StandardLegend
LegendItem
FigureLegend
VerticalNumberAxis
VerticalXYDataPlot
DataPlot
SampleXYDataset
SampleXYDatasetThread
VerticalPlot
FitAlgorithm

K-MEDOIDVARIANT
ArrayHolder
Plot
LinearPlotFit
PlotFitAlgorithm
ChartFrame
GraphicsHolder
PropertyEditPanel
StandardLegend
LegendItem
FigureLegend
VerticalNumberAxis
VerticalXYDataPlot
DataPlot
SampleXYDataset
Non-extreme Distribution
property
VerticalPlot
FitAlgorithm

K-MEDOIDVARIANT
ArrayHolder
Plot
LinearPlotFit
PlotFitAlgorithm
ChartFrame
GraphicsHolder
PropertyEditPanel
StandardLegend
LegendItem
FigureLegend
VerticalNumberAxis
VerticalXYDataPlot
DataPlot
SampleXYDataset
Non-extreme Distribution
property
VerticalPlot
FitAlgorithm
Extreme
Extreme

AUTHORITATIVENESS
RE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces
0
0.2
0.4
0.6
0.8
1.0
19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART
FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS

AUTHORITATIVENESS
RE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART
0
0.2
0.4
0.6
0.8
1.0
Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces
FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS

ISSUE
#2
SOURCE CODE
NORMALIZATION

SOURCECODE
LEXICALINFORMATION Implicit assumption: The “same” words are used whenever a

1. Tokenization
2. Normalization
1.5 Identiﬁer
Splitting
• Splitting algorithms based on naming conventions
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• NullHandle ==> Null | Handle
• displayBox ==> display | Box
draw, the, are,
box, r,
rectangl, g,
graphic,
color,
....
STATEOFTHEARTTOOLS
display, box
null, handl

1. Tokenization
2. Normalization
1.5 Identiﬁer
Splitting
draw, the, are,
box, r,
rectangl, g,
graphic,
color,
....
STATEOFTHEARTTOOLS
display, box
null, handl

1. Tokenization
2. Normalization
1.5 Identiﬁer
Splitting
• displayBox ==> display | Box
draw, the, are,
box, r,
rectangl, g,
graphic,
color,
....
STATEOFTHEARTTOOLS
null, handl
display, box

IDENTIFIERSSPLITTING

• drawXORRect ==> drawXOR | Rect

• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
Splitting algorithms based on naming conventions
are not robust enough

ABBREVIATIONSEXPANSION

• Heavy use of Abbreviations in the source code
• r as for Rectangle
• rect as for Rectangle
ABBREVIATIONSEXPANSION

1. Tokenization
CODENORMALIZATION
2. Normalization
1.5 Code
Normalization
• AMAP (Hill and Pollock, 2008)
• SAMURAI (Enslen, et.al , 2011)
• TIDIER (Guerrouj, et.al , 2011)
• GenTest+Normalize (Lawrie and Binkley, 2011)
• LINSEN (Corazza, Di Martino and Maggio, 2012)
draw, the, are,
null, handl,
box,
rectangl,
graphic,
color,
display, box,
rectangl,
draw, xor,
rectangl

1. Tokenization
CODENORMALIZATION
2. Normalization
1.5 Code
Normalization
• AMAP (Hill and Pollock, 2008)
• SAMURAI (Enslen, et.al , 2011)
• TIDIER (Guerrouj, et.al , 2011)
• GenTest+Normalize (Lawrie and Binkley, 2011)
• LINSEN (Corazza, Di Martino and Maggio, 2012)
draw, the, are,
null, handl,
box,
rectangl,
graphic,
color,
display, box,
rectangl,
draw,xor,
rectangl

CONTRI
BUTION
SOURCE CODE
NORMALIZATION
• LINSEN: novel technique for code normalization
• Able to both Split Identifiers and Expand possible occurring
abbreviations
Corazza, A., Di Martino, S., Maggio, V.
LINSEN: An efficient approach to split identifiers and expand abbreviations
(2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3

CONTRI
BUTION
SOURCE CODE
NORMALIZATION
• LINSEN: novel technique for code normalization
• Able to both Split Identifiers and Expand possible occurring
abbreviations
• Based on an efficient String Matching technique:
Baeza-Yates&Perlberg Algorithm (BYP)
• Exploit different Sources of Information to find the matching
words
Corazza, A., Di Martino, S., Maggio, V.
LINSEN: An efficient approach to split identifiers and expand abbreviations
(2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3

SKETCHOFTHEALGORITHM
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph
Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10

• ARCS corresponds to matchings between identifier substrings
and dictionary words
0
11
1
9
4
5
7
8
2 63
10

• ARCS corresponds to matchings between identifier substrings
and dictionary words
• Padding Arcs to ensure the Graph always connected
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX

• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each matching
• Cost function [c(“word”)] favors longest words and domain-
related information
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX

• The final Mapping Solution corresponds to the sequence of labels
in the path with the minimum cost (Djikstra Algorithm)
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“R”,C-MAX

SPLITTING
Accuracy Rates for the
comparison with
DTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTW LINSEN DTW LINSEN
DTW LINSEN
RE
SULTS
# 1
comparison with GenTest
(Lawrie and Binkley, 2011)
0
0.25
0.5
0.75
1
which 2.20 a2ps 4.14
GenTest LINSEN
DTW O(n3)

comparison with
AMAP (Hill and Pollock, 2008)
0
0.25
0.5
0.75
1
CW DL OO AC PR SL TOTAL (Avg)
AMAP
LINSEN
EXPANSION
RE
SULTS
# 2
CW: Combination Words
DL: Dropped Letters
OO: Others
AC: Acronyms
PR: Prefix
SL: Single Letters

PROBL
EM
S T A T E
M E N T
CLONE DETECTION

PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Textual Similarity

PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Functional Similarity

PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones affect the reliability of the system!
Sneaky Bug!

Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
STATEOFTHEARTTOOLS

Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Text Based Tools:
Text is compared line by line
STATEOFTHEARTTOOLS

Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Token Based Tools:
Token sequences are
compared to sequences
STATEOFTHEARTTOOLS

Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Syntax Based Tools:
Syntax subtrees are compared
to each other
STATEOFTHEARTTOOLS

Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Graph Based Tools:
(sub) graphs are compared to
each other
STATEOFTHEARTTOOLS

CONTRI
BUTION
CLONE DETECTION
• Combining different sources of information to improve the
effectiveness of the detection process
A Tree Kernel based approach for clone detection
(2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8
Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F.
Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability
(2013) Communications in Computer and Information Science, In Press

CONTRI
BUTION
CLONE DETECTION
• Structural and Lexical Information

CONTRI
BUTION
CLONE DETECTION
• Application of Kernel Methods to Source Code Structures

CONTRI
BUTION
CLONE DETECTION
• Application of Kernel Methods to Source Code Structures
• Typically applied in other field such as NLP or Bioinformatics

CODE
STRUCTURES
KERNELSFORSTRUCTURES
Computation of the dot product between (Graph) Structures
K( ),

CODE
STRUCTURES
KERNELSFORSTRUCTURES
Abstract Syntax Tree (AST)
Tree structure representing the syntactic structure of
the different instructions of a program (function)
Program Dependencies Graph (PDG)
(Directed) Graph structure representing the relationship
among the different statement of a program
Computation of the dot product between (Graph) Structures
K( ),

<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST
KERNELFORCLONES

<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST AST KERNEL
KERNELFORCLONES
<
block
while
= =
block
=
y -
=
x +
+
x 1
-
y 1
<
x y
>
b 0 =
c 3
if
block
>
b a
-
b 1
<
block
while
+
a 1
=
b -
=
a +

while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES

while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT

while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
i.e., LOOP, CALL,
Instruction (I)
i.e., FOR, IF, WHILE, RETURN

while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
i.e., LOOP, CALL,
Instruction (I)
Context (C)
i.e., Instruction Class of
the closer statement node

while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
i.e., LOOP, CALL,
Instruction (I)
Context (C)
Lexemes (Ls)
Lexical information gathered
(recursively) from leaves

while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
IC = Conditional-Expr
I = Less-operator
C = Loop
Ls= [x,y]
IC = Loop
I = while-loop
C = Function-Body
Ls= [x, y]
Instruction Class (IC)
i.e., LOOP, CALL,
Instruction (I)
Context (C)
Lexemes (Ls)
Lexical information gathered
(recursively) from leaves
IC = Block
I = while-body
C = Loop
Ls= [ x ]

CLONE DETECTION
• Comparison with another (pure) AST-based clone detector
• Comparison on a system with randomly seeded clones
0
0.25
0.5
0.75
1
Precision Recall F-measure
CloneDigger Tree Kernel Tool
RE
SULTS
Results refer to clones where code
fragments have been modified by adding/
removing or changing code statements

CONCLUSIONS
FUTURE RESEARCH DIRECTIONS
• Re-modularization:
• Integrate code normalization algorithm (LINSEN) to lexical analysis
• Code Normalization:
• Investigate the contributions of different sources of information
used for disambiguations
• Clone Detection:
• Combine Static (and Kernel methods) with Dynamic Analysis

THANK YOU
June 5th, 2013 - Ph.D. Defense
Valerio Maggio
Ph.D. Student, University of Naples “Federico II”
valerio.maggio@unina.it

Improving Software Maintenance using Unsupervised Machine Learning techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Improving Software Maintenance using Unsupervised Machine Learning techniques

Similar to Improving Software Maintenance using Unsupervised Machine Learning techniques (20)

More from Valerio Maggio

More from Valerio Maggio (9)

Recently uploaded

Recently uploaded (20)

Improving Software Maintenance using Unsupervised Machine Learning techniques