0
Dottorato in Scienze Computazionali e Informatiche, XXV Ciclo
Ph.D. Candidate:
Valerio Maggio
Thesis Advisors:
Dr. Sergio ...
THESIS
G O A L
UNSUPERVISED
MACHINE LEARNING FOR
SOFTWARE MAINTENANCE
These solutions exploit (unsupervised) machine learn...
There exist different types of
Software Maintenance.
SOFTWARE MAINTENANCE
“A software system must be continually adapted d...
SOFTWARE MAINTENANCE
“A software system must be continually adapted during its overall life cycle or it
progressively beco...
SOFTWARE MAINTENANCE
“A software system must be continually adapted during its overall life cycle or it
progressively beco...
ISSUES
SOFTWARE
MAINTENANCE
“Software Maintenance is about change!”
(cit. S. Jarzabek)
Change Analysis
ISSUES
SOFTWARE
MAINTENANCE
“Software Maintenance is about change!”
(cit. S. Jarzabek)
Change Analysis
Program
Comprehensi...
ISSUES
SOFTWARE
MAINTENANCE
The source code (usually) represents the most
reliable source of information about the system
...
ISSUES
SOFTWARE
MAINTENANCE
The source code (usually) represents the most
reliable source of information about the system
...
REVERSE
E N G I N E
E R I N G
• Definition of tools and techniques to support maintenance activities
• Goal: Build higher-...
REVERSE
E N G I N E
E R I N G
• Definition of tools and techniques to support maintenance activities
• Goal: Build higher-...
REVERSE
E N G I N E
E R I N G
• Definition of tools and techniques to support maintenance activities
• Goal: Build higher-...
MACHINE
L E A R N
I N G
• Provides computational effective solutions to analyze large data sets
MACHINE
L E A R N
I N G
• Provides computational effective solutions to analyze large data sets
• Provides solutions that ...
MACHINE
L E A R N
I N G
• Provides computational effective solutions to analyze large data sets
• Provides solutions that ...
MACHINE
L E A R N
I N G
• Provides computational effective solutions to analyze large data sets
• Provides solutions that ...
UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from...
UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from...
UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from...
THESIS
C O N T R
I B U T I
O N S
&
THESIS
C O N T R
I B U T I
O N S
Contributions to three relevant and related
open issues in Software Maintenance
&
THESIS
C O N T R
I B U T I
O N S
Contributions to three relevant and related
open issues in Software Maintenance
1. SOFTWA...
ISSUE
#1
SOFTWARE
RE-MODULARIZATION
Software Classes
UI Process
Components
UI
Components
Data Access
Components
Data Helpers /
Utilities
Security
Operational ...
Re-modularization provides a way to
support software maintainers by
automatically grouping together
(clustering) “related”...
SOURCECODE
LEXICALINFORMATION
SOURCECODE
LEXICALINFORMATION
SOURCECODE
LEXICALINFORMATION (IDEA): Exploit the lexical information gathered from the source code
to produce clusters of...
SOURCECODE
LEXICALINFORMATION (IDEA): Exploit the lexical information gathered from the source code
to produce clusters of...
IRINDEXING
Implicit assumption: The “same” words are used whenever a
particular concept occurs
IRINDEXING
1. Tokenization
2. Normalization
draw, the, are,
null, handl,
box, r,
rectangl, g,
graphic,
box,
display, box,
...
Draws, ...
SOURCECODEZONES
LEXICALINFORMATION
SOURCECODEZONES
LEXICALINFORMATION
Class Names
SOURCECODEZONES
LEXICALINFORMATION
Class Names
Attribute Names
SOURCECODEZONES
LEXICALINFORMATION
Class Names
Attribute Names
Method Names
Method Names
SOURCECODEZONES
LEXICALINFORMATION
Class Names
Attribute Names
Method Names
Parameter Names
Method Names
Parameter Names
SOURCECODEZONES
LEXICALINFORMATION
Class Names
Attribute Names
Method Names
Parameter Names
Comments
Comments
Method Names...
SOURCECODEZONES
LEXICALINFORMATION
Source Code
Class Names
Attribute Names
Method Names
Parameter Names
Comments
Comments
...
ZONEINDEXING
ZONEINDEXING
RQ1: Do terms in different Zones provide different contributions?
1. Tokenization
2. Normalization
ZONEINDEXING
Draws, the, are,
NullHandle,
box, r,
draw, g,
Graphics,
color,
displayBox,
....
SOURCECODELEXICON JFreeChart JEdit
JUnit Xerces
SOURCECODELEXICON JFreeChart
• Good lexicon in every Zone!
SOURCECODELEXICON JEdit
• Very good in Comments
• Very poor in Method and Parameter names
SOURCECODELEXICON JUnit
• Good in Class and Attribute names
• No Comments at all
SOURCECODELEXICON Xerces
• Poor in Method, Parameter names and Comments
• Good in Method Code and Variable names
SOURCECODELEXICON JFreeChart JEdit
JUnit Xerces
RQ2: How to automatically weight the different contributions ?
CONTRI
BUTION
SOFTWARE RE-
MODULARIZATION
• We propose a source code Zone Indexing
Corazza A., Di Martino, S., Maggio, V.,...
CONTRI
BUTION
SOFTWARE RE-
MODULARIZATION
• We propose a source code Zone Indexing
• An approach to automatically weight t...
CONTRI
BUTION
SOFTWARE RE-
MODULARIZATION
• We propose a source code Zone Indexing
• An approach to automatically weight t...
StackedVerticalBarRenderer
LineAndShapeRenderer
K-MEDOIDVARIANT
ArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
Hori...
StackedVerticalBarRenderer
LineAndShapeRenderer
K-MEDOIDVARIANT
ArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
Hori...
StackedVerticalBarRenderer
LineAndShapeRenderer
K-MEDOIDVARIANT
ArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
Hori...
StackedVerticalBarRenderer
LineAndShapeRenderer
K-MEDOIDVARIANT
ArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
Hori...
StackedVerticalBarRenderer
LineAndShapeRenderer
K-MEDOIDVARIANT
ArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
Hori...
AUTHORITATIVENESS
RE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
Ant Lucene Tomcat Azure...
AUTHORITATIVENESS
RE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
Ant Lucene Tomcat Azure...
AUTHORITATIVENESS
RE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
Ant Lucene Tomcat Azure...
AUTHORITATIVENESS
RE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
19 target Open Source J...
ISSUE
#2
SOURCE CODE
NORMALIZATION
SOURCECODE
LEXICALINFORMATION Implicit assumption: The “same” words are used whenever a
particular concept occurs
SOURCECODE
LEXICALINFORMATION Implicit assumption: The “same” words are used whenever a
particular concept occurs
1. Tokenization
2. Normalization
1.5 Identifier
Splitting
• Splitting algorithms based on naming conventions
• camelCase Sp...
1. Tokenization
2. Normalization
1.5 Identifier
Splitting
• Splitting algorithms based on naming conventions
• camelCase Sp...
1. Tokenization
2. Normalization
1.5 Identifier
Splitting
• Splitting algorithms based on naming conventions
• camelCase Sp...
1. Tokenization
2. Normalization
1.5 Identifier
Splitting
• Splitting algorithms based on naming conventions
• camelCase Sp...
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
IDENTIFIERSSPLITTING
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
IDENTIFIERSSPLITTING
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
Splitting algori...
Splitting algorithms based on naming conventions
are not robust enough
ABBREVIATIONSEXPANSION
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• r...
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• r...
1. Tokenization
CODENORMALIZATION
2. Normalization
1.5 Code
Normalization
• AMAP (Hill and Pollock, 2008)
• SAMURAI (Ensle...
1. Tokenization
CODENORMALIZATION
2. Normalization
1.5 Code
Normalization
• AMAP (Hill and Pollock, 2008)
• SAMURAI (Ensle...
CONTRI
BUTION
SOURCE CODE
NORMALIZATION
• LINSEN: novel technique for code normalization
• Able to both Split Identifiers ...
CONTRI
BUTION
SOURCE CODE
NORMALIZATION
• LINSEN: novel technique for code normalization
• Able to both Split Identifiers ...
SKETCHOFTHEALGORITHM
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph
Example: dr...
SKETCHOFTHEALGORITHM
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• NODES correspond...
SKETCHOFTHEALGORITHM
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• Padding Arcs to ...
SKETCHOFTHEALGORITHM
• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each...
SKETCHOFTHEALGORITHM
• The final Mapping Solution corresponds to the sequence of labels
in the path with the minimum cost ...
SPLITTING
Accuracy Rates for the
comparison with
DTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTW LI...
Accuracy Rates for the
comparison with
AMAP (Hill and Pollock, 2008)
0
0.25
0.5
0.75
1
CW DL OO AC PR SL TOTAL (Avg)
AMAP
...
ISSUE
#3
CLONE
DETECTION
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Textual Similarity
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Functional Similarity
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones affect the reliability of the system!
Sneaky Bug!
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
...
CONTRI
BUTION
CLONE DETECTION
• Combining different sources of information to improve the
effectiveness of the detection p...
CONTRI
BUTION
CLONE DETECTION
• Combining different sources of information to improve the
effectiveness of the detection p...
CONTRI
BUTION
CLONE DETECTION
• Combining different sources of information to improve the
effectiveness of the detection p...
CONTRI
BUTION
CLONE DETECTION
• Combining different sources of information to improve the
effectiveness of the detection p...
CODE
STRUCTURES
KERNELSFORSTRUCTURES
Computation of the dot product between (Graph) Structures
K( ),
CODE
STRUCTURES
KERNELSFORSTRUCTURES
Abstract Syntax Tree (AST)
Tree structure representing the syntactic structure of
the...
CODE
STRUCTURES
KERNELSFORSTRUCTURES
Abstract Syntax Tree (AST)
Tree structure representing the syntactic structure of
the...
CODE
KERNELFORCLONES
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST
KERNELFORCLONES
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST AST KERNEL
KERNE...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STAT...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STAT...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STAT...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STAT...
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
IC = Conditional-Expr
I = Less-operator
C = Loop
Ls= [x,y...
CLONE DETECTION
• Comparison with another (pure) AST-based clone detector
• Comparison on a system with randomly seeded cl...
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
FUTURE RESEARCH DIRECTIONS
• Re-modularization:
• Integrate code normalization algorithm (LINSEN) to lexical a...
THANK YOU
June 5th, 2013 - Ph.D. Defense
Valerio Maggio
Ph.D. Student, University of Naples “Federico II”
valerio.maggio@u...
Upcoming SlideShare
Loading in...5
×

Improving Software Maintenance using Unsupervised Machine Learning techniques

544

Published on

"Improving Software Maintenance using Unsupervised Machine Learning techniques": Ph.D. defence presentation.

Unsupervised Machine Learning techniques have been used to face different software maintenance issues such as Software Modularisation and Clone detection.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
544
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
23
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Improving Software Maintenance using Unsupervised Machine Learning techniques"

  1. 1. Dottorato in Scienze Computazionali e Informatiche, XXV Ciclo Ph.D. Candidate: Valerio Maggio Thesis Advisors: Dr. Sergio Di Martino Dr. Anna Corazza June 5th, 2013 UNSUPERVISED MACHINE LEARNING FOR SOFTWARE MAINTENANCE
  2. 2. THESIS G O A L UNSUPERVISED MACHINE LEARNING FOR SOFTWARE MAINTENANCE These solutions exploit (unsupervised) machine learning techniques to mine information from the source code Define and experimentally evaluate solutions (techniques and prototype tools) for automatic software analysis to support software maintenance activities.
  3. 3. There exist different types of Software Maintenance. SOFTWARE MAINTENANCE “A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution)
  4. 4. SOFTWARE MAINTENANCE “A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution) Software Maintenance represents the most expensive, time consuming and challenging phase of the whole development process.
  5. 5. SOFTWARE MAINTENANCE “A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution) Software Maintenance represents the most expensive, time consuming and challenging phase of the whole development process. Software Maintenance could account up to the 85-90% of the total software costs.
  6. 6. ISSUES SOFTWARE MAINTENANCE “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis
  7. 7. ISSUES SOFTWARE MAINTENANCE “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis Program Comprehension The documentation is usually scarce or not up to date!
  8. 8. ISSUES SOFTWARE MAINTENANCE The source code (usually) represents the most reliable source of information about the system “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis Program Comprehension The documentation is usually scarce or not up to date!
  9. 9. ISSUES SOFTWARE MAINTENANCE The source code (usually) represents the most reliable source of information about the system “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis Program Comprehension Reverse Engineering The documentation is usually scarce or not up to date!
  10. 10. REVERSE E N G I N E E R I N G • Definition of tools and techniques to support maintenance activities • Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document • Goal: To aid the comprehension of the system
  11. 11. REVERSE E N G I N E E R I N G • Definition of tools and techniques to support maintenance activities • Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document • Goal: To aid the comprehension of the system STATIC ANALYSIS DYNAMIC ANALYSIS
  12. 12. REVERSE E N G I N E E R I N G • Definition of tools and techniques to support maintenance activities • Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document • Goal: To aid the comprehension of the system STATIC ANALYSIS DYNAMIC ANALYSIS
  13. 13. MACHINE L E A R N I N G • Provides computational effective solutions to analyze large data sets
  14. 14. MACHINE L E A R N I N G • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains
  15. 15. MACHINE L E A R N I N G • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain
  16. 16. MACHINE L E A R N I N G • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain • the application of the learning algorithms to the considered data
  17. 17. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples
  18. 18. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples
  19. 19. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples (+) No cost of labeling samples (-) Trade-off imposed on the quality of the data
  20. 20. THESIS C O N T R I B U T I O N S &
  21. 21. THESIS C O N T R I B U T I O N S Contributions to three relevant and related open issues in Software Maintenance &
  22. 22. THESIS C O N T R I B U T I O N S Contributions to three relevant and related open issues in Software Maintenance 1. SOFTWARE RE-MODULARIZATION 3. CLONE DETECTION 2. SOURCE CODE NORMALIZATION &
  23. 23. ISSUE #1 SOFTWARE RE-MODULARIZATION
  24. 24. Software Classes UI Process Components UI Components Data Access Components Data Helpers / Utilities Security Operational Management Communications Business Components Application Facade Buisiness Workflows Messages Interfaces Service Interfaces Re-modularization provides a way to support software maintainers by automatically grouping together (clustering) “related” software classes SOFTWARE RE-MODULARIZATION PROBL EM S T A T E M E N T
  25. 25. Re-modularization provides a way to support software maintainers by automatically grouping together (clustering) “related” software classes SOFTWARE RE-MODULARIZATION External Systems Service Consumers Services Service Interfaces Messages Interfaces Cross Cutting Security OperationalManagement Communications Data Data Access Components Data Helpers / Utilities Presentation UI Components UI Process Components Business Application Facade Buisiness Workflows Business Components Clusters of Software Classes PROBL EM S T A T E M E N T
  26. 26. SOURCECODE LEXICALINFORMATION
  27. 27. SOURCECODE LEXICALINFORMATION
  28. 28. SOURCECODE LEXICALINFORMATION (IDEA): Exploit the lexical information gathered from the source code to produce clusters of classes that are lexically related.
  29. 29. SOURCECODE LEXICALINFORMATION (IDEA): Exploit the lexical information gathered from the source code to produce clusters of classes that are lexically related. State of the Art: Information Retrieval (IR) based approaches.
  30. 30. IRINDEXING
  31. 31. Implicit assumption: The “same” words are used whenever a particular concept occurs IRINDEXING
  32. 32. 1. Tokenization 2. Normalization draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ... Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... Implicit assumption: The “same” words are used whenever a particular concept occurs IRINDEXING
  33. 33. SOURCECODEZONES LEXICALINFORMATION
  34. 34. SOURCECODEZONES LEXICALINFORMATION Class Names
  35. 35. SOURCECODEZONES LEXICALINFORMATION Class Names Attribute Names
  36. 36. SOURCECODEZONES LEXICALINFORMATION Class Names Attribute Names Method Names Method Names
  37. 37. SOURCECODEZONES LEXICALINFORMATION Class Names Attribute Names Method Names Parameter Names Method Names Parameter Names
  38. 38. SOURCECODEZONES LEXICALINFORMATION Class Names Attribute Names Method Names Parameter Names Comments Comments Method Names Parameter Names Comments
  39. 39. SOURCECODEZONES LEXICALINFORMATION Source Code Class Names Attribute Names Method Names Parameter Names Comments Comments Method Names Parameter Names Source Code Comments
  40. 40. ZONEINDEXING
  41. 41. ZONEINDEXING RQ1: Do terms in different Zones provide different contributions?
  42. 42. 1. Tokenization 2. Normalization ZONEINDEXING Draws, the, are, NullHandle, box, r, draw, g, Graphics, color, displayBox, ... draw, the, are, null, handl, box, r, draw, g, graphic, color, display, box, ... RQ1: Do terms in different Zones provide different contributions?
  43. 43. SOURCECODELEXICON JFreeChart JEdit JUnit Xerces
  44. 44. SOURCECODELEXICON JFreeChart • Good lexicon in every Zone!
  45. 45. SOURCECODELEXICON JEdit • Very good in Comments • Very poor in Method and Parameter names
  46. 46. SOURCECODELEXICON JUnit • Good in Class and Attribute names • No Comments at all
  47. 47. SOURCECODELEXICON Xerces • Poor in Method, Parameter names and Comments • Good in Method Code and Variable names
  48. 48. SOURCECODELEXICON JFreeChart JEdit JUnit Xerces RQ2: How to automatically weight the different contributions ?
  49. 49. CONTRI BUTION SOFTWARE RE- MODULARIZATION • We propose a source code Zone Indexing Corazza A., Di Martino, S., Maggio, V., Scanniello, G. Investigating the use of lexical information for software system clustering (2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7 Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. Combining machine learning and information retrieval techniques for software clustering (2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
  50. 50. CONTRI BUTION SOFTWARE RE- MODULARIZATION • We propose a source code Zone Indexing • An approach to automatically weight the contribution of different terms in Zones • Maximum-Likelihood Estimation (MLE) Approach Corazza A., Di Martino, S., Maggio, V., Scanniello, G. Investigating the use of lexical information for software system clustering (2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7 Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. Combining machine learning and information retrieval techniques for software clustering (2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
  51. 51. CONTRI BUTION SOFTWARE RE- MODULARIZATION • We propose a source code Zone Indexing • An approach to automatically weight the contribution of different terms in Zones • Maximum-Likelihood Estimation (MLE) Approach • A variant of the K-medoid clustering algorithm • Tailored to the software clustering task Corazza A., Di Martino, S., Maggio, V., Scanniello, G. Investigating the use of lexical information for software system clustering (2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7 Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. Combining machine learning and information retrieval techniques for software clustering (2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
  52. 52. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread VerticalPlot FitAlgorithm
  53. 53. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm
  54. 54. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm
  55. 55. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm Extreme Extreme
  56. 56. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm
  57. 57. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative Target Partition Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces 0 0.2 0.4 0.6 0.8 1.0 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  58. 58. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative Target Partition Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces 0 0.2 0.4 0.6 0.8 1.0 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  59. 59. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative Target Partition Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces 0 0.2 0.4 0.6 0.8 1.0 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  60. 60. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative Target Partition 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART 0 0.2 0.4 0.6 0.8 1.0 Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  61. 61. ISSUE #2 SOURCE CODE NORMALIZATION
  62. 62. SOURCECODE LEXICALINFORMATION Implicit assumption: The “same” words are used whenever a particular concept occurs
  63. 63. SOURCECODE LEXICALINFORMATION Implicit assumption: The “same” words are used whenever a particular concept occurs
  64. 64. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle • displayBox ==> display | Box draw, the, are, box, r, rectangl, g, graphic, color, .... STATEOFTHEARTTOOLS display, box null, handl
  65. 65. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle • displayBox ==> display | Box draw, the, are, box, r, rectangl, g, graphic, color, .... STATEOFTHEARTTOOLS display, box null, handl
  66. 66. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle draw, the, are, box, r, rectangl, g, graphic, color, .... STATEOFTHEARTTOOLS display, box null, handl
  67. 67. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle • displayBox ==> display | Box draw, the, are, box, r, rectangl, g, graphic, color, .... STATEOFTHEARTTOOLS null, handl display, box
  68. 68. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ IDENTIFIERSSPLITTING
  69. 69. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect IDENTIFIERSSPLITTING
  70. 70. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect • drawxorrect ==> NO SPLIT Splitting algorithms based on naming conventions are not robust enough IDENTIFIERSSPLITTING
  71. 71. Splitting algorithms based on naming conventions are not robust enough ABBREVIATIONSEXPANSION
  72. 72. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code • r as for Rectangle • rect as for Rectangle ABBREVIATIONSEXPANSION
  73. 73. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code • r as for Rectangle • rect as for Rectangle ABBREVIATIONSEXPANSION
  74. 74. 1. Tokenization CODENORMALIZATION 2. Normalization 1.5 Code Normalization • AMAP (Hill and Pollock, 2008) • SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • LINSEN (Corazza, Di Martino and Maggio, 2012) draw, the, are, null, handl, box, rectangl, graphic, color, display, box, rectangl, draw, xor, rectangl
  75. 75. 1. Tokenization CODENORMALIZATION 2. Normalization 1.5 Code Normalization • AMAP (Hill and Pollock, 2008) • SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • LINSEN (Corazza, Di Martino and Maggio, 2012) draw, the, are, null, handl, box, rectangl, graphic, color, display, box, rectangl, draw,xor, rectangl
  76. 76. CONTRI BUTION SOURCE CODE NORMALIZATION • LINSEN: novel technique for code normalization • Able to both Split Identifiers and Expand possible occurring abbreviations Corazza, A., Di Martino, S., Maggio, V. LINSEN: An efficient approach to split identifiers and expand abbreviations (2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3
  77. 77. CONTRI BUTION SOURCE CODE NORMALIZATION • LINSEN: novel technique for code normalization • Able to both Split Identifiers and Expand possible occurring abbreviations • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Exploit different Sources of Information to find the matching words Corazza, A., Di Martino, S., Maggio, V. LINSEN: An efficient approach to split identifiers and expand abbreviations (2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3
  78. 78. SKETCHOFTHEALGORITHM • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10
  79. 79. SKETCHOFTHEALGORITHM • ARCS corresponds to matchings between identifier substrings and dictionary words • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10
  80. 80. SKETCHOFTHEALGORITHM • ARCS corresponds to matchings between identifier substrings and dictionary words • Padding Arcs to ensure the Graph always connected • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  81. 81. SKETCHOFTHEALGORITHM • Every Arc is Labelled with the corresponding dictionary word • Weights represent the “cost” of each matching • Cost function [c(“word”)] favors longest words and domain- related information Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  82. 82. SKETCHOFTHEALGORITHM • The final Mapping Solution corresponds to the sequence of labels in the path with the minimum cost (Djikstra Algorithm) Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  83. 83. SPLITTING Accuracy Rates for the comparison with DTW (Madani et. al 2010) 0 0.25 0.5 0.75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN DTW LINSEN DTW LINSEN RE SULTS # 1 Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011) 0 0.25 0.5 0.75 1 which 2.20 a2ps 4.14 GenTest LINSEN DTW O(n3)
  84. 84. Accuracy Rates for the comparison with AMAP (Hill and Pollock, 2008) 0 0.25 0.5 0.75 1 CW DL OO AC PR SL TOTAL (Avg) AMAP LINSEN EXPANSION RE SULTS # 2 CW: Combination Words DL: Dropped Letters OO: Others AC: Acronyms PR: Prefix SL: Single Letters
  85. 85. ISSUE #3 CLONE DETECTION
  86. 86. PROBL EM S T A T E M E N T CLONE DETECTION
  87. 87. PROBL EM S T A T E M E N T CLONE DETECTION Clones Textual Similarity
  88. 88. PROBL EM S T A T E M E N T CLONE DETECTION Clones Functional Similarity
  89. 89. PROBL EM S T A T E M E N T CLONE DETECTION Clones affect the reliability of the system! Sneaky Bug!
  90. 90. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD STATEOFTHEARTTOOLS
  91. 91. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Text Based Tools: Text is compared line by line STATEOFTHEARTTOOLS
  92. 92. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Token Based Tools: Token sequences are compared to sequences STATEOFTHEARTTOOLS
  93. 93. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Syntax Based Tools: Syntax subtrees are compared to each other STATEOFTHEARTTOOLS
  94. 94. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Graph Based Tools: (sub) graphs are compared to each other STATEOFTHEARTTOOLS
  95. 95. CONTRI BUTION CLONE DETECTION • Combining different sources of information to improve the effectiveness of the detection process Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  96. 96. CONTRI BUTION CLONE DETECTION • Combining different sources of information to improve the effectiveness of the detection process • Structural and Lexical Information Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  97. 97. CONTRI BUTION CLONE DETECTION • Combining different sources of information to improve the effectiveness of the detection process • Structural and Lexical Information • Application of Kernel Methods to Source Code Structures Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  98. 98. CONTRI BUTION CLONE DETECTION • Combining different sources of information to improve the effectiveness of the detection process • Structural and Lexical Information • Application of Kernel Methods to Source Code Structures • Typically applied in other field such as NLP or Bioinformatics Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  99. 99. CODE STRUCTURES KERNELSFORSTRUCTURES Computation of the dot product between (Graph) Structures K( ),
  100. 100. CODE STRUCTURES KERNELSFORSTRUCTURES Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ),
  101. 101. CODE STRUCTURES KERNELSFORSTRUCTURES Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ),
  102. 102. CODE KERNELFORCLONES
  103. 103. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST KERNELFORCLONES
  104. 104. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST AST KERNEL KERNELFORCLONES < block while = = block = y - = x + + x 1 - y 1 < x y > b 0 = c 3 if block > b a - b 1 < block while + a 1 = b - = a +
  105. 105. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES
  106. 106. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT
  107. 107. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN
  108. 108. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node
  109. 109. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves
  110. 110. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES IC = Conditional-Expr I = Less-operator C = Loop Ls= [x,y] IC = Loop I = while-loop C = Function-Body Ls= [x, y] Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves IC = Block I = while-body C = Loop Ls= [ x ]
  111. 111. CLONE DETECTION • Comparison with another (pure) AST-based clone detector • Comparison on a system with randomly seeded clones 0 0.25 0.5 0.75 1 Precision Recall F-measure CloneDigger Tree Kernel Tool RE SULTS Results refer to clones where code fragments have been modified by adding/ removing or changing code statements
  112. 112. CONCLUSIONS
  113. 113. CONCLUSIONS
  114. 114. CONCLUSIONS
  115. 115. CONCLUSIONS
  116. 116. CONCLUSIONS
  117. 117. CONCLUSIONS FUTURE RESEARCH DIRECTIONS • Re-modularization: • Integrate code normalization algorithm (LINSEN) to lexical analysis • Code Normalization: • Investigate the contributions of different sources of information used for disambiguations • Clone Detection: • Combine Static (and Kernel methods) with Dynamic Analysis
  118. 118. THANK YOU June 5th, 2013 - Ph.D. Defense Valerio Maggio Ph.D. Student, University of Naples “Federico II” valerio.maggio@unina.it
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×