Your SlideShare is downloading. ×
0
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Improving Software Maintenance using Unsupervised Machine Learning techniques

522

Published on

"Improving Software Maintenance using Unsupervised Machine Learning techniques": Ph.D. defence presentation. …

"Improving Software Maintenance using Unsupervised Machine Learning techniques": Ph.D. defence presentation.

Unsupervised Machine Learning techniques have been used to face different software maintenance issues such as Software Modularisation and Clone detection.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
522
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Dottorato in Scienze Computazionali e Informatiche, XXV Ciclo Ph.D. Candidate: Valerio Maggio Thesis Advisors: Dr. Sergio Di Martino Dr. Anna Corazza June 5th, 2013 UNSUPERVISED MACHINE LEARNING FOR SOFTWARE MAINTENANCE
  2. THESIS G O A L UNSUPERVISED MACHINE LEARNING FOR SOFTWARE MAINTENANCE These solutions exploit (unsupervised) machine learning techniques to mine information from the source code Define and experimentally evaluate solutions (techniques and prototype tools) for automatic software analysis to support software maintenance activities.
  3. There exist different types of Software Maintenance. SOFTWARE MAINTENANCE “A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution)
  4. SOFTWARE MAINTENANCE “A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution) Software Maintenance represents the most expensive, time consuming and challenging phase of the whole development process.
  5. SOFTWARE MAINTENANCE “A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution) Software Maintenance represents the most expensive, time consuming and challenging phase of the whole development process. Software Maintenance could account up to the 85-90% of the total software costs.
  6. ISSUES SOFTWARE MAINTENANCE “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis
  7. ISSUES SOFTWARE MAINTENANCE “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis Program Comprehension The documentation is usually scarce or not up to date!
  8. ISSUES SOFTWARE MAINTENANCE The source code (usually) represents the most reliable source of information about the system “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis Program Comprehension The documentation is usually scarce or not up to date!
  9. ISSUES SOFTWARE MAINTENANCE The source code (usually) represents the most reliable source of information about the system “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis Program Comprehension Reverse Engineering The documentation is usually scarce or not up to date!
  10. REVERSE E N G I N E E R I N G • Definition of tools and techniques to support maintenance activities • Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document • Goal: To aid the comprehension of the system
  11. REVERSE E N G I N E E R I N G • Definition of tools and techniques to support maintenance activities • Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document • Goal: To aid the comprehension of the system STATIC ANALYSIS DYNAMIC ANALYSIS
  12. REVERSE E N G I N E E R I N G • Definition of tools and techniques to support maintenance activities • Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document • Goal: To aid the comprehension of the system STATIC ANALYSIS DYNAMIC ANALYSIS
  13. MACHINE L E A R N I N G • Provides computational effective solutions to analyze large data sets
  14. MACHINE L E A R N I N G • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains
  15. MACHINE L E A R N I N G • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain
  16. MACHINE L E A R N I N G • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain • the application of the learning algorithms to the considered data
  17. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples
  18. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples
  19. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples (+) No cost of labeling samples (-) Trade-off imposed on the quality of the data
  20. THESIS C O N T R I B U T I O N S &
  21. THESIS C O N T R I B U T I O N S Contributions to three relevant and related open issues in Software Maintenance &
  22. THESIS C O N T R I B U T I O N S Contributions to three relevant and related open issues in Software Maintenance 1. SOFTWARE RE-MODULARIZATION 3. CLONE DETECTION 2. SOURCE CODE NORMALIZATION &
  23. ISSUE #1 SOFTWARE RE-MODULARIZATION
  24. Software Classes UI Process Components UI Components Data Access Components Data Helpers / Utilities Security Operational Management Communications Business Components Application Facade Buisiness Workflows Messages Interfaces Service Interfaces Re-modularization provides a way to support software maintainers by automatically grouping together (clustering) “related” software classes SOFTWARE RE-MODULARIZATION PROBL EM S T A T E M E N T
  25. Re-modularization provides a way to support software maintainers by automatically grouping together (clustering) “related” software classes SOFTWARE RE-MODULARIZATION External Systems Service Consumers Services Service Interfaces Messages Interfaces Cross Cutting Security OperationalManagement Communications Data Data Access Components Data Helpers / Utilities Presentation UI Components UI Process Components Business Application Facade Buisiness Workflows Business Components Clusters of Software Classes PROBL EM S T A T E M E N T
  26. SOURCECODE LEXICALINFORMATION
  27. SOURCECODE LEXICALINFORMATION
  28. SOURCECODE LEXICALINFORMATION (IDEA): Exploit the lexical information gathered from the source code to produce clusters of classes that are lexically related.
  29. SOURCECODE LEXICALINFORMATION (IDEA): Exploit the lexical information gathered from the source code to produce clusters of classes that are lexically related. State of the Art: Information Retrieval (IR) based approaches.
  30. IRINDEXING
  31. Implicit assumption: The “same” words are used whenever a particular concept occurs IRINDEXING
  32. 1. Tokenization 2. Normalization draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ... Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... Implicit assumption: The “same” words are used whenever a particular concept occurs IRINDEXING
  33. SOURCECODEZONES LEXICALINFORMATION
  34. SOURCECODEZONES LEXICALINFORMATION Class Names
  35. SOURCECODEZONES LEXICALINFORMATION Class Names Attribute Names
  36. SOURCECODEZONES LEXICALINFORMATION Class Names Attribute Names Method Names Method Names
  37. SOURCECODEZONES LEXICALINFORMATION Class Names Attribute Names Method Names Parameter Names Method Names Parameter Names
  38. SOURCECODEZONES LEXICALINFORMATION Class Names Attribute Names Method Names Parameter Names Comments Comments Method Names Parameter Names Comments
  39. SOURCECODEZONES LEXICALINFORMATION Source Code Class Names Attribute Names Method Names Parameter Names Comments Comments Method Names Parameter Names Source Code Comments
  40. ZONEINDEXING
  41. ZONEINDEXING RQ1: Do terms in different Zones provide different contributions?
  42. 1. Tokenization 2. Normalization ZONEINDEXING Draws, the, are, NullHandle, box, r, draw, g, Graphics, color, displayBox, ... draw, the, are, null, handl, box, r, draw, g, graphic, color, display, box, ... RQ1: Do terms in different Zones provide different contributions?
  43. SOURCECODELEXICON JFreeChart JEdit JUnit Xerces
  44. SOURCECODELEXICON JFreeChart • Good lexicon in every Zone!
  45. SOURCECODELEXICON JEdit • Very good in Comments • Very poor in Method and Parameter names
  46. SOURCECODELEXICON JUnit • Good in Class and Attribute names • No Comments at all
  47. SOURCECODELEXICON Xerces • Poor in Method, Parameter names and Comments • Good in Method Code and Variable names
  48. SOURCECODELEXICON JFreeChart JEdit JUnit Xerces RQ2: How to automatically weight the different contributions ?
  49. CONTRI BUTION SOFTWARE RE- MODULARIZATION • We propose a source code Zone Indexing Corazza A., Di Martino, S., Maggio, V., Scanniello, G. Investigating the use of lexical information for software system clustering (2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7 Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. Combining machine learning and information retrieval techniques for software clustering (2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
  50. CONTRI BUTION SOFTWARE RE- MODULARIZATION • We propose a source code Zone Indexing • An approach to automatically weight the contribution of different terms in Zones • Maximum-Likelihood Estimation (MLE) Approach Corazza A., Di Martino, S., Maggio, V., Scanniello, G. Investigating the use of lexical information for software system clustering (2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7 Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. Combining machine learning and information retrieval techniques for software clustering (2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
  51. CONTRI BUTION SOFTWARE RE- MODULARIZATION • We propose a source code Zone Indexing • An approach to automatically weight the contribution of different terms in Zones • Maximum-Likelihood Estimation (MLE) Approach • A variant of the K-medoid clustering algorithm • Tailored to the software clustering task Corazza A., Di Martino, S., Maggio, V., Scanniello, G. Investigating the use of lexical information for software system clustering (2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7 Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. Combining machine learning and information retrieval techniques for software clustering (2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
  52. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread VerticalPlot FitAlgorithm
  53. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm
  54. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm
  55. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm Extreme Extreme
  56. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOIDVARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm
  57. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative Target Partition Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces 0 0.2 0.4 0.6 0.8 1.0 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  58. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative Target Partition Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces 0 0.2 0.4 0.6 0.8 1.0 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  59. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative Target Partition Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces 0 0.2 0.4 0.6 0.8 1.0 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  60. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative Target Partition 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART 0 0.2 0.4 0.6 0.8 1.0 Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  61. ISSUE #2 SOURCE CODE NORMALIZATION
  62. SOURCECODE LEXICALINFORMATION Implicit assumption: The “same” words are used whenever a particular concept occurs
  63. SOURCECODE LEXICALINFORMATION Implicit assumption: The “same” words are used whenever a particular concept occurs
  64. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle • displayBox ==> display | Box draw, the, are, box, r, rectangl, g, graphic, color, .... STATEOFTHEARTTOOLS display, box null, handl
  65. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle • displayBox ==> display | Box draw, the, are, box, r, rectangl, g, graphic, color, .... STATEOFTHEARTTOOLS display, box null, handl
  66. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle draw, the, are, box, r, rectangl, g, graphic, color, .... STATEOFTHEARTTOOLS display, box null, handl
  67. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle • displayBox ==> display | Box draw, the, are, box, r, rectangl, g, graphic, color, .... STATEOFTHEARTTOOLS null, handl display, box
  68. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ IDENTIFIERSSPLITTING
  69. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect IDENTIFIERSSPLITTING
  70. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect • drawxorrect ==> NO SPLIT Splitting algorithms based on naming conventions are not robust enough IDENTIFIERSSPLITTING
  71. Splitting algorithms based on naming conventions are not robust enough ABBREVIATIONSEXPANSION
  72. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code • r as for Rectangle • rect as for Rectangle ABBREVIATIONSEXPANSION
  73. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code • r as for Rectangle • rect as for Rectangle ABBREVIATIONSEXPANSION
  74. 1. Tokenization CODENORMALIZATION 2. Normalization 1.5 Code Normalization • AMAP (Hill and Pollock, 2008) • SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • LINSEN (Corazza, Di Martino and Maggio, 2012) draw, the, are, null, handl, box, rectangl, graphic, color, display, box, rectangl, draw, xor, rectangl
  75. 1. Tokenization CODENORMALIZATION 2. Normalization 1.5 Code Normalization • AMAP (Hill and Pollock, 2008) • SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • LINSEN (Corazza, Di Martino and Maggio, 2012) draw, the, are, null, handl, box, rectangl, graphic, color, display, box, rectangl, draw,xor, rectangl
  76. CONTRI BUTION SOURCE CODE NORMALIZATION • LINSEN: novel technique for code normalization • Able to both Split Identifiers and Expand possible occurring abbreviations Corazza, A., Di Martino, S., Maggio, V. LINSEN: An efficient approach to split identifiers and expand abbreviations (2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3
  77. CONTRI BUTION SOURCE CODE NORMALIZATION • LINSEN: novel technique for code normalization • Able to both Split Identifiers and Expand possible occurring abbreviations • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Exploit different Sources of Information to find the matching words Corazza, A., Di Martino, S., Maggio, V. LINSEN: An efficient approach to split identifiers and expand abbreviations (2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3
  78. SKETCHOFTHEALGORITHM • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10
  79. SKETCHOFTHEALGORITHM • ARCS corresponds to matchings between identifier substrings and dictionary words • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10
  80. SKETCHOFTHEALGORITHM • ARCS corresponds to matchings between identifier substrings and dictionary words • Padding Arcs to ensure the Graph always connected • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  81. SKETCHOFTHEALGORITHM • Every Arc is Labelled with the corresponding dictionary word • Weights represent the “cost” of each matching • Cost function [c(“word”)] favors longest words and domain- related information Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  82. SKETCHOFTHEALGORITHM • The final Mapping Solution corresponds to the sequence of labels in the path with the minimum cost (Djikstra Algorithm) Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  83. SPLITTING Accuracy Rates for the comparison with DTW (Madani et. al 2010) 0 0.25 0.5 0.75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN DTW LINSEN DTW LINSEN RE SULTS # 1 Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011) 0 0.25 0.5 0.75 1 which 2.20 a2ps 4.14 GenTest LINSEN DTW O(n3)
  84. Accuracy Rates for the comparison with AMAP (Hill and Pollock, 2008) 0 0.25 0.5 0.75 1 CW DL OO AC PR SL TOTAL (Avg) AMAP LINSEN EXPANSION RE SULTS # 2 CW: Combination Words DL: Dropped Letters OO: Others AC: Acronyms PR: Prefix SL: Single Letters
  85. ISSUE #3 CLONE DETECTION
  86. PROBL EM S T A T E M E N T CLONE DETECTION
  87. PROBL EM S T A T E M E N T CLONE DETECTION Clones Textual Similarity
  88. PROBL EM S T A T E M E N T CLONE DETECTION Clones Functional Similarity
  89. PROBL EM S T A T E M E N T CLONE DETECTION Clones affect the reliability of the system! Sneaky Bug!
  90. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD STATEOFTHEARTTOOLS
  91. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Text Based Tools: Text is compared line by line STATEOFTHEARTTOOLS
  92. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Token Based Tools: Token sequences are compared to sequences STATEOFTHEARTTOOLS
  93. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Syntax Based Tools: Syntax subtrees are compared to each other STATEOFTHEARTTOOLS
  94. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Graph Based Tools: (sub) graphs are compared to each other STATEOFTHEARTTOOLS
  95. CONTRI BUTION CLONE DETECTION • Combining different sources of information to improve the effectiveness of the detection process Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  96. CONTRI BUTION CLONE DETECTION • Combining different sources of information to improve the effectiveness of the detection process • Structural and Lexical Information Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  97. CONTRI BUTION CLONE DETECTION • Combining different sources of information to improve the effectiveness of the detection process • Structural and Lexical Information • Application of Kernel Methods to Source Code Structures Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  98. CONTRI BUTION CLONE DETECTION • Combining different sources of information to improve the effectiveness of the detection process • Structural and Lexical Information • Application of Kernel Methods to Source Code Structures • Typically applied in other field such as NLP or Bioinformatics Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  99. CODE STRUCTURES KERNELSFORSTRUCTURES Computation of the dot product between (Graph) Structures K( ),
  100. CODE STRUCTURES KERNELSFORSTRUCTURES Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ),
  101. CODE STRUCTURES KERNELSFORSTRUCTURES Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ),
  102. CODE KERNELFORCLONES
  103. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST KERNELFORCLONES
  104. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST AST KERNEL KERNELFORCLONES < block while = = block = y - = x + + x 1 - y 1 < x y > b 0 = c 3 if block > b a - b 1 < block while + a 1 = b - = a +
  105. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES
  106. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT
  107. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN
  108. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node
  109. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves
  110. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES IC = Conditional-Expr I = Less-operator C = Loop Ls= [x,y] IC = Loop I = while-loop C = Function-Body Ls= [x, y] Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves IC = Block I = while-body C = Loop Ls= [ x ]
  111. CLONE DETECTION • Comparison with another (pure) AST-based clone detector • Comparison on a system with randomly seeded clones 0 0.25 0.5 0.75 1 Precision Recall F-measure CloneDigger Tree Kernel Tool RE SULTS Results refer to clones where code fragments have been modified by adding/ removing or changing code statements
  112. CONCLUSIONS
  113. CONCLUSIONS
  114. CONCLUSIONS
  115. CONCLUSIONS
  116. CONCLUSIONS
  117. CONCLUSIONS FUTURE RESEARCH DIRECTIONS • Re-modularization: • Integrate code normalization algorithm (LINSEN) to lexical analysis • Code Normalization: • Investigate the contributions of different sources of information used for disambiguations • Clone Detection: • Combine Static (and Kernel methods) with Dynamic Analysis
  118. THANK YOU June 5th, 2013 - Ph.D. Defense Valerio Maggio Ph.D. Student, University of Naples “Federico II” valerio.maggio@unina.it

×