Data Mining for Software Engineering          Tao Xie                                     Jian PeiNorth Carolina State Uni...
Outline• Introduction• What software engineering tasks can be  helped by data mining?• What kinds of software engineering ...
Introduction• A large amount of data is produced in  software development   – Data from software repositories   – Data fro...
Examples• Data in software development   – Programming: versions of programs   – Testing: execution traces   – Deployment:...
Overviewprogramming        defect detection              testing            debugging       maintenance                   ...
Software Engineering Tasks•     Programming•     Static defect detection•     Testing•     Debugging•     Maintenance    T...
Software Categorization – Why?• SourceForge hosts 70,000+ software systems   – How can one find the software needed?   – H...
Software Categorization – What?• Organize software systems into categories   – Software systems in each category share a  ...
Version Comparison and Search• What does the current code segment look  like in previous versions?   – How have they been ...
Software Library Reuse• Issues in reusing software libraries   – Which components should I use?   – What is the right way ...
API Usage• How should an API be used correctly?   – An API may serve multiple functionalities   – Different styles of API ...
How Can Data Mining Help?• Identify characteristic usage of the library  automatically• Understand the reuse of library cl...
Software Engineering Tasks•     Programming•     Static defect detection•     Testing•     Debugging•     Maintenance    T...
Locating Matching Method Calls• Many bugs due to unmatched method calls   – E.g., fail to call free() to deallocate a data...
Inferring Errors from Source Code• A system must follow some correctness  rules   – Unfortunately, the rules are documente...
Inference in Large Systems• Execution traces     inferred properties    static  checker• Inference algorithms need to be s...
Detecting Copy-Paste and Bugs• Copy-pasted code is common in large  systems   – Code reuse• Prone to bugs   – E.g., identi...
Software Engineering Tasks•     Programming•     Static defect detection•     Testing•     Debugging•     Maintenance    T...
Inspecting Test Behavior• Automatically generated tests or field  executions lack test oracles   – Sample/summarize behavi...
Mining Object Behavior• Can we find the undocumented behavior of  classes? It may not be observed from  program source cod...
Mining Specifications• Specifications are very useful for testing   – test generation + test oracle• Major obstacle: proto...
Specification Helps                                           SpecificationDoes the above code follow the correct socket A...
Software Engineering Tasks•     Programming•     Static defect detection•     Testing•     Debugging•     Maintenance    T...
Fault Localization• Running tests produces execution traces   – Some tests fail and the other tests pass• Given many execu...
Analyzing Bug Repositories• Most open source software development projects  have bug repositories    – Report and track pr...
Stabilizing Buggy Applications• Users may report bugs in a program, can  those bug reports be used to prevent the  program...
Software Engineering Tasks•     Programming•     Static defect detection•     Testing•     Debugging•     Maintenance    T...
Guiding Software Changes• Programmers start changing some locations   – Suggest locations that other programmers have     ...
Aspect Mining• Discover crosscutting concerns that can be  potentially turned into one place (an aspect  in aspect-oriente...
Software Engineering Data•     Static code bases•     Software change history•     Profiled program states•     Profiled s...
Code Entities• Identifiers within a system [Kawaguchi et al. 04]    – E.g., variable names, function names• Statement sequ...
Relationships btw Code Entities• Membership relationships   – A class contains membership functions• Reuse relationships  ...
Software Engineering Data•     Static code bases•     Software change history•     Profiled program states•     Profiled s...
Concurrent Versions System (CVS)Comments                                          [Chen et al. 01] http://cvssearch.source...
CVS Comments                                                          RCS files:/repository/file.h,v• cvs log – displays  ...
Code Version Histories• CVS provides file versioning   – Group individual per-file changes into individual     transaction...
Software Engineering Data•     Static code bases•     Software change history•     Profiled program states•     Profiled s...
Method-Entry/Exit States• State of an object   – Values of transitively reachable fields• Method-entry state   – Receiver-...
Other Profiled Program States• Values of variables at certain code locations  [Hangal&Lam 02]   – Object/static field read...
Software Engineering Data•     Static code bases•     Software change history•     Profiled program states•     Profiled s...
Executed Structural Entities• Executed branches/paths, def-use pairs• Executed function/method calls   – Group methods inv...
Software Engineering Data•     Static code bases•     Software change history•     Profiled program states•     Profiled s...
Processing Bug Reports                  User                                     Triager                      Developer Bu...
Sample Bugzilla Bug Report• Bug report image• Overlay the triage questions       Assigned To: ?                           ...
Eclipse Bug Data                                                          • Defect counts are listed                      ...
Data Mining Techniques in SE•     Association rules and frequent patterns•     Classification•     Clustering•     Misc.  ...
Frequent Itemsets• Itemset: a set of items    – E.g., acm={a, c, m}                                  Transaction database ...
Association Rules• (Time∈{Fri, Sat}) ∧ buy(X, diaper)                      buy(X,  beer)   – Dads taking care of babies in...
A Road Map• Boolean vs. quantitative associations   – buys(x, “SQLServer”) ^ buys(x, “DMBook”)     buys(x, “DM Software”) ...
Frequent Pattern Mining Methods• Apriori and its variations/improvements• Mining frequent-patterns without candidate  gene...
A Simple Case• Finding highly correlated method call pairs• Confidence of pairs help   – Conf(<a,b>)=support(<a,b>)/suppor...
Conflicting Patterns• 999 out of 1000 times spin_unlock  follows spin_lock   – The single time that spin_unlock does not  ...
Frequent Library Reuse Patterns• Items: classes, member functions, reuse relationships  (e.g., inheritance, overriding, in...
MAPO: Mining Frequent API Patterns                                                          [Xie&Pei 06]T. Xie and J. Pei:...
Sequential Pattern Mining in MAPO• Use BIDE [Wang&Han 04] to mine closed sequential  patterns from the preprocessed method...
Detecting Copy-Paste Code• Apply closed sequential pattern mining techniques• Customizing the techniques   – A copy-paste ...
Find Bugs in Copy-Pasted Segments• For two copy-pasted segments, are the  modifications consistent?   – Identifier a in se...
Approximate Patterns for Inferences• Use an alternating template to find  interesting properties   – Example: template – (...
Context Handling                                                          Figure from “Perracotta: mining teporal API rule...
Cross-Checking of Execution Traces• Mine association rules or sequential  patterns S      F, where S is a statement and  F...
Emerging Patterns of Traces• A method executed only in failing runs is  likely to point to the defect   – Comparing the co...
Learning Object Behavior• Extracting models   – A static analysis identifies all side-effect-free     methods in the progr...
Data Mining Techniques in SE•     Association rules and frequent patterns•     Classification•     Clustering•     Misc.  ...
Classification: A 2-step Process• Model construction: describe a set of  predetermined classes   – Training dataset: tuple...
Model Construction                                                               Classification                           ...
Model Application                                                     Classifier                         Testing          ...
Supervised vs. UnsupervisedLearning• Supervised learning (classification)   – Supervision: objects in the training data se...
GUI-Application Stabilizer• Given a program state S and an event e, predict  whether e likely results in a bug   – Positiv...
Data Mining Techniques in SE•     Association rules and frequent patterns•     Classification•     Clustering•     Misc.  ...
What Is Clustering?• Group data into clusters   – Similar to one another within the same cluster   – Dissimilar to the obj...
Categories of ClusteringApproaches (1)• Partitioning algorithms   – Partition the objects into k clusters   – Iteratively ...
Categories of ClusteringApproaches (2)• Density-based methods   – Based on connectivity and density functions   – Filter o...
K-Means: Example                                                            10                                            ...
Clustering and Categorization• Software categorization   – Partitioning software systems into categories• Categories prede...
Software Categorization - MUDABlue• Understanding source code    – Use latent semantic analysis (LSA) to find similarity  ...
Overview of MUDABlue• Extract identifiers• Create identifier-by-software matrix• Remove useless identifiers• Apply LSA, an...
Data Mining Techniques in SE•     Association rules and frequent patterns•     Classification•     Clustering•     Misc.  ...
Searching Source Code/Comments• CVSSearch: searching using CVS  comments• Comments are often more stable than code  segmen...
Jungloid Mining• Given a query describing the input and output  types, synthesize code fragments automatically• Prospector...
Parsing a JAVA source code file in anFinding Jungloids                                         IFile object using the Ecli...
Sampling Programs• During the execution of a program, each  execution of a statement takes a probability  to be sampled   ...
Outline• Introduction• What software engineering tasks can be  helped by data mining?• What kinds of software engineering ...
Case Studies• MAPO: mining API usages from open source  repositories [Xie&Pei 06]   • Code bases                   sequenc...
Motivation• APIs in class libraries or frameworks are  popularly reused in software development.• An example programming t...
First Try: ClassGen Java API DocaddMethodpublic void addMethod(Method m)       Add a method to this class.       Parameter...
Second Try: Code Search EngineT. Xie and J. Pei: Data Mining for Software Engineering   86
MAPO Approach• Analyze code segments returned from code  search engines and disclose the inherent  usage patterns   – Inpu...
Sample Tool OutputInstructionList.<init>()InstructionFactory.createLoad(Type, int)InstructionList.append(Instruction)Instr...
Tool ArchitectureT. Xie and J. Pei: Data Mining for Software Engineering   89
Results A tool that integrates various components • Relevant code extractor       – download returns from code search engi...
Case Studies• MAPO: mining API usages from open source  repositories [Xie&Pei 06]   • Code bases                   sequenc...
Co-Change Pattern• Things that are frequently changed togetheroften form a pattern (a.k.a. co-change)    • E.g., co-added ...
DynaMine    revision            mine CVS                                  rank andhistory mining                          ...
Mining Patterns    revision            mine CVS                                  rank andhistory mining                   ...
Mining Method CallsFoo.java    o1.addListener()    1.12            o1.removeListener()Bar.java    1.47            o2.addLi...
Finding Pairs            o1.addListener()                                                      1 PairFoo.java    1.12     ...
Mining Method CallsFoo.java    o1.addListener()    1.12            o1.removeListener()Bar.java    1.47            o2.addLi...
Finding Patterns                Find “frequent itemsets” (with                Apriori) o.enterAlignment()                 ...
Ranking PatternsFoo.java    o1.addListener()    1.12            o1.removeListener()Bar.java            o2.addListener()   ...
Ranking PatternsFoo.java    o1.addListener()    1.12            o1.removeListener()Bar.java    1.47            o2.addListe...
Dynamic Validation    revision            mine CVS                                  rank andhistory mining                ...
Matches and Mismatches        Find and count matches and mismatches.        o.register(d)        o.deregister(d)     match...
Pattern classification                                    post-process                             v validations, e violat...
Experiments                                                since                                                          ...
Case Studies• MAPO: mining API usages from open source  repositories [Xie&Pei 06]   • Code bases                   sequenc...
Assigning a Bug• Many considerations   – who has the expertise?   – who is available?   – how quickly does this have to be...
Assigning a Bug Today                 bill@firefox.orgT. Xie and J. Pei: Data Mining for Software Engineering   Adapted fr...
Recommending assignment                  bill@firefox.com                  ted@gmail.com                  cindy-loo@whovil...
Overview of approach             Approach tuned using Eclipse and Firefox                                                 ...
Steps to the approach1. Characterize the reports2. Label the reports3. Select the reports4. Use a machine learning algorit...
Step 1: Characterizing a report• Based on two fields   – textual summary   – description• Use text categorization approach...
Step 2: Labeling a report• Must determine who really fixed it   – “Assigned-to” field is not accurate• Project-specific he...
Step 2: Labeling a report• Must determine who really fixed it    – “Assigned-to” field is not accurate• Project-specificIf...
Step 2: Labeling a report• Must determine who really fixed it   – “Assigned-to” field is not accurate                     ...
Step 2: Labeling a report• Must determine who really fixed it   – “Assigned-to” field is not accurate                     ...
Step 2: Labeling a report• Must determine who really fixed it   – “Assigned-to” field is not accurate• Project-specific he...
Step 3: Selecting the reports• Exclude those with no label• Include those of active developers          – developer profil...
Step 3: Selecting the reports4035302520                                                            3 reports / month1510 5...
Step 4: Use a ML algorithm• Supervised Algorithms   – Naïve Bayes   – C4.5   – Support Vector Machines• Unsupervised Algor...
Evaluating Recommenders            # of relevant recommenda tionsPrecision =             # of recommenda tions made       ...
Determining Possibly Relevant Developers                         Module C                         Module J                ...
Still not Straightforward (e.g., Firefox)                                                                                 ...
Precision vs. Recall              A small set of “right” developers (precision) more              important than the set o...
Overviewprogramming        defect detection              testing            debugging          maintenance                ...
Conclusions• Software development generates a large  amount of different types of data• Data mining and data analysis can ...
Challenges• Complexity in software development   – Specific data mining techniques are needed• Software development and ma...
Questions?Mining Software Engineering Data Bibliographyhttp://ase.csc.ncsu.edu/dmse/•What software engineering tasks can b...
Upcoming SlideShare
Loading in …5
×

Datamingse

645 views

Published on

Tao Xie, Jian Pei
Data Mining for Software Engineering

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
645
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Datamingse

  1. 1. Data Mining for Software Engineering Tao Xie Jian PeiNorth Carolina State University Simon Fraser University www.csc.ncsu.edu/faculty/xie www.cs.sfu.ca/~jpei xie@csc.ncsu.edu jpei@cs.sfu.ca An up-to-date version of this tutorial is available at http://ase.csc.ncsu.edu/dmse/dmse.pdf Attendants of the tutorial are kindly suggested to download the latest version 2-3 days before the tutorial
  2. 2. Outline• Introduction• What software engineering tasks can be helped by data mining?• What kinds of software engineering data can be mined?• How are data mining techniques used in software engineering?• Case studies• ConclusionsT. Xie and J. Pei: Data Mining for Software Engineering 2
  3. 3. Introduction• A large amount of data is produced in software development – Data from software repositories – Data from program executions• Data mining techniques can be used to analyze software engineering data – Understand software artifacts or processes – Assist software engineering tasksT. Xie and J. Pei: Data Mining for Software Engineering 3
  4. 4. Examples• Data in software development – Programming: versions of programs – Testing: execution traces – Deployment: error/bug reports – Reuse: open source packages• Software development needs data analysis – How should I use this class? – Where are the bugs? – How to implement a typical functionality?T. Xie and J. Pei: Data Mining for Software Engineering 4
  5. 5. Overviewprogramming defect detection testing debugging maintenance software engineering tasks helped by data mining association/ classification clustering … patterns data mining techniques code change program structural bug bases history states entities reports software engineering data T. Xie and J. Pei: Data Mining for Software Engineering 5
  6. 6. Software Engineering Tasks• Programming• Static defect detection• Testing• Debugging• Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 6
  7. 7. Software Categorization – Why?• SourceForge hosts 70,000+ software systems – How can one find the software needed? – How can developers collaborate effectively?• Why software categorization? – SourceForge categorizes software according to their primary function (editors, databases, etc.) • Software foundries – related software – Keep developers informed about related software • Learn the “best practice” • Promote software reuse [Kawaguchi et al. 04]T. Xie and J. Pei: Data Mining for Software Engineering 7
  8. 8. Software Categorization – What?• Organize software systems into categories – Software systems in each category share a somehow same theme – A software system may belong to one or multiple categories• What are the categories? – Defined by domain experts manually – Discovered automatically• Example system: MUDABlue [Kawaguchi et al. 04]T. Xie and J. Pei: Data Mining for Software Engineering 8
  9. 9. Version Comparison and Search• What does the current code segment look like in previous versions? – How have they been changed over versions?• Using standard search tools, e.g., grep? – Source code may not be well documented – The code may be changed• Can we have some source code friendly search engines? – E.g., www.koders.com, corp.krugle.com, demo.spars.infoT. Xie and J. Pei: Data Mining for Software Engineering 9
  10. 10. Software Library Reuse• Issues in reusing software libraries – Which components should I use? – What is the right way to use? – Multiple components may often be used in combinations, e.g., Smalltalk’s Model/View/Controller• Frequent patterns help – Specifically, inheritance information is important – Example: most application classes inheriting from library class Widget tend to override its member function paint(); most application classes instantiating library class Painter and calling its member function begin() also call end() [Michail 99/00]T. Xie and J. Pei: Data Mining for Software Engineering 10
  11. 11. API Usage• How should an API be used correctly? – An API may serve multiple functionalities – Different styles of API usage• “I know what type of object I need, but I don’t know how to write the code to get the object” [Mandelin et al. 05] – Can we synthesize jungloid code fragments automatically? – Given a simple query describing the desired code in terms of input and output types, return a code segment• “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie & Pei 06]T. Xie and J. Pei: Data Mining for Software Engineering 11
  12. 12. How Can Data Mining Help?• Identify characteristic usage of the library automatically• Understand the reuse of library classes from real-life applications instead of toy programs• Keep reuse patterns up to date w.r.t. the most recent version of the library and applications• General patterns may cover inheritance casesT. Xie and J. Pei: Data Mining for Software Engineering 12
  13. 13. Software Engineering Tasks• Programming• Static defect detection• Testing• Debugging• Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 13
  14. 14. Locating Matching Method Calls• Many bugs due to unmatched method calls – E.g., fail to call free() to deallocate a data structure – One-line-code-changes: many bugs can be fixed by changing only one line in source code• Problem: how to find highly correlated pairs of method calls – E.g., <fopen, fclose>, <malloc, free> [Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06]T. Xie and J. Pei: Data Mining for Software Engineering 14
  15. 15. Inferring Errors from Source Code• A system must follow some correctness rules – Unfortunately, the rules are documented or specified in an ad hoc manner• Deriving the rules requires a lot of a priori knowledge• Can we detect some errors without knowing the rules by data mining? [Engler et al. 01]T. Xie and J. Pei: Data Mining for Software Engineering 15
  16. 16. Inference in Large Systems• Execution traces inferred properties static checker• Inference algorithms need to be scalable with the size of the programs and the input traces• Due to only imperfect traces available in industrial environments, how to use those imperfect traces• Many inferred properties may be uninteresting; it is hard for a developer to review those properties thoroughly for large programs [Yang et al. 06]T. Xie and J. Pei: Data Mining for Software Engineering 16
  17. 17. Detecting Copy-Paste and Bugs• Copy-pasted code is common in large systems – Code reuse• Prone to bugs – E.g., identifiers are not changed consistently• How to detect copy-paste code? – How to scale up to large software? – How to handle minor modifications? [Li et al. 04]T. Xie and J. Pei: Data Mining for Software Engineering 17
  18. 18. Software Engineering Tasks• Programming• Static defect detection• Testing• Debugging• Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 18
  19. 19. Inspecting Test Behavior• Automatically generated tests or field executions lack test oracles – Sample/summarize behavior for inspection• Examples: – Select tests (executions/outputs) for inspection • E.g., clustering path/branch profiles [Podgurski et al. 01, Bowring et al. 04] – Summarize object behavior [Xie&Notkin 04, Dallmeier et al. 06]T. Xie and J. Pei: Data Mining for Software Engineering 19
  20. 20. Mining Object Behavior• Can we find the undocumented behavior of classes? It may not be observed from program source code directlyBehavior model for JAVA Vector class. Picture from “Mining object behavior with ADABU”[Dallmeier et al. WODA 06]T. Xie and J. Pei: Data Mining for Software Engineering 20
  21. 21. Mining Specifications• Specifications are very useful for testing – test generation + test oracle• Major obstacle: protocol specifications are often unavailable – Example: what is the right way to use the socket API?• How can data mining help? – If a protocol is held in well tested programs (i.e., their executions), the protocol is likely valid [Ammons et al. 02] T. Xie and J. Pei: Data Mining for Software Engineering 21
  22. 22. Specification Helps SpecificationDoes the above code follow the correct socket API protocol?[Ammons et al. 02] T. Xie and J. Pei: Data Mining for Software Engineering 22
  23. 23. Software Engineering Tasks• Programming• Static defect detection• Testing• Debugging• Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 23
  24. 24. Fault Localization• Running tests produces execution traces – Some tests fail and the other tests pass• Given many execution traces generated by tests, can we suggest likely faulty statements? [Liblit et al. 03/05, Liu et al. 05] – Some traces may lead to program failures – It would be better if we can even suggest the likeliness of a statement being faulty• For large programs, how can we collect traces effectively?• What if there are multiple faults?T. Xie and J. Pei: Data Mining for Software Engineering 24
  25. 25. Analyzing Bug Repositories• Most open source software development projects have bug repositories – Report and track problems and potential enhancements – Valuable information for both developers and users• Bug repositories are often messy – Duplicate error reports; Related errors• Challenge: how to analyze effectively? – Who are reporting and at what rate? – How are reports resolved and by whom?• Automatic bug report assignment & duplicate detection [Anvik et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering 25
  26. 26. Stabilizing Buggy Applications• Users may report bugs in a program, can those bug reports be used to prevent the program from crashing? – When a user attempts an action that led to some errors before, a warning should be issued• Given a program state S and an event e, predict whether e likely results in a bug – Positive samples: past bugs – Negative samples: “not bug” reports [Michail&Xie 05]T. Xie and J. Pei: Data Mining for Software Engineering 26
  27. 27. Software Engineering Tasks• Programming• Static defect detection• Testing• Debugging• Maintenance T. Xie and J. Pei: Data Mining for Software Engineering 27
  28. 28. Guiding Software Changes• Programmers start changing some locations – Suggest locations that other programmers have changed together with this location E.g., “Programmers who changed this function also changed …”• Mine association rules from change histories – coarse-granular entities: directories, modules, files – fine-granular entities: methods, variables, sections [Zimmermann et al. 04, Ying et al. 04]T. Xie and J. Pei: Data Mining for Software Engineering 28
  29. 29. Aspect Mining• Discover crosscutting concerns that can be potentially turned into one place (an aspect in aspect-oriented programs) – E.g., logging, timing, communication• Mine recurring execution patterns – Event traces [Breu&Krinke 04, Tonella&Ceccato 04] – Source code [Shepherd et al. 05]T. Xie and J. Pei: Data Mining for Software Engineering 29
  30. 30. Software Engineering Data• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 30
  31. 31. Code Entities• Identifiers within a system [Kawaguchi et al. 04] – E.g., variable names, function names• Statement sequence within a basic block [Li et al. 04] – E.g., variables, operators, constants, functions, keywords• Element set within a function [Li&Zhou 05] – E.g., functions, variables, data types• Call sites within a function [Xie&Pei 05]• API signatures [Mandelin et al. 05] [Mandelin et al. 05] http://snobol.cs.berkeley.edu/prospector/index.jsp T. Xie and J. Pei: Data Mining for Software Engineering 31
  32. 32. Relationships btw Code Entities• Membership relationships – A class contains membership functions• Reuse relationships – Class inheritance – Class instantiation – Function invocations – Function overriding [Michail 99/00] http://codeweb.sourceforge.net/ for C++T. Xie and J. Pei: Data Mining for Software Engineering 32
  33. 33. Software Engineering Data• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 33
  34. 34. Concurrent Versions System (CVS)Comments [Chen et al. 01] http://cvssearch.sourceforge.net/T. Xie and J. Pei: Data Mining for Software Engineering 34
  35. 35. CVS Comments RCS files:/repository/file.h,v• cvs log – displays Working file: file.h head: 1.5 ... for all revisions and description: ---------------------------- its comments for each Revision 1.5 Date: ... file cvs comment ... ---------------------------- ... …• cvs diff – shows RCS file: /repository/file.h,v … differences between 9c9,10 < old line different versions of a --- > new line > another new line file [Chen et al. 01] http://cvssearch.sourceforge.net/T. Xie and J. Pei: Data Mining for Software Engineering 35
  36. 36. Code Version Histories• CVS provides file versioning – Group individual per-file changes into individual transactions (atomic change sets): checked in by the same author with the same check-in comment close in time• CVS manages only files and line numbers – Associate syntactic entities with line ranges• Filter out long transactions not corresponding to meaningful atomic changes – E.g., feature requests, bug fixes, branch merging [Ying et al. 04] [Zimmermann et al. 04] http://www.st.cs.uni-sb.de/softevo/erose/T. Xie and J. Pei: Data Mining for Software Engineering 36
  37. 37. Software Engineering Data• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 37
  38. 38. Method-Entry/Exit States• State of an object – Values of transitively reachable fields• Method-entry state – Receiver-object state, method argument values• Method-exit state – Receiver-object state, updated method argument values, method return value [Ernst et al. 02] http://pag.csail.mit.edu/daikon/ [Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/ [Henkel&Diwan 03] [Xie&Notkin 04/05]T. Xie and J. Pei: Data Mining for Software Engineering 38
  39. 39. Other Profiled Program States• Values of variables at certain code locations [Hangal&Lam 02] – Object/static field read/write – Method-call arguments – Method returns• Sampled predicates on values of variables [Liblit et al. 03/05] [Hangal&Lam 02] http://diduce.sourceforge.net/ [Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/T. Xie and J. Pei: Data Mining for Software Engineering 39
  40. 40. Software Engineering Data• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 40
  41. 41. Executed Structural Entities• Executed branches/paths, def-use pairs• Executed function/method calls – Group methods invoked on the same object• Profiling options – Execution hit vs. count – Execution order (sequences) [Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/ More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#relatedT. Xie and J. Pei: Data Mining for Software Engineering 41
  42. 42. Software Engineering Data• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports T. Xie and J. Pei: Data Mining for Software Engineering 42
  43. 43. Processing Bug Reports User Triager Developer BugReport Bug Duplicate Works Invalid Won’t Repository For Me Fix T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 43
  44. 44. Sample Bugzilla Bug Report• Bug report image• Overlay the triage questions Assigned To: ? Assignment? Duplicate? Reproducible? Bugzilla: open source bug tracking tool http://www.bugzilla.org/ [Anvik et al. 06] http://www.cs.ubc.ca/labs/spl/projects/bugTriage.html T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 44
  45. 45. Eclipse Bug Data • Defect counts are listed as count at the plug-in, package and compilationunit levels. • The value field contains the actual number of pre- ("pre") and post-release defects ("post"). • The average ("avg") and maximum ("max") values refer to the defects found in the compilation units ("compilationunits"). [Schröter et al. 06] http://www.st.cs.uni-sb.de/softevo/bug-data/eclipse/T. Xie and J. Pei: Data Mining for Software Engineering 45
  46. 46. Data Mining Techniques in SE• Association rules and frequent patterns• Classification• Clustering• Misc. T. Xie and J. Pei: Data Mining for Software Engineering 46
  47. 47. Frequent Itemsets• Itemset: a set of items – E.g., acm={a, c, m} Transaction database TDB• Support of itemsets TID Items bought – Sup(acm)=3 100 f, a, c, d, g, I, m, p• Given min_sup = 3, acm 200 a, b, c, f, l, m, o is a frequent pattern 300 b, f, h, j, o• Frequent pattern mining: 400 b, c, k, s, p find all frequent patterns 500 a, f, c, e, l, p, m, n in a database T. Xie and J. Pei: Data Mining for Software Engineering 47
  48. 48. Association Rules• (Time∈{Fri, Sat}) ∧ buy(X, diaper) buy(X, beer) – Dads taking care of babies in weekends drink beers• Itemsets should be frequent – It can be applied extensively• Rules should be confident – With strong prediction capabilityT. Xie and J. Pei: Data Mining for Software Engineering 48
  49. 49. A Road Map• Boolean vs. quantitative associations – buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DM Software”) [0.2%, 60%] – age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]• Single dimension vs. multiple dimensional associations• Single level vs. multiple-level analysis – What brands of beers are associated with what brands of diapers?T. Xie and J. Pei: Data Mining for Software Engineering 49
  50. 50. Frequent Pattern Mining Methods• Apriori and its variations/improvements• Mining frequent-patterns without candidate generation• Mining max-patterns and closed itemsets• Mining multi-dimensional, multi-level frequent patterns with flexible support constraints• Interestingness: correlation and causalityT. Xie and J. Pei: Data Mining for Software Engineering 50
  51. 51. A Simple Case• Finding highly correlated method call pairs• Confidence of pairs help – Conf(<a,b>)=support(<a,b>)/support(<a,a>)• Check the revisions (fixes to bugs), find the pairs of method calls whose confidences are improved dramatically by frequent added fixes – Those are the matching method call pairs that may often be violated by programmers [Livshits&Zimmermann 05]T. Xie and J. Pei: Data Mining for Software Engineering 51
  52. 52. Conflicting Patterns• 999 out of 1000 times spin_unlock follows spin_lock – The single time that spin_unlock does not follow may likely be an error• We can detect an error without knowing the correctness rule [Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06]T. Xie and J. Pei: Data Mining for Software Engineering 52
  53. 53. Frequent Library Reuse Patterns• Items: classes, member functions, reuse relationships (e.g., inheritance, overriding, instantiation)• Transactions: for every application class A, the set of all items that are involved in a reuse relationship with A• Pruning – Uninteresting rules, e.g., a rule holds for every class – Misleading rules, e.g., xy z (conf: 60%) is pruned if y z (conf: 80%) – Statistically insignificant rules, prune rules of a high p-value• Constrained rules – Rules involving a particular class – Rules that are violated in a particular application [Michail 99/00] T. Xie and J. Pei: Data Mining for Software Engineering 53
  54. 54. MAPO: Mining Frequent API Patterns [Xie&Pei 06]T. Xie and J. Pei: Data Mining for Software Engineering 54
  55. 55. Sequential Pattern Mining in MAPO• Use BIDE [Wang&Han 04] to mine closed sequential patterns from the preprocessed method-call sequences• Postprocessing in MAPO – Remove frequent sequences that do not contain the entities interesting to the user – Compress consecutive calls of the same method into one – Remove duplicate frequent sequences after the compression – Remove frequent sequences that are subsequences of some other frequent sequences [Xie&Pei 06]T. Xie and J. Pei: Data Mining for Software Engineering 55
  56. 56. Detecting Copy-Paste Code• Apply closed sequential pattern mining techniques• Customizing the techniques – A copy-paste segment typically do not have big gaps – use a maximum gap threshold to control – Output the instances of patterns (i.e., the copy-pasted code segments) instead of the patterns – Use small copy-pasted segments to form larger ones – Prune false positives: tiny segments, unmappable segments, overlapping segments, and segments with large gaps [Li et al. 04]T. Xie and J. Pei: Data Mining for Software Engineering 56
  57. 57. Find Bugs in Copy-Pasted Segments• For two copy-pasted segments, are the modifications consistent? – Identifier a in segment S1 is changed to b in segment S2 3 times, but remains unchanged once – likely a bug – The heuristic may not be right all the time• The lower the unchanged rate of an identifier, the more likely there is a bug [Li et al. 04]T. Xie and J. Pei: Data Mining for Software Engineering 57
  58. 58. Approximate Patterns for Inferences• Use an alternating template to find interesting properties – Example: template – (PS)*, an instance: loc.acq loc.rel• Handling imperfect traces – Instead of requiring perfect matches, check the ratio of matching – Explore contexts of matching [Yang et al. 06]T. Xie and J. Pei: Data Mining for Software Engineering 58
  59. 59. Context Handling Figure from “Perracotta: mining teporal API rules from imperfect traces”, in [Yang et al. ICSE’06]T. Xie and J. Pei: Data Mining for Software Engineering 59
  60. 60. Cross-Checking of Execution Traces• Mine association rules or sequential patterns S F, where S is a statement and F is the status of program failure• The higher the confidence, the more likely S is faulty or related to a fault• Using only one statement at the left side of the rule can be misleading, since a fault may be led by a combination of statements – Frequent patterns can be used to improve [Denmat et al. 05] T. Xie and J. Pei: Data Mining for Software Engineering 60
  61. 61. Emerging Patterns of Traces• A method executed only in failing runs is likely to point to the defect – Comparing the coverage of passing and failing program runs help• Mining patterns frequent in failing program runs but infrequent in passing program runs – Sequential patterns may be used [Dallmeier et al. 05, Denmat et al. 05, Yang et al. 06]T. Xie and J. Pei: Data Mining for Software Engineering 61
  62. 62. Learning Object Behavior• Extracting models – A static analysis identifies all side-effect-free methods in the program – Some side-effect-free methods are selected as inspectors – The program is executed and inspectors are called to extract information about an object’s state – a vector of inspector values• Merge models of all objects in a program [Dallmeier et al. 06]T. Xie and J. Pei: Data Mining for Software Engineering 62
  63. 63. Data Mining Techniques in SE• Association rules and frequent patterns• Classification• Clustering• Misc. T. Xie and J. Pei: Data Mining for Software Engineering 63
  64. 64. Classification: A 2-step Process• Model construction: describe a set of predetermined classes – Training dataset: tuples for model construction • Each tuple/sample belongs to a predefined class – Classification rules, decision trees, or math formulae• Model application: classify unseen objects – Estimate accuracy of the model using an independent test set – Acceptable accuracy apply the model to classify tuples with unknown class labelsT. Xie and J. Pei: Data Mining for Software Engineering 64
  65. 65. Model Construction Classification Algorithms Training DataName Rank Years Tenured ClassifierMike Ass. Prof 3 No (Model)Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes IF rank = ‘professor’Dave Ass. Prof 6 No OR years > 6Anne Asso. Prof 3 No THEN tenured = ‘yes’ T. Xie and J. Pei: Data Mining for Software Engineering 65
  66. 66. Model Application Classifier Testing Data Unseen Data (Jeff, Professor, 4) Name Rank Years Tenured Tom Ass. Prof 2 No Tenured?Merlisa Asso. Prof 7 NoGeorge Prof 5 YesJoseph Ass. Prof 7 YesT. Xie and J. Pei: Data Mining for Software Engineering 66
  67. 67. Supervised vs. UnsupervisedLearning• Supervised learning (classification) – Supervision: objects in the training data set have labels – New data is classified based on the training set• Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the dataT. Xie and J. Pei: Data Mining for Software Engineering 67
  68. 68. GUI-Application Stabilizer• Given a program state S and an event e, predict whether e likely results in a bug – Positive samples: past bugs – Negative samples: “not bug” reports• A k-NN based approach – Consider the k closest cases reported before – Compare Σ 1/d for bug cases and not-bug cases, where d is the similarity between the current state and the reported states – If the current state is more similar to bugs, predict a bug [Michail&Xie 05]T. Xie and J. Pei: Data Mining for Software Engineering 68
  69. 69. Data Mining Techniques in SE• Association rules and frequent patterns• Classification• Clustering• Misc. T. Xie and J. Pei: Data Mining for Software Engineering 69
  70. 70. What Is Clustering?• Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Outliers Cluster 1 Cluster 2T. Xie and J. Pei: Data Mining for Software Engineering 70
  71. 71. Categories of ClusteringApproaches (1)• Partitioning algorithms – Partition the objects into k clusters – Iteratively reallocate objects to improve the clustering• Hierarchy algorithms – Agglomerative: each object is a cluster, merge clusters to form larger ones – Divisive: all objects are in a cluster, split it up into smaller clustersT. Xie and J. Pei: Data Mining for Software Engineering 71
  72. 72. Categories of ClusteringApproaches (2)• Density-based methods – Based on connectivity and density functions – Filter out noise, find clusters of arbitrary shape• Grid-based methods – Quantize the object space into a grid structure• Model-based – Use a model to find the best fit of dataT. Xie and J. Pei: Data Mining for Software Engineering 72
  73. 73. K-Means: Example 10 1010 9 99 8 88 7 77 6 66 5 55 4 44 Assign 3 Update 3 the3 each 2 221 objects 1 0 cluster 1 00 0 1 2 3 4 5 6 7 8 9 10 to most 0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10 similar center reassign reassign 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center 4 Update 4 3 2 the 3 2 1 cluster 1 0 0 1 2 3 4 5 6 7 8 9 10 means 0 0 1 2 3 4 5 6 7 8 9 10 T. Xie and J. Pei: Data Mining for Software Engineering 73
  74. 74. Clustering and Categorization• Software categorization – Partitioning software systems into categories• Categories predefined – a classification problem• Categories discovered automatically – a clustering problemT. Xie and J. Pei: Data Mining for Software Engineering 74
  75. 75. Software Categorization - MUDABlue• Understanding source code – Use latent semantic analysis (LSA) to find similarity between software systems – Use identifiers (e.g., variable names, function names) as features • “gtk_window” represents some window • The source code near “gtk_window” contains some GUI operation on the window• Extracting categories using frequent identifiers – “gtk_window”, “gtk_main”, and “gpointer” GTK related software system – Use LSA to find relationships between identifiers [Kawaguchi et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 75
  76. 76. Overview of MUDABlue• Extract identifiers• Create identifier-by-software matrix• Remove useless identifiers• Apply LSA, and retrieve categories• Make software clusters from identifier clusters• Title software clusters [Kawaguchi et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering 76
  77. 77. Data Mining Techniques in SE• Association rules and frequent patterns• Classification• Clustering• Misc. T. Xie and J. Pei: Data Mining for Software Engineering 77
  78. 78. Searching Source Code/Comments• CVSSearch: searching using CVS comments• Comments are often more stable than code segments – Describe a segment of code – May hold for many future versions• Compare differences of successive versions – For two versions, associate a comment to the corresponding changes – Propagate changes over versions [Chen et al. 01]T. Xie and J. Pei: Data Mining for Software Engineering 78
  79. 79. Jungloid Mining• Given a query describing the input and output types, synthesize code fragments automatically• Prospector: using API method signatures and jungloids mined from a corpus of sample client programs• Elementary jungloids – Field access – Static method or constructor invocation – Instance method invocation – Widening reference conversion – Downcast (narrowing reference conversions) [Mandelin et al. 05]T. Xie and J. Pei: Data Mining for Software Engineering 79
  80. 80. Parsing a JAVA source code file in anFinding Jungloids IFile object using the Eclipse IDE framework• Use signatures of elementary jungloids and API’s to form a signature graph• Represent a solution as a path in the graph matching the constraints• Rank the paths by their lengths – short paths are preferred• Learn downcast from sample programs [Mandelin et al. 05]T. Xie and J. Pei: Data Mining for Software Engineering 80
  81. 81. Sampling Programs• During the execution of a program, each execution of a statement takes a probability to be sampled – Sampling large programs becomes feasible – Many traces can be collected• Bug isolation by analyzing samples – Correlation between some specific statements or function calls with program errors/crashes [Liblit et al. 03/05]T. Xie and J. Pei: Data Mining for Software Engineering 81
  82. 82. Outline• Introduction• What software engineering tasks can be helped by data mining?• What kinds of software engineering data can be mined?• How are data mining techniques used in software engineering?• Case studies• ConclusionsT. Xie and J. Pei: Data Mining for Software Engineering 82
  83. 83. Case Studies• MAPO: mining API usages from open source repositories [Xie&Pei 06] • Code bases sequence analysis programming• DynaMine: finding common error patterns by mining software revision histories [Livshits&Zimmermann 05] • Change history association rules defect detection• BugTriage: Who should fix this bugs? [Anvik et al. 06] • Bug reports classification debugging T. Xie and J. Pei: Data Mining for Software Engineering 83
  84. 84. Motivation• APIs in class libraries or frameworks are popularly reused in software development.• An example programming task: “instrument the bytecode of a Java class by adding an extra method to the class” – org.apache.bcel.generic.ClassGen public void addMethod(Method m)T. Xie and J. Pei: Data Mining for Software Engineering 84
  85. 85. First Try: ClassGen Java API DocaddMethodpublic void addMethod(Method m) Add a method to this class. Parameters: m - method to addT. Xie and J. Pei: Data Mining for Software Engineering 85
  86. 86. Second Try: Code Search EngineT. Xie and J. Pei: Data Mining for Software Engineering 86
  87. 87. MAPO Approach• Analyze code segments returned from code search engines and disclose the inherent usage patterns – Input: an API characterized by a method, class, or package code bases: open source repositories or proprietary source repositories – Output: a short list of frequent API usage patterns related to the APIT. Xie and J. Pei: Data Mining for Software Engineering 87
  88. 88. Sample Tool OutputInstructionList.<init>()InstructionFactory.createLoad(Type, int)InstructionList.append(Instruction)InstructionFactory.createReturn(Type)InstructionList.append(Instruction)MethodGen.setMaxStack()MethodGen.setMaxLocals()MethodGen.getMethod()ClassGen.addMethod(Method)InstructionList.dispose() •Mined from 36 Java source files, 1087 method seqs T. Xie and J. Pei: Data Mining for Software Engineering 88
  89. 89. Tool ArchitectureT. Xie and J. Pei: Data Mining for Software Engineering 89
  90. 90. Results A tool that integrates various components • Relevant code extractor – download returns from code search engine (koders.com) • Code analyzer – implemented a lightweight tool for Java programs • Sequence preprocessor – employed various heuristics • Frequent sequence miner – reused BIDE [Wang&Han ICDE 2004] • Frequent sequence postprocessor – employed various heuristicsT. Xie and J. Pei: Data Mining for Software Engineering 90
  91. 91. Case Studies• MAPO: mining API usages from open source repositories [Xie&Pei 06] • Code bases sequence analysis programming• DynaMine: finding common error patterns by mining software revision histories [Livshits&Zimmermann 05] • Change history association rules defect detection• BugTriage: Who should fix this bugs? [Anvik et al. 06] • Bug reports classification debugging T. Xie and J. Pei: Data Mining for Software Engineering 91
  92. 92. Co-Change Pattern• Things that are frequently changed togetheroften form a pattern (a.k.a. co-change) • E.g., co-added method calls public void createPartControl(Composite parent) { ... // add listener for editor page activation getSite().getPage().addPartListener(partListener); } public void dispose() { ... co-addedgetSite().getPage().removePartListener(partListener); }T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 92
  93. 93. DynaMine revision mine CVS rank andhistory mining patterns histories filter instrument relevant method calls run the application dynamic analysis post-process usage error unlikely patterns patterns patterns report report reporting patterns bugs T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 93
  94. 94. Mining Patterns revision mine CVS rank andhistory mining patterns histories filter instrument relevant method calls run the application dynamic analysis post-process usage error unlikely patterns patterns patterns report report reporting patterns Adaptedbugs from Livshits et al.’s slides T. Xie and J. Pei: Data Mining for Software Engineering 94
  95. 95. Mining Method CallsFoo.java o1.addListener() 1.12 o1.removeListener()Bar.java 1.47 o2.addListener() o2.removeListener() System.out.println()Baz.java 1.23 o3.addListener() o3.removeListener() list.iterator() iter.hasNext() iter.next()Qux.java o4.addListener() 1.41 System.out.println() 1.42 o4.removeListener()T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 95
  96. 96. Finding Pairs o1.addListener() 1 PairFoo.java 1.12 o1.removeListener()Bar.java o2.addListener() 1 Pair 1.47 o2.removeListener() System.out.println()Baz.java o3.addListener() 1.23 o3.removeListener() list.iterator() 2 Pairs iter.hasNext() iter.next() o4.addListener() 0 PairsQux.java 1.41 System.out.println() 1.42 o4.removeListener() 0 PairsT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 96
  97. 97. Mining Method CallsFoo.java o1.addListener() 1.12 o1.removeListener()Bar.java 1.47 o2.addListener() o2.removeListener() System.out.println()Baz.java 1.23 o3.addListener() Co-added calls o3.removeListener() list.iterator() often represent a iter.hasNext() iter.next() usage patternQux.java o4.addListener() 1.41 System.out.println() 1.42 o4.removeListener()T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 97
  98. 98. Finding Patterns Find “frequent itemsets” (with Apriori) o.enterAlignment() o.enterAlignment() o.enterAlignment() o.enterAlignment() o.exitAlignment() o.exitAlignment() o.exitAlignment() o.exitAlignment() o.redoAlignment() o.redoAlignment() o.redoAlignment() o.redoAlignment() iter.hasNext() iter.hasNext() iter.hasNext() iter.hasNext() iter.next() iter.next() iter.next() iter.next() {enterAlignment(), exitAlignment(), redoAlignment()}T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 98
  99. 99. Ranking PatternsFoo.java o1.addListener() 1.12 o1.removeListener()Bar.java o2.addListener() Support count = 1.47 o2.removeListener() System.out.println()Baz.java #occurrences of a o3.addListener() 1.23 o3.removeListener() pattern list.iterator() iter.hasNext() Confidence = iter.next()Qux.java strength of a o4.addListener() 1.41 System.out.println() pattern, P(A|B) 1.42 o4.removeListener()T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 99
  100. 100. Ranking PatternsFoo.java o1.addListener() 1.12 o1.removeListener()Bar.java 1.47 o2.addListener() o2.removeListener() System.out.println()Baz.java 1.23 o3.addListener() o3.removeListener() list.iterator() iter.hasNext() iter.next()Qux.java o4.addListener() 1.41 System.out.println() This is a fix! 1.42 o4.removeListener() Rank removeListener() patterns higherT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 100
  101. 101. Dynamic Validation revision mine CVS rank andhistory mining patterns histories filter instrument relevant method calls run the application dynamic analysis post-process usage error unlikely patterns patterns patterns report report reporting patterns Adaptedbugs from Livshits et al.’s slides T. Xie and J. Pei: Data Mining for Software Engineering 101
  102. 102. Matches and Mismatches Find and count matches and mismatches. o.register(d) o.deregister(d) matches o.register(d) o.deregister(d) mismatch Static vs dynamic counts.T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 102
  103. 103. Pattern classification post-process v validations, e violations usage error unlikely patterns patterns patterns e<v/10 v/10<=e<=2v otherwiseT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 103
  104. 104. Experiments since JEDIT 2000 ECLIPSE 2001 developers 92 112 lines of code 700,000 2,900,000 revisions 40,000 400,000 total 56 patternsT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Livshits et al.’s slides 104
  105. 105. Case Studies• MAPO: mining API usages from open source repositories [Xie&Pei 06] • Code bases sequence analysis programming• DynaMine: finding common error patterns by mining software revision histories [Livshits&Zimmermann 05] • Change history association rules defect detection• BugTriage: Who should fix this bugs? [Anvik et al. 06] • Bug reports classification debugging T. Xie and J. Pei: Data Mining for Software Engineering 105
  106. 106. Assigning a Bug• Many considerations – who has the expertise? – who is available? – how quickly does this have to be fixed?• Not always an obvious or correct assignment – multiple developers may be suitable – difficult to know what the bug is about – bug fixes get delayed • triage and fix rate indicates ‘liveness’ of OSS projectsT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 106
  107. 107. Assigning a Bug Today bill@firefox.orgT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 107
  108. 108. Recommending assignment bill@firefox.com ted@gmail.com cindy-loo@whoville.orgT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 108
  109. 109. Overview of approach Approach tuned using Eclipse and Firefox bill@firefox.com ted@gmail.com cindy-loo@whoville.org Machine Learning Assignment Resolved Algorithm Recommender Bug Reports T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 109
  110. 110. Steps to the approach1. Characterize the reports2. Label the reports3. Select the reports4. Use a machine learning algorithmT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 110
  111. 111. Step 1: Characterizing a report• Based on two fields – textual summary – description• Use text categorization approach – represent with a word vector – remove stop words – intra- and inter-document frequencyT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 111
  112. 112. Step 2: Labeling a report• Must determine who really fixed it – “Assigned-to” field is not accurate• Project-specific heuristicsT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 112
  113. 113. Step 2: Labeling a report• Must determine who really fixed it – “Assigned-to” field is not accurate• Project-specificIfheuristics a report is FIXED, label with who marked it as fixed. (Eclipse) – simple If a report is DUPLICATE, use the label of the report it duplicates. (Eclipse and Firefox) T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 113
  114. 114. Step 2: Labeling a report• Must determine who really fixed it – “Assigned-to” field is not accurate (Firefox)• Project-specificIfheuristics FIXED and has the report is attachments that are approved by a – simple reviewer, then – complex – If one submitter of patches, use their name. – If more than one submitter, choose name of who submitted the most patches. – If cannot determine submitters, label with the person assigned to the report.T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 114
  115. 115. Step 2: Labeling a report• Must determine who really fixed it – “Assigned-to” field is not accurate (Firefox)• Project-specificReports marked as WONTFIX are heuristics often resolved after discussion and – simple developers reaching a consensus. – complex – Unknown who would have fixed the bug – unclassifiable – Report is labeled unclassifiable.T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 115
  116. 116. Step 2: Labeling a report• Must determine who really fixed it – “Assigned-to” field is not accurate• Project-specific heuristics – simple Eclipse Firefox – complex Simple 5 4 – unclassifiable Complex 2 1 Unclassifiable 1 4T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 116
  117. 117. Step 3: Selecting the reports• Exclude those with no label• Include those of active developers – developer profiles40 4035 3530 3025 2520 2015 1510 105 50 0 Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05 Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05 T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 117
  118. 118. Step 3: Selecting the reports4035302520 3 reports / month1510 5 0 Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 118
  119. 119. Step 4: Use a ML algorithm• Supervised Algorithms – Naïve Bayes – C4.5 – Support Vector Machines• Unsupervised Algorithms – Expectation Maximization• Incremental Algorithms – Naïve BayesT. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 119
  120. 120. Evaluating Recommenders # of relevant recommenda tionsPrecision = # of recommenda tions made # of relevant recommenda tionsRecall = # of possibly relevant developers How do we find this? T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 120
  121. 121. Determining Possibly Relevant Developers Module C Module J paulw paulw@... Module Q tryder tryder@... … stibbs vendger@... … …Fixed Bug Modules Report CVS Bug touched by fix Usernames Repository Usernames CVS Repository T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 121
  122. 122. Still not Straightforward (e.g., Firefox) paulw@... bboggs@... vendger@... … Module C Module J paulw Module Q tryder Patch Submitters … stibbs … tryder@... Bug Module axelf@...Report CVS vendger@... List … Usernames Patch Submitters jlpicard@... bchater@... kpollac@... … Patch Submitters CVS Repository … T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 122
  123. 123. Precision vs. Recall A small set of “right” developers (precision) more important than the set of all possible developers (recall) 100% 100% 90% 90% 80% 80% 70% 70% 60% 60%Precision Recall 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% Multi. NB C4.5 SVM 0% Multi. NB C4.5 SVM Eclipse Firefox gcc Eclipse Firefox gcc Precision Recall T. Xie and J. Pei: Data Mining for Software Engineering Adapted from Anvik et al.’s slides 123
  124. 124. Overviewprogramming defect detection testing debugging maintenance software engineering tasks association/ classification clustering etc. patterns data mining techniques code change program structural bug bases history states entities reports software engineering data T. Xie and J. Pei: Data Mining for Software Engineering 124
  125. 125. Conclusions• Software development generates a large amount of different types of data• Data mining and data analysis can help software engineering substantially• Successful cases – What software engineering data can be mined? – What software engineering task can be helped? – How to conduct the mining?T. Xie and J. Pei: Data Mining for Software Engineering 125
  126. 126. Challenges• Complexity in software development – Specific data mining techniques are needed• Software development and maintenance are dynamic and user-centered – Interactive data mining – Visual data mining and analysis – Online, incremental miningT. Xie and J. Pei: Data Mining for Software Engineering 126
  127. 127. Questions?Mining Software Engineering Data Bibliographyhttp://ase.csc.ncsu.edu/dmse/•What software engineering tasks can be helped by data mining?•What kinds of software engineering data can be mined?•How are data mining techniques used in software engineering?•Resources

×