Micro Interaction Metrics for Defect Prediction (ESEC/FSE 2011)
Upcoming SlideShare
Loading in...5
×
 

Micro Interaction Metrics for Defect Prediction (ESEC/FSE 2011)

on

  • 1,175 views

ESEC/FSE presentation

ESEC/FSE presentation

Statistics

Views

Total Views
1,175
Views on SlideShare
1,175
Embed Views
0

Actions

Likes
0
Downloads
16
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Micro Interaction Metrics for Defect Prediction (ESEC/FSE 2011) Micro Interaction Metrics for Defect Prediction (ESEC/FSE 2011) Presentation Transcript

    • Micro Interaction Metrics for Defect PredictionTaek Lee, Jaechang Nam, Dongyun Han, Sunghun Kim, Hoh Peter In FSE 2011, Hungary, Sep. 5-9
    • Outline• Research motivation• The existing metrics• The proposed metrics• Experiment results• Threats to validity• Conclusion
    • Defect Prediction? why is it necessary?
    • Software quality assurance is inherently a resource constrained activity!
    • Predicting defect-prone software entities* isto put the best labor effort on the entities * functions or code files
    • Indicators of defects• Complexity of source codes (Chidamber and Kemerer 1994)• Frequent code changes (Moser et al. 2008)• Previous defect information (Kim et al. 2007)• Code dependencies (Zimmermann 2007)
    • Indeed,where do defects come from?
    • Humans Error!Programmers make mistakes, consequently defects are injected, and software fails Human Bugs Software Errors Injected fails
    • Programmer Interaction and Software Quality
    • Programmer Interaction and Software Quality“Errors are from cognitive breakdownwhile understanding and implementing requirements” - Ko et al. 2005
    • Programmer Interaction and Software Quality“Errors are from cognitive breakdownwhile understanding and implementing requirements” - Ko et al. 2005“Work interruptions or task switchingmay affect programmer productivity” - DeLine et al. 2006
    • Don’t we need to also consider developers’ interactions as defect indicators?
    • …, but the existing indicatorscan NOT directly capture developers’ interactions
    • Using Mylyn data, we propose novel“Micro Interaction Metrics (MIMs)” capturing developers’ interactions
    • The Mylyn* data is stored as an attachment to thecorresponding bug reports in the XML format * Eclipse plug-in storing and recovering task contexts
    • <InteractionEvent … Kind=“ ” … StartDate=“ ” EndDate=“ ” … StructureHandle=“ ” … Interest=“ ” … >
    • <InteractionEvent … Kind=“ ” … StartDate=“ ” EndDate=“ ” … StructureHandle=“ ” … Interest=“ ” … >
    • <InteractionEvent … Kind=“ ” … StartDate=“ ” EndDate=“ ” … StructureHandle=“ ” … Interest=“ ” … >
    • Two levels of MIMs Design
    • Two levels of MIMs DesignFile-level MIMsspecific interactions for afile in a task(e.g., AvgTimeIntervalEditEdit)
    • Two levels of MIMs DesignFile-level MIMsspecific interactions for afile in a task(e.g., AvgTimeIntervalEditEdit)Task-level MIMsproperty values sharedover the whole task(e.g., TimeSpent)
    • Two levels of MIMs DesignFile-level MIMs Mylyn Task Logsspecific interactions for afile in a task 10:30 Selection file A(e.g., AvgTimeIntervalEditEdit) 11:00 Edit file B 12:30 Edit file B
    • Two levels of MIMs Design Mylyn Task Logs 10:30 Selection file ATask-level MIMs 11:00 Edit file Bproperty values shared 12:30 Edit file Bover the whole task(e.g., TimeSpent)
    • The Proposed Micro Interaction Metrics
    • The Proposed Micro Interaction Metrics
    • The Proposed Micro Interaction Metrics
    • For example,NumPatternSXEY is to capture this interaction:
    • For example, NumPatternSXEY is to capture this interaction:“How much times did a programmer Select a file of group X and then Edit a file of group Y in a task activity?”
    • X if a file shows defect locality* propertiesgroup X or Y Y otherwise H if a file hasgroup H or L high** DOI value L otherwise * hinted by the paper [Kim et al. 2007] ** threshold: median of degree of interest (DOI) values in a task
    • Bug Prediction Process
    • STEP1: Counting & Labeling Instances Task 1 Task 2 Task 3 Task i Task i+1 Task i+2 Task i+3 f1.java f3.java f2.java … f2.java … f1.java f3.java f3.javaDec 2005 Time P Sep 2010
    • STEP1: Counting & Labeling Instances Task 1 Task 2 Task 3 Task i Task i+1 Task i+2 Task i+3 f1.java f3.java f2.java … f2.java … f1.java f3.java f3.javaDec 2005 Time P Sep 2010 All the Mylyn task data collectable from Eclipse subprojects (Dec 05 ~Sep 10)
    • STEP1: Counting & Labeling Instances Task 1 Task 2 Task 3 Task i Task i+1 Task i+2 Task i+3 f1.java f3.java f2.java … f2.java … f1.java f3.java f3.javaDec 2005 Time P Sep 2010
    • STEP1: Counting & Labeling Instances Task 1 Task 2 Task 3 Task i Task i+1 Task i+2 Task i+3 f1.java f3.java f2.java … f2.java … f1.java f3.java f3.javaDec 2005 Time P Sep 2010 Post-defect counting period
    • STEP1: Counting & Labeling Instances Task 1 Task 2 Task 3 Task i Task i+1 Task i+2 Task i+3 f1.java f3.java f2.java … f2.java … f1.java f3.java f3.javaDec 2005 Time P Sep 2010 Post-defect counting period
    • STEP1: Counting & Labeling Instances Task 1 Task 2 Task 3 Task i Task i+1 Task i+2 Task i+3 f1.java f3.java f2.java … f2.java … f1.java f3.java f3.javaDec 2005 Time P Sep 2010 Post-defect counting period
    • STEP1: Counting & Labeling Instances Task 1 Task 2 Task 3 Task i Task i+1 Task i+2 Task i+3 f1.java f3.java f2.java … f2.java … f1.java f3.java f3.javaDec 2005 Time P Sep 2010 Post-defect counting period
    • STEP1: Counting & Labeling Instances Task 1 Task 2 Task 3 Task i Task i+1 Task i+2 Task i+3 f1.java f3.java f2.java … f2.java … f1.java f3.java f3.javaDec 2005 Time P Sep 2010 Post-defect counting period
    • STEP1: Counting & Labeling Instances The number of counted post defects (edited files only within bug fixing tasks) Task 1 f1.java Task 3 Task 2 =1 Task i Task i+1 Task i+2 Task i+3 f2.java = 1 f3.java = 2 f1.java f3.java f2.java … … f1.java f3.java f2.java … f3.java Labeling rule for the file instance “buggy” (if # of post-defects > 0)Dec 2005 “clean” (if # of post-defects = 0) Time P Sep 2010 Post-defect counting period
    • STEP2: Extraction of MIMsDec 2005 Time P Sep 2010
    • STEP2: Extraction of MIMs Task 1 Task 2 Task 3 Task 4 …Dec 2005 Time P Sep 2010 Metrics extraction period
    • STEP2: Extraction of MIMs Task 1 f3.java ... edit … edit …Dec 2005 Time P Sep 2010 Metrics extraction period
    • STEP2: Extraction of MIMs Metrics Computation Task 1 f3.java ... edit … edit …Dec 2005 Time P Sep 2010 Metrics extraction period
    • STEP2: Extraction of MIMs Metrics Computation Task 1 MIMf3.java  valueTask1 f3.java ... edit … edit …Dec 2005 Time P Sep 2010 Metrics extraction period
    • STEP2: Extraction of MIMs Metrics Computation Task 1 Task 2 MIMf3.java  valueTask1 f3.java ... f1.java ... MIMf1.java  valueTask2 edit edit … … edit edit … …Dec 2005 Time P Sep 2010 Metrics extraction period
    • STEP2: Extraction of MIMs Metrics Computation Task 1 Task 2 Task 3 MIMf3.java  valueTask1 f3.java ... f1.java ... f2.java ... MIMf1.java  valueTask2 edit edit edit … edit … edit … edit MIMf2.java  valueTask3 … … …Dec 2005 Time P Sep 2010 Metrics extraction period
    • STEP2: Extraction of MIMs Metrics Computation Task 1 Task 2 Task 3 Task 4 MIMf3.java  valueTask1 f3.java ... f1.java ... f2.java ... f1.java MIMf1.java  valueTask2 …edit edit edit edit MIMf2.java  valueTask3 …edit.. … … … … edit edit edit f2.java … … … …edit…Dec 2005 Time P Sep 2010 Metrics extraction period
    • STEP2: Extraction of MIMs Metrics Computation Task 1 Task 2 Task 3 Task 4 MIMf3.java  valueTask1 f3.java f1.java f2.java f1.java MIMf1.java  (valueTask2+valueTask4) ... ... ... …edit edit … edit … edit … …edit.. f2.java … MIMf2.java  (valueTask3+valueTask4) edit edit edit … … … …edit…Dec 2005 Time P Sep 2010 Metrics extraction period
    • STEP2: Extraction of MIMs Metrics Computation Task 1 Task 2 Task 3 Task 4 MIMf3.java  valueTask1 f3.java f1.java f2.java f1.java MIMf1.java  (valueTask2+valueTask4)/2 ) ... ... ... …edit edit … edit … edit … …edit.. f2.java … MIMf2.java  (valueTask3+valueTask4)/2 ) edit edit edit … … … …edit…Dec 2005 Time P Sep 2010 Metrics extraction period
    • Understand JAVA tool was usedfor extracting 32 source Code Metrics (CMs)* * Chidamber and Kemerer, and OO metrics
    • Understand JAVA tool was used for extracting 32 source Code Metrics (CMs)* List of selected source code metrics CVS last revision …Dec 2005 Time P Sep 2010 * Chidamber and Kemerer, and OO metrics
    • Fifteen History Metrics (HMs)* were collected from the corresponding CVS repository * Moser et al.
    • Fifteen History Metrics (HMs)* were collected from the corresponding CVS repository List of history metrics (HMs) CVS revisions …Dec 2005 Time P Sep 2010 * Moser et al.
    • STEP3: Creating a training corpus Instance Extracted MIMs Label Name Training … Classifier Instance # of post Extracted MIMs Name defects Training … Regression
    • STEP4: Building prediction modelsClassification and Regressionmodeling with different machine learning algorithms using the WEKA* tool * an open source data mining tool
    • STEP5: Prediction EvaluationClassification Measures
    • STEP5: Prediction Evaluation How many instances are really buggy among the buggy-predicted outcomes?Classification Measures
    • STEP5: Prediction Evaluation How many instances are really buggy among the buggy-predicted outcomes? How many instances are correctly predicted as ‘buggy’ among the real buggy onesClassification Measures
    • STEP5: Prediction Evaluation Regression Measures correlation coefficient (-1~1) mean absolute error (0~1) root square error (0~1)
    • STEP5: Prediction Evaluation Regression between # of real buggyinstances and # of instances Measures predicted as buggy correlation coefficient (-1~1) mean absolute error (0~1) root square error (0~1)
    • T-test with 100 times of 10-fold cross validation Reject H0* and accept H1* if p-value < 0.05 (at the 95% confidence level) * H0: no difference in average performance, H1: different (better!)
    • Result Summary MIMs improve prediction accuracy for1 different Eclipse project subjects2 different machine learning algorithms3 different model training periods
    • Prediction for different project subjects File instances and % of defects
    • Prediction for different project subjects MIM: the proposed metrics CM: source code metrics HM: history metrics
    • Prediction for different project subjects BASELINE: Dummy Classifier predicts in a purely random manner e.g., for 12.5% of buggy instances Precision(B)=12.5%, Recall(B)=50% F-measure(B)=20% MIM: the proposed metrics CM: source code metrics HM: history metrics
    • Prediction for different project subjects MIM: the proposed metrics CM: source code metrics HM: history metrics
    • Prediction for different project subjects T-test results (significant figures are in bold, p-value < 0.05)
    • Prediction with different algorithms
    • Prediction with different algorithms T-test results (significant figures are in bold, p-value < 0.05)
    • Prediction in different training periods Model training period Model testing periodDec 2005 Sep 2010 Time P
    • Prediction in different training periods 50% : 50% 70% : 30% 80% : 20% Model training period Model testing periodDec 2005 Sep 2010 Time P
    • Prediction in different training periods
    • Prediction in different training periods T-test results (significant figures are in bold, p-value < 0.05)
    • Top 42 (37%) from MIMs among total 113 metrics (MIMs+CMs+HMs)
    • Possible InsightTOP 1: NumLowDOIEditTOP 2: NumPatternEXSXTOP 3: TimeSpentOnEdit
    • Possible Insight TOP 1: NumLowDOIEdit TOP 2: NumPatternEXSX TOP 3: TimeSpentOnEditChances are that more defects might be generatedif a programmer TOP2 repeatedly edit and browse afile especially related to the previous defects TOP3 with putting more weight on editing time, and especially TOP1 when editing such the files less frequently or less recently accessed ever …
    • Performance comparisonwith regression modelingfor predicting # of post-defects
    • Predicting Post-Defect Numbers
    • Predicting Post-Defect NumbersT-test results (significant figures are in bold, p-value < 0.05)
    • Threats to Validity• Systems examined might not be representative• Systems are all open source projects• Defect information might be biased
    • ConclusionOur findings exemplify that developer’sinteraction can affect software qualityOur proposed micro interaction metrics improve defect prediction accuracy significantly …
    • We believe future defect predictionmodels will use more developers’ direct and micro level interaction informationMIMs are a first step towards it
    • Thank you! Any Question?• Problem – Developer’s interaction information can affect software quality (defects)?• Approach – We proposed novel micro interaction metrics (MIMs) overcoming the popular static metrics• Result – MIMs significantly improve prediction accuracy compared to source code metrics (CMs) and history metrics (HMs)
    • Backup Slides
    • One possible ARGUMENT: Some developers may nothave used Mylyn to fix bugs
    • Error chance in counting post-defects as a result getting biased labels(i.e., incorrect % of buggy instances)
    • Repeated experiment using same instances but with adifferent heuristics of defect counting, CVS-log-based approach* * with keywords: “fix”, “bug”, “bug report ID” in change logs
    • Prediction with CVS-log-based approach CVS-log-based
    • Prediction with CVS-log-based approach CVS-log-based T-test results (significant figures are in bold, p-value < 0.05)
    • CVS-log-based approach reported more additional post-defects (more % of buggy-labeled instances)
    • CVS-log-based approach reported more additional post-defects (more % of buggy-labeled instances) MIMs failed to feature them due tolack of the corresponding Mylyn data
    • Note that it is difficult to100% guarantee the quality of CVS change logs (e.g., no explicit bug ID, missing logs)