• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Evaluating the presence and impact of bias in bug-fix datasets
 

Evaluating the presence and impact of bias in bug-fix datasets

on

  • 1,906 views

Empirical Software Engineering relies on reusable datasets to make it easier to replicate empirical studies and therefore build theories on top of those empirical results. An area where these reusable ...

Empirical Software Engineering relies on reusable datasets to make it easier to replicate empirical studies and therefore build theories on top of those empirical results. An area where these reusable datasets are particularly useful is defect predictions. In this area, the goal is to predict which entities will be more error prone, so managers can take preventive actions to improve the quality of the delivered system. These reusable datasets contain information about source code files and their history, bug reports, and bugs fixed in each one of the files. However, some of the most used datasets in the Empirical Software Engineering community have been shown to be biased: many links between files and fixed bugs are missing. Research work has already shown that this bias may affect the performance of defect prediction models. In this talk we will show how to use statistical techniques to evaluate the bias in datasets, and to estimate their impact on defect prediction

Statistics

Views

Total Views
1,906
Views on SlideShare
904
Embed Views
1,002

Actions

Likes
0
Downloads
9
Comments
1

5 Embeds 1,002

http://herraiz.org 970
http://feeds.feedburner.com 23
http://www.herraiz.org 7
http://feeds.churchturing.org 1
http://translate.googleusercontent.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • blessing_66666@yahoo.com

    My name is Blessing
    i am a young lady with a kind and open heart,
    I enjoy my life,but life can't be complete if you don't have a person to share it
    with. blessing_66666@yahoo.com

    Hoping To Hear From You
    Yours Blessing
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Evaluating the presence and impact of bias in bug-fix datasets Evaluating the presence and impact of bias in bug-fix datasets Presentation Transcript

    • Evaluating the presence and impact of bias in bug-fix datasets Israel Herraiz, UPM http://mat.caminos.upm.es/~iht Talk at University of California, Davis April 11 2012 This presentation is available athttp://www.slideshare.net/herraiz/evaluating-the-presence-and-impact-of-bias-in-bugfix-datasets
    • Outline 1. Who am I and what do I do 2. The problem 3. Preliminary results 4. The road ahead 5. Take away and discussionhttp://mat.caminos.upm.es/~iht 1 / 34
    • 1. Who am I and what do I dohttp://mat.caminos.upm.es/~iht 2 / 34
    • About me • PhD on Computer Science from Universidad Rey Juan Carlos (Madrid) • “A statistical examination of the evolution and properties of libre software” • http://herraiz.org/phd.html • Assistant Professor at the Technical University of Madrid • http://mat.caminos.upm.es/~iht • Visiting UC Davis from April to July hosted by Prof. Devanbu • Kindly funded by a MECD “José Castillejo” grant (JC2011-0093)http://mat.caminos.upm.es/~iht 3 / 34
    • What do I do?http://mat.caminos.upm.es/~iht 4 / 34
    • 2. The problemhttp://mat.caminos.upm.es/~iht 5 / 34
    • Replication in Empirical Software Engineering Empirical Software Engineering studies are hard to replicate. Verification and replication are crucial features of an empirical research discipline. Reusable datasets lower the barrier for replication.http://mat.caminos.upm.es/~iht 6 / 34
    • Reusable datasets FLOSSMolehttp://mat.caminos.upm.es/~iht 7 / 34
    • The case of the Eclipse dataset Defects data for all packages in the releases 2.0, 2.1 and 3.0 Size and complexity metrics for all the files http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/http://mat.caminos.upm.es/~iht 8 / 34
    • Bug-fix datasets • The Eclipse data is a bug-fix dataset • To cross correlate bugs with files, classes or packages, the data is extracted from • Bug tracking systems (fixed bug reports) • Version control system (commits) • Heuristics to detect relationships between bug- fix reports and commitshttp://mat.caminos.upm.es/~iht 9 / 34
    • A study using the Eclipse datasethttp://mat.caminos.upm.es/~iht 10 / 34
    • The distribution of software faults • The distribution of software faults (over packages) is a Weibull distribution • This study can be easily replicated thanks to the Eclipse reusable bug-fix dataset • If the same data is obtained for other case studies, it can also be easily verified and extendedhttp://mat.caminos.upm.es/~iht 11 / 34
    • But…http://mat.caminos.upm.es/~iht 12 / 34
    • What’s the difference between the two conflicting studies? • According to the authors there are methodological differences • Zhang uses Alberg diagrams • Concas et al. use CCDF plots to fit different distributions, and reason about the generative process as a model for software maintenance • What I suspect is a crucial difference • Zhang reused the Eclipse bug-fix dataset • Concas et al. gathered the data by themselves • So the bias in both datasets will be differenthttp://mat.caminos.upm.es/~iht 13 / 34
    • What’s wrong with the Eclipse bug-fix dataset?http://mat.caminos.upm.es/~iht 14 / 34
    • Bug feature bias There are other kind of bias (commit features), but in the case of the two Eclipse papers, the distribution is about packages features, not bugs neither commits features. RQ1: Will this kind of bias hold for packages / classes / files features? RQ2: What’s the impact on defect prediction?http://mat.caminos.upm.es/~iht 15 / 34
    • Impact on predictionhttp://mat.caminos.upm.es/~iht 16 / 34
    • Impact on prediction J48 tree to classify files as defective or nothttp://mat.caminos.upm.es/~iht 17 / 34
    • Conclusions so far • Developers only mark a subset of the bug-fix pairs, and so heuristics-based recovery methods only find a subset of the overall bug-fix pairs • The bias appears as a difference in the distribution of bugs and commits features • The conflict between the two studies about the distribution of bugs in Eclipse is likely to be due to differences in the distributions caused by bias • The bias has a great impact on the accuracy of predictor modelshttp://mat.caminos.upm.es/~iht 18 / 34
    • 3. Preliminary resultshttp://mat.caminos.upm.es/~iht 19 / 34
    • The distribution of bugs over files • Number of bugs per file for the case of Zxinghttp://mat.caminos.upm.es/~iht 20 / 34
    • The distribution of bugs over files • Number of bugs per file for the case of Eclipsehttp://mat.caminos.upm.es/~iht 21 / 34
    • The distribution of bugs over files • Comparison between the ReLink and the biased bug-fix sets (results of the χ2 test, p-values)http://mat.caminos.upm.es/~iht 22 / 34
    • The distribution of bugs over files • Comparison between the ReLink and the biased bug-fix sets (results of the χ2 test, p-values) RQ1: Will this kind of bias hold for packages / classes / files features? Not supported by these exampleshttp://mat.caminos.upm.es/~iht 23 / 34
    • Time over! • So there is no difference between the biased and non-biased datasets? • And how come the ReLink paper (and others) report improved accuracies when using the non- biased datasets? • What could explain these differences?http://mat.caminos.upm.es/~iht 24 / 34
    • Impact on prediction accuracy • What is the prediction accuracy using different (biased and non-biased) datasets? • Three datasets • Biased datasets recovered using heuristics • “Golden” dataset manually recovered • By Sung Kim et al., not me! • Non-biased dataset obtained using the ReLink tool • J48 tree classifier, 10 folds cross validation • Test datasets always extracted from the golden datasethttp://mat.caminos.upm.es/~iht 25 / 34
    • F-measure values • Procedure • Extract 100 subsamples of the same size for both datasets • Calculate F-measure using a 10 folds cross validation • The test set is always extracted from the “golden” set • Repeat for several subsample sizes • Only results for the case of OpenIntents so farhttp://mat.caminos.upm.es/~iht 26 / 34
    • http://mat.caminos.upm.es/~iht 27 / 34
    • RQ2: Impact on prediction Not clear whether there is any impacthttp://mat.caminos.upm.es/~iht 28 / 34
    • Little warning! The size is not exactly the same for the three cases in each boxplot. The biased is always the smallest of the three. RQ2: Impact on prediction I have to repeat this using exactly the same size for the three Not clear whether there is any impact datasets.http://mat.caminos.upm.es/~iht 29 / 34
    • Preliminary conclusions • The biased dataset does not provide the worst accuracy when predicting fault proneness for a set of (supposedly) unbiased bug fixes and files • Contrarily to what is reported in previous work • What is the cause of the reported differences in accuracy? • By definition, the size of the so-called biased dataset will be always smaller • Dataset size does have an impact on the F- measurehttp://mat.caminos.upm.es/~iht 30 / 34
    • 4. The road aheadhttp://mat.caminos.upm.es/~iht 31 / 34
    • My workplan at UC Davis • Discuss the ideas shown here • Is bias really a problem for defect prediction? • Extend the study to more cases • Do you have a dataset of files, bugs, commits, metrics? Please let me know! • Improve the study • What happens if we break down the data in more coherent subgroups • Do the results change at different levels of granularity?http://mat.caminos.upm.es/~iht 32 / 34
    • 5. Take away and conclusionshttp://mat.caminos.upm.es/~iht 33 / 34
    • No observable Systematic difference difference in the in bug-fixes collected statistical properties of by heuristics the so-called biased dataset Ecological inference Impact on prediction What happens at accuracy not clear other scales? With other subgroups?http://mat.caminos.upm.es/~iht 34 / 34