This document summarizes Martin Pinzger's research on predicting buggy methods using software repository mining. The key points are:
1. Pinzger and colleagues conducted experiments on 21 Java projects to predict buggy methods using source code and change metrics. Change metrics like authors and method histories performed best with up to 96% accuracy.
2. Predicting buggy methods at a finer granularity than files can save manual inspection and testing effort. Accuracy decreases as fewer methods are predicted but change metrics maintain higher precision.
3. Case studies on two classes show that method-level prediction achieves over 82% precision compared to only 17-42% at the file level. This demonstrates the benefit of finer-
Analyzing Changes in Software Systems From ChangeDistiller to FMDiffMartin Pinzger
Software systems continuously change and developers spent a large portion of their time in keeping track and understanding changes and their effects. Current development tools provide only limited support. Most of all, they track changes in source files only on the level of textual lines lacking semantic and context information on changes. Developers frequently need to reconstruct this information manually which is a time consuming and error prone task. In this talk, I present three techniques to address this problem by extracting detailed syntactical information from changes in various source files. I start with introducing ChangeDistiller, a tool and approach to extract information on source code changes on the level of ASTs. Next, I present the WSDLDiff approach to extract information on changes in web services interface description files. Finally, I present FMDiff, an approach to extract changes from feature models defined with the linux Kconfig language. For each approach I report on cases studies and experiments to highlight the benefits of our techniques. I also point out several research opportunities opened by our techniques and tools, and the detailed data on changes extracted by them.
PhD Maintainability of transformations in evolving MDE ecosystemsJokin García Pérez
- Co-evolve transformations to metamodel evolution
- Adapter-based approach to co-evolve generated SQL in model to text transformations
- Testing model to text transformations
-
Software analytics (for software quality purpose) is a statistical or machine learning classifier that is trained to identify defect-prone software modules. The goal of software analytics is to help software engineers prioritize their software testing effort on the most-risky modules and understand past pitfalls that lead to defective code. While the adoption of software analytics enables software organizations to distil actionable insights, there are still many barriers to broad and successful adoption of such analytics systems. Indeed, even if software organizations can access such invaluable software artifacts and toolkits for data analytics, researchers and practitioners often have little knowledge to properly develop analytics systems. Thus, the accuracy of the predictions and the insights that are derived from analytics systems is one of the most important challenges of data science in software engineering.
In this work, we conduct a series of empirical investigation to better understand the impact of experimental components (i.e., class mislabelling, parameter optimization of classification techniques, and model validation techniques) on the performance and interpretation of software analytics. To accelerate a large amount of compute-intensive experiment, we leverage the High-Performance-Computing (HPC) resources of Centre for Advanced Computing (CAC) from Queen’s University, Canada. Through case studies of systems that span both proprietary and open- source domains, we demonstrate that (1) realistic noise does not impact the precision of software analytics; (2) automated parameter optimization for classification techniques substantially improve the performance and stability of software analytics; and (3) the out-of- sample bootstrap validation technique produces a good balance between bias and variance of performance estimates. Our results lead us to conclude that the experimental components of analytics modelling impact the predictions and associated insights that are derived from software analytics. Empirical investigations on the impact of overlooked experimental components are needed to derive practical guidelines for analytics modelling.
Software Quality Assurance (SQA) teams play a critical role in the software development process to ensure the absence of software defects. It is not feasible to perform exhaustive SQA tasks (i.e., software testing and code review) on a large software product given the limited SQA resources that are available. Thus, the prioritization of SQA efforts is an essential step in all SQA efforts. Defect prediction models are used to prioritize risky software modules and understand the impact of software metrics on the defect-proneness of software modules. The predictions and insights that are derived from defect prediction models can help software teams allocate their limited SQA resources to the modules that are most likely to be defective and avoid common past pitfalls that are associated with the defective modules of the past. However, the predictions and insights that are derived from defect prediction models may be inaccurate and unreliable if practitioners do not control for the impact of experimental components (e.g., datasets, metrics, and classifiers) on defect prediction models, which could lead to erroneous decision-making in practice. In this thesis, we investigate the impact of experimental components on the performance and interpretation of defect prediction models. More specifically, we investigate the impact of the three often overlooked experimental components (i.e., issue report mislabelling, parameter optimization of classification techniques, and model validation techniques) have on defect prediction models. Through case studies of systems that span both proprietary and open-source domains, we demonstrate that (1) issue report mislabelling does not impact the precision of defect prediction models, suggesting that researchers can rely on the predictions of defect prediction models that were trained using noisy defect datasets; (2) automated parameter optimization for classification techniques substantially improve the performance and stability of defect prediction models, as well as they change their interpretation, suggesting that researchers should no longer shy from applying parameter optimization to their models; and (3) the out-of-sample bootstrap validation technique produces a good balance between bias and variance of performance estimates, suggesting that the single holdout and cross-validation families that are commonly-used nowadays should be avoided.
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...Feng Zhang
Defect prediction on projects with limited historical data has attracted great interest from both researchers and practitioners. Cross-project defect prediction has been the main area of progress by reusing classifiers from other projects. However, existing approaches require some degree of homogeneity (e.g., a similar distribution of metric values) between the training projects and the target project. Satisfying the homogeneity requirement often requires significant effort (currently a very active area of research).
An unsupervised classifier does not require any training data, therefore the heterogeneity challenge is no longer an issue. In this paper, we examine two types of unsupervised classifiers: a) distance-based classifiers (e.g., k-means); and b) connectivity-based classifiers. While distance-based unsupervised classifiers have been previously used in the defect prediction literature with disappointing performance, connectivity-based classifiers have never been explored before in our community.
We compare the performance of unsupervised classifiers versus supervised classifiers using data from 26 projects from three publicly available datasets (i.e., AEEEM, NASA, and PROMISE). In the cross-project setting, our proposed connectivity-based classifier (via spectral clustering) ranks as one of the top classifiers among five widely-used supervised classifiers (i.e., random forest, naive Bayes, logistic regression, decision tree, and logistic model tree) and five unsupervised classifiers (i.e., k-means, partition around medoids, fuzzy C-means, neural-gas, and spectral clustering). In the within-project setting (i.e., models are built and applied on the same project), our spectral classifier ranks in the second tier, while only random forest ranks in the first tier. Hence, connectivity-based unsupervised classifiers offer a viable solution for cross and within project defect predictions.
maXbox Starter 43 Work with Code Metrics ISO StandardMax Kleiner
Today we step through optimize your code with metrics and some style guide conventions. You cannot improve what you don’t measure and what you don’t measure, you cannot prove. A tool can be great for code quality but also provides a mechanism for extending your functions and quality with checks and tests.
Analyzing Changes in Software Systems From ChangeDistiller to FMDiffMartin Pinzger
Software systems continuously change and developers spent a large portion of their time in keeping track and understanding changes and their effects. Current development tools provide only limited support. Most of all, they track changes in source files only on the level of textual lines lacking semantic and context information on changes. Developers frequently need to reconstruct this information manually which is a time consuming and error prone task. In this talk, I present three techniques to address this problem by extracting detailed syntactical information from changes in various source files. I start with introducing ChangeDistiller, a tool and approach to extract information on source code changes on the level of ASTs. Next, I present the WSDLDiff approach to extract information on changes in web services interface description files. Finally, I present FMDiff, an approach to extract changes from feature models defined with the linux Kconfig language. For each approach I report on cases studies and experiments to highlight the benefits of our techniques. I also point out several research opportunities opened by our techniques and tools, and the detailed data on changes extracted by them.
PhD Maintainability of transformations in evolving MDE ecosystemsJokin García Pérez
- Co-evolve transformations to metamodel evolution
- Adapter-based approach to co-evolve generated SQL in model to text transformations
- Testing model to text transformations
-
Software analytics (for software quality purpose) is a statistical or machine learning classifier that is trained to identify defect-prone software modules. The goal of software analytics is to help software engineers prioritize their software testing effort on the most-risky modules and understand past pitfalls that lead to defective code. While the adoption of software analytics enables software organizations to distil actionable insights, there are still many barriers to broad and successful adoption of such analytics systems. Indeed, even if software organizations can access such invaluable software artifacts and toolkits for data analytics, researchers and practitioners often have little knowledge to properly develop analytics systems. Thus, the accuracy of the predictions and the insights that are derived from analytics systems is one of the most important challenges of data science in software engineering.
In this work, we conduct a series of empirical investigation to better understand the impact of experimental components (i.e., class mislabelling, parameter optimization of classification techniques, and model validation techniques) on the performance and interpretation of software analytics. To accelerate a large amount of compute-intensive experiment, we leverage the High-Performance-Computing (HPC) resources of Centre for Advanced Computing (CAC) from Queen’s University, Canada. Through case studies of systems that span both proprietary and open- source domains, we demonstrate that (1) realistic noise does not impact the precision of software analytics; (2) automated parameter optimization for classification techniques substantially improve the performance and stability of software analytics; and (3) the out-of- sample bootstrap validation technique produces a good balance between bias and variance of performance estimates. Our results lead us to conclude that the experimental components of analytics modelling impact the predictions and associated insights that are derived from software analytics. Empirical investigations on the impact of overlooked experimental components are needed to derive practical guidelines for analytics modelling.
Software Quality Assurance (SQA) teams play a critical role in the software development process to ensure the absence of software defects. It is not feasible to perform exhaustive SQA tasks (i.e., software testing and code review) on a large software product given the limited SQA resources that are available. Thus, the prioritization of SQA efforts is an essential step in all SQA efforts. Defect prediction models are used to prioritize risky software modules and understand the impact of software metrics on the defect-proneness of software modules. The predictions and insights that are derived from defect prediction models can help software teams allocate their limited SQA resources to the modules that are most likely to be defective and avoid common past pitfalls that are associated with the defective modules of the past. However, the predictions and insights that are derived from defect prediction models may be inaccurate and unreliable if practitioners do not control for the impact of experimental components (e.g., datasets, metrics, and classifiers) on defect prediction models, which could lead to erroneous decision-making in practice. In this thesis, we investigate the impact of experimental components on the performance and interpretation of defect prediction models. More specifically, we investigate the impact of the three often overlooked experimental components (i.e., issue report mislabelling, parameter optimization of classification techniques, and model validation techniques) have on defect prediction models. Through case studies of systems that span both proprietary and open-source domains, we demonstrate that (1) issue report mislabelling does not impact the precision of defect prediction models, suggesting that researchers can rely on the predictions of defect prediction models that were trained using noisy defect datasets; (2) automated parameter optimization for classification techniques substantially improve the performance and stability of defect prediction models, as well as they change their interpretation, suggesting that researchers should no longer shy from applying parameter optimization to their models; and (3) the out-of-sample bootstrap validation technique produces a good balance between bias and variance of performance estimates, suggesting that the single holdout and cross-validation families that are commonly-used nowadays should be avoided.
Cross-project Defect Prediction Using A Connectivity-based Unsupervised Class...Feng Zhang
Defect prediction on projects with limited historical data has attracted great interest from both researchers and practitioners. Cross-project defect prediction has been the main area of progress by reusing classifiers from other projects. However, existing approaches require some degree of homogeneity (e.g., a similar distribution of metric values) between the training projects and the target project. Satisfying the homogeneity requirement often requires significant effort (currently a very active area of research).
An unsupervised classifier does not require any training data, therefore the heterogeneity challenge is no longer an issue. In this paper, we examine two types of unsupervised classifiers: a) distance-based classifiers (e.g., k-means); and b) connectivity-based classifiers. While distance-based unsupervised classifiers have been previously used in the defect prediction literature with disappointing performance, connectivity-based classifiers have never been explored before in our community.
We compare the performance of unsupervised classifiers versus supervised classifiers using data from 26 projects from three publicly available datasets (i.e., AEEEM, NASA, and PROMISE). In the cross-project setting, our proposed connectivity-based classifier (via spectral clustering) ranks as one of the top classifiers among five widely-used supervised classifiers (i.e., random forest, naive Bayes, logistic regression, decision tree, and logistic model tree) and five unsupervised classifiers (i.e., k-means, partition around medoids, fuzzy C-means, neural-gas, and spectral clustering). In the within-project setting (i.e., models are built and applied on the same project), our spectral classifier ranks in the second tier, while only random forest ranks in the first tier. Hence, connectivity-based unsupervised classifiers offer a viable solution for cross and within project defect predictions.
maXbox Starter 43 Work with Code Metrics ISO StandardMax Kleiner
Today we step through optimize your code with metrics and some style guide conventions. You cannot improve what you don’t measure and what you don’t measure, you cannot prove. A tool can be great for code quality but also provides a mechanism for extending your functions and quality with checks and tests.
Implementation of reducing features to improve code change based bug predicti...eSAT Journals
Abstract Today, we are getting plenty of bugs in the software because of variations in the software and hardware technologies. Bugs are nothing but Software faults, existing a severe challenge for system reliability and dependability. To identify the bugs from the software bug prediction is convenient approach. To visualize the presence of a bug in a source code file, recently, Machine learning classifiers approach is developed. Because of a huge number of machine learned features current classifier-based bug prediction have two major problems i) inadequate precision for practical usage ii) measured prediction time. In this paper we used two techniques first, cos-triage algorithm which have a go to enhance the accuracy and also lower the price of bug prediction and second, feature selection methods which eliminate less significant features. Reducing features get better the quality of knowledge extracted and also boost the speed of computation. Keywords: Efficiency, Bug Prediction, Classification, Feature Selection, Accuracy
Data collection for software defect predictionAmmAr mobark
It is one of the important stages that software companies need it, it will be after produce the program and published, to know the reactions of the users and their impressions about the program and work on developing and improving it.
The Contents
* BACKGROUND AND RELATED WORK
* EXPERIMENTAL PLANNING
-Research Goal -Research Questions -Experimental Subjects
-Experimental Material -Tasks and Methods
-Experimental Design
عمار عبد الكريم صاحب مبارك
AmmAr Abdualkareem sahib mobark
Implementation of reducing features to improve code change based bug predicti...eSAT Journals
Abstract Today, we are getting plenty of bugs in the software because of variations in the software and hardware technologies. Bugs are nothing but Software faults, existing a severe challenge for system reliability and dependability. To identify the bugs from the software bug prediction is convenient approach. To visualize the presence of a bug in a source code file, recently, Machine learning classifiers approach is developed. Because of a huge number of machine learned features current classifier-based bug prediction have two major problems i) inadequate precision for practical usage ii) measured prediction time. In this paper we used two techniques first, cos-triage algorithm which have a go to enhance the accuracy and also lower the price of bug prediction and second, feature selection methods which eliminate less significant features. Reducing features get better the quality of knowledge extracted and also boost the speed of computation. Keywords: Efficiency, Bug Prediction, Classification, Feature Selection, Accuracy
Data collection for software defect predictionAmmAr mobark
It is one of the important stages that software companies need it, it will be after produce the program and published, to know the reactions of the users and their impressions about the program and work on developing and improving it.
The Contents
* BACKGROUND AND RELATED WORK
* EXPERIMENTAL PLANNING
-Research Goal -Research Questions -Experimental Subjects
-Experimental Material -Tasks and Methods
-Experimental Design
عمار عبد الكريم صاحب مبارك
AmmAr Abdualkareem sahib mobark
Reliability is concerned with decreasing faults and their impact. The earlier the faults are detected the better. That's why this presentation talks about automated techniques using machine learning to detect faults as early as possible.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Multi step automated refactoring for code smelleSAT Journals
Abstract
Brain MR Image can detect many abnormalities like tumor, cysts, bleeding, infection etc. Analysis of brain MRI using image
processing techniques has been an active research in the field of medical imaging. In this work, it is shown that MR image of brain
represent a multi fractal system which is described a continuous spectrum of exponents rather than a single exponent (fractal
dimension). Multi fractal analysis has been performed on number of images from OASIS database are analyzed. The properties of
multi fractal spectrum of a system have been exploited to prove the results. Multi fractal spectra are determined using the modified
box-counting method of fractal dimension estimation.
Keywords: Brain MR Image, Multi fractal, Box-counting
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Searching for Quality: Genetic Algorithms and Metamorphic Testing for Softwar...Annibale Panichella
More machine learning (ML) models are introduced to the field of Software Engineering (SE) and reached a stage of maturity to be considered for real-world use; But the real world is complex, and testing these models lacks often in explainability, feasibility and computational capacities. Existing research introduced meta-morphic testing to gain additional insights and certainty about the model, by applying semantic-preserving changes to input-data while observing model-output. As this is currently done at random places, it can lead to potentially unrealistic datapoints and high computational costs. With this work, we introduce genetic search as an aid for metamorphic testing in SE ML. Exploiting the delta in output as a fitness function, the evolutionary intelligence optimizes the transformations to produce higher deltas with less changes. We perform a case study minimizing F1 and MRR for Code2Vec on a representative sample from java-small with both genetic and random search. Our results show that within the same amount of time, genetic search was able to achieve a decrease of 10% in F1 while random search produced 3% drop.
Traps detection during migration of C and C++ code to 64-bit WindowsPVS-Studio
Appearance of 64-bit processors on PC market made developers face the task of converting old 32-bit applications for new platforms. After the migration of the application code it is highly probable that the code will work incorrectly. This article reviews questions related to software verification and testing. It also concerns difficulties a developer of 64-bit Windows application may face and the ways of solving them.
This paper advances the Domain Segmentation based on Uncertainty in the Surrogate (DSUS) framework which is a novel approach to characterize the uncertainty in surrogates. The leave-one-out cross-validation technique is adopted in the DSUS framework to measure local errors of a surrogate. A method is proposed in this paper to evaluate the performance of the leave-out-out cross-validation errors as local error measures. This method evaluates local errors by comparing: (i) the leave-one-out cross-validation error with (ii) the actual local error estimated within a local hypercube for each training point. The comparison results show that the leave-one-out cross-validation strategy can capture the local errors of a surrogate. The DSUS framework is then applied to key aspects of wind resource as- sessment and wind farm cost modeling. The uncertainties in the wind farm cost and the wind power potential are successfully characterized, which provides designers/users more confidence when using these models
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
3. Hmm, wait a minute
3
Can’t we learn “something” from that data?
4. Goal of software repository mining
Software Analytics
To obtain insightful and actionable information for completing various tasks
around developing and maintaining software systems
Examples
Quality analysis and defect prediction
Detecting “hot-spots”
Preventing defects
Recommender (advisory) systems
Code completion
Suggesting good code examples
Helping in using an API
...
4
5. Examples from my mining research
My mining research
The relationship between developer contributions and failure-prone Microsoft Vista
binaries (FSE 2008)
Predicting failure-prone methods (ESEM 2012)
Predicting Build Co-Changes with Source Code Change and Commit Categories
(to appear at SANER 2016)
For more see: http://serg.aau.at/bin/view/MartinPinzger/Publications
Surveys on software repository mining
A survey and taxonomy of approaches for mining software repositories in the
context of software evolution, Kagdi et al. 2007
Evaluating defect prediction approaches: a benchmark and an extensive
comparison, D’Ambros et al. 2012
Conference: MSR 2016 http://msrconf.org/
5
7. Many existing studies to predict bug-prone files
A comparative analysis of the efficiency of change metrics and static
code attributes for defect prediction, Moser et al. 2008
Use of relative code churn measures to predict system defect
density, Nagappan et al. 2005
Cross-project defect prediction: a large scale experiment on data vs.
domain vs. process, Zimmermann et al. 2009
Predicting faults using the complexity of code changes, Hassan et al.
2009
7
8. Prediction granularity
11 methods on average
class 1 class 2 class 3 class n...class 2
4 methods are bug prone (ca. 36%)
Retrieving bug-prone methods saves manual inspection effort and
testing effort
8
Large files are typically the most bug-prone files
9. Research questions
How accurate can we predict buggy methods?
Which characteristics (i.e., metrics) are indicating bug prone
methods?
How does the accuracy vary if the number of buggy methods
decreases?
9
10. Research questions
10
RQ1 What is the accuracy of bug prediction on
method level?
RQ2 Which characteristics (i.e., metrics) are
indicating bug prone methods?
RQ3 How does the accuracy vary if the number of
buggy methods decreases?
14. Predicting bug-prone methods
Bug-prone vs. not bug-prone
14
.1 Experimental Setup
Prior to model building and classification we labeled
ethod in our dataset either as bug-prone or not bug-p
s follows:
bugClass =
not bug − prone : #bugs = 0
bug − prone : #bugs >= 1
hese two classes represent the binary target classes
aining and validating the prediction models. Using 0
pectively 1) as cut-point is a common approach applie
any studies covering bug prediction models, e.g., [30
7, 4, 27, 37]. Other cut-points are applied in litera
r instance, a statistical lower confidence bound [33] or
edian [16]. Those varying cut-points as well as the div
15. Models computed with change metrics (CM) perform best
authors and methodHistories are the most important measures
Accuracy of prediction models
15
Table 4: Median classification results over all pro-
jects per classifier and per model
CM SCM CM&SCM
AUC P R AUC P R AUC P R
RndFor .95 .84 .88 .72 .5 .64 .95 .85 .95
SVM .96 .83 .86 .7 .48 .63 .95 .8 .96
BN .96 .82 .86 .73 .46 .73 .96 .81 .96
J48 .95 .84 .82 .69 .56 .58 .91 .83 .89
values of the code metrics model are approximately 0.7 for
each classifier—what is defined by Lessman et al. as ”promis-
ing” [26]. However, the source code metrics suffer from con-
siderably low precision values. The highest median precision
16. Predicting bug-prone methods with diff. cut-points
Bug-prone vs. not bug-prone
p = 75%, 90%, 95% percentiles of #bugs in methods per project
-> predict the top 25%, 10%, and 5% bug-prone methods
16
ow the classification performance varies (RQ3) as the
er of samples in the target class shrinks, and wheth
bserve similar findings as in Section 3.2 regarding t
ults of the change and code metrics (RQ2). For tha
pplied three additional cut-point values as follows:
bugClass =
not bug − prone : #bugs <= p
bug − prone : #bugs > p
here p represents either the value of the 75%, 90%, or
ercentile of the distribution of the number of bugs in
ds per project. For example, using the 95% percent
ut-point for prior binning would mean to predict the
ve percent” methods in terms of the number of bugs.
To conduct this study we applied the same experim
etup as in Section 3.1, except for the differently chose
17. Decreasing the number of bug-prone methods
Models trained with Random Forest (RndFor)
Change metrics (CM) perform best
Precision decreases (as expected)
17
Table 5: Median classification results for RndFor
ver all projects per cut-point and per model
CM SCM CM&SCM
AUC P R AUC P R AUC P R
GT0 .95 .84 .88 .72 .50 .64 .95 .85 .95
75% .97 .72 .95 .75 .39 .63 .97 .74 .95
90% .97 .58 .94 .77 .20 .69 .98 .64 .94
95% .97 .62 .92 .79 .13 .72 .98 .68 .92
ion in the case of the 95% percentile (median precision of
.13). Looking at the change metrics and the combined
model the median precision is significantly higher for the
18. Application: file level vs. method level prediction
JDT Core 3.0 - LocalDeclaration.class
Contains 6 methods / 1 affected by post release bugs
LocalDeclaration.resolve(...) was predicted bug-prone with p=0.97
File-level: p=0.17 to guess the bug-prone method
Need to manually rule out 5 methods to reach >0.82 precision 1 / (6-5)
JDT Core 3.0 - Main.class
Contains 26 methods / 11 affected by post release bugs
Main.configure(...) was predicted bug-prone with p=1.0
File-level: p=0.42 to guess a bug-prone method
Need to rule out 13 methods to reach >0.82 precision 11 / (26-13)
18
19. What can we learn from that?
Large files are more likely to change and have bugs
Test large files more thoroughly - YES
Bugs are fixed through changes that again lead to bugs
Stop changing our systems - NO, of course not!
Test changing entities more thoroughly - YES
Are we not already doing that?
Do we really need (complex) prediction models for that?
Not sure - might be the reason why these models are not really used, yet
Microsoft started to add prediction models to their quality assurance tools - current status?
But, use at least a metric tools and keep track of your code quality
-> Continuous integration environments, SONAR
19
21. Team structure and post-failures
Results of an initial study with MS Vista
#Authors and #Commits of binaries is correlated with the #post-release failures
We wanted to find out
Are binaries with fragmented contributions from many developers more likely to
have post-release failures?
Should developers focus on one thing?
21
22. Study with MS Vista project
Data
Released in January, 2007
> 4 years of development
Several thousand developers
Several thousand binaries (*.exe, *.dll)
Several millions of commits
22
23. Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Approach in a nutshell
23
Change
Logs
Bugs
Regression Analysis
Validation with data splitting
Alice
Dan
Eric Go
Hin c
5
4
6
2
5 7
4
a
4
Bob
2
b
6
Fu
Binary #bugs #centrality
a 12 0.9
b 7 0.5
c 3 0.2
26. Research questions
Are binaries with fragmented contributions more failure-prone?
Does more fragmentation also mean a higher number of post-release
failures?
Which measures of fragmentation are useful for failure estimation?
26
27. Correlation analysis
27
nrCommits nrAuthors Power dPower Closeness Reach Betweenness
Failures 0,7 0,699 0,692 0,74 0,747 0,746 0,503
nrCommits 0,704 0,996 0,773 0,748 0,732 0,466
nrAuthors 0,683 0,981 0,914 0,944 0,83
Power 0,756 0,732 0,714 0,439
dPower 0,943 0,964 0,772
Closeness 0,99 0,738
Reach 0,773
Spearman rank correlation
All correlations are significant at the 0.01 level (2-tailed)
28. How to predict failure-prone binaries?
Binary logistic regression of 50 random splits
4 principal components from 7 centrality measures
28
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
Precision Recall AUC
29. Hot to predict the number of failures?
Linear regression of 50 random splits
#Failures = b0 + b1*nCloseness + b2*nrAuthors + b3*nrCommits
All correlations are significant at the 0.01 level (2-tailed)
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
R-Square Pearson Spearman
29
30. Which fragmentation measures to use?
30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
R-SquareSpearman
Model with nrAuthors,
nrCommits
Model with nCloseness,
nrAuthors, nrCommits
31. Summary of results
Centrality measures can predict more than 83% of failure-pone Vista
binaries
Closeness, nrAuthors, and nrCommits can predict the number of post-
release failures
Closeness or Reach can improve prediction of the number of post-
release failures by 32%
More information
Can Developer-Module Networks Predict Failures?, FSE 2008
31
33. Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
What can we learn from that?
Re-organize/restrict developer contributions
Simply: fire Bob!
Find out the reasons why Bob is contributing to both binaries
At MS, few key developers helped in many places to get Vista running
33
34. Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
What can we learn from that?
Re-factor central binaries
Check the contract between “a” and “b” - decouple them
E.g., analyze if “a” contains functionality that should be moved to “b” or a new
binary
34
36. What did Microsoft do with the results?
Results were kept within MS Research (at least in 2007 and 2008)
Be careful - our findings were purely based on the data
Such findings often do not show the full picture, since not everything is recorded
My results triggered a lot more research on developer contributions
at MS Research
36
38. Open source projects: pros
+ Provide tons of data
E.g., Eclipse project, Apache projects, etc.
+ Easy to access
Almost “all” data is publicly available, e.g., Github
+ No organizational obstacles to get access
+ You get ALL the data, not just parts of it
38
39. Open source projects: cons
- Research is (sometimes) difficult to motivate
Sometimes researchers analyze open source projects to try out some new
technique/algorithms but without knowing what actual problem they want to
solve
Why are you doing this?
What is the value for the research and industry
- Difficult to get in touch with the developers to validate the results
10 years back, I showed our results to Mozilla - they only said “interesting” but
that was it!
Some open source communities are more responsive: e.g., Eclipse community
Still, you typically do not get the chance to meet them face-2-face
39
40. Industrial projects: pros
+ Usually provide real problems
But sometimes these problems are not “research” problems - and researchers
want to do research
+ Provide contact to developers to obtain feedback on the findings
Industry as a laboratory
Helps to evaluate what is useful and what is not
+ Potential to see our research results used at least by developers
Processes, tools, algorithms, models, best practices, etc.
40
41. Industrial projects: cons
- Often expectations between researchers and developers differ a lot
Developers want to get things done - researchers want to publish
Solutions are too complex and/or only applicable to a very specific case study,
therefore not useful
- Note, researchers often provide know how, not ready made tools
Is that a good idea? - Not always, but we are typically cheaper than most
consultants
- Industry provides only partial case studies
Often not really useful to perform research on -> we need data and a lot of it
41
42. How I got in touch with Microsoft (Research)
Met them at the Microsoft developers conference
Got invited to Redmond to show them what we could do FOR them
Agreed on sending a researcher (me) to Redmond for three months
Fully paid by Microsoft
Once within Microsoft got access to the data and to some developers
My main contact was with MS Research
Microsoft invests in internships and visiting researchers
They use it to find and hire talented people
42
43. What is next on my research agenda?
Study defect prediction in industrial software projects
I am still looking for a good industrial partner (and a student)
Ease understanding changes and their effects
What is the effect on the design?
What is the effect on the quality?
Recommender techniques
Identify the sources of problems
Recommend and perform refactorings to solve the problem
Provide advice on the effects of changes to prevent problems
For this I want and need to collaborate with industry!
43
44. Conclusions
44
Questions?
Martin Pinzger
martin.pinzger@aau.at
the history of a software system to assemble the dataset for
our experiments: (1) versioning data including lines modi-
fied (LM), (2) bug data, i.e., which files contained bugs and
how many of them (Bugs), and (3) fine-grained source code
changes (SCC).
4. Experiment
2. Bug Data
3. Source Code Changes (SCC)1.Versioning Data
CVS, SVN,
GIT
Evolizer
RHDB
Log Entries
ChangeDistiller
Subsequent
Versions
Changes
#bug123
Message Bug
Support
Vector
Machine
1.1 1.2
AST
Comparison
Figure 1: Stepwise overview of the data extraction process.
1. Versioning Data. We use EVOLIZER [14] to access the ver-
sioning repositories , e.g., CVS, SVN, or GIT. They provide
log entries that contain information about revisions of files
that belong to a system. From the log entries we extract the
revision number (to identify the revisions of a file in correct
temporal order), the revision timestamp, the name of the de-
veloper who checked-in the new revision, and the commit
message. We then compute LM for a source file as the sum of
lines added, lines deleted, and lines changed per file revision.
2. Bug Data. Bug reports are stored in bug repositories such
as Bugzilla. Traditional bug tracking and versioning repos-
Update Core 595 8’496 251’434 36’151 532 Oct0
Debug UI 1’954 18’862 444’061 81’836 3’120 May
JDT Debug UI 775 8’663 168’598 45’645 2’002 Nov
Help 598 3’658 66’743 12’170 243 May
JDT Core 1’705 63’038 2’814K 451’483 6’033 Jun0
OSGI 748 9’866 335’253 56’238 1’411 Nov
single source code statements, e.g., method invocatio
ments, between two versions of a program by com
their respective abstract syntax trees (AST). Each chan
represents a tree edit operation that is required to tr
one version of the AST into the other. The algorithm i
mented in CHANGEDISTILLER [14] that pairwise co
the ASTs between all direct subsequent revisions of e
Based on this information, we then count the numbe
ferent source code changes (SCC) per file revision.
The preprocessed data from step 1-3 is stored into
lease History Database (RHDB) [10]. From that data,
compute LM, SCC, and Bugs for each source file by a
ing the values over the given observation period.
3. EMPIRICAL STUDY
In this section, we present the empirical study that
formed to investigate the hypotheses stated in Sectio
discuss the dataset, the statistical methods and machi
ing algorithms we used, and report on the results a
ings of the experiments.
3.1 Dataset and Data Preparation
We performed our experiments on 15 plugins of the
platform. Eclipse is a popular open source system
been studied extensively before [4,27,38,39].
Table 1 gives an overview of the Eclipse dataset
this study with the number of unique *.java files (Fi
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
Academia wants/needs/must
collaborate with industry
Industry should invest in such
a collaboration