Impact of design complexity on software quality - A systematic review

753 views
639 views

Published on

METRICON 2010, Germany

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
753
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Today I would like to present my master thesis in the topic The impact of design complexity on software cost and quality. The thesis is perpormed under direct supervision of Marcus Ciolkowski and general supervision of Professor Dieter Rombach.
  • Here is agenda for the presentation. Firstly, I will present the motivation towards the research topic, including the importance of the topic for software practice and research community. Then, the research problem is formally stated as research questions. In the research methodology, I will present the approaches to answer these questions. The next two parts are research result and interpretation. At last, I would like to discuss some significant threat to our research validitiy and future works.
  • It is a common hypothesis that the structural feature of a design such as coupling, cohesion, inheritance has impact on external quality attributes. It is reasoned that a complex design structure can take a developer or tester more effort to understand, implement and maintain. Therefore It could lead to an undesired software quality such as increased fault proneness or reduced maintainability. Though m any studies investigate the relationship between design complexity and cost and quality, it is unclear what we have learned from these studies, because no systematic synthesis exists to date !
  • This master thesis address the main research question: What is the impact of design complexity on software quality? This question (RQ) is divided into five sub questions (SQ). In particular, we would like to know
  • We use four research methods to answer these five sub questions as shown in the diagram The literature review is used to achieve an quick impression about what type of cost and quality attributes are investigated. Then a systematic literature review is performed with focus on the most common quality attributes in literature. The data extracted from the systematic literature review will be used as input data for synthesis methods. Two available quantitative synthesis methods are vote counting and meta analysis. Vote counting is selected to answer the sub question 3. A design metric is a potential predictor of software quality if major portion of studies that investigate their relationship vote for it. Meta analysis is used to synthesize and quantify the global impact of a design metric on an external quality, which answer for SQ4. Meta analysis procedure also includes explanation for disagreement between studies which answer for SQ5.
  • This slide presents the result of studies search and selection process. After searching in three electronic database, namely Scopus, IEEE Explorer and ACM Digital Library, we found 39 papers. After that, the reference scan and search for grey literature give us 18 papers more. In total the systematic search results in 57 primary studies. These two pictures shows the distribution of primary studies over publication year and publication channel. It is revealed that the number of papers in the topics increasing last 5 years. Besides, the selected papers mainly come from high quality source, like book chapter, international journal or conferences
  • From this slide, I present the results for answering research questions. The diagram shows cost and quality attributes that are investigated in design complexity studies. The external quality attributes fall into three categories. Reliability attributes such as fault proneness, fault density, vulnerability. Maintainability and sub categories like testability, changeability. Development effort such as implementation cost, debugging effort, refactoring effort. It is noticed that main portion of studies focuses on fault proneness with 45% of total studies and maintainability with 25% of the studies. Fault proneness is the probability of a class to be faulty. Maintainability involves the effort necessary to maintain a class. Since only these two attributes are investigated in efficient number of studies, fault proneness and maintainability will be considered for SQ3, 4, 5.
  • This slide presents the result for SQ2. The most frequently proposed and used design metrics focuses on coupling, cohesion, inheritance, scale and polymorphism aspect. The largest number of metrics is coupling metrics and followed by scale, inheritance and cohesion. Interestingly, this order is the same for both fault proneness and maintainability studies. In term of design metrics, C&K metric set is the most common used. Here I explain the definition of those metrics. NOC is number of child, … DIT is nu
  • In this slide, We recall some basic concepts related to the topic. How to measure the impact? How to know the impact is strong or weak? How to know the impact happened not by chance? The impact between a design complexity metric and cost and quality is quantified by statistical correlation. Correlation analysis investigates the extent to which changes in the value of a variable (such as value of complexity metric in a class) are associated with changes in another variable (such as number of defect in a class). The intensity of the correlation is called effect size. There are three common effect sizes used in the correlational study: Spearman, Pearson and Odds ratios. For the purpose of demonstration, in coming slides, we consider the impact in term of Spearman correlation coefficient. The impact can be positive or negative. Positive impact means the increase in value of one variable will lead to the increase in value of another variable. Negative impact means …. The absolute value of Spearman coefficient range from 0 to 1. Cohen defined the coefficient smaller than 0.1 trivial, small, medium or large by this value. To know whether the impact happens by chance, we use a statistical index called p value. The p value of 0.05 or significance level at 5% means only 5% the measured impact happens by chance. It is noticed that correlation does not imply causation due to confounding factors. However it is still an effective method to select candidate variables for cause effect relationship
  • To find whether a design metric is a potential predictor of external attributes, we test each design metric with the following hypothesis: H0: There is no positive impact of metric X on quality attribute Y. Vote counting says that H0 is rejected if ratio of the number of reported positive significant effect sizes and total number of reported effect size are larger than 0.5. The table shows the result of hypothesis test for some metrics in Fault proneness studies. The procedure is performed similarly for hypothesis of negative impact.
  • In this slide, We recall some basic concepts related to the topic. How to measure the impact? How to know the impact is strong or weak? How to know the impact happened not by chance? The impact between a design complexity metric and cost and quality is quantified by statistical correlation. Correlation analysis investigates the extent to which changes in the value of a variable (such as value of complexity metric in a class) are associated with changes in another variable (such as number of defect in a class). The intensity of the correlation is called effect size. There are three common effect sizes used in the correlational study: Spearman, Pearson and Odds ratios. For the purpose of demonstration, in coming slides, we consider the impact in term of Spearman correlation coefficient. The impact can be positive or negative. Positive impact means the increase in value of one variable will lead to the increase in value of another variable. Negative impact means …. The absolute value of Spearman coefficient range from 0 to 1. Cohen defined the coefficient smaller than 0.1 trivial, small, medium or large by this value. To know whether the impact happens by chance, we use a statistical index called p value. The p value of 0.05 or significance level at 5% means only 5% the measured impact happens by chance. It is noticed that correlation does not imply causation due to confounding factors. However it is still an effective method to select candidate variables for cause effect relationship
  • Appearance of high heterogeneity indicates that the effect sizes is coming from the heterogeneous population. In other words, it may exists the subgroups within the population and the true effect of those subgroups are different. In this case the aggregation should take into account the between subgroup variation as well. The calculation method for this is called random effect model. The table shows the results of aggregation of Spearman coefficient for 6 design metrics and LOC. We found the high level of heterogeneity in all of these metrics and therefore, we use a random effect model in all cases. This diagram show the comparison of 95% confidence interval of effect size among 7 metrics.
  • The significance level can tell us whether a metric is theoretically correlated to an external quality attribute. But in order to be practically meaningful, the strength of impact should be large enough. Meta analysis are applied here to quantify and synthesize reported Spearman coefficient in different study. The example of global Spearman coefficient estimation of RFC in Fault proneness studies are shown in the Diagram. Each reported Spearman coefficient is weighed by the data set size. The rectangle represents for the weight of the effect size and its position in the axis is its magnitude. The line is … And the diamond is the aggregated effect size. We can see that all of reported spearman coefficient is larger than 0 which indicates an positive impact. I square is an index represented for heterogeneity among reported effect size. If I square larger than 70% means the high heterogeneity level.
  • In the previous questions, we found a high heterogeneity in population of all investigated metrics. And we want to find an explanation for this. One available approach is subgroup analysis. That is, we attempt to find a moderator variable that are able to account for a significant part of the observed variation. The heterogeneity test is performed for each sub group. The ratio between within subgroup heterogeneity and whole population heterogeneity is ve and it is the percentage of variance explained by the moderator variable. We calculate ve value for each suspect moderator variable and for each design metric. The moderator variable here is the characteristics of the dataset that we extracted before. The results shows that Defect collection phase can explain more than 50% of observed variance in 5 out of 7 investigated metric. Domain can explain 76% of variance in case of NOC. In some cases, for example RFC and WMC, the defect collection phase can distinguish the 95% confidence interval of pre release defect and post release defect. The correlation between metrics and pre release defects are stronger than with post release defects. The number of post release defects is likely less than number of pre release defects due to the testing process. Therefore, the faulty class is less likely to be correlated to design complexity due to smaller probability to be detected.
  • In this slide, we show the comparison between our results and the perceptions in literature. The results from vote counting and meta analysis statistically confirms the common claims about relationship between design metrics and software fault proneness. In general, our results agree with intuitive perception about relationship between CK metrics expect for DIT and LCOM. It is surprising to us that programming language cannot explain for the difference in the effect of CK metrics on fault proneness.
  • threats to validity could come from the systematic review and meta analysis procedures. The bias in study selection is one threat to validity due to a single reviewer. The variety of quality of selected studies is a trade off to the desire to receive all reported effect size. The limitation of research design in observational and historical methods is a shortage of the research area. The conclusion validity includes the lack of information reported in studies, such as raw data for Univariate logistic regression and moderator variables. It suggests a further improvement in reported information for the purpose of aggregation.
  • This slide summarizes the results of your research.
  • This slide summarizes the results of your research.
  • Compare before and after rework Influence of context setting
  • Impact of design complexity on software quality - A systematic review

    1. 1. Master thesis presentation Impact of design complexity on software quality Student: Nguyen Duc Anh First supervisor: Marcus Ciolkowski, Fraunhofer IESE Second supervisor: Sebastian Barney, BTH General supervisor: Prof. Dr. Dr. h.c. Dieter RombachFeb 21, 2013 1© Fraunhofer IESE
    2. 2. Agenda  Motivation  Problem statement  Research methodology  Research result  Threat of validity  Conclusion  Future workFeb 21, 2013 2© Fraunhofer IESE
    3. 3. Motivation High complexity leads to high cost and low qualityFeb 21, 2013 3© Fraunhofer IESE
    4. 4. Problem statement  SQ1: Which cost & quality attributes are predicted using design complexity metrics? W hat is the  SQ2: What (kind of ) design complexity metrics are impact most frequently used in literature? of  SQ3: Which complexity metrics are potential design predictors of quality attribute? complexity on  SQ4: Is there an overall influence of these metrics software on quality attributes? If yes, what are the impacts of cost &quality those metrics on those attributes? ?  SQ5: If no, what explains the inconsistency between studies? Is this explanation consistent across different metrics?Feb 21, 2013 4© Fraunhofer IESE
    5. 5. Research methodology W hat is the Search for relevant publications impact Extract information about design complexity of metrics & quality attributes design complexity Extract numerical representation of impact on relationship & context factors software Synthesize data & interpret results cost &quality ?Feb 21, 2013 5© Fraunhofer IESE
    6. 6. Study selection result Search range: 1960 to 2010Scope: Object oriented metrics Feb 21, 2013 6 © Fraunhofer IESE
    7. 7. Research resultSQ1 - Which quality attributes are predicted using software design metrics? Probability of a module to be faulty Effort to maintain a software module Number of fault per LOC Probability of a module to be changed Cost (effort) is excluded due to lack of sufficient number of investigated studies Feb 21, 2013 7 © Fraunhofer IESE
    8. 8. Research resultSQ2- What kind of complexity metrics is most frequently used in literature? Design complexity dimension No of studiesFeb 21, 2013 8© Fraunhofer IESE
    9. 9. Research result SQ2- What complexity metrics is most frequently used in literature? Design complexity metric: Chidamber & Kemerer (CK) metric set (*) Fault proneness Maintainability No of No of Metric Type Metric Type studies studiesNOC (Number Of Children) inheritance 28 WMC scale 9DIT (Depth of Inheritance Tree) inheritance 27 RFC coupling 8CBO (Coupling Between Object) coupling 22 DIT coupling 7LCOM (Lack of Cohesion cohesion 22 NOC inheritance 6between Method) CBO coupling 4WMC (Weighted Method Count) scale 22 LCOM cohesion 3RFC (Response For a Class) coupling 21 … … 3 … … 12Feb 21, 2013 9 ( )S . C id m e a dC . K m rr“AM t s u efr be t * .R h a b r n .F e ee, er S it o O jc ic© Fraunhofer IESE O ie tdD s n IEEE Trans. Softw. Eng., v l2 , r ne eig ,” o. 0 1 9 , p . 4 64 3 94 p 7- 9
    10. 10. Research resultSQ3 - Which complexity metrics are potential predictors of fault proneness?  Potential prediction – Statistical correlation analysis  Correlation coefficient  Spearman  Odds ratios (estimated from univariate logistic regression model)  Significant correlation  Vote counting  Count the number of reported significant impacts over total number of studiesFeb 21, 2013 10© Fraunhofer IESE
    11. 11. Research resultSQ3 - Which complexity metrics are potential predictors of fault proneness?(Ex a m p le : Vo te c o unting fo r Sp e a rm a n c o rre la tio n c o e ffic ie nt in Fa ult p ro ne ne s s s tud ie s ) Out comes ≤ 50%  ≥ 50% no No of Proportiona Positive Metric No of non l ratio of + impact ! positive impact studies No of + No of - impact? significantNOC 19 6 1 1 2 3% 2 NoDIT 14 2 0 1 2 1% 4 NoCBO 17 10 0 7 5% 9 Ys eLCOM 14 6 0 8 4% 3 No Except NOC,W MC 26 18 0 8 6% 9 Ys e DIT, LCOM listedRFC 15 9 0 6 6% 0 Ys e metrics areW MC McCabe 16 11 0 5 6% 9 Ys eSDMC 6 6 0 0 10 0% Ys e potentialAMC 6 6 0 0 10 0% Ys e predictor of faultNIM 6 6 0 0 10 0% Ys e proneness !NCM 6 6 0 0 10 0% Ys eNTM 6 6 0 0 10 0% Ys eFeb 21, 2013 11© Fraunhofer IESE
    12. 12. Research resultSQ3 - Which complexity metrics are potential predictors of fault proneness?  Strength of correlation (*) Trivial Small Medium Large  Meta analysis  Synthesize reported correlation coefficients  Assess the agreement among studies about aggregated resultFeb 21, 2013 12 (*) J. Cohen, Statistical Power Analysis for the Behavioral Science,© Fraunhofer IESE Lawrence Erlbaum Hillsdale, New Jersey, 1988.
    13. 13. Research resultSQ4 - Is there an overall influence of these metrics on fault proneness? 95% confidence interval of aggregated correlation coefficient between the metric and fault proneness Trivial Small Medium Large  Scale, coupling metrics are stronger correlated than cohesion, inheritance metric  LOC is strongest correlated to fault pronenessFeb 21, 2013 13© Fraunhofer IESE
    14. 14. Research result SQ4 - Is there an overall influence of these metrics on fault proneness?(Ex a m p le : M ta a na ly s is fo r Sp e a rm a n c o e ffic ie nt o f m e tric RFC in Fa ult p ro ne ne s s s tud ie s ) e Forest plot of RFC Aggregated resultsGlobal Spearman 0.31coefficient95% Confidence [0.22;0.40]IntervalP-value 0.000 Feb 21, 2013 14 © Fraunhofer IESE
    15. 15. Research result SQ4 - Is there an overall influence of these metrics on fault proneness?(Ex a m p le : M ta a na ly s is fo r Sp e a rm a n c o e ffic ie nt o f m e tric RFC in Fa ult p ro ne ne s s s tud ie s ) e Is this result consistent across studies? Metric I2 I2 test for heterogeneity ! CO B 9% 5 DT I 8% 3 N C O 7% 5 LO CM 7% 4 RC F 7% 8 RFC: I2=78% W C M 9% 3 LC O 8% 4Feb 21, 2013 15© Fraunhofer IESE
    16. 16. Research resultSQ4* - How many cases is enough to draw the statistically significant conclusion? (Ex a m p le : Po we r a na ly s is fo r Sp e a rm a n c o e ffic ie nt o f m e tric RFC in Fa ult p ro ne ne s s s tud ie s ) α value 0.1 Tails 2 Expected effect size 0.31 Expected power 80% Number of cases needed: 60 cases ! Feb 21, 2013 16 © Fraunhofer IESE
    17. 17. Research result SQ5: What explains the inconsistency between studies? Is this explanation consistent across different metrics?  Moderator variable  Programming Language: C++ & Java  Project type: Open source, Closed source academic & Closed source industry  Defect collection phase: Pre release defects & Post release defects  Business domain: Embedded system & Information system  Dataset size: Small, Medium & Large  Are the correlations different across each moderator variable?Feb 21, 2013 17© Fraunhofer IESE
    18. 18. Research result SQ5: What explains the inconsistency between studies? Is this explanation consistent across different metrics? Metric Programming Project type Defect col. Business Dataset size Variance Language Phase Domain explanation CO B 6% 4% 83% 4 % 8% in percent DT I 3% 0% 2% 0 0 % 1% N C O 3% 4 2%4 1% 5 2%2 1% 4 LO CM 1% 0% 60% 0 % 6% RC F 5% 3% 78% 3 % 2% W C M 3% 2 4% 60% 4 % 3% LC O 7% 2% 51% 1%5 0%  Remaining inconsistency is still excessive  No consistent explanation for heterogeneity across metrics !Feb 21, 2013 18© Fraunhofer IESE
    19. 19. Comparison of results with perception in literature Vote counting & meta analysis common claims in literature Common claims in literature In Lit. OursThe more classes a given class iscoupled, the more likely that class is Yes YesfaultyThe more methods that can potentiallybe executed in response to a message Yes Yesreceived by an object of a given class,the more likely inheritance tree for aThe deeper the that class is faultygiven class is, the more likely that class Yes Nois faultyThe more immediate sub-classes a givenclass has, the more likely that class is No NofaultyThe less similar methods within a givenclass, the more likely that class is Yes NofaultyThe more local methods a given class Yes Yeshas, the more likely that class is faultyThe 2013 Feb 21, larger size a given class has, the Yes Yes 19more likely that class is faultyDo the effects of CK metrics differ Yes No © Fraunhofer IESEacross different programming languagesDo the effects of CK metrics differ
    20. 20. Limitation  Internal validity  Selection of publications  Quality of selected studies.  External validity  Limitation to models with single complexity metric  Limitation to object oriented systems  Conclusion validity  Lack of comparable studies  Lack of reported context informationFeb 21, 2013 20© Fraunhofer IESE
    21. 21. Conclusion SQ1: Most common predicted attributes:  Fault proneness & Maintainability SQ2: Most common design complexity dimension & metric:  Coupling: CBO, RFC  Scale: WMC  Inheritance: DIT, NOC  Cohesion: LCOM SQ3,4: Overall impact of design complexity on software quality:  Moderate impact of WMC, CBO, RFC on fault proneness  LOC shows strongest impact on fault proneness ! SQ5: What explains the inconsistency between studies?  Not able to explain for the inconsistency  Defect collection phase explains part of the inconsistency 21© Fraunhofer IESE
    22. 22. Interpretation Look for quality predictor in source code: LOC Look for quality predictor in design: CBO, RFC and WMC Build different prediction models for pre release and post release defect Need context information to increase predictive performance Adapt the design metrics for any software systemsFeb 21, 2013 22© Fraunhofer IESE
    23. 23. Future work Construction of a generic Quality benchmarking model prediction System A System B CBO XXX CBO XXX RFC XXX ? RFC XXX WMC XXX WMC XXX LCOM XXX LCOM XXX DIT XXX DIT XXXFeb 21, 2013 23© Fraunhofer IESE
    24. 24. Q&AFeb 21, 2013 24© Fraunhofer IESE

    ×