Statistical Distribution of Metrics

440 views

Published on

Presentation for the Seminar on Open Source Evolution 2013
http://informatique.umons.ac.be/genlog/SOS-Evol/SOS-Evol2013.html

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
440
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Statistical Distribution of Metrics

  1. 1. Statistical distributions of software metrics: do they matter? Israel Herraiz Technical University of Madrid israel.herraiz@upm.es Grab these slides from http://slideshare.net/herraiz/statistical-distributions-of-metricsIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17
  2. 2. Outline1 Some background2 Statistical properties of software metrics3 Evidence of impact on quality4 Summary of findings and further workIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 2/17
  3. 3. 1 Some background2 Statistical properties of software metrics3 Evidence of impact on quality4 Summary of findings and further workIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 3/17
  4. 4. A (not so) long time ago...Statistical distribution of software metricsSoftware size follows a double Pareto distributionTowards a theoretical model for software growth MSR 2007More recentlyNot only size, but some OO metrics too (and some complexity metrics)On the Statistical Distribution of Object-Oriented SystemProperties WETSoM 2012Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 4/17
  5. 5. OK, but what is that double Pareto thing? 1e+00 1e−02P[X > x] Data Double Pareto 1e−04 Lognormal 1 100 10000 SLOCIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 5/17
  6. 6. But does it matter? Most of the files are on the lognormal side 10 15 20 25 30 35 % Files 5 0 C C++ Java Python LispIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
  7. 7. But does it matter? Most of the files are on the But the power law minority lognormal side matters a lot 10 15 20 25 30 35 40 30 % SLOC % Files 20 10 5 0 0 C C++ Java Python Lisp C C++ Java Python LispIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
  8. 8. Large files have a large impactSize estimation modelsSome software size estimation models are based on the log-normality of sizemetrics. These models systematically underestimate the size of software. C C++ 50 50 RE RE 0 0 −100 −100 2000 5000 10000 50000 2000 5000 20000 50000 SLOC SLOC Java Python 50 50 RE RE 0 0 −100 −100 1000 2000 5000 10000 1000 2000 5000 10000 SLOC SLOCOn the distribution of source code file sizes ICSOFT 2011Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 7/17
  9. 9. 1 Some background2 Statistical properties of software metrics3 Evidence of impact on quality4 Summary of findings and further workIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 8/17
  10. 10. Parameters of the statistical distributionPower law parameters: λ and xminTransition from lognormal to power law 1e+00 1e−02 P[X > x] Data Double Pareto 1e−04 Lognormal 1 100 10000 SLOCIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 9/17
  11. 11. 1 Some background2 Statistical properties of software metrics3 Evidence of impact on quality4 Summary of findings and further workIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 10/17
  12. 12. Probability of finding defectsProbability of finding defectsWe have seen that files above xmin account for 40% of total size, beingonly about ∼ 1% of the files.What about defects? Probability of finding defects in three softwareprojects (using CYCLO as metric) Project Below xmin Above xmin Apache .4178 .7708 OpenIntents .2500 .7500 Zxing .2143 .4161* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE2011.Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 11/17
  13. 13. Probability of finding defectsProbability of finding defects (normalized metrics)Using CYCLO / WMC as metric (cyclomatic complex. per LOC) Project Below xmin Above xmin Apache .4159 .6296 OpenIntents .2813 .5417 Zxing .3181 .2389Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 12/17
  14. 14. Probability of finding defectsDefects density (only pre-release defects)Using Number of Methods and number of pre-release defects per LOC Below xmin Above xmin Below xmin Above xmin 12000 300 10000 250 8000 200 6000 150 4000 100 2000 50 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Avg .Dens. = .2685 Avg .Dens. = .4565* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 13/17
  15. 15. Probability of finding defectsDefects density (only post-release defects)Using Number of Methods and number of post-release defects per LOC Below xmin Above xmin Below xmin Above xmin 12000 300 10000 250 8000 200 6000 150 4000 100 2000 50 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Avg .Dens. = .1437 Avg .Dens. = .2690Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 14/17
  16. 16. Probability of finding defectsDefects density (pre + post-release defects)Using CYCLO/SLOC and number of total defects per LOC 0 3 10 10 −1 2 10 10 Pr(X ≥ x) −2 1 10 10 −3 0 10 10 −4 −1 10 −1 1 3 5 10 −1 0 1 2 3 4 5 10 10 10 10 10 10 10 10 10 10 10 x Below xmin Above xmin Avg .Dens. = .3335 (>9000 files) Avg .Dens. = .7747 (364 files)Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 15/17
  17. 17. 1 Some background2 Statistical properties of software metrics3 Evidence of impact on quality4 Summary of findings and further workIsrael Herraiz, UPM Statistical distributions of software metrics: do they matter? 16/17
  18. 18. Summary and further workSummary of preliminary findings Some metrics have a transition from lognormal to power law Clear relation between normalized metrics and defects density Although the threshold might not be perfect (e.g., you might find a high defects density in a lower side file), it greatly reduces the search space for potentially problematic filesFurther work Verify in more projects Do you have defects data at the file level? Find explanation for the transition and its influence on quality How do the statistical parameters change over time? Do defects evolve accordingly?Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 17/17

×