Successfully reported this slideshow.
On the distribution of source code                  file sizes         Israel Herraiz – Universidad Politécnica de Madrid,...
Software size●    Important metric      ●          Estimation of effort and cost●    Examples      ●          COCOMO      ...
Goal of this paper●    Find out the statistical    distribution of software    size●    Why is the distribution    importa...
The lognormal distribution●    Software size is believed to follow a lognormal distribution          ●              “The D...
Lognormal vs. Double Pareto●    Contrarily to previous results, we have found that software    size follows a double Paret...
How did we find out?●   Measuring Debian 5.0.2, about 1.4M source code files      ●   Measured SLOC for different programm...
The numbers              7
Estimation of the density function●    Looks like log-Normally distributed      ●          Figure is in logarithmic scale ...
But is it normal?●    Graphical normality test      ●          Compare the quantile of the sample with ideal values of    ...
The density function from another point of view●    Complementary cumulative distribution function      ●          Easier ...
Conclusions so far                     11
Conclusions so far●    The shape of the actual distribution is    very close to lognormal●    However it is not clear enou...
But are the tails important?●   The tails are only a minority of files●   We will come back to this plot later            ...
Impact of the minority●    But the tails are an immense minority●    Impact of large files in the tails on the overall siz...
Model fitting●    Two parts model fitting      ●          Lognormal            ●                Straightforward procedure ...
Maximum likelihood power law fitting●    Fitting power laws to empirical data         ●             Clauset et al. “Power-...
Example of model fitting●    The data and two models in the CCDF plot●    Showing only Lisp source code files             ...
Results for all the languages●   Two languages do not have power law tails      ●   Shell and Perl                        ...
What about the lognormal body?●    Shell and Perl do not fit well the lognormal model either                              ...
Timeout!Conclusions so far                     20
Conclusions so far●    Lognormal body + power law tail      ●          C, C++, Java, Python and Lisp●    Unknown distribut...
Using the threshold value to show the impact of                  large files●    Even though large files are very scarce, ...
Estimation errors using double Pareto and                 lognormal models●    This impact causes a great error in the pre...
So what?●    Estimation techniques based on lognormal size models,    will always underestimate the size of software      ...
Any more juice extracted from these oranges?●    More fuel for the programming languages holy war      ●          The powe...
And what about Shell and Perl?●    These languages are used to great extent for package    maintenance activities in Debia...
Further work●    Analysis over time         ●             How do files reach the threshold value?         ●             Wh...
Take away Software size                    Size is notis an important                   lognormal,     metric             ...
Upcoming SlideShare
Loading in …5
×

Software size distribution - Why we always underestimate software cost

2,083 views

Published on

Why we always underestimate software cost.

Presentation of the paper "On the distribution of source code file sizes", accepted for ICSOFT 2011 http://www.icsoft.org

Preprint available at http://oa.upm.es/6791/

Published in: Education, Technology
  • Be the first to comment

Software size distribution - Why we always underestimate software cost

  1. 1. On the distribution of source code file sizes Israel Herraiz – Universidad Politécnica de Madrid, Spain Daniel German – University of Victoria, Canada Ahmed E. Hassan – Queens University, Canada ICSOFT 2011 Sevilla, July 19th 2011 Preprint available at http://oa.upm.es/6791/ This presentation available athttp://slideshare.net/herraiz/on-the-distribution-of-source-code-file-sizes 1
  2. 2. Software size● Important metric ● Estimation of effort and cost● Examples ● COCOMO ● Effort = a KLOC b ● Time = c Effort d ● People = Effort / Time ● Function points ● Guess size and you will find out software cost 2
  3. 3. Goal of this paper● Find out the statistical distribution of software size● Why is the distribution important? ● Estimate overall size ● Estimate number of modules within a given range size Image extracted from http://en.wikipedia.org/wiki/Normal_distribution 3
  4. 4. The lognormal distribution● Software size is believed to follow a lognormal distribution ● “The Distribution of Program Sizes and Its Implications: An Eclipse Case Study”. Hongyu Zhang, Hee Beng Kuan Tan, Michele Marchesi ● http://arxiv.org/abs/0905.2288 Log Images extracted from http://en.wikipedia.org/wiki/Lognormal_distribution 4
  5. 5. Lognormal vs. Double Pareto● Contrarily to previous results, we have found that software size follows a double Pareto distribution● Large files are found more often than predicted by the lognormal distribution Image extracted from http://en.wikipedia.org/wiki/Lognormal_distribution 5
  6. 6. How did we find out?● Measuring Debian 5.0.2, about 1.4M source code files ● Measured SLOC for different programming languages 6
  7. 7. The numbers 7
  8. 8. Estimation of the density function● Looks like log-Normally distributed ● Figure is in logarithmic scale 8
  9. 9. But is it normal?● Graphical normality test ● Compare the quantile of the sample with ideal values of normal quantiles ● Quite log-normal, except for the tails 9
  10. 10. The density function from another point of view● Complementary cumulative distribution function ● Easier to find shapes of known statistical distributions 10
  11. 11. Conclusions so far 11
  12. 12. Conclusions so far● The shape of the actual distribution is very close to lognormal● However it is not clear enough● The tails deviate from lognormality, and do not show a clear shape in the CCDF plot 12
  13. 13. But are the tails important?● The tails are only a minority of files● We will come back to this plot later 13
  14. 14. Impact of the minority● But the tails are an immense minority● Impact of large files in the tails on the overall size of the system 14
  15. 15. Model fitting● Two parts model fitting ● Lognormal ● Straightforward procedure ● The tails ● Probably power laws, not so straightforward procedure● How do we decide where the lognormal body ends and the tails begin? 15
  16. 16. Maximum likelihood power law fitting● Fitting power laws to empirical data ● Clauset et al. “Power-law distributions in empirical data”. ● http://www.santafe.edu/~aaronc/powerlaws/● Estimate the parameters that minimize the Kolmogorov- Smirnov distance ● Maximum vertical distance in the CCDF between model and data● Calculates a threshold value for data that deviate from the power law model 16
  17. 17. Example of model fitting● The data and two models in the CCDF plot● Showing only Lisp source code files 17
  18. 18. Results for all the languages● Two languages do not have power law tails ● Shell and Perl 18
  19. 19. What about the lognormal body?● Shell and Perl do not fit well the lognormal model either 19
  20. 20. Timeout!Conclusions so far 20
  21. 21. Conclusions so far● Lognormal body + power law tail ● C, C++, Java, Python and Lisp● Unknown distribution ● Shell and Perl● Large files are more frequent than predicted by a lognormal model 21
  22. 22. Using the threshold value to show the impact of large files● Even though large files are very scarce, they account for a large part of the overall size 22
  23. 23. Estimation errors using double Pareto and lognormal models● This impact causes a great error in the prediction of the lognormal model● Showing relative error for Lisp 23
  24. 24. So what?● Estimation techniques based on lognormal size models, will always underestimate the size of software ● Because they underestimate the amount of large files ● And large files have an impact >30% on the overall size 24
  25. 25. Any more juice extracted from these oranges?● More fuel for the programming languages holy war ● The power law parameters could be related to the properties of the different programming languages 25
  26. 26. And what about Shell and Perl?● These languages are used to great extent for package maintenance activities in Debian● So they are of a different nature● Does it mean that double Pareto is the signature of the programming process? 26
  27. 27. Further work● Analysis over time ● How do files reach the threshold value? ● What happens when files get large? Do they split? Are they abandoned?● Domains of applications ● How do the power law parameters change with domain of application?● Can we find more “non-double Pareto” languages? 27
  28. 28. Take away Software size Size is notis an important lognormal, metric it is (effort, cost) double Pareto Lognormal models Double Paretounderestimate as the signaturesoftware size of programming? by design Preprint available at http://oa.upm.es/6791/ 28

×