SlideShare a Scribd company logo
Statistical distributions of software metrics: do
                      they matter?

                                     Israel Herraiz

                          Technical University of Madrid


                         israel.herraiz@upm.es


                               Grab these slides from
     http://slideshare.net/herraiz/statistical-distributions-of-metrics




Israel Herraiz, UPM       Statistical distributions of software metrics: do they matter?   1/17
Outline



1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   2/17
1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   3/17
A (not so) long time ago...



Statistical distribution of software metrics
Software size follows a double Pareto distribution
Towards a theoretical model for software growth MSR 2007

More recently
Not only size, but some OO metrics too (and some complexity metrics)
On the Statistical Distribution of Object-Oriented System
Properties WETSoM 2012




Israel Herraiz, UPM    Statistical distributions of software metrics: do they matter?   4/17
OK, but what is that double Pareto thing?
           1e+00
           1e−02
P[X > x]




                          Data
                          Double Pareto
           1e−04




                          Lognormal


                      1                   100                                   10000

                                                  SLOC
Israel Herraiz, UPM           Statistical distributions of software metrics: do they matter?   5/17
But does it matter?




 Most of the files are on the
 lognormal side
             10 15 20 25 30 35
   % Files

             5
             0




                                 C   C++   Java   Python     Lisp




Israel Herraiz, UPM                               Statistical distributions of software metrics: do they matter?   6/17
But does it matter?




 Most of the files are on the                                                But the power law minority
 lognormal side                                                             matters a lot
             10 15 20 25 30 35




                                                                                       40
                                                                                       30
                                                                              % SLOC
   % Files




                                                                                       20
                                                                                       10
             5




                                                                                       0
             0




                                 C   C++   Java   Python     Lisp                            C        C++          Java   Python   Lisp




Israel Herraiz, UPM                               Statistical distributions of software metrics: do they matter?                          6/17
Large files have a large impact

Size estimation models
Some software size estimation models are based on the log-normality of size
metrics. These models systematically underestimate the size of software.

                                                  C                                                 C++
                           50




                                                                              50
                      RE




                                                                         RE
                           0




                                                                              0
                           −100




                                                                              −100
                                  2000    5000 10000             50000                2000    5000          20000     50000

                                                 SLOC                                               SLOC



                                                 Java                                           Python
                           50




                                                                              50
                      RE




                                                                         RE
                           0




                                                                              0
                           −100




                                                                              −100




                                   1000   2000          5000   10000                 1000    2000          5000     10000

                                                 SLOC                                               SLOC



On the distribution of source code file sizes ICSOFT 2011
Israel Herraiz, UPM                       Statistical distributions of software metrics: do they matter?                      7/17
1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   8/17
Parameters of the statistical distribution

Power law parameters: λ and xmin
Transition from lognormal to power law
                             1e+00
                             1e−02
                  P[X > x]




                                            Data
                                            Double Pareto
                             1e−04




                                            Lognormal


                                     1                      100                           10000

                                                                   SLOC

Israel Herraiz, UPM                      Statistical distributions of software metrics: do they matter?   9/17
1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   10/17
Probability of finding defects


Probability of finding defects
We have seen that files above xmin account for 40% of total size, being
only about ∼ 1% of the files.
What about defects? Probability of finding defects in three software
projects (using CYCLO as metric)

                      Project             Below xmin               Above xmin
                      Apache                   .4178                   .7708
                      OpenIntents              .2500                   .7500
                      Zxing                    .2143                   .4161

* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE
2011.



Israel Herraiz, UPM         Statistical distributions of software metrics: do they matter?   11/17
Probability of finding defects




Probability of finding defects (normalized metrics)
Using CYCLO / WMC as metric (cyclomatic complex. per LOC)

                      Project             Below xmin               Above xmin
                      Apache                   .4159                   .6296
                      OpenIntents              .2813                   .5417
                      Zxing                    .3181                   .2389




Israel Herraiz, UPM         Statistical distributions of software metrics: do they matter?   12/17
Probability of finding defects

Defects density (only pre-release defects)
Using Number of Methods and number of pre-release defects per LOC

                                      Below xmin                                                Above xmin
                                                  Below xmin                                                 Above xmin
                      12000                                                         300




                      10000                                                         250




                       8000                                                         200




                       6000                                                         150




                       4000                                                         100




                       2000                                                          50




                          0                                                           0
                              0   1   2   3   4       5        6   7   8   9   10         0   0.05   0.1   0.15       0.2   0.25   0.3   0.35




                      Avg .Dens. = .2685                                            Avg .Dens. = .4565

* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007

Israel Herraiz, UPM                               Statistical distributions of software metrics: do they matter?                                13/17
Probability of finding defects

Defects density (only post-release defects)
Using Number of Methods and number of post-release defects per LOC

                                           Below xmin                                                             Above xmin
                                                    Below xmin                                                             Above xmin
                      12000                                                                    300




                      10000                                                                    250




                       8000                                                                    200




                       6000                                                                    150




                       4000                                                                    100




                       2000                                                                     50




                          0                                                                      0
                              0    1   2    3   4       5         6   7   8   9   10                 0     0.05    0.1   0.15       0.2   0.25   0.3   0.35




                                  Avg .Dens. = .1437                                                     Avg .Dens. = .2690

Israel Herraiz, UPM                                              Statistical distributions of software metrics: do they matter?                               14/17
Probability of finding defects
Defects density (pre + post-release defects)
Using CYCLO/SLOC and number of total defects per LOC

                         0                                                  3
                        10                                                 10




                         −1                                                 2
                        10                                                 10
            Pr(X ≥ x)




                         −2                                                 1
                        10                                                 10




                         −3                                                 0
                        10                                                 10




                         −4                                                 −1
                        10 −1    1         3             5
                                                                           10
                                                                                 −1    0    1      2       3    4    5
                                                                                10    10   10     10      10   10   10
                          10    10       10            10
                                     x




                  Below xmin                                                   Above xmin
       Avg .Dens. = .3335 (>9000 files)                                Avg .Dens. = .7747 (364 files)
Israel Herraiz, UPM                      Statistical distributions of software metrics: do they matter?                  15/17
1    Some background


2    Statistical properties of software metrics


3    Evidence of impact on quality


4    Summary of findings and further work




Israel Herraiz, UPM      Statistical distributions of software metrics: do they matter?   16/17
Summary and further work

Summary of preliminary findings
        Some metrics have a transition from lognormal to power law
        Clear relation between normalized metrics and defects density
        Although the threshold might not be perfect (e.g., you might find a
        high defects density in a lower side file), it greatly reduces the search
        space for potentially problematic files

Further work
    Verify in more projects
                Do you have defects data at the file level?
        Find explanation for the transition and its influence on quality
        How do the statistical parameters change over time? Do defects
        evolve accordingly?

Israel Herraiz, UPM           Statistical distributions of software metrics: do they matter?   17/17

More Related Content

Similar to Statistical Distribution of Metrics

(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot
BIOVIA
 
2011/2012 CAST report on Application Software Quality (CRASH)
2011/2012 CAST report on Application Software Quality (CRASH)2011/2012 CAST report on Application Software Quality (CRASH)
2011/2012 CAST report on Application Software Quality (CRASH)
CAST
 
Software Cost Contingency Development
Software Cost Contingency DevelopmentSoftware Cost Contingency Development
Software Cost Contingency Development
skillern
 
The Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to ExascaleThe Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to Exascale
Intel IT Center
 
Hedge Fund IT Challenges Financial Survey
Hedge Fund IT Challenges Financial SurveyHedge Fund IT Challenges Financial Survey
Hedge Fund IT Challenges Financial Survey
Avere Systems
 
Dallas Meloon BI
Dallas Meloon   BIDallas Meloon   BI
Dallas Meloon BI
Dallas_Meloon
 
WETSoM 2011
WETSoM 2011WETSoM 2011
WETSoM 2011
Bogdan Vasilescu
 
Revolution R Enterprise - 100% R and More Webinar Presentation
Revolution R Enterprise - 100% R and More Webinar PresentationRevolution R Enterprise - 100% R and More Webinar Presentation
Revolution R Enterprise - 100% R and More Webinar Presentation
Revolution Analytics
 
Introduction to Performance Testing Part 1
Introduction to Performance Testing Part 1Introduction to Performance Testing Part 1
Introduction to Performance Testing Part 1
C.T.Co
 
Data visualization short v1.1
Data visualization short v1.1Data visualization short v1.1
Data visualization short v1.1
Adam Winkler
 
C3 Citrix Cloud Center
C3 Citrix Cloud CenterC3 Citrix Cloud Center
C3 Citrix Cloud Center
Rui Lopes
 
Aggregating API Services with an API Gateway (BFF)
Aggregating API Services with an API Gateway (BFF)Aggregating API Services with an API Gateway (BFF)
Aggregating API Services with an API Gateway (BFF)
José Roberto Araújo
 
BPMN Usage Survey: Results
BPMN Usage Survey: ResultsBPMN Usage Survey: Results
BPMN Usage Survey: Results
Michele Chinosi
 
5 APM and Capacity Planning Imperatives for a Virtualized World
5 APM and Capacity Planning Imperatives for a Virtualized World5 APM and Capacity Planning Imperatives for a Virtualized World
5 APM and Capacity Planning Imperatives for a Virtualized World
Correlsense
 
Xen.org: The past, the present and exciting Future
Xen.org: The past, the present and exciting FutureXen.org: The past, the present and exciting Future
Xen.org: The past, the present and exciting Future
The Linux Foundation
 
Introduction to MATLAB
Introduction to MATLABIntroduction to MATLAB
Introduction to MATLAB
Ashish Meshram
 
201103 cuore forms2_adf v0.2
201103 cuore forms2_adf v0.2201103 cuore forms2_adf v0.2
201103 cuore forms2_adf v0.2Pedro Gallardo
 
Simple is Not Necessarily Better: Why Software Productivity Factors Can Lead...
Simple is Not Necessarily Better:  Why Software Productivity Factors Can Lead...Simple is Not Necessarily Better:  Why Software Productivity Factors Can Lead...
Simple is Not Necessarily Better: Why Software Productivity Factors Can Lead...
Michael Gallo
 
Empirical evaluation in 2020: how big, how beautiful?
Empirical evaluation in 2020: how big, how beautiful?Empirical evaluation in 2020: how big, how beautiful?
Empirical evaluation in 2020: how big, how beautiful?
Massimiliano Di Penta
 

Similar to Statistical Distribution of Metrics (20)

(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot
 
2011/2012 CAST report on Application Software Quality (CRASH)
2011/2012 CAST report on Application Software Quality (CRASH)2011/2012 CAST report on Application Software Quality (CRASH)
2011/2012 CAST report on Application Software Quality (CRASH)
 
Software Cost Contingency Development
Software Cost Contingency DevelopmentSoftware Cost Contingency Development
Software Cost Contingency Development
 
The Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to ExascaleThe Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to Exascale
 
Hedge Fund IT Challenges Financial Survey
Hedge Fund IT Challenges Financial SurveyHedge Fund IT Challenges Financial Survey
Hedge Fund IT Challenges Financial Survey
 
Dallas Meloon BI
Dallas Meloon   BIDallas Meloon   BI
Dallas Meloon BI
 
WETSoM 2011
WETSoM 2011WETSoM 2011
WETSoM 2011
 
Itn no 06 06 application vendor evaluation matrix
Itn no 06 06 application vendor evaluation matrixItn no 06 06 application vendor evaluation matrix
Itn no 06 06 application vendor evaluation matrix
 
Revolution R Enterprise - 100% R and More Webinar Presentation
Revolution R Enterprise - 100% R and More Webinar PresentationRevolution R Enterprise - 100% R and More Webinar Presentation
Revolution R Enterprise - 100% R and More Webinar Presentation
 
Introduction to Performance Testing Part 1
Introduction to Performance Testing Part 1Introduction to Performance Testing Part 1
Introduction to Performance Testing Part 1
 
Data visualization short v1.1
Data visualization short v1.1Data visualization short v1.1
Data visualization short v1.1
 
C3 Citrix Cloud Center
C3 Citrix Cloud CenterC3 Citrix Cloud Center
C3 Citrix Cloud Center
 
Aggregating API Services with an API Gateway (BFF)
Aggregating API Services with an API Gateway (BFF)Aggregating API Services with an API Gateway (BFF)
Aggregating API Services with an API Gateway (BFF)
 
BPMN Usage Survey: Results
BPMN Usage Survey: ResultsBPMN Usage Survey: Results
BPMN Usage Survey: Results
 
5 APM and Capacity Planning Imperatives for a Virtualized World
5 APM and Capacity Planning Imperatives for a Virtualized World5 APM and Capacity Planning Imperatives for a Virtualized World
5 APM and Capacity Planning Imperatives for a Virtualized World
 
Xen.org: The past, the present and exciting Future
Xen.org: The past, the present and exciting FutureXen.org: The past, the present and exciting Future
Xen.org: The past, the present and exciting Future
 
Introduction to MATLAB
Introduction to MATLABIntroduction to MATLAB
Introduction to MATLAB
 
201103 cuore forms2_adf v0.2
201103 cuore forms2_adf v0.2201103 cuore forms2_adf v0.2
201103 cuore forms2_adf v0.2
 
Simple is Not Necessarily Better: Why Software Productivity Factors Can Lead...
Simple is Not Necessarily Better:  Why Software Productivity Factors Can Lead...Simple is Not Necessarily Better:  Why Software Productivity Factors Can Lead...
Simple is Not Necessarily Better: Why Software Productivity Factors Can Lead...
 
Empirical evaluation in 2020: how big, how beautiful?
Empirical evaluation in 2020: how big, how beautiful?Empirical evaluation in 2020: how big, how beautiful?
Empirical evaluation in 2020: how big, how beautiful?
 

More from Israel Herraiz

intensive metrics software evolution
intensive metrics software evolutionintensive metrics software evolution
intensive metrics software evolution
Israel Herraiz
 
Public Key Cryptography
Public Key CryptographyPublic Key Cryptography
Public Key Cryptography
Israel Herraiz
 
¿MATLAB? Yo uso Octave UPM
¿MATLAB? Yo uso Octave UPM¿MATLAB? Yo uso Octave UPM
¿MATLAB? Yo uso Octave UPM
Israel Herraiz
 
The Ultimate Debian Database
The Ultimate Debian DatabaseThe Ultimate Debian Database
The Ultimate Debian Database
Israel Herraiz
 
Evaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasetsEvaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasets
Israel Herraiz
 
Software size distribution - Why we always underestimate software cost
Software size distribution - Why we always underestimate software costSoftware size distribution - Why we always underestimate software cost
Software size distribution - Why we always underestimate software cost
Israel Herraiz
 
The dynamics of software evolution - EVOLUMONS 2011
The dynamics of software evolution - EVOLUMONS 2011The dynamics of software evolution - EVOLUMONS 2011
The dynamics of software evolution - EVOLUMONS 2011
Israel Herraiz
 
Public key cryptography
Public key cryptographyPublic key cryptography
Public key cryptographyIsrael Herraiz
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software Repositories
Israel Herraiz
 

More from Israel Herraiz (9)

intensive metrics software evolution
intensive metrics software evolutionintensive metrics software evolution
intensive metrics software evolution
 
Public Key Cryptography
Public Key CryptographyPublic Key Cryptography
Public Key Cryptography
 
¿MATLAB? Yo uso Octave UPM
¿MATLAB? Yo uso Octave UPM¿MATLAB? Yo uso Octave UPM
¿MATLAB? Yo uso Octave UPM
 
The Ultimate Debian Database
The Ultimate Debian DatabaseThe Ultimate Debian Database
The Ultimate Debian Database
 
Evaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasetsEvaluating the presence and impact of bias in bug-fix datasets
Evaluating the presence and impact of bias in bug-fix datasets
 
Software size distribution - Why we always underestimate software cost
Software size distribution - Why we always underestimate software costSoftware size distribution - Why we always underestimate software cost
Software size distribution - Why we always underestimate software cost
 
The dynamics of software evolution - EVOLUMONS 2011
The dynamics of software evolution - EVOLUMONS 2011The dynamics of software evolution - EVOLUMONS 2011
The dynamics of software evolution - EVOLUMONS 2011
 
Public key cryptography
Public key cryptographyPublic key cryptography
Public key cryptography
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software Repositories
 

Recently uploaded

Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
CarlosHernanMontoyab2
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 

Recently uploaded (20)

Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 

Statistical Distribution of Metrics

  • 1. Statistical distributions of software metrics: do they matter? Israel Herraiz Technical University of Madrid israel.herraiz@upm.es Grab these slides from http://slideshare.net/herraiz/statistical-distributions-of-metrics Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17
  • 2. Outline 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 2/17
  • 3. 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 3/17
  • 4. A (not so) long time ago... Statistical distribution of software metrics Software size follows a double Pareto distribution Towards a theoretical model for software growth MSR 2007 More recently Not only size, but some OO metrics too (and some complexity metrics) On the Statistical Distribution of Object-Oriented System Properties WETSoM 2012 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 4/17
  • 5. OK, but what is that double Pareto thing? 1e+00 1e−02 P[X > x] Data Double Pareto 1e−04 Lognormal 1 100 10000 SLOC Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 5/17
  • 6. But does it matter? Most of the files are on the lognormal side 10 15 20 25 30 35 % Files 5 0 C C++ Java Python Lisp Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
  • 7. But does it matter? Most of the files are on the But the power law minority lognormal side matters a lot 10 15 20 25 30 35 40 30 % SLOC % Files 20 10 5 0 0 C C++ Java Python Lisp C C++ Java Python Lisp Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
  • 8. Large files have a large impact Size estimation models Some software size estimation models are based on the log-normality of size metrics. These models systematically underestimate the size of software. C C++ 50 50 RE RE 0 0 −100 −100 2000 5000 10000 50000 2000 5000 20000 50000 SLOC SLOC Java Python 50 50 RE RE 0 0 −100 −100 1000 2000 5000 10000 1000 2000 5000 10000 SLOC SLOC On the distribution of source code file sizes ICSOFT 2011 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 7/17
  • 9. 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 8/17
  • 10. Parameters of the statistical distribution Power law parameters: λ and xmin Transition from lognormal to power law 1e+00 1e−02 P[X > x] Data Double Pareto 1e−04 Lognormal 1 100 10000 SLOC Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 9/17
  • 11. 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 10/17
  • 12. Probability of finding defects Probability of finding defects We have seen that files above xmin account for 40% of total size, being only about ∼ 1% of the files. What about defects? Probability of finding defects in three software projects (using CYCLO as metric) Project Below xmin Above xmin Apache .4178 .7708 OpenIntents .2500 .7500 Zxing .2143 .4161 * Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE 2011. Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 11/17
  • 13. Probability of finding defects Probability of finding defects (normalized metrics) Using CYCLO / WMC as metric (cyclomatic complex. per LOC) Project Below xmin Above xmin Apache .4159 .6296 OpenIntents .2813 .5417 Zxing .3181 .2389 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 12/17
  • 14. Probability of finding defects Defects density (only pre-release defects) Using Number of Methods and number of pre-release defects per LOC Below xmin Above xmin Below xmin Above xmin 12000 300 10000 250 8000 200 6000 150 4000 100 2000 50 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Avg .Dens. = .2685 Avg .Dens. = .4565 * Data obtained from "Predicting Defects for Eclipse” PROMISE 2007 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 13/17
  • 15. Probability of finding defects Defects density (only post-release defects) Using Number of Methods and number of post-release defects per LOC Below xmin Above xmin Below xmin Above xmin 12000 300 10000 250 8000 200 6000 150 4000 100 2000 50 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Avg .Dens. = .1437 Avg .Dens. = .2690 Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 14/17
  • 16. Probability of finding defects Defects density (pre + post-release defects) Using CYCLO/SLOC and number of total defects per LOC 0 3 10 10 −1 2 10 10 Pr(X ≥ x) −2 1 10 10 −3 0 10 10 −4 −1 10 −1 1 3 5 10 −1 0 1 2 3 4 5 10 10 10 10 10 10 10 10 10 10 10 x Below xmin Above xmin Avg .Dens. = .3335 (>9000 files) Avg .Dens. = .7747 (364 files) Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 15/17
  • 17. 1 Some background 2 Statistical properties of software metrics 3 Evidence of impact on quality 4 Summary of findings and further work Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 16/17
  • 18. Summary and further work Summary of preliminary findings Some metrics have a transition from lognormal to power law Clear relation between normalized metrics and defects density Although the threshold might not be perfect (e.g., you might find a high defects density in a lower side file), it greatly reduces the search space for potentially problematic files Further work Verify in more projects Do you have defects data at the file level? Find explanation for the transition and its influence on quality How do the statistical parameters change over time? Do defects evolve accordingly? Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 17/17