SlideShare a Scribd company logo
Size Doesn’t Matter?

On the Value of Software Size
Features for Effort Estimation

   Ekrem Kocaguneli, Tim Menzies : WVU,USA
   Jairus Hihn : JPL, USA
   Byeong Ho Kang : UTAS, Aus
Sept
2012

                           Sound bites

       Size matters!

       But, lack of size features can be tolerated
          • caveat: need to first prune irrelevancies




PROMISE’12                                              2
Sept
2012

             Role of Size Features in SEE
       Size features are at the
       heart of some of the most
       widely used SEE methods

                              COCOMO is based on LOC

                  Function points (FP) is based on
                  logical transactions


          Various others exist such as number of
          requirements, number of modules, number of
          web pages and so on…
PROMISE’12                                             3
Sept
2012

 Role of Size Features in SEE (cntd.)
       Size features have their advantages and disadvantages


         LOC can be automated for counting and is good a
         posteriori, but is difficult to estimate early on


         FP provides a way of a size metric based on early design
         information; hence more accurate a priori
         FP cannot be automated and is subjective… Even though
         training reduces the estimate variation

PROMISE’12                                                          4
Sept
2012

             Objections to Size Features
       Although particular size features may have their advantages
       in certain scenarios, there is a strong opposition…

 “Measuring software productivity by lines of code is like
 measuring progress on an airplane by how much it weighs.”
 Bill Gates
                  “This (referring to LOC) is a very costly measuring unit because
                  it encourages the writing of insipid code, but today I am less
                  interested in how foolish a unit it is from even a pure business
                  point of view.” E. W. Dijkstra


       So we question: Under what conditions are size features
       actually a “must” and can we compensate their absence?

PROMISE’12                                                                      5
Sept
2012




                  So let’s check…

             If we throw away size attributes,
                      what happens?



PROMISE’12                                       6
Sept
2012
                    If we remove “size”,
                       what happens?
       Compare standard successful methods run on reduced and
       full data sets, using 7 error measures and 13 data sets…
       Full data set includes size features
       Reduced data sets lacks size features

       Methods         Error Measures                Datasets
                                         Cocomo81    Nasa93     Sdr
       CART            MAR
                                         Cocomo81o   Nasa93c1   Desharnais
       1NN             MMRE
                                         Cocomo81e   Nasa93c2   DesharnaisL1
                       MdMRE
                                         Cocomo81s   Nasa93c5   DesharnaisL2
                       Pred(25)
                                                                DesharnaisL3
                       MMER
                       MBRE
                       MIBRE
PROMISE’12                                                                   7
Sept
2012
                         Evaluation
                           (cntd.)
       Methods     Error Measures               Datasets
       pop1NN      MAR              Cocomo81    Nasa93     Sdr
       CART        MMRE             Cocomo81o   Nasa93c1   Desharnais

       1NN                          Cocomo81e   Nasa93c2   DesharnaisL1
                   MdMRE
                                    Cocomo81s   Nasa93c5   DesharnaisL2
                   Pred(25)
                                                           DesharnaisL3
                   MMER
  Compare pop1NN
  against CART &   MBRE             On multiple data sets
  1NN              MIBRE            collected via COCOMO,
                                    COCOMOII and FP
                   Using 7
                   error
 Why CART?         measures             Mann-Whitney 95%
 Dejaeger et al.
     TSE 2012
PROMISE’12                                                              8
Sept
2012
                        Results
       (full data has “size”, reduced has not)
        CART on reduced-dataset
        vs. CART on full-dataset

                                   Last column shows
                                   total loss count of
                                   CART run on reduced
                                   dataset (i.e. no size
                                   features)



                                   In 7 of 13 tests, taking
                                   out size makes CART
                                   perform worse
PROMISE’12                                                    9
Sept
2012                    Results
       (full data has “size”, reduced has not)

                            Total loss counts of CART
                            and 1NN run on reduced
                            data vs. their variants run
                            on full data…



                           Standard methods are better
                           off with size attributes of the
                           data sets… I.e. they cannot
                           compensate for the lack of
                           size attributes well
           (copied from
PROMISE’12   last slide)                                     10
Sept
2012




                         New idea

                If we prune data irrelevancies,
             can we survive losing size attributes?



PROMISE’12                                            11
Sept
2012

                      Instance selection
       • Chang (1974)
          – Most of the instances are uninformative.
          – Reduced data sets of size 514, 150, 66 to 34, 14,6 prototypes .
       • Li et al. (2009)
          – genetic algorithm for instance selection
       • Turhan et al. (2009)
          – instance selection as a filter for cross-company defect data
          – See also, Kocaguneli et al. 2011
       • Kocaguneli et al. (2011) variance-based selection:
          – Dendogram of clusters: prune sub-trees with large variances
       • Keung et al.’s (2011) Analogy-X
          – instance selection method for analogous entry
       • New idea, 1popNN : a very simple instance selector

PROMISE’12                                                                    12
Sept
2012

             pop1NN : the urchin shape
  We propose that a “popularity” based method can
  compensate the lack of size features

                                 The “popularity” of an instance
                                 is the number of times it is the
                                 nearest-neighbor of other
                                 instances

                                          Sea urchin is a good
                                          example for SEE data…
                                          Popular central
                                          instances that are
                                          closest neighbors to
                                          scattered neighbors…
PROMISE’12                                                      13
Sept
2012

                     Formally, this is rNN
   • rNN =
        – Reverse Nearest Neighbor
        – E.g. how many residential areas would find a new store as their nearest choice.
        – E.g. predict popularity of a new cell phone plan, determine how many profiles
          have the plan as their best match, against the existing plans in the market.




   • Can be computed efficiently (rNN chaining)
        – see Lopez-Sastre et al.,
        – Fast Reciprocal Nearest Neighbors Clustering,
        – Signal Processing, 2012, Vol. 92, pages 270—275)


PROMISE’12                                                                             14
Sept
2012




                    So let’s check…

             If we (1) throw away size attributes
                    and (2) irrelevant rows,
                     then what happens?

PROMISE’12                                          15
Sept
2012
                          Details:
                       pop1NN (cntd.)
   pop1NN is a 6-step procedure…

        1. Calculate distances between every training instance-tuple
        2. Convert distances of Step 1 into ordering of neighbors
        3. Mark closest neighbors and calculate popularity
        4. Order training instances in decreasing popularity
        5. Decide which instances to select
           • Experiments with nearest neighbor on a hold-out set
        6. Return Estimates for the test instances


PROMISE’12                                                        16
Sept
2012                         Results
                         (reduced data)

                                            Loss values of
                                            pop1NN (on reduced
                                            data) vs. CART and
                                            1NN (on full data)




             pop1NN loses 2 out of 13 data sets against 1NN
             pop1NN loses 4 out of 13 data sets against 1NN
PROMISE’12                                                       17
Sept
2012




             Discussion




PROMISE’12                18
Sept
2012

                               Conclusions
 Successful methods (1NN & CART) cannot compensate the
 lack of size attributes very well
        Lack of size features decreases their performance in majority of
       the data sets

         When 1NN is augmented with a popularity-based pre-
         processor to come up with pop1NN
              Lack of size features can be tolerated in most of the datasets
              Caveat: need to first prune irrelevancies

        Size features are essential for standard learners
              Practitioners with enough resources to correctly collect size
             features should do so
              In the lack of such resources, pop1NN-like methods can
             compensate for the lack of the size features
PROMISE’12                                                                      19
Sept
2012

   Future Work
   • Pop1NN as a feature selector?
        – Lipowezky (1998) :
             • feature and case selection are
               similar tasks,
             • both remove cells in the
               hypercube of all instances
               times all features.

        – So it should be possible to convert
          a case selection mechanism into a
          feature selector.
             • Transpose data
             • Nearby columns are correlated
             • Keep columns that are near no other
   • Active learning:
        – pop1NN does not use dependent variable information.
        – can identify the popular instances of a data set, guide expert reflection on
          collect dependent variable information
PROMISE’12                                                                               20
Sept
2012
             Questions?
             Comments?




PROMISE’12                21

More Related Content

Similar to Size Doesn’t Matter? On the Value of Software Size Features for Effort Estimation

Lecture 13
Lecture 13Lecture 13
Lecture 13
Shani729
 
Chapt 1 odbms
Chapt 1 odbmsChapt 1 odbms
Chapt 1 odbms
Sushil Kulkarni
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
Alexandru Iosup
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular data
JimmyLiang20
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
Joshua Bloom
 
Balancing the Pendulum: Reflecting on BDD in Practice
Balancing the Pendulum: Reflecting on BDD in PracticeBalancing the Pendulum: Reflecting on BDD in Practice
Balancing the Pendulum: Reflecting on BDD in Practice
Zach Dennis
 
Dimension reduction(jiten01)
Dimension reduction(jiten01)Dimension reduction(jiten01)
Dimension reduction(jiten01)
Jiten Dhimmar
 
Cutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For ClassificationCutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For Classification
Pankaj Sharma
 
MapReduce and Its Discontents
MapReduce and Its DiscontentsMapReduce and Its Discontents
MapReduce and Its Discontents
Dean Wampler
 
Big data - teams not technology
Big data - teams not technologyBig data - teams not technology
Big data - teams not technology
Upside Energy Ltd
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
M. Raihan
 
DOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACES
DOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACESDOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACES
DOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACES
sipij
 
Domain Engineering for Applied Monocular Reconstruction of Parametric Faces
Domain Engineering for Applied Monocular Reconstruction of Parametric FacesDomain Engineering for Applied Monocular Reconstruction of Parametric Faces
Domain Engineering for Applied Monocular Reconstruction of Parametric Faces
sipij
 
Model Replication in the Context of Agent-based Simulation
Model Replication in the Context of Agent-based SimulationModel Replication in the Context of Agent-based Simulation
Model Replication in the Context of Agent-based Simulation
Richard Oliver Legendi
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceData-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceGeorge Ang
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing  with MapReduce Data-Intensive Text Processing  with MapReduce
Data-Intensive Text Processing with MapReduce George Ang
 
Lessons from modern coaches for project managers
Lessons from modern coaches for project managersLessons from modern coaches for project managers
Lessons from modern coaches for project managers
Mosesraj R
 
A comparative review of various approaches for feature extraction in Face rec...
A comparative review of various approaches for feature extraction in Face rec...A comparative review of various approaches for feature extraction in Face rec...
A comparative review of various approaches for feature extraction in Face rec...
Vishnupriya T H
 
Time Warp on the Go
Time Warp on the GoTime Warp on the Go
Time Warp on the Go
Gabriele D'Angelo
 

Similar to Size Doesn’t Matter? On the Value of Software Size Features for Effort Estimation (20)

Lecture 13
Lecture 13Lecture 13
Lecture 13
 
Chapt 1 odbms
Chapt 1 odbmsChapt 1 odbms
Chapt 1 odbms
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular data
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Balancing the Pendulum: Reflecting on BDD in Practice
Balancing the Pendulum: Reflecting on BDD in PracticeBalancing the Pendulum: Reflecting on BDD in Practice
Balancing the Pendulum: Reflecting on BDD in Practice
 
Dimension reduction(jiten01)
Dimension reduction(jiten01)Dimension reduction(jiten01)
Dimension reduction(jiten01)
 
Cutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For ClassificationCutting Edge Predictive Modeling For Classification
Cutting Edge Predictive Modeling For Classification
 
MapReduce and Its Discontents
MapReduce and Its DiscontentsMapReduce and Its Discontents
MapReduce and Its Discontents
 
Big data - teams not technology
Big data - teams not technologyBig data - teams not technology
Big data - teams not technology
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
DOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACES
DOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACESDOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACES
DOMAIN ENGINEERING FOR APPLIED MONOCULAR RECONSTRUCTION OF PARAMETRIC FACES
 
Domain Engineering for Applied Monocular Reconstruction of Parametric Faces
Domain Engineering for Applied Monocular Reconstruction of Parametric FacesDomain Engineering for Applied Monocular Reconstruction of Parametric Faces
Domain Engineering for Applied Monocular Reconstruction of Parametric Faces
 
Model Replication in the Context of Agent-based Simulation
Model Replication in the Context of Agent-based SimulationModel Replication in the Context of Agent-based Simulation
Model Replication in the Context of Agent-based Simulation
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceData-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing  with MapReduce Data-Intensive Text Processing  with MapReduce
Data-Intensive Text Processing with MapReduce
 
Get tr doc
Get tr docGet tr doc
Get tr doc
 
Lessons from modern coaches for project managers
Lessons from modern coaches for project managersLessons from modern coaches for project managers
Lessons from modern coaches for project managers
 
A comparative review of various approaches for feature extraction in Face rec...
A comparative review of various approaches for feature extraction in Face rec...A comparative review of various approaches for feature extraction in Face rec...
A comparative review of various approaches for feature extraction in Face rec...
 
Time Warp on the Go
Time Warp on the GoTime Warp on the Go
Time Warp on the Go
 

More from CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
CS, NcState
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
CS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
CS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
CS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
CS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
CS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
CS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
CS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
CS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
CS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
CS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
CS, NcState
 

More from CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 

Recently uploaded

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Size Doesn’t Matter? On the Value of Software Size Features for Effort Estimation

  • 1. Size Doesn’t Matter? On the Value of Software Size Features for Effort Estimation Ekrem Kocaguneli, Tim Menzies : WVU,USA Jairus Hihn : JPL, USA Byeong Ho Kang : UTAS, Aus
  • 2. Sept 2012 Sound bites Size matters! But, lack of size features can be tolerated • caveat: need to first prune irrelevancies PROMISE’12 2
  • 3. Sept 2012 Role of Size Features in SEE Size features are at the heart of some of the most widely used SEE methods COCOMO is based on LOC Function points (FP) is based on logical transactions Various others exist such as number of requirements, number of modules, number of web pages and so on… PROMISE’12 3
  • 4. Sept 2012 Role of Size Features in SEE (cntd.) Size features have their advantages and disadvantages LOC can be automated for counting and is good a posteriori, but is difficult to estimate early on FP provides a way of a size metric based on early design information; hence more accurate a priori FP cannot be automated and is subjective… Even though training reduces the estimate variation PROMISE’12 4
  • 5. Sept 2012 Objections to Size Features Although particular size features may have their advantages in certain scenarios, there is a strong opposition… “Measuring software productivity by lines of code is like measuring progress on an airplane by how much it weighs.” Bill Gates “This (referring to LOC) is a very costly measuring unit because it encourages the writing of insipid code, but today I am less interested in how foolish a unit it is from even a pure business point of view.” E. W. Dijkstra So we question: Under what conditions are size features actually a “must” and can we compensate their absence? PROMISE’12 5
  • 6. Sept 2012 So let’s check… If we throw away size attributes, what happens? PROMISE’12 6
  • 7. Sept 2012 If we remove “size”, what happens? Compare standard successful methods run on reduced and full data sets, using 7 error measures and 13 data sets… Full data set includes size features Reduced data sets lacks size features Methods Error Measures Datasets Cocomo81 Nasa93 Sdr CART MAR Cocomo81o Nasa93c1 Desharnais 1NN MMRE Cocomo81e Nasa93c2 DesharnaisL1 MdMRE Cocomo81s Nasa93c5 DesharnaisL2 Pred(25) DesharnaisL3 MMER MBRE MIBRE PROMISE’12 7
  • 8. Sept 2012 Evaluation (cntd.) Methods Error Measures Datasets pop1NN MAR Cocomo81 Nasa93 Sdr CART MMRE Cocomo81o Nasa93c1 Desharnais 1NN Cocomo81e Nasa93c2 DesharnaisL1 MdMRE Cocomo81s Nasa93c5 DesharnaisL2 Pred(25) DesharnaisL3 MMER Compare pop1NN against CART & MBRE On multiple data sets 1NN MIBRE collected via COCOMO, COCOMOII and FP Using 7 error Why CART? measures Mann-Whitney 95% Dejaeger et al. TSE 2012 PROMISE’12 8
  • 9. Sept 2012 Results (full data has “size”, reduced has not) CART on reduced-dataset vs. CART on full-dataset Last column shows total loss count of CART run on reduced dataset (i.e. no size features) In 7 of 13 tests, taking out size makes CART perform worse PROMISE’12 9
  • 10. Sept 2012 Results (full data has “size”, reduced has not) Total loss counts of CART and 1NN run on reduced data vs. their variants run on full data… Standard methods are better off with size attributes of the data sets… I.e. they cannot compensate for the lack of size attributes well (copied from PROMISE’12 last slide) 10
  • 11. Sept 2012 New idea If we prune data irrelevancies, can we survive losing size attributes? PROMISE’12 11
  • 12. Sept 2012 Instance selection • Chang (1974) – Most of the instances are uninformative. – Reduced data sets of size 514, 150, 66 to 34, 14,6 prototypes . • Li et al. (2009) – genetic algorithm for instance selection • Turhan et al. (2009) – instance selection as a filter for cross-company defect data – See also, Kocaguneli et al. 2011 • Kocaguneli et al. (2011) variance-based selection: – Dendogram of clusters: prune sub-trees with large variances • Keung et al.’s (2011) Analogy-X – instance selection method for analogous entry • New idea, 1popNN : a very simple instance selector PROMISE’12 12
  • 13. Sept 2012 pop1NN : the urchin shape We propose that a “popularity” based method can compensate the lack of size features The “popularity” of an instance is the number of times it is the nearest-neighbor of other instances Sea urchin is a good example for SEE data… Popular central instances that are closest neighbors to scattered neighbors… PROMISE’12 13
  • 14. Sept 2012 Formally, this is rNN • rNN = – Reverse Nearest Neighbor – E.g. how many residential areas would find a new store as their nearest choice. – E.g. predict popularity of a new cell phone plan, determine how many profiles have the plan as their best match, against the existing plans in the market. • Can be computed efficiently (rNN chaining) – see Lopez-Sastre et al., – Fast Reciprocal Nearest Neighbors Clustering, – Signal Processing, 2012, Vol. 92, pages 270—275) PROMISE’12 14
  • 15. Sept 2012 So let’s check… If we (1) throw away size attributes and (2) irrelevant rows, then what happens? PROMISE’12 15
  • 16. Sept 2012 Details: pop1NN (cntd.) pop1NN is a 6-step procedure… 1. Calculate distances between every training instance-tuple 2. Convert distances of Step 1 into ordering of neighbors 3. Mark closest neighbors and calculate popularity 4. Order training instances in decreasing popularity 5. Decide which instances to select • Experiments with nearest neighbor on a hold-out set 6. Return Estimates for the test instances PROMISE’12 16
  • 17. Sept 2012 Results (reduced data) Loss values of pop1NN (on reduced data) vs. CART and 1NN (on full data) pop1NN loses 2 out of 13 data sets against 1NN pop1NN loses 4 out of 13 data sets against 1NN PROMISE’12 17
  • 18. Sept 2012 Discussion PROMISE’12 18
  • 19. Sept 2012 Conclusions Successful methods (1NN & CART) cannot compensate the lack of size attributes very well  Lack of size features decreases their performance in majority of the data sets When 1NN is augmented with a popularity-based pre- processor to come up with pop1NN  Lack of size features can be tolerated in most of the datasets  Caveat: need to first prune irrelevancies Size features are essential for standard learners  Practitioners with enough resources to correctly collect size features should do so  In the lack of such resources, pop1NN-like methods can compensate for the lack of the size features PROMISE’12 19
  • 20. Sept 2012 Future Work • Pop1NN as a feature selector? – Lipowezky (1998) : • feature and case selection are similar tasks, • both remove cells in the hypercube of all instances times all features. – So it should be possible to convert a case selection mechanism into a feature selector. • Transpose data • Nearby columns are correlated • Keep columns that are near no other • Active learning: – pop1NN does not use dependent variable information. – can identify the popular instances of a data set, guide expert reflection on collect dependent variable information PROMISE’12 20
  • 21. Sept 2012 Questions? Comments? PROMISE’12 21