SlideShare a Scribd company logo
How to Find Relevant Data
            for
    Effort Estimation

                Ekrem Kocaguneli, Tim Menzies


     LCSEE, West Virginia University, Morgantown/USA



 1
http://goo.gl/j8F64


USD DOD military projects (last decade)


                                 You must
                                segment to
                               find relevant
                                    data




2
http://goo.gl/j8F64
                                               Q: What to do
                                                about rare
Domain Segmentations                               zones?




        A: Select the nearest ones from the rest
3
        But how?                                                      3!
http://goo.gl/j8F64

In the literature: within vs cross = ??

Before                                 This work

  Kitchenham        et al. TSE          Cross vs within are no
     2007                                rigid boundaries
       Within-company learning            They are soft borders
        (just use local data)              And we can move a few
       Cross-company learning              examples across the border
        (just use data from other          And after making those
        companies)                          moves
  Results      mixed                             “Cross” same as “local”
         No clear win from cross or
          within
 4
http://goo.gl/j8F64
Some data does not divide
neatly on existing dimensions




5
http://goo.gl/j8F64


The Locality(1) Assumption
    Data divides best on one attribute
     1.  development centers of developers;
     2.  project type; e.g. embedded, etc;
     3.  development language
     4.  application type (MIS; GNC; etc);
     5.  targeted hardware platform;
     6.  in-house vs outsourced projects;
     7.  Etc

    If Locality(1) : hard to use
     data across these boundaries
         Then harder to build effort models:
         Need to collect local data (slow)

 6
http://goo.gl/j8F64


The Locality(N) Assumption
  Data  divides best on
     combination of attributes

  If     Locality(N)
         Easier to use data across
          these boundaries
            Relevant data spread all
             around
            little diamonds floating in the
             dust




 7
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within.

         Related Work

         SO WHAT:
           Conclusions


 8
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within.

         Related Work

         SO WHAT:
           Conclusions

 9
http://goo.gl/j8F64


What is SEE?


Software effort estimation (SEE) is
the activity of estimating the total
effort required to complete a
software project (Keung2008 [1]).



SEE is heavily investigated since early 80’s (Mendes2003
[2]), (Kemerer1987 [3]) (Boehm1981 [4])


10
http://goo.gl/j8F64


What is the SEE problem?



SEE as an industry problem:
   •  oftware projects (60%-80%) encounter overruns
    S
        •  vg. overrun is 89% (Standish Group 2004)
         A
        •  cc. to Jorgensen the amount is less (around 30%),
         A
        but still dire (Jorgensen2011 [4])




11
http://goo.gl/j8F64


Active research area
 Jorgensen & Shepperd reviews 304 journal papers after
 filtering (Jorgensen2007 [8])
      •  or “software effort cost” in “2000-2011” period IEEE
       F
      Xplore returns
          •  098 Conference papers
            1
          •  61 Journal papers
            1


 Jorgensen & Shepperd literature review reveals
 (Jorgensen2007 [8])
     • Since 80’s 61% of SEE studies deal with new model
     proposal and comparison to old ones

12
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within.

         Related Work

         SO WHAT:
           Conclusions


 13
http://goo.gl/j8F64


TEAK = ABE0 + instance selection
  Kocaguneli     et al. 2011, ASE journal
      17,000+ variants of analogy-based effort estimation

  ABE0    = analogy-based effort estimator, version 0
      just the most commonly used analogy method
      Normalized numerics ; min to max, 0 to 1
      Euclidean distance (ignoring dependent variables)
      Equal weighting to all attributes
      Return median effort of k-nearest neighbors

  Instance   selection
      Smart way to adjust training data

 14
http://goo.gl/j8F64


How to find relevant training data?
                                         independent
                                           attributes

          Use similar?                   w     x    y    z   class
                            similar 1     0    1    1    1     2
                            similar 2     0    1    1    1     3
                           different 1    7    7    6    2     5
                           different 2    1    9    1    8     8
     Use more variant?     different 3    5    4    2    6    10
                             alien 1     74   15   73   56    20
                             alien 2     77   45   13    6    40
                             alien 3     35   99   31   21    60
                             alien 4     49   55   37    4    80
            Use aliens ?




15
http://goo.gl/j8F64


Variance pruning
                                                independent
                                                  attributes
          KEEP !
                                                w     x    y    z   class
                                   similar 1     0    1    1    1     2
                                   similar 2     0    1    1    1     3
                                  different 1    7    7    6    2     5
                                  different 2    1    9    1    8     8
                                  different 3    5    4    2    6    10
                                    alien 1     74   15   73   56    20
                                    alien 2     77   45   13    6    40
                                    alien 3     35   99   31   21    60
                                    alien 4     49   55   37    4    80
          PRUNE !

1) Sort the clusters by “variance”
                                                 “Easy path”: cull the examples
2) Prune those high variance things
                                                  that hurt the learner
3) Estimate on the rest
16
http://goo.gl/j8F64
TEAK: clustering + variance pruning
      (TSE, Jan 2011)

 • TEAK is a variance-based
 instance selector
 • t is built via GAC trees
  I




• TEAK is a two-pass system
    •  irst pass selects low-
     F
    variance relevant projects
    •  econd pass retrieves
     S
    projects to estimate from

17
http://goo.gl/j8F64


Essential point

  TEAK  finds local regions
  important to the estimation of
  particular cases

  TEAK   finds those regions via
  locality(N)
      Not locality(1)




 18
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within

         Related Work

         SO WHAT:
           Conclusions

 19
http://goo.gl/j8F64


Within and Cross Datasets
Out of 20 datasets, only 6 are found suitable for within/cross experiments




      Note: all
     Locality(1)
      divisions




20
http://goo.gl/j8F64
Experiment1: Performance Comparison
of Within and Cross-Source Data
              • TEAK on within & cross data for each
              dataset group (lines separate groups)
              • LOOCV used for runs
              • 20 runs performed for each treatment
              • Results evaluated w.r.t. MAR, MMRE,
                 MdMRE and Pred(30),
                 but see http://goo.gl/6q0tw


              • If within data outperforms cross, the
              dataset is highlighted with gray
                   • See only 2 datasets highlighted

21
http://goo.gl/j8F64

Experiment 2: Retrieval Tendency of TEAK
from Within and Cross-Source Data




22
http://goo.gl/j8F64

Experiment2: Retrieval Tendency of TEAK
from Within and Cross-Source Data
                                 Diagonal (WC)
                                 vs. Off-Diagonal
                                 (CC) selection
                                 percentages
                                 sorted




                   Percentiles of diagonals
                   and off-diagonals



 23
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK no distinction cross /within

         Related Work

         SO WHAT:
           Conclusions
 24
http://goo.gl/j8F64


Roadmap
         WHY:
           Motivation

         WHAT:
           Background (SEE = software effort estimation)

         HOW:
           Technology (TEAK)

         Results
           With TEAK, no distinction cross / within

         Related Work

         SO WHAT:
           Conclusions

 25
http://goo.gl/j8F64


Highlights
1.         Don’t listen to everyone
            When listening to a crowd, first
             filter the noise

2.         Once the noise clears: bits of
           me are similar to bits of you
            Probability of selecting cross or
             within instances is the same

3.         Cross-vs-within is not a
           useful distinction
            Locality(1) not informative
            Enables “cross-company”
             learning
  26
http://goo.gl/j8F64


Implications
                 Companies can learn
                 from each other’s data

                 Businesscase for building
                 shared repository

                 Maybe, there     are general
                 effects in SE
                     effects that transcend
                      boundaries of one
                      company


27
http://goo.gl/j8F64


Future Work
              1.         Check external validity
                        Does cross == within (after
                         instance selection) in other data?

              2.         Build more repositories
                        More useful than previously
                         thought for effort estimation

              3.         Synonym discovery
                      Can only use cross-data if it has
                       the same ontology
                      auto-generate lexicons to map
                       terms between data sets?
28
http://goo.gl/j8F64
Questions?
Comments?




29
http://goo.gl/j8F64


References
1) J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 2008 15th Asia-Pacific Software Engineering
      Conference, pp. 495–502, 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 4724583
2) E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical
      Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.
3) C. Kemerer, “An empirical validation of software cost estimation models,” Communications of the ACM, vol. 30, no. 5, pp. 416–429, May 1987.
4) B. W. Boehm, Software Engineering Economics.Upper Saddle River, NJ, USA: Prentice Hall PTR, 1981.
5) Magne Jørgensen , “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information
      and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001
6) Johann Rost, Robert L. Glass, “The Dark Side of Software Engineering: Evil on Computing Projects”, Wiley John & Sons Inc., 2011.
7) Magne Jørgensen, “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information
      and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001 Key: citeulike:9626144
8) M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53,
      2007.
9) M. Shepperd and G. F. Kadoda, “Comparing software prediction techniques using simulation,” IEEE Trans. Software Eng, vol. 27, no. 11, pp. 1014–1022, 2001.
10) I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Transactions on Software
      Engineering, vol. vol, pp. 31no5pp380–391, May 2005.
11) B. Kitchenham, E. Mendes, and G. H. Travassos. Cross versus Within-Company Cost Estimation Studies: A Systematic Review. IEEE Trans. Softw. Eng., 33(5):
      316–329, 2007.
12) T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting Best Practices for Effort Estimation. IEEE Transaction on Software Engineering, 32(11):883–895, 2006.
13) K. Lum, J. Powell, and J. Hihn.Validation of Spacecraft Software Cost Estimation Models for Flight and Ground Systems. In ISPA Conference Proceedings,
      Software Modeling Track, May 2002.
14) J. Keung, E. Kocaguneli, and T. Menzies. A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation. Automated
      Software Engineering (submitted), 2011.
15) M. Shepperd and C. Schofield. Estimating Software Project Effort Using Analogies. IEEE Transactions on Software Engineering, 23(12), Nov. 1997.
16) L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.Classification and Regression Trees, 1984.
17) T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction. ESEC/FSE’09, page 91, 2009.
18) B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical
      Software Engineering, 14(5):540–578, 2009.
19) E. Kocaguneli, G. Gay, T. Menzies,Y.Yang, and J. W. Keung. When to use data from other projects for effort estimation. In ASE’10, pages 321–324, 2010.




30
http://goo.gl/j8F64

Related work
Cross-vs-within
(Defect prediction)                      Other

    Zimmermann et al. FSE’09                Keung et al. 2011
       pairs of project (x,y)                  90 effort estimators
       For 96% of pairs, predictors
                                                Best methods built multiple
        from “x” failed for “y”
                                                 local models (CART, CBR)
       No relevancy filtering
                                                Single dimensional models
    Opposite result:                            comparatively worse
       Turhan et al. ESE’09                 Instance selection
       If nearest neighbor filtering,          Can discard 70 to 90% of data
        predictors from “x” work                 without hurting accuracy
        well for “y”
       But no variance filtering                   Since 1974, 100s of papers
                                                    http://goo.gl/8iAUz
 31

More Related Content

Similar to How to Find Relevant Data for Effort Estimation

One Person, One Model, One World: Learning Continual User Representation wi...
One Person, One Model, One World:  Learning Continual User Representation  wi...One Person, One Model, One World:  Learning Continual User Representation  wi...
One Person, One Model, One World: Learning Continual User Representation wi...westlakereplab
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksMarkus Scheidgen
 
Sustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopSustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopYannick Wurm
 
Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Alexey Grigorev
 
Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning
Yuandong Tian at AI Frontiers : Planning in Reinforcement LearningYuandong Tian at AI Frontiers : Planning in Reinforcement Learning
Yuandong Tian at AI Frontiers : Planning in Reinforcement LearningAI Frontiers
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsParham Zilouchian
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewQuantUniversity
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET Journal
 
Ferreira c ai-se2013-final-handouts
Ferreira   c ai-se2013-final-handoutsFerreira   c ai-se2013-final-handouts
Ferreira c ai-se2013-final-handoutscaise2013vlc
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA
 
Grails @ Java User Group Silicon Valley
Grails @ Java User Group Silicon ValleyGrails @ Java User Group Silicon Valley
Grails @ Java User Group Silicon ValleySven Haiges
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooJaeJun Yoo
 
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian MeyerA Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyermfrancis
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Atari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural NetworksAtari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural Networksjohnstamford
 
Nips 2016 tutorial generative adversarial networks review
Nips 2016 tutorial  generative adversarial networks reviewNips 2016 tutorial  generative adversarial networks review
Nips 2016 tutorial generative adversarial networks reviewMinho Heo
 

Similar to How to Find Relevant Data for Effort Estimation (20)

One Person, One Model, One World: Learning Continual User Representation wi...
One Person, One Model, One World:  Learning Continual User Representation  wi...One Person, One Model, One World:  Learning Continual User Representation  wi...
One Person, One Model, One World: Learning Continual User Representation wi...
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
Sustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopSustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshop
 
Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)Duplicates everywhere (Kiev)
Duplicates everywhere (Kiev)
 
Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning
Yuandong Tian at AI Frontiers : Planning in Reinforcement LearningYuandong Tian at AI Frontiers : Planning in Reinforcement Learning
Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANs
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
Blenderbot
BlenderbotBlenderbot
Blenderbot
 
Ferreira c ai-se2013-final-handouts
Ferreira   c ai-se2013-final-handoutsFerreira   c ai-se2013-final-handouts
Ferreira c ai-se2013-final-handouts
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
 
Spock pres
Spock presSpock pres
Spock pres
 
Ready, Set, Refactor
Ready, Set, RefactorReady, Set, Refactor
Ready, Set, Refactor
 
Grails @ Java User Group Silicon Valley
Grails @ Java User Group Silicon ValleyGrails @ Java User Group Silicon Valley
Grails @ Java User Group Silicon Valley
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun Yoo
 
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian MeyerA Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
A Fault Tolerance Concept for Distributed OSGi Applications - Fabian Meyer
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
 
Atari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural NetworksAtari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural Networks
 
Nips 2016 tutorial generative adversarial networks review
Nips 2016 tutorial  generative adversarial networks reviewNips 2016 tutorial  generative adversarial networks review
Nips 2016 tutorial generative adversarial networks review
 

More from CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdecCS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).CS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 

More from CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 

Recently uploaded

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...Product School
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 

Recently uploaded (20)

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 

How to Find Relevant Data for Effort Estimation

  • 1. How to Find Relevant Data for Effort Estimation Ekrem Kocaguneli, Tim Menzies LCSEE, West Virginia University, Morgantown/USA 1
  • 2. http://goo.gl/j8F64 USD DOD military projects (last decade) You must segment to find relevant data 2
  • 3. http://goo.gl/j8F64 Q: What to do about rare Domain Segmentations zones? A: Select the nearest ones from the rest 3 But how? 3!
  • 4. http://goo.gl/j8F64 In the literature: within vs cross = ?? Before This work   Kitchenham et al. TSE   Cross vs within are no 2007 rigid boundaries   Within-company learning   They are soft borders (just use local data)   And we can move a few   Cross-company learning examples across the border (just use data from other   And after making those companies) moves   Results mixed   “Cross” same as “local”   No clear win from cross or within 4
  • 5. http://goo.gl/j8F64 Some data does not divide neatly on existing dimensions 5
  • 6. http://goo.gl/j8F64 The Locality(1) Assumption   Data divides best on one attribute 1.  development centers of developers; 2.  project type; e.g. embedded, etc; 3.  development language 4.  application type (MIS; GNC; etc); 5.  targeted hardware platform; 6.  in-house vs outsourced projects; 7.  Etc   If Locality(1) : hard to use data across these boundaries   Then harder to build effort models:   Need to collect local data (slow) 6
  • 7. http://goo.gl/j8F64 The Locality(N) Assumption   Data divides best on combination of attributes   If Locality(N)   Easier to use data across these boundaries   Relevant data spread all around   little diamonds floating in the dust 7
  • 8. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within.   Related Work   SO WHAT:   Conclusions 8
  • 9. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within.   Related Work   SO WHAT:   Conclusions 9
  • 10. http://goo.gl/j8F64 What is SEE? Software effort estimation (SEE) is the activity of estimating the total effort required to complete a software project (Keung2008 [1]). SEE is heavily investigated since early 80’s (Mendes2003 [2]), (Kemerer1987 [3]) (Boehm1981 [4]) 10
  • 11. http://goo.gl/j8F64 What is the SEE problem? SEE as an industry problem: •  oftware projects (60%-80%) encounter overruns S •  vg. overrun is 89% (Standish Group 2004) A •  cc. to Jorgensen the amount is less (around 30%), A but still dire (Jorgensen2011 [4]) 11
  • 12. http://goo.gl/j8F64 Active research area Jorgensen & Shepperd reviews 304 journal papers after filtering (Jorgensen2007 [8]) •  or “software effort cost” in “2000-2011” period IEEE F Xplore returns •  098 Conference papers 1 •  61 Journal papers 1 Jorgensen & Shepperd literature review reveals (Jorgensen2007 [8]) • Since 80’s 61% of SEE studies deal with new model proposal and comparison to old ones 12
  • 13. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within.   Related Work   SO WHAT:   Conclusions 13
  • 14. http://goo.gl/j8F64 TEAK = ABE0 + instance selection   Kocaguneli et al. 2011, ASE journal   17,000+ variants of analogy-based effort estimation   ABE0 = analogy-based effort estimator, version 0   just the most commonly used analogy method   Normalized numerics ; min to max, 0 to 1   Euclidean distance (ignoring dependent variables)   Equal weighting to all attributes   Return median effort of k-nearest neighbors   Instance selection   Smart way to adjust training data 14
  • 15. http://goo.gl/j8F64 How to find relevant training data? independent attributes Use similar? w x y z class similar 1 0 1 1 1 2 similar 2 0 1 1 1 3 different 1 7 7 6 2 5 different 2 1 9 1 8 8 Use more variant? different 3 5 4 2 6 10 alien 1 74 15 73 56 20 alien 2 77 45 13 6 40 alien 3 35 99 31 21 60 alien 4 49 55 37 4 80 Use aliens ? 15
  • 16. http://goo.gl/j8F64 Variance pruning independent attributes KEEP ! w x y z class similar 1 0 1 1 1 2 similar 2 0 1 1 1 3 different 1 7 7 6 2 5 different 2 1 9 1 8 8 different 3 5 4 2 6 10 alien 1 74 15 73 56 20 alien 2 77 45 13 6 40 alien 3 35 99 31 21 60 alien 4 49 55 37 4 80 PRUNE ! 1) Sort the clusters by “variance” “Easy path”: cull the examples 2) Prune those high variance things that hurt the learner 3) Estimate on the rest 16
  • 17. http://goo.gl/j8F64 TEAK: clustering + variance pruning (TSE, Jan 2011) • TEAK is a variance-based instance selector • t is built via GAC trees I • TEAK is a two-pass system •  irst pass selects low- F variance relevant projects •  econd pass retrieves S projects to estimate from 17
  • 18. http://goo.gl/j8F64 Essential point   TEAK finds local regions important to the estimation of particular cases   TEAK finds those regions via locality(N)   Not locality(1) 18
  • 19. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within   Related Work   SO WHAT:   Conclusions 19
  • 20. http://goo.gl/j8F64 Within and Cross Datasets Out of 20 datasets, only 6 are found suitable for within/cross experiments Note: all Locality(1) divisions 20
  • 21. http://goo.gl/j8F64 Experiment1: Performance Comparison of Within and Cross-Source Data • TEAK on within & cross data for each dataset group (lines separate groups) • LOOCV used for runs • 20 runs performed for each treatment • Results evaluated w.r.t. MAR, MMRE, MdMRE and Pred(30), but see http://goo.gl/6q0tw • If within data outperforms cross, the dataset is highlighted with gray • See only 2 datasets highlighted 21
  • 22. http://goo.gl/j8F64 Experiment 2: Retrieval Tendency of TEAK from Within and Cross-Source Data 22
  • 23. http://goo.gl/j8F64 Experiment2: Retrieval Tendency of TEAK from Within and Cross-Source Data Diagonal (WC) vs. Off-Diagonal (CC) selection percentages sorted Percentiles of diagonals and off-diagonals 23
  • 24. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK no distinction cross /within   Related Work   SO WHAT:   Conclusions 24
  • 25. http://goo.gl/j8F64 Roadmap   WHY:   Motivation   WHAT:   Background (SEE = software effort estimation)   HOW:   Technology (TEAK)   Results   With TEAK, no distinction cross / within   Related Work   SO WHAT:   Conclusions 25
  • 26. http://goo.gl/j8F64 Highlights 1.  Don’t listen to everyone   When listening to a crowd, first filter the noise 2.  Once the noise clears: bits of me are similar to bits of you   Probability of selecting cross or within instances is the same 3.  Cross-vs-within is not a useful distinction   Locality(1) not informative   Enables “cross-company” learning 26
  • 27. http://goo.gl/j8F64 Implications   Companies can learn from each other’s data   Businesscase for building shared repository   Maybe, there are general effects in SE   effects that transcend boundaries of one company 27
  • 28. http://goo.gl/j8F64 Future Work 1.  Check external validity   Does cross == within (after instance selection) in other data? 2.  Build more repositories   More useful than previously thought for effort estimation 3.  Synonym discovery   Can only use cross-data if it has the same ontology   auto-generate lexicons to map terms between data sets? 28
  • 30. http://goo.gl/j8F64 References 1) J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 2008 15th Asia-Pacific Software Engineering Conference, pp. 495–502, 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 4724583 2) E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003. 3) C. Kemerer, “An empirical validation of software cost estimation models,” Communications of the ACM, vol. 30, no. 5, pp. 416–429, May 1987. 4) B. W. Boehm, Software Engineering Economics.Upper Saddle River, NJ, USA: Prentice Hall PTR, 1981. 5) Magne Jørgensen , “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001 6) Johann Rost, Robert L. Glass, “The Dark Side of Software Engineering: Evil on Computing Projects”, Wiley John & Sons Inc., 2011. 7) Magne Jørgensen, “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001 Key: citeulike:9626144 8) M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53, 2007. 9) M. Shepperd and G. F. Kadoda, “Comparing software prediction techniques using simulation,” IEEE Trans. Software Eng, vol. 27, no. 11, pp. 1014–1022, 2001. 10) I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Transactions on Software Engineering, vol. vol, pp. 31no5pp380–391, May 2005. 11) B. Kitchenham, E. Mendes, and G. H. Travassos. Cross versus Within-Company Cost Estimation Studies: A Systematic Review. IEEE Trans. Softw. Eng., 33(5): 316–329, 2007. 12) T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting Best Practices for Effort Estimation. IEEE Transaction on Software Engineering, 32(11):883–895, 2006. 13) K. Lum, J. Powell, and J. Hihn.Validation of Spacecraft Software Cost Estimation Models for Flight and Ground Systems. In ISPA Conference Proceedings, Software Modeling Track, May 2002. 14) J. Keung, E. Kocaguneli, and T. Menzies. A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation. Automated Software Engineering (submitted), 2011. 15) M. Shepperd and C. Schofield. Estimating Software Project Effort Using Analogies. IEEE Transactions on Software Engineering, 23(12), Nov. 1997. 16) L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.Classification and Regression Trees, 1984. 17) T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction. ESEC/FSE’09, page 91, 2009. 18) B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, 2009. 19) E. Kocaguneli, G. Gay, T. Menzies,Y.Yang, and J. W. Keung. When to use data from other projects for effort estimation. In ASE’10, pages 321–324, 2010. 30
  • 31. http://goo.gl/j8F64 Related work Cross-vs-within (Defect prediction) Other   Zimmermann et al. FSE’09   Keung et al. 2011   pairs of project (x,y)   90 effort estimators   For 96% of pairs, predictors   Best methods built multiple from “x” failed for “y” local models (CART, CBR)   No relevancy filtering   Single dimensional models   Opposite result: comparatively worse   Turhan et al. ESE’09   Instance selection   If nearest neighbor filtering,   Can discard 70 to 90% of data predictors from “x” work without hurting accuracy well for “y”   But no variance filtering   Since 1974, 100s of papers   http://goo.gl/8iAUz 31