How to Find Relevant Data for Effort Estimation

1,711 views

Published on

Ekrem Kocaguneli, Tim Menzies

LCSEE, West Virginia University, Morgantown/USA

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,711
On SlideShare
0
From Embeds
0
Number of Embeds
281
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

How to Find Relevant Data for Effort Estimation

  1. 1. How to Find Relevant Data for Effort Estimation Ekrem Kocaguneli, Tim Menzies LCSEE, West Virginia University, Morgantown/USA 1
  2. 2. http://goo.gl/j8F64USD DOD military projects (last decade) You must segment to find relevant data2
  3. 3. http://goo.gl/j8F64 Q: What to do about rareDomain Segmentations zones? A: Select the nearest ones from the rest3 But how? 3!
  4. 4. http://goo.gl/j8F64In the literature: within vs cross = ??Before This work  Kitchenham et al. TSE   Cross vs within are no 2007 rigid boundaries   Within-company learning   They are soft borders (just use local data)   And we can move a few   Cross-company learning examples across the border (just use data from other   And after making those companies) moves  Results mixed   “Cross” same as “local”   No clear win from cross or within 4
  5. 5. http://goo.gl/j8F64Some data does not divideneatly on existing dimensions5
  6. 6. http://goo.gl/j8F64The Locality(1) Assumption  Data divides best on one attribute 1.  development centers of developers; 2.  project type; e.g. embedded, etc; 3.  development language 4.  application type (MIS; GNC; etc); 5.  targeted hardware platform; 6.  in-house vs outsourced projects; 7.  Etc  If Locality(1) : hard to use data across these boundaries   Then harder to build effort models:   Need to collect local data (slow) 6
  7. 7. http://goo.gl/j8F64The Locality(N) Assumption  Data divides best on combination of attributes  If Locality(N)   Easier to use data across these boundaries   Relevant data spread all around   little diamonds floating in the dust 7
  8. 8. http://goo.gl/j8F64Roadmap  WHY:   Motivation  WHAT:   Background (SEE = software effort estimation)  HOW:   Technology (TEAK)  Results   With TEAK, no distinction cross / within.  Related Work  SO WHAT:   Conclusions 8
  9. 9. http://goo.gl/j8F64Roadmap  WHY:   Motivation  WHAT:   Background (SEE = software effort estimation)  HOW:   Technology (TEAK)  Results   With TEAK, no distinction cross / within.  Related Work  SO WHAT:   Conclusions 9
  10. 10. http://goo.gl/j8F64What is SEE?Software effort estimation (SEE) isthe activity of estimating the totaleffort required to complete asoftware project (Keung2008 [1]).SEE is heavily investigated since early 80’s (Mendes2003[2]), (Kemerer1987 [3]) (Boehm1981 [4])10
  11. 11. http://goo.gl/j8F64What is the SEE problem?SEE as an industry problem: •  oftware projects (60%-80%) encounter overruns S •  vg. overrun is 89% (Standish Group 2004) A •  cc. to Jorgensen the amount is less (around 30%), A but still dire (Jorgensen2011 [4])11
  12. 12. http://goo.gl/j8F64Active research area Jorgensen & Shepperd reviews 304 journal papers after filtering (Jorgensen2007 [8]) •  or “software effort cost” in “2000-2011” period IEEE F Xplore returns •  098 Conference papers 1 •  61 Journal papers 1 Jorgensen & Shepperd literature review reveals (Jorgensen2007 [8]) • Since 80’s 61% of SEE studies deal with new model proposal and comparison to old ones12
  13. 13. http://goo.gl/j8F64Roadmap  WHY:   Motivation  WHAT:   Background (SEE = software effort estimation)  HOW:   Technology (TEAK)  Results   With TEAK, no distinction cross / within.  Related Work  SO WHAT:   Conclusions 13
  14. 14. http://goo.gl/j8F64TEAK = ABE0 + instance selection  Kocaguneli et al. 2011, ASE journal   17,000+ variants of analogy-based effort estimation  ABE0 = analogy-based effort estimator, version 0   just the most commonly used analogy method   Normalized numerics ; min to max, 0 to 1   Euclidean distance (ignoring dependent variables)   Equal weighting to all attributes   Return median effort of k-nearest neighbors  Instance selection   Smart way to adjust training data 14
  15. 15. http://goo.gl/j8F64How to find relevant training data? independent attributes Use similar? w x y z class similar 1 0 1 1 1 2 similar 2 0 1 1 1 3 different 1 7 7 6 2 5 different 2 1 9 1 8 8 Use more variant? different 3 5 4 2 6 10 alien 1 74 15 73 56 20 alien 2 77 45 13 6 40 alien 3 35 99 31 21 60 alien 4 49 55 37 4 80 Use aliens ?15
  16. 16. http://goo.gl/j8F64Variance pruning independent attributes KEEP ! w x y z class similar 1 0 1 1 1 2 similar 2 0 1 1 1 3 different 1 7 7 6 2 5 different 2 1 9 1 8 8 different 3 5 4 2 6 10 alien 1 74 15 73 56 20 alien 2 77 45 13 6 40 alien 3 35 99 31 21 60 alien 4 49 55 37 4 80 PRUNE !1) Sort the clusters by “variance” “Easy path”: cull the examples2) Prune those high variance things that hurt the learner3) Estimate on the rest16
  17. 17. http://goo.gl/j8F64TEAK: clustering + variance pruning (TSE, Jan 2011) • TEAK is a variance-based instance selector • t is built via GAC trees I• TEAK is a two-pass system •  irst pass selects low- F variance relevant projects •  econd pass retrieves S projects to estimate from17
  18. 18. http://goo.gl/j8F64Essential point  TEAK finds local regions important to the estimation of particular cases  TEAK finds those regions via locality(N)   Not locality(1) 18
  19. 19. http://goo.gl/j8F64Roadmap  WHY:   Motivation  WHAT:   Background (SEE = software effort estimation)  HOW:   Technology (TEAK)  Results   With TEAK, no distinction cross / within  Related Work  SO WHAT:   Conclusions 19
  20. 20. http://goo.gl/j8F64Within and Cross DatasetsOut of 20 datasets, only 6 are found suitable for within/cross experiments Note: all Locality(1) divisions20
  21. 21. http://goo.gl/j8F64Experiment1: Performance Comparisonof Within and Cross-Source Data • TEAK on within & cross data for each dataset group (lines separate groups) • LOOCV used for runs • 20 runs performed for each treatment • Results evaluated w.r.t. MAR, MMRE, MdMRE and Pred(30), but see http://goo.gl/6q0tw • If within data outperforms cross, the dataset is highlighted with gray • See only 2 datasets highlighted21
  22. 22. http://goo.gl/j8F64Experiment 2: Retrieval Tendency of TEAKfrom Within and Cross-Source Data22
  23. 23. http://goo.gl/j8F64Experiment2: Retrieval Tendency of TEAKfrom Within and Cross-Source Data Diagonal (WC) vs. Off-Diagonal (CC) selection percentages sorted Percentiles of diagonals and off-diagonals 23
  24. 24. http://goo.gl/j8F64Roadmap  WHY:   Motivation  WHAT:   Background (SEE = software effort estimation)  HOW:   Technology (TEAK)  Results   With TEAK no distinction cross /within  Related Work  SO WHAT:   Conclusions 24
  25. 25. http://goo.gl/j8F64Roadmap  WHY:   Motivation  WHAT:   Background (SEE = software effort estimation)  HOW:   Technology (TEAK)  Results   With TEAK, no distinction cross / within  Related Work  SO WHAT:   Conclusions 25
  26. 26. http://goo.gl/j8F64Highlights1.  Don’t listen to everyone   When listening to a crowd, first filter the noise2.  Once the noise clears: bits of me are similar to bits of you   Probability of selecting cross or within instances is the same3.  Cross-vs-within is not a useful distinction   Locality(1) not informative   Enables “cross-company” learning 26
  27. 27. http://goo.gl/j8F64Implications   Companies can learn from each other’s data   Businesscase for building shared repository   Maybe, there are general effects in SE   effects that transcend boundaries of one company27
  28. 28. http://goo.gl/j8F64Future Work 1.  Check external validity   Does cross == within (after instance selection) in other data? 2.  Build more repositories   More useful than previously thought for effort estimation 3.  Synonym discovery   Can only use cross-data if it has the same ontology   auto-generate lexicons to map terms between data sets?28
  29. 29. http://goo.gl/j8F64Questions?Comments?29
  30. 30. http://goo.gl/j8F64References1) J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 2008 15th Asia-Pacific Software Engineering Conference, pp. 495–502, 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 47245832) E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.3) C. Kemerer, “An empirical validation of software cost estimation models,” Communications of the ACM, vol. 30, no. 5, pp. 416–429, May 1987.4) B. W. Boehm, Software Engineering Economics.Upper Saddle River, NJ, USA: Prentice Hall PTR, 1981.5) Magne Jørgensen , “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.0016) Johann Rost, Robert L. Glass, “The Dark Side of Software Engineering: Evil on Computing Projects”, Wiley John & Sons Inc., 2011.7) Magne Jørgensen, “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001 Key: citeulike:96261448) M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53, 2007.9) M. Shepperd and G. F. Kadoda, “Comparing software prediction techniques using simulation,” IEEE Trans. Software Eng, vol. 27, no. 11, pp. 1014–1022, 2001.10) I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Transactions on Software Engineering, vol. vol, pp. 31no5pp380–391, May 2005.11) B. Kitchenham, E. Mendes, and G. H. Travassos. Cross versus Within-Company Cost Estimation Studies: A Systematic Review. IEEE Trans. Softw. Eng., 33(5): 316–329, 2007.12) T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting Best Practices for Effort Estimation. IEEE Transaction on Software Engineering, 32(11):883–895, 2006.13) K. Lum, J. Powell, and J. Hihn.Validation of Spacecraft Software Cost Estimation Models for Flight and Ground Systems. In ISPA Conference Proceedings, Software Modeling Track, May 2002.14) J. Keung, E. Kocaguneli, and T. Menzies. A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation. Automated Software Engineering (submitted), 2011.15) M. Shepperd and C. Schofield. Estimating Software Project Effort Using Analogies. IEEE Transactions on Software Engineering, 23(12), Nov. 1997.16) L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.Classification and Regression Trees, 1984.17) T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction. ESEC/FSE’09, page 91, 2009.18) B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, 2009.19) E. Kocaguneli, G. Gay, T. Menzies,Y.Yang, and J. W. Keung. When to use data from other projects for effort estimation. In ASE’10, pages 321–324, 2010.30
  31. 31. http://goo.gl/j8F64Related workCross-vs-within(Defect prediction) Other  Zimmermann et al. FSE’09   Keung et al. 2011   pairs of project (x,y)   90 effort estimators   For 96% of pairs, predictors   Best methods built multiple from “x” failed for “y” local models (CART, CBR)   No relevancy filtering   Single dimensional models  Opposite result: comparatively worse   Turhan et al. ESE’09   Instance selection   If nearest neighbor filtering,   Can discard 70 to 90% of data predictors from “x” work without hurting accuracy well for “y”   But no variance filtering   Since 1974, 100s of papers   http://goo.gl/8iAUz 31

×