A PrincipledMethodologyA Dozen Principles of Software Effort Estimation Ekrem Kocaguneli, 11/07/2012
2 Agenda• Introduction• Publications• What to Know • 8 Questions• Answers • 12 Principles• Validity Issues• Future Work
3 IntroductionSoftware effort estimation (SEE) is the process of estimating the total effort required to complete a software project (Keung2008 ). Successful estimation is critical for an organizations Over-estimation: Killing promising projects Under-estimation: Wasting entire effort! E.g. NASA’s launch-control system cancelled after initial estimate of $200M was overrun by another $200M  Among IT projects developed in 2009, only 32% were successfully completed within time with full functionality 
4 Introduction (cntd.) We will discuss algorithms, but it would be irresponsible to saythat SEE is merely an algorithmic problem. Organizational factors are just as important E.g. common experiences of data collection and user interaction in organizations operating in different domains
5 Introduction (cntd.)This presentation is not about a single algorithm/answer targeting a single problem. Because there is not just one question. It is (unfortunately) not everything about SEE. It brings together critical questions and related solutions.
6 What to know?1 When do I have perfect data? What is the best effort 2 estimation method?3 Can I use multiple methods? 4 ABE methods are easy to use.5 What if I lack resources How can I improve them? for local data? 7 Are all attributes and all6 I don’t believe in size instances necessary? attributes. What can I do? 8 How to experiment, which sampling method to use?
7 PublicationsJournals• E. Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on Software Engineering, 2011.• E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based Effort Estimation”, IEEE Transactions on Software Engineering, 2011.• E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical Software Engineering Journal, 2011.• J. Keung, E. Kocaguneli, T. Menzies, “A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation”, Journal of Automated Software Engineering, 2012.Under review Journals• E. Kocaguneli, T. Menzies, J. Keung, “Active Learning for Effort Estimation”, third round review at IEEE Transactions on Software Engineering.• E. Kocaguneli, T. Menzies, E. Mendes, “Transfer Learning in Effort Estimation”, submitted to ACM Transactions on Software Engineering.• E. Kocaguneli, T. Menzies, “Software Effort Models Should be Assessed Via Leave-One-Out Validation”, under second round review at Journal of Systems and Software.• E. Kocaguneli, T. Menzies, E. Mendes, “Towards Theoretical Maximum Prediction Accuracy Using D- ABE”, submitted to IEEE Transactions on Software Engineering.Conference• E. Kocaguneli, T. Menzies, J. Hihn, Byeong Ho Kang, “Size Doesn‘t Matter? On the Value of Software Size Features for Effort Estimation”, Predictive Models in Software Engineering (PROMISE) 2012.• E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium on Empirical Software Engineering and Measurement (ESEM) 2011• E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”, International Conference on Automated Software Engineering (ASE) 2010, Short-paper.
8 1 When do I have the perfect data?Principle #1: Know your domainDomain knowledge is important in every step (Fayyad1996 )Yet, this knowledge takes time and effort to gain,e.g. percentage commit information Principle #2: Let the experts talk Initial results may be off according to domain experts Success is to create a discussion, interest and suggestions Principle #3: Suspect your data “Curiosity” to question is a key characteristic (Rauser2011 ) e.g. in an SEE project, 200+ test cases, 0 bugs Principle #4: Data collection is cyclic Any step from mining till presentation may be repeated
92 What is the best effort estimation method? There is no agreed upon Methods change ranking w.r.t. best estimation method conditions such as data sets, error (Shepperd2001 ) measures (Myrtveit2005 )Experimenting with: 90 solo-methods, 20 public data sets, 7 Top 13 methods are CART & ABEerror measures methods (1NN, 5NN)
10 3 How to use superior subset of methods? We have a set of Assembling solo-methodssuperior methods to may be a good idea, e.g. recommend fusion of 3 biometric modalities (Ross2003 )But the previous evidence of Baker2007 , Kocaguneli2009assembling multiple methods in , Khoshgoftaar2009  failed toSEE is discouraging outperform solo-methodsCombine top2,4,8,13 solo-methods via mean,median and IRWM
112 How to use superior subset of methods?3 What is the best effort estimation method? Principle #5: Use a ranking stability indicator Principle #6: Assemble superior solo-methods A method to identify successful methods using their rank changes A novel scheme for assembling solo-methods Multi-methods that outperform all solo-methodsThis research published at: .• Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on Software Engineering, 2011.• J. Keung, E. Kocaguneli, T. Menzies, “A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation”, Journal of Automated Software Engineering, 2012.
124 How can we improve ABE methods?Analogy based methods They are very widely usedmake use of similar past (Walkerden1999 ) as:projects for estimation • No model-calibration to local data • Can better handle outliers • Can work with 1 or more attributes • Easy to explain Two promising research areas • weighting the selected analogies (Mendes2003 , Mosley2002) • improving design options (Keung2008 )
13 How can we improve ABE methods? (cntd.) Building on the previous research (Mendes2003 , Mosley2002 ,Keung2008 ), we adopted two different strategies a) Weighting analogiesWe used kernel weighting toweigh selected analogies Compare performance of each k-value with and without weighting. In none of the scenarios did we see a significant improvement
14 How can we improve ABE methods?b) Designing ABE methods (cntd.) D-ABEEasy-path: Remove training • Get best estimates of all traininginstance that violate assumptions instances • Remove all the training instances TEAK will be discussed later. within half of the worst MRE (acc.D-ABE: Built on theoretical to TMPA).maximum prediction accuracy • Return closest neighbor’s estimate(TMPA) (Keung2008 ) to the test instance. Training Instances Test instance t a Close to the b d worst MRE Return the c closest e neighbor’s estimate f Worst MRE
15 How can we improve ABE methods? (cntd.)DABE Comparison to DABE Comparison tostatic k w.r.t. MMRE static k w.r.t. win, tie, loss
16 How can we improve ABE methods? (cntd.) Principle #7: Weighting analogies is overelaboration Principle #8: Use easy-path design Investigation of an unexplored and promising ABE option of kernel-weighting A negative result published at ESE Journal An ABE design option that can be applied to different ABE methods (D-ABE, TEAK)This research published at: .• E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based Effort Estimation”, IEEE Transactions on Software Engineering, 2011.• E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical Software Engineering Journal, 2011.
17 5 How to handle lack of local data? Finding enough local training Merits of using cross-data from data is a fundamental problem another company is questionable of SEE (Turhan2009 ). (Kitchenham2007 ). We use a relevancy filtering method called TEAK on public and proprietary data sets.Similar projects, Similar projects,dissimilar effort similar effortvalues, hence values, hencehigh variance low variance Cross data works as well as within data for 6 out of 8 proprietary data sets, 19 out of 21 public data sets after TEAK’s relevancy filtering
18 How to handle lack of local data? (cntd.) Principle #9: Use relevancy filtering A novel method to handle lack of local data Successful application on public as well as proprietary dataThis research published at: .• E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium on Empirical Software Engineering and Measurement (ESEM) 2011• E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”, International Conference on Automated Software Engineering (ASE) 2010, Short-paper.
19 E(k) matrices & Popularity This concept helps the next 2 problems: size features and the essential content, i.e. pop1NN and QUICK algorithms, respectivelyA similar concept is reverse nearest neighbor (RNN) in ML, used to findinstances whose k-NN’s are included in a specific query (Achtert2006 *26+).
20 E(k) matrices & Popularity (cntd.) Outlier pruning Sample steps1. Calculate “popularity” of instances2. Sorting by popularity,3. Label one instance at a time4. Find the stopping point5. Return estimate from labeled training data Finding the stopping point 1. If all popular instances are exhausted. 2. Or if there is no MRE improvement for n consecutive times. 3. Or if the ∆ between the best and the worst error of the last n instances is very small. (∆ = 0.1; n = 3)
21 E(k) matrices & Popularity (cntd.) Picking random More popular instancestraining instance is One of the stopping in the active pool not a good idea point conditions fire decreases error
226 Do I have to use size attributes? At the heart of widely accepted COCOMO uses LOC (Boehm1981 SEE methods lies the software ), whereas FP (Albrecht1983 size attributes ) uses logical transactions Size attributes are beneficial if used properly (Lum2002 ); e.g. DoD and NASA uses successfully. Yet, the size attributes may not be trusted or may not be estimated at the early stages. That disrupts adoption of SEE methods.Measuring softwareproductivity by lines of code is This is a very costly measuringlike measuring progress on an unit because it encourages theairplane by how much it weighs writing of insipid code - E. Dijkstra– B. Gates
23Do I have to use size attributes? (cntd.)pop1NN (w/o size) vs. CART and 1NN (w/ size) Given enough resources for correct collection and estimation, size features are helpful If not, then outlier pruning helps.
24 Do I have to use size attributes? (cntd.) Principle #10: Use outlier pruning Promotion of SEE methods that can compensate the lack of the software size features A method called pop1NN that shows that size features are not a “must”.This research published at: .• E. Kocaguneli, T. Menzies, J. Hihn, Byeong Ho Kang, “Size Doesn‘t Matter? On the Value of Software Size Features for Effort Estimation”, Predictive Models in Software Engineering (PROMISE) 2012.
25 7 What is the essential content of SEE data? SEE is populated with overly In a matrix of N instances and F complex methods for features, the essential content is N ′ ∗ F ′ marginal performance increase (Jorgensen2007 ) QUICK is an active learning Synonym pruning method combines outlier1. Transpose normalized matrix removal and synonym pruning and calculate the popularity Removal of features based on of features distance seemed to be reserved2. Select non-popular features. for instances. Similar tasks both remove ABE method as a two dimensional cells in the hypercube of all reduction (Ahn2007 ) cases times all columns In our lab variance-based feature (Lipowezky1998 ) selector is used as a row selector
26What is the essential content of SEE data? (cntd.) At most 31% of all the cells On median 10% There is a consensus in the high-dimensional data analysis community that the only reason any methods work in very high dimensions is that, in fact, the data are not truly high-dimensional. (Levina & Bickel 2005) Performance?
27 What is the essential content of SEEQUICK vs passiveNN (1NN) data? (cntd.) QUICK vs CART Only one dataset where QUICK is significantly worse than passiveNN 4 such data sets when QUICK is compared to CART
28 What is the essential content of SEE data? (cntd.) Principle #11: Combine outlier and synonym pruning An unsupervised method to find the essential content of SEE data sets and reduce the data needs Promoting research to elaborate on the data, not on the algorithmThis research is under 3rd round review: .• E. Kocaguneli, T. Menzies, J. Keung, “Active Learning for Effort Estimation”, third round review at IEEE Transactions on Software Engineering.
298 How should I choose the right SM? Expectation (Kitchenham2007 ) Observed No significant difference for B&V values among 90 methods Only minutes of run time difference (<15)LOO is not probabilistic and results can be easily shared
30 How should I choose the right SM? (cntd.) Principle #12: Be aware of sampling method trade-off The first investigation of B&V trade-off in SEE domain Recommendation based on experimental concernsThis research is under 2nd round review: .• E. Kocaguneli, T. Menzies, “Software Effort Models Should be Assessed Via Leave-One-Out Validation”, under second round review at Journal of Systems and Software.
31 1. What to know? Know your domain 2. Let the experts talk 3.When do I your data Suspect have perfect data? What isathe best effort 5. Use ranking stability 4. Data collection is cyclic estimation method? indicator 6. Assemble superior solo- Can I use multiple methods? methods 7.ABE methods are easy to use. Weighting analogies is over- What if I lack resources elaboration I improve them? How can 9. Use relevancy filtering 8. Use easy-path design for local data? Are all attributes andand 11. Combine outlier all I don’t believe in size instances necessary? synonym pruning 10. Use outlier pruningattributes. What can I do? How Be experiment, which 12. to aware of sampling method trade-off sampling method to use?
32 Validity Issues Construct validity, i.e. do we measure what we intend to measure? Use of previously recommended estimation methods, error measures and data setsExternal validity, i.e. can we generalize results outside current specifications Difficult to assert that results will definitely hold Yet we use almost all the publicly available SEE data sets. Median value of projects used by the studies reviewed is 186 projects (Kitchenham2007 ) Our experimentation uses 1000+ projects
33 Future Work Application on publicly accessible big data sets300K projects, 2M users 250K open source projects Smarter, larger scale algorithms for general conclusions Application to different domains, e.g. defectCurrent methods may face predictionscalability issues. Improvingcommon ideas for scalability, e.g. Combining intrinsic dimensionalitylinear time NN methods techniques in ML for lower bound dimensions of SEE data sets (Levina2004 )
36 References J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software CostEstimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process for extracting useful knowledgefrom volumes of data,” Commun. ACM, vol. 39, no. 11, pp. 27–34, Nov. 1996. J. Rauser, “What is a career in big data?” 2011. [Online]. Available: http://strataconf.com/stratany2011/public/schedule/speaker/10070 M. Shepperd and G. Kadoda, “Comparing Software Prediction Techniques Using Simulation,” IEEETrans. Softw. Eng., vol. 27, no. 11, pp. 1014–1022, 2001. I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies ofsoftware prediction models,” IEEE Trans. Softw. Eng., vol. 31, no. 5, pp. 380–391, May 2005. E. Alpaydin, “Techniques for combining multiple learners,” Proceedings of Engineering of IntelligentSystems, vol. 2, pp. 6–12, 1998. D. Baker, “A hybrid approach to expert and model-based effort estimation,” Master’s thesis, LaneDepartment of Computer Science and Electrical Engineering, West Virginia University, 2007. E. Kocaguneli, Y. Kultur, and A. Bener, “Combining multiple learners induced on multiple datasetsfor software effort prediction,” in International Symposium on Software Reliability Engineering (ISSRE),2009, student Paper. T. M. Khoshgoftaar, P. Rebours, and N. Seliya, “Software quality analysis by combining multipleprojects and learners,” Software Quality Control, vol. 17, no. 1, pp. 25–49, 2009. F. Walkerden and R. Jeffery, “An empirical study of analogy-based software effort estima- tion,”Empirical Software Engineering, vol. 4, no. 2, pp. 135–158, 1999. E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of costestimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp.163–196, 2003.
 E. Mendes and N. Mosley, “Further investigation into the use of cbr and stepwise regression to 37predict development effort for web hypermedia applications,” in International Symposium on EmpiricalSoftware Engineering, 2002. B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company andwithin-company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009. B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company costestimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316– 329, 2007. B. W. Boehm, C. Abts, A. W. Brown, S. Chulani, B. K. Clark, E. Horowitz, R. Madachy, D. J.Reifer, and B. Steece, Software Cost Estimation with Cocomo II. Upper Saddle River, NJ, USA:Prentice Hall PTR, 2000. A. Albrecht and J. Gaffney, “Software function, source lines of code and development effortprediction: A software science validation,” IEEE Trans. Softw. Eng., vol. 9, pp. 639–648, 1983. K. Lum, J. Powell, and J. Hihn, “Validation of spacecraft cost estimation models for flight andground systems,” in ISPA’02: Conference Proceedings, Software Modeling Track, 2002. M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimationstudies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53, 2007. ] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company costestimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316– 329, 2007. A. Ross, “Information fusion in biometrics,” Pattern Recognition Letters, vol. 24, no. 13, pp. 2115–2125, Sep. 2003. Raymond P. L. Buse, Thomas Zimmermann: Information needs for software developmentanalytics. ICSE 2012: 987-996 Spareref.com. Nasa to shut down checkout & launch control system, August 26, 2002.http://www.spaceref.com/news/viewnews.html?id=475. Standish Group (2004). CHAOS Report(Report). West Yarmouth, Massachusetts: StandishGroup. U. Lipowezky, Selection of the optimal prototype subset for 1-NN classification, PatternRecognition Lett. 19 (1998) 907}918. Hyunchul Ahn, Kyoung-jae Kim, Ingoo Han, A case-based reasoning system with the two-dimensional reduction technique for customer classification, Expert Systems with Applications, Volume32, Issue 4, May 2007, Pages 1011-1019, ISSN 0957-4174, 10.1016/j.eswa.2006.02.021. Elke Achtert, Christian Böhm, Peer Kröger, Peter Kunath, Alexey Pryakhin, and Matthias Renz.2006. Efficient reverse k-nearest neighbor search in arbitrary metric spaces. In Proceedings of the2006 ACM SIGMOD international conference on Management of data (SIGMOD 06) E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances inNeural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.
40 What is the best effort estimation method? (cntd.)1. Rank methods acc. to win, loss and win-loss values 2. δr is the max. rank change3. Sort methods acc. to loss and observe δr values
41What is the best effort estimation method? (cntd.)What about aggregate results reflecting on specific scenarios? (question of a reviewer) Sort methods according to increasing MdMRE Group MRE values that are statistically the same Highlighted are the cases, where superior- methods do not occur in the top group Note how superior solo-methods correspond to the best (lowest MRE) groups
42 How can we improve ABE methods? (cntd.)We used kernel weightingwith 4 kernels with 5bandwidth values plusIRWM to weigh selectedanalogies (5 different kvalues) A total of 2090 settings: • 19 datasets * 5 k-values = 95 • 19 datasets * 5 k values * 4 kernels * 5 bandwidths = 1900 • IRWM: 19 datasets * 5 k values = 95
43 How can we improve ABE methods? (cntd.)We used kernel weightingto weigh selectedanalogies Compare performance of each k-value with and without weighting. • o = tie for 3 or more k values • - = loss for 3 or more k values • + = win for 3 or more k values In none of the scenarios did we see a significant improvement
44How to handle lack of local data? (cntd.)TEAK on proprietary data TEAK on public data
45Do I have to use size attributes? (cntd.) Can standard methods tolerate the lack of size attributes? CART w/o size vs. CART w/ size CART and 1NN
468 How should I choose the right SM? Only one work (Kitchenham2007) discusses implications ofsampling method (SM) on the Expectations isbias and variance LOO: high variance, low bias 3Way: low variance, high bias 10Way: in between Does expectation hold? What about run time and ease-of replication?