How to Find Relevant Data for Effort Estimation

How to Find Relevant Data
for
Effort Estimation

Ekrem Kocaguneli, Tim Menzies

LCSEE, West Virginia University, Morgantown/USA

1

http://goo.gl/j8F64

USD DOD military projects (last decade)

You must
segment to
find relevant
data

2

http://goo.gl/j8F64
Q: What to do
about rare
Domain Segmentations zones?

A: Select the nearest ones from the rest
3
But how? 3!

http://goo.gl/j8F64

In the literature: within vs cross = ??

Before This work

 Kitchenham et al. TSE  Cross vs within are no
2007 rigid boundaries
 Within-company learning  They are soft borders
(just use local data)  And we can move a few
 Cross-company learning examples across the border
(just use data from other  And after making those
companies) moves
 Results mixed  “Cross” same as “local”
 No clear win from cross or
within
4

http://goo.gl/j8F64
Some data does not divide
neatly on existing dimensions

5

http://goo.gl/j8F64

The Locality(1) Assumption
 Data divides best on one attribute
1. development centers of developers;
2. project type; e.g. embedded, etc;
3. development language
4. application type (MIS; GNC; etc);
5. targeted hardware platform;
6. in-house vs outsourced projects;
7. Etc

 If Locality(1) : hard to use
data across these boundaries
 Then harder to build effort models:
 Need to collect local data (slow)

6

http://goo.gl/j8F64

The Locality(N) Assumption
 Data divides best on
combination of attributes

 If Locality(N)
 Easier to use data across
these boundaries
 Relevant data spread all
around
 little diamonds floating in the
dust

7

http://goo.gl/j8F64

Roadmap
 WHY:
 Motivation

 WHAT:
 Background (SEE = software effort estimation)

 HOW:
 Technology (TEAK)

 Results
 With TEAK, no distinction cross / within.

 Related Work

 SO WHAT:
 Conclusions

8

http://goo.gl/j8F64

Roadmap
 WHY:
 Motivation

 WHAT:

 HOW:

 Results

 Related Work

 SO WHAT:
 Conclusions

9

http://goo.gl/j8F64

What is SEE?

Software effort estimation (SEE) is
the activity of estimating the total
effort required to complete a
software project (Keung2008 [1]).

SEE is heavily investigated since early 80’s (Mendes2003
[2]), (Kemerer1987 [3]) (Boehm1981 [4])

10

http://goo.gl/j8F64

What is the SEE problem?

SEE as an industry problem:
• oftware projects (60%-80%) encounter overruns
S
• vg. overrun is 89% (Standish Group 2004)
A
• cc. to Jorgensen the amount is less (around 30%),
A
but still dire (Jorgensen2011 [4])

11

http://goo.gl/j8F64

Active research area
Jorgensen & Shepperd reviews 304 journal papers after
filtering (Jorgensen2007 [8])
• or “software effort cost” in “2000-2011” period IEEE
F
Xplore returns
• 098 Conference papers
1
• 61 Journal papers
1

Jorgensen & Shepperd literature review reveals
(Jorgensen2007 [8])
• Since 80’s 61% of SEE studies deal with new model
proposal and comparison to old ones

12

http://goo.gl/j8F64

Roadmap
 WHY:
 Motivation

 WHAT:

 HOW:

 Results

 Related Work

 SO WHAT:
 Conclusions

13

http://goo.gl/j8F64

TEAK = ABE0 + instance selection
 Kocaguneli et al. 2011, ASE journal
 17,000+ variants of analogy-based effort estimation

 ABE0 = analogy-based effort estimator, version 0
 just the most commonly used analogy method
 Normalized numerics ; min to max, 0 to 1
 Euclidean distance (ignoring dependent variables)
 Equal weighting to all attributes
 Return median effort of k-nearest neighbors

 Instance selection
 Smart way to adjust training data

14

http://goo.gl/j8F64

How to find relevant training data?
independent
attributes

Use similar? w x y z class
similar 1 0 1 1 1 2
similar 2 0 1 1 1 3
different 1 7 7 6 2 5
Use more variant? different 3 5 4 2 6 10
alien 1 74 15 73 56 20
alien 2 77 45 13 6 40
alien 3 35 99 31 21 60
alien 4 49 55 37 4 80
Use aliens ?

15

http://goo.gl/j8F64

Variance pruning
independent
attributes
KEEP !
w x y z class
similar 1 0 1 1 1 2
similar 2 0 1 1 1 3
alien 1 74 15 73 56 20
alien 2 77 45 13 6 40
alien 3 35 99 31 21 60
alien 4 49 55 37 4 80
PRUNE !

1) Sort the clusters by “variance”
“Easy path”: cull the examples
2) Prune those high variance things
that hurt the learner
3) Estimate on the rest
16

http://goo.gl/j8F64
TEAK: clustering + variance pruning
(TSE, Jan 2011)

• TEAK is a variance-based
instance selector
• t is built via GAC trees
I

• TEAK is a two-pass system
• irst pass selects low-
F
variance relevant projects
• econd pass retrieves
S
projects to estimate from

17

http://goo.gl/j8F64

Essential point

 TEAK finds local regions
important to the estimation of
particular cases

 TEAK finds those regions via
locality(N)
 Not locality(1)

18

http://goo.gl/j8F64

Roadmap
 WHY:
 Motivation

 WHAT:

 HOW:

 Results
 With TEAK, no distinction cross / within

 Related Work

 SO WHAT:
 Conclusions

19

http://goo.gl/j8F64

Within and Cross Datasets
Out of 20 datasets, only 6 are found suitable for within/cross experiments

Note: all
Locality(1)
divisions

20

http://goo.gl/j8F64
Experiment1: Performance Comparison
of Within and Cross-Source Data
• TEAK on within & cross data for each
dataset group (lines separate groups)
• LOOCV used for runs
• 20 runs performed for each treatment
• Results evaluated w.r.t. MAR, MMRE,
MdMRE and Pred(30),
but see http://goo.gl/6q0tw

• If within data outperforms cross, the
dataset is highlighted with gray
• See only 2 datasets highlighted

21

http://goo.gl/j8F64

Experiment 2: Retrieval Tendency of TEAK
from Within and Cross-Source Data

22

http://goo.gl/j8F64

Experiment2: Retrieval Tendency of TEAK
from Within and Cross-Source Data
Diagonal (WC)
vs. Off-Diagonal
(CC) selection
percentages
sorted

Percentiles of diagonals
and off-diagonals

23

http://goo.gl/j8F64

Roadmap
 WHY:
 Motivation

 WHAT:

 HOW:

 Results
 With TEAK no distinction cross /within

 Related Work

 SO WHAT:
 Conclusions
24

http://goo.gl/j8F64

Roadmap
 WHY:
 Motivation

 WHAT:

 HOW:

 Results
 With TEAK, no distinction cross / within

 Related Work

 SO WHAT:
 Conclusions

25

http://goo.gl/j8F64

Highlights
1. Don’t listen to everyone
 When listening to a crowd, first
filter the noise

2. Once the noise clears: bits of
me are similar to bits of you
 Probability of selecting cross or
within instances is the same

3. Cross-vs-within is not a
useful distinction
 Locality(1) not informative
 Enables “cross-company”
learning
26

http://goo.gl/j8F64

Implications
 Companies can learn
from each other’s data

 Businesscase for building
shared repository

 Maybe, there are general
effects in SE
 effects that transcend
boundaries of one
company

27

http://goo.gl/j8F64

Future Work
1. Check external validity
 Does cross == within (after
instance selection) in other data?

2. Build more repositories
 More useful than previously
thought for effort estimation

3. Synonym discovery
 Can only use cross-data if it has
the same ontology
 auto-generate lexicons to map
terms between data sets?
28

http://goo.gl/j8F64
Questions?
Comments?

29

http://goo.gl/j8F64

References
1) J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 2008 15th Asia-Pacific Software Engineering
Conference, pp. 495–502, 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber= 4724583
2) E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical
Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.
3) C. Kemerer, “An empirical validation of software cost estimation models,” Communications of the ACM, vol. 30, no. 5, pp. 416–429, May 1987.
4) B. W. Boehm, Software Engineering Economics.Upper Saddle River, NJ, USA: Prentice Hall PTR, 1981.
5) Magne Jørgensen , “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information
and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001
6) Johann Rost, Robert L. Glass, “The Dark Side of Software Engineering: Evil on Computing Projects”, Wiley John & Sons Inc., 2011.
7) Magne Jørgensen, “Contrasting ideal and realistic conditions as a means to improve judgment-based software development effort estimation”, Information
and Software Technology (July 2011) doi:10.1016/j.infsof.2011.07.001 Key: citeulike:9626144
8) M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53,
2007.
9) M. Shepperd and G. F. Kadoda, “Comparing software prediction techniques using simulation,” IEEE Trans. Software Eng, vol. 27, no. 11, pp. 1014–1022, 2001.
10) I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Transactions on Software
Engineering, vol. vol, pp. 31no5pp380–391, May 2005.
11) B. Kitchenham, E. Mendes, and G. H. Travassos. Cross versus Within-Company Cost Estimation Studies: A Systematic Review. IEEE Trans. Softw. Eng., 33(5):
316–329, 2007.
12) T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting Best Practices for Effort Estimation. IEEE Transaction on Software Engineering, 32(11):883–895, 2006.
13) K. Lum, J. Powell, and J. Hihn.Validation of Spacecraft Software Cost Estimation Models for Flight and Ground Systems. In ISPA Conference Proceedings,
Software Modeling Track, May 2002.
14) J. Keung, E. Kocaguneli, and T. Menzies. A Ranking Stability Indicator for Selecting the Best Effort Estimator in Software Cost Estimation. Automated
Software Engineering (submitted), 2011.
15) M. Shepperd and C. Schofield. Estimating Software Project Effort Using Analogies. IEEE Transactions on Software Engineering, 23(12), Nov. 1997.
16) L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.Classification and Regression Trees, 1984.
17) T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction. ESEC/FSE’09, page 91, 2009.
18) B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical
Software Engineering, 14(5):540–578, 2009.
19) E. Kocaguneli, G. Gay, T. Menzies,Y.Yang, and J. W. Keung. When to use data from other projects for effort estimation. In ASE’10, pages 321–324, 2010.

30

http://goo.gl/j8F64

Related work
Cross-vs-within
(Defect prediction) Other

 Zimmermann et al. FSE’09  Keung et al. 2011
 pairs of project (x,y)  90 effort estimators
 For 96% of pairs, predictors
 Best methods built multiple
from “x” failed for “y”
local models (CART, CBR)
 No relevancy filtering
 Single dimensional models
 Opposite result: comparatively worse
 Turhan et al. ESE’09  Instance selection
 If nearest neighbor filtering,  Can discard 70 to 90% of data
predictors from “x” work without hurting accuracy
well for “y”
 But no variance filtering  Since 1974, 100s of papers
 http://goo.gl/8iAUz
31

How to Find Relevant Data for Effort Estimation

Recommended

Recommended

More Related Content

Similar to How to Find Relevant Data for Effort Estimation

Similar to How to Find Relevant Data for Effort Estimation (20)

More from CS, NcState

More from CS, NcState (20)

Recently uploaded

Recently uploaded (20)

How to Find Relevant Data for Effort Estimation