Local vs. Global Models for Effort Estimation and Defect Prediction
1. LOCAL VS. GLOBAL
MODELS FOR EFFORT
ESTIMATION AND DEFECT
PREDICTION
TIM MENZIES, ANDREW BUTCHER (WVU)
ANDRIAN MARCUS (WAYNE STATE)
THOMAS ZIMMERMANN (MICROSOFT)
DAVID COK (GRAMMATECH)
2. PREMISE
Something is very wrong with data mining
research in software engineering
• Need less “algorithm mining” and more “data mining”
• Handle “conclusion instability”
Need to do a different kind of data mining
• Cluster, then learn
• Learning via “envy”
12/1/2011
2
4. TOO MUCH MINING?
Porter & Selby, 1990
• Evaluating Techniques for Generating Metric-Based Classification Trees, JSS.
• Empirically Guided Software Development Using Metric-Based Classification
Trees. IEEE Software
• Learning from Examples: Generation and Evaluation of Decision Trees for
Software Resource Analysis. IEEE TSE
In 2011, Hall et al. (TSE, pre-print)
• reported 100s of similar studies.
• L learners on D data sets in a M*N cross-val
What is your next paper?
• Hopefully not D*L*M*N
12/1/201
4
5. THE FIELD IS CALLED “DATA MINING”,
NOT “ALGORITHM MINING”
To understand data
mining, look at the data,
not the algorithms
Our results should be
insights about data,
• not trivia about (say)
decision tree algorithms
Besides, the thing that
most predicts for
performance is the data,
not the algorithm,
• Domingos & Pazzani: Optimality of
the Simple Bayesian Classifier under
Zero-One Loss, Machine Learning,
5
Volume 29, [103-130, 1997
8. Conclusion Instability:
what works there does not work here
Posnet et al [2011]
Zimmermann [2009] : learned defect predictors from 622 pairs of
projects ⟨project1, project2⟩.
• 4% of pairs did project1’s predictors work for project2.
Kitchenham [2007] : studies comparing effort models learned from local
or imported models
• 1/3 better, 1/3 same, 1/3 worse
Jørgensen [2004] :
15 studies comparing model-based to expert-based estimation.
• 1/3 better, 1/3 same, 1/3 worse
Mair [2005] : studies comparing regression to analogy methods for
effort estimation
• 7/20 better,4/20 same, 9/20
9. ROOT CAUSE OF
CONCLUSION INSTABILITY?
HYPOTHESIS #1 HYPOTHESIS #2
Any one of…. SE is an inherently varied
• Over-generalization across activity
different kinds of projects?
• Solve with “delphi • So conclusion instability
localization” can’t be fixed
• Noisy data? • It must be managed
• Too little data?
• Poor statistical technique? • Needs different kinds of
• Stochastic choice within data miners
data miner (e.g. random • Cluster, then learn
forests)
• Learning via “envy”
• Insert idea here
12/1/2011
9
10. SOLVE CONCLUSION INSTABILITY
WITH “DELPHI LOCALIZATIONS” ?
Restrict data mining to just related projects
Ask an expert to find the right local context
• Are we sure they’re right?
• Posnett at al. 2011:
• What is right level for learning?
• Files or packages?
• Methods or classes?
• Changes from study to study
And even if they are “right”:
• Should we use those contexts?
• What if not enough info in our own delphi localization?
10
12/1/2011
11. Q: What to do
about rare
DELPHI LOCALIZATIONS zones?
A: Select the nearest ones from the rest
11
But how? 11"
13. KOCAGUNELI [2011]
CLUSTERING TO FIND “LOCAL”
TEAK: estimates from “k”
nearest-neighbors
• “k” auto-selected
per test case
• Pre-processor to cluster data,
remove worrisome regions
• IEEE TSE, Jan’11
ESEM’11
• Train within one delphi localization
• Or train on all and see what it picks
• Result #1: usually, cross as good as within
• Result #2: given a choice of both, TEAK picks “within” as much as “cross
13
12/1/2011
14. LESSON : DATA MAY NOT DIVIDE
NEATLY ON RAW DIMENSIONS
The best description for SE projects may be synthesize
dimensions extracted from the raw dimensions
14
12/1/2011
15. SYNTHESIZED DIMENSIONS
PCA : e.g. Nagappan [2006] Fastmap: Faloutsos [1995]
O(2N) generation of axis of large variability
Finds orthogonal “components”
• Pick any point W;
• Transforms N correlated • Find X furthest from W,
variables to • Find Y furthest from Y.
fewer uncorrelated
"components".
c = dist(X,Y)
• Component[i]: accounts for All points have distance a,b to (X,Y)
as much variability as
possible.
• Component[ j>I ] : accounts • x = (a2 + c2 − b2)/2c
for remaining variability • y= sqrt(a2 – x2)
O(N2) to generate
15
12/1/2011
16. HIERARCHICAL PARTITIONING
Grow Prune
Find two orthogonal dimensions Combine quadtree leaves
Find median(x), median(y) with similar densities
Recurse on four quadrants Score each cluster by median
score of class variable
16
17. Q: WHY CLUSTER VIA FASTMAP?
A1: Circular methods (e.g. k-means) assume
round clusters.
• But density-based clustering allows
clusters to be any shape
A2: No need to pre-set the number of clusters
A3: the O(2N) heuristic
is very fast,
• Unoptimized Python:
17
12/1/2011
19. Q: WHY TRAIN ON NEIGHBORING
CLUSTERS WITH BETTER SCORES?
A1: Why learn from
your own mistakes?
• When there exists
a smarter
neighbor?
• The “grass is
greener” principle
19
12/1/2011
20. HIERARCHICAL PARTITIONING
Grow Prune
Find two orthogonal dimensions Combine quadtree leaves
with similar densities
Find median(x), median(y)
Score each cluster by median
Recurse on four quadrants score of class variable
20
21. HIERARCHICAL PARTITIONING
Grow Prune
Find two orthogonal dimensions Combine quadtree leaves
with similar densities
Find median(x), median(y)
Score each cluster by median
Recurse on four quadrants score of class variable
Where is grass greenest?
C1 envies neighbor C2 with max
abs(score(C2) - score(C1))
21
• Train on C2, test on C1
22. Q: HOW TO LEARN RULES FROM
NEIGHBORING CLUSTERS
A: it doesn’t really matter
• But when comparing global & intra-cluster rules
• Use the same rule learner
This study uses WHICH (Menzies [2010])
• Customizable scoring operator
• Faster termination
• Generates very small rules (good for explanation)
22
12/1/2011
23. DATA FROM
HTTP://PROMISEDATA.ORG/DATA
Distributions have percentiles:
Effort reduction =
{ NasaCoc, China } : 100th
COCOMO or function points
75th
Defect reduction =
{lucene,xalan jedit,synapse,etc } : 50th
CK metrics(OO) 25th
0 20 40 60 80 100
Clusters have untreated class
distribution. untreated global local
Rules select a subset of the
examples: Treated with rules
learned from all data
• generate a treated class
distribution
Treated with rules learned
23
12/1/2011
from neighboring cluster
24. BY ANY MEASURE,
PER-CLUSTER LEARNING IS BEST
Lower median efforts/defects (50th percentile)
Greater stability (75th – 25th percentile)
Decreased worst case (100th percentile)
24
12/1/2011
25. CLUSTERS GENERATE
DIFFERENT RULES
What works “here” does not work “there”
• Misguided to try and tame conclusion instability
• Inherent in the data
Don’t tame it, use it: build lots of local models
25
12/1/2011
27. RELATED WORK
Defect & effort prediction: 1,000 papers Design of experiments
• All about making predictions • Don’t learn from immediate
• This work: learning controllers to change data, learn from better
prediction neighbors
• Here: , train once per cluster
Outlier removal : (small subset of whole data)
• Yin [2011], Yoon [2010], Kocaguneli [2011] • Orders of magnitude faster
• Subsumed by this work than N*M cross-val
Clustering & case-based reasoning Localizations:
• Kocaguneli [2011], Turhan [2009], • Expert-based Petersen [2009]:
Cuadrado [2007] how to know it correct?
• No generated, nothing to reflect about • Source code-based: ecological
• Needs indexing (runtime speed) inference: Posnett [2011]
• This work: auto-learning of
Structured literature reviews: contexts; beneficial
• Kitchenham [2007] + many more besides
• May be over-generalizing across cluster
boundaries
27
12/1/2011
29. THIS TALK
Something is fundamentally wrong with data mining research in
software engineering
• Needs more “data mining”, less “algorithm mining”
• Handle “conclusion instability”
Need to do a different kind of data mining
• Cluster, then learn
• Learning via “envy”
29
12/1/2011
30. NOT “ONE RING TO RULE THEM ALL”
Trite global statements about multiple SE
projects are… trite
Need effective ways to learn local lessons
• Automatic clustering tools
• Rule learning (per cluster, using envy)
30
12/1/2011
35. THE WISDOM OF THE COWS
• Seek the fence where
the grass is greener
on the other side.
• Learn from there
• Test on here
• Don’t rely on trite
definitions of “there”
and “here”
• Cluster to find
“here” and “there”
35
12/1/2011