2. SOUND BITES
• Ye olde worlde SE
• “The” model of SE (defects, effort, etc)
• 21st century SE
• Models (plural)
• No generality in models
• But , perhaps generality in how we find those models
• Transfer learning
2
4. WHAT IS TRANSFER LEARNING?
• Source = old= Domain1 = < Eg1, P1>
• Target = new = Domain2 = <Eg2, P2>
• If we move from domain1 to domain2, do have have to start
afresh?
• Or can we learn faster in “new” …
• … Using lessons learned from “old”?
• NSF funding (2013..2017):
• Transfer learning in Software Engineering
• Menzies, Layman, Shull , Diep
4
5. WHO CARES?
(WHAT’S AT
STAKE?)
• “Transfer” is a core
scientific issue
• Lack of transfer is the
scandal of SE
• Replication is
Empirical SE is rare
• Conclusion instability
• It all depends.
• The full stop
syndrome
• The result?
• A funding crisis
5
6. MANUAL TRANSFER (WAR STORIES)
• Brazil, SEL, 2002: need domain knowledge (but now gone)?
• NSF, SEL, 2006: need better automatic support
• Kitchenham, Mendes et al, TSE 2007: for = against
• Zimmermann FSE, 2009: cross works in 4/600 times
6
7. WAR STORIES
(EFFORT ESTIMATION)
Effort = a . locx . y
• learned using Boehm’s
methods
• 20*66% of NASA93
• COCOMO attributes
• Linear regression (log
pre-processor)
• Sort the co-efficients
found for each member
of x,y
7
9. SMARTER TRANSFER
(AUTOMATIC SUPPORT)
• Don’t use all available training data
• Use relevancy filtering [Turhan’09 ESE journal]
• Use variance pruning ç this talk
• Don’t use the raw attributes
• Most are rubbish anyway
• Feature selection [Chen’05, IEEE Software]
• Feature synthesis ç this talk
9
10. ESEM, 2011 :
How to Find Relevant
Data for Effort Estimation
TIM MENZIES,
EKREM KOCAGUNELI
11. USD DOD MILITARY PROJECTS
(LAST DECADE)
11
You must
segment to
find relevant
data
13. IN THE LITERATURE: WITHIN VS
CROSS = ??
BEFORE THIS WORK
13
Kitchenham et al. TSE
2007
• Within-company
learning (just use local
data)
• Cross-company
learning (just use data
from other companies)
Results mixed
• No clear win from cross
or within
Cross vs within are no
rigid boundaries
• They are soft borders
• And we can move a
few examples across
the border
• And after making those
moves
• “Cross” same as
“local”
14. SOME DATA DOES NOT DIVIDE
NEATLY ON EXISTING DIMENSIONS
14
15. THE LOCALITY(1) ASSUMPTION
15
Data divides best on one attribute
1. development centers of developers;
2. project type; e.g. embedded, etc;
3. development language
4. application type (MIS; GNC; etc);
5. targeted hardware platform;
6. in-house vs outsourced projects;
7. Etc
If Locality(1) : hard to use
data across these boundaries
• Then harder to build effort models:
• Need to collect local data (slow)
16. THE LOCALITY(N) ASSUMPTION
16
Data divides best on
combination of attributes
If Locality(N)
• Easier to use data across
these boundaries
• Relevant data spread all
around
• little diamonds floating in
the dust
17. HOW TO FIND RELEVANT TRAINING
DATA?
17
independent
attributes
w x y z class
similar 1
0 1 1 1 2
similar 2
0 1 1 1 3
different 1 7 7 6 2 5
different 2 1 9 1 8 8
different 3 5 4 2 6 10
alien 1 74 15 73 56 20
alien 2 77 45 13 6 40
alien 3 35 99 31 21 60
alien 4 49 55 37 4 80
Use similar?
Use more variant?
Use aliens ?
18. VARIANCE PRUNING
18
independent
attributes
w x y z class
similar 1
0 1 1 1 2
similar 2
0 1 1 1 3
different 1 7 7 6 2 5
different 2 1 9 1 8 8
different 3 5 4 2 6 10
alien 1 74 15 73 56 20
alien 2 77 45 13 6 40
alien 3 35 99 31 21 60
alien 4 49 55 37 4 80
1) Sort the clusters by “variance”
2) Prune those high variance things
3) Estimate on the rest
“Easy path”: cull the examples
that hurt the learner
PRUNE !
KEEP !
19. TEAK: CLUSTERING + VARIANCE
PRUNING (TSE, JAN 2011)
19
• TEAK is a variance-based
instance selector
• It is built via GAC trees
• TEAK is a two-pass system
• First pass selects low-
variance relevant
projects
• Second pass retrieves
projects to estimate from
20. ESSENTIAL POINT
20
TEAK finds local regions important to the
estimation of particular cases
TEAK finds those regions via locality(N)
• Not locality(1)
22. EXPERIMENT1: PERFORMANCE
COMPARISON OF WITHIN AND CROSS-
SOURCE DATA
22
• TEAK on within & cross data for each dataset
group (lines separate groups)
• LOOCV used for runs
• 20 runs performed for each treatment
• Results evaluated w.r.t. MAR, MMRE,
MdMRE and Pred(30),
but see http://goo.gl/6q0tw
• If within data outperforms cross, the dataset is
highlighted with gray
• See only 2 datasets highlighted
24. EXPERIMENT2: RETRIEVAL TENDENCY OF
TEAK FROM WITHIN AND CROSS-SOURCE
DATA
24
Diagonal (WC) vs.
Off-Diagonal (CC)
selection
percentages
sorted
Percentiles of diagonals
and off-diagonals
25. HIGHLIGHTS
25
1. Don’t listen to everyone
• When listening to a crowd, first
filter the noise
2. Once the noise clears: bits of
me are similar to bits of you
• Probability of selecting cross or
within instances is the same
3. Cross-vs-within is not a
useful distinction
• Locality(1) not informative
• Enables “cross-company”
learning
26. TSE, 2013 :
LOCAL VS. GLOBAL
MODELS FOR EFFORT
ESTIMATION AND
DEFECT PREDICTION
TIM MENZIES, ANDREW BUTCHER (WVU)
ANDRIAN MARCUS (WAYNE STATE)
THOMAS ZIMMERMANN (MICROSOFT)
DAVID COK (GRAMMATECH)
28. ROOT CAUSE OF
CONCLUSION INSTABILITY?
HYPOTHESIS #1
Any one of….
• Noisy data?
• Too little data?
• Poor statistical technique?
• Stochastic choice within
data miner (e.g. random
forests)?
• etc
HYPOTHESIS #2
SE is an inherently varied
activity
• So conclusion instability
can’t be fixed
• It must be managed
• Needs different kinds of
data miners
12/1/2011
28
30. • Seek the fence
where the grass
is greener on
the other side.
• Learn from
there
• Test on
here
• Cluster to find
“here” and
“there”
12/1/2011
30
ENVY =
THE WISDOM OF THE COWS
31. 12/1/2011
31
@attribute recordnumber real
@attribute projectname {de,erb,gal,X,hst,slp,spl,Y}
@attribute cat2 {Avionics, application_ground, avionicsmonitoring, … }
@attribute center {1,2,3,4,5,6}
@attribute year real
@attribute mode {embedded,organic,semidetached}
@attribute rely {vl,l,n,h,vh,xh}
@attribute data {vl,l,n,h,vh,xh}
…
@attribute equivphyskloc real
@attribute act_effort real
@data
1,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,25.9,117.6
2,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,24.6,117.6
3,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,7.7,31.2
4,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,8.2,36
5,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,9.7,25.2
6,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,2.2,8.4
….
DATA = MULTI-DIMENSIONAL VECTORS
32. CAUTION: DATA MAY NOT DIVIDE
NEATLY ON RAW DIMENSIONS
The best description for SE projects may be synthesize
dimensions extracted from the raw dimensions
12/1/2011
32
33. FASTMAP
33
Fastmap: Faloutsos [1995]
O(2N) generation of axis of large variability
• Pick any point W;
• Find X furthest from W,
• Find Y furthest from Y.
c = dist(X,Y)
All points have distance a,b to (X,Y)
• x = (a2 + c2 − b2)/2c
• y= sqrt(a2 – x2)
Find median(x), median(y)
Recurse on four quadrants
34. HIERARCHICAL PARTITIONING
Prune
Find two orthogonal dimensions
Find median(x), median(y)
Recurse on four quadrants
Combine quadtree leaves
with similar densities
Score each cluster by median
score of class variable
34
Grow
36. • Seek the fence
where the grass
is greener on
the other side.
• Learn from
there
• Test on
here
• Cluster to find
“here” and
“there”
36
ENVY =
THE WISDOM OF THE COWS
37. HIERARCHICAL PARTITIONING
Prune
Find two orthogonal dimensions
Find median(x), median(y)
Recurse on four quadrants
Combine quadtree leaves
with similar densities
Score each cluster by median
score of class variable
37
Grow
38. HIERARCHICAL PARTITIONING
Prune
Find two orthogonal dimensions
Find median(x), median(y)
Recurse on four quadrants
Combine quadtree leaves
with similar densities
Score each cluster by median
score of class variable
This cluster envies its neighbor with
better score and max
abs(score(this) - score(neighbor))
38
Grow
Where is grass greenest?
39. Q: HOW TO LEARN RULES FROM
NEIGHBORING CLUSTERS
A: it doesn’t really matter
• Many competent rule learners
But to evaluate global vs local rules:
• Use the same rule learner for local vs global rule learning
This study uses WHICH (Menzies [2010])
• Customizable scoring operator
• Faster termination
• Generates very small rules (good for explanation)
39
40. DATA FROM
HTTP://PROMISEDATA.ORG/DATA
Effort reduction =
{ NasaCoc, China } :
COCOMO or function points
Defect reduction =
{lucene,xalan jedit,synapse,etc } :
CK metrics(OO)
Clusters have untreated class
distribution.
Rules select a subset of the
examples:
• generate a treated class
distribution
40
0 20 40 60 80 100
25th
50th
75th
100th
untreated global local
Distributions have percentiles:
Treated with rules
learned from all data
Treated with rules learned
from neighboring cluster
41. Lower median efforts/defects (50th percentile)
Greater stability (75th – 25th percentile)
Decreased worst case (100th percentile)
BY ANY MEASURE,
LOCAL BETTER THAN GLOBAL
41
42. RULES LEARNED IN EACH CLUSTER
What works best “here” does not work “there”
• Misguided to try and tame conclusion instability
• Inherent in the data
Can’t tame conclusion instability.
• Instead, you can exploit it
• Learn local lessons that do better than overly generalized global theories
42
43. RELATED WORK
Other clustering methods:
• nbTree, Kohavi [1996],
• consensus clustering
Outlier removal :
• Yin [2011], Yoon [2010],
Clustering & case-based reasoning
• Kocaguneli [2011], Turhan [2009],
Cuadrado [2007]
Design of experiments
• Learn via envy
• Faster than N*M cross-val
Localizations:
• Expert-based Petersen [2009]: how to
know it correct?
• This work: auto-learning of contexts
Structured literature reviews:
• Kitchenham [2007] + others
• ?over-generalizations
Anything in any SE textbook
43
50. • Seek the fence
where the grass
is greener on
the other side.
• Learn from
there
• Test on
here
• Cluster to find
“here” and
“there”
50
ENVY =
THE WISDOM OF THE COWS