The “Bellwether” Effect and Its Implications to Transfer Learning

The “Bellwether” Effect
Rahul Krishna (rkrish11@ncsu.edu)
Tim Menzies, and Wei Fu
And Its Implications to Transfer Learning
1

2WeTOSM ‘14
[Turhan09] Data from
Turkish toasters can
predict defects in
NASA flight systems
Today’s topic:
Transfer Learning

3
Today’s topic:
Simpler Transfer Learning with
“Bell…. what?”

Definitions
Bellwether effect
4
• If a community builds
many software projects
• There exists one ∈ many
from which
• quality predictors can
be built …
• … and used for all
Bellwether method
• find the one
• use it

Definitions
5
• find the one
• use it
Note: vastly simpler than other transfer learning
methods [Turhan09, Turhan11, Nam13, etc]
Bellwether effect Bellwether method
• If a community builds
many software projects
• There exists one ∈ many
from which
• quality predictors can
be built …
• … and used for all

Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Benchmark Data
○ Prediction Model
○ Statistical Measures
● Results
● Conclusions 6

The “Cold-Start” Problem
Past Projects Prediction Model Upcoming
releases
7

The “Cold-Start” Problem
Past Projects Prediction Model
?
8
Upcoming
releases

Challenges:
Variable Datasets
... “New projects are always emerging,
and old ones are being rewritten…”
… “the quality, representativeness,
and volume of the training data have a
major influence on the usefulness
and stability of model performance…”
— Rahman et al.
[Rah12]
Growing Volume
Of Projects
9

• Unstable conclusions are typical in SE [Menzies12]
• Usefulness of some lesson “X” is contradictory
Challenges:
Conclusion Instability
10

Challenges:
11
Kitchenham et al. ‘07
• Are data from other
organizations …
• … as useful as local
data?
• Inconclusive
• 3 cases: Just as good.
4 cases: Worse.

Challenges:
12
Zimmermann et al. ‘09
• 622 pairs of projects
• Only 4% of pairs
were useful
Kitchenham et al. ‘07
• Are data from other
organizations …
• … as useful as local
data?
• Inconclusive
• 3 cases: Just as good.
4 cases: Worse.

• Menzies et al. [Men12] offer several ways
• They ask for better experimental practice.
• Is there a better way?
•Yes! Look for the “Bellwether”
• As long as the bellwether continues to offer good
quality predictions
•Then conclusions from one…
•... are conclusions for all
13
How to Reduce this Instability?

Outline
● Motivation
● Background
○ Benchmark Data
● Results
● Conclusions 14

Estimating Quality
Why not Static Analyzers?
• [Rahman14] et al. compared
• Code analysis tools:
FindBugs, JLint, and PMD
• with Static Code defect
Predictors
• Found no difference
(measurement: AUCEC)
15
• And
• Using lightweight parsers...
• … Defect predictors can
quickly jump to new
languages
• Same is not true for static
code analysis tools
• Lesser Bugs Better Software

Estimating Quality
Why not Static Analyzers?
16
• And
• They work surprisingly well!
• [Ostrand04]: ~80% of the bugs localized
in 20% of the code

Estimating Quality:
Static code Defect Prediction
1. Ubiquitous
• Researchers and Industrial practitioners frequently use
them. Eg. Companies like Google [Lew14], V&V books
[Raktin01]
2. A lot of (ongoing) research
• Tremendous Attention [Nam13]
• Better approaches are constantly being proposed
3. They are easy to use
• Software Metrics can be collected fast
• Wide variety of tools, open source data miners
[sklearn][weka]
17

Outline
● Motivation
● Background
○ Benchmark Data
● Results
● Conclusions 18

Transfer Learning:
Introduction
• Extract knowledge from source (S) and apply to
target (T)
• Data needs to be massaged before use[Zhang15]
• Careful sub-sampling
• Transformation
• Based on data source, TL is categorized as:
• Homogeneous vs. Heterogeneous
• Based on transformation[Nam13, Nam15, Jing15]
• Similarity vs. Dimensionality
19

Transfer Learning:
Categories
Homogeneous
• Source (S) and Target
(T) are quantified using
the same attributes
Heterogeneous
different attributes
Similarity
• Learn from subsampled
rows/columns of the
source (S)
Dimensionality
• Manipulate
rows/columns of
source (S) to match
target (T)
20

Heterogeneous
different attributes
Dimensionality
• Manipulate
rows/columns of
source (S) to match
target (T)
Transfer Learning:
Categories
Homogeneous
the same attributes
Similarity
• Learn from subsampled
rows/columns of the
source (S)
This Talk
21

Homogeneous TL:
Burak Filter
22
• Burak[Tur09] used relevancy filtering
• Filter using kNN
• Gather two sets of data
• Validation set (S) Test Data
• Candidate set (T) Train Data
• Use kNN
• Pick “similar” instances from T
• Filter T using S

Homogeneous TL:
Burak Filter
• First study on relevancy
• Their conclusion:
23
… The performances of defect predictors based on the
NN-filtered data do not give necessary empirical
evidence to make a strong conclusion …
… Sometimes NN data based models may perform
better than WC data based models …

Homogeneous TL:
Mixed Model Learner
• Turhan et al.[Tur11] proposed a mixed-model learner
• Combine local data with curated non-local data
• Gather two sets of data
• Validation set (S): Pick a random 10% of local data
• Candidate set (T): Remaining 90% and non-local data
• For non-local data, they use Burak filter[Tur09]
• Experiment with various 90%-10% splits
• 400 experiments were conducted to pick the best model
24

Homogeneous TL:
Mixed Model Learner
• Extension to Burak Filter
• Incorporated local data
Challenges
• Similar issues as Burak Filter
• Biased; Unstable model.
• The authors report:
… mixed project models offer only limited improvements
i.e., 3 out 10 projects
— Turhan
‘11
25

Homogeneous TL:
Addressing the challenges
• Researchers have offered a bleak view of TL
• Zimmerman et al.[Zimm09]
•Transfer is not always consistent
•IE could learn from Firefox but not vice versa
•Rahman et al.[Rahman12]
•The “imprecision” of learning across projects
• Recent research has resorted to more complex
approaches
26

More Transfer Learners …
27 WeTOSM ‘14

Outline
● Motivation
● Background
○ Benchmark Data
● Results
● Conclusions 28

Is this complexity necessary?
• Short answer — No
• Just look for the “Bellwether”
•Use our bellwether method
•Build your model
•Et voilà!
29

The Bellwether Method
Generate
Apply Monitor
#

Generate
• Project Pairs Pi , j
• Perform a Leave-one-out Test
Train on Pi Test on Pj
• Pick the Project with the
best model
Apply Monitor
#

Generate
Apply
• Predict Quality
on future
projects
Monitor
#

Generate
Apply
Monitor
• When
predictions
fail. Restart.
#

Outline
● Motivation
● Background
○ Benchmark Data
● Results
● Conclusions 35

Experiment Setup:
Benchmark Data
• 120 Datasets from 4 communities
• Defects in 3 levels of granularity
• File, Class, and Function
• Open source and Proprietary
36

Experiment Setup:
Benchmark Data
• BTW, Apache has local data
• Multiple versions
• Temporally ordered
37
A total of
54 datasets

Outline
● Motivation
● Background
○ Benchmark Data
● Results
● Conclusions 38

Experiment Setup:
Prediction Model
• We use Random Forests[Zimmerman08]
• Build several decision trees from random subsamples
• Use ensemble learning
• Samples are imbalanced[Pelayo07]
• More “clean” examples
• Use SMOTE [Chawla01] to rebalance data*
• Randomly down sample “clean” instances
• Up-sample “buggy” instances
*Apply only to training data
38

Outline
● Motivation
● Background
○ Benchmark Data
● Results
● Conclusions 40

Experiment Setup:
Statistical Measures
41
• Prediction is usually measured using ROC
• ROC is a plot of Recall vs. False Alarm
• Plot requires several treatments
• Obtained by cross validation.
• We refrain from Cross-Validation
• It tends to mix the test data with the bellwether
• Instead,
• We use Balance [Ma07]

Experiment Setup:
42
• Instead of a set of points for ROC,
• Produce one point.
• X, Y = Pd (Recall), Pf (False Alarm)
• Balance is the weighted distance from the ideal
point
• Ideal Point => (Pd, Pf) = (1, 0)
• Balance =
• Lower the Balance, better the performance

Experiment Setup:
• Prediction Model is inherently random
• Rerun model 40 times with different seeds
• Collect Balance measure in every run
• Use Scott-Knott Test to compare Balance values
• Scott-Knott ranks Balance values (best to worst)
• Rank -> Effect Size Test + Hypothesis Test
• Why SK?
•It’s been used by recent high profile papers at TSE
[Mittas13] and ICSE [Ghotra15]
43

Outline
● Motivation
● Background
○ Benchmark Data
● Results
● Conclusions 44

How rare are “Bellwethers”?
How does the bellwether fare against local models?
Is Bellwether better than other transfer learning methods?
Can we predict which data set will be bellwether?
How much of the “Bellwether” data is required?
Results:
Research Questions
45

Results:
Research Question 1
46

Results:
Research Question 1
47
Research Answer
Our results suggest bellwethers are not rare.

Community:
Bellwether: Lucene
Apache
Results:
Research Question 1
48

Community:
Bellwether: MC
NASA
Results:
Research Question 1
49

Community:
Bellwether: LC
AEEEM
Results:
Research Question 1
50

Community:
Bellwether: Safe
ReLink
X===
Results:
Research Question 1
51

Results:
Research Question 2
52

Research Answer
For projects measured with the
same quality metrics, training
models with bellwether is just
as good as — if not better than
— local models
Results:
Research Question 2
53

Results:
Research Question 3
54

Research Answer
The bellwether outperforms standard homogeneous transfer learners.
Results:
Research Question 3
55

Results:
Research Question 4
56

Research Answer
This is non-trivial. Trying to statistically determine if a project will be a
bellwether was unsuccessful. This is open to further examination.
Results:
Research Question 4
57

Results:
Research Question 5
58

How much data is required before detecting the “Bellwether”?
Research Answer
A few dozen defective samples from the bellwether is sufficient to build a
reliable model
Results:
Research Question 5
59

Outline
● Motivation
● Background
○ Benchmark Data
● Results
● Conclusions 60

Practical Implications
• The problem of generality in SE
• Reproducibility is hard to achieve.
• With Bellwethers Transfer Learners can
• Not only be reproducible
• But also be stable
• and Reliable
• Identification of Bellwether earlier
• Would have changed course of research
• More focus on coarse grain analysis
• Less on relevancy filtering, model generation
61

Future Work
• Bellwethers in heterogeneous learners
• Promising heterogeneous transfer learners [Nam15][Jing15]
• Perform complex dimensionality mapping transforms
• Can Bellwethers assist in finding the best mapping?
• Study and quantify bellwether
• what makes a bellwether, a bellwether?
•Bellwethers beyond defect prediction
•Are there bellwethers in other data?
62

In conclusion...
•Look for bellwethers
•To use as a baseline
•To justify the use of transfer learning
•Stabilize the pace of conclusions
•Not permanent conclusion stability
•Easy to find
•Look when necessary
•New data can be discarded
•Updated only as they start failing
63

The “Bellwether” Effect and Its Implications to Transfer Learning

More Related Content

What's hot

Similar to The “Bellwether” Effect and Its Implications to Transfer Learning

Recently uploaded

The “Bellwether” Effect and Its Implications to Transfer Learning