Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The “Bellwether” Effect
Rahul Krishna (rkrish11@ncsu.edu)
Tim Menzies, and Wei Fu
And Its Implications to Transfer Learnin...
2WeTOSM ‘14
[Turhan09] Data from
Turkish toasters can
predict defects in
NASA flight systems
Today’s topic:
Transfer Learn...
3
Today’s topic:
Simpler Transfer Learning with
“Bell…. what?”
Definitions
Bellwether effect
4
• If a community builds
many software projects
• There exists one ∈ many
from which
• qual...
Definitions
5
• find the one
• use it
Note: vastly simpler than other transfer learning
methods [Turhan09, Turhan11, Nam13...
Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Bench...
The “Cold-Start” Problem
Past Projects Prediction Model Upcoming
releases
7
The “Cold-Start” Problem
Past Projects Prediction Model
?
8
Upcoming
releases
Challenges:
Variable Datasets
... “New projects are always emerging,
and old ones are being rewritten…”
… “the quality, re...
• Unstable conclusions are typical in SE [Menzies12]
• Usefulness of some lesson “X” is contradictory
Challenges:
Conclusi...
• Unstable conclusions are typical in SE [Menzies12]
• Usefulness of some lesson “X” is contradictory
Challenges:
Conclusi...
• Unstable conclusions are typical in SE [Menzies12]
• Usefulness of some lesson “X” is contradictory
Challenges:
Conclusi...
• Menzies et al. [Men12] offer several ways
• They ask for better experimental practice.
• Is there a better way?
•Yes! Lo...
Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Bench...
Estimating Quality
Why not Static Analyzers?
• [Rahman14] et al. compared
• Code analysis tools:
FindBugs, JLint, and PMD
...
Estimating Quality
Why not Static Analyzers?
16
• And
• They work surprisingly well!
• [Ostrand04]: ~80% of the bugs local...
Estimating Quality:
Static code Defect Prediction
1. Ubiquitous
• Researchers and Industrial practitioners frequently use
...
Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Bench...
Transfer Learning:
Introduction
• Extract knowledge from source (S) and apply to
target (T)
• Data needs to be massaged be...
Transfer Learning:
Categories
Homogeneous
• Source (S) and Target
(T) are quantified using
the same attributes
Heterogeneo...
Heterogeneous
• Source (S) and Target
(T) are quantified using
different attributes
Dimensionality
• Manipulate
rows/colum...
Homogeneous TL:
Burak Filter
22
• Burak[Tur09] used relevancy filtering
• Filter using kNN
• Gather two sets of data
• Val...
Homogeneous TL:
Burak Filter
• First study on relevancy
• Their conclusion:
23
… The performances of defect predictors bas...
Homogeneous TL:
Mixed Model Learner
• Turhan et al.[Tur11] proposed a mixed-model learner
• Combine local data with curate...
Homogeneous TL:
Mixed Model Learner
• Extension to Burak Filter
• Incorporated local data
Challenges
• Similar issues as B...
Homogeneous TL:
Addressing the challenges
• Researchers have offered a bleak view of TL
• Zimmerman et al.[Zimm09]
•Transf...
More Transfer Learners …
27 WeTOSM ‘14
Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Bench...
Is this complexity necessary?
• Short answer — No
• Just look for the “Bellwether”
•Use our bellwether method
•Build your ...
The Bellwether Method
Generate
Apply Monitor
#
The Bellwether Method
Generate
• Project Pairs Pi , j
• Perform a Leave-one-out Test
Train on Pi Test on Pj
• Pick the Pro...
The Bellwether Method
Generate
Apply
• Predict Quality
on future
projects
Monitor
#
The Bellwether Method
Generate
Apply
Monitor
• When
predictions
fail. Restart.
#
The Bellwether Method
Generate
Apply Monitor
#
Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Bench...
Experiment Setup:
Benchmark Data
• 120 Datasets from 4 communities
• Defects in 3 levels of granularity
• File, Class, and...
Experiment Setup:
Benchmark Data
• BTW, Apache has local data
• Multiple versions
• Temporally ordered
37
A total of
54 da...
Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Bench...
Experiment Setup:
Prediction Model
• We use Random Forests[Zimmerman08]
• Build several decision trees from random subsamp...
Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Bench...
Experiment Setup:
Statistical Measures
41
• Prediction is usually measured using ROC
• ROC is a plot of Recall vs. False A...
Experiment Setup:
Statistical Measures
42
• Instead of a set of points for ROC,
• Produce one point.
• X, Y = Pd (Recall),...
Experiment Setup:
Statistical Measures
• Prediction Model is inherently random
• Rerun model 40 times with different seeds...
Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Bench...
How rare are “Bellwethers”?
How does the bellwether fare against local models?
Is Bellwether better than other transfer le...
How rare are “Bellwethers”?
How does the bellwether fare against local models?
Is Bellwether better than other transfer le...
Results:
Research Question 1
47
Research Answer
Our results suggest bellwethers are not rare.
How rare are “Bellwethers”?
How rare are “Bellwethers”?
Community:
Bellwether: Lucene
Apache
Results:
Research Question 1
48
How rare are “Bellwethers”?
Community:
Bellwether: MC
NASA
Results:
Research Question 1
49
How rare are “Bellwethers”?
Community:
Bellwether: LC
AEEEM
Results:
Research Question 1
50
How rare are “Bellwethers”?
Community:
Bellwether: Safe
ReLink
X===
Results:
Research Question 1
51
How rare are “Bellwethers”?
How does the bellwether fare against local models?
Is Bellwether better than other transfer le...
How does the bellwether fare against local models?
Research Answer
For projects measured with the
same quality metrics, tr...
How rare are “Bellwethers”?
How does the bellwether fare against local models?
Is Bellwether better than other transfer le...
Is Bellwether better than other transfer learning methods?
Research Answer
The bellwether outperforms standard homogeneous...
How rare are “Bellwethers”?
How does the bellwether fare against local models?
Is Bellwether better than other transfer le...
Can we predict which data set will be bellwether?
Research Answer
This is non-trivial. Trying to statistically determine i...
How rare are “Bellwethers”?
How does the bellwether fare against local models?
Is Bellwether better than other transfer le...
How much data is required before detecting the “Bellwether”?
Research Answer
A few dozen defective samples from the bellwe...
Outline
● Motivation
● Background
○ Evaluating Quality
○ Transfer Learning
○ The “Bellwether”
● Experimental Setup
○ Bench...
Practical Implications
• The problem of generality in SE
• Reproducibility is hard to achieve.
• With Bellwethers Transfer...
Future Work
• Bellwethers in heterogeneous learners
• Promising heterogeneous transfer learners [Nam15][Jing15]
• Perform ...
In conclusion...
•Look for bellwethers
•To use as a baseline
•To justify the use of transfer learning
•Stabilize the pace ...
Upcoming SlideShare
Loading in …5
×

The “Bellwether” Effect and Its Implications to Transfer Learning

262 views

Published on

Transfer learning: is the process of translating quality predictors learned in one data set to another. Transfer learning has been the subject of much recent research. In practice, that research means changing models all the time as transfer learners continually exchange new models to the current project. This paper offers a very simple bellwether transfer learner. Given N data sets, we find which one produce the best predictions on all the others. This bellwether data set is then used for all subsequent predictions (or, until such time as its predictions start failing-- at which point it is wise to seek another bellwether). Bellwethers are interesting since they are very simple to find (just wrap a for-loop around standard data miners). Also, they simplify the task of making general policies in SE since as long as one bellwether remains useful, stable conclusions for N data sets can be achieved just by reasoning over that bellwether. From this, we conclude (1) this bellwether method is a useful (and very simple) transfer learning method; (2) bellwethers are a baseline method against which future transfer learners should be compared; (3) sometimes, when building increasingly complex automatic methods, researchers should pause and compare their supposedly more sophisticated method against simpler alternatives.

Published in: Software
  • Be the first to comment

  • Be the first to like this

The “Bellwether” Effect and Its Implications to Transfer Learning

  1. 1. The “Bellwether” Effect Rahul Krishna (rkrish11@ncsu.edu) Tim Menzies, and Wei Fu And Its Implications to Transfer Learning 1
  2. 2. 2WeTOSM ‘14 [Turhan09] Data from Turkish toasters can predict defects in NASA flight systems Today’s topic: Transfer Learning
  3. 3. 3 Today’s topic: Simpler Transfer Learning with “Bell…. what?”
  4. 4. Definitions Bellwether effect 4 • If a community builds many software projects • There exists one ∈ many from which • quality predictors can be built … • … and used for all Bellwether method • find the one • use it
  5. 5. Definitions 5 • find the one • use it Note: vastly simpler than other transfer learning methods [Turhan09, Turhan11, Nam13, etc] Bellwether effect Bellwether method • If a community builds many software projects • There exists one ∈ many from which • quality predictors can be built … • … and used for all
  6. 6. Outline ● Motivation ● Background ○ Evaluating Quality ○ Transfer Learning ○ The “Bellwether” ● Experimental Setup ○ Benchmark Data ○ Prediction Model ○ Statistical Measures ● Results ● Conclusions 6
  7. 7. The “Cold-Start” Problem Past Projects Prediction Model Upcoming releases 7
  8. 8. The “Cold-Start” Problem Past Projects Prediction Model ? 8 Upcoming releases
  9. 9. Challenges: Variable Datasets ... “New projects are always emerging, and old ones are being rewritten…” … “the quality, representativeness, and volume of the training data have a major influence on the usefulness and stability of model performance…” — Rahman et al. [Rah12] Growing Volume Of Projects 9
  10. 10. • Unstable conclusions are typical in SE [Menzies12] • Usefulness of some lesson “X” is contradictory Challenges: Conclusion Instability 10
  11. 11. • Unstable conclusions are typical in SE [Menzies12] • Usefulness of some lesson “X” is contradictory Challenges: Conclusion Instability 11 Kitchenham et al. ‘07 • Are data from other organizations … • … as useful as local data? • Inconclusive • 3 cases: Just as good. 4 cases: Worse.
  12. 12. • Unstable conclusions are typical in SE [Menzies12] • Usefulness of some lesson “X” is contradictory Challenges: Conclusion Instability 12 Zimmermann et al. ‘09 • 622 pairs of projects • Only 4% of pairs were useful Kitchenham et al. ‘07 • Are data from other organizations … • … as useful as local data? • Inconclusive • 3 cases: Just as good. 4 cases: Worse.
  13. 13. • Menzies et al. [Men12] offer several ways • They ask for better experimental practice. • Is there a better way? •Yes! Look for the “Bellwether” • As long as the bellwether continues to offer good quality predictions •Then conclusions from one… •... are conclusions for all 13 How to Reduce this Instability?
  14. 14. Outline ● Motivation ● Background ○ Evaluating Quality ○ Transfer Learning ○ The “Bellwether” ● Experimental Setup ○ Benchmark Data ○ Prediction Model ○ Statistical Measures ● Results ● Conclusions 14
  15. 15. Estimating Quality Why not Static Analyzers? • [Rahman14] et al. compared • Code analysis tools: FindBugs, JLint, and PMD • with Static Code defect Predictors • Found no difference (measurement: AUCEC) 15 • And • Using lightweight parsers... • … Defect predictors can quickly jump to new languages • Same is not true for static code analysis tools • Lesser Bugs Better Software
  16. 16. Estimating Quality Why not Static Analyzers? 16 • And • They work surprisingly well! • [Ostrand04]: ~80% of the bugs localized in 20% of the code
  17. 17. Estimating Quality: Static code Defect Prediction 1. Ubiquitous • Researchers and Industrial practitioners frequently use them. Eg. Companies like Google [Lew14], V&V books [Raktin01] 2. A lot of (ongoing) research • Tremendous Attention [Nam13] • Better approaches are constantly being proposed 3. They are easy to use • Software Metrics can be collected fast • Wide variety of tools, open source data miners [sklearn][weka] 17
  18. 18. Outline ● Motivation ● Background ○ Evaluating Quality ○ Transfer Learning ○ The “Bellwether” ● Experimental Setup ○ Benchmark Data ○ Prediction Model ○ Statistical Measures ● Results ● Conclusions 18
  19. 19. Transfer Learning: Introduction • Extract knowledge from source (S) and apply to target (T) • Data needs to be massaged before use[Zhang15] • Careful sub-sampling • Transformation • Based on data source, TL is categorized as: • Homogeneous vs. Heterogeneous • Based on transformation[Nam13, Nam15, Jing15] • Similarity vs. Dimensionality 19
  20. 20. Transfer Learning: Categories Homogeneous • Source (S) and Target (T) are quantified using the same attributes Heterogeneous • Source (S) and Target (T) are quantified using different attributes Similarity • Learn from subsampled rows/columns of the source (S) Dimensionality • Manipulate rows/columns of source (S) to match target (T) 20
  21. 21. Heterogeneous • Source (S) and Target (T) are quantified using different attributes Dimensionality • Manipulate rows/columns of source (S) to match target (T) Transfer Learning: Categories Homogeneous • Source (S) and Target (T) are quantified using the same attributes Similarity • Learn from subsampled rows/columns of the source (S) This Talk 21
  22. 22. Homogeneous TL: Burak Filter 22 • Burak[Tur09] used relevancy filtering • Filter using kNN • Gather two sets of data • Validation set (S) Test Data • Candidate set (T) Train Data • Use kNN • Pick “similar” instances from T • Filter T using S
  23. 23. Homogeneous TL: Burak Filter • First study on relevancy • Their conclusion: 23 … The performances of defect predictors based on the NN-filtered data do not give necessary empirical evidence to make a strong conclusion … … Sometimes NN data based models may perform better than WC data based models …
  24. 24. Homogeneous TL: Mixed Model Learner • Turhan et al.[Tur11] proposed a mixed-model learner • Combine local data with curated non-local data • Gather two sets of data • Validation set (S): Pick a random 10% of local data • Candidate set (T): Remaining 90% and non-local data • For non-local data, they use Burak filter[Tur09] • Experiment with various 90%-10% splits • 400 experiments were conducted to pick the best model 24
  25. 25. Homogeneous TL: Mixed Model Learner • Extension to Burak Filter • Incorporated local data Challenges • Similar issues as Burak Filter • Biased; Unstable model. • The authors report: … mixed project models offer only limited improvements i.e., 3 out 10 projects — Turhan ‘11 25
  26. 26. Homogeneous TL: Addressing the challenges • Researchers have offered a bleak view of TL • Zimmerman et al.[Zimm09] •Transfer is not always consistent •IE could learn from Firefox but not vice versa •Rahman et al.[Rahman12] •The “imprecision” of learning across projects • Recent research has resorted to more complex approaches 26
  27. 27. More Transfer Learners … 27 WeTOSM ‘14
  28. 28. Outline ● Motivation ● Background ○ Evaluating Quality ○ Transfer Learning ○ The “Bellwether” ● Experimental Setup ○ Benchmark Data ○ Prediction Model ○ Statistical Measures ● Results ● Conclusions 28
  29. 29. Is this complexity necessary? • Short answer — No • Just look for the “Bellwether” •Use our bellwether method •Build your model •Et voilà! 29
  30. 30. The Bellwether Method Generate Apply Monitor #
  31. 31. The Bellwether Method Generate • Project Pairs Pi , j • Perform a Leave-one-out Test Train on Pi Test on Pj • Pick the Project with the best model Apply Monitor #
  32. 32. The Bellwether Method Generate Apply • Predict Quality on future projects Monitor #
  33. 33. The Bellwether Method Generate Apply Monitor • When predictions fail. Restart. #
  34. 34. The Bellwether Method Generate Apply Monitor #
  35. 35. Outline ● Motivation ● Background ○ Evaluating Quality ○ Transfer Learning ○ The “Bellwether” ● Experimental Setup ○ Benchmark Data ○ Prediction Model ○ Statistical Measures ● Results ● Conclusions 35
  36. 36. Experiment Setup: Benchmark Data • 120 Datasets from 4 communities • Defects in 3 levels of granularity • File, Class, and Function • Open source and Proprietary 36
  37. 37. Experiment Setup: Benchmark Data • BTW, Apache has local data • Multiple versions • Temporally ordered 37 A total of 54 datasets
  38. 38. Outline ● Motivation ● Background ○ Evaluating Quality ○ Transfer Learning ○ The “Bellwether” ● Experimental Setup ○ Benchmark Data ○ Prediction Model ○ Statistical Measures ● Results ● Conclusions 38
  39. 39. Experiment Setup: Prediction Model • We use Random Forests[Zimmerman08] • Build several decision trees from random subsamples • Use ensemble learning • Samples are imbalanced[Pelayo07] • More “clean” examples • Use SMOTE [Chawla01] to rebalance data* • Randomly down sample “clean” instances • Up-sample “buggy” instances *Apply only to training data 38
  40. 40. Outline ● Motivation ● Background ○ Evaluating Quality ○ Transfer Learning ○ The “Bellwether” ● Experimental Setup ○ Benchmark Data ○ Prediction Model ○ Statistical Measures ● Results ● Conclusions 40
  41. 41. Experiment Setup: Statistical Measures 41 • Prediction is usually measured using ROC • ROC is a plot of Recall vs. False Alarm • Plot requires several treatments • Obtained by cross validation. • We refrain from Cross-Validation • It tends to mix the test data with the bellwether • Instead, • We use Balance [Ma07]
  42. 42. Experiment Setup: Statistical Measures 42 • Instead of a set of points for ROC, • Produce one point. • X, Y = Pd (Recall), Pf (False Alarm) • Balance is the weighted distance from the ideal point • Ideal Point => (Pd, Pf) = (1, 0) • Balance = • Lower the Balance, better the performance
  43. 43. Experiment Setup: Statistical Measures • Prediction Model is inherently random • Rerun model 40 times with different seeds • Collect Balance measure in every run • Use Scott-Knott Test to compare Balance values • Scott-Knott ranks Balance values (best to worst) • Rank -> Effect Size Test + Hypothesis Test • Why SK? •It’s been used by recent high profile papers at TSE [Mittas13] and ICSE [Ghotra15] 43
  44. 44. Outline ● Motivation ● Background ○ Evaluating Quality ○ Transfer Learning ○ The “Bellwether” ● Experimental Setup ○ Benchmark Data ○ Prediction Model ○ Statistical Measures ● Results ● Conclusions 44
  45. 45. How rare are “Bellwethers”? How does the bellwether fare against local models? Is Bellwether better than other transfer learning methods? Can we predict which data set will be bellwether? How much of the “Bellwether” data is required? Results: Research Questions 45
  46. 46. How rare are “Bellwethers”? How does the bellwether fare against local models? Is Bellwether better than other transfer learning methods? Can we predict which data set will be bellwether? How much of the “Bellwether” data is required? Results: Research Question 1 46
  47. 47. Results: Research Question 1 47 Research Answer Our results suggest bellwethers are not rare. How rare are “Bellwethers”?
  48. 48. How rare are “Bellwethers”? Community: Bellwether: Lucene Apache Results: Research Question 1 48
  49. 49. How rare are “Bellwethers”? Community: Bellwether: MC NASA Results: Research Question 1 49
  50. 50. How rare are “Bellwethers”? Community: Bellwether: LC AEEEM Results: Research Question 1 50
  51. 51. How rare are “Bellwethers”? Community: Bellwether: Safe ReLink X=== Results: Research Question 1 51
  52. 52. How rare are “Bellwethers”? How does the bellwether fare against local models? Is Bellwether better than other transfer learning methods? Can we predict which data set will be bellwether? How much of the “Bellwether” data is required? Results: Research Question 2 52
  53. 53. How does the bellwether fare against local models? Research Answer For projects measured with the same quality metrics, training models with bellwether is just as good as — if not better than — local models Results: Research Question 2 53
  54. 54. How rare are “Bellwethers”? How does the bellwether fare against local models? Is Bellwether better than other transfer learning methods? Can we predict which data set will be bellwether? How much of the “Bellwether” data is required? Results: Research Question 3 54
  55. 55. Is Bellwether better than other transfer learning methods? Research Answer The bellwether outperforms standard homogeneous transfer learners. Results: Research Question 3 55
  56. 56. How rare are “Bellwethers”? How does the bellwether fare against local models? Is Bellwether better than other transfer learning methods? Can we predict which data set will be bellwether? How much of the “Bellwether” data is required? Results: Research Question 4 56
  57. 57. Can we predict which data set will be bellwether? Research Answer This is non-trivial. Trying to statistically determine if a project will be a bellwether was unsuccessful. This is open to further examination. Results: Research Question 4 57
  58. 58. How rare are “Bellwethers”? How does the bellwether fare against local models? Is Bellwether better than other transfer learning methods? Can we predict which data set will be bellwether? How much of the “Bellwether” data is required? Results: Research Question 5 58
  59. 59. How much data is required before detecting the “Bellwether”? Research Answer A few dozen defective samples from the bellwether is sufficient to build a reliable model Results: Research Question 5 59
  60. 60. Outline ● Motivation ● Background ○ Evaluating Quality ○ Transfer Learning ○ The “Bellwether” ● Experimental Setup ○ Benchmark Data ○ Prediction Model ○ Statistical Measures ● Results ● Conclusions 60
  61. 61. Practical Implications • The problem of generality in SE • Reproducibility is hard to achieve. • With Bellwethers Transfer Learners can • Not only be reproducible • But also be stable • and Reliable • Identification of Bellwether earlier • Would have changed course of research • More focus on coarse grain analysis • Less on relevancy filtering, model generation 61
  62. 62. Future Work • Bellwethers in heterogeneous learners • Promising heterogeneous transfer learners [Nam15][Jing15] • Perform complex dimensionality mapping transforms • Can Bellwethers assist in finding the best mapping? • Study and quantify bellwether • what makes a bellwether, a bellwether? •Bellwethers beyond defect prediction •Are there bellwethers in other data? 62
  63. 63. In conclusion... •Look for bellwethers •To use as a baseline •To justify the use of transfer learning •Stabilize the pace of conclusions •Not permanent conclusion stability •Easy to find •Look when necessary •New data can be discarded •Updated only as they start failing 63

×