SlideShare a Scribd company logo
1 of 36
Download to read offline
LOCAL VS. GLOBAL
MODELS FOR EFFORT
ESTIMATION AND DEFECT
PREDICTION
TIM MENZIES, ANDREW BUTCHER   (WVU)
ANDRIAN MARCUS                (WAYNE STATE)
THOMAS ZIMMERMANN             (MICROSOFT)
DAVID COK                     (GRAMMATECH)
PREMISE

Something is very wrong with data mining
research in software engineering
    •  Need less “algorithm mining” and more “data mining”
    •  Handle “conclusion instability”


Need to do a different kind of data mining
    •  Cluster, then learn
    •  Learning via “envy”




12/1/2011




                                                             2
Less “algorithm mining”

       More “data mining”


12/1/2011




                                 3
TOO MUCH MINING?

 Porter & Selby, 1990
     •  Evaluating Techniques for Generating Metric-Based Classification Trees, JSS.
     •  Empirically Guided Software Development Using Metric-Based Classification
        Trees. IEEE Software
     •  Learning from Examples: Generation and Evaluation of Decision Trees for
        Software Resource Analysis. IEEE TSE

 In 2011, Hall et al. (TSE, pre-print)
           •  reported 100s of similar studies.
           •  L learners on D data sets in a M*N cross-val

 What is your next paper?
     •  Hopefully not D*L*M*N

12/1/201




                                                                                       4
THE FIELD IS CALLED “DATA MINING”,
 NOT “ALGORITHM MINING”

To understand data
mining, look at the data,
not the algorithms


Our results should be
insights about data,
  •  not trivia about (say)
     decision tree algorithms


Besides, the thing that
most predicts for
performance is the data,
not the algorithm,
  •  Domingos & Pazzani: Optimality of
     the Simple Bayesian Classifier under
     Zero-One Loss, Machine Learning,




                                            5
     Volume 29, [103-130, 1997
Handle
    “Conclusion instability”



12/1/2011




                               6
CONCLUSION INSTABILITY:
WHAT WORKS THERE DOES NOT WORK HERE




12/1/2011




                                      7
Conclusion Instability:
what works there does not work here

Posnet et al [2011]
Zimmermann [2009] : learned defect predictors from 622 pairs of
projects ⟨project1, project2⟩.
  •  4% of pairs did project1’s predictors work for project2.
Kitchenham [2007] : studies comparing effort models learned from local
or imported models
  •  1/3 better, 1/3 same, 1/3 worse
Jørgensen [2004] :
15 studies comparing model-based to expert-based estimation.
  •  1/3 better, 1/3 same, 1/3 worse
Mair [2005] : studies comparing regression to analogy methods for
effort estimation
  •  7/20 better,4/20 same, 9/20
ROOT CAUSE OF
CONCLUSION INSTABILITY?

  HYPOTHESIS #1                         HYPOTHESIS #2
  Any one of….                          SE is an inherently varied
      •  Over-generalization across     activity
         different kinds of projects?
            •  Solve with “delphi         •  So conclusion instability
               localization”                 can’t be fixed
      •  Noisy data?                      •  It must be managed
      •  Too little data?
      •  Poor statistical technique?      •  Needs different kinds of
      •  Stochastic choice within            data miners
         data miner (e.g. random                •  Cluster, then learn
         forests)
                                                •  Learning via “envy”
      •  Insert idea here

12/1/2011




                                                                         9
SOLVE CONCLUSION INSTABILITY
WITH “DELPHI LOCALIZATIONS” ?
Restrict data mining to just related projects


Ask an expert to find the right local context
    •  Are we sure they’re right?
    •  Posnett at al. 2011:
            •    What is right level for learning?
            •    Files or packages?
            •    Methods or classes?
            •    Changes from study to study



And even if they are “right”:
    •  Should we use those contexts?
    •  What if not enough info in our own delphi localization?




                                                                 10
12/1/2011
Q: What to do
                                             about rare
DELPHI LOCALIZATIONS                           zones?




        A: Select the nearest ones from the rest




                                                                  11
        But how?                                            11"
Cluster then learn




                                 12
12/1/2011
KOCAGUNELI [2011]
CLUSTERING TO FIND “LOCAL”
TEAK: estimates from “k”
nearest-neighbors
    •  “k” auto-selected
       per test case
    •  Pre-processor to cluster data,
       remove worrisome regions
    •  IEEE TSE, Jan’11




ESEM’11
    •    Train within one delphi localization
    •    Or train on all and see what it picks
    •    Result #1: usually, cross as good as within
    •    Result #2: given a choice of both, TEAK picks “within” as much as “cross




                                                                                    13
12/1/2011
LESSON : DATA MAY NOT DIVIDE
NEATLY ON RAW DIMENSIONS




The best description for SE projects may be synthesize
dimensions extracted from the raw dimensions




                                                         14
12/1/2011
SYNTHESIZED DIMENSIONS

PCA : e.g. Nagappan [2006]           Fastmap: Faloutsos [1995]
                                     O(2N) generation of axis of large variability
Finds orthogonal “components”
                                      •  Pick any point W;
    •  Transforms N correlated        •  Find X furthest from W,
       variables to                   •  Find Y furthest from Y.
       fewer uncorrelated
       "components".
                                     c = dist(X,Y)
    •  Component[i]: accounts for    All points have distance a,b to (X,Y)
       as much variability as
       possible.
    •  Component[ j>I ] : accounts     •  x = (a2 + c2 − b2)/2c
       for remaining variability       •  y= sqrt(a2 – x2)

O(N2) to generate




                                                                                     15
12/1/2011
HIERARCHICAL PARTITIONING
Grow                             Prune
Find two orthogonal dimensions   Combine quadtree leaves
Find median(x), median(y)        with similar densities

Recurse on four quadrants        Score each cluster by median
                                 score of class variable




                                                                16
Q: WHY CLUSTER VIA FASTMAP?

A1: Circular methods (e.g. k-means) assume
round clusters.
    •  But density-based clustering allows
      clusters to be any shape



A2: No need to pre-set the number of clusters



A3: the O(2N) heuristic
is very fast,
    •  Unoptimized Python:




                                                17
12/1/2011
Learning via “envy”




                                  18
12/1/2011
Q: WHY TRAIN ON NEIGHBORING
CLUSTERS WITH BETTER SCORES?

A1: Why learn from
your own mistakes?
    •  When there exists
       a smarter
       neighbor?




    •  The “grass is
       greener” principle




                               19
12/1/2011
HIERARCHICAL PARTITIONING
Grow                             Prune
Find two orthogonal dimensions   Combine quadtree leaves
                                 with similar densities
Find median(x), median(y)
                                 Score each cluster by median
Recurse on four quadrants        score of class variable




                                                                20
HIERARCHICAL PARTITIONING
Grow                             Prune
Find two orthogonal dimensions   Combine quadtree leaves
                                 with similar densities
Find median(x), median(y)
                                 Score each cluster by median
Recurse on four quadrants        score of class variable




                                 Where is grass greenest?
                                 C1 envies neighbor C2 with max
                                 abs(score(C2) - score(C1))




                                                                  21
                                   •  Train on C2, test on C1
Q: HOW TO LEARN RULES FROM
NEIGHBORING CLUSTERS

A: it doesn’t really matter
   •  But when comparing global & intra-cluster rules
   •  Use the same rule learner

This study uses WHICH (Menzies [2010])
   • Customizable scoring operator
   • Faster termination
   • Generates very small rules (good for explanation)




                                                         22
12/1/2011
DATA FROM
HTTP://PROMISEDATA.ORG/DATA
                                      Distributions have percentiles:
Effort reduction =
{ NasaCoc, China } :                   100th
COCOMO or function points
                                        75th
Defect reduction =
{lucene,xalan jedit,synapse,etc } :     50th
CK metrics(OO)                          25th

                                               0      20       40   60    80     100
Clusters have untreated class
distribution.                                      untreated    global   local
Rules select a subset of the
examples:                                 Treated with rules
                                       learned from all data
    •  generate a treated class
       distribution
                                                       Treated with rules learned




                                                                                    23
12/1/2011

                                                         from neighboring cluster
BY ANY MEASURE,
PER-CLUSTER LEARNING IS BEST

Lower median efforts/defects (50th percentile)
Greater stability (75th – 25th percentile)
Decreased worst case (100th percentile)




                                                 24
12/1/2011
CLUSTERS GENERATE
DIFFERENT RULES




What works “here” does not work “there”
    •  Misguided to try and tame conclusion instability
    •  Inherent in the data

Don’t tame it, use it: build lots of local models




                                                          25
12/1/2011
Related work




                           26
12/1/2011
RELATED WORK
Defect & effort prediction: 1,000 papers          Design of experiments
  •  All about making predictions                   •  Don’t learn from immediate
  •  This work: learning controllers to change         data, learn from better
     prediction                                        neighbors
                                                    •  Here: , train once per cluster
Outlier removal :                                      (small subset of whole data)
  •  Yin [2011], Yoon [2010], Kocaguneli [2011]     •  Orders of magnitude faster
  •  Subsumed by this work                             than N*M cross-val

Clustering & case-based reasoning                 Localizations:
   •  Kocaguneli [2011], Turhan [2009],              •  Expert-based Petersen [2009]:
      Cuadrado [2007]                                   how to know it correct?
   •  No generated, nothing to reflect about         •  Source code-based: ecological
   •  Needs indexing (runtime speed)                    inference: Posnett [2011]
                                                     •  This work: auto-learning of
Structured literature reviews:                          contexts; beneficial
   •  Kitchenham [2007] + many more besides
   •  May be over-generalizing across cluster
      boundaries




                                                                                        27
12/1/2011
Conclusion




                         28
12/1/2011
THIS TALK

Something is fundamentally wrong with data mining research in
software engineering
    •  Needs more “data mining”, less “algorithm mining”
    •  Handle “conclusion instability”


Need to do a different kind of data mining
    •  Cluster, then learn
    •  Learning via “envy”




                                                                29
12/1/2011
NOT “ONE RING TO RULE THEM ALL”

Trite global statements about multiple SE
projects are… trite


Need effective ways to learn local lessons
    •  Automatic clustering tools
    •  Rule learning (per cluster, using envy)




                                                 30
12/1/2011
THE WISDOM OF THE   CROWDS




                             31
12/1/2011
THE WISDOM OF THE   CROWDS




                             32
12/1/2011
THE WISDOM OF THE   CROWDS




                             33
12/1/2011
THE WISDOM OF THE   COWS




                           34
12/1/2011
THE WISDOM OF THE          COWS

•  Seek the fence where
   the grass is greener
   on the other side.
     •  Learn from there
     •  Test on here


•  Don’t rely on trite
   definitions of “there”
   and “here”
     •  Cluster to find
        “here” and “there”




                                    35
  12/1/2011
36
12/1/2011

More Related Content

Viewers also liked

Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Pritom Chaki
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment Parinda Rajapaksha
 
RNA secondary structure prediction
RNA secondary structure predictionRNA secondary structure prediction
RNA secondary structure predictionMuhammed sadiq
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignmentKubuldinho
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooChristian Perone
 
RNA Secondary Structure Prediction
RNA Secondary Structure PredictionRNA Secondary Structure Prediction
RNA Secondary Structure PredictionSumin Byeon
 

Viewers also liked (11)

Bioalgo 2012-01-gene-prediction-sim
Bioalgo 2012-01-gene-prediction-simBioalgo 2012-01-gene-prediction-sim
Bioalgo 2012-01-gene-prediction-sim
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)Global and local alignment (bioinformatics)
Global and local alignment (bioinformatics)
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
RNA secondary structure prediction
RNA secondary structure predictionRNA secondary structure prediction
RNA secondary structure prediction
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
RNA Secondary Structure Prediction
RNA Secondary Structure PredictionRNA Secondary Structure Prediction
RNA Secondary Structure Prediction
 

Similar to Local vs. Global Models for Effort Estimation and Defect Prediction

Franhouder july2013
Franhouder july2013Franhouder july2013
Franhouder july2013CS, NcState
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Machine Learning: Learning with data
Machine Learning: Learning with dataMachine Learning: Learning with data
Machine Learning: Learning with dataONE Talks
 
One talk Machine Learning
One talk Machine LearningOne talk Machine Learning
One talk Machine LearningONE Talks
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopYahoo Developer Network
 
On the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineeringOn the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineeringCS, NcState
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter? CS, NcState
 
ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+ ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+ Prateek Jain
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
PhD Defense Slides
PhD Defense SlidesPhD Defense Slides
PhD Defense SlidesDebasmit Das
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...広樹 本間
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrellzukun
 
ensemble learning
ensemble learningensemble learning
ensemble learningbutest
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data AnalysisDeviousQuant
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Miningebelani
 

Similar to Local vs. Global Models for Effort Estimation and Defect Prediction (20)

Franhouder july2013
Franhouder july2013Franhouder july2013
Franhouder july2013
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Machine Learning: Learning with data
Machine Learning: Learning with dataMachine Learning: Learning with data
Machine Learning: Learning with data
 
One talk Machine Learning
One talk Machine LearningOne talk Machine Learning
One talk Machine Learning
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using Hadoop
 
On the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineeringOn the value of stochastic analysis for software engineering
On the value of stochastic analysis for software engineering
 
Borderline Smote
Borderline SmoteBorderline Smote
Borderline Smote
 
KNN
KNNKNN
KNN
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+ ESWC 2011 BLOOMS+
ESWC 2011 BLOOMS+
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
PhD Defense Slides
PhD Defense SlidesPhD Defense Slides
PhD Defense Slides
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrell
 
ensemble learning
ensemble learningensemble learning
ensemble learning
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data Analysis
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 

More from CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdecCS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).CS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 

More from CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Local vs. Global Models for Effort Estimation and Defect Prediction

  • 1. LOCAL VS. GLOBAL MODELS FOR EFFORT ESTIMATION AND DEFECT PREDICTION TIM MENZIES, ANDREW BUTCHER (WVU) ANDRIAN MARCUS (WAYNE STATE) THOMAS ZIMMERMANN (MICROSOFT) DAVID COK (GRAMMATECH)
  • 2. PREMISE Something is very wrong with data mining research in software engineering •  Need less “algorithm mining” and more “data mining” •  Handle “conclusion instability” Need to do a different kind of data mining •  Cluster, then learn •  Learning via “envy” 12/1/2011 2
  • 3. Less “algorithm mining” More “data mining” 12/1/2011 3
  • 4. TOO MUCH MINING? Porter & Selby, 1990 •  Evaluating Techniques for Generating Metric-Based Classification Trees, JSS. •  Empirically Guided Software Development Using Metric-Based Classification Trees. IEEE Software •  Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis. IEEE TSE In 2011, Hall et al. (TSE, pre-print) •  reported 100s of similar studies. •  L learners on D data sets in a M*N cross-val What is your next paper? •  Hopefully not D*L*M*N 12/1/201 4
  • 5. THE FIELD IS CALLED “DATA MINING”, NOT “ALGORITHM MINING” To understand data mining, look at the data, not the algorithms Our results should be insights about data, •  not trivia about (say) decision tree algorithms Besides, the thing that most predicts for performance is the data, not the algorithm, •  Domingos & Pazzani: Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning, 5 Volume 29, [103-130, 1997
  • 6. Handle “Conclusion instability” 12/1/2011 6
  • 7. CONCLUSION INSTABILITY: WHAT WORKS THERE DOES NOT WORK HERE 12/1/2011 7
  • 8. Conclusion Instability: what works there does not work here Posnet et al [2011] Zimmermann [2009] : learned defect predictors from 622 pairs of projects ⟨project1, project2⟩. •  4% of pairs did project1’s predictors work for project2. Kitchenham [2007] : studies comparing effort models learned from local or imported models •  1/3 better, 1/3 same, 1/3 worse Jørgensen [2004] : 15 studies comparing model-based to expert-based estimation. •  1/3 better, 1/3 same, 1/3 worse Mair [2005] : studies comparing regression to analogy methods for effort estimation •  7/20 better,4/20 same, 9/20
  • 9. ROOT CAUSE OF CONCLUSION INSTABILITY? HYPOTHESIS #1 HYPOTHESIS #2 Any one of…. SE is an inherently varied •  Over-generalization across activity different kinds of projects? •  Solve with “delphi •  So conclusion instability localization” can’t be fixed •  Noisy data? •  It must be managed •  Too little data? •  Poor statistical technique? •  Needs different kinds of •  Stochastic choice within data miners data miner (e.g. random •  Cluster, then learn forests) •  Learning via “envy” •  Insert idea here 12/1/2011 9
  • 10. SOLVE CONCLUSION INSTABILITY WITH “DELPHI LOCALIZATIONS” ? Restrict data mining to just related projects Ask an expert to find the right local context •  Are we sure they’re right? •  Posnett at al. 2011: •  What is right level for learning? •  Files or packages? •  Methods or classes? •  Changes from study to study And even if they are “right”: •  Should we use those contexts? •  What if not enough info in our own delphi localization? 10 12/1/2011
  • 11. Q: What to do about rare DELPHI LOCALIZATIONS zones? A: Select the nearest ones from the rest 11 But how? 11"
  • 12. Cluster then learn 12 12/1/2011
  • 13. KOCAGUNELI [2011] CLUSTERING TO FIND “LOCAL” TEAK: estimates from “k” nearest-neighbors •  “k” auto-selected per test case •  Pre-processor to cluster data, remove worrisome regions •  IEEE TSE, Jan’11 ESEM’11 •  Train within one delphi localization •  Or train on all and see what it picks •  Result #1: usually, cross as good as within •  Result #2: given a choice of both, TEAK picks “within” as much as “cross 13 12/1/2011
  • 14. LESSON : DATA MAY NOT DIVIDE NEATLY ON RAW DIMENSIONS The best description for SE projects may be synthesize dimensions extracted from the raw dimensions 14 12/1/2011
  • 15. SYNTHESIZED DIMENSIONS PCA : e.g. Nagappan [2006] Fastmap: Faloutsos [1995] O(2N) generation of axis of large variability Finds orthogonal “components” •  Pick any point W; •  Transforms N correlated •  Find X furthest from W, variables to •  Find Y furthest from Y. fewer uncorrelated "components". c = dist(X,Y) •  Component[i]: accounts for All points have distance a,b to (X,Y) as much variability as possible. •  Component[ j>I ] : accounts •  x = (a2 + c2 − b2)/2c for remaining variability •  y= sqrt(a2 – x2) O(N2) to generate 15 12/1/2011
  • 16. HIERARCHICAL PARTITIONING Grow Prune Find two orthogonal dimensions Combine quadtree leaves Find median(x), median(y) with similar densities Recurse on four quadrants Score each cluster by median score of class variable 16
  • 17. Q: WHY CLUSTER VIA FASTMAP? A1: Circular methods (e.g. k-means) assume round clusters. •  But density-based clustering allows clusters to be any shape A2: No need to pre-set the number of clusters A3: the O(2N) heuristic is very fast, •  Unoptimized Python: 17 12/1/2011
  • 18. Learning via “envy” 18 12/1/2011
  • 19. Q: WHY TRAIN ON NEIGHBORING CLUSTERS WITH BETTER SCORES? A1: Why learn from your own mistakes? •  When there exists a smarter neighbor? •  The “grass is greener” principle 19 12/1/2011
  • 20. HIERARCHICAL PARTITIONING Grow Prune Find two orthogonal dimensions Combine quadtree leaves with similar densities Find median(x), median(y) Score each cluster by median Recurse on four quadrants score of class variable 20
  • 21. HIERARCHICAL PARTITIONING Grow Prune Find two orthogonal dimensions Combine quadtree leaves with similar densities Find median(x), median(y) Score each cluster by median Recurse on four quadrants score of class variable Where is grass greenest? C1 envies neighbor C2 with max abs(score(C2) - score(C1)) 21 •  Train on C2, test on C1
  • 22. Q: HOW TO LEARN RULES FROM NEIGHBORING CLUSTERS A: it doesn’t really matter •  But when comparing global & intra-cluster rules •  Use the same rule learner This study uses WHICH (Menzies [2010]) • Customizable scoring operator • Faster termination • Generates very small rules (good for explanation) 22 12/1/2011
  • 23. DATA FROM HTTP://PROMISEDATA.ORG/DATA Distributions have percentiles: Effort reduction = { NasaCoc, China } : 100th COCOMO or function points 75th Defect reduction = {lucene,xalan jedit,synapse,etc } : 50th CK metrics(OO) 25th 0 20 40 60 80 100 Clusters have untreated class distribution. untreated global local Rules select a subset of the examples: Treated with rules learned from all data •  generate a treated class distribution Treated with rules learned 23 12/1/2011 from neighboring cluster
  • 24. BY ANY MEASURE, PER-CLUSTER LEARNING IS BEST Lower median efforts/defects (50th percentile) Greater stability (75th – 25th percentile) Decreased worst case (100th percentile) 24 12/1/2011
  • 25. CLUSTERS GENERATE DIFFERENT RULES What works “here” does not work “there” •  Misguided to try and tame conclusion instability •  Inherent in the data Don’t tame it, use it: build lots of local models 25 12/1/2011
  • 26. Related work 26 12/1/2011
  • 27. RELATED WORK Defect & effort prediction: 1,000 papers Design of experiments •  All about making predictions •  Don’t learn from immediate •  This work: learning controllers to change data, learn from better prediction neighbors •  Here: , train once per cluster Outlier removal : (small subset of whole data) •  Yin [2011], Yoon [2010], Kocaguneli [2011] •  Orders of magnitude faster •  Subsumed by this work than N*M cross-val Clustering & case-based reasoning Localizations: •  Kocaguneli [2011], Turhan [2009], •  Expert-based Petersen [2009]: Cuadrado [2007] how to know it correct? •  No generated, nothing to reflect about •  Source code-based: ecological •  Needs indexing (runtime speed) inference: Posnett [2011] •  This work: auto-learning of Structured literature reviews: contexts; beneficial •  Kitchenham [2007] + many more besides •  May be over-generalizing across cluster boundaries 27 12/1/2011
  • 28. Conclusion 28 12/1/2011
  • 29. THIS TALK Something is fundamentally wrong with data mining research in software engineering •  Needs more “data mining”, less “algorithm mining” •  Handle “conclusion instability” Need to do a different kind of data mining •  Cluster, then learn •  Learning via “envy” 29 12/1/2011
  • 30. NOT “ONE RING TO RULE THEM ALL” Trite global statements about multiple SE projects are… trite Need effective ways to learn local lessons •  Automatic clustering tools •  Rule learning (per cluster, using envy) 30 12/1/2011
  • 31. THE WISDOM OF THE CROWDS 31 12/1/2011
  • 32. THE WISDOM OF THE CROWDS 32 12/1/2011
  • 33. THE WISDOM OF THE CROWDS 33 12/1/2011
  • 34. THE WISDOM OF THE COWS 34 12/1/2011
  • 35. THE WISDOM OF THE COWS •  Seek the fence where the grass is greener on the other side. •  Learn from there •  Test on here •  Don’t rely on trite definitions of “there” and “here” •  Cluster to find “here” and “there” 35 12/1/2011