SlideShare a Scribd company logo
Predicting More from Less:
Synergies of Learning
Ekrem Kocaguneli, ekrem@kocaguneli.com
Bojan Cukic, bojan.cukic@mail.wvu.edu,
Huihua Lu, hlu3@mix.wvu.edu
RAISE'13 
2nd International NSF sponsored Workshop
on Realizing Artificial Intelligence Synergies in Software Engineering
5/25/2013
RAISE'13
Collecting data is important
SourceForge currently hosts
324K projects with a user
base of 3.4M1
GoogleCode hosts 250K open
source projects2
1. http://sourceforge.net/apps/trac/sourceforge/wiki/What%20is%20SourceForge.net
2. https://developers.google.com/open-source/
1
Also, there is an abundant
amount of SE repositories
ISBSG1 PROMISE2
Eclipse Bug Data3
TukuTuku4
1. C. Lokan, T. Wright, P. Hill, and M. Stringer. Organizational bench- marking using the ISBSG data repository. IEEE Software, 18(5):26–
32, 2001.
2. T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering
data, June 2012.
3. T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In International Workshop on Predictor Models in Software
Engineering, 2007. PROMISE’07: ICSE Workshops 2007.
4. http://www.metriq.biz/tukutuku/ 2
We have mountains of data,
but then what?
3
Abundance of data is promising for predictive
modeling and supervised learning
Yet, dependent variable information is
not always available!
Dependent variables (labels, effort values
etc.) may be missing, outdated or
available for a limited number of
instances
4
When an organization has no local
data or the local data is outdated,
transferring data helps
When only a limited amount of data is
labeled, we can use the existing labels
to label other training instances
When no labels exist, we can request
labels from experts with a cost
Transfer
learning
Semi-
supervised
learning
Active
learning 5
How to transfer data data between
domains and projects?
How to accommodate prediction
problems for which a limited amount
of labeled instances are available?
How to handle prediction problems in
which no instances have labels?
Transfer
learning
Semi-
supervised
learning
Active
learning 6
What is the current
state-of-the-art?
7
Transfer learning is a set of learning methods that allow
the training and test sets to have different domains
and/or tasks (Ma2012 [1]).
Transfer learning - 1
[1] Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross- company software defect
prediction. Information and Software Technol- ogy, 54(3):248 – 256, 2012.
SE transfer learning studies (a.k.a. cross-company
learning) have the same task yet different domains
(data coming from different organizations or different
time frames).
8
Transfer learning results in SE report instability and
significant variability if data is used as-is
(Kitchenham2007 [1], Zimmermann2009[2])
Transfer learning - 2
[1] B.A.Kitchenham,E.Mendes,andG.H.Travassos.Crossversuswithin- company cost estimation studies: A systematic review. IEEE Trans. Softw.
Eng., 33(5):316–329, 2007.
[2] T.Zimmermann,N.Nagappan,H.Gall,E.Giger,andB.Murphy.Cross- project defect prediction: A large scale experiment on data vs. domain vs.
process. ESEC/FSE, pages 91–100, 2009.
[3] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction.
Empirical Software Engineering, 14(5):540–578, 2009.
[4] E. Kocaguneli and T. Menzies. How to find relevant data for effort es- timation. In ESEM’11: International Symposium on Empirical Software
Engineering and Measurement, 2011.
Filtering-based approaches support prior results
(Turhan2009[3], Kocaguneli2011[4])
• Transferring all cross data yields poor performance
• Filtering cross data significantly improves estimation
9
SSL methods are a group of machine learning algorithms
that learn from a set of training instances among which
only a small subset has pre-assigned labels [1].
Semi-supervised learning (SSL) -1
[1] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. MIT Press, Cambridge, MA, USA, 2006.
SSL helps relax the dependent variable dependence
of supervised methods
Hence, we can supplement supervised
estimation methods.
10
Despite the promise, SSL appears to be
less than thoroughly investigated in SE
Semi-supervised learning (SSL) - 2
[1] Huihua Lu, Bojan Cukic, and Mark Culp. 2012. Software defect prediction using semi-supervised learning with dimension reduction. In
Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012).
[2] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated
Software Engineering, 19:201–230, 2012.
Lu et al. use an SSL algorithm augmented with multi-
dimensional scaling (MDS) as pre-processor, which
outperforms corresponding supervised methods
Li et al. developed a framework which
maps ensemble learning and random
forests into an SSL setting [19].
11
AL methods are unsupervised methods working on an
initially unlabeled data set.
Active Learning (AL) - 1
[1] M.-F.Balcan, A.Beygelzimer, andJ.Langford. “Agnostic active learning”. Proceedings of the 23rd international conference on Machine learning
- ICML ’06, pages 65–72, 2006.
AL methods can query an oracle, which can provide
labels. Yet, each label comes with a cost. Hence, we
need as few queries as possible.
e.g. Balcan et al. show AL provides the
same performance as a supervised
learner with substantially smaller
samples sizes [1]
12
In SE, AL methods hold a good
potential to reduce the labeling costs
Active Learning (AL) - 2
[1] Huihua Lu and Bojan Cukic. 2012. An adaptive approach with active learning in software fault prediction. In Proceedings of the 8th
International Conference on Predictive Models in Software Engineering (PROMISE '12).
[2] Kocaguneli, E.; Menzies, T.; Keung, J.; Cok, D.; Madachy, R., "Active Learning and Effort Estimation: Finding the Essential Content of Software
Effort Estimation Data," Software Engineering, IEEE Transactions on , vol.PP, no.99, pp.1,1, 0
Lu et al. propose an AL-based fault prediction
method, which outperforms supervised techniques
by using 20% or less of the data [1]
Kocaguneli et al. use AL in SEE. The proposed
method performs comparable to supervised
methods with 31% of the original data [2]
13
So what do we do?
14
Strengths and Weaknesses
Supervised Learning (SL)
Strengths
• Successfully used in SE for predictive
purposes.
• Provides successful estimation
performance.
Challenges
• Requires retrospective local data.
• Requires dependent variable
information.
Transfer Learning (TL)
Strengths
• Enables data to be transferred between
different organizations or time frames.
• Provides a solution to the lack of local data.
• After relevancy filtering, cross data can
perform as well as within data.
Challenges
• Use of cross-data in an as is manner results in
unstable performance results.
• TL filters relevant cross data, which reduces
the transferred cross data amount.
Semi-supervised Learning (SSL)
Strengths
• Enables learning from small sets of labeled
instances.
• Supplements the learning with unlabeled instances.
• Relaxes the requirement of dependent variables.
Challenges
• Although being small, it still requires an initially
labeled set of training instances.
• For datasets with large number of independent
features, it requires feature subset selection.
Active Learning (AL)
Strengths
• Helps find the essential content of the data.
• Decreases the number of dependent variable
information, thereby reducing the associated
data collection costs.
Challenges
• Susceptible to unbalanced class distributions
in classification problems.
15
Strengths and Weaknesses
Supervised Learning (SL)
• Requires retrospective local data.
Transfer Learning (TL)
• Provides a solution to the lack of local data.
• TL filters relevant cross data, which reduces
the transferred cross data amount.
Semi-supervised Learning (SSL)
• Enables learning from small sets of labeled
instances.
Active Learning (AL)
• Helps find the essential content of the data.
1
2
3
16
Synergy #1
Synergy #1 is already being pursued in SE
With successful applications of
transferring data among:
• Domain
• Time frame
17
Filtering labeled cross data yields a very limited
amount of locally relevant data
SSL can use filtered cross data to provide pseudo-
labels for the unlabeled within data
Synergy #2
18
SE data (defect and effort) can be summarized
with its essential content
Transfer learning may benefit from using
essential content instead of all the data, which
may contain noise and outliers
Synergy #3
19
Did you try any
of the synergies?
20
Within test project(s)
Cross data
Es ma on
Method
Estimate
TEAK
filter
Filtered cross data
Past within data
(without labels)
QUICK
Essential
within data
SSL
Essential within data
with pseudo labels
1
2
3
4
Experiments with
Synergy #3
21
Experiments with
Synergy #3
Estimation from
pseudo-labeled
within data
Within data is
summarized to at
most 15%
Opportunity for
within data to be
locally interpreted
22
What have we covered?
23

More Related Content

What's hot

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Real-World Recommender Systems for Academia: The Pain and Gain in Building, O...
Real-World Recommender Systems for Academia: The Pain and Gain in Building, O...Real-World Recommender Systems for Academia: The Pain and Gain in Building, O...
Real-World Recommender Systems for Academia: The Pain and Gain in Building, O...
Joeran Beel
 
TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Pers...
TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Pers...TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Pers...
TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Pers...
Joeran Beel
 

What's hot (19)

De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
 
Data science lecture4_doaa_mohey
Data science lecture4_doaa_moheyData science lecture4_doaa_mohey
Data science lecture4_doaa_mohey
 
Data science lecture2_doaa_mohey
Data science lecture2_doaa_mohey Data science lecture2_doaa_mohey
Data science lecture2_doaa_mohey
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Data science lecture3_doaa_mohey
Data science lecture3_doaa_mohey Data science lecture3_doaa_mohey
Data science lecture3_doaa_mohey
 
J48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataJ48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance Data
 
De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
 
Query aware determinization of uncertain objects
Query aware determinization of uncertain objectsQuery aware determinization of uncertain objects
Query aware determinization of uncertain objects
 
Real-World Recommender Systems for Academia: The Pain and Gain in Building, O...
Real-World Recommender Systems for Academia: The Pain and Gain in Building, O...Real-World Recommender Systems for Academia: The Pain and Gain in Building, O...
Real-World Recommender Systems for Academia: The Pain and Gain in Building, O...
 
Active learning for ranking through expected loss optimization
Active learning for ranking through expected loss optimizationActive learning for ranking through expected loss optimization
Active learning for ranking through expected loss optimization
 
Efficient Refining Of Why-Not Questions on Top-K Queries
Efficient Refining Of Why-Not Questions on Top-K QueriesEfficient Refining Of Why-Not Questions on Top-K Queries
Efficient Refining Of Why-Not Questions on Top-K Queries
 
Advanced Question Paper Generator using Fuzzy Logic
Advanced Question Paper Generator using Fuzzy LogicAdvanced Question Paper Generator using Fuzzy Logic
Advanced Question Paper Generator using Fuzzy Logic
 
IRJET- Missing Value Evaluation in SQL Queries: A Survey
IRJET- 	  Missing Value Evaluation in SQL Queries: A SurveyIRJET- 	  Missing Value Evaluation in SQL Queries: A Survey
IRJET- Missing Value Evaluation in SQL Queries: A Survey
 
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATION
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATIONREVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATION
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATION
 
Data science lecture1_doaa_mohey
Data science lecture1_doaa_moheyData science lecture1_doaa_mohey
Data science lecture1_doaa_mohey
 
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Machine learning testing  survey, landscapes and horizons, the Cliff NotesMachine learning testing  survey, landscapes and horizons, the Cliff Notes
Machine learning testing survey, landscapes and horizons, the Cliff Notes
 
TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Pers...
TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Pers...TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Pers...
TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Pers...
 
Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse...
Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse...Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse...
Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse...
 
4 de47584
4 de475844 de47584
4 de47584
 

Viewers also liked

2011 A/NZ Cloud Solutions For Smb 20 July
2011 A/NZ Cloud Solutions For Smb 20 July2011 A/NZ Cloud Solutions For Smb 20 July
2011 A/NZ Cloud Solutions For Smb 20 July
Graeme Wood
 

Viewers also liked (9)

Reflection Support for Communities on the Web
Reflection Support for Communities on the WebReflection Support for Communities on the Web
Reflection Support for Communities on the Web
 
Can we build software better and faster and cheaper
Can we build software better and faster and cheaperCan we build software better and faster and cheaper
Can we build software better and faster and cheaper
 
Learning Analytics for the Lifelong Long Tail Learner
Learning Analytics for the Lifelong Long Tail LearnerLearning Analytics for the Lifelong Long Tail Learner
Learning Analytics for the Lifelong Long Tail Learner
 
Finding local lessons in software engineering
Finding local lessons in software engineeringFinding local lessons in software engineering
Finding local lessons in software engineering
 
7 συμβουλές για να γίνεται επιτυχημένοι εξ αποστάσεως σπουδαστές
7 συμβουλές για να γίνεται επιτυχημένοι εξ αποστάσεως σπουδαστές7 συμβουλές για να γίνεται επιτυχημένοι εξ αποστάσεως σπουδαστές
7 συμβουλές για να γίνεται επιτυχημένοι εξ αποστάσεως σπουδαστές
 
Lecture 8: More DCGs
Lecture 8: More DCGsLecture 8: More DCGs
Lecture 8: More DCGs
 
2011 A/NZ Cloud Solutions For Smb 20 July
2011 A/NZ Cloud Solutions For Smb 20 July2011 A/NZ Cloud Solutions For Smb 20 July
2011 A/NZ Cloud Solutions For Smb 20 July
 
What is a PhotoCamp?
What is a PhotoCamp?What is a PhotoCamp?
What is a PhotoCamp?
 
Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 

Similar to Predicting More from Less: Synergies of Learning

Federated learning and its role in the privacy preservation of IoT devices
Federated learning and its role in the privacy preservation of IoT devicesFederated learning and its role in the privacy preservation of IoT devices
Federated learning and its role in the privacy preservation of IoT devices
AlAtfat
 
A simplified predictive framework for cost evaluation to fault assessment usi...
A simplified predictive framework for cost evaluation to fault assessment usi...A simplified predictive framework for cost evaluation to fault assessment usi...
A simplified predictive framework for cost evaluation to fault assessment usi...
IJECEIAES
 
STATE-OF-THE-ART IN EMPIRICAL VALIDATION OF SOFTWARE METRICS FOR FAULT PRONEN...
STATE-OF-THE-ART IN EMPIRICAL VALIDATION OF SOFTWARE METRICS FOR FAULT PRONEN...STATE-OF-THE-ART IN EMPIRICAL VALIDATION OF SOFTWARE METRICS FOR FAULT PRONEN...
STATE-OF-THE-ART IN EMPIRICAL VALIDATION OF SOFTWARE METRICS FOR FAULT PRONEN...
IJCSES Journal
 

Similar to Predicting More from Less: Synergies of Learning (20)

AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
 
Comparative performance analysis
Comparative performance analysisComparative performance analysis
Comparative performance analysis
 
Comparative Performance Analysis of Machine Learning Techniques for Software ...
Comparative Performance Analysis of Machine Learning Techniques for Software ...Comparative Performance Analysis of Machine Learning Techniques for Software ...
Comparative Performance Analysis of Machine Learning Techniques for Software ...
 
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATIONONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
 
EARLY STAGE SOFTWARE DEVELOPMENT EFFORT ESTIMATIONS – MAMDANI FIS VS NEURAL N...
EARLY STAGE SOFTWARE DEVELOPMENT EFFORT ESTIMATIONS – MAMDANI FIS VS NEURAL N...EARLY STAGE SOFTWARE DEVELOPMENT EFFORT ESTIMATIONS – MAMDANI FIS VS NEURAL N...
EARLY STAGE SOFTWARE DEVELOPMENT EFFORT ESTIMATIONS – MAMDANI FIS VS NEURAL N...
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
 
Research issues in object oriented software testing
Research issues in object oriented software testingResearch issues in object oriented software testing
Research issues in object oriented software testing
 
Federated learning and its role in the privacy preservation of IoT devices
Federated learning and its role in the privacy preservation of IoT devicesFederated learning and its role in the privacy preservation of IoT devices
Federated learning and its role in the privacy preservation of IoT devices
 
A simplified predictive framework for cost evaluation to fault assessment usi...
A simplified predictive framework for cost evaluation to fault assessment usi...A simplified predictive framework for cost evaluation to fault assessment usi...
A simplified predictive framework for cost evaluation to fault assessment usi...
 
STATE-OF-THE-ART IN EMPIRICAL VALIDATION OF SOFTWARE METRICS FOR FAULT PRONEN...
STATE-OF-THE-ART IN EMPIRICAL VALIDATION OF SOFTWARE METRICS FOR FAULT PRONEN...STATE-OF-THE-ART IN EMPIRICAL VALIDATION OF SOFTWARE METRICS FOR FAULT PRONEN...
STATE-OF-THE-ART IN EMPIRICAL VALIDATION OF SOFTWARE METRICS FOR FAULT PRONEN...
 
Re2018 Semios for Requirements
Re2018 Semios for RequirementsRe2018 Semios for Requirements
Re2018 Semios for Requirements
 
Object Oriented Programming using C++.pptx
Object Oriented Programming using C++.pptxObject Oriented Programming using C++.pptx
Object Oriented Programming using C++.pptx
 
OOP ppt.pdf
OOP ppt.pdfOOP ppt.pdf
OOP ppt.pdf
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
 
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
 
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approach
 

More from CS, NcState

Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
CS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
CS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
CS, NcState
 

More from CS, NcState (20)

Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

Predicting More from Less: Synergies of Learning

  • 1. Predicting More from Less: Synergies of Learning Ekrem Kocaguneli, ekrem@kocaguneli.com Bojan Cukic, bojan.cukic@mail.wvu.edu, Huihua Lu, hlu3@mix.wvu.edu RAISE'13 
2nd International NSF sponsored Workshop on Realizing Artificial Intelligence Synergies in Software Engineering 5/25/2013 RAISE'13
  • 2. Collecting data is important SourceForge currently hosts 324K projects with a user base of 3.4M1 GoogleCode hosts 250K open source projects2 1. http://sourceforge.net/apps/trac/sourceforge/wiki/What%20is%20SourceForge.net 2. https://developers.google.com/open-source/ 1
  • 3. Also, there is an abundant amount of SE repositories ISBSG1 PROMISE2 Eclipse Bug Data3 TukuTuku4 1. C. Lokan, T. Wright, P. Hill, and M. Stringer. Organizational bench- marking using the ISBSG data repository. IEEE Software, 18(5):26– 32, 2001. 2. T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June 2012. 3. T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In International Workshop on Predictor Models in Software Engineering, 2007. PROMISE’07: ICSE Workshops 2007. 4. http://www.metriq.biz/tukutuku/ 2
  • 4. We have mountains of data, but then what? 3
  • 5. Abundance of data is promising for predictive modeling and supervised learning Yet, dependent variable information is not always available! Dependent variables (labels, effort values etc.) may be missing, outdated or available for a limited number of instances 4
  • 6. When an organization has no local data or the local data is outdated, transferring data helps When only a limited amount of data is labeled, we can use the existing labels to label other training instances When no labels exist, we can request labels from experts with a cost Transfer learning Semi- supervised learning Active learning 5
  • 7. How to transfer data data between domains and projects? How to accommodate prediction problems for which a limited amount of labeled instances are available? How to handle prediction problems in which no instances have labels? Transfer learning Semi- supervised learning Active learning 6
  • 8. What is the current state-of-the-art? 7
  • 9. Transfer learning is a set of learning methods that allow the training and test sets to have different domains and/or tasks (Ma2012 [1]). Transfer learning - 1 [1] Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross- company software defect prediction. Information and Software Technol- ogy, 54(3):248 – 256, 2012. SE transfer learning studies (a.k.a. cross-company learning) have the same task yet different domains (data coming from different organizations or different time frames). 8
  • 10. Transfer learning results in SE report instability and significant variability if data is used as-is (Kitchenham2007 [1], Zimmermann2009[2]) Transfer learning - 2 [1] B.A.Kitchenham,E.Mendes,andG.H.Travassos.Crossversuswithin- company cost estimation studies: A systematic review. IEEE Trans. Softw. Eng., 33(5):316–329, 2007. [2] T.Zimmermann,N.Nagappan,H.Gall,E.Giger,andB.Murphy.Cross- project defect prediction: A large scale experiment on data vs. domain vs. process. ESEC/FSE, pages 91–100, 2009. [3] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, 2009. [4] E. Kocaguneli and T. Menzies. How to find relevant data for effort es- timation. In ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011. Filtering-based approaches support prior results (Turhan2009[3], Kocaguneli2011[4]) • Transferring all cross data yields poor performance • Filtering cross data significantly improves estimation 9
  • 11. SSL methods are a group of machine learning algorithms that learn from a set of training instances among which only a small subset has pre-assigned labels [1]. Semi-supervised learning (SSL) -1 [1] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. MIT Press, Cambridge, MA, USA, 2006. SSL helps relax the dependent variable dependence of supervised methods Hence, we can supplement supervised estimation methods. 10
  • 12. Despite the promise, SSL appears to be less than thoroughly investigated in SE Semi-supervised learning (SSL) - 2 [1] Huihua Lu, Bojan Cukic, and Mark Culp. 2012. Software defect prediction using semi-supervised learning with dimension reduction. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012). [2] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19:201–230, 2012. Lu et al. use an SSL algorithm augmented with multi- dimensional scaling (MDS) as pre-processor, which outperforms corresponding supervised methods Li et al. developed a framework which maps ensemble learning and random forests into an SSL setting [19]. 11
  • 13. AL methods are unsupervised methods working on an initially unlabeled data set. Active Learning (AL) - 1 [1] M.-F.Balcan, A.Beygelzimer, andJ.Langford. “Agnostic active learning”. Proceedings of the 23rd international conference on Machine learning - ICML ’06, pages 65–72, 2006. AL methods can query an oracle, which can provide labels. Yet, each label comes with a cost. Hence, we need as few queries as possible. e.g. Balcan et al. show AL provides the same performance as a supervised learner with substantially smaller samples sizes [1] 12
  • 14. In SE, AL methods hold a good potential to reduce the labeling costs Active Learning (AL) - 2 [1] Huihua Lu and Bojan Cukic. 2012. An adaptive approach with active learning in software fault prediction. In Proceedings of the 8th International Conference on Predictive Models in Software Engineering (PROMISE '12). [2] Kocaguneli, E.; Menzies, T.; Keung, J.; Cok, D.; Madachy, R., "Active Learning and Effort Estimation: Finding the Essential Content of Software Effort Estimation Data," Software Engineering, IEEE Transactions on , vol.PP, no.99, pp.1,1, 0 Lu et al. propose an AL-based fault prediction method, which outperforms supervised techniques by using 20% or less of the data [1] Kocaguneli et al. use AL in SEE. The proposed method performs comparable to supervised methods with 31% of the original data [2] 13
  • 15. So what do we do? 14
  • 16. Strengths and Weaknesses Supervised Learning (SL) Strengths • Successfully used in SE for predictive purposes. • Provides successful estimation performance. Challenges • Requires retrospective local data. • Requires dependent variable information. Transfer Learning (TL) Strengths • Enables data to be transferred between different organizations or time frames. • Provides a solution to the lack of local data. • After relevancy filtering, cross data can perform as well as within data. Challenges • Use of cross-data in an as is manner results in unstable performance results. • TL filters relevant cross data, which reduces the transferred cross data amount. Semi-supervised Learning (SSL) Strengths • Enables learning from small sets of labeled instances. • Supplements the learning with unlabeled instances. • Relaxes the requirement of dependent variables. Challenges • Although being small, it still requires an initially labeled set of training instances. • For datasets with large number of independent features, it requires feature subset selection. Active Learning (AL) Strengths • Helps find the essential content of the data. • Decreases the number of dependent variable information, thereby reducing the associated data collection costs. Challenges • Susceptible to unbalanced class distributions in classification problems. 15
  • 17. Strengths and Weaknesses Supervised Learning (SL) • Requires retrospective local data. Transfer Learning (TL) • Provides a solution to the lack of local data. • TL filters relevant cross data, which reduces the transferred cross data amount. Semi-supervised Learning (SSL) • Enables learning from small sets of labeled instances. Active Learning (AL) • Helps find the essential content of the data. 1 2 3 16
  • 18. Synergy #1 Synergy #1 is already being pursued in SE With successful applications of transferring data among: • Domain • Time frame 17
  • 19. Filtering labeled cross data yields a very limited amount of locally relevant data SSL can use filtered cross data to provide pseudo- labels for the unlabeled within data Synergy #2 18
  • 20. SE data (defect and effort) can be summarized with its essential content Transfer learning may benefit from using essential content instead of all the data, which may contain noise and outliers Synergy #3 19
  • 21. Did you try any of the synergies? 20
  • 22. Within test project(s) Cross data Es ma on Method Estimate TEAK filter Filtered cross data Past within data (without labels) QUICK Essential within data SSL Essential within data with pseudo labels 1 2 3 4 Experiments with Synergy #3 21
  • 23. Experiments with Synergy #3 Estimation from pseudo-labeled within data Within data is summarized to at most 15% Opportunity for within data to be locally interpreted 22
  • 24. What have we covered? 23