TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Predicting More from Less: Synergies of Learning
1. Predicting More from Less:
Synergies of Learning
Ekrem Kocaguneli, ekrem@kocaguneli.com
Bojan Cukic, bojan.cukic@mail.wvu.edu,
Huihua Lu, hlu3@mix.wvu.edu
RAISE'13 2nd International NSF sponsored Workshop
on Realizing Artificial Intelligence Synergies in Software Engineering
5/25/2013
RAISE'13
2. Collecting data is important
SourceForge currently hosts
324K projects with a user
base of 3.4M1
GoogleCode hosts 250K open
source projects2
1. http://sourceforge.net/apps/trac/sourceforge/wiki/What%20is%20SourceForge.net
2. https://developers.google.com/open-source/
1
3. Also, there is an abundant
amount of SE repositories
ISBSG1 PROMISE2
Eclipse Bug Data3
TukuTuku4
1. C. Lokan, T. Wright, P. Hill, and M. Stringer. Organizational bench- marking using the ISBSG data repository. IEEE Software, 18(5):26–
32, 2001.
2. T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering
data, June 2012.
3. T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In International Workshop on Predictor Models in Software
Engineering, 2007. PROMISE’07: ICSE Workshops 2007.
4. http://www.metriq.biz/tukutuku/ 2
5. Abundance of data is promising for predictive
modeling and supervised learning
Yet, dependent variable information is
not always available!
Dependent variables (labels, effort values
etc.) may be missing, outdated or
available for a limited number of
instances
4
6. When an organization has no local
data or the local data is outdated,
transferring data helps
When only a limited amount of data is
labeled, we can use the existing labels
to label other training instances
When no labels exist, we can request
labels from experts with a cost
Transfer
learning
Semi-
supervised
learning
Active
learning 5
7. How to transfer data data between
domains and projects?
How to accommodate prediction
problems for which a limited amount
of labeled instances are available?
How to handle prediction problems in
which no instances have labels?
Transfer
learning
Semi-
supervised
learning
Active
learning 6
9. Transfer learning is a set of learning methods that allow
the training and test sets to have different domains
and/or tasks (Ma2012 [1]).
Transfer learning - 1
[1] Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross- company software defect
prediction. Information and Software Technol- ogy, 54(3):248 – 256, 2012.
SE transfer learning studies (a.k.a. cross-company
learning) have the same task yet different domains
(data coming from different organizations or different
time frames).
8
10. Transfer learning results in SE report instability and
significant variability if data is used as-is
(Kitchenham2007 [1], Zimmermann2009[2])
Transfer learning - 2
[1] B.A.Kitchenham,E.Mendes,andG.H.Travassos.Crossversuswithin- company cost estimation studies: A systematic review. IEEE Trans. Softw.
Eng., 33(5):316–329, 2007.
[2] T.Zimmermann,N.Nagappan,H.Gall,E.Giger,andB.Murphy.Cross- project defect prediction: A large scale experiment on data vs. domain vs.
process. ESEC/FSE, pages 91–100, 2009.
[3] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction.
Empirical Software Engineering, 14(5):540–578, 2009.
[4] E. Kocaguneli and T. Menzies. How to find relevant data for effort es- timation. In ESEM’11: International Symposium on Empirical Software
Engineering and Measurement, 2011.
Filtering-based approaches support prior results
(Turhan2009[3], Kocaguneli2011[4])
• Transferring all cross data yields poor performance
• Filtering cross data significantly improves estimation
9
11. SSL methods are a group of machine learning algorithms
that learn from a set of training instances among which
only a small subset has pre-assigned labels [1].
Semi-supervised learning (SSL) -1
[1] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. MIT Press, Cambridge, MA, USA, 2006.
SSL helps relax the dependent variable dependence
of supervised methods
Hence, we can supplement supervised
estimation methods.
10
12. Despite the promise, SSL appears to be
less than thoroughly investigated in SE
Semi-supervised learning (SSL) - 2
[1] Huihua Lu, Bojan Cukic, and Mark Culp. 2012. Software defect prediction using semi-supervised learning with dimension reduction. In
Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012).
[2] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated
Software Engineering, 19:201–230, 2012.
Lu et al. use an SSL algorithm augmented with multi-
dimensional scaling (MDS) as pre-processor, which
outperforms corresponding supervised methods
Li et al. developed a framework which
maps ensemble learning and random
forests into an SSL setting [19].
11
13. AL methods are unsupervised methods working on an
initially unlabeled data set.
Active Learning (AL) - 1
[1] M.-F.Balcan, A.Beygelzimer, andJ.Langford. “Agnostic active learning”. Proceedings of the 23rd international conference on Machine learning
- ICML ’06, pages 65–72, 2006.
AL methods can query an oracle, which can provide
labels. Yet, each label comes with a cost. Hence, we
need as few queries as possible.
e.g. Balcan et al. show AL provides the
same performance as a supervised
learner with substantially smaller
samples sizes [1]
12
14. In SE, AL methods hold a good
potential to reduce the labeling costs
Active Learning (AL) - 2
[1] Huihua Lu and Bojan Cukic. 2012. An adaptive approach with active learning in software fault prediction. In Proceedings of the 8th
International Conference on Predictive Models in Software Engineering (PROMISE '12).
[2] Kocaguneli, E.; Menzies, T.; Keung, J.; Cok, D.; Madachy, R., "Active Learning and Effort Estimation: Finding the Essential Content of Software
Effort Estimation Data," Software Engineering, IEEE Transactions on , vol.PP, no.99, pp.1,1, 0
Lu et al. propose an AL-based fault prediction
method, which outperforms supervised techniques
by using 20% or less of the data [1]
Kocaguneli et al. use AL in SEE. The proposed
method performs comparable to supervised
methods with 31% of the original data [2]
13
16. Strengths and Weaknesses
Supervised Learning (SL)
Strengths
• Successfully used in SE for predictive
purposes.
• Provides successful estimation
performance.
Challenges
• Requires retrospective local data.
• Requires dependent variable
information.
Transfer Learning (TL)
Strengths
• Enables data to be transferred between
different organizations or time frames.
• Provides a solution to the lack of local data.
• After relevancy filtering, cross data can
perform as well as within data.
Challenges
• Use of cross-data in an as is manner results in
unstable performance results.
• TL filters relevant cross data, which reduces
the transferred cross data amount.
Semi-supervised Learning (SSL)
Strengths
• Enables learning from small sets of labeled
instances.
• Supplements the learning with unlabeled instances.
• Relaxes the requirement of dependent variables.
Challenges
• Although being small, it still requires an initially
labeled set of training instances.
• For datasets with large number of independent
features, it requires feature subset selection.
Active Learning (AL)
Strengths
• Helps find the essential content of the data.
• Decreases the number of dependent variable
information, thereby reducing the associated
data collection costs.
Challenges
• Susceptible to unbalanced class distributions
in classification problems.
15
17. Strengths and Weaknesses
Supervised Learning (SL)
• Requires retrospective local data.
Transfer Learning (TL)
• Provides a solution to the lack of local data.
• TL filters relevant cross data, which reduces
the transferred cross data amount.
Semi-supervised Learning (SSL)
• Enables learning from small sets of labeled
instances.
Active Learning (AL)
• Helps find the essential content of the data.
1
2
3
16
18. Synergy #1
Synergy #1 is already being pursued in SE
With successful applications of
transferring data among:
• Domain
• Time frame
17
19. Filtering labeled cross data yields a very limited
amount of locally relevant data
SSL can use filtered cross data to provide pseudo-
labels for the unlabeled within data
Synergy #2
18
20. SE data (defect and effort) can be summarized
with its essential content
Transfer learning may benefit from using
essential content instead of all the data, which
may contain noise and outliers
Synergy #3
19
22. Within test project(s)
Cross data
Es ma on
Method
Estimate
TEAK
filter
Filtered cross data
Past within data
(without labels)
QUICK
Essential
within data
SSL
Essential within data
with pseudo labels
1
2
3
4
Experiments with
Synergy #3
21
23. Experiments with
Synergy #3
Estimation from
pseudo-labeled
within data
Within data is
summarized to at
most 15%
Opportunity for
within data to be
locally interpreted
22