Predicting More from Less: Synergies of Learning

438 views

Published on

Ekrem Kocaguneli, ekrem@kocaguneli.com
Bojan Cukic, bojan.cukic@mail.wvu.edu,
Huihua Lu, hlu3@mix.wvu.edu

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
438
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Predicting More from Less: Synergies of Learning

  1. 1. Predicting More from Less:Synergies of LearningEkrem Kocaguneli, ekrem@kocaguneli.comBojan Cukic, bojan.cukic@mail.wvu.edu,Huihua Lu, hlu3@mix.wvu.eduRAISE13 
2nd International NSF sponsored Workshopon Realizing Artificial Intelligence Synergies in Software Engineering5/25/2013RAISE13
  2. 2. Collecting data is importantSourceForge currently hosts324K projects with a userbase of 3.4M1GoogleCode hosts 250K opensource projects21. http://sourceforge.net/apps/trac/sourceforge/wiki/What%20is%20SourceForge.net2. https://developers.google.com/open-source/1
  3. 3. Also, there is an abundantamount of SE repositoriesISBSG1 PROMISE2Eclipse Bug Data3TukuTuku41. C. Lokan, T. Wright, P. Hill, and M. Stringer. Organizational bench- marking using the ISBSG data repository. IEEE Software, 18(5):26–32, 2001.2. T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineeringdata, June 2012.3. T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In International Workshop on Predictor Models in SoftwareEngineering, 2007. PROMISE’07: ICSE Workshops 2007.4. http://www.metriq.biz/tukutuku/ 2
  4. 4. We have mountains of data,but then what?3
  5. 5. Abundance of data is promising for predictivemodeling and supervised learningYet, dependent variable information isnot always available!Dependent variables (labels, effort valuesetc.) may be missing, outdated oravailable for a limited number ofinstances4
  6. 6. When an organization has no localdata or the local data is outdated,transferring data helpsWhen only a limited amount of data islabeled, we can use the existing labelsto label other training instancesWhen no labels exist, we can requestlabels from experts with a costTransferlearningSemi-supervisedlearningActivelearning 5
  7. 7. How to transfer data data betweendomains and projects?How to accommodate predictionproblems for which a limited amountof labeled instances are available?How to handle prediction problems inwhich no instances have labels?TransferlearningSemi-supervisedlearningActivelearning 6
  8. 8. What is the currentstate-of-the-art?7
  9. 9. Transfer learning is a set of learning methods that allowthe training and test sets to have different domainsand/or tasks (Ma2012 [1]).Transfer learning - 1[1] Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross- company software defectprediction. Information and Software Technol- ogy, 54(3):248 – 256, 2012.SE transfer learning studies (a.k.a. cross-companylearning) have the same task yet different domains(data coming from different organizations or differenttime frames).8
  10. 10. Transfer learning results in SE report instability andsignificant variability if data is used as-is(Kitchenham2007 [1], Zimmermann2009[2])Transfer learning - 2[1] B.A.Kitchenham,E.Mendes,andG.H.Travassos.Crossversuswithin- company cost estimation studies: A systematic review. IEEE Trans. Softw.Eng., 33(5):316–329, 2007.[2] T.Zimmermann,N.Nagappan,H.Gall,E.Giger,andB.Murphy.Cross- project defect prediction: A large scale experiment on data vs. domain vs.process. ESEC/FSE, pages 91–100, 2009.[3] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction.Empirical Software Engineering, 14(5):540–578, 2009.[4] E. Kocaguneli and T. Menzies. How to find relevant data for effort es- timation. In ESEM’11: International Symposium on Empirical SoftwareEngineering and Measurement, 2011.Filtering-based approaches support prior results(Turhan2009[3], Kocaguneli2011[4])• Transferring all cross data yields poor performance• Filtering cross data significantly improves estimation9
  11. 11. SSL methods are a group of machine learning algorithmsthat learn from a set of training instances among whichonly a small subset has pre-assigned labels [1].Semi-supervised learning (SSL) -1[1] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. MIT Press, Cambridge, MA, USA, 2006.SSL helps relax the dependent variable dependenceof supervised methodsHence, we can supplement supervisedestimation methods.10
  12. 12. Despite the promise, SSL appears to beless than thoroughly investigated in SESemi-supervised learning (SSL) - 2[1] Huihua Lu, Bojan Cukic, and Mark Culp. 2012. Software defect prediction using semi-supervised learning with dimension reduction. InProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012).[2] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. AutomatedSoftware Engineering, 19:201–230, 2012.Lu et al. use an SSL algorithm augmented with multi-dimensional scaling (MDS) as pre-processor, whichoutperforms corresponding supervised methodsLi et al. developed a framework whichmaps ensemble learning and randomforests into an SSL setting [19].11
  13. 13. AL methods are unsupervised methods working on aninitially unlabeled data set.Active Learning (AL) - 1[1] M.-F.Balcan, A.Beygelzimer, andJ.Langford. “Agnostic active learning”. Proceedings of the 23rd international conference on Machine learning- ICML ’06, pages 65–72, 2006.AL methods can query an oracle, which can providelabels. Yet, each label comes with a cost. Hence, weneed as few queries as possible.e.g. Balcan et al. show AL provides thesame performance as a supervisedlearner with substantially smallersamples sizes [1]12
  14. 14. In SE, AL methods hold a goodpotential to reduce the labeling costsActive Learning (AL) - 2[1] Huihua Lu and Bojan Cukic. 2012. An adaptive approach with active learning in software fault prediction. In Proceedings of the 8thInternational Conference on Predictive Models in Software Engineering (PROMISE 12).[2] Kocaguneli, E.; Menzies, T.; Keung, J.; Cok, D.; Madachy, R., "Active Learning and Effort Estimation: Finding the Essential Content of SoftwareEffort Estimation Data," Software Engineering, IEEE Transactions on , vol.PP, no.99, pp.1,1, 0Lu et al. propose an AL-based fault predictionmethod, which outperforms supervised techniquesby using 20% or less of the data [1]Kocaguneli et al. use AL in SEE. The proposedmethod performs comparable to supervisedmethods with 31% of the original data [2]13
  15. 15. So what do we do?14
  16. 16. Strengths and WeaknessesSupervised Learning (SL)Strengths• Successfully used in SE for predictivepurposes.• Provides successful estimationperformance.Challenges• Requires retrospective local data.• Requires dependent variableinformation.Transfer Learning (TL)Strengths• Enables data to be transferred betweendifferent organizations or time frames.• Provides a solution to the lack of local data.• After relevancy filtering, cross data canperform as well as within data.Challenges• Use of cross-data in an as is manner results inunstable performance results.• TL filters relevant cross data, which reducesthe transferred cross data amount.Semi-supervised Learning (SSL)Strengths• Enables learning from small sets of labeledinstances.• Supplements the learning with unlabeled instances.• Relaxes the requirement of dependent variables.Challenges• Although being small, it still requires an initiallylabeled set of training instances.• For datasets with large number of independentfeatures, it requires feature subset selection.Active Learning (AL)Strengths• Helps find the essential content of the data.• Decreases the number of dependent variableinformation, thereby reducing the associateddata collection costs.Challenges• Susceptible to unbalanced class distributionsin classification problems.15
  17. 17. Strengths and WeaknessesSupervised Learning (SL)• Requires retrospective local data.Transfer Learning (TL)• Provides a solution to the lack of local data.• TL filters relevant cross data, which reducesthe transferred cross data amount.Semi-supervised Learning (SSL)• Enables learning from small sets of labeledinstances.Active Learning (AL)• Helps find the essential content of the data.12316
  18. 18. Synergy #1Synergy #1 is already being pursued in SEWith successful applications oftransferring data among:• Domain• Time frame17
  19. 19. Filtering labeled cross data yields a very limitedamount of locally relevant dataSSL can use filtered cross data to provide pseudo-labels for the unlabeled within dataSynergy #218
  20. 20. SE data (defect and effort) can be summarizedwith its essential contentTransfer learning may benefit from usingessential content instead of all the data, whichmay contain noise and outliersSynergy #319
  21. 21. Did you try anyof the synergies?20
  22. 22. Within test project(s)Cross dataEs ma onMethodEstimateTEAKfilterFiltered cross dataPast within data(without labels)QUICKEssentialwithin dataSSLEssential within datawith pseudo labels1234Experiments withSynergy #321
  23. 23. Experiments withSynergy #3Estimation frompseudo-labeledwithin dataWithin data issummarized to atmost 15%Opportunity forwithin data to belocally interpreted22
  24. 24. What have we covered?23

×