Rahul Krishna presents research on improving software defect prediction through data preprocessing and model tuning. The document discusses motivations for defect prediction, challenges in the field, and outlines four research questions. Experiments were conducted on eight defect prediction datasets and one security flaws dataset. The results show that data preprocessing with SMOTE often improves prediction accuracy, parameter tuning has limited benefits, and the combination of SMOTE and tuning yields the best results.
In This Data Science course ( Graduate Program ) I will focus on understanding business intelligence systems and helping future managers use and understand analytics, Business Intelligence emphasizing the applications and implementations behind the concepts. a solid foundation of BI that is reinforced with hands-on practice. The course is also designed as an introduction to programming and statistics for students from many different majors. It teaches practical techniques that apply across many disciplines and also serves as the technical foundation for more advanced courses in data science, statistics, and computer science.
Had a great pleasure and honor to give a lecture about the Current and Future Challenges in Data Science at the Nextech 2019 conference alongside an impressive list of other speakers
Claudia Gold: Learning Data Science Onlinesfdatascience
Claudia Gold, author of the Data Analysis Learning path on SlideRule, talks about why she wrote it and how to approach learning data science on your own. https://www.mysliderule.com/learning-paths/data-analysis/
The Data Errors we Make by Sean Taylor at Big Data Spain 2017Big Data Spain
Where statistical errors come from, how they cause us to make bad decisions, and what to do about it.
https://www.bigdataspain.org/2017/talk/the-data-errors-we-make
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.
https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
In This Data Science course ( Graduate Program ) I will focus on understanding business intelligence systems and helping future managers use and understand analytics, Business Intelligence emphasizing the applications and implementations behind the concepts. a solid foundation of BI that is reinforced with hands-on practice. The course is also designed as an introduction to programming and statistics for students from many different majors. It teaches practical techniques that apply across many disciplines and also serves as the technical foundation for more advanced courses in data science, statistics, and computer science.
Had a great pleasure and honor to give a lecture about the Current and Future Challenges in Data Science at the Nextech 2019 conference alongside an impressive list of other speakers
Claudia Gold: Learning Data Science Onlinesfdatascience
Claudia Gold, author of the Data Analysis Learning path on SlideRule, talks about why she wrote it and how to approach learning data science on your own. https://www.mysliderule.com/learning-paths/data-analysis/
The Data Errors we Make by Sean Taylor at Big Data Spain 2017Big Data Spain
Where statistical errors come from, how they cause us to make bad decisions, and what to do about it.
https://www.bigdataspain.org/2017/talk/the-data-errors-we-make
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.
https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...Margaret-Anne Storey
Authors: Margaret-Anne Storey, Emelie Engstrom, Per Runeson, Martin Host, Elizabeth Bjarnason (Lund University, Sweden and University of Victoria, Canada)
Abstract:
Empirical software engineering research aims to generate prescriptive knowledge that can help software engineers improve their work and overcome their challenges, but deriving these insights from real-world problems can be challenging. In this paper, we promote design science as an effective way to produce and communicate prescriptive knowledge. We propose using a visual abstract template to communicate design science contributions and highlight the main problem/solution constructs of this area of research, as well as to present the validity aspects of design knowledge. Our conceptualization of design science is derived from existing literature and we illustrate its use by applying the visual abstract to an example use case. This is work in progress and further evaluation by practitioners and researchers will be forthcoming.
Preprint is available at: http://chisel.cs.uvic.ca/pubs/storey-ESEM2017.pdf
A blog post is available here:
http://margaretstorey.com/blog/2017/11/09/visual-abstracts/
A template for the visual abstract can be found here, if you use it, please share your experience with us!
https://github.com/margaretstorey/VASE
Tips and Tricks to be an Effective Data ScientistLisa Cohen
Data Science is an evolving field, that requires a diverse skill set. From Analytical Techniques to Career Advice, this talk is full of practical tips that you can apply immediately to your job.
Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe
Diamond Age Data Science and Zafgen, Inc, co-present on their work in using bioinformatics data effectively in the context of a small therapeutics company.
Eleanor Howe, PhD, CEO of Diamond Age, presents on the different types of computational biologist, the characteristics of a good bioinformatics team, and the pluses and minuses of using deep learning/AI in a discovery biology context.
Huseyin Mehmet, VP of Discovery Research at Zafgen, describes his team's work with Diamond Age and uses their capabilities to inform Zafgen's drug development. He discusses the needs of biotech companies for a diverse, experience bioinformatics team.
The “Bellwether” Effect and Its Implications to Transfer LearningRahul Krishna
Transfer learning: is the process of translating quality predictors learned in one data set to another. Transfer learning has been the subject of much recent research. In practice, that research means changing models all the time as transfer learners continually exchange new models to the current project. This paper offers a very simple bellwether transfer learner. Given N data sets, we find which one produce the best predictions on all the others. This bellwether data set is then used for all subsequent predictions (or, until such time as its predictions start failing-- at which point it is wise to seek another bellwether). Bellwethers are interesting since they are very simple to find (just wrap a for-loop around standard data miners). Also, they simplify the task of making general policies in SE since as long as one bellwether remains useful, stable conclusions for N data sets can be achieved just by reasoning over that bellwether. From this, we conclude (1) this bellwether method is a useful (and very simple) transfer learning method; (2) bellwethers are a baseline method against which future transfer learners should be compared; (3) sometimes, when building increasingly complex automatic methods, researchers should pause and compare their supposedly more sophisticated method against simpler alternatives.
Using a Visual Abstract as a Lens for Communicating and Promoting Design Scie...Margaret-Anne Storey
Authors: Margaret-Anne Storey, Emelie Engstrom, Per Runeson, Martin Host, Elizabeth Bjarnason (Lund University, Sweden and University of Victoria, Canada)
Abstract:
Empirical software engineering research aims to generate prescriptive knowledge that can help software engineers improve their work and overcome their challenges, but deriving these insights from real-world problems can be challenging. In this paper, we promote design science as an effective way to produce and communicate prescriptive knowledge. We propose using a visual abstract template to communicate design science contributions and highlight the main problem/solution constructs of this area of research, as well as to present the validity aspects of design knowledge. Our conceptualization of design science is derived from existing literature and we illustrate its use by applying the visual abstract to an example use case. This is work in progress and further evaluation by practitioners and researchers will be forthcoming.
Preprint is available at: http://chisel.cs.uvic.ca/pubs/storey-ESEM2017.pdf
A blog post is available here:
http://margaretstorey.com/blog/2017/11/09/visual-abstracts/
A template for the visual abstract can be found here, if you use it, please share your experience with us!
https://github.com/margaretstorey/VASE
Tips and Tricks to be an Effective Data ScientistLisa Cohen
Data Science is an evolving field, that requires a diverse skill set. From Analytical Techniques to Career Advice, this talk is full of practical tips that you can apply immediately to your job.
Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe
Diamond Age Data Science and Zafgen, Inc, co-present on their work in using bioinformatics data effectively in the context of a small therapeutics company.
Eleanor Howe, PhD, CEO of Diamond Age, presents on the different types of computational biologist, the characteristics of a good bioinformatics team, and the pluses and minuses of using deep learning/AI in a discovery biology context.
Huseyin Mehmet, VP of Discovery Research at Zafgen, describes his team's work with Diamond Age and uses their capabilities to inform Zafgen's drug development. He discusses the needs of biotech companies for a diverse, experience bioinformatics team.
The “Bellwether” Effect and Its Implications to Transfer LearningRahul Krishna
Transfer learning: is the process of translating quality predictors learned in one data set to another. Transfer learning has been the subject of much recent research. In practice, that research means changing models all the time as transfer learners continually exchange new models to the current project. This paper offers a very simple bellwether transfer learner. Given N data sets, we find which one produce the best predictions on all the others. This bellwether data set is then used for all subsequent predictions (or, until such time as its predictions start failing-- at which point it is wise to seek another bellwether). Bellwethers are interesting since they are very simple to find (just wrap a for-loop around standard data miners). Also, they simplify the task of making general policies in SE since as long as one bellwether remains useful, stable conclusions for N data sets can be achieved just by reasoning over that bellwether. From this, we conclude (1) this bellwether method is a useful (and very simple) transfer learning method; (2) bellwethers are a baseline method against which future transfer learners should be compared; (3) sometimes, when building increasingly complex automatic methods, researchers should pause and compare their supposedly more sophisticated method against simpler alternatives.
Evolving the Optimal Relevancy Ranking Model at Dice.comSimon Hughes
This is a talk about gathering a golden test set of relevancy judgements, either using manual annotators or search log mining, to use in either an automated or manual relevancy tuning process. We also discuss the dangers of positive feedback loops when building closed-loop machine learning models for search and recommendation.
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
This talk describes how to implement conceptual search (semantic search) within a modern search engine using the word2vec algorithm to learn concepts. We also cover how to auto-tune the search engine parameters using black box optimization techniques, and the problems of feedback loops encountered when building machine learning systems that modify the user behavior used to train the system.
Overview of a Machine Learning 11 week course I developed and trained software engineers at Dell on their way to become Data Scientists. Class is outline of Predictive Analytics methods using Python. I taught this class 8 separate occasions over 3 years.
'A critique of testing' UK TMF forum January 2015 Georgina Tilby
This presentation draws upon the 'Critique of Testing' Ebook that was discussed at January's UK TMF forum. The slides explore the fundamental concepts of test case design and provide a detailed analysis of each method in terms of them.
Delivered @ MusicCityCode 6/2/2017
Knowledge is power, but is it if you're not using it? What if the application you delivered to your customers was extremely intelligent? It could retrieve, analyze and use the massive amounts of data that businesses are generating at an astronomical rate.
It could analyze business deals, predict potential issues, proactively recommend business decisions and estimate profit, loss and risks.
Those things provide direct benefits to your company. Churning through that data by hand doesn't. Enter Azure Machine Learning.
In this session you will learn how to integrate Azure Machine Learning into your existing applications and workflows with REST services. You will learn how to deliver a modular, maintainable solution to your customers that allows them to analyze their data.
You will learn to:
* Numerous ways to abstract business rules, workflows, AI (Machine Learning) and more into your applications
* How to Integrate Azure Machine Learning into your existing Applications and Processes
* Create Azure Machine Learning Experiments
* Retrieve the Score from an Azure Machine Learning Experiment and integrate it into your applications and processes
* Integrate numerous Machine Learning Experiments from the Azure Machine Learning Marketplace into your existing applications and processes
* Learn various concepts for abstracting and managing services and api's.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Generation of Search Based Test Data on Acceptability Testing Principleiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
Delivered at Pittsburgh Tech Fest - 6/10/2017
Knowledge is power, but is it if you're not using it? What if the application you delivered to your customers was extremely intelligent? It could retrieve, analyze and use the massive amounts of data that businesses are generating at an astronomical rate.
It could analyze business deals, predict potential issues, proactively recommend business decisions and estimate profit, loss and risks.
Those things provide direct benefits to your company. Churning through that data by hand doesn't. Enter Azure Machine Learning.
In this session you will learn how to integrate Azure Machine Learning into your existing applications and workflows with REST services. You will learn how to deliver a modular, maintainable solution to your customers that allows them to analyze their data.
You will learn to:
* Numerous ways to abstract business rules, workflows, AI (Machine Learning) and more into your applications
* How to Integrate Azure Machine Learning into your existing Applications and Processes
* Create Azure Machine Learning Experiments
* Retrieve the Score from an Azure Machine Learning Experiment and integrate it into your applications and processes
* Integrate numerous Machine Learning Experiments from the Azure Machine Learning Marketplace into your existing applications and processes
* Learn various concepts for abstracting and managing services and api's.
Workshop presented at Webdagene 2013 (http://webdagene.no/en/) September 9, 2013; UX Lisbon (http://www.ux-lx.com), May 12, 2011; UX Hong Kong (http://www.uxhongkong.com/), February 17, 2011.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
1. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
On Strategies To Improve
Software Defect Prediction
Rahul Krishna
PhD Scholar
Dept. Computer Science
2. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Overview
• Motivation
• Research Questions
• Background
• Data Sets
• Experimental Setup
• Experimental Results
3. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
MOTIVATION
4. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Why Defect Prediction?
• Boehm and Papaccio[1] comment that early detection helps
reduce cost incurred to fix at a later stage “by a factor of upto 200”
• IEEE Metrics 2002 concluded that “Finding and fixing bugs after
delivery is usually 100 times more expensive that do so at the
requirements and design phase”[2]
• Shull et al.[2] claim that, “About 40-50% of the user programs enter
use with nontrivial defects”
• In the agile world, code bases are more developed than tested
• The takeaway– Find Bugs Early!
[1] B. W. Boehm and P. N. Papaccio, “Understanding and controlling software costs,” IEEE Trans. Softw. Eng., vol. 14, no. 10, pp. 1462–1477, Oct.
1988.
[2] F. Shull, V. Basili, B. Boehm, A. W. Brown, P. Costa, M. Lindvall, D. Port, I. Rus, R. Tesoriero, and M. Zelkowitz, “What we have learned about
fighting defects,” in Software Metrics, 2002. Proceedings. Eighth IEEE Symp. on. IEEE,pp. 249–258.
5. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Easier said than done..
• No oracles or closed form mathematical models.
• Expert opinion is would take too long.
• There way too much data
– Github has over 9 million users and 21.1 million repositories.
• Develop efficient code analysis measures
• Use Machine Learning tools
– Algorithms are too generic, needs optimization
• But real world data is skewed
– “80% of the defects lie in only 20% of the modules”
– Not enough defective samples in a project to learn meaningful
patterns
6. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Research Questions
• RQ1: Can techniques such as SMOTE be used to
preprocess data to improve prediction accuracy?
• RQ2: Does Tuning a data miner improve it’s
prediction accuracy?
• RQ3: Can tuning be performed in conjunction with
SMOTE to further improve the prediction accuracy?
• RQ4: Is SMOTE limited only to defect prediction?
7. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
BACKGROUND
8. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Defect Prediction
• Models are hard to obtain, to complex, and not aren’t reliable.
• Different regions of the same data have different properties[1]
• A plausible solution:
– Use Case Based Reasoning
– Learn from past data and reflect at new data
• They’re pretty neat
– Can work with partial data (useful at early stages)[2]
– Can work with sparse samples[3]
– Rather robust
[1] T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman, F. Shull, B. Turhan, and T. Zimmermann, “Local versus global lessons for defect
prediction and effort estimation,” Software Engineering, IEEE Transactions on, vol. 39, no. 6, pp. 822 – 834, June 2013.
[2] F. Walkerden and R. Jeffery, “An empirical study of analogy based software effort estimation,” Empirical software engineering, vol. 4, no. 2,
pp.
135–158, 1999.
[3] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” Software
9. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
• Lessmann et al.[1] compared 21 different learners for software
defect prediction.
• They found Random Forest to be the Best and CART to be Worst
• That’s strange!
– They’re both tree based learners
– One is deterministic, other is random
– But they surely can’t be on opposite ends of spectrum. Can they?
• It’s probably the data
– It’s always the data
• Maybe the predictors need to be calibrated
Defect Prediction
[1] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking classification models for software defect prediction: A proposed framework
and novel findings,” Software Engineering, IEEE Transactions on, vol. 34, no. 4, pp. 485–496, July 2008
10. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Class Imbalance in Data
11. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Class Imbalance in Data
• Too many samples of non-defective modules
• Trees constructed by CART and RF would be
severely biased
• Use SMOTE[1] to preprocess training data
– Upsample minority class by creating “synthetic”
samples
– Downsample majority class by randomly discarding
samples
• My criterion (My infallible Engineering judgment)
– At least 50 samples from minority class
– At most 100 samples from majority class
12. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Parameter Tuning
• SMOTE preprocess training data
• Tuning calibrates the predictor
• Automate calibration using metaheuristics
– Differential Evolution is popular and a simple optimizer
• Use training data to learn the best parameters for the
predictor
• Test data must not be revealed
– Only datasets with 3 or more historic versions are used
– Last version is used for test, all other are used for
training
13. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Differential Evolution
(in a nutshell)
1. Randomly choose attributes
2. Pick any two attributes and create a new
attribute by interpolation
3. If the new attribute performs better than
the old one discard the old one
4. If not discard the new one
5. Repeat 2-4
14. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
DATASETS
15. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Datasets
• 8 Defect Prediction Datasets:
1. Ant
2. Ivy
3. Jedit
4. Lucene
5. Poi
6. Synapse
7. Velocity
8. Xalan
• 1 Bugzilla dataset (Thanks Chris!)
16. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
The Metrics
17. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
EXPERIMENTAL SETUP
18. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Statistical Measures
• Let A,B,C,D denote True negative, False Negative, False Positive, True Positive
• The standard measures:
• F,G measure both defects and non-defects at once. Recall and specificity only
measure one.
• G is especially useful, it is the harmonic mean between recall and specificity.
• G is lower than both recall and fallout.
– High G implies both Recall and sensitivity are high. Which is good!
19. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
EXPERIMENTAL RESULTS
20. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Defect Dataset
• RQ1:Can techniques such as SMOTE be used to preprocess data to
improve prediction accuracy?
– RF was better than CART in 6 out of the 8 datasets.
– SMOTE helped improve the performance in 4 out of those 6 datasets.
• RQ2: Does Tuning a data miner improve it’s prediction accuracy?
– Not always, just tuning didn’t help
• RQ3: Can tuning be performed in conjunction with SMOTE to further
improve the prediction accuracy?
– Yes. In 6 out the 8 datasets, SMOTE+Tuning surely helps
21. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
22. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
23. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Security Flaws Dataset
24. Search-based SE: without search, you won’t find a thing.
“Engineering is optimization and optimization is search.”
ai4se.net
Conclusion
• Defect Data Set
– SMOTEing is beneficial
– Tuning alone is not too useful
– The combination of both works even better.
• Security Flaw Dataset
– Improves sensitivity by 10 times
• In summary:
– Always reflect over the data
– Calibrate your predictor before use