Software Engineering Course 2009 - Mining Software Archives

  • 471 views
Uploaded on

 

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
471
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SOME EXAM INFOS
  • 2. Exam Admittance 50% ROOM ... Question or Problems: kim@cs.uni-saarland.de
  • 3. After-Exam Registration Not registered = No after exam But please do only register when you plan to participate
  • 4. Exam Regulations ‣ Single sided cheat sheet ‣ No dictionaries ‣ ask supervision ‣ Bags to be left at entrance ‣ Hand in exam & cheat sheet ‣ Student ID on desk ‣ Additionalpaper only from ‣ Name + MatNr. on every supervision sheet (incl. cheat sheet) ‣ Stick to one language ‣ per exercise ‣ (german or english)
  • 5. Seminar on Code Modi cation at Runtime by Frank Padberg Topics July Runtime optimization of byte code 22 ‣ ‣ on-the-fly creation of classes ‣ self-modifying code ‣ ... AND MORE! Initial Meeting Vorbesprechung http://www.st.cs.uni-saarland.de/edu/codemod09/rcm09.html
  • 6. Current Assignment http://www.st.cs.uni-saarland.de/edu/se/2009/handouts/mutation_tyes.png
  • 7. MINING SOFTWARE REPOSITORIES Software Engineering Course 2009 Kim Herzig - Saarland University
  • 8. Books Data Mining: Concepts and Techniques Data Mining: Practical Machine Learning Tools and Techniques by Jiawei Han & Micheline Kamber by Ian H. Witten & Eibe Frank
  • 9. Imagine You as Quality Manager
  • 10. Imagine ‣ 30,000 classes ‣ ~ 5.5 million lines of code ‣ ~3000 defect per release ‣ 700 developers You as Quality Manager Your product
  • 11. Your Boss Test the system! You have 6 months, $500,000. And don’t miss any bug!
  • 12. The Problem ‣ Not enough time to test everything ‣ What to test? What to test first? ‣ Not enough money to pay enough testers ‣ To which extend? Central question: Where are the most defect prone entities in my system?
  • 13. Your Testers
  • 14. Your Testers
  • 15. We need efficiency!
  • 16. We need efficiency!
  • 17. We need efficiency!
  • 18. Can we learn from history? ... to predict or estimate the future?
  • 19. data mining
  • 20. What is data mining mining? Data mining is the process of discovering actionable information from large sets of data.
  • 21. The Mining Model Defining the problem Preparing Deploying data and updating models Exploring data Violating Building models models http://technet.microsoft.com/en-us/library/ms174949.aspx
  • 22. Step 1: De ning Problem ‣ Clearly define the problem Defining the ‣ What are you looking for? problem Preparing ‣ Scope of problem Deploying data and updating ‣ Types of relationships models Exploring data Violating ‣ Define how to evaluate models Building models ‣ Prediction, recommendation or just patterns
  • 23. Defect Prediction Problem Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Which source code entities Step 4: Building the Model Step 5: Validating the Model should we test most?
  • 24. Defect Prediction Problem Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Which source code entities Step 4: Building the Model Step 5: Validating the Model should we test most? Which are the most defect prone entities in my system?
  • 25. Defect Prediction Problem Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Which source code entities Step 4: Building the Model Step 5: Validating the Model should we test most? Which are the most defect prone entities in my system? In the past, which entities had the most defects?
  • 26. Defect Prediction Problem Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Which source code entities Step 4: Building the Model Step 5: Validating the Model should we test most? Which properties of Which are the most source code entities correlate defect prone with defects? entities in my system? In the past, which entities had the most defects?
  • 27. Data Sources Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database Version Archive Source Code
  • 28. Data Sources Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive Source Code
  • 29. Data Sources Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive source code properties Source Code (metrics)
  • 30. Data Sources: Heuristics Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive “... commit messages that contain fix and bug id ...”
  • 31. Data Sources: Metrics Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model ‣ Complicity metrics ‣ McCabe, FanIn, FanOut, Couplings ‣ (see Lecture “Metrics and Estimation”) source code ‣ Time metrics Source Code properties (metrics) ‣ How many changes ‣ How many different authors ‣ Age of code
  • 32. Data Sources Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive source code properties Source Code (metrics)
  • 33. Step 2: Prepare Data ‣ Highly distributed data: ‣ Version repository, bug data base, time trackers, ... Defining the problem Preparing ‣ Data integration Deploying data and updating ‣ Excel, CSV, SQL, ARFF, ... models Exploring data Violating Building ‣ Data cleaning models models ‣ missing values, noise, inter- correlations
  • 34. Example Mining File
  • 35. Example Mining File entities
  • 36. Example Mining File ... entities data points
  • 37. Example Mining File ... entities data points output
  • 38. Example Mining File ge fi les! l! L ar col umn Ca refu nes, 300 illion li e.g. :5m ... entities data points output
  • 39. Step 3: Explore Data You cannot validate the output if you don’t know the input ‣ Descriptive data summary Defining the ‣ max, min, mean, pareto, distribution problem Preparing Deploying data ‣ Data Selection and updating models Exploring ‣ Relevance of data data Violating Building models models ‣ Data reduction ‣ aggregation, subset selection
  • 40. Descriptive Data Summary Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model ‣ How good can a prediction possibly be? ‣ Does it make sense to predict the top 20% 20% of entities contain 80% of defects
  • 41. Step 3: Explore Data Data sufficiency Defining the problem Preparing ‣ Maybe the data will not help Deploying and updating data to solve the problem models Exploring data Violating Building models ‣ Redefine problem models ‣ Search for alternatives ‣ Access different data
  • 42. Step 3: Explore Data Data sufficiency Defining the problem Preparing ‣ Maybe the data will not help Deploying and updating data to solve the problem models Exploring data Violating Building models ‣ Redefine problem models ‣ Search for alternatives ‣ Access different data
  • 43. Step 3: Explore Data Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive source code Does complexity Source Code properties (metrics) correlate with defects?
  • 44. Step 3: Explore Data Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive source code Does complexity Source Code properties (metrics) correlate with defects? YES!
  • 45. Step 4: Build Model ‣ Mining model only container Defining the problem ‣ parameters and mining Preparing Deploying data structure and updating models ‣ output value Exploring data Violating Building ‣ Now we need some models models statistics / machine learners
  • 46. Example Mining File ... entities data points output
  • 47. Building the Model Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model ‣ Regression ‣ Predicting concrete, continuous values ‣ Difficult and very imprecise ‣ But desirable ‣ Classification ‣ Predicting class labels (e.g. more that X defects or not) ‣ Easier and more precise ‣ Vague information (how many defects in code?)
  • 48. Building the Model Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model
  • 49. Building the Model Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Rule- Based Class ificat Support Vec tor Machine ion Linear Reg ression Lazy Learners ee Bayesian Network n Tr ci sio Logistic Reg ression D e
  • 50. Training and Testing ‣ Training set ‣ The data set to train the model ‣ Which columns correlate with output values? ‣ Which columns correlate with each other? ‣ Testing set ‣ A data set independent of the training data set ‣ used to fine-tune the estimates of the model parameters
  • 51. Training and Testing Random split + Only one version needed + No overlaps between DATA SET training and testing entities - Does not reflect real life - Which random set is the best one? (because they are all different)
  • 52. Training and Testing Random split + Only one version needed + No overlaps between DATA SET training and testing entities - Does not reflect real life - Which random set is the best one? (because they are all training data (2/3) different) testing data (1/3)
  • 53. Training and Testing Random split + Only one version needed + No overlaps between DATA SET training and testing entities - Does not reflect real life - Which random set is the best one? (because they are all training data (2/3) different) testing data (1/3)
  • 54. Training and Testing DATA SET version N Forward estimation + Reflectsreal life training data + Reproducable result testing data - Two versions needed DATA SET version N+1
  • 55. Step 4: Build Model
  • 56. Step 4: Build Model training set
  • 57. Step 4: Build Model machine training set learner (black box)
  • 58. Step 4: Build Model input machine training set learner (black box)
  • 59. Step 4: Build Model input machine training set learner (black box) output iction Model Pred
  • 60. Step 4: Build Model input machine training set learner (black box) output testing set iction Model Pred
  • 61. Step 4: Build Model input machine training set learner (black box) output input testing set iction Model Pred
  • 62. Step 4: Build Model input machine training set learner (black box) output input output testing set iction Model Pred Prediction
  • 63. Step 5: Validating Model ‣ Test data has same stucture but different content Defining the problem Preparing data ‣ Goal is to use model to Deploying and updating models correctly estimate output Exploring data values Violating Building models models ‣ Compare estimation with real values (fine tuning)
  • 64. Evaluation Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model
  • 65. Evaluation Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Never predict concrete number! Because people will take them for real!
  • 66. Evaluation Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model sorted descending real defects per entity predicted defects per entity
  • 67. Evaluation Step 1: Define the problem Step 2: Prepare Data correctly predicted defect prone modules Step 3: Explore Data (true positives) Step 4: Building the Model Step 5: Validating the Model real defects per entity predicted defects per entity
  • 68. Recall, Precision, Accuracy Predict defects ? Yes No false Yes true positives negatives Real defects? false true No positives negatives
  • 69. Recall, Precision, Accuracy Predict defects ? Yes No true false Real Yes positives negatives defects? No false true positives negatives true positives true positives + false positives Precision Predicted defect prone entities will be defect prone!
  • 70. Recall, Precision, Accuracy Predict defects ? Yes No Real Yes true positives false negatives Recall defects? No false true positives negatives true positives Precision true positives + false negative All defect prone entities get predicted as defect prone.
  • 71. Recall, Precision, Accuracy Predict defects ? Yes No true false Real Yes positives negatives Recall defects? No false true positives negatives Precision Accuracy true positives + true negatives true positives + true negatives + false negative + false positives The overall correctness
  • 72. Step 6: Deploying Model ‣ Integrate model into development or quality Defining the problem assurance process Preparing data Deploying and updating models ‣ Update model frequently Exploring data (because change happens) Violating Building models models ‣ Frequently validate the precision of your model
  • 73. Step 6: Deploying Model ‣ Integrate model into m od els! t jecDefining the ct data! development or quality -pro on proje ross ndend problem assurance process ith c epe l w efu els highly d Preparing Car Deploying data d and updating Many mo models ‣ Update model frequently Exploring data (because change happens) Violating Building models models ‣ Frequently validate the precision of your model
  • 74. State of the Art
  • 75. State of the Art
  • 76. Prediction Results Training Testing Precision Recall Accuracy 2.0 0.692 0.265 0.876 2.0 2.1 0.478 0.191 0.890 3.0 0.613 0.171 0.861 2.0 0.664 0.203 0.870 2.1 2.1 0.668 0.160 0.900 3.0 0.717 0.139 0.864 2.0 0.578 0.277 0.866 3.0 2.1 0.528 0.220 0.894 3.0 0.675 0.224 0.869 Predicting java classes: Classification: has bugs, has no bugs
  • 77. Prediction Results Training Testing Precision Recall Accuracy 2.0 0.692 0.265 0.876 2.0 2.1 0.478 0.191 0.890 3.0 0.613 0.171 0.861 2.0 0.664 0.203 0.870 2.1 2.1 0.668 0.160 0.900 3.0 0.717 0.139 0.864 2.0 0.578 0.277 0.866 3.0 2.1 0.528 0.220 0.894 3.0 0.675 0.224 0.869 Predicting java classes: Classification: has bugs, has no bugs
  • 78. Prediction Results Training Testing Precision Recall Accuracy 2.0 0.692 0.265 0.876 2.0 2.1 0.478 0.191 0.890 3.0 0.613 0.171 d efe cts! 0.861 uses 2.0 0.664 0.203 0.870 2.1 xity ca 2.1 0.668 0.160 0.900 Com ple 3.0 0.717 0.139 0.864 2.0 0.578 0.277 0.866 3.0 2.1 0.528 0.220 0.894 3.0 0.675 0.224 0.869 Predicting java classes: Classification: has bugs, has no bugs
  • 79. Prediction Results Training Testing Precision Recall Accuracy 2.0 0.692 0.265 0.876 2.0 2.1 0.478 0.191 0.890 3.0 0.613 0.171 d efe cts! 0.861 uses ity! 2.0 0.664 0.203 0.870 2.1 lexi 2.1ty ca 0.668 com plex 0.160 0.900 Comp 3.0 me from 0.717 0.139 0.864 2.0 efec ts co 0.578 0.277 0.866 3.0 not all d 2.1 0.528 0.220 0.894 But 3.0 0.675 0.224 0.869 Predicting java classes: Classification: has bugs, has no bugs
  • 80. What to mine?
  • 81. Code e-mail Bug Reports Changes Profiles What to mine? Traces Effort Specification Chats Tests Navigation Models
  • 82. Code e-mail Bug Reports Changes Profiles Traces Effort Sepcification Chats Tests Navigation Models
  • 83. Models Specs Code Traces Profiles Tests Data Mining Input Sources e-mail Bugs Effort Navigati Change Chats
  • 84. Models Specs Code Traces Profiles Tests People who changed function f() also changed .... e-mail Bugs Effort Navigati Change Chats
  • 85. Models Specs Code Traces Profiles Tests Which modules should I test most? e-mail Bugs Effort Navigati Change Chats
  • 86. Models Specs Code Traces Profiles Tests How long will it take to x this bug? e-mail Bugs Effort Navigati Change Chats
  • 87. Models Specs Code Traces Profiles Tests Should I use design A or B ? e-mail Bugs Effort Navigati Change Chats
  • 88. Models Specs Code Traces Profiles Tests This requirement is risky! e-mail Bugs Effort Navigati Change Chats
  • 89. Assistance
  • 90. Assistance Future environments will •mine patterns from program + process •apply rules to make predictions •provide assistance in all development decisions •adapt advice to project history
  • 91. Empirical SE 2.0
  • 92. Wikis Joy of Use Participation Usability Recommendation Social Software Collaboration Perpetual Beta Simplicity Empirical SE 2.0 Trust Economy Remixability The Long Tail DataDriven
  • 93. Bachelor/Master Theses in software mining
  • 94. Summary
  • 95. Summary
  • 96. Summary
  • 97. Summary
  • 98. Summary
  • 99. Summary