SlideShare a Scribd company logo
1 of 100
SOME EXAM INFOS
Exam Admittance



                            50%
 ROOM ...




    Question or Problems:
   kim@cs.uni-saarland.de
After-Exam Registration




    Not registered = No after exam
   But please do only register when you plan to participate
Exam Regulations

‣ Single   sided cheat sheet       ‣ No    dictionaries
                                    ‣   ask supervision
‣ Bags    to be left at entrance
                                   ‣ Hand    in exam & cheat sheet
‣ Student    ID on desk
                                   ‣ Additionalpaper only from
‣ Name  + MatNr. on every
                                    supervision
 sheet (incl. cheat sheet)
‣ Stick   to one language
 ‣ per exercise
 ‣ (german or english)
Seminar on Code Modi cation
             at Runtime by Frank Padberg

    Topics
                                               July
    Runtime optimization of byte code

                                             22
‣

‣   on-the-fly creation of classes
‣   self-modifying code
‣   ... AND MORE!                         Initial Meeting
                                         Vorbesprechung

    http://www.st.cs.uni-saarland.de/edu/codemod09/rcm09.html
Current Assignment




http://www.st.cs.uni-saarland.de/edu/se/2009/handouts/mutation_tyes.png
MINING SOFTWARE REPOSITORIES
  Software Engineering Course 2009
   Kim Herzig - Saarland University
Books




Data Mining: Concepts and Techniques   Data Mining: Practical Machine Learning Tools
                                                      and Techniques
 by Jiawei Han & Micheline Kamber             by Ian H. Witten & Eibe Frank
Imagine




You as Quality Manager
Imagine

                         ‣ 30,000 classes
                         ‣ ~ 5.5 million lines of code
                         ‣ ~3000 defect per release
                         ‣ 700 developers



You as Quality Manager           Your product
Your Boss

      Test the system!
You have 6 months, $500,000.
  And don’t miss any bug!
The Problem

‣   Not enough time to test everything
    ‣   What to test? What to test first?


‣   Not enough money to pay enough testers
    ‣   To which extend?


    Central question:
    Where are the most defect prone entities in my system?
Your Testers
Your Testers
We need efficiency!
We need efficiency!
We need efficiency!
Can we learn from history?
 ... to predict or estimate the future?
data mining
What is data mining
             mining?
    Data mining is the process of discovering
  actionable information from large sets of data.
The Mining Model
                   Defining the
                    problem
                                             Preparing
 Deploying                                     data
and updating
  models
                                                     Exploring
                                                       data
               Violating        Building
               models           models



     http://technet.microsoft.com/en-us/library/ms174949.aspx
Step 1: De ning Problem

‣   Clearly define the problem
                                                         Defining the
    ‣ What are you looking for?                           problem
                                                                              Preparing
    ‣ Scope of problem                Deploying                                 data
                                     and updating
    ‣ Types of relationships           models
                                                                                    Exploring
                                                                                      data

                                                    Violating
‣   Define how to evaluate                           models
                                                                   Building
                                                                   models

    ‣   Prediction, recommendation
        or just patterns
Defect Prediction Problem
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data           Which source code entities
Step 4: Building the Model
Step 5: Validating the Model
                                should we test most?
Defect Prediction Problem
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data           Which source code entities
Step 4: Building the Model
Step 5: Validating the Model
                                should we test most?



                                   Which are the most
                                       defect prone
                                  entities in my system?
Defect Prediction Problem
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data                               Which source code entities
Step 4: Building the Model
Step 5: Validating the Model
                                                    should we test most?



                                                        Which are the most
                                                            defect prone
                                                       entities in my system?


                               In the past, which entities had
                                     the most defects?
Defect Prediction Problem
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data                               Which source code entities
Step 4: Building the Model
Step 5: Validating the Model
                                                    should we test most?



        Which properties of                             Which are the most
   source code entities correlate                           defect prone
           with defects?                               entities in my system?


                               In the past, which entities had
                                     the most defects?
Data Sources
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model



                      Bug Database




                     Version Archive




                       Source Code
Data Sources
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model



                      Bug Database
                                         past defect
                                         per entity
                                          (quality)

                     Version Archive




                       Source Code
Data Sources
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model



                      Bug Database
                                         past defect
                                         per entity
                                          (quality)

                     Version Archive


                                         source code
                                          properties
                       Source Code         (metrics)
Data Sources: Heuristics
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model




                      Bug Database
                                           past defect
                                           per entity
                                            (quality)

                     Version Archive



        “... commit messages that contain fix and bug id ...”
Data Sources: Metrics
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model


  ‣   Complicity metrics
       ‣   McCabe, FanIn, FanOut, Couplings
       ‣   (see Lecture “Metrics and Estimation”)
                                                                  source code
  ‣   Time metrics                                  Source Code
                                                                   properties
                                                                    (metrics)

      ‣ How many changes
      ‣ How many different authors
      ‣ Age of code
Data Sources
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model



                      Bug Database
                                         past defect
                                         per entity
                                          (quality)

                     Version Archive


                                         source code
                                          properties
                       Source Code         (metrics)
Step 2: Prepare Data
‣   Highly distributed data:
    ‣   Version repository, bug data
        base, time trackers, ...
                                                                Defining the
                                                                 problem
                                                                                     Preparing
‣   Data integration                         Deploying                                 data
                                            and updating
    ‣   Excel, CSV, SQL, ARFF, ...            models
                                                                                           Exploring
                                                                                             data

                                                           Violating      Building
‣   Data cleaning                                          models         models

        ‣   missing values, noise, inter-
            correlations
Example Mining File
Example Mining File




entities
Example Mining File




                              ...
entities        data points
Example Mining File




                                  ...
entities            data points
           output
Example Mining File


                        ge fi les!
                 l! L ar        col umn
        Ca refu nes, 300
           illion li
  e.g. :5m

                                          ...
entities                   data points
             output
Step 3: Explore Data
    You cannot validate the output if you don’t know the input




‣   Descriptive data summary
                                                                       Defining the
        ‣   max, min, mean, pareto, distribution                        problem
                                                                                            Preparing
                                                    Deploying                                 data

‣   Data Selection                                 and updating
                                                     models
                                                                                                  Exploring
    ‣   Relevance of data                                                                           data

                                                                  Violating      Building
                                                                  models         models

‣   Data reduction
        ‣   aggregation, subset selection
Descriptive Data Summary
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model




  ‣ How good can a prediction
    possibly be?
  ‣ Does it make sense to predict
    the top 20%


                  20% of entities contain 80% of defects
Step 3: Explore Data

    Data sufficiency
                                                       Defining the
                                                        problem
                                                                            Preparing
‣   Maybe the data will not help    Deploying
                                   and updating
                                                                              data


    to solve the problem             models
                                                                                  Exploring
                                                                                    data

                                                  Violating      Building
                                                  models
‣ Redefine problem                                                models


‣ Search for alternatives
‣ Access different data
Step 3: Explore Data

    Data sufficiency
                                                       Defining the
                                                        problem
                                                                            Preparing
‣   Maybe the data will not help    Deploying
                                   and updating
                                                                              data


    to solve the problem             models
                                                                                  Exploring
                                                                                    data

                                                  Violating      Building
                                                  models
‣ Redefine problem                                                models


‣ Search for alternatives
‣ Access different data
Step 3: Explore Data
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model




      Bug Database
                                   past defect
                               per entity (quality)


     Version Archive


                                  source code            Does complexity
      Source Code
                                   properties
                                    (metrics)         correlate with defects?
Step 3: Explore Data
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model




      Bug Database
                                   past defect
                               per entity (quality)


     Version Archive


                                  source code            Does complexity
      Source Code
                                   properties
                                    (metrics)         correlate with defects?

                                                              YES!
Step 4: Build Model


‣   Mining model only container                         Defining the
                                                         problem
    ‣ parameters and mining                                                  Preparing
                                     Deploying                                 data
      structure                     and updating
                                      models
    ‣ output value                                                                 Exploring
                                                                                     data

                                                   Violating      Building
‣   Now we need some                               models         models


    statistics / machine learners
Example Mining File




                                  ...
entities            data points
           output
Building the Model
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
              ‣ Regression
                ‣ Predicting concrete, continuous values
                ‣ Difficult and very imprecise
                ‣ But desirable

              ‣ Classification
                ‣ Predicting class labels (e.g. more that X defects or not)
                ‣ Easier and more precise
                ‣ Vague information (how many defects in code?)
Building the Model
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Building the Model
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model

                                     Rule-
                                             Based
                                                     Class
                                                             ificat
     Support Vec tor Machine                                      ion

          Linear Reg ression         Lazy Learners


                          ee   Bayesian Network
                      n Tr
              ci sio                  Logistic Reg ression
           D e
Training and Testing

‣   Training set
    ‣ The data set to train the model
    ‣ Which columns correlate with output values?
    ‣ Which columns correlate with each other?

‣ Testing      set
    ‣ A data set independent of the training data set
    ‣ used to fine-tune the estimates of the model parameters
Training and Testing

 Random split
+ Only  one version needed
+ No overlaps between               DATA SET
  training and testing entities

- Does  not reflect real life
- Which random set is the
  best one? (because they are all
 different)
Training and Testing

 Random split
+ Only  one version needed
+ No overlaps between               DATA SET
  training and testing entities

- Does  not reflect real life
- Which random set is the
  best one? (because they are all          training data (2/3)
 different)
                                     testing data (1/3)
Training and Testing

 Random split
+ Only  one version needed
+ No overlaps between               DATA SET
  training and testing entities

- Does  not reflect real life
- Which random set is the
  best one? (because they are all          training data (2/3)
 different)
                                     testing data (1/3)
Training and Testing

                          DATA SET
                          version N
 Forward estimation
+ Reflectsreal life                        training data
+ Reproducable result      testing data

- Two   versions needed                    DATA SET
                                          version N+1
Step 4: Build Model
Step 4: Build Model

training set
Step 4: Build Model

                 machine
training set     learner
                (black box)
Step 4: Build Model

               input   machine
training set           learner
                       (black box)
Step 4: Build Model

               input     machine
training set             learner
                        (black box)



                              output




                           iction Model
                       Pred
Step 4: Build Model

               input     machine
training set             learner
                        (black box)



                              output


testing set


                           iction Model
                       Pred
Step 4: Build Model

               input     machine
training set             learner
                        (black box)



                              output

               input
testing set


                           iction Model
                       Pred
Step 4: Build Model

               input     machine
training set             learner
                        (black box)



                              output

               input                      output
testing set


                           iction Model
                       Pred                        Prediction
Step 5: Validating Model

‣   Test data has same stucture
    but different content                             Defining the
                                                       problem
                                                                           Preparing
                                                                             data
‣   Goal is to use model to        Deploying
                                  and updating
                                    models
    correctly estimate output                                                    Exploring
                                                                                   data
    values                                       Violating      Building
                                                 models         models


‣   Compare estimation with
    real values (fine tuning)
Evaluation
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model
Evaluation
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model




              Never predict concrete number!
               Because people will take them for real!
Evaluation
Step 1: Define the problem
Step 2: Prepare Data
Step 3: Explore Data
Step 4: Building the Model
Step 5: Validating the Model




                                            sorted descending




                  real defects per entity                       predicted defects per entity
Evaluation
Step 1: Define the problem
Step 2: Prepare Data       correctly   predicted defect prone modules
Step 3: Explore Data                      (true positives)
Step 4: Building the Model
Step 5: Validating the Model




                   real defects per entity            predicted defects per entity
Recall, Precision, Accuracy

                           Predict defects ?

                           Yes           No

                                         false
                Yes   true positives
                                       negatives
Real defects?
                          false          true
                No      positives      negatives
Recall, Precision, Accuracy
                   Predict defects ?
                    Yes          No
                     true        false
 Real      Yes     positives   negatives
defects?   No        false       true
                   positives   negatives

                                                   true positives
                                           true positives + false positives
                 Precision


                 Predicted defect prone entities
                      will be defect prone!
Recall, Precision, Accuracy
                   Predict defects ?
                    Yes          No

 Real      Yes       true
                   positives
                                 false
                               negatives           Recall
defects?   No        false       true
                   positives   negatives

                                                   true positives
                  Precision
                                           true positives + false negative


                    All defect prone entities
                 get predicted as defect prone.
Recall, Precision, Accuracy
                  Predict defects ?
                   Yes          No
                    true        false
 Real      Yes    positives   negatives
                                                  Recall
defects?   No       false       true
                  positives   negatives



                 Precision

                                                    Accuracy
                                                 true positives + true negatives
                                 true positives + true negatives + false negative + false positives




                    The overall correctness
Step 6: Deploying Model

‣   Integrate model into
    development or quality                         Defining the
                                                    problem
    assurance process                                                   Preparing
                                                                          data
                                Deploying
                               and updating
                                 models
‣   Update model frequently                                                   Exploring
                                                                                data
    (because change happens)                  Violating      Building
                                              models         models

‣   Frequently validate the
    precision of your model
Step 6: Deploying Model

‣   Integrate model into                                          m od els!
                                       t
                                    jecDefining the ct data!
    development or quality     -pro on proje
                           ross ndend problem
    assurance process ith c epe
                   l w
              efu els highly d
                                                        Preparing

           Car                          Deploying         data

                 d                     and updating

            Many mo                      models
‣   Update model frequently                                                    Exploring
                                                                                 data
    (because change happens)                          Violating     Building
                                                      models        models

‣   Frequently validate the
    precision of your model
State of the Art
State of the Art
Prediction Results
Training   Testing   Precision   Recall       Accuracy
             2.0      0.692      0.265         0.876
  2.0        2.1      0.478      0.191         0.890
             3.0      0.613      0.171         0.861
             2.0      0.664      0.203         0.870
  2.1        2.1      0.668      0.160         0.900
             3.0      0.717      0.139         0.864
             2.0      0.578      0.277         0.866
  3.0        2.1      0.528      0.220         0.894
             3.0      0.675      0.224         0.869



                                   Predicting java classes: Classification:
                                           has bugs, has no bugs
Prediction Results
Training   Testing   Precision   Recall       Accuracy
             2.0      0.692      0.265         0.876
  2.0        2.1      0.478      0.191         0.890
             3.0      0.613      0.171         0.861
             2.0      0.664      0.203         0.870
  2.1        2.1      0.668      0.160         0.900
             3.0      0.717      0.139         0.864
             2.0      0.578      0.277         0.866
  3.0        2.1      0.528      0.220         0.894
             3.0      0.675      0.224         0.869



                                   Predicting java classes: Classification:
                                           has bugs, has no bugs
Prediction Results
Training   Testing   Precision   Recall       Accuracy
             2.0      0.692      0.265         0.876
  2.0        2.1      0.478      0.191         0.890
             3.0      0.613      0.171
                                 d efe cts!    0.861

                    uses
             2.0      0.664      0.203         0.870
  2.1
            xity ca
             2.1      0.668      0.160         0.900
     Com ple 3.0      0.717      0.139         0.864
             2.0      0.578      0.277         0.866
  3.0        2.1      0.528      0.220         0.894
             3.0      0.675      0.224         0.869



                                   Predicting java classes: Classification:
                                           has bugs, has no bugs
Prediction Results
    Training   Testing   Precision   Recall       Accuracy
                 2.0      0.692      0.265         0.876
      2.0        2.1      0.478      0.191         0.890
                 3.0      0.613      0.171
                                     d efe cts!    0.861

                          uses            ity!
                 2.0      0.664      0.203         0.870
     2.1
                lexi
                 2.1ty ca 0.668
                                  com plex
                                     0.160         0.900
        Comp     3.0
                          me from
                          0.717      0.139         0.864
                 2.0
               efec ts co 0.578      0.277         0.866
     3.0
    not  all d   2.1      0.528      0.220         0.894
But              3.0      0.675      0.224         0.869



                                       Predicting java classes: Classification:
                                               has bugs, has no bugs
What to mine?
Code
                            e-mail
             Bug Reports             Changes
Profiles



               What to mine?
  Traces                    Effort        Specification
               Chats
     Tests
                       Navigation              Models
Code
                            e-mail
             Bug Reports             Changes
Profiles




  Traces                    Effort        Sepcification
               Chats
     Tests
                       Navigation              Models
Models   Specs   Code     Traces     Profiles   Tests




     Data Mining Input Sources


e-mail   Bugs    Effort   Navigati   Change    Chats
Models   Specs   Code     Traces     Profiles   Tests




 People who changed function
      f() also changed ....

e-mail   Bugs    Effort   Navigati   Change    Chats
Models    Specs   Code     Traces     Profiles   Tests




         Which modules should
              I test most?

e-mail    Bugs    Effort   Navigati   Change    Chats
Models   Specs   Code     Traces     Profiles   Tests




         How long will it take
           to x this bug?

e-mail   Bugs    Effort   Navigati   Change    Chats
Models   Specs   Code     Traces     Profiles   Tests




             Should I use
            design A or B ?

e-mail   Bugs    Effort   Navigati   Change    Chats
Models   Specs   Code     Traces     Profiles   Tests




          This requirement
               is risky!

e-mail   Bugs    Effort   Navigati   Change    Chats
Assistance
Assistance


Future environments will
 •mine patterns from program + process
 •apply rules to make predictions
 •provide assistance in all development decisions
 •adapt advice to project history
Empirical SE 2.0
Wikis
                                                               Joy of Use
                Participation                        Usability
Recommendation        Social Software
                        Collaboration      Perpetual Beta   Simplicity


         Empirical SE 2.0
                                                               Trust

                                                                Economy
        Remixability                                                   The Long Tail
   DataDriven
Bachelor/Master Theses
  in software mining
Summary
Summary
Summary
Summary
Summary
Summary

More Related Content

Viewers also liked

160412 html001 html概要編
160412 html001 html概要編160412 html001 html概要編
160412 html001 html概要編elephancube
 
Clean Code 閱讀心得
Clean Code 閱讀心得Clean Code 閱讀心得
Clean Code 閱讀心得Jz Chang
 
Revenue Recognition Considerations for SaaS Companies
Revenue Recognition Considerations for SaaS CompaniesRevenue Recognition Considerations for SaaS Companies
Revenue Recognition Considerations for SaaS CompaniesMatt Ream
 
Mega events are small in Macau 26,000 rooms in one mile
Mega events are small in Macau 26,000 rooms in one mileMega events are small in Macau 26,000 rooms in one mile
Mega events are small in Macau 26,000 rooms in one mileDOC DMC Macau & Hong Kong
 
2016 Resume
2016 Resume2016 Resume
2016 ResumeChi Wang
 
Chinese Comprehension 28
Chinese Comprehension 28Chinese Comprehension 28
Chinese Comprehension 28Kathleen Ong
 

Viewers also liked (7)

160412 html001 html概要編
160412 html001 html概要編160412 html001 html概要編
160412 html001 html概要編
 
Clean Code 閱讀心得
Clean Code 閱讀心得Clean Code 閱讀心得
Clean Code 閱讀心得
 
Speech organisation
Speech organisationSpeech organisation
Speech organisation
 
Revenue Recognition Considerations for SaaS Companies
Revenue Recognition Considerations for SaaS CompaniesRevenue Recognition Considerations for SaaS Companies
Revenue Recognition Considerations for SaaS Companies
 
Mega events are small in Macau 26,000 rooms in one mile
Mega events are small in Macau 26,000 rooms in one mileMega events are small in Macau 26,000 rooms in one mile
Mega events are small in Macau 26,000 rooms in one mile
 
2016 Resume
2016 Resume2016 Resume
2016 Resume
 
Chinese Comprehension 28
Chinese Comprehension 28Chinese Comprehension 28
Chinese Comprehension 28
 

Similar to Software Engineering Course 2009 - Mining Software Archives

Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?Michaela Greiler
 
Real-world Entity Framework
Real-world Entity FrameworkReal-world Entity Framework
Real-world Entity FrameworkLynn Langit
 
Design and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using MLDesign and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using MLSiva krishnam raju Patsamatla
 
Predicting Method Crashes with Bytecode Operations
Predicting Method Crashes with Bytecode OperationsPredicting Method Crashes with Bytecode Operations
Predicting Method Crashes with Bytecode OperationsThomas Zimmermann
 
High time to add machine learning to your information security stack
High time to add machine learning to your information security stackHigh time to add machine learning to your information security stack
High time to add machine learning to your information security stackMinhaz A V
 
Design For Testability
Design For TestabilityDesign For Testability
Design For TestabilityWill Iverson
 
The Art Of Debugging
The Art Of DebuggingThe Art Of Debugging
The Art Of Debuggingsvilen.ivanov
 
Adversarial machine learning for av software
Adversarial machine learning for av softwareAdversarial machine learning for av software
Adversarial machine learning for av softwarejunseok seo
 
Use of Formal Methods at Amazon Web Services
Use of Formal Methods at Amazon Web ServicesUse of Formal Methods at Amazon Web Services
Use of Formal Methods at Amazon Web ServicesSulman Ahmed
 
TEA Presentation V 0.3
TEA Presentation V 0.3TEA Presentation V 0.3
TEA Presentation V 0.3Ian McDonald
 
Virtual Data : Eliminating the data constraint in Application Development
Virtual Data :  Eliminating the data constraint in Application DevelopmentVirtual Data :  Eliminating the data constraint in Application Development
Virtual Data : Eliminating the data constraint in Application DevelopmentKyle Hailey
 
Performance - a challenging craft
Performance  - a challenging craftPerformance  - a challenging craft
Performance - a challenging craftFabian Lange
 
PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesAntiy Labs
 
Application Assessment Techniques
Application Assessment TechniquesApplication Assessment Techniques
Application Assessment TechniquesDenim Group
 
Proactive Monitoring: Playing Offense for the Win
Proactive Monitoring: Playing Offense for the WinProactive Monitoring: Playing Offense for the Win
Proactive Monitoring: Playing Offense for the WinDeborah Schalm
 
Testing & should i do it
Testing & should i do itTesting & should i do it
Testing & should i do itMartin Sykora
 
Embeddable Antivirus engine with high granularity
Embeddable Antivirus engine with high granularityEmbeddable Antivirus engine with high granularity
Embeddable Antivirus engine with high granularityAntiy Labs
 
Malware detection-using-machine-learning
Malware detection-using-machine-learningMalware detection-using-machine-learning
Malware detection-using-machine-learningSecurity Bootcamp
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecurityTao Xie
 
What Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software SecurityWhat Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software SecurityAnne Oikarinen
 

Similar to Software Engineering Course 2009 - Mining Software Archives (20)

Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
 
Real-world Entity Framework
Real-world Entity FrameworkReal-world Entity Framework
Real-world Entity Framework
 
Design and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using MLDesign and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using ML
 
Predicting Method Crashes with Bytecode Operations
Predicting Method Crashes with Bytecode OperationsPredicting Method Crashes with Bytecode Operations
Predicting Method Crashes with Bytecode Operations
 
High time to add machine learning to your information security stack
High time to add machine learning to your information security stackHigh time to add machine learning to your information security stack
High time to add machine learning to your information security stack
 
Design For Testability
Design For TestabilityDesign For Testability
Design For Testability
 
The Art Of Debugging
The Art Of DebuggingThe Art Of Debugging
The Art Of Debugging
 
Adversarial machine learning for av software
Adversarial machine learning for av softwareAdversarial machine learning for av software
Adversarial machine learning for av software
 
Use of Formal Methods at Amazon Web Services
Use of Formal Methods at Amazon Web ServicesUse of Formal Methods at Amazon Web Services
Use of Formal Methods at Amazon Web Services
 
TEA Presentation V 0.3
TEA Presentation V 0.3TEA Presentation V 0.3
TEA Presentation V 0.3
 
Virtual Data : Eliminating the data constraint in Application Development
Virtual Data :  Eliminating the data constraint in Application DevelopmentVirtual Data :  Eliminating the data constraint in Application Development
Virtual Data : Eliminating the data constraint in Application Development
 
Performance - a challenging craft
Performance  - a challenging craftPerformance  - a challenging craft
Performance - a challenging craft
 
PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File Features
 
Application Assessment Techniques
Application Assessment TechniquesApplication Assessment Techniques
Application Assessment Techniques
 
Proactive Monitoring: Playing Offense for the Win
Proactive Monitoring: Playing Offense for the WinProactive Monitoring: Playing Offense for the Win
Proactive Monitoring: Playing Offense for the Win
 
Testing & should i do it
Testing & should i do itTesting & should i do it
Testing & should i do it
 
Embeddable Antivirus engine with high granularity
Embeddable Antivirus engine with high granularityEmbeddable Antivirus engine with high granularity
Embeddable Antivirus engine with high granularity
 
Malware detection-using-machine-learning
Malware detection-using-machine-learningMalware detection-using-machine-learning
Malware detection-using-machine-learning
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and Security
 
What Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software SecurityWhat Every Developer And Tester Should Know About Software Security
What Every Developer And Tester Should Know About Software Security
 

More from Kim Herzig

Keynote AST 2016
Keynote AST 2016Keynote AST 2016
Keynote AST 2016Kim Herzig
 
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015Kim Herzig
 
The Art of Testing Less without Sacrificing Quality @ ICSE 2015
The Art of Testing Less without Sacrificing Quality @ ICSE 2015The Art of Testing Less without Sacrificing Quality @ ICSE 2015
The Art of Testing Less without Sacrificing Quality @ ICSE 2015Kim Herzig
 
Code Ownership and Software Quality: A Replication Study @ MSR 2015
Code Ownership and Software Quality: A Replication Study @ MSR 2015Code Ownership and Software Quality: A Replication Study @ MSR 2015
Code Ownership and Software Quality: A Replication Study @ MSR 2015Kim Herzig
 
Issre2014 test defectprediction
Issre2014 test defectpredictionIssre2014 test defectprediction
Issre2014 test defectpredictionKim Herzig
 
The Impact of Test Ownership and Team Structure on the Reliability and Effect...
The Impact of Test Ownership and Team Structure on the Reliability and Effect...The Impact of Test Ownership and Team Structure on the Reliability and Effect...
The Impact of Test Ownership and Team Structure on the Reliability and Effect...Kim Herzig
 
Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)Kim Herzig
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Kim Herzig
 
The Impact of Tangled Code Changes
The Impact of Tangled Code ChangesThe Impact of Tangled Code Changes
The Impact of Tangled Code ChangesKim Herzig
 
Mining Cause Effect Chains from Version Archives - ISSRE 2011
Mining Cause Effect Chains from Version Archives - ISSRE 2011Mining Cause Effect Chains from Version Archives - ISSRE 2011
Mining Cause Effect Chains from Version Archives - ISSRE 2011Kim Herzig
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication StudyKim Herzig
 
Capturing the Long Term Impact of Changes
Capturing the Long Term Impact of ChangesCapturing the Long Term Impact of Changes
Capturing the Long Term Impact of ChangesKim Herzig
 

More from Kim Herzig (12)

Keynote AST 2016
Keynote AST 2016Keynote AST 2016
Keynote AST 2016
 
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
 
The Art of Testing Less without Sacrificing Quality @ ICSE 2015
The Art of Testing Less without Sacrificing Quality @ ICSE 2015The Art of Testing Less without Sacrificing Quality @ ICSE 2015
The Art of Testing Less without Sacrificing Quality @ ICSE 2015
 
Code Ownership and Software Quality: A Replication Study @ MSR 2015
Code Ownership and Software Quality: A Replication Study @ MSR 2015Code Ownership and Software Quality: A Replication Study @ MSR 2015
Code Ownership and Software Quality: A Replication Study @ MSR 2015
 
Issre2014 test defectprediction
Issre2014 test defectpredictionIssre2014 test defectprediction
Issre2014 test defectprediction
 
The Impact of Test Ownership and Team Structure on the Reliability and Effect...
The Impact of Test Ownership and Team Structure on the Reliability and Effect...The Impact of Test Ownership and Team Structure on the Reliability and Effect...
The Impact of Test Ownership and Team Structure on the Reliability and Effect...
 
Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
The Impact of Tangled Code Changes
The Impact of Tangled Code ChangesThe Impact of Tangled Code Changes
The Impact of Tangled Code Changes
 
Mining Cause Effect Chains from Version Archives - ISSRE 2011
Mining Cause Effect Chains from Version Archives - ISSRE 2011Mining Cause Effect Chains from Version Archives - ISSRE 2011
Mining Cause Effect Chains from Version Archives - ISSRE 2011
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication Study
 
Capturing the Long Term Impact of Changes
Capturing the Long Term Impact of ChangesCapturing the Long Term Impact of Changes
Capturing the Long Term Impact of Changes
 

Recently uploaded

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Recently uploaded (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

Software Engineering Course 2009 - Mining Software Archives

  • 2. Exam Admittance 50% ROOM ... Question or Problems: kim@cs.uni-saarland.de
  • 3. After-Exam Registration Not registered = No after exam But please do only register when you plan to participate
  • 4. Exam Regulations ‣ Single sided cheat sheet ‣ No dictionaries ‣ ask supervision ‣ Bags to be left at entrance ‣ Hand in exam & cheat sheet ‣ Student ID on desk ‣ Additionalpaper only from ‣ Name + MatNr. on every supervision sheet (incl. cheat sheet) ‣ Stick to one language ‣ per exercise ‣ (german or english)
  • 5. Seminar on Code Modi cation at Runtime by Frank Padberg Topics July Runtime optimization of byte code 22 ‣ ‣ on-the-fly creation of classes ‣ self-modifying code ‣ ... AND MORE! Initial Meeting Vorbesprechung http://www.st.cs.uni-saarland.de/edu/codemod09/rcm09.html
  • 7. MINING SOFTWARE REPOSITORIES Software Engineering Course 2009 Kim Herzig - Saarland University
  • 8. Books Data Mining: Concepts and Techniques Data Mining: Practical Machine Learning Tools and Techniques by Jiawei Han & Micheline Kamber by Ian H. Witten & Eibe Frank
  • 10. Imagine ‣ 30,000 classes ‣ ~ 5.5 million lines of code ‣ ~3000 defect per release ‣ 700 developers You as Quality Manager Your product
  • 11. Your Boss Test the system! You have 6 months, $500,000. And don’t miss any bug!
  • 12.
  • 13. The Problem ‣ Not enough time to test everything ‣ What to test? What to test first? ‣ Not enough money to pay enough testers ‣ To which extend? Central question: Where are the most defect prone entities in my system?
  • 19. Can we learn from history? ... to predict or estimate the future?
  • 21. What is data mining mining? Data mining is the process of discovering actionable information from large sets of data.
  • 22. The Mining Model Defining the problem Preparing Deploying data and updating models Exploring data Violating Building models models http://technet.microsoft.com/en-us/library/ms174949.aspx
  • 23. Step 1: De ning Problem ‣ Clearly define the problem Defining the ‣ What are you looking for? problem Preparing ‣ Scope of problem Deploying data and updating ‣ Types of relationships models Exploring data Violating ‣ Define how to evaluate models Building models ‣ Prediction, recommendation or just patterns
  • 24. Defect Prediction Problem Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Which source code entities Step 4: Building the Model Step 5: Validating the Model should we test most?
  • 25. Defect Prediction Problem Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Which source code entities Step 4: Building the Model Step 5: Validating the Model should we test most? Which are the most defect prone entities in my system?
  • 26. Defect Prediction Problem Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Which source code entities Step 4: Building the Model Step 5: Validating the Model should we test most? Which are the most defect prone entities in my system? In the past, which entities had the most defects?
  • 27. Defect Prediction Problem Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Which source code entities Step 4: Building the Model Step 5: Validating the Model should we test most? Which properties of Which are the most source code entities correlate defect prone with defects? entities in my system? In the past, which entities had the most defects?
  • 28. Data Sources Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database Version Archive Source Code
  • 29. Data Sources Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive Source Code
  • 30. Data Sources Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive source code properties Source Code (metrics)
  • 31. Data Sources: Heuristics Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive “... commit messages that contain fix and bug id ...”
  • 32. Data Sources: Metrics Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model ‣ Complicity metrics ‣ McCabe, FanIn, FanOut, Couplings ‣ (see Lecture “Metrics and Estimation”) source code ‣ Time metrics Source Code properties (metrics) ‣ How many changes ‣ How many different authors ‣ Age of code
  • 33. Data Sources Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive source code properties Source Code (metrics)
  • 34. Step 2: Prepare Data ‣ Highly distributed data: ‣ Version repository, bug data base, time trackers, ... Defining the problem Preparing ‣ Data integration Deploying data and updating ‣ Excel, CSV, SQL, ARFF, ... models Exploring data Violating Building ‣ Data cleaning models models ‣ missing values, noise, inter- correlations
  • 37. Example Mining File ... entities data points
  • 38. Example Mining File ... entities data points output
  • 39. Example Mining File ge fi les! l! L ar col umn Ca refu nes, 300 illion li e.g. :5m ... entities data points output
  • 40. Step 3: Explore Data You cannot validate the output if you don’t know the input ‣ Descriptive data summary Defining the ‣ max, min, mean, pareto, distribution problem Preparing Deploying data ‣ Data Selection and updating models Exploring ‣ Relevance of data data Violating Building models models ‣ Data reduction ‣ aggregation, subset selection
  • 41. Descriptive Data Summary Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model ‣ How good can a prediction possibly be? ‣ Does it make sense to predict the top 20% 20% of entities contain 80% of defects
  • 42. Step 3: Explore Data Data sufficiency Defining the problem Preparing ‣ Maybe the data will not help Deploying and updating data to solve the problem models Exploring data Violating Building models ‣ Redefine problem models ‣ Search for alternatives ‣ Access different data
  • 43. Step 3: Explore Data Data sufficiency Defining the problem Preparing ‣ Maybe the data will not help Deploying and updating data to solve the problem models Exploring data Violating Building models ‣ Redefine problem models ‣ Search for alternatives ‣ Access different data
  • 44. Step 3: Explore Data Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive source code Does complexity Source Code properties (metrics) correlate with defects?
  • 45. Step 3: Explore Data Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Bug Database past defect per entity (quality) Version Archive source code Does complexity Source Code properties (metrics) correlate with defects? YES!
  • 46. Step 4: Build Model ‣ Mining model only container Defining the problem ‣ parameters and mining Preparing Deploying data structure and updating models ‣ output value Exploring data Violating Building ‣ Now we need some models models statistics / machine learners
  • 47. Example Mining File ... entities data points output
  • 48. Building the Model Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model ‣ Regression ‣ Predicting concrete, continuous values ‣ Difficult and very imprecise ‣ But desirable ‣ Classification ‣ Predicting class labels (e.g. more that X defects or not) ‣ Easier and more precise ‣ Vague information (how many defects in code?)
  • 49. Building the Model Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model
  • 50. Building the Model Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Rule- Based Class ificat Support Vec tor Machine ion Linear Reg ression Lazy Learners ee Bayesian Network n Tr ci sio Logistic Reg ression D e
  • 51. Training and Testing ‣ Training set ‣ The data set to train the model ‣ Which columns correlate with output values? ‣ Which columns correlate with each other? ‣ Testing set ‣ A data set independent of the training data set ‣ used to fine-tune the estimates of the model parameters
  • 52. Training and Testing Random split + Only one version needed + No overlaps between DATA SET training and testing entities - Does not reflect real life - Which random set is the best one? (because they are all different)
  • 53. Training and Testing Random split + Only one version needed + No overlaps between DATA SET training and testing entities - Does not reflect real life - Which random set is the best one? (because they are all training data (2/3) different) testing data (1/3)
  • 54. Training and Testing Random split + Only one version needed + No overlaps between DATA SET training and testing entities - Does not reflect real life - Which random set is the best one? (because they are all training data (2/3) different) testing data (1/3)
  • 55. Training and Testing DATA SET version N Forward estimation + Reflectsreal life training data + Reproducable result testing data - Two versions needed DATA SET version N+1
  • 56. Step 4: Build Model
  • 57. Step 4: Build Model training set
  • 58. Step 4: Build Model machine training set learner (black box)
  • 59. Step 4: Build Model input machine training set learner (black box)
  • 60. Step 4: Build Model input machine training set learner (black box) output iction Model Pred
  • 61. Step 4: Build Model input machine training set learner (black box) output testing set iction Model Pred
  • 62. Step 4: Build Model input machine training set learner (black box) output input testing set iction Model Pred
  • 63. Step 4: Build Model input machine training set learner (black box) output input output testing set iction Model Pred Prediction
  • 64. Step 5: Validating Model ‣ Test data has same stucture but different content Defining the problem Preparing data ‣ Goal is to use model to Deploying and updating models correctly estimate output Exploring data values Violating Building models models ‣ Compare estimation with real values (fine tuning)
  • 65. Evaluation Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model
  • 66. Evaluation Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model Never predict concrete number! Because people will take them for real!
  • 67. Evaluation Step 1: Define the problem Step 2: Prepare Data Step 3: Explore Data Step 4: Building the Model Step 5: Validating the Model sorted descending real defects per entity predicted defects per entity
  • 68. Evaluation Step 1: Define the problem Step 2: Prepare Data correctly predicted defect prone modules Step 3: Explore Data (true positives) Step 4: Building the Model Step 5: Validating the Model real defects per entity predicted defects per entity
  • 69. Recall, Precision, Accuracy Predict defects ? Yes No false Yes true positives negatives Real defects? false true No positives negatives
  • 70. Recall, Precision, Accuracy Predict defects ? Yes No true false Real Yes positives negatives defects? No false true positives negatives true positives true positives + false positives Precision Predicted defect prone entities will be defect prone!
  • 71. Recall, Precision, Accuracy Predict defects ? Yes No Real Yes true positives false negatives Recall defects? No false true positives negatives true positives Precision true positives + false negative All defect prone entities get predicted as defect prone.
  • 72. Recall, Precision, Accuracy Predict defects ? Yes No true false Real Yes positives negatives Recall defects? No false true positives negatives Precision Accuracy true positives + true negatives true positives + true negatives + false negative + false positives The overall correctness
  • 73. Step 6: Deploying Model ‣ Integrate model into development or quality Defining the problem assurance process Preparing data Deploying and updating models ‣ Update model frequently Exploring data (because change happens) Violating Building models models ‣ Frequently validate the precision of your model
  • 74. Step 6: Deploying Model ‣ Integrate model into m od els! t jecDefining the ct data! development or quality -pro on proje ross ndend problem assurance process ith c epe l w efu els highly d Preparing Car Deploying data d and updating Many mo models ‣ Update model frequently Exploring data (because change happens) Violating Building models models ‣ Frequently validate the precision of your model
  • 77. Prediction Results Training Testing Precision Recall Accuracy 2.0 0.692 0.265 0.876 2.0 2.1 0.478 0.191 0.890 3.0 0.613 0.171 0.861 2.0 0.664 0.203 0.870 2.1 2.1 0.668 0.160 0.900 3.0 0.717 0.139 0.864 2.0 0.578 0.277 0.866 3.0 2.1 0.528 0.220 0.894 3.0 0.675 0.224 0.869 Predicting java classes: Classification: has bugs, has no bugs
  • 78. Prediction Results Training Testing Precision Recall Accuracy 2.0 0.692 0.265 0.876 2.0 2.1 0.478 0.191 0.890 3.0 0.613 0.171 0.861 2.0 0.664 0.203 0.870 2.1 2.1 0.668 0.160 0.900 3.0 0.717 0.139 0.864 2.0 0.578 0.277 0.866 3.0 2.1 0.528 0.220 0.894 3.0 0.675 0.224 0.869 Predicting java classes: Classification: has bugs, has no bugs
  • 79. Prediction Results Training Testing Precision Recall Accuracy 2.0 0.692 0.265 0.876 2.0 2.1 0.478 0.191 0.890 3.0 0.613 0.171 d efe cts! 0.861 uses 2.0 0.664 0.203 0.870 2.1 xity ca 2.1 0.668 0.160 0.900 Com ple 3.0 0.717 0.139 0.864 2.0 0.578 0.277 0.866 3.0 2.1 0.528 0.220 0.894 3.0 0.675 0.224 0.869 Predicting java classes: Classification: has bugs, has no bugs
  • 80. Prediction Results Training Testing Precision Recall Accuracy 2.0 0.692 0.265 0.876 2.0 2.1 0.478 0.191 0.890 3.0 0.613 0.171 d efe cts! 0.861 uses ity! 2.0 0.664 0.203 0.870 2.1 lexi 2.1ty ca 0.668 com plex 0.160 0.900 Comp 3.0 me from 0.717 0.139 0.864 2.0 efec ts co 0.578 0.277 0.866 3.0 not all d 2.1 0.528 0.220 0.894 But 3.0 0.675 0.224 0.869 Predicting java classes: Classification: has bugs, has no bugs
  • 82. Code e-mail Bug Reports Changes Profiles What to mine? Traces Effort Specification Chats Tests Navigation Models
  • 83. Code e-mail Bug Reports Changes Profiles Traces Effort Sepcification Chats Tests Navigation Models
  • 84. Models Specs Code Traces Profiles Tests Data Mining Input Sources e-mail Bugs Effort Navigati Change Chats
  • 85. Models Specs Code Traces Profiles Tests People who changed function f() also changed .... e-mail Bugs Effort Navigati Change Chats
  • 86. Models Specs Code Traces Profiles Tests Which modules should I test most? e-mail Bugs Effort Navigati Change Chats
  • 87. Models Specs Code Traces Profiles Tests How long will it take to x this bug? e-mail Bugs Effort Navigati Change Chats
  • 88. Models Specs Code Traces Profiles Tests Should I use design A or B ? e-mail Bugs Effort Navigati Change Chats
  • 89. Models Specs Code Traces Profiles Tests This requirement is risky! e-mail Bugs Effort Navigati Change Chats
  • 91. Assistance Future environments will •mine patterns from program + process •apply rules to make predictions •provide assistance in all development decisions •adapt advice to project history
  • 93. Wikis Joy of Use Participation Usability Recommendation Social Software Collaboration Perpetual Beta Simplicity Empirical SE 2.0 Trust Economy Remixability The Long Tail DataDriven
  • 94. Bachelor/Master Theses in software mining