SlideShare a Scribd company logo
Introduction
                            Data Mining
                      And the results are
A vision over the present and the future




             Mining Software Repositories
                            Improving software


                             Pere UrbĀ“n Bayes
                                     o

                         Data Management Group
                    Dept. Arquitectura de Computadors
                    Universitat Polit`cnica de Catalunya
                                     e
                              purbon@ac.upc.edu

                                 July of 2009



                       Pere UrbĀ“n Bayes
                               o            Mining Software Repositories
Introduction
                                      Data Mining
                                And the results are
          A vision over the present and the future


Index




        Introduction
        Data Mining
        The results
        The future




                                 Pere UrbĀ“n Bayes
                                         o            Mining Software Repositories
Introduction
                                                   Motivations
                                   Data Mining
                                                   The Situation
                             And the results are
                                                   Objectives
       A vision over the present and the future


The problem



     Companies need to own highly available and reliable software.
     The software of low quality harms both, clients and producers.
     Unfortunately, avoiding defects is a diļ¬ƒcult task to undertake.

     Project Leaders need to keep an eye inside to many projects.
     Software engineer tend not to document software in deep.
     The complexity of software projects is growing every day.




                              Pere UrbĀ“n Bayes
                                      o            Mining Software Repositories
Introduction
                                                   Motivations
                                   Data Mining
                                                   The Situation
                             And the results are
                                                   Objectives
       A vision over the present and the future


The software development process




                              Pere UrbĀ“n Bayes
                                      o            Mining Software Repositories
Introduction
                                                        Motivations
                                        Data Mining
                                                        The Situation
                                  And the results are
                                                        Objectives
            A vision over the present and the future


Support tools


  Tools used to support software development:
      Version Control server.
      Bug Tracker server.
      Project Management server.
      Life cycle management software.
      ...

   This set of tools store a huge amount of information during the
  process, Why not to use this information to improve our software?




                                   Pere UrbĀ“n Bayes
                                           o            Mining Software Repositories
Introduction
                                                    Motivations
                                    Data Mining
                                                    The Situation
                              And the results are
                                                    Objectives
        A vision over the present and the future


Objective and Applications


  Objectives:
      Analyse the use of data mining technology, to data stored in
      support tools, with the aim to improve software quality.
      Develop an experimental prototype tool.
  Applications:
      Reduce the error rate.
      Provides a non-exploited source of documentation.
      Provide a new source of support tools for IDEā€™s.




                               Pere UrbĀ“n Bayes
                                       o            Mining Software Repositories
Introduction
                                     Data Mining     Introduction
                               And the results are   The use of
         A vision over the present and the future


Data mining



  Type of database analysis that attempts to discover useful patterns
  or relationships in a group of data. The analysis uses advanced
  statistical methods, such as cluster analysis, and sometimes
  employs artiļ¬cial intelligence or neural network techniques. A
  major goal of data mining is to discover previously unknown
  relationships among the data, especially when the data come from
  diļ¬€erent databases.




                                Pere UrbĀ“n Bayes
                                        o            Mining Software Repositories
Introduction
                                    Data Mining     Introduction
                              And the results are   The use of
        A vision over the present and the future


Methods


  Types of:
      Traditional Data Mining (K-Means, C4.5, Bayesian Networks).
      Relational Data Mining (ILP, Markov logic networks,
      Relational bayesian methods, Dependency Networks).
  Categories:
      Clusterers
      Classiļ¬ers
      Associative rules
      Network Models.



                               Pere UrbĀ“n Bayes
                                       o            Mining Software Repositories
Introduction
                                     Data Mining     Introduction
                               And the results are   The use of
         A vision over the present and the future


Data mining



  Type of database analysis that attempts to discover useful patterns
  or relationships in a group of data. The analysis uses advanced
  statistical methods, such as cluster analysis, and sometimes
  employs artiļ¬cial intelligence or neural network techniques. A
  major goal of data mining is to discover previously unknown
  relationships among the data, especially when the data come from
  diļ¬€erent databases.




                                Pere UrbĀ“n Bayes
                                        o            Mining Software Repositories
Introduction
                                    Data Mining     Introduction
                              And the results are   The use of
        A vision over the present and the future


Issue detection

   LOC                          DefectAppearence2Month                      RevisionsAuthor
   LineAddedIRLAdd              ReportedI2Month                             Revision2Month
   LineAddedIRLDel              Revision3Month                              Releases
   AlterType                    DefectAppearence3Month                      ReportedI1Month
   AgeMonths                    ReportedI3Month                             ReportedIssues
   RevisionAge                  Revision5Month                              ReportedI5Month
   DefectReleases               DefectAppearence5Month
   Revision1Month               DefectAppearance1Month

  Question: Has this ļ¬le a non detected error. The exact number of
  errors can be predicted to.



                               Pere UrbĀ“n Bayes
                                       o            Mining Software Repositories
Introduction
                                     Data Mining     Introduction
                               And the results are   The use of
         A vision over the present and the future


Another types of objectives

      Predict bugs related to a software developer.
      Prediction of bugs in software components.
  This techniques could be used in diļ¬€erent topics:
      Software understanding.
      Software evolution.
      Software visualization.
      Change propagation.
      Impact analysis.
      Software complexity.
      Fault prediction.

                                Pere UrbĀ“n Bayes
                                        o            Mining Software Repositories
Introduction
                                    Data Mining     Error prediction
                              And the results are   Software
        A vision over the present and the future


Error prediction

                                 Eclipse Project                  Firefox Project
   Correctly classiļ¬ed           94.65%                           94.822%
   Statistics Kappa              0.893                            0.8883
   Precision                     0.9465                           0.9482
   Recall                        0.945                            0.949
   AUC ROC                       0.9682                           0.9808
                                 Eclipse-Firefox                  Firefox-Eclipse
   Correctly classiļ¬ed           82.0065%                         87.975%
   Statistics Kappa              0.5976                           0.7595
   Precision                     0.818                            0.894
   Recall                        0.82                             0.88
   AUC ROC                       0.805                            0.83


                               Pere UrbĀ“n Bayes
                                       o            Mining Software Repositories
Introduction
                                  Data Mining     Error prediction
                            And the results are   Software
      A vision over the present and the future


The end App




                             Pere UrbĀ“n Bayes
                                     o            Mining Software Repositories
Introduction
                                    Data Mining     Software libraries
                              And the results are   An envision
        A vision over the present and the future


The Prototype


  Software being used:
      Programming: JAVA
      Database: MySQL and MonetDB.
      Data Mining: Weka 3.6 and Proximity 4.3
      XML: Apache Xerces 2.9.1
      SVN, CVS : svnkit 1.3.0, for CVS netbeans-cvs lib and a
      custom rcs ļ¬le parser.
      Presentation: Prefuse Visualization Toolkit and Weka
      Drawing facilities.



                               Pere UrbĀ“n Bayes
                                       o            Mining Software Repositories
Introduction
                                    Data Mining     Software libraries
                              And the results are   An envision
        A vision over the present and the future


Could python give use the same?
  Machine Learning:
      Orange: With 1.0 this lib has many interesting and useful
      methods, Classiļ¬cation, Regression and Clustering. The most
      similar to Weka.
      PyML: Only has classiļ¬er facilities.
      Shogun: Only for Support Vector Machines.
      RPy: An interface to R.
  Databases:
      The most important relational databases are available via
      DB-API.
            ZODB: Zope Object Database.
            Metakit: An embedded database with a not deļ¬ned paradigm.
            Pygr: Python graph database framework for bioinformatics.

                               Pere UrbĀ“n Bayes
                                       o            Mining Software Repositories
Introduction
                                    Data Mining     Software libraries
                              And the results are   An envision
        A vision over the present and the future


Could python give use the same?

  Presentation:
      Graph Drawing: NetworkX, with nice result. There are some
      other but they look incomplete.
      GUI: PyQT, wxWindows, pyGTK. Itā€™s your taste XD!.
  SVN, CVS processing:
      SVN: pysvn - Python interface to Subversion.
      CVS: It seams nothing is available.
      GIT: PyGit - Pythonic git bindings targeted towards
      porcelains.
  XML Processing could be done using built-in support and with any
  SAX or DOM parser.

                               Pere UrbĀ“n Bayes
                                       o            Mining Software Repositories
Introduction
                                    Data Mining     Software libraries
                              And the results are   An envision
        A vision over the present and the future


The future

  Known issues:
      Data preprocessing performance.
      Database performance, is the relational model valid?
      Dynamic procedure addition.
  The Todo List:
      Develop new procedures over diļ¬€erent related topics, like
      software visualization, change support, etc.
      Develop a more mature software. Python could help in some
      parts. This software must be easily extensible.
      Improve the hole process performance.


                               Pere UrbĀ“n Bayes
                                       o            Mining Software Repositories
Introduction
                                  Data Mining     Software libraries
                            And the results are   An envision
      A vision over the present and the future


The end



                                         Question?


                               Pere UrbĀ“n Bayes
                                         o
                           Data Management Group
                      Dept. Arquitectura de Computadors
                      Universitat Polit`cnica de Catalunya
                                       e
                                 purbon@ac.upc.edu




                             Pere UrbĀ“n Bayes
                                     o            Mining Software Repositories

More Related Content

Similar to Python Meetup Talk 21072009

Information Needs for Software Development Analytics
Information Needs for Software Development AnalyticsInformation Needs for Software Development Analytics
Information Needs for Software Development Analytics
Ray Buse
Ā 
Today's BI and Data Mining ecosystem
Today's BI and Data Mining ecosystemToday's BI and Data Mining ecosystem
Today's BI and Data Mining ecosystem
Josep Arroyo
Ā 
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...Foviance
Ā 
Past, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataPast, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software Data
Jeongwhan Choi
Ā 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
Ā 
Where does it go from here? The role of software in digital repositories
Where does it go from here? The role of software in digital repositoriesWhere does it go from here? The role of software in digital repositories
Where does it go from here? The role of software in digital repositories
Neil Chue Hong
Ā 
Analytics for software development
Analytics for software developmentAnalytics for software development
Analytics for software developmentThomas Zimmermann
Ā 
Process Project Mgt Seminar 8 Apr 2009(2)
Process Project Mgt Seminar 8 Apr 2009(2)Process Project Mgt Seminar 8 Apr 2009(2)
Process Project Mgt Seminar 8 Apr 2009(2)
avitale1998
Ā 
Exploring Data Visualization
Exploring Data VisualizationExploring Data Visualization
Exploring Data Visualization
Jim Jenkins
Ā 
Hihn.jarius
Hihn.jariusHihn.jarius
Hihn.jariusNASAPMC
Ā 
The Impact of SOA on Traditional Middleware Technologies
The Impact of SOA on Traditional Middleware TechnologiesThe Impact of SOA on Traditional Middleware Technologies
The Impact of SOA on Traditional Middleware Technologies
digitallibrary
Ā 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data Talk
Adina Chuang Howe
Ā 
Se lect13 btech
Se lect13 btechSe lect13 btech
Se lect13 btechIIITA
Ā 
Se lect12 btech
Se lect12 btechSe lect12 btech
Se lect12 btechIIITA
Ā 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories prwheatley
Ā 
20111104 s4 overview
20111104 s4 overview20111104 s4 overview
20111104 s4 overview
Leo Neumeyer
Ā 
rorosyd - Test Driven Search Development
rorosyd - Test Driven Search Developmentrorosyd - Test Driven Search Development
rorosyd - Test Driven Search DevelopmentAndrew Harvey
Ā 
Today's bi and data mining ecosystem v2
Today's bi and data mining ecosystem v2Today's bi and data mining ecosystem v2
Today's bi and data mining ecosystem v2
Josep Arroyo
Ā 
Data Mining
Data MiningData Mining
Data Miningswami920
Ā 

Similar to Python Meetup Talk 21072009 (20)

Information Needs for Software Development Analytics
Information Needs for Software Development AnalyticsInformation Needs for Software Development Analytics
Information Needs for Software Development Analytics
Ā 
Integration
IntegrationIntegration
Integration
Ā 
Today's BI and Data Mining ecosystem
Today's BI and Data Mining ecosystemToday's BI and Data Mining ecosystem
Today's BI and Data Mining ecosystem
Ā 
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
Neil Mason presents on Data Mining and Predictive Analytics at Emetrics San F...
Ā 
Past, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software DataPast, Present, and Future of Analyzing Software Data
Past, Present, and Future of Analyzing Software Data
Ā 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Ā 
Where does it go from here? The role of software in digital repositories
Where does it go from here? The role of software in digital repositoriesWhere does it go from here? The role of software in digital repositories
Where does it go from here? The role of software in digital repositories
Ā 
Analytics for software development
Analytics for software developmentAnalytics for software development
Analytics for software development
Ā 
Process Project Mgt Seminar 8 Apr 2009(2)
Process Project Mgt Seminar 8 Apr 2009(2)Process Project Mgt Seminar 8 Apr 2009(2)
Process Project Mgt Seminar 8 Apr 2009(2)
Ā 
Exploring Data Visualization
Exploring Data VisualizationExploring Data Visualization
Exploring Data Visualization
Ā 
Hihn.jarius
Hihn.jariusHihn.jarius
Hihn.jarius
Ā 
The Impact of SOA on Traditional Middleware Technologies
The Impact of SOA on Traditional Middleware TechnologiesThe Impact of SOA on Traditional Middleware Technologies
The Impact of SOA on Traditional Middleware Technologies
Ā 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data Talk
Ā 
Se lect13 btech
Se lect13 btechSe lect13 btech
Se lect13 btech
Ā 
Se lect12 btech
Se lect12 btechSe lect12 btech
Se lect12 btech
Ā 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
Ā 
20111104 s4 overview
20111104 s4 overview20111104 s4 overview
20111104 s4 overview
Ā 
rorosyd - Test Driven Search Development
rorosyd - Test Driven Search Developmentrorosyd - Test Driven Search Development
rorosyd - Test Driven Search Development
Ā 
Today's bi and data mining ecosystem v2
Today's bi and data mining ecosystem v2Today's bi and data mining ecosystem v2
Today's bi and data mining ecosystem v2
Ā 
Data Mining
Data MiningData Mining
Data Mining
Ā 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
Ā 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
Ā 
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
UiPathCommunity
Ā 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
Ā 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
Ā 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
Ā 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
Ā 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
Ā 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
Ā 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
Ā 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
Ā 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
Ā 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
Ā 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
Ā 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
Ā 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
Ā 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
Ā 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
Ā 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
Ā 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
Ā 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
Ā 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Ā 
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Ā 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
Ā 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
Ā 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Ā 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Ā 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
Ā 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
Ā 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Ā 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Ā 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Ā 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
Ā 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Ā 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Ā 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Ā 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Ā 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Ā 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Ā 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Ā 

Python Meetup Talk 21072009

  • 1. Introduction Data Mining And the results are A vision over the present and the future Mining Software Repositories Improving software Pere UrbĀ“n Bayes o Data Management Group Dept. Arquitectura de Computadors Universitat Polit`cnica de Catalunya e purbon@ac.upc.edu July of 2009 Pere UrbĀ“n Bayes o Mining Software Repositories
  • 2. Introduction Data Mining And the results are A vision over the present and the future Index Introduction Data Mining The results The future Pere UrbĀ“n Bayes o Mining Software Repositories
  • 3. Introduction Motivations Data Mining The Situation And the results are Objectives A vision over the present and the future The problem Companies need to own highly available and reliable software. The software of low quality harms both, clients and producers. Unfortunately, avoiding defects is a diļ¬ƒcult task to undertake. Project Leaders need to keep an eye inside to many projects. Software engineer tend not to document software in deep. The complexity of software projects is growing every day. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 4. Introduction Motivations Data Mining The Situation And the results are Objectives A vision over the present and the future The software development process Pere UrbĀ“n Bayes o Mining Software Repositories
  • 5. Introduction Motivations Data Mining The Situation And the results are Objectives A vision over the present and the future Support tools Tools used to support software development: Version Control server. Bug Tracker server. Project Management server. Life cycle management software. ... This set of tools store a huge amount of information during the process, Why not to use this information to improve our software? Pere UrbĀ“n Bayes o Mining Software Repositories
  • 6. Introduction Motivations Data Mining The Situation And the results are Objectives A vision over the present and the future Objective and Applications Objectives: Analyse the use of data mining technology, to data stored in support tools, with the aim to improve software quality. Develop an experimental prototype tool. Applications: Reduce the error rate. Provides a non-exploited source of documentation. Provide a new source of support tools for IDEā€™s. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 7. Introduction Data Mining Introduction And the results are The use of A vision over the present and the future Data mining Type of database analysis that attempts to discover useful patterns or relationships in a group of data. The analysis uses advanced statistical methods, such as cluster analysis, and sometimes employs artiļ¬cial intelligence or neural network techniques. A major goal of data mining is to discover previously unknown relationships among the data, especially when the data come from diļ¬€erent databases. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 8. Introduction Data Mining Introduction And the results are The use of A vision over the present and the future Methods Types of: Traditional Data Mining (K-Means, C4.5, Bayesian Networks). Relational Data Mining (ILP, Markov logic networks, Relational bayesian methods, Dependency Networks). Categories: Clusterers Classiļ¬ers Associative rules Network Models. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 9. Introduction Data Mining Introduction And the results are The use of A vision over the present and the future Data mining Type of database analysis that attempts to discover useful patterns or relationships in a group of data. The analysis uses advanced statistical methods, such as cluster analysis, and sometimes employs artiļ¬cial intelligence or neural network techniques. A major goal of data mining is to discover previously unknown relationships among the data, especially when the data come from diļ¬€erent databases. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 10. Introduction Data Mining Introduction And the results are The use of A vision over the present and the future Issue detection LOC DefectAppearence2Month RevisionsAuthor LineAddedIRLAdd ReportedI2Month Revision2Month LineAddedIRLDel Revision3Month Releases AlterType DefectAppearence3Month ReportedI1Month AgeMonths ReportedI3Month ReportedIssues RevisionAge Revision5Month ReportedI5Month DefectReleases DefectAppearence5Month Revision1Month DefectAppearance1Month Question: Has this ļ¬le a non detected error. The exact number of errors can be predicted to. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 11. Introduction Data Mining Introduction And the results are The use of A vision over the present and the future Another types of objectives Predict bugs related to a software developer. Prediction of bugs in software components. This techniques could be used in diļ¬€erent topics: Software understanding. Software evolution. Software visualization. Change propagation. Impact analysis. Software complexity. Fault prediction. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 12. Introduction Data Mining Error prediction And the results are Software A vision over the present and the future Error prediction Eclipse Project Firefox Project Correctly classiļ¬ed 94.65% 94.822% Statistics Kappa 0.893 0.8883 Precision 0.9465 0.9482 Recall 0.945 0.949 AUC ROC 0.9682 0.9808 Eclipse-Firefox Firefox-Eclipse Correctly classiļ¬ed 82.0065% 87.975% Statistics Kappa 0.5976 0.7595 Precision 0.818 0.894 Recall 0.82 0.88 AUC ROC 0.805 0.83 Pere UrbĀ“n Bayes o Mining Software Repositories
  • 13. Introduction Data Mining Error prediction And the results are Software A vision over the present and the future The end App Pere UrbĀ“n Bayes o Mining Software Repositories
  • 14. Introduction Data Mining Software libraries And the results are An envision A vision over the present and the future The Prototype Software being used: Programming: JAVA Database: MySQL and MonetDB. Data Mining: Weka 3.6 and Proximity 4.3 XML: Apache Xerces 2.9.1 SVN, CVS : svnkit 1.3.0, for CVS netbeans-cvs lib and a custom rcs ļ¬le parser. Presentation: Prefuse Visualization Toolkit and Weka Drawing facilities. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 15. Introduction Data Mining Software libraries And the results are An envision A vision over the present and the future Could python give use the same? Machine Learning: Orange: With 1.0 this lib has many interesting and useful methods, Classiļ¬cation, Regression and Clustering. The most similar to Weka. PyML: Only has classiļ¬er facilities. Shogun: Only for Support Vector Machines. RPy: An interface to R. Databases: The most important relational databases are available via DB-API. ZODB: Zope Object Database. Metakit: An embedded database with a not deļ¬ned paradigm. Pygr: Python graph database framework for bioinformatics. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 16. Introduction Data Mining Software libraries And the results are An envision A vision over the present and the future Could python give use the same? Presentation: Graph Drawing: NetworkX, with nice result. There are some other but they look incomplete. GUI: PyQT, wxWindows, pyGTK. Itā€™s your taste XD!. SVN, CVS processing: SVN: pysvn - Python interface to Subversion. CVS: It seams nothing is available. GIT: PyGit - Pythonic git bindings targeted towards porcelains. XML Processing could be done using built-in support and with any SAX or DOM parser. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 17. Introduction Data Mining Software libraries And the results are An envision A vision over the present and the future The future Known issues: Data preprocessing performance. Database performance, is the relational model valid? Dynamic procedure addition. The Todo List: Develop new procedures over diļ¬€erent related topics, like software visualization, change support, etc. Develop a more mature software. Python could help in some parts. This software must be easily extensible. Improve the hole process performance. Pere UrbĀ“n Bayes o Mining Software Repositories
  • 18. Introduction Data Mining Software libraries And the results are An envision A vision over the present and the future The end Question? Pere UrbĀ“n Bayes o Data Management Group Dept. Arquitectura de Computadors Universitat Polit`cnica de Catalunya e purbon@ac.upc.edu Pere UrbĀ“n Bayes o Mining Software Repositories