• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mining Software Archives to Support Software Development
 

Mining Software Archives to Support Software Development

on

  • 2,421 views

Job application talk.

Job application talk.

Statistics

Views

Total Views
2,421
Views on SlideShare
2,414
Embed Views
7

Actions

Likes
2
Downloads
65
Comments
0

1 Embed 7

http://www.slideshare.net 7

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mining Software Archives to Support Software Development Mining Software Archives to Support Software Development Presentation Transcript

    • Mining Software Archives to Support Software Development Tom Zimmermann Saarland University
    • Software Development Hello Build Calgary!
    • Software Development Build
    • Collaboration
    • Collaboration
    • Collaboration Comm. Archive
    • Collaboration Version Comm. Archive Archive
    • Collaboration Version Comm. Bug Archive Archive Database
    • Collaboration Version Comm. Bug Archive Archive Database Mining Software Archives
    • Mining Software Archives
    • Mining Software Archives eROSE BugCache Vulture
    • eROSE Related Changes (ICSE 2004, TSE 2005) Tom Zimmermann • Saarland University Peter Weißgerber • University of Trier Stephan Diehl • University of Trier Andreas Zeller • Saarland University
    • Developers who changed this function also changed...
    • eROSE: Guiding Developers Customers who bought this item also bought... Purchase History
    • eROSE: Guiding Developers Developers who Customers who changed this function bought this item also also changed... bought... Version Purchase Archive History
    • eROSE suggests further locations.
    • eROSE prevents incomplete changes.
    • Processing CVS data
    • Processing CVS data
    • Processing CVS data 1. Comparing files 2. Building transactions
    • Comparing Files
    • Comparing Files A() B() C() D() E()
    • Comparing Files A() A() B() F() C() B() D() D() E() E()
    • Comparing Files A() A() B() F() C() B() D() D() E() E()
    • Building Transactions CVS 150,000
    • Building Transactions 2003-02-19 (aweinand): fixed #13332 CVS createGeneralPage() createTextComparePage() 150,000 fKeys[] initDefaults() buildnotes_compare.html PatchMessages.properties plugin.properties
    • Building Transactions same author + message + time 2003-02-19 (aweinand): fixed #13332 CVS createGeneralPage() createTextComparePage() 150,000 fKeys[] initDefaults() buildnotes_compare.html PatchMessages.properties plugin.properties
    • Mining Associations User changes fKeys[] and initDefaults()
    • Mining Associations
    • Mining Associations EROSE finds past transactions
    • Mining Associations #756 #6721 #21078 EROSE fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() finds past ... ... ... transactions plugin.properties plugin.properties plugin.properties #42432 #51345 #59998 #71003 fKeys[] fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() initDefaults() ... ... ... ... plugin.properties plugin.properties plugin.properties plugin.properties #87264 #91220 #101823 #104223 fKeys[] fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() initDefaults() ... ... ... ... plugin.properties plugin.properties plugin.properties
    • Mining Associations #756 #6721 #21078 EROSE fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() finds past ... ... ... transactions plugin.properties plugin.properties plugin.properties #42432 #51345 #59998 #71003 {fKeys[], initDefaults()} {plugin.properties} fKeys[] fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() initDefaults() Support 10, Confidence 10/11 = 0.909 ... ... ... ... plugin.properties plugin.properties plugin.properties plugin.properties #87264 #91220 #101823 #104223 fKeys[] fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() initDefaults() ... ... ... ... plugin.properties plugin.properties plugin.properties
    • Evaluation GIMP PostgreSQL KOffice jEdit
    • Evaluation EROSE predicts 33% of all changed entities. GIMP (files: 44%) PostgreSQL KOffice jEdit
    • Evaluation EROSE predicts 33% of all changed entities. GIMP (files: 44%) In 70% of all transactions, EROSE’s topmost three suggestions contain a changed entity. PostgreSQL (files: 72%) KOffice jEdit
    • Evaluation EROSE predicts 33% of all changed entities. GIMP (files: 44%) In 70% of all transactions, EROSE’s topmost three suggestions contain a changed entity. PostgreSQL (files: 72%) EROSE learns quickly (within 30 days). KOffice jEdit
    • eROSE Related Changes (ICSE 2004, TSE 2005) guides developers non-program elements (documentation) learns quickly
    • BugCache Predicting Defects (ASE 2006, ICSE 2007) ` Sung Kim • MIT Tom Zimmermann • Saarland University Jim Whitehead • Univ. of California SC Andreas Zeller • Saarland University
    • The Problem How should we allocate our resources for quality assurance?
    • One Solution List with elements that (will) have defects List is adaptive, i.e., it changes over time
    • One Solution List with elements that (will) have defects Cache List is adaptive, i.e., it changes over time
    • The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
    • The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
    • The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
    • The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
    • The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
    • The BugCache Model What is loaded in the cache? Cache size: 2 Miss Hypothesis: Temporal locality between defects
    • The BugCache Model What is loaded in the cache? Cache size: 2 Miss Hypothesis: Temporal locality between defects
    • The BugCache Model Cache size: 2 Miss
    • The BugCache Model Cache size: 2 Miss
    • The BugCache Model Cache size: 2 Miss Hit
    • The BugCache Model Cache size: 2 Miss Hit
    • The BugCache Model Cache size: 2 Miss Hit Miss
    • The BugCache Model Cache size: 2 Miss Hit Miss
    • The BugCache Model Cache size: 2 Miss Hit Miss Hit rate = #Hits / #Defects = 33.3%
    • The BugCache Model Cache size: 2 Miss Hit Miss
    • The BugCache Model Cache size: 2 Miss Hit Miss
    • The BugCache Model Cache size: 2 Miss Hit Miss Miss
    • The BugCache Model Cache size: 2 Miss Hit Miss Miss
    • The BugCache Model Cache size: 2 Miss Hit Miss Miss
    • Loading Elements Temporal locality – as shown before Spatial locality – load “nearby” elements (i.e., co-changed before) Changed-entity locality – load changed elements New-entity locality – load new elements Initial pre-fetch – start with a loaded cache
    • Evaluation Mozilla jEdit PostgreSQL Columba
    • Hit Rates Methods Files Project BugCache FixCache BugCache FixCache Apache 1.3 59.6% 61.5% 83.9% 81.5% Columba 58.9% 67.6% 83.5% 83.0% Eclipse 64.5% 71.6% 95.1% 95.0% JEdit 50.5% 48.9% 85.7% 85.4% Mozilla 49.3% 55.0% 93.3% 88.0% PostgreSQL 61.9% 59.2% 73.9% 71.0% Subversion 68.3% 43.8% 82.0% 81.3% Cache size = 10%
    • Hit Rates Methods Files Project BugCache FixCache BugCache FixCache Apache 1.3 59.6% 61.5% 83.9% 81.5% Columba 58.9% 67.6% 83.5% 83.0% Eclipse 64.5% 71.6% 95.1% 95.0% JEdit 50.5% 48.9% 85.7% 85.4% Mozilla 49.3% 55.0% 93.3% 88.0% PostgreSQL 61.9% 59.2% 73.9% 71.0% Subversion 68.3% 43.8% 82.0% 81.3% Cache size = 10%
    • Reasons for Hits Initial pre-fetch Spatial locality 18% 18% Initial pre-fetch Temporal locality Temporal locality Spatial locality Changed-entity locality 60% New-entity locality
    • Warning Developers “Safe” Location (not in FixCache) Risky Location (red, in FixCache)
    • BugCache Predicting Defects (ASE 2006, ICSE 2007) temporal locality adaptive hit rates of 71%~95%
    • Vulture Predicting Security Vulnerabilities (Work in Progress) Stephan Neuhaus • Saarland University Tom Zimmermann • Saarland University Andreas Zeller • Saarland University
    • Firefox/Mozilla >700 developers 228,365 commits 14,368 C/C++ files 1,012,512 revisions (10,452 components)
    • >700 developers 228,365 commits 14,368 C/C++ files 1,012,512 revisions (10,452 components)
    • Vulnerabilities
    • Vulnerabilities
    • Vulnerabilities 0 Vulnerabilities
    • Vulnerabilities Security Advisory 2005-12 Title: Livefeed bookmarks can steal cookies Impact: High Products: Firefox Description: Earlier versions of Firefox allowed javascript: and data: URLs as Livefeed bookmarks. When they updated the URL would be run in the context of the current page and could be used to steal cookies or data displayed on the page. If the user were on a page with elevated privileges (for example, about:config) when the Livefeed was updated, the feed URL could potentially run arbitrary code on the user's machine. 0 Vulnerabilities
    • Vulnerabilities 0 Vulnerabilities
    • Vulnerabilities Security Advisory 2005-13 Title: Window Injection Spoofing Severity: Low Products: Firefox, Mozilla Suite Description: A website can inject content into a popup opened by another site if the target name of the popup window is known. An attacker who knows you are going to visit that other site could spoof the contents of the popup. 0 Vulnerabilities
    • Vulnerabilities Security Advisory 2005-15 2005-41 2005-16 2006-76 2005-14 Title: Heap overflow possible security dialogs Title: Spoofing escalation via DOM property XSS quot;secure sitequot;window's Function Privilege download and in UTF8 to object SSL using outer indicator spoofing Impact: Moderate Unicode conversion overrides High with overlapping windows Severity: Products:Critical 2.0 Severity: High Products: Firefox Mozilla Suite Firefox, Description:Various schemesdemonstrated Products: Firefox, Thunderbird, Mozilla Suitethat Description: moz_bug_r_a4 were reported Mozilla Suite Description: It thepossible forreportedstringin the Function prototype regressionlock icon to with that could causeMichael Kraxsitequot; UTF8 several moz_bug_r_a4 a described is quot;secure demonstrates that the download dialog trigger details overflow be bug 355161 couldto and security dialogs the exploitsand show attacker the ability tothe wrong invalid sequences certificate a heap bypass can of appear giving an be exploited to for install malicious could be data. by requiring would spoofed byUnicode cross Exploitability only convertedcode or steal data,phishers to an that site. These against used site script (XSS) protections partially covering them with make injection, which could be used to particularly a the user do commonplace users get click onin overlapping window. Some actionsstealthe string depend on the attackers abilityto may not notice their spoofs look more legitimate, like credentials or the buggyhide the and browser or perform link or window from arbitrary sitescommon thesensitive the context menu. Theshowing the intoOS opendataborderaddress barweb content is windows that converter. General statusbar destructive actions on privileged rule out cause in what appears to be of a logged-in and bisectingeach case was behalf a single dialog,user. converted elsewhere but we can'tUI code the be true location. (quot;chromequot;) being overly attack. convinced by the spoofing text of the top-most possibility of a successfultrusting of DOM nodes from the content window. window to click on the quot;Allowquot; or quot;Openquot; button of the window below. 0 Vulnerabilities
    • Vulnerabilities 0 Vulnerabilities
    • Vulnerabilities 10,452 components 424 vulnerable 4.05% 0 Vulnerabilities
    • Vulnerabilities What other components are vulnerable? 0 Vulnerabilities
    • Vulnerabilities 0 Vulnerabilities
    • Vulnerabilities 0 Vulnerabilities ?
    • Vulnerabilities Is this new component likely to be vulnerable? 0 Vulnerabilities ?
    • Vulture Code Vulnerability Version Code Code Database Archive Code Redo diagram
    • Vulture Code Vulnerability Version Code Code Database Archive Code Redo diagram Vulture
    • Vulture Code Vulnerability Version Code Code Database Archive Code Redo diagram Vulture Component Component Component
    • Vulture Code Vulnerability Version Code Code Database Archive Code Redo diagram Vulture Predictor Component Component Component
    • Vulture Code Vulnerability Version Code Code Code Database Archive Code Redo diagram Vulture Predictor Component Component Component
    • Correlations
    • Correlations Programmer Code Complexity Language
    • Correlations Code Complexity Language
    • Correlations Language
    • Correlations Language Problem Domain
    • Imports
    • Imports GUI Database Certificates OS
    • Imports GUI Database Certificates OS
    • Imports GUI Database Certificates OS
    • Example (1) nsIContent.h nsIContentUtils.h nsIScriptSecurityManager.h
    • Example (1) nsIContent.h import nsIContentUtils.h nsIScriptSecurityManager.h
    • Example (1) ✘ ✘ ✘ ✘ ✘ ✘ nsIContent.h ✘ ✘ ✘ ✘✘ ✘ import ✘ ✘ ✘ nsIContentUtils.h ✘ ✘ 95.5% ✘ ✔ ✘ ✘ ✘ nsIScriptSecurityManager.h
    • Example (2) nsIPrivateDOMEvent.h nsReadableUtils.h
    • Example (2) import nsIPrivateDOMEvent.h nsReadableUtils.h
    • Example (2) ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ import nsIPrivateDOMEvent.h ✘ ✘ ✘ ✘ 100% ✘ ✘ ✘ ✘ ✘ nsReadableUtils.h
    • Research Questions • How well do imports predict vulnerabilities? • Can imports be used for − classification (vulnerable or not) and for − regression (number of vulnerabilities)?
    • Input Data nsCOMArray 0 nsIDocument.h 1 nspr_md.h 0 nsDOMClassInfo 10 EmbedGTKTools 0 MozillaControl.cpp 0 nsDOMClassInfo has had 10 vulnerability-related bug reports
    • Input Data e. am t.h h e. re Fr c bt ack nne e or St o di h s/fi h m ns PC st le. 9, h ut o.h sy pl. 9 il.h IX Im 05 ns ss nsCOMArray 0 1 0 0 0 1 0 0 nsIDocument.h 1 0 0 1 0 0 1 0 nspr_md.h 0 0 1 1 0 0 1 0 nsDOMClassInfo 10 0 0 1 0 1 0 0 EmbedGTKTools 0 0 0 0 0 1 0 0 MozillaControl.cpp 0 0 1 0 1 0 0 0 nsDOMClassInfo has had 10 nsDOMClassInfo imports vulnerability-related bug reports “nsIXPConnect.h”
    • Distribution ibution of MFSAs Distribution of Bug Reports 300 Number of Components 20 50 5 12 5 7 9 11 13 13579 13 17 24 umber of MFSAs Number of Bug Reports
    • Experiments • 40 randomtraining set, 3,484 rows in validation set splits 6,968 rows in • Classification recall and precision Train SVM, compute • Regression rank correlation on top 1% Train SVM, compute • SVM: linear kernel10GB ofdefault parameters with R implementation (up to main memory)
    • Results (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation
    • Results (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation 45% (about 1/2) of predictions correct
    • Results (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation 2/3 of all vulnerable components detected 45% (about 1/2) of predictions correct
    • Results (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation 2/3 of all vulnerable components detected 45% (about 1/2) of predictions correct
    • Results moderately strong correlation (mostly significant at p < 0.01) (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation 2/3 of all vulnerable components detected 45% (about 1/2) of predictions correct
    • Ranking
    • Ranking Rank Component Actual Rank 1 nsDOMClassInfo 3 2 SGridRowLayout 95 3 xpcprivate 6 4 jsxml 2 5 nsGenericHTMLElement 8 6 jsgc 3 7 nsISEnvironment 12 8 jsfun 1 9 nsHTMLLabelElement 18 10 nsHttpTransaction 35 ... (3,474 components)
    • Ranking Rank Component Actual Rank 1 nsDOMClassInfo 3 2 SGridRowLayout 95 3 xpcprivate 6 4 jsxml 2 5 nsGenericHTMLElement 8 6 jsgc 3 7 nsISEnvironment 12 8 jsfun 1 9 nsHTMLLabelElement 18 10 nsHttpTransaction 35 ... (3,474 components)
    • Ranking Rank Component Actual Rank 1 nsDOMClassInfo 3 2 SGridRowLayout 95 3 xpcprivate 6 4 jsxml 2 5 nsGenericHTMLElement 8 6 jsgc 3 7 nsISEnvironment 12 8 jsfun 1 9 nsHTMLLabelElement 18 10 nsHttpTransaction 35 ... (3,474 components)
    • Ranking Rank Component Actual Rank 1 nsDOMClassInfo 3 2 SGridRowLayout 95 3 xpcprivate 6 4 jsxml 2 5 nsGenericHTMLElement 8 6 jsgc 3 7 nsISEnvironment 12 8 jsfun 1 9 nsHTMLLabelElement 18 10 nsHttpTransaction 35 ... (3,474 components)
    • Similar Results for Bugs Packages + Import relationships (ISESE 2006) Precision: 66.7% Recall: 69.4% Binaries + Dependencies (Internship @ Microsoft Research, 2006) Precision: 64.4% Recall: 75.3%
    • Vulture Predicting Security Vulnerabilities (Work in Progress) locates past + predicts new vulnerabilities problem domain
    • Future Work ?
    • #1: Mining across Projects • Complement source code search engines with mining techniques. • Large-scale mining (144,000 SF projects)
    • #2: Developer Buddy MOCKUP
    • eROSE BugCache Vulture
    • automatic eROSE BugCache Vulture
    • automatic large-scale eROSE BugCache Vulture
    • automatic large-scale eROSE BugCache Vulture tool-oriented
    • automatic large-scale Empirical Software Engineering 2.0 tool-oriented
    • automatic large-scale Empirical Software Engineering 2.0 tool-oriented Thanks! Questions?