SlideShare a Scribd company logo
1 of 48
Download to read offline
More information available at
http://ase.csc.ncsu.edu/dmse/
Mining Software Engineering Data
Ahmed E. Hassan
Queen’s University
www.cs.queensu.ca/~ahmed
ahmed@cs.queensu.ca
Tao Xie
North Carolina State University
www.csc.ncsu.edu/faculty/xie
xie@csc.ncsu.edu
• NSERC/RIM Software Engineering Research
Chair Queen’s University, Canada
• Leads the SAIL research group at Queen’s
• Co-chair for Workshop on Mining Software
Repositories (MSR) from 2004-2006
• Chair of the steering committee for MSR
2
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
3
• Associate Professor at North Carolina State
University, USA
• Leads the ASE research group at NCSU
• PC Co-Chair of ICSM 2009, MSR 2011/2012
• Co-organizer of 2007 Dagstuhl Seminar on
Mining Programs and Processes
http://msrconf.org
An international effort to
make software repositories actionable
http://promisedata.org
• Transforms static record-
keeping repositories to active
repositories
• Makes repository data
actionable by uncovering
hidden patterns and trends
11
MailinglistBugzilla Crashes
Field logs CVS/SVN
1212
Field
Logs
Source Control
CVS/SVN
Bugzilla Mailing
lists
Crash
Repos
Historical Repositories Runtime Repos
Code Repos
Sourceforge
GoogleCode
Bugzilla CVS/SVNMailinglist Crashes
fixed
bug
discussions
Buggy change &
Fixing change
Field
crashes
Estimate fix effort
Mark duplicates
Suggest experts and fix!
New Bug Report
Bugzilla CVS/SVNMailinglist Crashes
fixed
bug
Field
crashes
Suggest APIs
Warn about risky code or bugs
Suggest locations to co-change
New Change
discussions
Buggy change &
Fixing change
Example Repositories:
Source Control and Bug Repositories
A. E. Hassan and T. Xie: Mining
Software Engineering Data
16
Source Control Repositories
• A source control system
tracks changes to
ChangeUnits
• Example of ChangeUnits:
– File (most common)
– Function
– Dependency (e.g., Call)
• Each ChangeUnit:
– Records the developer, change
time, change message, co-
changing Units
ChangeListDeveloper
Time
ChangeChangeUnit
Modify
Add
Remove
Change
Type
* .. *
ChangeList
Message
FI
FR
GM
ChangeList
Type
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
17
Determine
Initial Entity
To Change
Change
Entity
Determine
Other Entities
To Change
Consult
Guru for
Advice
New Req., Bug Fix
“How does a change in one source code
entity propagate to other entities?”
No More
Changes
For Each Entity
Suggested Entity
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
18
• We want:
– High Precision to avoid wasting time
– High Recall to avoid bugs
entitieschanged
changedwhichentitiespredicted
Recall 
entitiespredicted
changedwhichentitiespredicted
Precision 
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
19
• Mine association rules from change history
• Use rules to help propagate changes:
– Recall as high as 44%
– Precision around 30%
• High precision and recall reached in < 1mth
• Prediction accuracy improves prior to a
release (i.e., during maintenance phase)
• Better predictor than static dependencies
alone
[Zimmermann et al. 05]
[Hassan&Holt 04]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
20
• Traditional dependency graphs and program
understanding models usually do not use
historical information
• Static dependencies capture only a static view
of a system – not enough detail!
• Development history can help understand the
current structure (architecture) of a software
system
[Hassan & Holt 04]
21
Hardware
Trans.
Kernel Fault
Handler
Pager
FileSystem
Virtual Addr.
Maint.
VM Policy
Subsystem
Depend Divergence
Hardware
Trans.
Kernel Fault
Handler
Pager
FileSystem
Virtual Addr.
Maint.
VM Policy
Convergence
Subsystem
Why? Who?
When? Where?
22
• Eight unexpected dependencies
• All except two dependencies existed since day one:
– Virtual Address Maintenance  Pager
– Pager  Hardware Translations
Which?
vm_map_entry_create (in src/sys/vm/Attic/vm_map.c)
depends on pager_map (in /src/sys/uvm/uvm_pager.c)
Who? cgd
When?
1993/04/09 15:54:59
Revision 1.2 of src/sys/vm/Attic/vm_map.c
Why?
from sean eric fagan:
it seems to keep the vm system from deadlocking the
system when it runs out of swap + physical memory.
prevents the system from giving the last page(s) to
anything but the referenced "processes" (especially
important is the pager process, which should never
have to wait for a free page).
Auto-generated
from CVS repository
• Conway’s Law:
“The structure of a software system is a direct
reflection of the structure of the development
team”
Conceptual
Architecture
Ownership
Architecture
Concrete
Architecture
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
25
import org.eclipse.jdt.internal.compiler.lookup.*;
import org.eclipse.jdt.internal.compiler.*;
import org.eclipse.jdt.internal.compiler.ast.*;
import org.eclipse.jdt.internal.compiler.util.*;
...
import org.eclipse.pde.core.*;
import org.eclipse.jface.wizard.*;
import org.eclipse.ui.*;
14% of all files that import ui packages,
had to be fixed later on.
71% of files that import compiler packages,
had to be fixed later on.
[Schröter et al. 06]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
26
Percentage of bug-introducing changes for eclipse
Don’t program on Fridays ;-)
[Zimmermann et al. 05]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
27
• Given a change can we warn a developer that
there is a bug in it?
– Recall/Precision in 50-60% range
[Kim et al. 06]
Example Repositories:
Project Communication – Mailing lists
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
29
• Most open source projects communicate
through mailing lists or IRC channels
• Rich source of information about the inner
workings of large projects
• Discussions cover topics such as future plans,
design decisions, project policies, code or
patch reviews
• Social network analysis could be performed on
discussion threads
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
30
• Study the content of messages before and after a release
• Use dimensions from a psychometric text analysis tool:
– After Apache 1.3 release there was a drop in optimism
– After Apache 2.0 release there was an increase in sociability
[Rigby & Hassan 07]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
31
• Mailing list activity:
– strongly correlates with code
change activity
– moderately correlates with
document change activity
• Social network measures (in-
degree, out-degree,
betweenness) indicate that
committers play a more
significant role in the mailing list
community than non-
committers [Bird et al. 06]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
32
• When will a developer be invited to join a
project?
– Expertise vs. interest
[Bird et al. 07]
Example Repositories:
Program Source Code
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
35
Source data Mined info
Variable names and function names Software categories
[Kawaguchi et al. 04]
Statement seq in a basic block Copy-paste code
[Li et al. 04]
Set of functions, variables, and data
types within a C function
Programming rules
[Li&Zhou 05]
Sequence of methods within a Java
method
API usages
[Xie&Pei 06]
API method signatures API Jungloids
[Mandelin et al. 05]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
36
• How should an API be used correctly?
– An API may serve multiple functionalities
– Different styles of API usage
• “I know what type of object I need, but I don’t know
how to write the code to get the object” [Mandelin
et al. 05]
– Can we synthesize jungloid code fragments automatically?
– Given a simple query describing the desired code in terms
of input and output types, return a code segment
• “I know what method call I need, but I don’t know
how to write code before and after this method call”
[Xie&Pei 06]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
37
• Mine framework reuse patterns [Michail 00]
– Membership relationships
• A class contains membership functions
– Reuse relationships
• Class inheritance/ instantiation
• Function invocations/overriding
• Mine software plagiarism [Liu et al. 06]
– Program dependence graphs
[Michail 99/00] http://codeweb.sourceforge.net/ for C++
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
38
• Apply closed sequential pattern mining techniques
• Customizing the techniques
– A copy-paste segment typically does not have big gaps – use
a maximum gap threshold to control
– Output the instances of patterns (i.e., the copy-pasted code
segments) instead of the patterns
– Use small copy-pasted segments to form larger ones
– Prune false positives: tiny segments, unmappable segments,
overlapping segments, and segments with large gaps
[Li et al. 04]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
39
• For two copy-pasted segments, are the
modifications consistent?
– Identifier a in segment S1 is changed to b in
segment S2 3 times, but remains unchanged once
– likely a bug
– The heuristic may not be correct all the time
• The lower the unchanged rate of an identifier,
the more likely there is a bug
[Li et al. 04]
Example Repositories:
Program Execution Traces
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
41
• Goal: mine specifications (pre/post conditions) or
object behavior (object transition diagrams)
• State of an object
– Values of transitively reachable fields
• Method-entry state
– Receiver-object state, method argument values
• Method-exit state
– Receiver-object state, updated method argument values,
method return value
[Ernst et al. 02] http://pag.csail.mit.edu/daikon/
[Xie&Notkin 04/05][Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
42
• Goal: detect or locate bugs
• Values of variables at certain code locations
[Hangal&Lam 02]
– Object/static field read/write
– Method-call arguments
– Method returns
• Sampled predicates on values of variables [Liblit
et al. 03/05][Liu et al. 05]
[Hangal&Lam 02] http://diduce.sourceforge.net/
[Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/
[Liu et al. 05] http://www.ews.uiuc.edu/~chaoliu/sober.htm
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
43
• Goal: locate bugs
• Executed branches/paths, def-use pairs
• Executed function/method calls
– Group methods invoked on the same object
• Profiling options
– Execution hit vs. count
– Execution order (sequences)
[Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/
More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
44
• Given a program state S and an event e, predict
whether e likely results in a bug
– Positive samples: past bugs
– Negative samples: “not bug” reports
• A k-NN based approach
– Consider the k closest cases reported before
– Compare Σ 1/d for bug cases and not-bug cases, where d is
the similarity between the current state and the reported
states
– If the current state is more similar to bugs, predict a bug
[Michail&Xie 05]
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
45
• A method executed only in failing runs is likely
to point to the defect
– Comparing the coverage of passing and failing
program runs helps
• Mining patterns frequent in failing program
runs but infrequent in passing program runs
– Sequential patterns may be used
[Dallmeier et al. 05, Denmat et al. 05]
Must show value before
data quality improves
Correlation vs. Causation
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
48
• Very active research area in SE:
– MSR is the most attended ICSE event in last 8 yrs
• http://msrconf.org
– Special Issue of IEEE TSE 2005 on MSR:
• 15 % of all submissions of TSE in 2004
• Fastest review cycle in TSE history: 8 months
– Special Issue Empirical Software Engineering 2009, 2011
– MSR 2012!
A. E. Hassan and T. Xie:
Mining Software Engineering
Data
49
• Report the statistical significance of your results:
– Get a statistics book (one for social scientist, not for
mathematicians)
• Discuss any limitations of your findings based on the
characteristics of the studied repositories:
– Make sure you manually examine the repositories. Do not fully
automate the process!
– Use random sampling to resolve issues about data noise
• Relevant conferences/workshops:
– main SE conferences, ICSM, ISSTA, MSR, WODA, PROMISE …

More Related Content

Similar to Mining Software Engineering Data

Scalable and Cost-Effective Model-Based Software Verification and Testing
Scalable and Cost-Effective Model-Based Software Verification and TestingScalable and Cost-Effective Model-Based Software Verification and Testing
Scalable and Cost-Effective Model-Based Software Verification and TestingLionel Briand
 
V1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docV1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docpraveena06
 
Improvement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining TechniquesImprovement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining Techniquesijdmtaiir
 
Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.Lionel Briand
 
The Path to Digital Engineering
The Path to Digital EngineeringThe Path to Digital Engineering
The Path to Digital EngineeringElizabeth Steiner
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleDr. Mirko Kämpf
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
 
Software engineering lecture notes
Software engineering   lecture notesSoftware engineering   lecture notes
Software engineering lecture notesGarima Singh
 
WINSEM2023-24_BCSE429L_TH_CH2023240501528_Reference_Material_III_S3-Homoheter...
WINSEM2023-24_BCSE429L_TH_CH2023240501528_Reference_Material_III_S3-Homoheter...WINSEM2023-24_BCSE429L_TH_CH2023240501528_Reference_Material_III_S3-Homoheter...
WINSEM2023-24_BCSE429L_TH_CH2023240501528_Reference_Material_III_S3-Homoheter...KumarSuman24
 
Cloudera Federal Forum 2014: EzBake, the DoDIIS App Engine
Cloudera Federal Forum 2014: EzBake, the DoDIIS App EngineCloudera Federal Forum 2014: EzBake, the DoDIIS App Engine
Cloudera Federal Forum 2014: EzBake, the DoDIIS App EngineCloudera, Inc.
 
Process and Project Metrics-1
Process and Project Metrics-1Process and Project Metrics-1
Process and Project Metrics-1Saqib Raza
 

Similar to Mining Software Engineering Data (20)

Scalable and Cost-Effective Model-Based Software Verification and Testing
Scalable and Cost-Effective Model-Based Software Verification and TestingScalable and Cost-Effective Model-Based Software Verification and Testing
Scalable and Cost-Effective Model-Based Software Verification and Testing
 
V1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.docV1_I2_2012_Paper3.doc
V1_I2_2012_Paper3.doc
 
Improvement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining TechniquesImprovement of Software Maintenance and Reliability using Data Mining Techniques
Improvement of Software Maintenance and Reliability using Data Mining Techniques
 
Scope of software engineering
Scope of software engineeringScope of software engineering
Scope of software engineering
 
Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.Software Engineering Research: Leading a Double-Agent Life.
Software Engineering Research: Leading a Double-Agent Life.
 
The Path to Digital Engineering
The Path to Digital EngineeringThe Path to Digital Engineering
The Path to Digital Engineering
 
Data structure design in SE
Data structure  design in SEData structure  design in SE
Data structure design in SE
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
 
Introduction Software engineering
Introduction   Software engineeringIntroduction   Software engineering
Introduction Software engineering
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
Software engineering lecture notes
Software engineering   lecture notesSoftware engineering   lecture notes
Software engineering lecture notes
 
Softwareproject planning
Softwareproject planningSoftwareproject planning
Softwareproject planning
 
WINSEM2023-24_BCSE429L_TH_CH2023240501528_Reference_Material_III_S3-Homoheter...
WINSEM2023-24_BCSE429L_TH_CH2023240501528_Reference_Material_III_S3-Homoheter...WINSEM2023-24_BCSE429L_TH_CH2023240501528_Reference_Material_III_S3-Homoheter...
WINSEM2023-24_BCSE429L_TH_CH2023240501528_Reference_Material_III_S3-Homoheter...
 
Se research update
Se research updateSe research update
Se research update
 
Presentation of se
Presentation of sePresentation of se
Presentation of se
 
Software Development Life Cycle
Software Development Life CycleSoftware Development Life Cycle
Software Development Life Cycle
 
Sdlc 4
Sdlc 4Sdlc 4
Sdlc 4
 
Data mining
Data miningData mining
Data mining
 
Cloudera Federal Forum 2014: EzBake, the DoDIIS App Engine
Cloudera Federal Forum 2014: EzBake, the DoDIIS App EngineCloudera Federal Forum 2014: EzBake, the DoDIIS App Engine
Cloudera Federal Forum 2014: EzBake, the DoDIIS App Engine
 
Process and Project Metrics-1
Process and Project Metrics-1Process and Project Metrics-1
Process and Project Metrics-1
 

More from SAIL_QU

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsSAIL_QU
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...SAIL_QU
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...SAIL_QU
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...SAIL_QU
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...SAIL_QU
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...SAIL_QU
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?SAIL_QU
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesSAIL_QU
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesSAIL_QU
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...SAIL_QU
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...SAIL_QU
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...SAIL_QU
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?SAIL_QU
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...SAIL_QU
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsSAIL_QU
 

More from SAIL_QU (20)

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
 

Recently uploaded

What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?Watsoo Telematics
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 

Recently uploaded (20)

What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 

Mining Software Engineering Data

  • 1. More information available at http://ase.csc.ncsu.edu/dmse/ Mining Software Engineering Data Ahmed E. Hassan Queen’s University www.cs.queensu.ca/~ahmed ahmed@cs.queensu.ca Tao Xie North Carolina State University www.csc.ncsu.edu/faculty/xie xie@csc.ncsu.edu
  • 2. • NSERC/RIM Software Engineering Research Chair Queen’s University, Canada • Leads the SAIL research group at Queen’s • Co-chair for Workshop on Mining Software Repositories (MSR) from 2004-2006 • Chair of the steering committee for MSR 2
  • 3. A. E. Hassan and T. Xie: Mining Software Engineering Data 3 • Associate Professor at North Carolina State University, USA • Leads the ASE research group at NCSU • PC Co-Chair of ICSM 2009, MSR 2011/2012 • Co-organizer of 2007 Dagstuhl Seminar on Mining Programs and Processes
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. http://msrconf.org An international effort to make software repositories actionable http://promisedata.org
  • 11. • Transforms static record- keeping repositories to active repositories • Makes repository data actionable by uncovering hidden patterns and trends 11 MailinglistBugzilla Crashes Field logs CVS/SVN
  • 12. 1212 Field Logs Source Control CVS/SVN Bugzilla Mailing lists Crash Repos Historical Repositories Runtime Repos Code Repos Sourceforge GoogleCode
  • 13. Bugzilla CVS/SVNMailinglist Crashes fixed bug discussions Buggy change & Fixing change Field crashes Estimate fix effort Mark duplicates Suggest experts and fix! New Bug Report
  • 14. Bugzilla CVS/SVNMailinglist Crashes fixed bug Field crashes Suggest APIs Warn about risky code or bugs Suggest locations to co-change New Change discussions Buggy change & Fixing change
  • 15. Example Repositories: Source Control and Bug Repositories
  • 16. A. E. Hassan and T. Xie: Mining Software Engineering Data 16 Source Control Repositories • A source control system tracks changes to ChangeUnits • Example of ChangeUnits: – File (most common) – Function – Dependency (e.g., Call) • Each ChangeUnit: – Records the developer, change time, change message, co- changing Units ChangeListDeveloper Time ChangeChangeUnit Modify Add Remove Change Type * .. * ChangeList Message FI FR GM ChangeList Type
  • 17. A. E. Hassan and T. Xie: Mining Software Engineering Data 17 Determine Initial Entity To Change Change Entity Determine Other Entities To Change Consult Guru for Advice New Req., Bug Fix “How does a change in one source code entity propagate to other entities?” No More Changes For Each Entity Suggested Entity
  • 18. A. E. Hassan and T. Xie: Mining Software Engineering Data 18 • We want: – High Precision to avoid wasting time – High Recall to avoid bugs entitieschanged changedwhichentitiespredicted Recall  entitiespredicted changedwhichentitiespredicted Precision 
  • 19. A. E. Hassan and T. Xie: Mining Software Engineering Data 19 • Mine association rules from change history • Use rules to help propagate changes: – Recall as high as 44% – Precision around 30% • High precision and recall reached in < 1mth • Prediction accuracy improves prior to a release (i.e., during maintenance phase) • Better predictor than static dependencies alone [Zimmermann et al. 05] [Hassan&Holt 04]
  • 20. A. E. Hassan and T. Xie: Mining Software Engineering Data 20 • Traditional dependency graphs and program understanding models usually do not use historical information • Static dependencies capture only a static view of a system – not enough detail! • Development history can help understand the current structure (architecture) of a software system [Hassan & Holt 04]
  • 21. 21 Hardware Trans. Kernel Fault Handler Pager FileSystem Virtual Addr. Maint. VM Policy Subsystem Depend Divergence Hardware Trans. Kernel Fault Handler Pager FileSystem Virtual Addr. Maint. VM Policy Convergence Subsystem Why? Who? When? Where?
  • 22. 22 • Eight unexpected dependencies • All except two dependencies existed since day one: – Virtual Address Maintenance  Pager – Pager  Hardware Translations Which? vm_map_entry_create (in src/sys/vm/Attic/vm_map.c) depends on pager_map (in /src/sys/uvm/uvm_pager.c) Who? cgd When? 1993/04/09 15:54:59 Revision 1.2 of src/sys/vm/Attic/vm_map.c Why? from sean eric fagan: it seems to keep the vm system from deadlocking the system when it runs out of swap + physical memory. prevents the system from giving the last page(s) to anything but the referenced "processes" (especially important is the pager process, which should never have to wait for a free page). Auto-generated from CVS repository
  • 23. • Conway’s Law: “The structure of a software system is a direct reflection of the structure of the development team”
  • 25. A. E. Hassan and T. Xie: Mining Software Engineering Data 25 import org.eclipse.jdt.internal.compiler.lookup.*; import org.eclipse.jdt.internal.compiler.*; import org.eclipse.jdt.internal.compiler.ast.*; import org.eclipse.jdt.internal.compiler.util.*; ... import org.eclipse.pde.core.*; import org.eclipse.jface.wizard.*; import org.eclipse.ui.*; 14% of all files that import ui packages, had to be fixed later on. 71% of files that import compiler packages, had to be fixed later on. [Schröter et al. 06]
  • 26. A. E. Hassan and T. Xie: Mining Software Engineering Data 26 Percentage of bug-introducing changes for eclipse Don’t program on Fridays ;-) [Zimmermann et al. 05]
  • 27. A. E. Hassan and T. Xie: Mining Software Engineering Data 27 • Given a change can we warn a developer that there is a bug in it? – Recall/Precision in 50-60% range [Kim et al. 06]
  • 29. A. E. Hassan and T. Xie: Mining Software Engineering Data 29 • Most open source projects communicate through mailing lists or IRC channels • Rich source of information about the inner workings of large projects • Discussions cover topics such as future plans, design decisions, project policies, code or patch reviews • Social network analysis could be performed on discussion threads
  • 30. A. E. Hassan and T. Xie: Mining Software Engineering Data 30 • Study the content of messages before and after a release • Use dimensions from a psychometric text analysis tool: – After Apache 1.3 release there was a drop in optimism – After Apache 2.0 release there was an increase in sociability [Rigby & Hassan 07]
  • 31. A. E. Hassan and T. Xie: Mining Software Engineering Data 31 • Mailing list activity: – strongly correlates with code change activity – moderately correlates with document change activity • Social network measures (in- degree, out-degree, betweenness) indicate that committers play a more significant role in the mailing list community than non- committers [Bird et al. 06]
  • 32. A. E. Hassan and T. Xie: Mining Software Engineering Data 32 • When will a developer be invited to join a project? – Expertise vs. interest [Bird et al. 07]
  • 34. A. E. Hassan and T. Xie: Mining Software Engineering Data 35 Source data Mined info Variable names and function names Software categories [Kawaguchi et al. 04] Statement seq in a basic block Copy-paste code [Li et al. 04] Set of functions, variables, and data types within a C function Programming rules [Li&Zhou 05] Sequence of methods within a Java method API usages [Xie&Pei 06] API method signatures API Jungloids [Mandelin et al. 05]
  • 35. A. E. Hassan and T. Xie: Mining Software Engineering Data 36 • How should an API be used correctly? – An API may serve multiple functionalities – Different styles of API usage • “I know what type of object I need, but I don’t know how to write the code to get the object” [Mandelin et al. 05] – Can we synthesize jungloid code fragments automatically? – Given a simple query describing the desired code in terms of input and output types, return a code segment • “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei 06]
  • 36. A. E. Hassan and T. Xie: Mining Software Engineering Data 37 • Mine framework reuse patterns [Michail 00] – Membership relationships • A class contains membership functions – Reuse relationships • Class inheritance/ instantiation • Function invocations/overriding • Mine software plagiarism [Liu et al. 06] – Program dependence graphs [Michail 99/00] http://codeweb.sourceforge.net/ for C++
  • 37. A. E. Hassan and T. Xie: Mining Software Engineering Data 38 • Apply closed sequential pattern mining techniques • Customizing the techniques – A copy-paste segment typically does not have big gaps – use a maximum gap threshold to control – Output the instances of patterns (i.e., the copy-pasted code segments) instead of the patterns – Use small copy-pasted segments to form larger ones – Prune false positives: tiny segments, unmappable segments, overlapping segments, and segments with large gaps [Li et al. 04]
  • 38. A. E. Hassan and T. Xie: Mining Software Engineering Data 39 • For two copy-pasted segments, are the modifications consistent? – Identifier a in segment S1 is changed to b in segment S2 3 times, but remains unchanged once – likely a bug – The heuristic may not be correct all the time • The lower the unchanged rate of an identifier, the more likely there is a bug [Li et al. 04]
  • 40. A. E. Hassan and T. Xie: Mining Software Engineering Data 41 • Goal: mine specifications (pre/post conditions) or object behavior (object transition diagrams) • State of an object – Values of transitively reachable fields • Method-entry state – Receiver-object state, method argument values • Method-exit state – Receiver-object state, updated method argument values, method return value [Ernst et al. 02] http://pag.csail.mit.edu/daikon/ [Xie&Notkin 04/05][Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/
  • 41. A. E. Hassan and T. Xie: Mining Software Engineering Data 42 • Goal: detect or locate bugs • Values of variables at certain code locations [Hangal&Lam 02] – Object/static field read/write – Method-call arguments – Method returns • Sampled predicates on values of variables [Liblit et al. 03/05][Liu et al. 05] [Hangal&Lam 02] http://diduce.sourceforge.net/ [Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/ [Liu et al. 05] http://www.ews.uiuc.edu/~chaoliu/sober.htm
  • 42. A. E. Hassan and T. Xie: Mining Software Engineering Data 43 • Goal: locate bugs • Executed branches/paths, def-use pairs • Executed function/method calls – Group methods invoked on the same object • Profiling options – Execution hit vs. count – Execution order (sequences) [Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/ More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related
  • 43. A. E. Hassan and T. Xie: Mining Software Engineering Data 44 • Given a program state S and an event e, predict whether e likely results in a bug – Positive samples: past bugs – Negative samples: “not bug” reports • A k-NN based approach – Consider the k closest cases reported before – Compare Σ 1/d for bug cases and not-bug cases, where d is the similarity between the current state and the reported states – If the current state is more similar to bugs, predict a bug [Michail&Xie 05]
  • 44. A. E. Hassan and T. Xie: Mining Software Engineering Data 45 • A method executed only in failing runs is likely to point to the defect – Comparing the coverage of passing and failing program runs helps • Mining patterns frequent in failing program runs but infrequent in passing program runs – Sequential patterns may be used [Dallmeier et al. 05, Denmat et al. 05]
  • 45.
  • 46. Must show value before data quality improves Correlation vs. Causation
  • 47. A. E. Hassan and T. Xie: Mining Software Engineering Data 48 • Very active research area in SE: – MSR is the most attended ICSE event in last 8 yrs • http://msrconf.org – Special Issue of IEEE TSE 2005 on MSR: • 15 % of all submissions of TSE in 2004 • Fastest review cycle in TSE history: 8 months – Special Issue Empirical Software Engineering 2009, 2011 – MSR 2012!
  • 48. A. E. Hassan and T. Xie: Mining Software Engineering Data 49 • Report the statistical significance of your results: – Get a statistics book (one for social scientist, not for mathematicians) • Discuss any limitations of your findings based on the characteristics of the studied repositories: – Make sure you manually examine the repositories. Do not fully automate the process! – Use random sampling to resolve issues about data noise • Relevant conferences/workshops: – main SE conferences, ICSM, ISSTA, MSR, WODA, PROMISE …