SlideShare a Scribd company logo
Mining Unstructured Software
Repositories Using IRModels
Stephen W. Thomas
PhD Candidate
Queen’s University
BBAA
2
Stephen W. Thomas
Mining Software Repositories with Topic Models.
ICSE 2011
Stephen W. Thomas, Hadi Hemmati, Ahmed E. Hassan, and Dorothea Blostein
Static TestC ase Prioritization Using Topic Models.
Empirical Software Engineering, 2012
Stephen W. Thomas, Nicolas Bettenburg, Ahmed E. Hassan, and Dorothea Blostein
Talk and Work: Recovering the Relationship between Mailing ListDiscussions and Development
Activity.
Empirical Software Engineering, 2nd
round
Stephen W. Thomas, Meiyappan Nagappan , Ahmed E. Hassan, and Dorothea Blostein
The ImpactofC lassifierC onfiguration and C lassifierC ombination on Bug Localization.
IEEE Transactions on Software Engineering, 2nd
round
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Validating the Use ofTopic Models forSoftware Evolution.
SCAM 2010
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Modeling the Evolution ofTopics in Source C ode Histories.
MSR 2011
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Studying Software Evolution Using Topic Models.
Science of Computer Programming, 2012
code changes
logs
bugs
email
reqs
bug prediction
traceability linking
feature location
architecture recovery
change pattern detection
3
00:03:45: E22344, 76, 90.3,
00:03:46: E2f3a4, 82, 95.0,
00:03:56: E22345, 78, 96.6,
00:04:15: E22344, 23, 95.1,
00:04:35: E23348, 65, 95.7,
00:04:37: E2234b, 56, 93.1,
00:04:38: E2234b, 54, 95.0,
00:04:39: E22a34, 98, 95.1,
00:05:42: E353f4, 65, 94.7,
00:05:42: E3556j, 45, 95.2,
00:05:42: E3545g, 63, 92.8,
00:05:42: E354r4, 94, 95.6,
source code comments
bug reports
emails
requirement descriptions
forum and blog posts
commit messages
source code identifiers
4
NPE caused by
no spashscreen
handler service
available
Provide unittests for link
creation constraints, unit tests
fail in standalone build
5
Service
pricing
Confer
6
pricing
Conference
Service
7
New!
1
2
3 8
Part
Part
Part
9
The research and practice of using IR models to
mine software repositories can be improved by
(i) considering additional software engineering
tasks, such as prioritizing test cases;
(ii) using advanced IR techniques, such as
combining multiple IR models; and
(iii) better understanding the assumptions and
parameters of IR models.
Test Case Prioritization
Less similar
Higher prioritySimilarity
identifiers
comments
string literals
Part 1
10[EMSE 2012]
structural-based IR-based
Source code ↔ Email Interaction
cleaning and
preprocessing
identifiers
comments
string literals
mail codeXML
printing
installation
GUI
Code
Mail
Time
Activity
XML
Monitoring project status
Software explanation
Training and documentation
11
Part 1
[EMSE 20XX]
New!
1
2
3 12
Part
Part
Part
Combining Multiple IRModels
identifiers
comments
string literalsBug
report
Bug
report
Similarity
title
description
Best individual
IR model
Random subset,
combined
13
Part 2
[TSE 20XX] sets had improved performance median improvement
XML concept
Swing concept
Encryption concept
Time
Popularity
Concept Evolution Models
identifiers
comments
string literals
14
Part 2
[SCP 2012]
[SCAM 2010]
accuracy of topic evolutions
New!
1
2
3 15
Part
Part
Part
Data Duplication Problem
identical
16
Part 3
[MSR 2011] accuracysensitivity
Preprocessing and ParameterEffects
Code representation
identifiers? comments?
past bug reports?
Bug report representation
title? description?
Preprocessing
split identifiers? remove stop words?
word stemming?
IR Model parameters
term weighting?
No. of topics? similarity measure?
No. of iterations?
Configuration matters!
worst:
best:
mean:
17
Part 3
[TSE 20XX]
“configuration”
New!
1
2
3
18
Part
Part
Part
Proposed and evaluated a technique to prioritize test cases
Proposed and evaluated a technique to analyze the interaction of source code and mailing lists
Described and evaluated a technique to analyze code histories using topic evolution models
Proposed and evaluated a frameworkforcombining the results of disparate IR models
Overcame the data duplication problem in large source code histories
Analyzed the sensitivity of IRmodels to data preprocessing and IR model parameters

More Related Content

Viewers also liked

빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
JM code group
 
Mineograph Mining Automation Software
Mineograph Mining Automation SoftwareMineograph Mining Automation Software
Mineograph Mining Automation Software
Mineograph Software
 
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
Norihiro Yoshida
 
Data mining software comparison
Data mining software comparison Data mining software comparison
Data mining software comparison
Esteban Alcaide
 
임태현, software catastrophe
임태현, software catastrophe임태현, software catastrophe
임태현, software catastrophe
태현 임
 
Mining Software Archives to Support Software Development
Mining Software Archives to Support Software DevelopmentMining Software Archives to Support Software Development
Mining Software Archives to Support Software Development
Thomas Zimmermann
 
Model Comparison for Delta-Compression
Model Comparison for Delta-CompressionModel Comparison for Delta-Compression
Model Comparison for Delta-Compression
Markus Scheidgen
 
An Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub RepositoriesAn Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub Repositories
SAIL_QU
 
MSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick TriggerMSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick Trigger
Xin Yang
 
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
Dylan Ko
 
MSR 2016 data showcase - Mining Code Review Repositories
MSR 2016 data showcase - Mining Code Review RepositoriesMSR 2016 data showcase - Mining Code Review Repositories
MSR 2016 data showcase - Mining Code Review Repositories
Xin Yang
 
Software Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersSoftware Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that Matters
Tao Xie
 
연관도 분석을 이용한 데이터마이닝
연관도 분석을 이용한 데이터마이닝연관도 분석을 이용한 데이터마이닝
연관도 분석을 이용한 데이터마이닝
Keunhyun Oh
 
고품질 Sw와 개발문화
고품질 Sw와 개발문화고품질 Sw와 개발문화
고품질 Sw와 개발문화
도형 임
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
Sung Kim
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defense
Sung Kim
 
위대한개발문화
위대한개발문화위대한개발문화
위대한개발문화
신승환
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software Repositories
Israel Herraiz
 
Introduce Deep learning & A.I. Applications
Introduce Deep learning & A.I. ApplicationsIntroduce Deep learning & A.I. Applications
Introduce Deep learning & A.I. Applications
Mario Cho
 

Viewers also liked (20)

빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
 
Mineograph Mining Automation Software
Mineograph Mining Automation SoftwareMineograph Mining Automation Software
Mineograph Mining Automation Software
 
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
 
Data mining software comparison
Data mining software comparison Data mining software comparison
Data mining software comparison
 
임태현, software catastrophe
임태현, software catastrophe임태현, software catastrophe
임태현, software catastrophe
 
Mining Software Archives to Support Software Development
Mining Software Archives to Support Software DevelopmentMining Software Archives to Support Software Development
Mining Software Archives to Support Software Development
 
Model Comparison for Delta-Compression
Model Comparison for Delta-CompressionModel Comparison for Delta-Compression
Model Comparison for Delta-Compression
 
An Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub RepositoriesAn Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub Repositories
 
MSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick TriggerMSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick Trigger
 
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
 
MSR 2016 data showcase - Mining Code Review Repositories
MSR 2016 data showcase - Mining Code Review RepositoriesMSR 2016 data showcase - Mining Code Review Repositories
MSR 2016 data showcase - Mining Code Review Repositories
 
Software Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersSoftware Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that Matters
 
연관도 분석을 이용한 데이터마이닝
연관도 분석을 이용한 데이터마이닝연관도 분석을 이용한 데이터마이닝
연관도 분석을 이용한 데이터마이닝
 
고품질 Sw와 개발문화
고품질 Sw와 개발문화고품질 Sw와 개발문화
고품질 Sw와 개발문화
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defense
 
위대한개발문화
위대한개발문화위대한개발문화
위대한개발문화
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software Repositories
 
Introduce Deep learning & A.I. Applications
Introduce Deep learning & A.I. ApplicationsIntroduce Deep learning & A.I. Applications
Introduce Deep learning & A.I. Applications
 

Similar to Mining Unstructured Software Repositories Using IR Models

Paper summary
Paper summaryPaper summary
Paper summary
Adam Feldscher
 
Studying Software Quality Using Topic Models
Studying Software Quality Using Topic ModelsStudying Software Quality Using Topic Models
Studying Software Quality Using Topic Models
SAIL_QU
 
Software bug prediction
Software bug prediction Software bug prediction
Software bug prediction
Muthukumaran Kasinathan
 
AI-Driven Software Quality Assurance in the Age of DevOps
AI-Driven Software Quality Assurance in the Age of DevOpsAI-Driven Software Quality Assurance in the Age of DevOps
AI-Driven Software Quality Assurance in the Age of DevOps
Chakkrit (Kla) Tantithamthavorn
 
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
Hong-Linh Truong
 
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGFEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
IJCI JOURNAL
 
(Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing (Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing
Gilles Perrouin
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET Journal
 
May 2024: Top 10 Read Articles in Software Engineering & Applications Interna...
May 2024: Top 10 Read Articles in Software Engineering & Applications Interna...May 2024: Top 10 Read Articles in Software Engineering & Applications Interna...
May 2024: Top 10 Read Articles in Software Engineering & Applications Interna...
sebastianku31
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
Data Science Dojo
 
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IJCSEA Journal
 
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IJCSEA Journal
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
Tao Xie
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
CS, NcState
 
PhD defense: David Ameller
PhD defense: David AmellerPhD defense: David Ameller
PhD defense: David Ameller
David Ameller
 
Not Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityNot Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software Quality
Rocco Oliveto
 
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNINGIMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IRJET Journal
 

Similar to Mining Unstructured Software Repositories Using IR Models (20)

Paper summary
Paper summaryPaper summary
Paper summary
 
Studying Software Quality Using Topic Models
Studying Software Quality Using Topic ModelsStudying Software Quality Using Topic Models
Studying Software Quality Using Topic Models
 
Software bug prediction
Software bug prediction Software bug prediction
Software bug prediction
 
AI-Driven Software Quality Assurance in the Age of DevOps
AI-Driven Software Quality Assurance in the Age of DevOpsAI-Driven Software Quality Assurance in the Age of DevOps
AI-Driven Software Quality Assurance in the Age of DevOps
 
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
 
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGFEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
 
sab17idealtop233.comtop233.comtop233.com
sab17idealtop233.comtop233.comtop233.comsab17idealtop233.comtop233.comtop233.com
sab17idealtop233.comtop233.comtop233.com
 
sab17ideal.pdfthis is for test title a ha
sab17ideal.pdfthis is for test title a hasab17ideal.pdfthis is for test title a ha
sab17ideal.pdfthis is for test title a ha
 
(Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing (Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
May 2024: Top 10 Read Articles in Software Engineering & Applications Interna...
May 2024: Top 10 Read Articles in Software Engineering & Applications Interna...May 2024: Top 10 Read Articles in Software Engineering & Applications Interna...
May 2024: Top 10 Read Articles in Software Engineering & Applications Interna...
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
 
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
 
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
PhD defense: David Ameller
PhD defense: David AmellerPhD defense: David Ameller
PhD defense: David Ameller
 
Not Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityNot Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software Quality
 
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNINGIMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
 

More from SAIL_QU

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
SAIL_QU
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
SAIL_QU
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
SAIL_QU
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
SAIL_QU
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
SAIL_QU
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
SAIL_QU
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
SAIL_QU
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
SAIL_QU
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
SAIL_QU
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
SAIL_QU
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
SAIL_QU
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
SAIL_QU
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
SAIL_QU
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
SAIL_QU
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
SAIL_QU
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
SAIL_QU
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
SAIL_QU
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
SAIL_QU
 

More from SAIL_QU (20)

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
 

Recently uploaded

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Envertis Software Solutions
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 

Recently uploaded (20)

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise EditionWhy Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 

Mining Unstructured Software Repositories Using IR Models

  • 1. Mining Unstructured Software Repositories Using IRModels Stephen W. Thomas PhD Candidate Queen’s University BBAA
  • 2. 2 Stephen W. Thomas Mining Software Repositories with Topic Models. ICSE 2011 Stephen W. Thomas, Hadi Hemmati, Ahmed E. Hassan, and Dorothea Blostein Static TestC ase Prioritization Using Topic Models. Empirical Software Engineering, 2012 Stephen W. Thomas, Nicolas Bettenburg, Ahmed E. Hassan, and Dorothea Blostein Talk and Work: Recovering the Relationship between Mailing ListDiscussions and Development Activity. Empirical Software Engineering, 2nd round Stephen W. Thomas, Meiyappan Nagappan , Ahmed E. Hassan, and Dorothea Blostein The ImpactofC lassifierC onfiguration and C lassifierC ombination on Bug Localization. IEEE Transactions on Software Engineering, 2nd round Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein Validating the Use ofTopic Models forSoftware Evolution. SCAM 2010 Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein Modeling the Evolution ofTopics in Source C ode Histories. MSR 2011 Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein Studying Software Evolution Using Topic Models. Science of Computer Programming, 2012
  • 3. code changes logs bugs email reqs bug prediction traceability linking feature location architecture recovery change pattern detection 3
  • 4. 00:03:45: E22344, 76, 90.3, 00:03:46: E2f3a4, 82, 95.0, 00:03:56: E22345, 78, 96.6, 00:04:15: E22344, 23, 95.1, 00:04:35: E23348, 65, 95.7, 00:04:37: E2234b, 56, 93.1, 00:04:38: E2234b, 54, 95.0, 00:04:39: E22a34, 98, 95.1, 00:05:42: E353f4, 65, 94.7, 00:05:42: E3556j, 45, 95.2, 00:05:42: E3545g, 63, 92.8, 00:05:42: E354r4, 94, 95.6, source code comments bug reports emails requirement descriptions forum and blog posts commit messages source code identifiers 4
  • 5. NPE caused by no spashscreen handler service available Provide unittests for link creation constraints, unit tests fail in standalone build 5
  • 7. 7
  • 9. 9 The research and practice of using IR models to mine software repositories can be improved by (i) considering additional software engineering tasks, such as prioritizing test cases; (ii) using advanced IR techniques, such as combining multiple IR models; and (iii) better understanding the assumptions and parameters of IR models.
  • 10. Test Case Prioritization Less similar Higher prioritySimilarity identifiers comments string literals Part 1 10[EMSE 2012] structural-based IR-based
  • 11. Source code ↔ Email Interaction cleaning and preprocessing identifiers comments string literals mail codeXML printing installation GUI Code Mail Time Activity XML Monitoring project status Software explanation Training and documentation 11 Part 1 [EMSE 20XX]
  • 13. Combining Multiple IRModels identifiers comments string literalsBug report Bug report Similarity title description Best individual IR model Random subset, combined 13 Part 2 [TSE 20XX] sets had improved performance median improvement
  • 14. XML concept Swing concept Encryption concept Time Popularity Concept Evolution Models identifiers comments string literals 14 Part 2 [SCP 2012] [SCAM 2010] accuracy of topic evolutions
  • 16. Data Duplication Problem identical 16 Part 3 [MSR 2011] accuracysensitivity
  • 17. Preprocessing and ParameterEffects Code representation identifiers? comments? past bug reports? Bug report representation title? description? Preprocessing split identifiers? remove stop words? word stemming? IR Model parameters term weighting? No. of topics? similarity measure? No. of iterations? Configuration matters! worst: best: mean: 17 Part 3 [TSE 20XX] “configuration”
  • 18. New! 1 2 3 18 Part Part Part Proposed and evaluated a technique to prioritize test cases Proposed and evaluated a technique to analyze the interaction of source code and mailing lists Described and evaluated a technique to analyze code histories using topic evolution models Proposed and evaluated a frameworkforcombining the results of disparate IR models Overcame the data duplication problem in large source code histories Analyzed the sensitivity of IRmodels to data preprocessing and IR model parameters

Editor's Notes

  1. This diagram describes the field of Mining Software Repositories. The overall goal is take software repositories (which are readily-available datasets about a software project, such as [list a few]), apply data mining and machine learning techniques, and come out with some actionable knowledge that will help developers in some way. For example: bug prediction, traceability linking, feature location, …
  2. In current research, the majority of the repositories that are mined are structured: call graphs, parse trees, execution logs; However, there are also many repositories that are unstructured: [name them] In fact, research has shown that about 80% of the content in software repositories is unstructured, meaning that we to consider this data if we want to take full advantage of the software repositories.
  3. However, unstructured data brings with it many challenges. Consider these two seemingly-innocent bug reports from one of my case studies. Here we see many difficulties, such as undefined acronyms; spelling errors and typos; inconsistent usages; no labels, vague wording. These problems exist because most unstructured data comes in the form of natural language text written by humans, which is notoriously difficult for a computer to deal with.
  4. In an attempt to deal with unstructured software repositories, researchers have began to use IR. IR models come from the NLP community, and a good fit for our problem because they were designed to handle many of the problems of unstructured data. IR models help you search, organize, and provide structure for your unstructured data. IR models use a simplifying assumption of the data, called the “bag of words” approach. This means that word order is not considered in IR models. By ignoring word order, analysis is simpler and faster, and the techniques can scale to large datasets. And we demonstrate that despite this simplifying assumption, IR models actually perform quite well in many scenarios. Initial successes: concept location; document clustering; new code metrics; code search engines; traceability linking
  5. To understand how IR models have been used in MSR, I did a thorough literature review of all papers that use IR models to mine unstructured data. In all, there are about 67 papers. I analyzed the trends and common usages, and found three shortcomings of the state-of-the-art, i.e., some areas where we could improve. My thesis is the proposal of solutions to each of these three shortcomings.
  6. First shortcoming: most papers that use IR models only perform one of two software engineering tasks: concept location, and traceability linking. There’s nothing wrong with these applications, but I propose that we can go beyond these two tasks and use IR models to perform new SE tasks, and help software developers even further. Second shortcoming: most papers use only the most basic IR models, such as the Vector Space Model (1975, 37 years ago). I propose that we use some of the more advanced, super-man like IR techniques, which may bring better results and new capabilities to software developers. Third shortcoming: most papers use IR models as off-the-shelf black boxes, without fully understanding how their parameters work, what input is required, and what the output means. I propose that we develop a better understanding of how IR models, which will allow us to take full advantage of their potential, and improve results for software developers.
  7. My thesis statement has a parallel structure: [read]
  8. In TCP, the goal is take an unordered set of test cases, and provide an ordering such that more bugs are detected earlier in the testing process. By doing so, if the test suite must be stopped early, then you can rest assured that you have detected as many bugs as possible. Typically, TCP is tackled by using some sort of structural code coverage metric, that says: hey, how much code does this test case execute? If it executes a lot of code, then let’s give it a high priority. Otherwise, let’s give it a low priority. This is how it’s traditionally done. However, I propose that we can use IR models to solve the same problem, only with the additional advantage of not having to run the test case to collect the execution information. Here’s how. First, we extract the unstructured information from the source code: identifier names, comments, and string literals. Then, we compute the IR similarity between each pair of test cases. This will tell us if the test cases are textually similar or not. Then, if a test case is not very similar to other test cases, we give it a higher priority. The thought here is: if two test cases are exactly the same, then they will find the same bugs, so we don’t need to execute both. So we’re looking for test cases that are highly unlike any other test case, because it will detect unique bugs. We did a case study on 5 real world systems, and found that our IR based approach was as good or better than existing approaches prioritizing test cases.
  9. The first advanced technique I propose is that of combining multiple IR models. Let me explain this in the context of bug localization. […] A simple way to combine models is to just add the scores of each file from the various IR models. That way, if a file gets a high score in several models, it will shoot up to the top in the combined model. Another way is expert voting, where only the rank of each file is used, as opposed to the score. Either way, the end goal is to utilize the “expertise” of each model.
  10. If a manager or developer had a dashboard that magically told them what developers were working on, and when, at a high level, they would be very happy. This would keep them informed, allow them to perform retrospective analysis, and maybe even be part of a preemptive maintenance solution that automatically monitored the “health” of the source code over time. To achieve this goal, we use an advanced IR model called a topic evolution model. It works by [explain] We input these versions into an advanced IR model, called a topic evolution model, which gives us exactly what we’re looking for. A case study found that a large majority of the discovered evolutions were in-sync with how developers described the project, and since this technique is automatic, it will be helpful to use in an automatic dashboard setting.
  11. During my research, I came across an issue which I now call the “data duplication problem”. When I tried to analyze the evolution of long-lived systems with many different versions, I found that the IR model was producing unusual and unexpected results. Things just didn’t make sense: the topics were weird, and something was off, but I didn’t know what. Upon further analysis, I learned that the cause of this problem was that in source code, hardly any of the words change between versions. A new version typically contains some bug fixes and some new features, but these only affect at most 1% of the lines of code, meaning that 99% of the data is exactly the same. It’s identical. This was throwing the IR models out of balance, and causing the problems that we experienced. The reason is, IR models weren’t originally designed for source code. They were designed for newspaper articles or books. So version 1 here might contain all the newspaper articles in January, and version 2 contains all the newspaper articles in February. Sure, there might be some overlap, but in general we do not expect that 99% of the articles in February are exact duplicates from January. I believe that someone would be fired from the newspaper if this happened. So I proposed a model that better handled this data duplication inherent to source code. Basically what it does, is it only inputs the differences between versions into the IR model. This keeps everything in balance because it meets the implicit assumptions made by the IR model. Our case studies showed that results are better when the duplication is removed.
  12. Another way to better understand IR models is to understand their parameters and configurations. IR models have a lot of dials, knobs, and switches that you can tweak. For example, … Currently, researchers don’t focus on these parameters, and just seem to randomly choose settings without fully understanding the associated consequences. To better understand the parameters, we ran a large, empirical case study. We had 8000 bug reports, and we ran each of them through 3,168 IR model configurations. What we found was, that there is a HUGE difference in performance between the various configurations. For example, the worst IR model could only achieve 1% accuracy; the best could get as high 55%. And the mean was 23%. So the range was quite big, as was the variance. In addition, in this study we were able to determine which configurations were best, so that researchers, tool vendors, and developers could use these when building their own IR-based solutions.
  13. Let me conclude by summarizing the main contributions of this thesis. First, I proposed new application of IR models in SE: TCP, and measuring the interaction of email and source code. I also proposed that we start using more advanced IR techniques in our work, such as topic evolution models and model combination and Finally, I proposed that if we increase our understanding of IR models, we further improve results. The two studies have show that by looking into the details of IR models, instead of treating them as black boxes, we can improve our techniques and get better results. My broader research vision is to provide better tools, techniques, and insights for software development teams, so that they can build better software at lower costs and have happier customers. In this thesis, I have taken a step towards that vision by proposing and evaluating ways to better utilize the unstructured elements of software repositories, which in turn provide new and better capabilities for software developers.