SlideShare a Scribd company logo
Using IR Methods
for Labeling Source Code Artifacts:
is it Worthwhile?
Andrea
De Lucia
Massimiliano
Di Penta
Rocco
Oliveto
Annibale
Panichella
Sebastiano
Panichella
Context
• Source code is text too!
• Lexicon quality impacts software quality
• IR techniques used to analyze software
• Emerging application: label software artifacts
• Labeling packages [Kuhn et al., 2007]
• Labeling changes [Thomas et al., 2010]
• Relate topics in high-level artifacts an source
code [Gethers et al., 2011]
ok... but...
Are these automatic
labelings meaningful?
Related study: Haiduc et al., 2010
Empirical Study
Goal: compare human-generated source
code labeling with automatically generated
ones
Quality focus: quality of automatically
generated source code labelings
Perspective: researchers interested to
develop source code labeling techniques
Research Questions
RQ1: How much is the overlap between the
keywords identified by developers when
describing a source code artifact and those
identified by an automatic technique?
RQ2:Which are the characteristics of source
code artifacts that affect the overlap of
automatic labeling techniques with the
human-generated labels?
Context
Objects:
10 classes from eXVantage (industrial test data
generation tool)
10 classes from JHotDraw
Subjects:
17 software engineering students (Bachelor degree
in CS, Univ. of Molise, second year)
Study Procedure
Procedure Overview
1.Participants’ training on the system
2.Presentation of the experiment procedure
3.Manual labeling by participants
4.Automatic labeling
5.Comparison
Manual Labeling
• Subjects label each class by selecting 10 words from it
• Time spent on each class annotated
• Offline study, lasted 2 weeks
book
hotel
room
reservation
arrival
departure
smoking
double
card
breakfast
source code file
Aggregating manual labeling
Each artifact is labeled using terms selected
by at least 50% of the subjects
book
hotel
room
reservation
arrival
departure
smoking
double
card
breakfast
book
hotel
room
refund
arrival
check
parking
double
suite
group
confirmation
room
reservation
arrival
departure
date
bed
card
payment
spa
room 3
arrival 3
book 2
hotel 2
reservation 2
departure 2
double 2
card 2
book
hotel
room
reservation
arrival
departure
smoking
double
card
breakfast
book
hotel
room
refund
arrival
check
parking
double
suite
group
confirmation
room
reservation
arrival
departure
date
bed
card
payment
spa
Automatic Labeling
Text Processing
• Extracted words from
• source code + comments
• comments only
• Identifier splitting (camel case)
• Pruned stop words and programming language
keywords
• Stemming (Porter)
• Term indexing using: tf or tf-idf
Labeling techniques
• Simple signature: words from class name, method name and
params, attribute names
• VSM: terms ranked according to tf or tf-idf
• Latent Semantic Indexing (LSI)
• Class methods considered as documents
• Words having the highest weight in the LSI space
• Latent Dirichlet Allocation (LDA)
• Different number of topics: 2, #Methods/2, #Methods
• Core words: having highest probability on the overall set
of topics
• Core topics: words from the topic with highest probability
Measurements: RQ1
Asymmetric Jaccard to avoid penalizing
automatic approaches
K(Ci) = {t1 . . . tm}
Kmi (Ci) = {t1 . . . th}
overlapmi (Ci) =
|K(Ci)  Kmi
(Ci)|
Kmi (Ci)
manual labeling of Ci
automatic labeling of Ci
by technique mi
Measurements: RQ2
A) Ability of LDA and LSI to cluster related classes
Open your book, page 0
while (windowEnd+1 < sessions.Count)
{
if (WindowError(sessions,windowStart,windowEnd+1) > maxWind
&& WindowLength(sessions,windowStart,windowEnd+1) > mWi
{
var bubble = new SessionBubbleContract();
bubble.Start = sessions[windowStart].Start;
bubble.End = sessions[windowEnd].End.Value;
list.Add(bubble);
windowStart = windowEnd+1;
}
windowEnd++;
}
if (windowStart < sessions.Count)
Proceedings of the
2012 20th IEEE
International Conference on
Program Comprehension
ICPC 2012
AlarmVal
AlarmVal != ErrValue
AlarmVal > -0.0001
AlarmVal < + 0.0001
C
Alarm Val = Param->Alarm Val
foo = Alarm Val
bar = Fun$Result1
Alarm Val = bar
IF Alaram Val > -0.0001
Fun$Result1
int Alarm Val = Param->Alarm Val;
int foo = Alarm Val;
int bar = fun();
Alarm Val = ba r;
if ( Alaram Val > -0.0001){
...
call Fun
Celebrating 20 Years
Sponsered by
JUNE 11-13
Measurements: RQ2
B) Correlation between overlap and time spent by
subjects to label artifacts
H(Ci) =
mX
j=1
tfj
n
· log
✓
n
tfj
◆
n =
Pm
k=1 tfk
A) Ability of LDA and LSI to cluster related classes
measured as Entropy of terms in a class
eH(Ci) = H(Ci)/log(m)normalized as:
Results
RQ1: eXVantage
Signature (tf)
Signature (tf-idf)
VSM (tf)
VSM (tf-idf)
LSI (tf)
LSI (tf-idf)
LDA (n=M, core_tp)
LDA (n=M, core_ts)
LDA(n=M/2, core_tp)
LDA (n=M/2, core_ts)
LDA (n=2, core_tp)
LDA (n=2, core_ts)
0.00 20.00 40.00 60.00 80.00
59
61
59
57
50
46
52
53
59
61
0
0
59
60
56
56
53
52
63
58
69
70
77
76
Comments+Code Comments only
Overlap
RQ1: JHotDraw
Signature (tf)
Signature (tf-idf)
VSM (tf)
VSM (tf-idf)
LSI (tf)
LSI (tf-idf)
LDA (n=M, core_tp)
LDA (n=M, core_ts)
LDA(n=M/2, core_tp)
LDA (n=M/2, core_ts)
LDA (n=2, core_tp)
LDA (n=2, core_ts)
0.00 20.00 40.00 60.00 80.00
52
53
52
53
46
58
44
48
54
53
0
0
62
59
52
60
55
59
55
54
65
60
75
74
Comments+Code Comments only
Overlap
Why LDA does not work well
Class distance in eXVantage
topic 1
topic2
RQ2: Entropy
Comments do not contain clearly dominant words
!"#$%&"$ !''#&() !''#&()$*&+,
-./--./0-.1--.10
2&(3!4,$!5$(#3')
!"#$%&"$ !''#&() !''#&()$*&+,
-./--./0-.1--.10
2&(3!4,$!5$(#3')
eXVantage JHotDraw
Entropyofterms
Entropyofterms
Code+comments Code+comments Comments onlyComments only
RQ2: Effort to label artifacts
System Class size
Comment
verbosity
JHotDraw 0.6 -0.25
eXVantage 0 -0.13
Pearson Correlation
Different comment verbosity in JHotDraw (6) and eXVantage (14)
RQ2:VSM vs. LSI
VSM
LSI
0 20 40 60 80
72
59
42
61
Low High
JHotDraw
VSM
LSI
0 20 40 60 80
68
68
52
72
Low High
exVantage
Overlap vs. effort needed to label a class
Conclusions
Using IR methods for labeling source code artifacts: Is it worthwhile?

More Related Content

What's hot

co:op-READ-Convention Marburg - Roger Labahn
co:op-READ-Convention Marburg - Roger Labahnco:op-READ-Convention Marburg - Roger Labahn
co:op-READ-Convention Marburg - Roger Labahn
ICARUS - International Centre for Archival Research
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
Jaganadh Gopinadhan
 
Top schools in delhi ncr
Top schools in delhi ncrTop schools in delhi ncr
Top schools in delhi ncr
Edhole.com
 
ppt
pptppt
ppt
butest
 
Semantic job recommendation engine
Semantic job recommendation engineSemantic job recommendation engine
Semantic job recommendation engine
Mahak Gambhir
 
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Centre of Competence
 
Robust Speech Recognition Technique using Mat lab
Robust Speech Recognition Technique using Mat labRobust Speech Recognition Technique using Mat lab
Robust Speech Recognition Technique using Mat lab
IRJET Journal
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition System
Vani011
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
Anshuli Mittal
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
LINAGORA
 

What's hot (10)

co:op-READ-Convention Marburg - Roger Labahn
co:op-READ-Convention Marburg - Roger Labahnco:op-READ-Convention Marburg - Roger Labahn
co:op-READ-Convention Marburg - Roger Labahn
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Top schools in delhi ncr
Top schools in delhi ncrTop schools in delhi ncr
Top schools in delhi ncr
 
ppt
pptppt
ppt
 
Semantic job recommendation engine
Semantic job recommendation engineSemantic job recommendation engine
Semantic job recommendation engine
 
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
 
Robust Speech Recognition Technique using Mat lab
Robust Speech Recognition Technique using Mat labRobust Speech Recognition Technique using Mat lab
Robust Speech Recognition Technique using Mat lab
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition System
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
 

Viewers also liked

The hobbits
The hobbitsThe hobbits
The hobbits
aiesechyderabad
 
Catalogo Gicaballoons
Catalogo GicaballoonsCatalogo Gicaballoons
Catalogo Gicaballoonsgicaballoons
 
Grammar verb be
Grammar verb beGrammar verb be
Grammar verb bealemati
 
Digipack analysis
Digipack analysisDigipack analysis
Digipack analysis
Hollie15
 
Chapter08
Chapter08Chapter08
Gi Ambassadors workshops - Intro to blogging 2011
Gi Ambassadors workshops - Intro to blogging 2011Gi Ambassadors workshops - Intro to blogging 2011
Gi Ambassadors workshops - Intro to blogging 2011
James Aspin
 
Mengenal jarkom
Mengenal jarkomMengenal jarkom
Mengenal jarkom
labiebm
 
Best Finance Award Application
Best Finance Award ApplicationBest Finance Award Application
Best Finance Award Application
aiesechyderabad
 
Evaluation question 1
Evaluation question 1Evaluation question 1
Evaluation question 1
Jakewootton
 
Audience feedback
Audience feedbackAudience feedback
Audience feedbackHollie15
 
Production log
Production logProduction log
Production log
halo4robo
 
Lcp lc review
Lcp lc review Lcp lc review
Lcp lc review
aiesechyderabad
 
Dragons review
Dragons reviewDragons review
Dragons review
aiesechyderabad
 
State of marketing 2012
State of marketing 2012State of marketing 2012
State of marketing 2012
aiesechyderabad
 
Accessories
AccessoriesAccessories
Accessories
Sabbaba Khan
 
Learn with google
Learn with googleLearn with google
Learn with google
aiesechyderabad
 
De moda ir de compras en las tiendas
De moda ir de compras en las tiendasDe moda ir de compras en las tiendas
De moda ir de compras en las tiendas
HA MFL Department
 
Esprit updates
Esprit updatesEsprit updates
Esprit updates
aiesechyderabad
 

Viewers also liked (20)

The hobbits
The hobbitsThe hobbits
The hobbits
 
Catalogo Gicaballoons
Catalogo GicaballoonsCatalogo Gicaballoons
Catalogo Gicaballoons
 
Grammar verb be
Grammar verb beGrammar verb be
Grammar verb be
 
Digipack analysis
Digipack analysisDigipack analysis
Digipack analysis
 
Chapter08
Chapter08Chapter08
Chapter08
 
Gi Ambassadors workshops - Intro to blogging 2011
Gi Ambassadors workshops - Intro to blogging 2011Gi Ambassadors workshops - Intro to blogging 2011
Gi Ambassadors workshops - Intro to blogging 2011
 
Mengenal jarkom
Mengenal jarkomMengenal jarkom
Mengenal jarkom
 
Best Finance Award Application
Best Finance Award ApplicationBest Finance Award Application
Best Finance Award Application
 
Gcdp lc day
Gcdp lc dayGcdp lc day
Gcdp lc day
 
Evaluation question 1
Evaluation question 1Evaluation question 1
Evaluation question 1
 
Je m´entends bien
Je m´entends bienJe m´entends bien
Je m´entends bien
 
Audience feedback
Audience feedbackAudience feedback
Audience feedback
 
Production log
Production logProduction log
Production log
 
Lcp lc review
Lcp lc review Lcp lc review
Lcp lc review
 
Dragons review
Dragons reviewDragons review
Dragons review
 
State of marketing 2012
State of marketing 2012State of marketing 2012
State of marketing 2012
 
Accessories
AccessoriesAccessories
Accessories
 
Learn with google
Learn with googleLearn with google
Learn with google
 
De moda ir de compras en las tiendas
De moda ir de compras en las tiendasDe moda ir de compras en las tiendas
De moda ir de compras en las tiendas
 
Esprit updates
Esprit updatesEsprit updates
Esprit updates
 

Similar to Using IR methods for labeling source code artifacts: Is it worthwhile?

Unit 1 cd
Unit 1 cdUnit 1 cd
Unit 1 cd
codereplugd
 
Cpcs302 1
Cpcs302  1Cpcs302  1
Cpcs302 1
guest5de1a5
 
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
Paul Lo
 
130817 latifa guerrouj - context-aware source code vocabulary normalization...
130817   latifa guerrouj - context-aware source code vocabulary normalization...130817   latifa guerrouj - context-aware source code vocabulary normalization...
130817 latifa guerrouj - context-aware source code vocabulary normalization...
Ptidej Team
 
Chapter 1 1
Chapter 1 1Chapter 1 1
Chapter 1 1
bolovv
 
Chapter One
Chapter OneChapter One
Chapter One
bolovv
 
download
downloaddownload
download
butest
 
download
downloaddownload
download
butest
 
Programming in C [Module One]
Programming in C [Module One]Programming in C [Module One]
Programming in C [Module One]
Abhishek Sinha
 
Language processors
Language processorsLanguage processors
Language processors
Ganesh Wedpathak
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
Valentina Paunovic
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
Presentation1
Presentation1Presentation1
Presentation1
Zarin Tasnim
 
Presentation1
Presentation1Presentation1
Presentation1
Zarin Tasnim
 
01. introduction
01. introduction01. introduction
01. introduction
babaaasingh123
 
Code Analysis-run time error prediction
Code Analysis-run time error predictionCode Analysis-run time error prediction
Code Analysis-run time error prediction
NIKHIL NAWATHE
 
Audio Fingerprinting Introduction
Audio Fingerprinting IntroductionAudio Fingerprinting Introduction
Audio Fingerprinting Introduction
Vikesh Khanna
 
A Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query LanguagesA Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query Languages
Kim Mens
 
Web search engines
Web search enginesWeb search engines
Web search engines
AbdusamadAbdukarimov2
 

Similar to Using IR methods for labeling source code artifacts: Is it worthwhile? (20)

Unit 1 cd
Unit 1 cdUnit 1 cd
Unit 1 cd
 
Cpcs302 1
Cpcs302  1Cpcs302  1
Cpcs302 1
 
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
 
130817 latifa guerrouj - context-aware source code vocabulary normalization...
130817   latifa guerrouj - context-aware source code vocabulary normalization...130817   latifa guerrouj - context-aware source code vocabulary normalization...
130817 latifa guerrouj - context-aware source code vocabulary normalization...
 
Chapter 1 1
Chapter 1 1Chapter 1 1
Chapter 1 1
 
Chapter One
Chapter OneChapter One
Chapter One
 
download
downloaddownload
download
 
download
downloaddownload
download
 
Programming in C [Module One]
Programming in C [Module One]Programming in C [Module One]
Programming in C [Module One]
 
Language processors
Language processorsLanguage processors
Language processors
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Presentation1
Presentation1Presentation1
Presentation1
 
Presentation1
Presentation1Presentation1
Presentation1
 
01. introduction
01. introduction01. introduction
01. introduction
 
Code Analysis-run time error prediction
Code Analysis-run time error predictionCode Analysis-run time error prediction
Code Analysis-run time error prediction
 
Audio Fingerprinting Introduction
Audio Fingerprinting IntroductionAudio Fingerprinting Introduction
Audio Fingerprinting Introduction
 
A Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query LanguagesA Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query Languages
 
Web search engines
Web search enginesWeb search engines
Web search engines
 

More from Sebastiano Panichella

Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
Sebastiano Panichella
 
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
Sebastiano Panichella
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
Sebastiano Panichella
 
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
Sebastiano Panichella
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Sebastiano Panichella
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Sebastiano Panichella
 
COSMOS: DevOps for Complex Cyber-physical Systems
COSMOS: DevOps for Complex Cyber-physical SystemsCOSMOS: DevOps for Complex Cyber-physical Systems
COSMOS: DevOps for Complex Cyber-physical Systems
Sebastiano Panichella
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Sebastiano Panichella
 
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
Sebastiano Panichella
 
Automated Identification and Qualitative Characterization of Safety Concerns ...
Automated Identification and Qualitative Characterization of Safety Concerns ...Automated Identification and Qualitative Characterization of Safety Concerns ...
Automated Identification and Qualitative Characterization of Safety Concerns ...
Sebastiano Panichella
 
The 2nd Intl. Workshop on NL-based Software Engineering
The 2nd Intl. Workshop on NL-based Software EngineeringThe 2nd Intl. Workshop on NL-based Software Engineering
The 2nd Intl. Workshop on NL-based Software Engineering
Sebastiano Panichella
 
The 16th Intl. Workshop on Search-Based and Fuzz Testing
The 16th Intl. Workshop on Search-Based and Fuzz TestingThe 16th Intl. Workshop on Search-Based and Fuzz Testing
The 16th Intl. Workshop on Search-Based and Fuzz Testing
Sebastiano Panichella
 
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
Sebastiano Panichella
 
Exposed! A case study on the vulnerability-proneness of Google Play Apps
Exposed! A case study on the vulnerability-proneness of Google Play AppsExposed! A case study on the vulnerability-proneness of Google Play Apps
Exposed! A case study on the vulnerability-proneness of Google Play Apps
Sebastiano Panichella
 
Search-based Software Testing (SBST) '22
Search-based Software Testing (SBST) '22Search-based Software Testing (SBST) '22
Search-based Software Testing (SBST) '22
Sebastiano Panichella
 
NL-based Software Engineering (NLBSE) '22
NL-based Software Engineering (NLBSE) '22NL-based Software Engineering (NLBSE) '22
NL-based Software Engineering (NLBSE) '22
Sebastiano Panichella
 
NLBSE’22: Tool Competition
NLBSE’22: Tool CompetitionNLBSE’22: Tool Competition
NLBSE’22: Tool Competition
Sebastiano Panichella
 

More from Sebastiano Panichella (20)

Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
COSMOS: DevOps for Complex Cyber-physical Systems
COSMOS: DevOps for Complex Cyber-physical SystemsCOSMOS: DevOps for Complex Cyber-physical Systems
COSMOS: DevOps for Complex Cyber-physical Systems
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
 
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
 
Automated Identification and Qualitative Characterization of Safety Concerns ...
Automated Identification and Qualitative Characterization of Safety Concerns ...Automated Identification and Qualitative Characterization of Safety Concerns ...
Automated Identification and Qualitative Characterization of Safety Concerns ...
 
The 2nd Intl. Workshop on NL-based Software Engineering
The 2nd Intl. Workshop on NL-based Software EngineeringThe 2nd Intl. Workshop on NL-based Software Engineering
The 2nd Intl. Workshop on NL-based Software Engineering
 
The 16th Intl. Workshop on Search-Based and Fuzz Testing
The 16th Intl. Workshop on Search-Based and Fuzz TestingThe 16th Intl. Workshop on Search-Based and Fuzz Testing
The 16th Intl. Workshop on Search-Based and Fuzz Testing
 
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
 
Exposed! A case study on the vulnerability-proneness of Google Play Apps
Exposed! A case study on the vulnerability-proneness of Google Play AppsExposed! A case study on the vulnerability-proneness of Google Play Apps
Exposed! A case study on the vulnerability-proneness of Google Play Apps
 
Search-based Software Testing (SBST) '22
Search-based Software Testing (SBST) '22Search-based Software Testing (SBST) '22
Search-based Software Testing (SBST) '22
 
NL-based Software Engineering (NLBSE) '22
NL-based Software Engineering (NLBSE) '22NL-based Software Engineering (NLBSE) '22
NL-based Software Engineering (NLBSE) '22
 
NLBSE’22: Tool Competition
NLBSE’22: Tool CompetitionNLBSE’22: Tool Competition
NLBSE’22: Tool Competition
 

Recently uploaded

Cybersecurity Presentation PowerPoint!!!
Cybersecurity Presentation PowerPoint!!!Cybersecurity Presentation PowerPoint!!!
Cybersecurity Presentation PowerPoint!!!
arichardson21686
 
2023 Ukraine Crisis Media Center Financial Report
2023 Ukraine Crisis Media Center Financial Report2023 Ukraine Crisis Media Center Financial Report
2023 Ukraine Crisis Media Center Financial Report
UkraineCrisisMediaCenter
 
2023 Ukraine Crisis Media Center Finance Balance
2023 Ukraine Crisis Media Center Finance Balance2023 Ukraine Crisis Media Center Finance Balance
2023 Ukraine Crisis Media Center Finance Balance
UkraineCrisisMediaCenter
 
Prsentation for VIVA Welike project 1semester.pptx
Prsentation for VIVA Welike project 1semester.pptxPrsentation for VIVA Welike project 1semester.pptx
Prsentation for VIVA Welike project 1semester.pptx
prafulpawar29
 
Bridging the visual gap between cultural heritage and digital scholarship
Bridging the visual gap between cultural heritage and digital scholarshipBridging the visual gap between cultural heritage and digital scholarship
Bridging the visual gap between cultural heritage and digital scholarship
Inesm9
 
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
kekzed
 
SASi-SPi Science Policy Lab Pre-engagement
SASi-SPi Science Policy Lab Pre-engagementSASi-SPi Science Policy Lab Pre-engagement
SASi-SPi Science Policy Lab Pre-engagement
Francois Stepman
 
AWS User Group Torino 2024 #3 - 18/06/2024
AWS User Group Torino 2024 #3 - 18/06/2024AWS User Group Torino 2024 #3 - 18/06/2024
AWS User Group Torino 2024 #3 - 18/06/2024
Guido Maria Nebiolo
 
Proposal: The Ark Project and The BEEP Inc
Proposal: The Ark Project and The BEEP IncProposal: The Ark Project and The BEEP Inc
Proposal: The Ark Project and The BEEP Inc
Raheem Muhammad
 
Presentation agenda of three-day conference
Presentation agenda of three-day conferencePresentation agenda of three-day conference
Presentation agenda of three-day conference
bernadettalaurentia1
 
一比一原版(unc毕业证书)美国北卡罗来纳大学教堂山分校毕业证如何办理
一比一原版(unc毕业证书)美国北卡罗来纳大学教堂山分校毕业证如何办理一比一原版(unc毕业证书)美国北卡罗来纳大学教堂山分校毕业证如何办理
一比一原版(unc毕业证书)美国北卡罗来纳大学教堂山分校毕业证如何办理
gfysze
 
2023 Ukraine Crisis Media Center Annual Report
2023 Ukraine Crisis Media Center Annual Report2023 Ukraine Crisis Media Center Annual Report
2023 Ukraine Crisis Media Center Annual Report
UkraineCrisisMediaCenter
 
Genesis chapter 3 Isaiah Scudder.pptx
Genesis    chapter 3 Isaiah Scudder.pptxGenesis    chapter 3 Isaiah Scudder.pptx
Genesis chapter 3 Isaiah Scudder.pptx
FamilyWorshipCenterD
 
Legislation And Regulations For Import, Manufacture,.pptx
Legislation And Regulations For Import, Manufacture,.pptxLegislation And Regulations For Import, Manufacture,.pptx
Legislation And Regulations For Import, Manufacture,.pptx
Charmi13
 
Gamify it until you make it Improving Agile Development and Operations with ...
Gamify it until you make it  Improving Agile Development and Operations with ...Gamify it until you make it  Improving Agile Development and Operations with ...
Gamify it until you make it Improving Agile Development and Operations with ...
Ben Linders
 
Kalyan chart satta matka guessing result
Kalyan chart satta matka guessing resultKalyan chart satta matka guessing result
Kalyan chart satta matka guessing result
sanammadhu484
 
Data Processing in PHP - PHPers 2024 Poznań
Data Processing in PHP - PHPers 2024 PoznańData Processing in PHP - PHPers 2024 Poznań
Data Processing in PHP - PHPers 2024 Poznań
Norbert Orzechowicz
 
ServiceNow CIS-ITSM Exam Dumps & Questions [2024]
ServiceNow CIS-ITSM Exam Dumps & Questions [2024]ServiceNow CIS-ITSM Exam Dumps & Questions [2024]
ServiceNow CIS-ITSM Exam Dumps & Questions [2024]
SkillCertProExams
 
ACTIVE IMPLANTABLE MEDICAL DEVICE IN EUROPE
ACTIVE IMPLANTABLE MEDICAL DEVICE IN EUROPEACTIVE IMPLANTABLE MEDICAL DEVICE IN EUROPE
ACTIVE IMPLANTABLE MEDICAL DEVICE IN EUROPE
Charmi13
 

Recently uploaded (19)

Cybersecurity Presentation PowerPoint!!!
Cybersecurity Presentation PowerPoint!!!Cybersecurity Presentation PowerPoint!!!
Cybersecurity Presentation PowerPoint!!!
 
2023 Ukraine Crisis Media Center Financial Report
2023 Ukraine Crisis Media Center Financial Report2023 Ukraine Crisis Media Center Financial Report
2023 Ukraine Crisis Media Center Financial Report
 
2023 Ukraine Crisis Media Center Finance Balance
2023 Ukraine Crisis Media Center Finance Balance2023 Ukraine Crisis Media Center Finance Balance
2023 Ukraine Crisis Media Center Finance Balance
 
Prsentation for VIVA Welike project 1semester.pptx
Prsentation for VIVA Welike project 1semester.pptxPrsentation for VIVA Welike project 1semester.pptx
Prsentation for VIVA Welike project 1semester.pptx
 
Bridging the visual gap between cultural heritage and digital scholarship
Bridging the visual gap between cultural heritage and digital scholarshipBridging the visual gap between cultural heritage and digital scholarship
Bridging the visual gap between cultural heritage and digital scholarship
 
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
怎么办理(lincoln学位证书)英国林肯大学毕业证文凭学位证书原版一模一样
 
SASi-SPi Science Policy Lab Pre-engagement
SASi-SPi Science Policy Lab Pre-engagementSASi-SPi Science Policy Lab Pre-engagement
SASi-SPi Science Policy Lab Pre-engagement
 
AWS User Group Torino 2024 #3 - 18/06/2024
AWS User Group Torino 2024 #3 - 18/06/2024AWS User Group Torino 2024 #3 - 18/06/2024
AWS User Group Torino 2024 #3 - 18/06/2024
 
Proposal: The Ark Project and The BEEP Inc
Proposal: The Ark Project and The BEEP IncProposal: The Ark Project and The BEEP Inc
Proposal: The Ark Project and The BEEP Inc
 
Presentation agenda of three-day conference
Presentation agenda of three-day conferencePresentation agenda of three-day conference
Presentation agenda of three-day conference
 
一比一原版(unc毕业证书)美国北卡罗来纳大学教堂山分校毕业证如何办理
一比一原版(unc毕业证书)美国北卡罗来纳大学教堂山分校毕业证如何办理一比一原版(unc毕业证书)美国北卡罗来纳大学教堂山分校毕业证如何办理
一比一原版(unc毕业证书)美国北卡罗来纳大学教堂山分校毕业证如何办理
 
2023 Ukraine Crisis Media Center Annual Report
2023 Ukraine Crisis Media Center Annual Report2023 Ukraine Crisis Media Center Annual Report
2023 Ukraine Crisis Media Center Annual Report
 
Genesis chapter 3 Isaiah Scudder.pptx
Genesis    chapter 3 Isaiah Scudder.pptxGenesis    chapter 3 Isaiah Scudder.pptx
Genesis chapter 3 Isaiah Scudder.pptx
 
Legislation And Regulations For Import, Manufacture,.pptx
Legislation And Regulations For Import, Manufacture,.pptxLegislation And Regulations For Import, Manufacture,.pptx
Legislation And Regulations For Import, Manufacture,.pptx
 
Gamify it until you make it Improving Agile Development and Operations with ...
Gamify it until you make it  Improving Agile Development and Operations with ...Gamify it until you make it  Improving Agile Development and Operations with ...
Gamify it until you make it Improving Agile Development and Operations with ...
 
Kalyan chart satta matka guessing result
Kalyan chart satta matka guessing resultKalyan chart satta matka guessing result
Kalyan chart satta matka guessing result
 
Data Processing in PHP - PHPers 2024 Poznań
Data Processing in PHP - PHPers 2024 PoznańData Processing in PHP - PHPers 2024 Poznań
Data Processing in PHP - PHPers 2024 Poznań
 
ServiceNow CIS-ITSM Exam Dumps & Questions [2024]
ServiceNow CIS-ITSM Exam Dumps & Questions [2024]ServiceNow CIS-ITSM Exam Dumps & Questions [2024]
ServiceNow CIS-ITSM Exam Dumps & Questions [2024]
 
ACTIVE IMPLANTABLE MEDICAL DEVICE IN EUROPE
ACTIVE IMPLANTABLE MEDICAL DEVICE IN EUROPEACTIVE IMPLANTABLE MEDICAL DEVICE IN EUROPE
ACTIVE IMPLANTABLE MEDICAL DEVICE IN EUROPE
 

Using IR methods for labeling source code artifacts: Is it worthwhile?

  • 1. Using IR Methods for Labeling Source Code Artifacts: is it Worthwhile? Andrea De Lucia Massimiliano Di Penta Rocco Oliveto Annibale Panichella Sebastiano Panichella
  • 2. Context • Source code is text too! • Lexicon quality impacts software quality • IR techniques used to analyze software • Emerging application: label software artifacts • Labeling packages [Kuhn et al., 2007] • Labeling changes [Thomas et al., 2010] • Relate topics in high-level artifacts an source code [Gethers et al., 2011]
  • 5. Related study: Haiduc et al., 2010
  • 6. Empirical Study Goal: compare human-generated source code labeling with automatically generated ones Quality focus: quality of automatically generated source code labelings Perspective: researchers interested to develop source code labeling techniques
  • 7. Research Questions RQ1: How much is the overlap between the keywords identified by developers when describing a source code artifact and those identified by an automatic technique? RQ2:Which are the characteristics of source code artifacts that affect the overlap of automatic labeling techniques with the human-generated labels?
  • 8. Context Objects: 10 classes from eXVantage (industrial test data generation tool) 10 classes from JHotDraw Subjects: 17 software engineering students (Bachelor degree in CS, Univ. of Molise, second year)
  • 10. Procedure Overview 1.Participants’ training on the system 2.Presentation of the experiment procedure 3.Manual labeling by participants 4.Automatic labeling 5.Comparison
  • 11. Manual Labeling • Subjects label each class by selecting 10 words from it • Time spent on each class annotated • Offline study, lasted 2 weeks book hotel room reservation arrival departure smoking double card breakfast source code file
  • 12. Aggregating manual labeling Each artifact is labeled using terms selected by at least 50% of the subjects book hotel room reservation arrival departure smoking double card breakfast book hotel room refund arrival check parking double suite group confirmation room reservation arrival departure date bed card payment spa room 3 arrival 3 book 2 hotel 2 reservation 2 departure 2 double 2 card 2 book hotel room reservation arrival departure smoking double card breakfast book hotel room refund arrival check parking double suite group confirmation room reservation arrival departure date bed card payment spa
  • 14. Text Processing • Extracted words from • source code + comments • comments only • Identifier splitting (camel case) • Pruned stop words and programming language keywords • Stemming (Porter) • Term indexing using: tf or tf-idf
  • 15. Labeling techniques • Simple signature: words from class name, method name and params, attribute names • VSM: terms ranked according to tf or tf-idf • Latent Semantic Indexing (LSI) • Class methods considered as documents • Words having the highest weight in the LSI space • Latent Dirichlet Allocation (LDA) • Different number of topics: 2, #Methods/2, #Methods • Core words: having highest probability on the overall set of topics • Core topics: words from the topic with highest probability
  • 16. Measurements: RQ1 Asymmetric Jaccard to avoid penalizing automatic approaches K(Ci) = {t1 . . . tm} Kmi (Ci) = {t1 . . . th} overlapmi (Ci) = |K(Ci) Kmi (Ci)| Kmi (Ci) manual labeling of Ci automatic labeling of Ci by technique mi
  • 17. Measurements: RQ2 A) Ability of LDA and LSI to cluster related classes Open your book, page 0
  • 18. while (windowEnd+1 < sessions.Count) { if (WindowError(sessions,windowStart,windowEnd+1) > maxWind && WindowLength(sessions,windowStart,windowEnd+1) > mWi { var bubble = new SessionBubbleContract(); bubble.Start = sessions[windowStart].Start; bubble.End = sessions[windowEnd].End.Value; list.Add(bubble); windowStart = windowEnd+1; } windowEnd++; } if (windowStart < sessions.Count) Proceedings of the 2012 20th IEEE International Conference on Program Comprehension ICPC 2012 AlarmVal AlarmVal != ErrValue AlarmVal > -0.0001 AlarmVal < + 0.0001 C Alarm Val = Param->Alarm Val foo = Alarm Val bar = Fun$Result1 Alarm Val = bar IF Alaram Val > -0.0001 Fun$Result1 int Alarm Val = Param->Alarm Val; int foo = Alarm Val; int bar = fun(); Alarm Val = ba r; if ( Alaram Val > -0.0001){ ... call Fun Celebrating 20 Years Sponsered by JUNE 11-13
  • 19. Measurements: RQ2 B) Correlation between overlap and time spent by subjects to label artifacts H(Ci) = mX j=1 tfj n · log ✓ n tfj ◆ n = Pm k=1 tfk A) Ability of LDA and LSI to cluster related classes measured as Entropy of terms in a class eH(Ci) = H(Ci)/log(m)normalized as:
  • 21. RQ1: eXVantage Signature (tf) Signature (tf-idf) VSM (tf) VSM (tf-idf) LSI (tf) LSI (tf-idf) LDA (n=M, core_tp) LDA (n=M, core_ts) LDA(n=M/2, core_tp) LDA (n=M/2, core_ts) LDA (n=2, core_tp) LDA (n=2, core_ts) 0.00 20.00 40.00 60.00 80.00 59 61 59 57 50 46 52 53 59 61 0 0 59 60 56 56 53 52 63 58 69 70 77 76 Comments+Code Comments only Overlap
  • 22. RQ1: JHotDraw Signature (tf) Signature (tf-idf) VSM (tf) VSM (tf-idf) LSI (tf) LSI (tf-idf) LDA (n=M, core_tp) LDA (n=M, core_ts) LDA(n=M/2, core_tp) LDA (n=M/2, core_ts) LDA (n=2, core_tp) LDA (n=2, core_ts) 0.00 20.00 40.00 60.00 80.00 52 53 52 53 46 58 44 48 54 53 0 0 62 59 52 60 55 59 55 54 65 60 75 74 Comments+Code Comments only Overlap
  • 23. Why LDA does not work well Class distance in eXVantage topic 1 topic2
  • 24. RQ2: Entropy Comments do not contain clearly dominant words !"#$%&"$ !''#&() !''#&()$*&+, -./--./0-.1--.10 2&(3!4,$!5$(#3') !"#$%&"$ !''#&() !''#&()$*&+, -./--./0-.1--.10 2&(3!4,$!5$(#3') eXVantage JHotDraw Entropyofterms Entropyofterms Code+comments Code+comments Comments onlyComments only
  • 25. RQ2: Effort to label artifacts System Class size Comment verbosity JHotDraw 0.6 -0.25 eXVantage 0 -0.13 Pearson Correlation Different comment verbosity in JHotDraw (6) and eXVantage (14)
  • 26. RQ2:VSM vs. LSI VSM LSI 0 20 40 60 80 72 59 42 61 Low High JHotDraw VSM LSI 0 20 40 60 80 68 68 52 72 Low High exVantage Overlap vs. effort needed to label a class