Submit Search
Upload
Extracting Code Segments and Their Descriptions from Research Articles
•
1 like
•
106 views
Preetha Chatterjee
Follow
Presentation from MSR 2017
Read less
Read more
Software
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 22
Recommended
Ijetr012045
Ijetr012045
ER Publication.org
Ethnograph 10 Jul07
Ethnograph 10 Jul07
Clara Kwan
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...
CSCJournals
Review of plagiarism detection and control & copyrights in India
Review of plagiarism detection and control & copyrights in India
ijiert bestjournal
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
IJSRD
Study on security and quality of service implementations in p2 p overlay netw...
Study on security and quality of service implementations in p2 p overlay netw...
eSAT Publishing House
Ethnograph 11 Jul07
Ethnograph 11 Jul07
Clara Kwan
Author Identification of Source Code Segments Written by Multiple Authors Usi...
Author Identification of Source Code Segments Written by Multiple Authors Usi...
Parvez Mahbub
Recommended
Ijetr012045
Ijetr012045
ER Publication.org
Ethnograph 10 Jul07
Ethnograph 10 Jul07
Clara Kwan
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (4) Issu...
CSCJournals
Review of plagiarism detection and control & copyrights in India
Review of plagiarism detection and control & copyrights in India
ijiert bestjournal
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
IJSRD
Study on security and quality of service implementations in p2 p overlay netw...
Study on security and quality of service implementations in p2 p overlay netw...
eSAT Publishing House
Ethnograph 11 Jul07
Ethnograph 11 Jul07
Clara Kwan
Author Identification of Source Code Segments Written by Multiple Authors Usi...
Author Identification of Source Code Segments Written by Multiple Authors Usi...
Parvez Mahbub
Performance analysis on secured data method in natural language steganography
Performance analysis on secured data method in natural language steganography
journalBEEI
Keyword extraction and clustering for document recommendation in conversations.
Keyword extraction and clustering for document recommendation in conversations.
LeMeniz Infotech
Speech recognition using neural + fuzzy logic
Speech recognition using neural + fuzzy logic
Snehal Patel
Ijet journal
Ijet journal
Fhiedt Takumi
AN ONTOLOGY FOR EXPLORING KNOWLEDGE IN COMPUTER NETWORKS
AN ONTOLOGY FOR EXPLORING KNOWLEDGE IN COMPUTER NETWORKS
ijcsa
H017445260
H017445260
IOSR Journals
Tcp performance Final Report
Tcp performance Final Report
ambitlick
Network security essentials applications and standards - 17376.pdf
Network security essentials applications and standards - 17376.pdf
DrBasemMohamedElomda
Block Library Driven Translation Validation for DataFlow Models in Safety Cri...
Block Library Driven Translation Validation for DataFlow Models in Safety Cri...
Marc Pantel
Info scince pp
Info scince pp
malakAlHarbi94
HTML, CSS and XML
HTML, CSS and XML
Aashish Jain
High performance computing
High performance computing
Guy Tel-Zur
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Robert McDermott
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Robert McDermott
Overview of SMB, NetBIOS and other network attacks
Overview of SMB, NetBIOS and other network attacks
David Sweigert
Ieee Cyber 2012 Late News Cfp
Ieee Cyber 2012 Late News Cfp
Terry Janssen
A.Levenchuk -- Complexity in Engineering
A.Levenchuk -- Complexity in Engineering
Anatoly Levenchuk
1 history computers
1 history computers
Jobi Mathai
Decipher openseminar (1)
Decipher openseminar (1)
Jae-Yun Kim
Open Cybersecurity Alliance Briefing at RSAC 2020
Open Cybersecurity Alliance Briefing at RSAC 2020
Carol Geyer
This final assignment will be a departure from the three previous.docx
This final assignment will be a departure from the three previous.docx
jwilliam16
From the classroom to the cloud a journey with node.js - Christopher Hogue
From the classroom to the cloud a journey with node.js - Christopher Hogue
Devopsdays
More Related Content
What's hot
Performance analysis on secured data method in natural language steganography
Performance analysis on secured data method in natural language steganography
journalBEEI
Keyword extraction and clustering for document recommendation in conversations.
Keyword extraction and clustering for document recommendation in conversations.
LeMeniz Infotech
Speech recognition using neural + fuzzy logic
Speech recognition using neural + fuzzy logic
Snehal Patel
Ijet journal
Ijet journal
Fhiedt Takumi
AN ONTOLOGY FOR EXPLORING KNOWLEDGE IN COMPUTER NETWORKS
AN ONTOLOGY FOR EXPLORING KNOWLEDGE IN COMPUTER NETWORKS
ijcsa
H017445260
H017445260
IOSR Journals
What's hot
(6)
Performance analysis on secured data method in natural language steganography
Performance analysis on secured data method in natural language steganography
Keyword extraction and clustering for document recommendation in conversations.
Keyword extraction and clustering for document recommendation in conversations.
Speech recognition using neural + fuzzy logic
Speech recognition using neural + fuzzy logic
Ijet journal
Ijet journal
AN ONTOLOGY FOR EXPLORING KNOWLEDGE IN COMPUTER NETWORKS
AN ONTOLOGY FOR EXPLORING KNOWLEDGE IN COMPUTER NETWORKS
H017445260
H017445260
Similar to Extracting Code Segments and Their Descriptions from Research Articles
Tcp performance Final Report
Tcp performance Final Report
ambitlick
Network security essentials applications and standards - 17376.pdf
Network security essentials applications and standards - 17376.pdf
DrBasemMohamedElomda
Block Library Driven Translation Validation for DataFlow Models in Safety Cri...
Block Library Driven Translation Validation for DataFlow Models in Safety Cri...
Marc Pantel
Info scince pp
Info scince pp
malakAlHarbi94
HTML, CSS and XML
HTML, CSS and XML
Aashish Jain
High performance computing
High performance computing
Guy Tel-Zur
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Robert McDermott
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Robert McDermott
Overview of SMB, NetBIOS and other network attacks
Overview of SMB, NetBIOS and other network attacks
David Sweigert
Ieee Cyber 2012 Late News Cfp
Ieee Cyber 2012 Late News Cfp
Terry Janssen
A.Levenchuk -- Complexity in Engineering
A.Levenchuk -- Complexity in Engineering
Anatoly Levenchuk
1 history computers
1 history computers
Jobi Mathai
Decipher openseminar (1)
Decipher openseminar (1)
Jae-Yun Kim
Open Cybersecurity Alliance Briefing at RSAC 2020
Open Cybersecurity Alliance Briefing at RSAC 2020
Carol Geyer
This final assignment will be a departure from the three previous.docx
This final assignment will be a departure from the three previous.docx
jwilliam16
From the classroom to the cloud a journey with node.js - Christopher Hogue
From the classroom to the cloud a journey with node.js - Christopher Hogue
Devopsdays
Customer Success Story: IEEE Xplore Saves Time
Customer Success Story: IEEE Xplore Saves Time
IEEE Xplore Digital Library
CIS505008VA0- COMMUNICATION TECHNOLOGIESWeek 2 Assignment 1 Su.docx
CIS505008VA0- COMMUNICATION TECHNOLOGIESWeek 2 Assignment 1 Su.docx
sleeperharwell
Analysis of network_security_threats_and_vulnerabilities_by_development__impl...
Analysis of network_security_threats_and_vulnerabilities_by_development__impl...
Tương Hoàng
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET Journal
Similar to Extracting Code Segments and Their Descriptions from Research Articles
(20)
Tcp performance Final Report
Tcp performance Final Report
Network security essentials applications and standards - 17376.pdf
Network security essentials applications and standards - 17376.pdf
Block Library Driven Translation Validation for DataFlow Models in Safety Cri...
Block Library Driven Translation Validation for DataFlow Models in Safety Cri...
Info scince pp
Info scince pp
HTML, CSS and XML
HTML, CSS and XML
High performance computing
High performance computing
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Overview of SMB, NetBIOS and other network attacks
Overview of SMB, NetBIOS and other network attacks
Ieee Cyber 2012 Late News Cfp
Ieee Cyber 2012 Late News Cfp
A.Levenchuk -- Complexity in Engineering
A.Levenchuk -- Complexity in Engineering
1 history computers
1 history computers
Decipher openseminar (1)
Decipher openseminar (1)
Open Cybersecurity Alliance Briefing at RSAC 2020
Open Cybersecurity Alliance Briefing at RSAC 2020
This final assignment will be a departure from the three previous.docx
This final assignment will be a departure from the three previous.docx
From the classroom to the cloud a journey with node.js - Christopher Hogue
From the classroom to the cloud a journey with node.js - Christopher Hogue
Customer Success Story: IEEE Xplore Saves Time
Customer Success Story: IEEE Xplore Saves Time
CIS505008VA0- COMMUNICATION TECHNOLOGIESWeek 2 Assignment 1 Su.docx
CIS505008VA0- COMMUNICATION TECHNOLOGIESWeek 2 Assignment 1 Su.docx
Analysis of network_security_threats_and_vulnerabilities_by_development__impl...
Analysis of network_security_threats_and_vulnerabilities_by_development__impl...
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...
More from Preetha Chatterjee
Interpersonal Trust in OSS: Exploring Dimensions of Trust in GitHub Pull Requ...
Interpersonal Trust in OSS: Exploring Dimensions of Trust in GitHub Pull Requ...
Preetha Chatterjee
Data Augmentation for Improving Emotion Recognition in Software Engineering C...
Data Augmentation for Improving Emotion Recognition in Software Engineering C...
Preetha Chatterjee
Automatic Identification of Informative Code in Stack Overflow Posts
Automatic Identification of Informative Code in Stack Overflow Posts
Preetha Chatterjee
Automatically Identifying the Quality of Developer Chats for Post Hoc Use
Automatically Identifying the Quality of Developer Chats for Post Hoc Use
Preetha Chatterjee
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Preetha Chatterjee
Extracting Archival-Quality Information from Software-Related Chats
Extracting Archival-Quality Information from Software-Related Chats
Preetha Chatterjee
Mining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software Artifacts
Preetha Chatterjee
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Preetha Chatterjee
More from Preetha Chatterjee
(8)
Interpersonal Trust in OSS: Exploring Dimensions of Trust in GitHub Pull Requ...
Interpersonal Trust in OSS: Exploring Dimensions of Trust in GitHub Pull Requ...
Data Augmentation for Improving Emotion Recognition in Software Engineering C...
Data Augmentation for Improving Emotion Recognition in Software Engineering C...
Automatic Identification of Informative Code in Stack Overflow Posts
Automatic Identification of Informative Code in Stack Overflow Posts
Automatically Identifying the Quality of Developer Chats for Post Hoc Use
Automatically Identifying the Quality of Developer Chats for Post Hoc Use
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Extracting Archival-Quality Information from Software-Related Chats
Extracting Archival-Quality Information from Software-Related Chats
Mining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software Artifacts
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Recently uploaded
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
harshavardhanraghave
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
Andolasoft Inc
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
kellynguyen01
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
VICTOR MAESTRE RAMIREZ
What is Binary Language? Computer Number Systems
What is Binary Language? Computer Number Systems
JheuzeDellosa
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
BradBedford3
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Christina Lin
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
kalichargn70th171
Professional Resume Template for Software Developers
Professional Resume Template for Software Developers
Vinodh Ram
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
kalichargn70th171
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
bodapatigopi8531
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
aagamshah0812
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
ComplianceQuest1
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
stazi3110
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
Evangelist Apps https://twitter.com/EvangelistSW/
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Alberto González Trastoy
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
soniya singh
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
OPEN KNOWLEDGE GmbH
Recently uploaded
(20)
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
What is Binary Language? Computer Number Systems
What is Binary Language? Computer Number Systems
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
Professional Resume Template for Software Developers
Professional Resume Template for Software Developers
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
Extracting Code Segments and Their Descriptions from Research Articles
1.
Preetha Chatterjee, Benjamin Gause, Hunter Hedinger, and Lori Pollock Computer & Information Science Extracting Code Segments and Their Descriptions from Research Articles 1
2.
Code is everywhere! 2
3.
Bug Reports Emails Blog Posts Q & A forums Code Reviews Documentation E-books Research Papers Course Materials Presentations Public Chats Benchmarks Research Papers DL Domain # of articles ACM DL
Computer Science > 300,000 IEEE Xplore Computer Science > 3,500,000 DBLP Mostly Computer Science > 3,729,582 https://en.wikipedia.org/wiki/IEEE_Xplore https://cacm.acm.org/magazines/2011/7/109905-acm-aggregates-publication- statistics-in-the-acm-digital-library/fulltext http://dblp.uni-trier.de/ 3
4.
70% of the articles contain one or more code segments, with an average of 3-4 code segments per article. 4
5.
Excerpt from a research article Indication of a problem in the code The functionality of the code Functionalities of individual method calls within the code The type of data structure used by the program Cause of the code issue presented earlier To understand the difficulty of fixing a memory leak, let us take a look at an example program in Fig. 1. This is a contrived example mimicking recurring leak patterns we found in real C programs. Procedure check_records checks whether there is any bad records in a large file, and the caller could either check all records, or specify a search condition to check only part of records. In this example, both get_next and search_for_next will allocate and return a heap structure, which is expected to be freed at line 12. However, the execution may break out the loop at line 10, causing a memory leak. The programming language of the source code 5
6.
Why extract code segments and their descriptions? Code recommendation Automatic comment generation Extension of documentation Learning from & reusing code examples 6
7.
Excerpt from a research article To understand the difficulty of fixing a memory leak, let us take a look at an example program in Fig. 1. This is a contrived example mimicking recurring leak patterns we found in real C programs. Procedure check_records checks whether there is any bad records in a large file, and the caller could either check all records, or specify a search condition to check only part of records. In this example, both get_next and search_for_next will allocate and return a heap structure, which is expected to be freed at line 12. However, the execution may break out the loop at line 10, causing a memory leak.
7 Code segments [Bacchelli et al. (ICPC’10) Tang et al. (KDD’05) Bettenburg et al. (MSR’08) Subramanian et al. (MSR’13) Rigby et al. (ICSE’13) Natural language text describing each code segment
8.
This Paper’s Contributions Automatically identifying & mapping text describing code segments in research articles A prototype, CoDesNPub Miner, that outputs XML-based representation associating code segments with their descriptions Evaluation of the effectiveness of code description identification techniques 8 •
Seeds • Neighbors
9.
Overview of CoDesNPub Miner 9
10.
Identify Seeds: References Figure Containing Code Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. 10 *****Code Segment appears here***** Listing 9 shows an example of three statements that were single statement blocks after the first phases, but can be merged into a single block because they have similar RHSs. *****Code Segment appears here*****
11.
Identify Seeds: Located Immediately Before or After Inlined Code Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. 11 *****Code Segment appears here***** A major obstacle to extracting API examples from test code is the multiple test scenarios in a test method. Fig. 1 depicts such a test method. Lines 2-4 are the declaration of some data objects. Lines 5-13 depict a test scenario that contains the usage of some API methods, such as keySetByValue, put, and getKey. Lines 14-22 depict another test scenario, which contains a similar usage to the previous one. Such multiple test scenarios are quite reasonable when aiming at covering testing input domains. But they bring redundant code for API users to read. In fact, there are actually 200+ code lines containing similar test scenarios in the test method in Fig.1. It is necessary to separate different test scenarios from one test method and cluster the similar usages to remove redundancy. *****Code Segment appears here*****
12.
Identify Seeds: Contains Code Identifiers Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. 12 *****Code Segment appears here*****
13.
Identify Seeds: References Code By Position 13 Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. This code snippet obtains a user name (user- Name) by invoking request.getParameter(“name”)and uses it to construct a query to be passed to a database for execution (con.execute (query)). This seemingly innocent piece of code may allow an attacker to gain access to unauthorized information: if an attacker has full control of string userName
obtained from an HTTP request, he can for example set it to 'OR 1 = 1;- -. Two dashes are used to indicate comments in the Oracle dialect of SQL, so the WHERE clause of the query effectively becomes the tautologyname = ' ' OR 1 = 1. This allows the attackerto circumvent the name check and get access to all user records in the database.
14.
Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8- 10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. ReferencesCodeFigure Score=3 ContainsCodeIdentifiers Score=2 ContainsCodeIdentifiers
Score=2 ReferencesCodeByPosition Score=2 ContainsCodeIdentifiers, Score=2 ReferencesCodeByPosition Score=2 TextBefore Score=1 14 Identify Seeds: Putting It All Together *****Code Segment appears here***** • Scoring sentences • Equal • Accuracy-based • Threshold Analysis Heuristic References CodeFigure Contains Code Identifiers References Code ByPosition Text Before, Text After Score 3 2 2 1
15.
Identifying Neighboring Code-related Text • Heuristic 1: At least 1 sentence is a seed • Heuristic 2: At least (25%, 50%, or 75%, respectively) sentences in the paragraph are seeds Fig.5 shows a typical test method of this pattern. The method tests a set of basic functionality of API class BasicAuthCache, including the method put, get, remove and clear. There are three test scenarios in the method: line 4-5, line 6-7, line 8-10. They share two data objects, cache and authScheme. Their method invocation sequences are not same and there is no unified test target method. But there is a common subsequence among three method invocation sequences, i.e., the invocations of get and HttpHost. 5
out of 6 (75%) sentences are seeds 15 Heuristic 1 Heuristic 2 whole paragraph is a description
16.
Evaluation Methodology • Research Question: How effective is our approach to automatically identify code descriptions in natural language text of research articles? • Subjects: 100 code segments from ACM DL and IEEE Xplore
journal and conference software engineering papers • Gold Set: • 10 Human annotators (non-authors) • Measures: • Overall code description identification: Precision and recall • Seed identification: Precision 16
17.
Evaluation Results Minimum # of Seeds Precision Recall 1-24% 39.05
70.20 >= 25% 53.41 50.33 >= 50% 66.04 28.45 >= 75% 68.30 20.53 Overall system effectiveness 17
18.
18 High Clarity Low Clarity Kinds of information in the code descriptions
19.
What cues are most prominent? 19
20.
Main Threats to Validity • Unable to distinguish between pseudocode and code fragments • Papers with no pseudocode, plan to extend the approach to identify both. •
Evaluation relies on human judges • Human judges with experiences in programming and research paper reading. • Each code segment judged by at least two judges. • Scaling to extensive evaluation set might lead to different results • Plan to expand the evaluation with more participants, and research papers containing more code segments. 20
21.
Related Work • Analyzing Collections of Research Articles: Cruzes et al. (ESEM’07), Siegmund et al. (ICSE’15) •
Code Segment Extraction: Bacchelli et al. (ICPC’10), Tang et al. (KDD’05), Bettenburg et al. (MSR’08), Subramanian et al. (MSR’13), Rigby et al. (ICSE’13) • Code Description Identification: Bug Reports: Panichella et al. (ICPC’12), Q&A: Vassallo et al. (ICPC’14), Wong et al. (ASE’13), Rahman et al. (SCAM’15) 21
22.
Summary • Automatically identify & map text describing code segments in research articles • CoDesNPub
Miner outputting XML-based representation associating code segments with their descriptions • Evaluation of the effectiveness of code description identification techniques • Precision = 68% • Recall = 21% 22 Future Work Improve recall and precision Fully automate preprocessing Expand the experiments