The document outlines Masud Rahman's PhD thesis proposal on supporting source code search with context-aware, analytics-driven query reformulation. The proposal discusses three research questions: 1) evaluating term weighting techniques for keyword selection from source code and bug reports, 2) incorporating bug report quality for local code search, and 3) leveraging crowd knowledge and data analytics to deliver query keywords. The contribution summary highlights techniques for term dependence, quality-aware bug localization, and using crowd knowledge and large data analytics.
Semantics and optimisation of the SPARQL1.1 federation extensionOscar Corcho
Presentation done at ESWC2011 for the paper "Semantics and optimisation of the SPARQL1.1 federation extension". Buil-Aranda C, Arenas M, Corcho O. ESWC2011, May 2011, Hersonissos, Greece
Information access over linked data requires to determine
subgraph(s), in linked data's underlying graph, that correspond to the required information need. Usually, an information access framework is able to retrieve richer information by checking of a large number of possible subgraphs. However, on the ecking of a large number of possible subgraphs increases information access complexity. This makes information access frameworks less eective. A large number of contemporary linked data information access frameworks reduce the complexity by introducing dierent heuristics but they suer on retrieving richer information. Or, some frameworks do not care about the complexity. However, a practically usable framework should retrieve richer information with lower complexity. In linked data information access, we hypothesize that pre-processed data statistics of linked data can be used to eciently check a large number of possible subgraphs. This will help to retrieve comparatively richer information with lower data access complexity. Preliminary evaluation of our proposed hypothesis shows promising performance.
Semantics and optimisation of the SPARQL1.1 federation extensionOscar Corcho
Presentation done at ESWC2011 for the paper "Semantics and optimisation of the SPARQL1.1 federation extension". Buil-Aranda C, Arenas M, Corcho O. ESWC2011, May 2011, Hersonissos, Greece
Information access over linked data requires to determine
subgraph(s), in linked data's underlying graph, that correspond to the required information need. Usually, an information access framework is able to retrieve richer information by checking of a large number of possible subgraphs. However, on the ecking of a large number of possible subgraphs increases information access complexity. This makes information access frameworks less eective. A large number of contemporary linked data information access frameworks reduce the complexity by introducing dierent heuristics but they suer on retrieving richer information. Or, some frameworks do not care about the complexity. However, a practically usable framework should retrieve richer information with lower complexity. In linked data information access, we hypothesize that pre-processed data statistics of linked data can be used to eciently check a large number of possible subgraphs. This will help to retrieve comparatively richer information with lower data access complexity. Preliminary evaluation of our proposed hypothesis shows promising performance.
Many Linked Data datasets model elements in their domains in the form of lists: a countable number of ordered resources.
When publishing these lists in RDF, an important concern is making them easy to consume.
Therefore, a well-known recommendation is to find an existing list modelling solution, and reuse it.
However, a specific domain model can be implemented in different ways and vocabularies may provide alternative solutions.
In this paper, we argue that a wrong decision could have a significant impact in terms of performance and, ultimately, the availability of the data.
We take the case of RDF Lists and make the hypothesis that the efficiency of retrieving sequential linked data depends primarily on how they are modelled (triple-store invariance hypothesis).
To demonstrate this, we survey different solutions for modelling sequences in RDF, and propose a pragmatic approach for assessing their impact on data availability.
Finally, we derive good (and bad) practices on how to publish lists as linked open data.
By doing this, we sketch the foundations of an empirical, task-oriented methodology for benchmarking linked data modelling solutions.
SSN-TC workshop talk at ISWC 2015 on EmroozMarkus Stocker
Slides for the talk describing the paper on Emrooz, a scalable database for sensor observations with semantics according to the Semantic Sensor Network ontology.
Towards efficient processing of RDF data streamsAlejandro Llaves
Presentation of short paper submitted to OrdRing workshop, held at ISWC 2014 - http://streamreasoning.org/events/ordring2014.
In the last years, there has been an increase in the amount of real-time data generated. Sensors attached to things are transforming how we interact with our environment. Extracting meaningful information from these streams of data is essential for some application areas and requires processing systems that scale to varying conditions in data sources, complex queries, and system failures. This paper describes ongoing research on the development of a scalable RDF streaming engine.
Towards efficient processing of RDF data streamsAlejandro Llaves
In the last years, there has been an increase in the amount of real-time data generated. Sensors attached to things are transforming how we interact with our environment. Extracting meaningful information from these streams of data is essential for some application areas and requires processing systems that scale to varying conditions in data sources, complex queries, and system failures. This paper describes ongoing research on the development of a scalable RDF streaming engine.
Presented at OrdRing workshop, International Semantic Web Conference 2014.
http://streamreasoning.org/events/ordring2014
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
In this deck from ATPESC 2019, Michela Taufer from UT Knoxville presents: Scientific Applications and Heterogeneous Architectures.
"This talk discusses two emerging trends in computing (i.e., the convergence of data generation and analytics, and the emergence of edge computing) and how these trends can impact heterogeneous applications. Next-generation supercomputers, with their extremely heterogeneous resources and dramatically higher performance than current systems, will generate more data than we need or, even, can handle. At the same time, more and more data is generated at the “edge,” requiring computing and storage to move closer and closer to data sources. The coordination of data generation and analysis across the spectrum of heterogonous systems including supercomputers, cloud computing, and edge computing adds additional layers of heterogeneity to applications’ workflows. More importantly, the coordination can neither rely on manual, centralized approaches as it is predominately done today in HPC nor exclusively be delegated to be just a problem for commercial Clouds. This talk presents case studies of heterogenous applications in precision medicine and precision farming that expand scientist workflows beyond the supercomputing center and shed our reliance on large-scale simulations exclusively, for the sake of scientific discovery."
Watch the video: https://wp.me/p3RLHQ-lq2
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Linked lists represent a countable number of ordered values, and are among the most important abstract data types in computer science. With the advent of RDF as a highly expressive knowledge representation language for the Web, various implementations for RDF lists have been proposed. Yet, there is no benchmark so far dedicated to evaluate the performance of triple stores and SPARQL query engines on dealing with ordered linked data. Moreover, essential tasks for evaluating RDF lists, like generating datasets containing RDF lists of various sizes, or generating the same RDF list using different modelling choices, are cumbersome and unprincipled. In this paper, we propose List.MID, a systematic benchmark for evaluating systems serving RDF lists. List.MID consists of a dataset generator, which creates RDF list data in various models and of different sizes; and a set of SPARQL queries. The RDF list data is coherently generated from a large, community-curated base collection of Web MIDI files, rich in lists of musical events of arbitrary length. We describe the List.MID benchmark, and discuss its impact and adoption, reusability, design, and availability.
With the increasing adoption of NoSQL data base systems like MongoDB or CouchDB more and more applications store structured data according to a non-relational, document oriented model. Exposing this structured data as Linked Data is currently inhibited by a lack of standards as well as tools and requires the implementation of custom solutions. While recent efforts aim at expressing transformations of such data models into RDF in a standardized manner, there is a lack of approaches which facilitate SPARQL execution over mapped non-relational data sources. With SparqlMap-M we show how dynamic SPARQL access to non-relational data can be achieved. SparqlMap-M is an extension to our SPARQL-to-SQL rewriter SparqlMap that performs a (partial) transformation of SPARQL queries by using a relational abstraction over a document store. Further, duplicate data in the document store is used to reduce the number of joins and custom optimiza-tions are introduced. Our showcase scenario employs the Berlin SPARQL Benchmark (BSBM) with different adap-tions to a document data model. We use this scenario to demonstrate the viability of our approach and compare it to different MongoDB setups and native SQL.
Jörg Unbehauen | AKSW, Universität Leipzig
Presentation at Semantics 2016 in Leipzig in the context with the results of the LEDS project
RAISE Lab at Dalhousie University
aims to develop tools and technologies for intelligent automation in software engineering. An overview is presented by Dr. Masud Rahman, Assistant Professor, Faculty of Computer Science, Dalhousie University, Canada.
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...Masud Rahman
Being light-weight and cost-effective, IR-based approaches for bug localization have shown promise in finding software bugs. However, the accuracy of these approaches heavily depends on their used bug reports. A significant number of bug reports contain only plain natural language texts. According to existing studies, IR-based approaches cannot perform well when they use these bug reports as search queries. On the other hand, there is a piece of recent evidence that suggests that even these natural language-only reports contain enough good keywords that could help localize the bugs successfully. On one hand, these findings suggest that natural language-only bug reports might be a sufficient source for good query keywords. On the other hand, they cast serious doubt on the query selection practices in the IR-based bug localization. In this article, we attempted to clear the sky on this aspect by conducting an in-depth empirical study that critically examines the state-of-the-art query selection practices in IR-based bug localization. In particular, we use a dataset of 2,320 bug reports, employ ten existing approaches from the literature, exploit the Genetic Algorithm-based approach to construct optimal, near-optimal search queries from these bug reports, and then answer three research questions. We confirmed that the state-of-the-art query construction approaches are indeed not sufficient for constructing appropriate queries (for bug localization) from certain natural language-only bug reports. However, these bug reports indeed contain high-quality search keywords in their texts even though they might not contain explicit hints for localizing bugs (e.g., stack traces). We also demonstrate that optimal queries and non-optimal queries chosen from bug report texts are significantly different in terms of several keyword characteristics (e.g., frequency, entropy, position, part of speech). Such an analysis has led us to four actionable insights on how to choose appropriate keywords from a bug report. Furthermore, we demonstrate 27%–34% improvement in the performance of non-optimal queries through the application of our actionable insights to them. Finally, we summarize our study findings with future research directions (e.g., machine intelligence in keyword selection).
Preprint: https://bit.ly/39nAoun
Publication URL: https://bit.ly/3xVUxlq
Replication package: https://bit.ly/36T8oxL
More details: https://web.cs.dal.ca/~masud
Many Linked Data datasets model elements in their domains in the form of lists: a countable number of ordered resources.
When publishing these lists in RDF, an important concern is making them easy to consume.
Therefore, a well-known recommendation is to find an existing list modelling solution, and reuse it.
However, a specific domain model can be implemented in different ways and vocabularies may provide alternative solutions.
In this paper, we argue that a wrong decision could have a significant impact in terms of performance and, ultimately, the availability of the data.
We take the case of RDF Lists and make the hypothesis that the efficiency of retrieving sequential linked data depends primarily on how they are modelled (triple-store invariance hypothesis).
To demonstrate this, we survey different solutions for modelling sequences in RDF, and propose a pragmatic approach for assessing their impact on data availability.
Finally, we derive good (and bad) practices on how to publish lists as linked open data.
By doing this, we sketch the foundations of an empirical, task-oriented methodology for benchmarking linked data modelling solutions.
SSN-TC workshop talk at ISWC 2015 on EmroozMarkus Stocker
Slides for the talk describing the paper on Emrooz, a scalable database for sensor observations with semantics according to the Semantic Sensor Network ontology.
Towards efficient processing of RDF data streamsAlejandro Llaves
Presentation of short paper submitted to OrdRing workshop, held at ISWC 2014 - http://streamreasoning.org/events/ordring2014.
In the last years, there has been an increase in the amount of real-time data generated. Sensors attached to things are transforming how we interact with our environment. Extracting meaningful information from these streams of data is essential for some application areas and requires processing systems that scale to varying conditions in data sources, complex queries, and system failures. This paper describes ongoing research on the development of a scalable RDF streaming engine.
Towards efficient processing of RDF data streamsAlejandro Llaves
In the last years, there has been an increase in the amount of real-time data generated. Sensors attached to things are transforming how we interact with our environment. Extracting meaningful information from these streams of data is essential for some application areas and requires processing systems that scale to varying conditions in data sources, complex queries, and system failures. This paper describes ongoing research on the development of a scalable RDF streaming engine.
Presented at OrdRing workshop, International Semantic Web Conference 2014.
http://streamreasoning.org/events/ordring2014
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
In this deck from ATPESC 2019, Michela Taufer from UT Knoxville presents: Scientific Applications and Heterogeneous Architectures.
"This talk discusses two emerging trends in computing (i.e., the convergence of data generation and analytics, and the emergence of edge computing) and how these trends can impact heterogeneous applications. Next-generation supercomputers, with their extremely heterogeneous resources and dramatically higher performance than current systems, will generate more data than we need or, even, can handle. At the same time, more and more data is generated at the “edge,” requiring computing and storage to move closer and closer to data sources. The coordination of data generation and analysis across the spectrum of heterogonous systems including supercomputers, cloud computing, and edge computing adds additional layers of heterogeneity to applications’ workflows. More importantly, the coordination can neither rely on manual, centralized approaches as it is predominately done today in HPC nor exclusively be delegated to be just a problem for commercial Clouds. This talk presents case studies of heterogenous applications in precision medicine and precision farming that expand scientist workflows beyond the supercomputing center and shed our reliance on large-scale simulations exclusively, for the sake of scientific discovery."
Watch the video: https://wp.me/p3RLHQ-lq2
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Linked lists represent a countable number of ordered values, and are among the most important abstract data types in computer science. With the advent of RDF as a highly expressive knowledge representation language for the Web, various implementations for RDF lists have been proposed. Yet, there is no benchmark so far dedicated to evaluate the performance of triple stores and SPARQL query engines on dealing with ordered linked data. Moreover, essential tasks for evaluating RDF lists, like generating datasets containing RDF lists of various sizes, or generating the same RDF list using different modelling choices, are cumbersome and unprincipled. In this paper, we propose List.MID, a systematic benchmark for evaluating systems serving RDF lists. List.MID consists of a dataset generator, which creates RDF list data in various models and of different sizes; and a set of SPARQL queries. The RDF list data is coherently generated from a large, community-curated base collection of Web MIDI files, rich in lists of musical events of arbitrary length. We describe the List.MID benchmark, and discuss its impact and adoption, reusability, design, and availability.
With the increasing adoption of NoSQL data base systems like MongoDB or CouchDB more and more applications store structured data according to a non-relational, document oriented model. Exposing this structured data as Linked Data is currently inhibited by a lack of standards as well as tools and requires the implementation of custom solutions. While recent efforts aim at expressing transformations of such data models into RDF in a standardized manner, there is a lack of approaches which facilitate SPARQL execution over mapped non-relational data sources. With SparqlMap-M we show how dynamic SPARQL access to non-relational data can be achieved. SparqlMap-M is an extension to our SPARQL-to-SQL rewriter SparqlMap that performs a (partial) transformation of SPARQL queries by using a relational abstraction over a document store. Further, duplicate data in the document store is used to reduce the number of joins and custom optimiza-tions are introduced. Our showcase scenario employs the Berlin SPARQL Benchmark (BSBM) with different adap-tions to a document data model. We use this scenario to demonstrate the viability of our approach and compare it to different MongoDB setups and native SQL.
Jörg Unbehauen | AKSW, Universität Leipzig
Presentation at Semantics 2016 in Leipzig in the context with the results of the LEDS project
RAISE Lab at Dalhousie University
aims to develop tools and technologies for intelligent automation in software engineering. An overview is presented by Dr. Masud Rahman, Assistant Professor, Faculty of Computer Science, Dalhousie University, Canada.
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...Masud Rahman
Being light-weight and cost-effective, IR-based approaches for bug localization have shown promise in finding software bugs. However, the accuracy of these approaches heavily depends on their used bug reports. A significant number of bug reports contain only plain natural language texts. According to existing studies, IR-based approaches cannot perform well when they use these bug reports as search queries. On the other hand, there is a piece of recent evidence that suggests that even these natural language-only reports contain enough good keywords that could help localize the bugs successfully. On one hand, these findings suggest that natural language-only bug reports might be a sufficient source for good query keywords. On the other hand, they cast serious doubt on the query selection practices in the IR-based bug localization. In this article, we attempted to clear the sky on this aspect by conducting an in-depth empirical study that critically examines the state-of-the-art query selection practices in IR-based bug localization. In particular, we use a dataset of 2,320 bug reports, employ ten existing approaches from the literature, exploit the Genetic Algorithm-based approach to construct optimal, near-optimal search queries from these bug reports, and then answer three research questions. We confirmed that the state-of-the-art query construction approaches are indeed not sufficient for constructing appropriate queries (for bug localization) from certain natural language-only bug reports. However, these bug reports indeed contain high-quality search keywords in their texts even though they might not contain explicit hints for localizing bugs (e.g., stack traces). We also demonstrate that optimal queries and non-optimal queries chosen from bug report texts are significantly different in terms of several keyword characteristics (e.g., frequency, entropy, position, part of speech). Such an analysis has led us to four actionable insights on how to choose appropriate keywords from a bug report. Furthermore, we demonstrate 27%–34% improvement in the performance of non-optimal queries through the application of our actionable insights to them. Finally, we summarize our study findings with future research directions (e.g., machine intelligence in keyword selection).
Preprint: https://bit.ly/39nAoun
Publication URL: https://bit.ly/3xVUxlq
Replication package: https://bit.ly/36T8oxL
More details: https://web.cs.dal.ca/~masud
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
PhD proposal of Masud Rahman
1. SUPPORTING SOURCE CODE SEARCH WITH
CONTEXT-AWARE, ANALYTICS-DRIVEN
QUERY REFORMULATION
Masud Rahman
Department of Computer Science
University of Saskatchewan, Canada
Advisor: Dr. Chanchal Roy
@masud233
6
2. TALK OUTLINE
Part 2: PhD Thesis
Part 1: Research Problem
Part 4: Q&A + Discussions
2
Part 3: Contribution Summary
MasudRahman,UofS
4. MCAS: A SOFTWARE BUG THAT KILLS
MasudRahman,UofS
Boeing 737 MAX 8
4
MCAS
P1 P2 P4P3
5. THE SEARCH FOR THE BUGGY CODE
MasudRahman,UofS
Boeing
Customer
MCAS Bug report
Boeing Developer Code search
Query Suggestion Query Reformulation
Boeing Codebase
5
P1 P2 P4P3
6. SYSTEMATIC LITERATURE REVIEW
MasudRahman,UofS
ACM DL
CrossRef
DBLP
Mendeley
Google Scholar
IEEE Xplore
ProQuest
ScienceDirect
SpringerLink
Web of Science
Wiley Online Lib
2871 2317 562
Initial
results
Impurity
removal
Filter by
Title
195
Filter by
Abstract
93
Merging &
Duplicate
removal
56
Primary
studies
Filter by
Full texts
Query reformulation, query expansion, query reduction, query formulation,
query refinement, automated query expansion, AQE, query suggestion,
query recommendation, term selection, query replacement, query difficulty,
query quality, keyword selection, keyword extraction, search term
identification, search query, search term, and search keyword.
6
3
P1 P2 P4P3
7. I1: INAPPROPRIATE TERM WEIGHTING
RFDd t
t
n
D
dftIDFTF log)),log(1()(
• Different syntax
• Different semantics
• Different structures
MasudRahman,UofS
7
RQ1: Can TF-IDF deliver appropriate search keywords
either from source code or from bug reports? If not, how
can we improve the keyword selection?
P1 P2 P4P3
8. I2: LOW QUALITY OF BUG REPORTS
8
5000+
MasudRahman,UofS
PoorNoisyRich
RQ2: Can we deliver appropriate keywords for IR-
based bug localization (a.k.a., local code search)
by incorporating the bug report quality?
Traditional Practices
P1 P2 P4P3
9. I3: WORDNET FOR SEMANTIC SIMILARITY
9
MasudRahman,UofS
W1 W2
RQ3: Can we deliver appropriate query keywords for
the code search using crowd knowledge (Stack
Overflow) and large data analytics (FastText)?
P1 P2 P4P3
12. 12
MasudRahman,UofS
S1 S2 S3 S4 S5 S6
RQ1: Can TF-IDF deliver appropriate search
keywords either from source code or from
bug reports? If not, how can we improve
the keyword selection?
Graph-based Term
Weighting
P1 P2 P4P3
13. TF-IDF: TERM IMPORTANCE (TRADITIONAL)
13
MasudRahman,UofS
University of Saskatchewan
The Saskatchewan Huskies football team
represents the University of Saskatchewan
in U Sports football that competes in the
Canada West Universities Athletic
Association conference of U Sports. The
program has won the Vanier Cup national
championship three times, in 1990, 1996
and 1998.
The Saskatchewan Huskies
became only the second U Sports team to
advance to three consecutive Vanier Cup
games, after the Saint Mary's Huskies, but
lost all three games from 2004-2006. The
team has won the most Hardy Trophy
titles in Canada West, having won a total
of 20 times. The 2006 Saskatchewan
Huskies became only the third team to
play in a Vanier Cup that their school was
hosting, when the University of
Saskatchewan hosted the 42nd Vanier
Cup. The Toronto Varsity Blues were the
first when they won two Vanier Cups in
1965 and 1993. Saskatchewan also
became the first western school to host
the national championship game.
Saskatchewan:6
Vanier: 5
Won: 4
Huskies: 4
Cup: 4
Team: 4
Sports: 3
Times: 2
School: 2
Championship:2
Vanier: 0.5
Won: 0.4
Huskies: 0.4
School: 0.1
Saskatchewan: 0.06
Championship: 0.06
Sports: 0.06
Times: 0.06
Cup: 0.04
Team: 0.04
TF IDF TF x IDF
Saskatchewan: .01
Vanier: 0.1
Won: 0.1
Huskies: 0.1
Cup: 0.01
Team: 0.01
Sports: 0.02
Times: 0.03
School: 0.05
Championship: .03
IDF = log (DF / N)
Saskatchewan Huskies
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
20. SEARCH QUERY FROM NOISY BUG REPORT
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01 20
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
RQ2: High quality keywords can be provided for IR-
based bug localization (a.k.a., local code search) by
considering bug report quality.
25. EXPERIMENT, DATASET & METRICS
25
5K+ Bug reports Version HistoryGround Truth
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
1. Hit@K
2. MAP@K
3. MRR@K
4. QE
26. SEARCH CONTEXTS: LOCAL & INTERNET-SCALE
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
26
76%
S1 S2 S3 S4 S5 S6
MasudRahman,UofS
P1 P2 P4P3
27. CROWD KNOWLEDGE & DATA ANALYTICS FOR QUERY
EXPANSION
MasudRahman,UofS
Convert image to gray scale without losing transparency
BufferedImage Grayscale ImageEdit ColorConvertOp File
Transparency ColorSpace BufferedImageOp Graphics ImageEffects 27
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
28. WHAT IS CROWD KNOWLEDGE?
28
MasudRahman,UofS
P1 P2 P4P3 S1 S2 S3 S4 S5 S6
29. RACK: QUERIES USING CROWD KNOWLEDGE
29
MasudRahman,UofS
MessageDigest
generate
MD5
hash
S1 S2 S3 S4 S5 S6P1 P2 P4P3
RQ3: Appropriate query keywords (e.g., relevant API classes)
can be delivered for the code search using crowd knowledge
(Stack Overflow)
Q* = Q + C
Keyword-API
Mapping DB
30. NLP2API: QUERIES WITH DATA ANALYTICS
30
MasudRahman,UofS
S1 S2 S3 S4 S5 S6P1 P2 P4P3
Semantic Proximity:
if proximity(Q,A) > proximity(Q,B)
Q
A
B
Q* = Q + A
RQ3: Appropriate query keywords (e.g., relevant API classes)
can be delivered for the code search using large-scale data
analytics (FastText).
40. TWO WORKING CONTEXTS: LOCAL & GLOBAL
MasudRahman,UofS
Local code search
(e.g., bug localization)
Internet-scale
code search
Boeing
codebase GitHub
P1 P2 P3
40
41. S2: KEYWORDS SELECTION FROM SOURCE
CODE WITH CODERANK
41
resolveRuntimeClasspathEntry
Resolve Runtime Classpath Entry
P1 P2 P3
)(
)10(
|)(|
)(
)1()(
ivInj
j
j
i
vOut
vS
vS
RQ1 [Source Code]: Keywords selected by PageRank
are more effective for local code searches (e.g., concept
location) than that of TF-IDF
S1 S2 S3 S4 S5 S6
MasudRahman,UofS
42. HOW DID WE DO?
42
MasudRahman,UofS
P1 P2 P3 S1 S2 S3 S4 S5 S6
3
RQ3: Appropriate query keywords can be delivered for the
code search using Stack Overflow and FastText.
45. R4: GENETIC ALGORITHM FOR QUERIES
MasudRahman,UofS
Method Search Query QE
Baseline {title + description} 25
STRICT[140] {tab classpath enabled buttons user entry} 86
TF-IDF {button entry bootstrap enabled incorrectly moving} 177
GA {open reflect tab bottom entry classpath} 01
Title
Description
Lower QE is better
P1 P2 P3 P4
45
46. SEARCH QUERY FROM NOISY BUG REPORT
46
Bug 31637 – should be able to cast null
NullPointerException
Ci Cj Mk Mn Cp
53 01
MasudRahman,UofS
S1 S2 S3 S4P1 P2 P3
Hello everyone! Good afternoon! Thanks for attending this meeting.
My name is Masud Rahman. I am a PhD Candidate from Software Research Lab.
I work with Dr. Chanchal K. Roy.
Today, I will be talking about automated query reformulations for code search.
Today, my talk will be divided into four sections.
In the first section, I will discuss the research problem I am trying to solve in my PhD.
In the second section, I will discuss about my PhD Thesis proposals to solve that research problem.
In the third section, I will summarize my PhD contributions.
Finally, we will have a Q&A session and interesting discussions.
Part 1: Research Problem
You are looking at two aircrafts -- Ethiopian airlines and Lion Air Indonesia.
These are called the nose-down situation. Due to these nose down situations, we have two fatal crashes in a single calendar year.
These crashes took 346 precious human lives and cost trillions of dollars.
Now, the culprit is MCAS. This is a software component that was added to Boeing 737-Max 8 version.
The bottom line conclusion is, this is a faulty component, not well designed, and ultimately leads to crash.
That is why, Boeing 737 Max planes are grounded right now.
Now, lets say, a Boeing customer has submitted a bug report.
Now, a Boeing developer is responsible to locate and repair the faulty code triggering that bug.
As a frequent practice, developer chooses a few important keywords and attempts to locate the buggy code within the Boeing codebase.
But the study shows that 88% of the keywords chosen by the developer could be incorrect. That is, they do not return the buggy code.
So, the obvious next step is to reformulate the query through automated tool supports, so that the buggy code could be located.
There are also tools that take a bug report and suggest appropriate search queries in the first place.
So, we are interested into these part of the process, and my PhD focuses on this.
As you can also see that, Google does not have any jurisdiction in this case.
So what we did? We did a systematic literature survey using 56 primary studies on query reformulation for code search.
During this study, we found 3 major issues in the literature.
Now, this is a metric which has been on the play from last the century. It was proposed in the 70s.
It is a good metric, but it was actually proposed for regular texts such as news articles or plaintexts.
On the other hand, we are dealing with source code here.
Now, regular texts and source code have different semantics and different structures.
They are not the same
So, metrics for regular texts are not appropriate for the source code– this is our hypothesis.
So, here is our first research question? How does TF-IDF perform? If not good, can we propose something new?
We did an empirical study with 5K+ bug reports in our ICSE poster.
And we discovered that bug reports could be very different in terms of quality.
There could be different types of bug reports.
It could be noisy with stack traces which is 16%
It could be really poor that does not contain any structured entities, which is 30%
Or it could be rich bug reports that include source code, test case and other stuffs, which is 54%
Now, what the existing studies do? They treat all these different types of bug reports like the same.
So, in their approach, everybody does not get a chance to watch the game.
So, here is our second research question. Can we incorporate reporting quality into bug localization and deliver better queries?
Identifying similar words is very important during query reformulation.
We found that WordNet has been extensively used by the literature for finding the similar words.
Now, it is good for regular texts. But again, we are dealing with source code here.
Evidence suggest that WordNet might not work well for source code.
However, those were old days. Now we have Stack Overflow and advanced tools like FastText for semantic similarity calculation.
So, here is our third RQ. Can deliver appropriate keywords during code search using Stack Overflow and FastText?
Now, we are done with Background concepts, Part 1.
Now, we are going into Part 2 -- PhD Thesis
So, this is our thesis statement. We hypothesize that we can improve the query reformulation using
graph-based term weighting rather than TF-IDF
Bug report quality and document contexts
Crowdsourced knowledge, i.e., Stack Overflow and Data analytics such as word embedding from FastText.
So, to evaluate these hypothesis, we conduct six studies in the PhD.
The first and second study address RQ1, the third study addresses RQ2 and the rest answers RQ3
Similarly, we can see the phrases and dependencies among the terms in the bug report texts as well.
Our job is to identify the keywords from these texts, right?
So, did we do?
We consider the co-occurrences among the terms. That is, how terms occur with other terms within a certain context.
We encode such co-occurrences as edges, and transform the texts into a graph like this.
Besides term co-occurrences, we consider another aspect called syntactic dependencies.
For this, we used Jespersen Rank Theory, a theory developed back in 1925.
According this theory, parts of speech of sentence can be divided into three ranks –
nouns (first), verbs + adjectives in the second rank and the rest are the third ranks
According to Jespersen, verb and adjective modifies noun. That is there are some syntactic dependencies for between element and reported and element and plain to covey the overall meaning of the sentence.
Now, we capture such syntactic dependencies as well, and transform the report texts into a POS graph as well.
So, we have created two graphs, right?
Now, we have two graphs developed from the bug report based on two different dimensions
--Word co-occurrence and syntactic dependence.
Once we have graphs, we apply this famous algorithm called PageRank algorithm. This is the backbone of Google search.
Now, the algorithmic details are a bit complex, but I will try to provide an overview here.
Why do you think, this guy is laughing? Because, it is getting the maximum votes.
Similarly, in the graph, the node that is connected to most of the nodes is the winner.
That is, a term’s importance will be determined by its connectivity with other nodes.
More importantly, since this is a recursive algorithm, the importance depends on the weights of the connected node as well.
Once the computation is done, we get a reformulation candidate from each graph.
What is the reformulation candidate? – a ranked list of keywords like this.
So, we collect two candidates from two graph, apply machine learning and suggest the best one as our suggested query from the bug report.
Now once such items are extracted, we split them.
Now as we see, these single terms share some kind of semantics to convey a broader semantic.
That is, they complement each other in this context.
Now, we capture such semantic dependencies in the source code, and develop a term graph like this.
So, first we take a bug report as input.
Then we apply regular expressions to identify the structured components.
We then classify whether this is a
a noisy report containing stack traces.
a poor bug report containing only regular texts
a rich bug report containing source code and texts.
Once the quality level is identified, what’s the next step?
Well, we do query reformulation unlike the earlier studies.
We separate signals from noise from noisy report, feed the poor bug report with appropriate keywords.
We mostly keep the rich bug report as is.
So, that is the equity approach.
So, from a noisy report, we extract
The report title
The encountered exception
The most important keywords from the stack traces.
Then we do the search with this newly constructed query.
For example, the baseline noisy query returns the result at 53rd position.
Whereas our query returns the correct result at the topmost position.
First, we construct a semantic hyperspace using Stack Overflow corpus.
What is hyperspace?
Now, if we have more than 3 dimensions, then we call that space as hyperspace.
How do we do it?
First we Stack Overflow data dump that contain software specific texts. Our corpus contains about 2.1 million questions and answers.
We do pre-processing and feed the contents to FastText. Now FastText generates a three-layer neural network model.
This model essentially represents the whole vocabulary like this in a hyperspace.
Now how does it help?
Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time.
Well, that is not the case.
They are mentioned in the similar contexts by the people across the whole corpus.
The model recognizes such occurrences and thus put burger and sandwich close together.
Similarly, dumpling and ramen are close to each other.
Now, we propose this. This is original query, and this is reformulated query.
Now, a good reformulated query will cluster together the original query.
A bad reformulated query will NOT be able to cluster with the original query.
So, clustering tendency within the hyperspace is our weapon here.
We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
Now, for the experiments, we chose 8 subject systems from Apache and Eclipse.
We collect about 3000 bug reports, and try to map them with the version control history at GitHub.
Through such mapping we extract the ground truth for the bug reports.
This is a standard process followed by the existing literature.
Now lets expand and generalize the problem a bit.
So far, we discuss the code search within a local codebase.
It could also be in the large-scale open source repository such as GitHub.
Now, based on these contexts, there are different challenges in query reformulation.
The local codebase is small, domain specific and organized.
On the contrary, GitHub is huge, cross-domain and very noisy.
So, yes, they need different strategies to suggest queries for them.
Now, I am not going to discuss those studies in details.
But here is the glimpse.
Developers generally look for relevant code on the web using natural language query.
Please note that we are not talking about simply web search, rather talking about source code repository such as GitHub.
Now, GitHub provides this result. Now, you see it tries to match the query keywords with comment and identifiers.
But what we are dealing with source code right? So, we need source code friendly query for a better result.
So, we identify relevant API classes against this natural language query through extensive data mining and data analytics.
And once again, Stack Overflow is our friend in this grand challenge.
Thanks for your time and attention.
Now, I am ready to take your questions.
For the API suggestion, we natural language queries from four tutorial sites such as KodeJava and others.
We collect 300+ queries, we also collect the ground truth API classes from them.
Then we try to determine our approach can suggest appropriate API classes for those queries by mining crowd knowledge from Stack Overflow.
For the query reformulation part, we collect 4K code examples from GitHub, combine with our ground truth code segments from tutorial site.
Then we determine whether our reformulated query actually works or not.
Now let me explain the metrics a bit since we will be using these a lot.
Hit@K is the percentage of the queries for which at least one ground truth is found within the top K results.
MAP is the standard precision + result position. The detailed is much complex which I can discuss later.
MRR is the inverse of the rank of first ground truth within the result.
QE also stands for query effectiveness is just the opposite of MRR
Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time.
Well, that is not the case.
They are mentioned in the similar contexts by the people across the whole corpus.
The model recognizes such occurrences and thus put burger and sandwich close together.
Similarly, dumpling and ramen are close to each other.
Now, we propose this. This is original query, and this is reformulated query.
Now, a good reformulated query will cluster together the original query.
A bad reformulated query will NOT be able to cluster with the original query.
So, clustering tendency within the hyperspace is our weapon here.
We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
Now lets expand and generalize the problem a bit.
So far, we discuss the code search within a local codebase.
It could also be in the large-scale open source repository such as GitHub.
Now, based on these contexts, there are different challenges in query reformulation.
The local codebase is small, domain specific and organized.
On the contrary, GitHub is huge, cross-domain and very noisy.
So, yes, they need different strategies to suggest queries for them.
Now once such items are extracted, we split them.
Now as we see, these single terms share some kind of semantics to convey a broader semantic.
That is, they complement each other in this context.
Now, we capture such semantic dependencies in the source code, and develop a term graph like this.
That is, each of three people, customer, past developer and JOE have their own vocabulary to describe a certain problem/concept.
In fact, any people will discuss the same problem with the same vocabulary, this probability is only 15%-20%
So, naturally, developer JOE finds it a great challenge to make a connection between bug report and the buggy code.
This costs development time, money and valuable efforts.
Here we see that burger is close sandwich. Why? They are eaten together. I do that all the time.
Well, that is not the case.
They are mentioned in the similar contexts by the people across the whole corpus.
The model recognizes such occurrences and thus put burger and sandwich close together.
Similarly, dumpling and ramen are close to each other.
Now, we propose this. This is original query, and this is reformulated query.
Now, a good reformulated query will cluster together the original query.
A bad reformulated query will NOT be able to cluster with the original query.
So, clustering tendency within the hyperspace is our weapon here.
We calculated Hopkins statistic and Polygon Area for calculating the clustering tendency.
So, from a noisy report, we extract
The report title
The encountered exception
The most important keywords from the stack traces.
Then we do the search with this newly constructed query.
For example, the baseline noisy query returns the result at 53rd position.
Whereas our query returns the correct result at the topmost position.
Now the question is, why is this so challenging?
The answer is vocabulary mismatch problem. In fact, this is a common problem for any type of document search.
Here we see both guys are looking at the same object, but they are explaining it differently.
That is, they are both correct from their perspective, but wrong from other guy’s perspective.
This also actually happens with bug reports as well.
Both customer and developer will explain the same problem using the same terminologies, that probability is only 15%
That is why selecting appropriate keywords from the bug report is very challenging.
Let us see an example.
This is a bug report, this is title and this is the description.
Now, developer JOE would use this bug report to localize the bug from source code.
Now he chose some ad hoc queries.
Which one is the best do you think, here? PAUSE!
Well, lets see. This one returns the correct result at this position. That means, the developer needs to check 1300+ results b4 reaching to the correct result he tries this query.
… oh… this one is the best.
So, selecting appropriate keywords from the bug report is not that simple.