This document presents a comparative study of different text models for information retrieval (IR)-based bug localization. It evaluates generic models like vector space model (VSM), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and cluster-based document model (CBDM) on a dataset of 291 bugs from the iBUGS AspectJ repository. It finds that simpler models like unigram and VSM perform comparably or better than more complex topic-based models like LDA and LSA. The study also compares the performance of IR-based localization to static and dynamic bug localization tools.
Exploiting Distributional Semantic Models in Question AnsweringPierpaolo Basile
This paper investigates the role of Distributional
Semantic Models (DSMs) in Question Answering (QA), and
specifically in a QA system called QuestionCube. QuestionCube is
a framework for QA that combines several techniques to retrieve
passages containing the exact answers for natural language questions.
It exploits Information Retrieval models to seek candidate
answers and Natural Language Processing algorithms for the
analysis of questions and candidate answers both in English and
Italian. The data source for the answer is an unstructured text
document collection stored in search indices.
In this paper we propose to exploit DSMs in the QuestionCube
framework. In DSMs words are represented as mathematical
points in a geometric space, also known as semantic space. Words
are similar if they are close in that space. Our idea is that
DSMs approaches can help to compute relatedness between users’
questions and candidate answers by exploiting paradigmatic
relations between words. Results of an experimental evaluation
carried out on CLEF2010 QA dataset, prove the effectiveness of
the proposed approach.
Exploiting Distributional Semantic Models in Question AnsweringPierpaolo Basile
This paper investigates the role of Distributional
Semantic Models (DSMs) in Question Answering (QA), and
specifically in a QA system called QuestionCube. QuestionCube is
a framework for QA that combines several techniques to retrieve
passages containing the exact answers for natural language questions.
It exploits Information Retrieval models to seek candidate
answers and Natural Language Processing algorithms for the
analysis of questions and candidate answers both in English and
Italian. The data source for the answer is an unstructured text
document collection stored in search indices.
In this paper we propose to exploit DSMs in the QuestionCube
framework. In DSMs words are represented as mathematical
points in a geometric space, also known as semantic space. Words
are similar if they are close in that space. Our idea is that
DSMs approaches can help to compute relatedness between users’
questions and candidate answers by exploiting paradigmatic
relations between words. Results of an experimental evaluation
carried out on CLEF2010 QA dataset, prove the effectiveness of
the proposed approach.
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
The slides from the Learning to Rank for Recommender Systems tutorial given at ACM RecSys 2013 in Hong Kong by Alexandros Karatzoglou, Linas Baltrunas and Yue Shi.
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsWagner Andreas
Many databases today are text-rich, comprising not only structured, but also textual data. Querying such databases involves predicates matching structured data combined with string predicates featuring textual constraints. Based on selectivity estimates for these predicates, query processing as well as other tasks that can be solved through such queries can be optimized. Existing work on selectivity estimation focuses either on string or on structured query predicates alone. Further, probabilistic models proposed to incorporate dependencies between predicates are focused on the re- lational setting. In this work, we propose a template-based probabilistic model, which enables selectivity estimation for general graph-structured data. Our probabilistic model allows dependencies between structured data and its text-rich parts to be captured. With this general probabilistic solution, BN+, selectivity estimations can be obtained for queries over text-rich graph-structured data, which may contain structured and string predicates (hybrid queries). In our experiments on real-world data, we show that capturing dependencies between structured and textual data in this way greatly improves the accuracy of selectivity estimates without compromising the efficiency.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Filtering Inaccurate Entity Co-references on the Linked Open Dataebrahim_bagheri
A method for identifying incorrect sameAs links on the Linked Open Data cloud
Details published in:
John Cuzzola, Ebrahim Bagheri, Jelena Jovanovic:
Filtering Inaccurate Entity Co-references on the Linked Open Data. DEXA (1) 2015: 128-143
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
The slides from the Learning to Rank for Recommender Systems tutorial given at ACM RecSys 2013 in Hong Kong by Alexandros Karatzoglou, Linas Baltrunas and Yue Shi.
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsWagner Andreas
Many databases today are text-rich, comprising not only structured, but also textual data. Querying such databases involves predicates matching structured data combined with string predicates featuring textual constraints. Based on selectivity estimates for these predicates, query processing as well as other tasks that can be solved through such queries can be optimized. Existing work on selectivity estimation focuses either on string or on structured query predicates alone. Further, probabilistic models proposed to incorporate dependencies between predicates are focused on the re- lational setting. In this work, we propose a template-based probabilistic model, which enables selectivity estimation for general graph-structured data. Our probabilistic model allows dependencies between structured data and its text-rich parts to be captured. With this general probabilistic solution, BN+, selectivity estimations can be obtained for queries over text-rich graph-structured data, which may contain structured and string predicates (hybrid queries). In our experiments on real-world data, we show that capturing dependencies between structured and textual data in this way greatly improves the accuracy of selectivity estimates without compromising the efficiency.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Filtering Inaccurate Entity Co-references on the Linked Open Dataebrahim_bagheri
A method for identifying incorrect sameAs links on the Linked Open Data cloud
Details published in:
John Cuzzola, Ebrahim Bagheri, Jelena Jovanovic:
Filtering Inaccurate Entity Co-references on the Linked Open Data. DEXA (1) 2015: 128-143
Deep neural methods have recently demonstrated significant performance improvements in several IR tasks. In this lecture, we will present a brief overview of deep models for ranking and retrieval.
This is a follow-up lecture to "Neural Learning to Rank" (https://www.slideshare.net/BhaskarMitra3/neural-learning-to-rank-231759858)
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
MSR presentation
1. Comparative Study
Retrieval from Software Libraries for Bug
Localization: A Comparative Study of Generic
and Composite Text Models
Shivani Rao and Avinash Kak
School of ECE,Purdue University
May 21, 2011
MSR, Hawaii
Mining Software Repositories, Hawaii, 2011
2. Comparative Study
Outline
1 Bug localization
2 IR(Information Retrieval)-based bug localization
3 Text Models
4 Preprocessing of the source files
5 Evaluation Metrics
6 Results
7 Conclusion
Mining Software Repositories, Hawaii, 2011
3. Comparative Study
Bug localization
Bug localization
Bug localization means to locate the files, methods, classes,
etc., that are directly related to the problem causing abnormal
execution behavior of the software.
IR Bug localization means to locate a bug from its textual
description.
Mining Software Repositories, Hawaii, 2011
4. Comparative Study
Background
A typical bug localization process
Mining Software Repositories, Hawaii, 2011
5. Comparative Study
Background
A typical bug report:JEdit
Mining Software Repositories, Hawaii, 2011
6. Comparative Study
Background
Past work on IR-based bug localization
Authors/Paper Model Software dataset
Marcus et al. VSM Jedit
[1]
Cleary et al. [2] LM, LSA and Eclipse JDT
CA
Lukins et al. [3] LDA Mozilla, Eclipse, Rhino and
JEdit
Drawbacks
1 None of the work reported has been evaluated on a standard
dataset.
2 Inability to compare with the static and dynamic techniques.
3 Number of bugs is of the order 5-30
Mining Software Repositories, Hawaii, 2011
7. Comparative Study
Background
iBUGS
Created by Dallmeier and Zimmerman [4], iBUGS contains a
large number of real bugs with corresponding test suites in
order to generate failing and passing test runs
ASPECTJ software
Software Library Size (Number of files) 6546
Lines of Code 75 KLOC
Vocabulary Size 7553
Number of bugs 291
Table: The iBUGS dataset after preprocessing
Mining Software Repositories, Hawaii, 2011
8. Comparative Study
Background
A typical bug report in the iBUGS repository
Mining Software Repositories, Hawaii, 2011
9. Comparative Study
Text Models
Text models
VSM : Vector Space Model
LSA : Latent Semantic Analysis Model
UM : Unigram Model
LDA : Latent Dirichlet Allocation Model
CBDM : Cluster-Based Document Model
Mining Software Repositories, Hawaii, 2011
10. Comparative Study
Text Models
Vector Space Model
If V is the vocabulary
then queries and
documents are
|V|-dimensional vectors.
wq .wm
sim(q, dm ) =
|w q ||w m |
Sparse yet high
dimensional space.
Mining Software Repositories, Hawaii, 2011
11. Comparative Study
Text Models
Latent semantic analysis: Eigen decomposition
A = UΣV T
Mining Software Repositories, Hawaii, 2011
12. Comparative Study
Text Models
LSA based models
Topic based representation: wk (m) which is a K -dimensional
eigen vector that mth document wm .
wK (m) = Σ−1 UK wm
K
T
qK = Σ−1 UK q
K
T
qK .wK (m)
sim(q, dm ) =
|qK ||wK (m)|
LSA2: Fold back the K-dimensional representation to a
smoothed |V| dimensional represenation and compare directly
with the query q. w = UK ΣK wK
˜ T
Combined Representation: combines the LSA2 with the VSM
representation using the mixture parameter λ .
˜
Acombined = λA + (1 − λ)A
Mining Software Repositories, Hawaii, 2011
13. Comparative Study
Text Models
Unigram model to represent documents using
probability distribution [5]
The term frequencies in a document are considered to be its
probability distribution
The term frequencies in a query become the query’s
probablity distribution
The similarities are established by comparing the probability
distributions using KL divergence.
To add smoothing we add the probability distribution over the
entire source library.
|D|
c(w , dm ) m=1 c(w , dm )
puni (w |Dm ) = µ + (1 − µ) |D|
|dm |
m=1 |dm |
|D|
c(w , q) m=1 c(w , dm )
puni (w |q) = µ + (1 − µ) |D|
|q|
m=1 |dm |
Mining Software Repositories, Hawaii, 2011
14. Comparative Study
Text Models
LDA: A mixture model to represent
documents using topics/concepts [6]
Mining Software Repositories, Hawaii, 2011
15. Comparative Study
Text Models
LDA based models [7]
Topic based representation θm which is a K -dimensional
probability vector that indicates the topic proportions
present in mth document.
Maximum Likelihood Representation folds back to the |V|
dimensional term space.
t=K
plda (w |Dm ) = p(w |z = t)p(z = t|Dm )
t=1
t=K
= φ(t, w )θm (t)
t=1
Combined Representation combines the Unigram representation of
the document and the MLE-LDA representation of a
document.
pcombined (w |Dm ) = λplda (w |Dm ) + (1 − λ)puni (w |Dm )
Mining Software Repositories, Hawaii, 2011
16. Comparative Study
Text Models
Cluster Based Document Model (CBDM) [8]
Cluster the documents into K clusters using deterministic
algorithms like K-means, hierarchical, agglomerative clustering
and so on.
Represent each of the clusters using a multinomial distribution
over the terms in the vocabulary. This distribution is
commonly denoted by pML (w |Clusterj ) and we can express
probabilistic distribution for a words in a dm ∈ Clusterj by:
wm (n)
pcbdm (w |wm ) = λ1 × n=|V|
+ λ2 × pc (w ) +
n=1 wm (n)
λ3 × pML (w |Clusterj ) (1)
Mining Software Repositories, Hawaii, 2011
17. Comparative Study
Text Models
Summary of Text Models used in the
comparative study
Mining Software Repositories, Hawaii, 2011
18. Comparative Study
Text Models
Summary of Text Models used in the
comparative study (cont.)
Model Representation Similarity Metric
VSM frequency vector Cosine similarity
LSA K dimensional vector in the Cosine similarity
eigen space
Unigram |V| dimensional probability vec- KL divergence
tor (smoothed)
LDA K dimensional probability vec- KL divergence
tor
CBDM |V| dimensional combined prob- KL divergence or likeli-
ability vector hood
Table: Generic models used in the comparative evaluation
Mining Software Repositories, Hawaii, 2011
19. Comparative Study
Text Models
Summary of Text Models used in the
comparative study (cont.)
Model Representation Similarity Metric
LSA2 |V| dimensional representation Cosine similarity
in term-space
MLE- |V| dimensional MLE-LDA KL divergence or likeli-
LDA probability vector hood
Table: The variations on two of the generic models used in the
comparative evaluation
Mining Software Repositories, Hawaii, 2011
20. Comparative Study
Text Models
Summary of Text Models used in the
comparative study (cont.)
Model Representation Similarity Metric
Unigram |V| dimensional combined prob- KL divergence or likeli-
+ LDA ability vector hood
VSM + |V| dimensional combined VSM Cosine similarity
LSA and LSA representation
Table: The two composite models used
Mining Software Repositories, Hawaii, 2011
21. Comparative Study
Preprocessing of the source files
Preprocessing of the source files
If a patch file does not exist in the /trunk then it is searched
and added to the source library from the other branches/tags
of the ASPECTJ
The source library consists of ”.java” files only. After this
step, our library ended up with 6546 Java files.
The repository.xml file documents all the information related
to a bug. This includes the BugID, the bug description, the
relevant source files, and so on. We shall call this
ground-truth information as relevance judgements.
The bugs that are documented in iBUGS and do not have any
relevant software files in the source library that results from
the previous step are eliminated. After this step, we are left
with 291 bugs.
Mining Software Repositories, Hawaii, 2011
22. Comparative Study
Preprocessing of the source files
Preprocessing of the source files (contd)
Hard-words, camel-case words and soft-words are handled by
using popular identifier-splitting methods [9, 10].
Stop-list consists of most commonly occuring words.
Example: “for,” “else,” “while,” “int,”, “double,” “long,”
“public,” “void,” etc. There are 375 such words in iBUGS
ASPECTJ software. We also drop from the vocabulary all
unicode strings.
The vocabulary is pruned further by calculating the relative
importance of terms and eliminating ubiquitous and
rarely-occuring terms.
Mining Software Repositories, Hawaii, 2011
23. Comparative Study
Evaluation Metrics
Mean Average Precision (MAP)
Mean Average Precision (MAP)
Calculated using the following two sets:
retreived(Nr ) set consists of the top Nr documents from a ranked
list of documents retrieved vis-a-vis the query.
relevant set is extracted from relevance judgements available
from repository.xml
Precision and Recall:
|{relevant} {retrieved}|
Precision(P@Nr ) =
|{retrieved}|
|{relevant} {retrieved}|
Recall(R@Nr ) =
|{relevant}|
Mining Software Repositories, Hawaii, 2011
24. Comparative Study
Evaluation Metrics
Mean Average Precision (MAP)
Mean Average Precision (MAP) (cont.)
1 If we were to plot a typical P-R curve from the values for
P@Nr and R@Nr , we would get a monotonically decrceasing
curve that has high values of Precision for low values of Recall
and vice versa.
2 Area under the P-R curve is called the Average Precision.
3 Taking mean of the Average Precision over all the queries
gives Mean Average Precision (MAP).
4 Physical significance of MAP: Same as that of Precision.
Mining Software Repositories, Hawaii, 2011
25. Comparative Study
Evaluation Metrics
Rank of Retrieved Files
Rank of Retrieved Files [3]
The number of queries/bugs for which relevant source files
were retrieved with ranks rlow ≤ R ≤ rhigh is reported.
For the retrieval performance reported in [3], ranks used are
R = 1, 2 ≤ R ≤ 5, 6 ≤ R ≤ 10 and R > 10.
Mining Software Repositories, Hawaii, 2011
26. Comparative Study
Evaluation Metrics
SCORE
SCORE [11]
1 Indicates the proportion of the program that need to be
examined in order to locate or localize a fault
2 For each range of this proportion (example, 10 − 20%) the
number of test-runs (bugs) is reported.
Mining Software Repositories, Hawaii, 2011
27. Comparative Study
Results
Models using LDA
Figure: MAP using the three LDA models for different values of K, the
experimental parameters for LDA+Unigram model are λ = 0.9 µ = 0.5,
β = 0.01 and α = 50/K
Mining Software Repositories, Hawaii, 2011
28. Comparative Study
Results
The combined LDA+Unigram model
Figure: MAP plotted for different values of mixture proportions (λ and
µ) of the LDA+Unigram combined model.
Mining Software Repositories, Hawaii, 2011
29. Comparative Study
Results
Models using LSA
Figure: MAP using LSA model and its variations and combinations for
different values of K. The experimental parameter for the LSA+VSM
combined model is λ = 0.5.
Mining Software Repositories, Hawaii, 2011
30. Comparative Study
Results
CBDM
Model parameters K
λ1 λ2 λ3 100 250 500 1000
0.25 0.25 0.5 0.093144 0.0914 0.08666 0.07664
0.15 0.35 0.5 0.0883 0.0897 0.0963 0.0932
0.81 0.09 0.1 0.143 0.102 0.108 0.09952
0.27 0.63 0.1 0.1306 0.117 0.111 0.0998
0.495 0.495 0.01 0.141 0.141 0.141 0.141
0.05 0.05 0.99 0.069 0.075 0.072 0.065
Table: Retrieval performance using MAP with the CBDM.
λ1 + λ2 + λ3 = 1. λ1 Unigram model λ2 Collection Model λ3 Cluster
model
Mining Software Repositories, Hawaii, 2011
31. Comparative Study
Results
Rank based metric
Figure: The height of the bars shows the number of queries (bugs) for
which at least one relevant source file was retrieved at rank 1.
Mining Software Repositories, Hawaii, 2011
32. Comparative Study
Results
SCORE: IR based bug localization tools
Mining Software Repositories, Hawaii, 2011
33. Comparative Study
Results
SCORE: Compare with AMPLE and
FINDBUGS
SCORE with FINDBUGS
None of the bugs were
localized correctly.
Figure: SCORE values calculated over 44
bugs in iBUGS ASPECTJ using AMPLE
[12]
Mining Software Repositories, Hawaii, 2011
34. Comparative Study
Conclusion
Conclusion
IR based bug localization techniques are equally or more
effective compared to static or dynamic bug localization tools.
Sophisticated models like LDA, LSA or CBDM do not
out-perform simpler models like Unigram or VSM for IR based
bug localization on large software systems.
An analysis of the spread of the word distributions over the
source files with the help of measures such as tf and idf can
give useful insights into the usability of topic and cluster
based models for localization.
Mining Software Repositories, Hawaii, 2011
35. Comparative Study
Conclusion
End of Presentation
Thanks to
Questions?
Mining Software Repositories, Hawaii, 2011
36. Comparative Study
Conclusion
Threads to validity
We have tested on a single database like iBUGS. How does
this generalize?
We have eliminated xml files among those that are indexed
and queried. Maybe not a valid assumption?
Mining Software Repositories, Hawaii, 2011
37. Comparative Study
Conclusion
References
A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An
Information Retrieval Approach to Concept Location in Source
code,” in In Proceedings of the 11th Working Conference on
Reverse Engineering (WCRE 2004, pp. 214–223, IEEE
Computer Society, 2004.
B. Cleary, C. Exton, J. Buckley, and M. English, “An Empirical
Analysis of Information Retrieval based Concept Location
Techniques in Software Comprehension,” Empirical Softw.
Engg., vol. 14, no. 1, pp. 93–130, 2009.
S. K. Lukins, N. A. Karft, and E. H. Letha, “Source Code
Retrieval for Bug Localization using Latent Dirichlet
Allocation,” in 15th Working Conference on Reverse
Engineering, 2008.
Mining Software Repositories, Hawaii, 2011
38. Comparative Study
Conclusion
References (cont.)
V. Dallmeier and T. Zimmermann, “Extraction of Bug
Localization Benchmarks from History,” in ASE ’07:
Proceedings of the twenty-second IEEE/ACM international
conference on Automated software engineering, (New York,
NY, USA), pp. 433–436, ACM, 2007.
J. Lafferty and C. Zhai, “A Study of Smoothing Methods for
Language Models Applied to information retrieval,” ACM
Transactions Information Systems, pp. 179–214, 2004.
D. M. Blei, A. V. Ng, and M. I. Jordan, “Latent Dirichlet
Allocation,” Journal of Machine Learning, pp. 993–1022, 2003.
Mining Software Repositories, Hawaii, 2011
39. Comparative Study
Conclusion
References (cont.)
X. Wei and W. B. Croft, “Lda-Based Document Models for
Ad-hoc Retrieval,” in Proceedings of the 29th annual
international ACM SIGIR conference on Research and
development in information retrieval, ACM, 2006.
L. X and W. B. Croft, “Cluster-Based Retrieval Using
Language Models,” in ACM SIGIR Conference on Research
and Development in Information Retrieval, ACM, 2004.
D. B. H. Field and D. Lawrie., “An Empirical Comparison of
Techniques for Extracting Concept Abbreviations from
Identifiers.,” in Proceedings of IASTED International
Conference on Software Engineering and Applications, 2006.
Mining Software Repositories, Hawaii, 2011
40. Comparative Study
Conclusion
References (cont.)
E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Mining
Source Code to Automatically Split Identifiers for Software
Analysis,” in Proceedings of the 2009 6th IEEE International
Working Conference on Mining Software Repositories, MSR
’09, (Washington, DC, USA), pp. 71–80, IEEE Computer
Society, 2009.
J. A. Jones and M. J. Harrold, “Empirical Evaluation of the
Tarantula Automatic Fault-Localization Technique,” in
Automated Software Engineering, 2005.
V. Dallmeier and T. Zimmermann, “Automatic Extraction of
Bug Localization Benchmarks from History,” tech. rep.,
Universi¨t des Saarlandes, Saarbr¨cken, Germany, June 2007.
a u
Mining Software Repositories, Hawaii, 2011