Audio Music Similarity is a task within Music Information Retrieval that deals with systems that retrieve songs musically similar to a query song according to their audio content. Evaluation experiments are the main scientific tool in Information Retrieval to determine what systems work better and advance the state of the art accordingly. It is therefore essential that the conclusions drawn from these experiments are both valid and reliable, and that we can reach them at a low cost. This dissertation studies these three aspects of evaluation experiments for the particular case of Audio Music Similarity, with the general goal of improving how these systems are evaluated. The traditional paradigm for Information Retrieval evaluation based on test collections is approached as an statistical estimator of certain probability distributions that characterize how users employ systems. In terms of validity, we study how well the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we study the optimal characteristics of test collections and statistical procedures, and in terms of efficiency we study models and methods to greatly reduce the cost of running an evaluation experiment.
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions.
Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.
A Query Routing Model to Rank Expertcandidates on TwitterJonathas Magalhães
Online Social Networks (OSNs) have become very popular and new ways of use their virtual environment have emerged. One of these new ways is a method to obtain information online called Social Query that consists of sharing a question on an OSN and waiting for answers come from contacts. The usual strategy is sharing a question that will be visible to everyone (public). However, this way there is no guarantee that an answer will be received neither about the quality of the answer. Directing the question to an expert about its subject (Query Routing) is a better strategy, but decides to whom direct the question is not always an easy task. In this work, we propose and evaluate a model to decide who user is the most able to receive a question and answer it correctly and quickly. The differential of our research is that we focused in OSNs context and leaded with the recommendation as multi-criteria decision making problem. Our evaluation shows promising results and confirms the great performance of our proposal.
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Julián Urbano
We describe a new approach for computing the similarity between symbolic musical pieces, based on the differences in shape between the interpolating curves defining the pieces. We outline several requirements for a symbolic musical similarity system, and then show how our model addresses them. Results obtained by an early prototype, tested with the MIREX 2005 Symbolic Music Similarity collections, show that this approach performs well. This new model opens a new and promising line for further research on the melodic similarity problem.
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
This notebook paper describes our participation in both tasks of the TREC 2011 Crowdsourcing Track. For the first one we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level, which was based on a simple reading comprehension test. For the second task we submitted another three runs: one with a stepwise execution of the GetAnotherLabel algorithm by Ipeirotis et al., and two others with a rule-based and a SVM-based model. We also comment on several topics regarding the Track design and evaluation methods.
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
Ground truths based on partially ordered lists have been used for some years now to evaluate the effectiveness of Music Information Retrieval systems, especially in tasks related to symbolic melodic similarity. However, there has been practically no meta-evaluation to measure or improve the correctness of these evaluations. In this paper we revise the methodology used to generate these ground truths and disclose some issues that need to be addressed. In particular, we focus on the arrangement and aggregation of the relevant results, and show that it is not possible to ensure lists completely consistent. We develop a measure of consistency based on Average Dynamic Recall and propose several alternatives to arrange the lists, all of which prove to be more consistent than the original method. The results of the MIREX 2005 evaluation are revisited using these alternative ground truths.
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions.
Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.
A Query Routing Model to Rank Expertcandidates on TwitterJonathas Magalhães
Online Social Networks (OSNs) have become very popular and new ways of use their virtual environment have emerged. One of these new ways is a method to obtain information online called Social Query that consists of sharing a question on an OSN and waiting for answers come from contacts. The usual strategy is sharing a question that will be visible to everyone (public). However, this way there is no guarantee that an answer will be received neither about the quality of the answer. Directing the question to an expert about its subject (Query Routing) is a better strategy, but decides to whom direct the question is not always an easy task. In this work, we propose and evaluate a model to decide who user is the most able to receive a question and answer it correctly and quickly. The differential of our research is that we focused in OSNs context and leaded with the recommendation as multi-criteria decision making problem. Our evaluation shows promising results and confirms the great performance of our proposal.
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Julián Urbano
We describe a new approach for computing the similarity between symbolic musical pieces, based on the differences in shape between the interpolating curves defining the pieces. We outline several requirements for a symbolic musical similarity system, and then show how our model addresses them. Results obtained by an early prototype, tested with the MIREX 2005 Symbolic Music Similarity collections, show that this approach performs well. This new model opens a new and promising line for further research on the melodic similarity problem.
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
This notebook paper describes our participation in both tasks of the TREC 2011 Crowdsourcing Track. For the first one we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level, which was based on a simple reading comprehension test. For the second task we submitted another three runs: one with a stepwise execution of the GetAnotherLabel algorithm by Ipeirotis et al., and two others with a rule-based and a SVM-based model. We also comment on several topics regarding the Track design and evaluation methods.
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
Ground truths based on partially ordered lists have been used for some years now to evaluate the effectiveness of Music Information Retrieval systems, especially in tasks related to symbolic melodic similarity. However, there has been practically no meta-evaluation to measure or improve the correctness of these evaluations. In this paper we revise the methodology used to generate these ground truths and disclose some issues that need to be addressed. In particular, we focus on the arrangement and aggregation of the relevant results, and show that it is not possible to ensure lists completely consistent. We develop a measure of consistency based on Average Dynamic Recall and propose several alternatives to arrange the lists, all of which prove to be more consistent than the original method. The results of the MIREX 2005 evaluation are revisited using these alternative ground truths.
The Music Information Retrieval Evaluation eXchange (MIREX) is a valuable community service, having established standard datasets, metrics, baselines, methodologies, and infrastructure for comparing MIR methods. While MIREX has managed to successfully maintain operations for over a decade, its long-term sustainability is at risk without considerable ongoing financial support. The imposed constraint that input data cannot be made freely available to participants necessitates that all algorithms run on centralized computational resources, which are administered by a limited number of people. This incurs an approximately linear cost with the number of submissions, exacting significant tolls on both human and financial resources, such that the current paradigm becomes less tenable as participation increases. To alleviate the recurring costs of future evaluation campaigns, we propose a distributed, community-centric paradigm for system evaluation, built upon the principles of openness, transparency, reproducibility, and incremental evaluation. We argue that this proposal has the potential to reduce operating costs to sustainable levels. Moreover, the proposed paradigm would improve scalability, and eventually result in the release of large, open datasets for improving both MIR techniques and evaluation methods.
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
Music similarity tasks, where musical pieces similar to a query should be retrieved, are quite troublesome to evaluate. Ground truths based on partially ordered lists were developed to cope with problems regarding relevance judgment, but they require such man-power to generate that the official MIREX evaluations had to turn over more affordable alternatives. However, in house evaluations keep using these partially ordered lists because they are still more suitable for similarity tasks. In this paper we propose a cheaper alternative to generate these lists by using crowdsourcing to gather music preference judgments. We show that our method produces lists very similar to the original ones, while dealing with some defects of the original methodology. With this study, we show that crowdsourcing is a perfectly viable alternative to evaluate music systems without the need for experts.
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
We describe a pilot experiment to update the program of an Information Retrieval course for Computer Science undergraduates. We have engaged the students in the development of a search engine from scratch, and they have been involved in the elaboration, also from scratch, of a complete test collection to evaluate their systems. With this methodology they get a whole vision of the Information Retrieval process as they would find it in a real-world setting, and their direct involvement in the evaluation makes them realize the importance of these laboratory experiments in Computer Science. We show that this methodology is indeed reliable and feasible, and so we plan on improving and keep using it in the next years, leading to a public repository of resources for Information Retrieval courses.
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
On the Measurement of Test Collection ReliabilityJulián Urbano
The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a significant difference is found between systems and how often these significant differences are incorrect. We study the effect of using different effectiveness measures with different sets of relevance judgments, for varying number of queries and alternative statistical procedures. Different measures are shown to behave similarly overall, though some are much more sensitive and stable than others. The use of different statistical procedures does improve the reliability of the results, and it allows using as little as half the number of queries currently used in MIREX evaluations while still offering very similar reliability levels. We also conclude that experimenters can be very confident that if a significant difference is found between two systems, the difference is indeed real.
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.
The Music Information Retrieval Evaluation eXchange (MIREX) is a valuable community service, having established standard datasets, metrics, baselines, methodologies, and infrastructure for comparing MIR methods. While MIREX has managed to successfully maintain operations for over a decade, its long-term sustainability is at risk without considerable ongoing financial support. The imposed constraint that input data cannot be made freely available to participants necessitates that all algorithms run on centralized computational resources, which are administered by a limited number of people. This incurs an approximately linear cost with the number of submissions, exacting significant tolls on both human and financial resources, such that the current paradigm becomes less tenable as participation increases. To alleviate the recurring costs of future evaluation campaigns, we propose a distributed, community-centric paradigm for system evaluation, built upon the principles of openness, transparency, reproducibility, and incremental evaluation. We argue that this proposal has the potential to reduce operating costs to sustainable levels. Moreover, the proposed paradigm would improve scalability, and eventually result in the release of large, open datasets for improving both MIR techniques and evaluation methods.
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
Music similarity tasks, where musical pieces similar to a query should be retrieved, are quite troublesome to evaluate. Ground truths based on partially ordered lists were developed to cope with problems regarding relevance judgment, but they require such man-power to generate that the official MIREX evaluations had to turn over more affordable alternatives. However, in house evaluations keep using these partially ordered lists because they are still more suitable for similarity tasks. In this paper we propose a cheaper alternative to generate these lists by using crowdsourcing to gather music preference judgments. We show that our method produces lists very similar to the original ones, while dealing with some defects of the original methodology. With this study, we show that crowdsourcing is a perfectly viable alternative to evaluate music systems without the need for experts.
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
We describe a pilot experiment to update the program of an Information Retrieval course for Computer Science undergraduates. We have engaged the students in the development of a search engine from scratch, and they have been involved in the elaboration, also from scratch, of a complete test collection to evaluate their systems. With this methodology they get a whole vision of the Information Retrieval process as they would find it in a real-world setting, and their direct involvement in the evaluation makes them realize the importance of these laboratory experiments in Computer Science. We show that this methodology is indeed reliable and feasible, and so we plan on improving and keep using it in the next years, leading to a public repository of resources for Information Retrieval courses.
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
On the Measurement of Test Collection ReliabilityJulián Urbano
The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a significant difference is found between systems and how often these significant differences are incorrect. We study the effect of using different effectiveness measures with different sets of relevance judgments, for varying number of queries and alternative statistical procedures. Different measures are shown to behave similarly overall, though some are much more sensitive and stable than others. The use of different statistical procedures does improve the reliability of the results, and it allows using as little as half the number of queries currently used in MIREX evaluations while still offering very similar reliability levels. We also conclude that experimenters can be very confident that if a significant difference is found between two systems, the difference is indeed real.
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.
Information Experience Lab, IE Lab at SISLTIsa Jahnke
Founded in 2003
The Information Experience Laboratory, IE Lab – is a usability and user experience lab …
… with the mission to improve learning technologies, information and communication systems.
We here present the IE Lab and methods .
Introduction to Usability Testing for Survey ResearchCaroline Jarrett
The basics of how to incorporate usability testing in the development process of a survey. Workshp first presented at the SAPOR conference, Raleigh, North Carolina USA, October 2011 by Emily Geisen of RTI and Caroline Jarrett of Effortmark.
Recommending Scientific Papers: Investigating the User CurriculumJonathas Magalhães
In this paper, we propose a Personalized Paper Recommender System, a new user-paper based approach that takes into consideration the user academic curriculum vitae. To build the user profiles, we use a Brazilian academic platform called CV-Lattes. Furthermore, we examine some issues related to user profiling, such as (i) we define and compare different strategies to build and represent the user profiles, using terms and using concepts; (ii) we verify how much past information of a user is required to provide good recommendations; (iii) we compare our approaches with the state-of-art in paper recommendation using the CV-Lattes. To validate our strategies, we conduct a user study experiment involving 30 users in the Computer Science domain. Our results show that (i) our approaches outperform the state-of-art in CV-Lattes; (ii) concepts profiles are comparable with the terms profiles; (iii) analyzing the content of the past four years for terms profiles and five years for concepts profiles achieved the best results; and (iv) terms profiles provide better results but they are slower than concepts profiles, thus, if the system needs real time recommendations, concepts profiles are better.
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
Going through a PhD may be seen as a requirement for an academic career or a different kind of job, simply as “the next step” in education, as something to do “because why not?”, or even just as a hobby you have on the side. What it really is though, is a life-changing experience, something that can be terribly painful and amazingly rewarding at the same time. In that journey I learned a few lessons in the hard way, lessons that I wish someone had told me about at the time. In this talk I’ll try to do just that and not talk about the content and process of a PhD, but rather about you, the person, during your PhD.
The Treatment of Ties in AP CorrelationJulián Urbano
The Kendall tau and AP correlation coefficients are very commonly use to compare two rankings over the same set of items. Even though Kendall tau was originally defined assuming that there are no ties in the rankings, two alternative versions were soon developed to account for ties in two different scenarios: measure the accuracy of an observer with respect to a true and objective ranking, and measure the agreement between two observers in the absence of a true ranking. These two variants prove useful in cases where ties are possible in either ranking, and may indeed result in very different scores. AP correlation was devised to incorporate a top-heaviness component into Kendall tau, penalizing more heavily if differences occur between items at the top of the rankings, making it a very compelling coefficient in Information Retrieval settings. However, the treatment of ties in AP correlation remains an open problem. In this paper we fill this gap, providing closed analytical formulations of AP correlation under the two scenarios of ties contemplated in Kendall tau. In addition, we developed an R package that implements these coefficients.
Crawling the Web for Structured DocumentsJulián Urbano
Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like Sourceforge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections of other types of documents, such as Java source code for software oriented search engines or RDF for semantic searching.
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
We present an empirical analysis of the effect that the gain and discount functions have in the correlation between DCG and user satisfaction. Through a large user study we estimate the relationship between satisfaction and the effectiveness computed with a test collection. In particular, we estimate the probabilities that users find a system satisfactory given a DCG score, and that they agree with a difference in DCG as to which of two systems is more satisfactory. We study this relationship for 36 combinations of gain and discount, and find that a linear gain and a constant discount are best correlated with user satisfaction.
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test out- perform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
This short paper describes four submissions to the Symbolic Melodic Similarity task of the MIREX 2010 edition. All four submissions rely on a local-alignment approach between sequences of n-grams, and they differ mainly on the substitution score between two n-grams. This score is based on a geometric representation that shapes musical pieces as curves in the pitch-time plane. One of the systems described ranked first for all ten effectiveness measures used and the other three ranked from second to fifth, depending on the measure.
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
This paper describes the participation of the uc3m team in both tasks of the TREC 2011 Crowdsourcing Track. For the first task we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level based on a simple reading comprehension test. For the second task we also submitted three runs: one with a stepwise execution of the GetAnotherLabel algorithm and two others with a rule-based and a SVMbased model. According to the NIST gold labels, our runs performed very well in both tasks, ranking at the top for most measures.
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
Much research in MIR is based on descriptors computed from audio signals. Some music corpora use different audio encodings, some do not contain audio but descriptors already computed in some particular way, and sometimes we have to gather audio files ourselves. We thus assume that descriptors are robust to these changes and algorithms are not affected. We investigated this issue for MFCCs and Chroma: how do encoding quality, analysis parameters and musical characteristics affect their robustness?
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
4. Information Retrieval
• Automatic representation, storage and search of
unstructured information
– Traditionally textual information
– Lately multimedia too: images, video, music
• A user has an information need and uses an IR
system that retrieves the relevant or significant
information from a collection of documents
4
5. Information Retrieval Evaluation
• IR systems are based on models to estimate
relevance, implementing different techniques
• How good is my system? What system is better?
• Answered with IR Evaluation experiments
– “if you can’t measure it, you can’t improve it”
– But we need to be able to trust our measurements
• Research on IR Evaluation
– Improve our methods to evaluate systems
– Critical for the correct development of the field
5
6. History of IR Evaluation research
MEDLARS
Cranfield 2
SMART
1960
SIGIR
1970
1980
1990
2000
2010
6
7. History of IR Evaluation research
MEDLARS
Cranfield 2
SMART
1960
TREC
SIGIR
1970
1980
1990
INEX
CLEF
NTCIR
2000
2010
6
8. History of IR Evaluation research
MEDLARS
Cranfield 2
SMART
1960
TREC
SIGIR
1970
1980
1990
INEX
CLEF
NTCIR
2000
2010
ISMIR
MIREX
MusiCLEF
MSD Challenge
6
9. History of IR Evaluation research
MEDLARS
Cranfield 2
SMART
1960
TREC
SIGIR
1970
1980
1990
INEX
CLEF
NTCIR
2000
2010
ISMIR
MIREX
MusiCLEF
MSD Challenge
6
10. History of IR Evaluation research
MEDLARS
Cranfield 2
SMART
1960
TREC
SIGIR
1970
1980
1990
INEX
CLEF
NTCIR
2000
2010
ISMIR
MIREX
MusiCLEF
MSD Challenge
6
11. Audio Music Similarity
• Song as input to system, audio signal
• Retrieve songs musically similar to it, by content
• Resembles traditional Ad Hoc retrieval in Text IR
• (most?) Important task in Music IR
– Music recommendation
– Playlist generation
– Plagiarism detection
• Annual evaluation in MIREX
7
14. The two questions
• How good is my system?
– What does good mean?
– What is good enough?
• Is system A better than system B?
– What does better mean?
– How much better?
• Efficiency? Effectiveness? Ease?
10
15. Measure user experience
• We are interested in user-measures
– Time to complete task, idle time, success rate, failure
rate, frustration, ease to learn, ease to use …
– Their distributions describe user experience, fully
• User satisfaction is the bigger picture
– How likely is it that an arbitrary user, with an arbitrary
query (and with an arbitrary document collection) will
be satisfied by the system?
• This is the ultimate goal: the good, the better
11
16. The Cranfield Paradigm
• Estimate user-measure distributions
– Sample documents, queries and users
– Monitor user experience and behavior
– Representativeness, cost, ethics, privacy …
• Fix samples to allow reproducibility
– But cannot fix users and their behavior
– Remove users, but include a static user component,
fixed across experiments: ground truth judgments
– Still need to include the dynamics of the process: user
models behind effectiveness measures and scales
12
17. Test collections
• Our goal is the users:
user-measure = f(system)
• Cranfield measures systems:
system-effectiveness = f(system, measure, scale)
• Estimators of the distributions of user-measures
– Only source of variability is the systems themselves
– Reproducibility becomes easy
– Experiments are inexpensive (collections are not)
– Research becomes systematic
13
18. Validity, Reliability and Efficiency
• Validity: are we measuring what we want to?
– How well are effectiveness and satisfaction correlated?
– How good is good and how better is better?
• Reliability: how repeatable are the results?
– How large do samples have to be?
– What statistical methods should be used?
• Efficiency: how inexpensive is it to get valid and
reliable results?
– Can we estimate results with fewer judgments?
14
19. Goal of this dissertation
Study and improve
the validity, reliability and efficiency
of the methods used to evaluate AMS systems
Additionally, improve meta-evaluation methods
15
21. Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
17
22. Assumption of Cranfield
• Systems with better effectiveness are perceived
by users as more useful, more satisfactory
• But different effectiveness measures and
relevance scales produce different distributions
– Which one is better to predict user satisfaction?
• Map system effectiveness onto user satisfaction,
experimentally
– If P@10 = 0.2, how likely is it that an arbitrary user will
find the results satisfactory?
– What if DCG@20 = 0.46?
18
36. What can we infer?
• Preference: difference noticed by user
– Positive: user agrees with evaluation
– Negative: user disagrees with evaluation
• Non-preference: difference not noticed by user
– Good: both systems are satisfactory
– Bad: both systems are not satisfactory
21
37. Data
• Queries, documents and judgments from MIREX
• 4115 unique and artificial examples
• 432 unique queries, 5636 unique documents
• Answers collected via Crowdsourcing
– Quality control with trap questions
• 113 unique subjects
22
38. Single system: how good is it?
• For 2045 examples (49%) users could not decide
which system was better
What do we expect?
23
39. Single system: how good is it?
• For 2045 examples (49%) users could not decide
which system was better
23
40. Single system: how good is it?
• Large ℓmin thresholds underestimate satisfaction
24
41. Single system: how good is it?
• Users don’t pay attention to ranking?
25
42. Single system: how good is it?
• Exponential gain underestimates satisfaction
26
43. Single system: how good is it?
• Document utility independent of others
27
44. Two systems: which one is better?
• For 2090 examples (51%) users did prefer one
system over the other one
What do we expect?
28
45. Two systems: which one is better?
• For 2090 examples (51%) users did prefer one
system over the other one
28
46. Two systems: which one is better?
• Large differences needed for users to note them
29
47. Two systems: which one is better?
• More relevance levels are better to discriminate
30
48. Two systems: which one is better?
• Cascade and navigational user models are not
appropriate
31
49. Two systems: which one is better?
• Users do prefer the (supposedly) worse system
32
50. Summary
• Effectiveness and satisfaction are clearly correlated
– But there is a bias of 20% because of user disagreement
– Room for improvement through personalization
• Magnitude of differences does matter
– Just looking at rankings is very naive
• Be careful with statistical significance
– Need Δλ≈0.4 for users to agree with effectiveness
• Historically, only 20% of times in MIREX
• Differences among measures and scales
– Linear gain slightly better than exponential gain
– Informational and positional user models better than
navigational and cascade
– The more relevance levels, the better
33
53. Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
36
54. Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
37
55. Evaluate in terms of user satisfaction
• So far, arbitrary users for a single query
– P Sat Ql @5 = 0.61 = 0.7
• Easily for n users and a single query
– P Sat15 = 10 Q l @5 = 0.61 = 0.21
• What about a sample of queries 𝒬?
– Map queries separately for the distribution of P(Sat)
– For easier mappings, P(Sat | λ) functions are
interpolated with simple polynomials
38
56. Expected probability of satisfaction
• Now we can compute point and interval estimates
of the expected probability of satisfaction
• Intuition fails when interpreting effectiveness
39
57. System success
• If P(Sat) ≥ threshold the system is successful
– Setting the threshold was rather arbitrary
– Now it is meaningful, in terms of user satisfaction
• Intuitively, we want the majority of users to find
the system satisfactory
– P Succ = P P Sat > 0.5 = 1 − FP
Sat
(0.5)
• Improving queries for which we are bad is
worthier than further improving those for which
we are already good
40
58. Distribution of P(Sat)
• Need to estimate the cumulative distribution
function of user satisfaction: FP(Sat)
• Not described by a typical distribution family
– ecdf converges, but what is a good sample size?
– Compare with Normal, Truncated Normal and Beta
• Compared on >2M random samples from MIREX
collections, at different query set sizes
• Goodness of fit as to Cramér-von Mises ω2
41
59. Estimated distribution of P(Sat)
• More than ≈25 queries in the collection
– ecdf approximates better
• Less than ≈25 queries in the collection
– Normal for graded scales, ecdf for binary scales
• Beta is always the best with the Fine scale
• The more levels in the relevance scale, the better
• Linear gain better than exponential gain
42
60. Intuition fails, again
• Intuitive conclusions based on effectiveness alone
contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat
– E ΔP Succ
= 0.001
= 0.07
43
61. Intuition fails, again
• Intuitive conclusions based on effectiveness alone
contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat
– E ΔP Succ
= 0.001
= 0.07
43
62. Intuition fails, again
• Intuitive conclusions based on effectiveness alone
contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat
– E ΔP Succ
= 0.001
= 0.07
43
63. Historically, in MIREX
• Systems are not as satisfactory as we thought
• But they are more successful
– Good (or bad) for some kinds of queries
44
66. Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
47
67. Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
48
68. Random error
• Test collections are just samples from larger,
possibly infinite, populations
• If we conclude system A is better than B, how
confident can we be?
– Δλ 𝒬 is just an estimate of the population mean μΔλ
• Usually employ some statistical significance test
for differences in location
• If it is statistically significant, we have confidence
that the true difference is at least that large
49
69. Statistical hypothesis testing
• Set two mutually exclusive hypotheses
– H0 : μΔλ = 0
– H1 : μΔλ ≠ 0
• Run test, obtain p-value= P μΔλ ≥ Δλ 𝒬 H0
– p ≤ α: statistically significant, high confidence
– p > α: statistically non-significant, low confidence
• Possible errors in the binary decision
– Type I: incorrectly reject H0
– Type II: incorrectly accept H0
50
70. Statistical significance tests
• (Non-)parametric tests
– t-test, Wilcoxon test, Sign test
• Based on resampling
– Bootstrap test, permutation/randomization test
• They make certain assumptions about
distributions and sampling methods
– Often violated in IR evaluation experiments
– Which test behaves better, in practice, knowing that
assumptions are violated?
51
71. Optimality criteria
• Power
– Achieve significance as often as possible (low Type II)
– Usually increases Type I error rates
• Safety
– Minimize Type I error rates
– Usually decreases power
• Exactness
– Maintain Type I error rate at α level
– Permutation test is theoretically exact
52
72. Experimental design
• Randomly split query set in two
• Evaluate all systems with both subsets
– Simulating two different test collections
• Compare p-values with both subsets
– How well do statistical tests agree with themselves?
– At different α levels
• All systems and queries from MIREX 2007-2011
– >15M p-values
53
73. Power and success
• Bootstrap test is the most powerful
• Wilcoxon, bootstrap and permutation are the
most successful, depending on α level
54
74. Conflicts
• Wilcoxon and t-test are the safest at low α levels
• Wilcoxon is the most exact at low α levels, but
bootstrap is for usual levels
55
75. Optimal measure and scale
• Power: CGl@5, GAP@5, DCGl@5 and RBPl@5
• Success: CGl@5, GAP@5, DCGl@5 and RBPl@5
• Conflicts: very similar across measures
• Power: Fine, Broad and binary
• Success: Fine, Broad and binary
• Conflicts: very similar across scales
56
76. Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
57
77. Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
58
78. Acceptable sample size
• Reliability is higher with larger sample sizes
– But it is also more expensive
– What is an acceptable test collection size?
• Answer with Generalizability Theory
– G-Study: estimate variance components
– D-Study: estimate reliability of different sample sizes
and experimental designs
59
90. G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 =
σ2 + σ2 + σ2
s
q
sq
• Estimated with Analysis of Variance
• If σ2 is small or σ2 is large, we need more queries
s
q
60
91. D-study: variance ratios
• Stability of absolute scores
Φ nq
σ2
s
=
σ2 + σ2
q
e
2
σs +
nq
• Stability of relative scores
Eρ2 nq =
σ2
s
σ2
σ2 + e
s
nq
• We can easily estimate how many queries are
needed to reach some level of stability (reliability)
61
92. D-study: variance ratios
• Stability of absolute scores
Φ nq
σ2
s
=
σ2 + σ2
q
e
2
σs +
nq
• Stability of relative scores
Eρ2 nq =
σ2
s
σ2
σ2 + e
s
nq
• We can easily estimate how many queries are
needed to reach some level of stability (reliability)
61
93. Effect of query set size
•
•
•
•
Average absolute stability Φ = 0.97
≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases
Fine scale slightly better than Broad and binary scales
RBPl@5 and nDCGl@5 are the most stable
62
94. Effect of query set size
•
•
•
•
Average relative stability Eρ2 = 0.98
≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases
Fine scale better than Broad and binary scales
CGl@5 and RBPl@5 are the most stable
63
95. Effect of cutoff k
• What if we use a deeper cutoff, k=10?
– From 100 queries and k=5 to 50 queries and k=10
– Should still have stable scores
– Judging effort should decrease
– Rank-based measures should become more stable
• Tested in MIREX 2012
– Apparently in 2013 too
64
96. Effect of cutoff k
• Judging effort reduced to 72% of the usual
• Generally stable
– From Φ = 0.81 to Φ = 0.83
– From Eρ2 = 0.93 to Eρ2 = 0.95
65
97. Effect of cutoff k
• Reliability given a fixed budged for judging?
– k=10 allows us to use fewer queries, about 70%
– Slightly reduced relative stability
66
98. Effect of assessor set size
• More assessors or simply more queries?
– Judging effort is multiplied
• Can be studied with MIREX 2006 data
– 3 different assessors per query
– Nested experimental design: s × h: q
67
99. Effect of assessor set size
•
2
2
Broad scale: σs ≈ σh:q
Fine scale: σ2 ≫ σ2
s
h:q
•
• Always better to spend resources on queries
68
100. Summary
• MIREX collections generally larger than necessary
• For fixed budget
– More queries better than more assessors
– More queries slightly better than deeper cutoff
• Worth studying alternative user model?
•
•
•
•
Employ G-Theory while building the collection
Fine better than Broad, better than binary
CGl@5 and DCGl@5 best for relative stability
RBPl@5 and nDCGl@5 best for absolute stability
69
101. Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
70
103. Probabilistic evaluation
• The MIREX setting is still expensive
– Need to judge all top k documents from all systems
– Takes days, even weeks sometimes
• Model relevance probabilistically
• Relevance judgments are random variables over
the space of possible assignments of relevance
• Effectiveness measures are also probabilistic
72
104. Probabilistic evaluation
• Accuracy increases as we make judgments
– E R d ← rd
• Reliability increases too (confidence)
– Var R d ← 0
• Iteratively estimate relevance and effectiveness
– If confidence is low, make judgments
– If confidence is high, stop
• Judge as few documents as possible
73
105. Learning distributions of relevance
• Uniform distribution is very uninformative
• Historical distribution in MIREX has high variance
• Estimate from a set of features: P R d = ℓ θd
– For each document separately
– Ordinal Logistic Regression
• Three sets of features
– Output-based, can always be used
– Judgment-based, to exploit known judgments
– Audio-based, to exploit musical similarity
74
106. Learned models
• Mout : can be used even without judgments
– Similarity between systems’ outputs
– Genre and artist metadata
• Genre is highly correlated to similarity
– Decent fit, R2 ≈ 0.35
• Mjud : can be used when there are judgments
– Similarity between systems’ outputs
– Known relevance of same system and same artist
• Artist is extremely correlated to similarity
– Excellent fit, R2 ≈ 0.91
75
107. Estimation errors
• Actual vs. predicted by Mout
– 0.36 with Broad and 0.34 with Fine
• Actual vs. predicted by Mjud
– 0.14 with Broad and 0.09 with Fine
• Among assessors in MIREX 2006
– 0.39 with Broad and 0.31 with Fine
• Negligible under the current MIREX setting
76
110. Probabilistic effectiveness measures
• Effectiveness scores are also random variables
• Different approaches to compute estimates
– Deal with dependence of random variables
– Different definitions of confidence
• For measures based on ideal ranking (nDCGl@k
and RBPl@k) we do not have a closed form
– Approximated with Delta method and Taylor series
79
111. Ranking without judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
• Average confidence in the rankings is 94%
• Average accuracy of the ranking is 92%
80
112. Ranking without judgments
• Can we trust individual estimates?
– Ideally, we want X% accuracy when X% confidence
– Confidence slightly overestimated in [0.9, 0.99)
Confidence
[0.5, 0.6)
[0.6, 0.7)
[0.7, 0.8)
[0.8, 0.9)
[0.9, 0.95)
[0.95, 0.99)
[0.99, 1)
E[Accuracy]
DCGl@5
Broad
In bin
Accuracy
23 (6.5%)
0.826
14 (4%)
0.786
14 (4%)
0.571
22 (6.2%)
0.864
23 (6.5%)
0.87
24 (6.8%)
0.917
232 (65.9%)
0.996
0.938
Fine
In bin
22 (6.2%)
16 (4.5%)
11 (3.1%)
21 (6%)
19 (5.4%)
27 (7.7%)
236 (67%)
Accuracy
0.636
0.812
0.364
0.762
0.895
0.926
0.996
0.921
81
113. Relative estimates with judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
3. While confidence is low (<95%)
1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of differences and rank systems
• What documents should we judge?
– Those that are the most informative
– Measure-dependent
82
114. Relative estimates with judgments
• Judging effort dramatically reduced
– 1.3% with CGl@5, 9.7% with RBPl@5
• Average accuracy still 92%, but improved individually
– 74% of estimates with >99% confidence, 99.9% accurate
– Expected accuracy improves slightly from 0.927 to 0.931
83
115. Absolute estimates with judgments
1. Estimate relevance with Mout
2. Estimate absolute effectiveness scores
3. While confidence is low (expected error >±0.05)
1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of absolute effectiveness scores
• What documents should we judge?
– Those that reduce variance the most
– Measure-dependent
84
116. Absolute estimates with judgments
• The stopping condition is overly confident
– Virtually no judgments are even needed (supposedly)
• But effectiveness is highly overestimated
– Especially with nDCGl@5 and RBPl@5
– Mjud, and especially Mout, tend to overestimate relevance
85
117. Absolute estimates with judgments
• Practical fix: correct variance
• Estimates are better, but at the cost of judging
– Need between 15% and 35% of judgments
86
118. Summary
• Estimate ranking of systems with no judgments
– 92% accuracy on average, trustworthy individually
– Statistically significant differences are always correct
• If we want more confidence, judge documents
– As few as 2% needed to reach 95% confidence
– 74% of estimates have >99% confidence and accuracy
• Estimate absolute scores, judging as necessary
– Around 25% needed to ensure error <0.05
87
121. Validity
• Cranfield tells us about systems, not about users
• Provide empirical mapping from system
effectiveness onto user satisfaction
• Room for personalization quantified in 20%
• Need large differences for users to note them
• Consider full distributions, not just averages
• Conclusions based on effectiveness tend to
contradict conclusions based on user satisfaction
90
122. Reliability
• Different significance tests for different needs
– Bootstrap test is the most powerful
– Wilcoxon and t-test are the safest
– Wilcoxon and bootstrap test are the most exact
•
•
•
•
•
Practical interpretation of p-values
MIREX collections generally larger than needed
Spend resources on queries, not on assessors
User models with deeper cutoffs are feasible
Employ G-Theory while building collections
91
123. Efficiency
•
•
•
•
•
Probabilistic evaluation reduces cost, dramatically
Two models to estimate document relevance
System rankings 92% accurate without judgments
2% of judgments to reach 95% confidence
25% of judgments to reduce error to 0.05
92
124. Measures and scales
• Best measure and scale depends on situation
• But generally speaking
– CGl@5, DCGl@5 and RBPl@5
– Fine scale
– Model distributions as Beta
93
127. Validity
• User studies to understand user behavior
• What information to include in test collections
• Other forms of relevance judgment to better
capture document utility
• Explicitly define judging guidelines
• Similar mapping for Text IR
96
128. Reliability
• Corrections for Multiple Comparisons
• Methods to reliably estimate reliability while
building test collections
97
129. Efficiency
• Better models to estimate document relevance
• Correct variance when having just a few
relevance judgments available
• Estimate relevance beyond k=5
• Other stopping conditions and document weights
98