The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
Going through a PhD may be seen as a requirement for an academic career or a different kind of job, simply as “the next step” in education, as something to do “because why not?”, or even just as a hobby you have on the side. What it really is though, is a life-changing experience, something that can be terribly painful and amazingly rewarding at the same time. In that journey I learned a few lessons in the hard way, lessons that I wish someone had told me about at the time. In this talk I’ll try to do just that and not talk about the content and process of a PhD, but rather about you, the person, during your PhD.
The Treatment of Ties in AP CorrelationJulián Urbano
The Kendall tau and AP correlation coefficients are very commonly use to compare two rankings over the same set of items. Even though Kendall tau was originally defined assuming that there are no ties in the rankings, two alternative versions were soon developed to account for ties in two different scenarios: measure the accuracy of an observer with respect to a true and objective ranking, and measure the agreement between two observers in the absence of a true ranking. These two variants prove useful in cases where ties are possible in either ranking, and may indeed result in very different scores. AP correlation was devised to incorporate a top-heaviness component into Kendall tau, penalizing more heavily if differences occur between items at the top of the rankings, making it a very compelling coefficient in Information Retrieval settings. However, the treatment of ties in AP correlation remains an open problem. In this paper we fill this gap, providing closed analytical formulations of AP correlation under the two scenarios of ties contemplated in Kendall tau. In addition, we developed an R package that implements these coefficients.
The Music Information Retrieval Evaluation eXchange (MIREX) is a valuable community service, having established standard datasets, metrics, baselines, methodologies, and infrastructure for comparing MIR methods. While MIREX has managed to successfully maintain operations for over a decade, its long-term sustainability is at risk without considerable ongoing financial support. The imposed constraint that input data cannot be made freely available to participants necessitates that all algorithms run on centralized computational resources, which are administered by a limited number of people. This incurs an approximately linear cost with the number of submissions, exacting significant tolls on both human and financial resources, such that the current paradigm becomes less tenable as participation increases. To alleviate the recurring costs of future evaluation campaigns, we propose a distributed, community-centric paradigm for system evaluation, built upon the principles of openness, transparency, reproducibility, and incremental evaluation. We argue that this proposal has the potential to reduce operating costs to sustainable levels. Moreover, the proposed paradigm would improve scalability, and eventually result in the release of large, open datasets for improving both MIR techniques and evaluation methods.
Crawling the Web for Structured DocumentsJulián Urbano
Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like Sourceforge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections of other types of documents, such as Java source code for software oriented search engines or RDF for semantic searching.
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
We present an empirical analysis of the effect that the gain and discount functions have in the correlation between DCG and user satisfaction. Through a large user study we estimate the relationship between satisfaction and the effectiveness computed with a test collection. In particular, we estimate the probabilities that users find a system satisfactory given a DCG score, and that they agree with a difference in DCG as to which of two systems is more satisfactory. We study this relationship for 36 combinations of gain and discount, and find that a linear gain and a constant discount are best correlated with user satisfaction.
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
Going through a PhD may be seen as a requirement for an academic career or a different kind of job, simply as “the next step” in education, as something to do “because why not?”, or even just as a hobby you have on the side. What it really is though, is a life-changing experience, something that can be terribly painful and amazingly rewarding at the same time. In that journey I learned a few lessons in the hard way, lessons that I wish someone had told me about at the time. In this talk I’ll try to do just that and not talk about the content and process of a PhD, but rather about you, the person, during your PhD.
The Treatment of Ties in AP CorrelationJulián Urbano
The Kendall tau and AP correlation coefficients are very commonly use to compare two rankings over the same set of items. Even though Kendall tau was originally defined assuming that there are no ties in the rankings, two alternative versions were soon developed to account for ties in two different scenarios: measure the accuracy of an observer with respect to a true and objective ranking, and measure the agreement between two observers in the absence of a true ranking. These two variants prove useful in cases where ties are possible in either ranking, and may indeed result in very different scores. AP correlation was devised to incorporate a top-heaviness component into Kendall tau, penalizing more heavily if differences occur between items at the top of the rankings, making it a very compelling coefficient in Information Retrieval settings. However, the treatment of ties in AP correlation remains an open problem. In this paper we fill this gap, providing closed analytical formulations of AP correlation under the two scenarios of ties contemplated in Kendall tau. In addition, we developed an R package that implements these coefficients.
The Music Information Retrieval Evaluation eXchange (MIREX) is a valuable community service, having established standard datasets, metrics, baselines, methodologies, and infrastructure for comparing MIR methods. While MIREX has managed to successfully maintain operations for over a decade, its long-term sustainability is at risk without considerable ongoing financial support. The imposed constraint that input data cannot be made freely available to participants necessitates that all algorithms run on centralized computational resources, which are administered by a limited number of people. This incurs an approximately linear cost with the number of submissions, exacting significant tolls on both human and financial resources, such that the current paradigm becomes less tenable as participation increases. To alleviate the recurring costs of future evaluation campaigns, we propose a distributed, community-centric paradigm for system evaluation, built upon the principles of openness, transparency, reproducibility, and incremental evaluation. We argue that this proposal has the potential to reduce operating costs to sustainable levels. Moreover, the proposed paradigm would improve scalability, and eventually result in the release of large, open datasets for improving both MIR techniques and evaluation methods.
Crawling the Web for Structured DocumentsJulián Urbano
Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like Sourceforge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections of other types of documents, such as Java source code for software oriented search engines or RDF for semantic searching.
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
We present an empirical analysis of the effect that the gain and discount functions have in the correlation between DCG and user satisfaction. Through a large user study we estimate the relationship between satisfaction and the effectiveness computed with a test collection. In particular, we estimate the probabilities that users find a system satisfactory given a DCG score, and that they agree with a difference in DCG as to which of two systems is more satisfactory. We study this relationship for 36 combinations of gain and discount, and find that a linear gain and a constant discount are best correlated with user satisfaction.
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test out- perform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
This short paper describes four submissions to the Symbolic Melodic Similarity task of the MIREX 2010 edition. All four submissions rely on a local-alignment approach between sequences of n-grams, and they differ mainly on the substitution score between two n-grams. This score is based on a geometric representation that shapes musical pieces as curves in the pitch-time plane. One of the systems described ranked first for all ten effectiveness measures used and the other three ranked from second to fifth, depending on the measure.
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
This paper describes the participation of the uc3m team in both tasks of the TREC 2011 Crowdsourcing Track. For the first task we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level based on a simple reading comprehension test. For the second task we also submitted three runs: one with a stepwise execution of the GetAnotherLabel algorithm and two others with a rule-based and a SVMbased model. According to the NIST gold labels, our runs performed very well in both tasks, ranking at the top for most measures.
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
Much research in MIR is based on descriptors computed from audio signals. Some music corpora use different audio encodings, some do not contain audio but descriptors already computed in some particular way, and sometimes we have to gather audio files ourselves. We thus assume that descriptors are robust to these changes and algorithms are not affected. We investigated this issue for MFCCs and Chroma: how do encoding quality, analysis parameters and musical characteristics affect their robustness?
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions.
Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.
Audio Music Similarity is a task within Music Information Retrieval that deals with systems that retrieve songs musically similar to a query song according to their audio content. Evaluation experiments are the main scientific tool in Information Retrieval to determine what systems work better and advance the state of the art accordingly. It is therefore essential that the conclusions drawn from these experiments are both valid and reliable, and that we can reach them at a low cost. This dissertation studies these three aspects of evaluation experiments for the particular case of Audio Music Similarity, with the general goal of improving how these systems are evaluated. The traditional paradigm for Information Retrieval evaluation based on test collections is approached as an statistical estimator of certain probability distributions that characterize how users employ systems. In terms of validity, we study how well the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we study the optimal characteristics of test collections and statistical procedures, and in terms of efficiency we study models and methods to greatly reduce the cost of running an evaluation experiment.
On the Measurement of Test Collection ReliabilityJulián Urbano
The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
This notebook paper describes our participation in both tasks of the TREC 2011 Crowdsourcing Track. For the first one we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level, which was based on a simple reading comprehension test. For the second task we submitted another three runs: one with a stepwise execution of the GetAnotherLabel algorithm by Ipeirotis et al., and two others with a rule-based and a SVM-based model. We also comment on several topics regarding the Track design and evaluation methods.
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a significant difference is found between systems and how often these significant differences are incorrect. We study the effect of using different effectiveness measures with different sets of relevance judgments, for varying number of queries and alternative statistical procedures. Different measures are shown to behave similarly overall, though some are much more sensitive and stable than others. The use of different statistical procedures does improve the reliability of the results, and it allows using as little as half the number of queries currently used in MIREX evaluations while still offering very similar reliability levels. We also conclude that experimenters can be very confident that if a significant difference is found between two systems, the difference is indeed real.
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
We describe a pilot experiment to update the program of an Information Retrieval course for Computer Science undergraduates. We have engaged the students in the development of a search engine from scratch, and they have been involved in the elaboration, also from scratch, of a complete test collection to evaluate their systems. With this methodology they get a whole vision of the Information Retrieval process as they would find it in a real-world setting, and their direct involvement in the evaluation makes them realize the importance of these laboratory experiments in Computer Science. We show that this methodology is indeed reliable and feasible, and so we plan on improving and keep using it in the next years, leading to a public repository of resources for Information Retrieval courses.
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
Ground truths based on partially ordered lists have been used for some years now to evaluate the effectiveness of Music Information Retrieval systems, especially in tasks related to symbolic melodic similarity. However, there has been practically no meta-evaluation to measure or improve the correctness of these evaluations. In this paper we revise the methodology used to generate these ground truths and disclose some issues that need to be addressed. In particular, we focus on the arrangement and aggregation of the relevant results, and show that it is not possible to ensure lists completely consistent. We develop a measure of consistency based on Average Dynamic Recall and propose several alternatives to arrange the lists, all of which prove to be more consistent than the original method. The results of the MIREX 2005 evaluation are revisited using these alternative ground truths.
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
Music similarity tasks, where musical pieces similar to a query should be retrieved, are quite troublesome to evaluate. Ground truths based on partially ordered lists were developed to cope with problems regarding relevance judgment, but they require such man-power to generate that the official MIREX evaluations had to turn over more affordable alternatives. However, in house evaluations keep using these partially ordered lists because they are still more suitable for similarity tasks. In this paper we propose a cheaper alternative to generate these lists by using crowdsourcing to gather music preference judgments. We show that our method produces lists very similar to the original ones, while dealing with some defects of the original methodology. With this study, we show that crowdsourcing is a perfectly viable alternative to evaluate music systems without the need for experts.
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Julián Urbano
We describe a new approach for computing the similarity between symbolic musical pieces, based on the differences in shape between the interpolating curves defining the pieces. We outline several requirements for a symbolic musical similarity system, and then show how our model addresses them. Results obtained by an early prototype, tested with the MIREX 2005 Symbolic Music Similarity collections, show that this approach performs well. This new model opens a new and promising line for further research on the melodic similarity problem.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test out- perform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
This short paper describes four submissions to the Symbolic Melodic Similarity task of the MIREX 2010 edition. All four submissions rely on a local-alignment approach between sequences of n-grams, and they differ mainly on the substitution score between two n-grams. This score is based on a geometric representation that shapes musical pieces as curves in the pitch-time plane. One of the systems described ranked first for all ten effectiveness measures used and the other three ranked from second to fifth, depending on the measure.
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
This paper describes the participation of the uc3m team in both tasks of the TREC 2011 Crowdsourcing Track. For the first task we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level based on a simple reading comprehension test. For the second task we also submitted three runs: one with a stepwise execution of the GetAnotherLabel algorithm and two others with a rule-based and a SVMbased model. According to the NIST gold labels, our runs performed very well in both tasks, ranking at the top for most measures.
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
Much research in MIR is based on descriptors computed from audio signals. Some music corpora use different audio encodings, some do not contain audio but descriptors already computed in some particular way, and sometimes we have to gather audio files ourselves. We thus assume that descriptors are robust to these changes and algorithms are not affected. We investigated this issue for MFCCs and Chroma: how do encoding quality, analysis parameters and musical characteristics affect their robustness?
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions.
Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.
Audio Music Similarity is a task within Music Information Retrieval that deals with systems that retrieve songs musically similar to a query song according to their audio content. Evaluation experiments are the main scientific tool in Information Retrieval to determine what systems work better and advance the state of the art accordingly. It is therefore essential that the conclusions drawn from these experiments are both valid and reliable, and that we can reach them at a low cost. This dissertation studies these three aspects of evaluation experiments for the particular case of Audio Music Similarity, with the general goal of improving how these systems are evaluated. The traditional paradigm for Information Retrieval evaluation based on test collections is approached as an statistical estimator of certain probability distributions that characterize how users employ systems. In terms of validity, we study how well the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we study the optimal characteristics of test collections and statistical procedures, and in terms of efficiency we study models and methods to greatly reduce the cost of running an evaluation experiment.
On the Measurement of Test Collection ReliabilityJulián Urbano
The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
This notebook paper describes our participation in both tasks of the TREC 2011 Crowdsourcing Track. For the first one we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level, which was based on a simple reading comprehension test. For the second task we submitted another three runs: one with a stepwise execution of the GetAnotherLabel algorithm by Ipeirotis et al., and two others with a rule-based and a SVM-based model. We also comment on several topics regarding the Track design and evaluation methods.
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a significant difference is found between systems and how often these significant differences are incorrect. We study the effect of using different effectiveness measures with different sets of relevance judgments, for varying number of queries and alternative statistical procedures. Different measures are shown to behave similarly overall, though some are much more sensitive and stable than others. The use of different statistical procedures does improve the reliability of the results, and it allows using as little as half the number of queries currently used in MIREX evaluations while still offering very similar reliability levels. We also conclude that experimenters can be very confident that if a significant difference is found between two systems, the difference is indeed real.
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
We describe a pilot experiment to update the program of an Information Retrieval course for Computer Science undergraduates. We have engaged the students in the development of a search engine from scratch, and they have been involved in the elaboration, also from scratch, of a complete test collection to evaluate their systems. With this methodology they get a whole vision of the Information Retrieval process as they would find it in a real-world setting, and their direct involvement in the evaluation makes them realize the importance of these laboratory experiments in Computer Science. We show that this methodology is indeed reliable and feasible, and so we plan on improving and keep using it in the next years, leading to a public repository of resources for Information Retrieval courses.
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
Ground truths based on partially ordered lists have been used for some years now to evaluate the effectiveness of Music Information Retrieval systems, especially in tasks related to symbolic melodic similarity. However, there has been practically no meta-evaluation to measure or improve the correctness of these evaluations. In this paper we revise the methodology used to generate these ground truths and disclose some issues that need to be addressed. In particular, we focus on the arrangement and aggregation of the relevant results, and show that it is not possible to ensure lists completely consistent. We develop a measure of consistency based on Average Dynamic Recall and propose several alternatives to arrange the lists, all of which prove to be more consistent than the original method. The results of the MIREX 2005 evaluation are revisited using these alternative ground truths.
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
Music similarity tasks, where musical pieces similar to a query should be retrieved, are quite troublesome to evaluate. Ground truths based on partially ordered lists were developed to cope with problems regarding relevance judgment, but they require such man-power to generate that the official MIREX evaluations had to turn over more affordable alternatives. However, in house evaluations keep using these partially ordered lists because they are still more suitable for similarity tasks. In this paper we propose a cheaper alternative to generate these lists by using crowdsourcing to gather music preference judgments. We show that our method produces lists very similar to the original ones, while dealing with some defects of the original methodology. With this study, we show that crowdsourcing is a perfectly viable alternative to evaluate music systems without the need for experts.
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Julián Urbano
We describe a new approach for computing the similarity between symbolic musical pieces, based on the differences in shape between the interpolating curves defining the pieces. We outline several requirements for a symbolic musical similarity system, and then show how our model addresses them. Results obtained by an early prototype, tested with the MIREX 2005 Symbolic Music Similarity collections, show that this approach performs well. This new model opens a new and promising line for further research on the melodic similarity problem.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain
1. Information Retrieval Meta-Evaluation:
Challenges and Opportunities
in the Music Domain
Julián Urbano @julian_urbano
University Carlos III of Madrid
ISMIR 2011
Picture by Daniel Ray Miami, USA · October 26th
7. ISMIR 2001 resolution on the need to create
standardized MIR test collections tasks and
collections, tasks,
evaluation metrics for MIR research and development
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011
ISMIR
(2000-today)
3 workshops (2002-2003):
The MIR/MDL Evaluation Project
8. ISMIR 2001 resolution on the need to create
standardized MIR test collections tasks and
collections, tasks,
evaluation metrics for MIR research and development
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011
follow the steps of the Text IR folks ISMIR
(2000-today)
MIREX
(2005-today)
but carefully: not everything applies to music
>1200
3 workshops (2002-2003): runs!
The MIR/MDL Evaluation Project
9. are we done already?
nearly 2 decades of
Evaluation is not easy Meta-
Meta-Evaluation in Text IR
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011
ISMIR MIREX
(2000-today)
(2005-today)
positive
impact
on MIR
10. are we done already?
nearly 2 decades of
Evaluation is not easy Meta-
Meta-Evaluation in Text IR
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011
some good practices inherited from here ISMIR MIREX
(2000-today)
(2005-today)
positive
impact
“not everything applies” a lot of things
have happened here!
on MIR
but much of it does!
14. Experimental Validity
how well an experiment meets the well-grounded
requirements of the scientific method
do the results fairly and actually assess
what was intended?
Meta-Evaluation
analyze the validity of IR Evaluation experiments
15. Ground truth
User model
Documents
Measures
Systems
Queries
Construct Task
x x x
Content x x x x x
Convergent x x x
Criterion x x x
Internal x x x x x
External x x x x
Conclusion x x x x x
17. construct validity
#fail
measure quality of a Web search engine
by the number of visits
what?
do the variables of the experiment correspond
to the theoretical meaning of the concept
they purport to measure?
how?
thorough selection and justification
of the variables used
18. construct validity in IR
effectiveness measures and their user model
[Carterette, SIGIR2011]
set-based measures do not resemble real users
[Sanderson et al., SIGIR2010]
rank-based measures are better
[Jarvelin et al., TOIS2002]
graded relevance is better
[Voorhees, SIGIR2001][Kekäläinen, IP&M2005]
other forms of ground truth are better
[Bennet et al., SIGIRForum2008]
19. content validity
#fail
measure reading comprehension
only with sci-fi books
what?
do the experimental units reflect and represent
the elements of the domain under study?
how?
careful selection of the experimental units
20. content validity in IR
tasks closely resembling real-world settings
systems completely fulfilling real-user needs
heavy user component, difficult to control
evaluate de system component instead
[Cleverdon, SIGIR2001][Voorhees, CLEF2002]
actual value of systems is really unknown
[Marchioni, CACM2006]
sometimes they just do not work with real users
[Turpin et al., SIGIR2001]
21. content validity in IR
documents resembling real-world settings’
large and representative samples
specially for Machine Learning
careful selection of queries, diverse but reasonable
[Voorhees, CLEF2002][Carterette et al., ECIR2009]
random selection is not good
some queries are better to differentiate bad systems
[Guiver et al., TOIS2009][Robertson, ECIR2011]
22. convergent validity
#fail
measures of math skills not correlated
with abstract thinking
what?
do the results agree with others, theoretical or
experimental, they should be related with?
how?
careful examination and confirmation
of the relationship between the results
and others supposedly related
23. convergent validity in IR
ground truth data is subjective
differences across groups and over time
different results depending on who evaluates
absolute numbers change
relative differences stand still for the most part
[Voorhees, IP&M2000]
for large-scale evaluations or varying experience
of assessors, differences do exist
[Carterette et al., 2010]
24. convergent validity in IR
measures are precision- or recall-oriented
they should therefore be correlated with each other
but they actually are not reliability?
[Kekäläinen, IP&M2005][Sakai, IP&M2007]
better correlated with others than with themselves!
[Webber et al., SIGIR2008]
correlation with user satisfaction in the task
[Sanderson et al., SIGIR2010]
ranks, unconventional judgments, discounted gain…
[Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]
25. criterion validity
#fail
ask if the new drink is good
instead of better than the old one
what?
are the results correlated with those of
other experiments already known to be valid?
how?
careful examination and confirmation of the
correlation between our results and previous ones
26. criterion validity in IR
practical large-scale methodologies: pooling
[Buckley et al., SIGIR2004] less effort, but
same results?
judgments by non-experts
[Bailey et al., SIGIR2008]
crowdsourcing for low-cost
[Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010]
estimate measures with fewer judgments
[Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008]
select what documents to judge, by informativeness
[Carterette et al., SIGIR2006][Carterette et al., SIGIR2007]
use no relevance judgments at all
[Soboroff et al., SIGIR2001]
27. internal validity
#fail
measure usefulness of Windows vs Linux vs iOS
only with Apple employees
what?
can the conclusions be rigorously drawn
from the experiment alone
and not other overlooked factors?
how?
careful identification and control of possible
confounding variables and selection of desgin
28. internal validity in IR
inconsistency: performance depends on assessors
[Voorhees, IP&M2000][Carterette et al., SIGIR2010]
incompleteness: performance depends on pools
system reinforcement
[Zobel, SIGIR2008]
affects reliability of measures and overall results
[Sakai, JIR2008][Buckley et al., SIGIR2007] specially for
Machine Learning
train-test: same characteristics in queries and docs
improvements on the same collections: overfitting
[Voorhees, CLEF2002]
measures must be fair to all systems
29. external validity
#fail
study cancer treatment mostly with teenage males
what?
can the results be generalized
to other populations and experimental settings?
how?
careful design and justification
of sampling and selection methods
30. external validity in IR
weakest point of IR Evaluation
[Voorhees, CLEF2002]
large-scale is always incomplete
[Zobel, SIGIR2008][Buckley et al., SIGIR2004]
test collections are themselves an evaluation result
but they become hardly reusable
[Carterette et al., WSDM2010][Carterette et al., SIGIR2010]
31. external validity in IR
systems perform differently with different collections
cross-collection comparisons are unjustified
highly depends on test collection characteristics
[Bodoff et al., SIGIR2007][Voorhees, CLEF2002]
interpretation of results must be in terms of
pairwise comparisons, not absolute numbers
[Voorhees, CLEF2002]
do not claim anything about state of the art
based on a handful of experiments
baselines can be used to compare across collections
meaningful, [Armstrong et al., CIKM2009]
not random!
32. conclusion validity
#fail
more access to the Internet in China than in the US
because of the larger total number of users
what?
are the conclusions justified based on the results?
how?
careful selection of the measuring instruments and
statistical methods used to draw grand conclusions
33. conclusion validity in IR
measures should be sensitive and stable
[Buckley et al., SIGIR2000]
and also powerful
[Voorhees et al., SIGIR2002][Sakai, IP&M2007]
with little effort
[Sanderson et al., SIGIR2005]
always bearing in mind
the user model and the task
34. conclusion validity in IR
statistical methods to compare score distributions
[Smucker et al., CIKM2007][Webber et al., CIKM2008]
correct interpretation of the statistics
hypothesis testing is troublesome
statistical significance ≠ practical significance
increasing #queries (sample size) increases power
to detect ever smaller differences (effect-size)
eventually, everything is statistically significant
45. IR Research & Development Cycle
collections are too small
and/or biased
lack of realistic,
controlled public standard formats
collections and evaluation
software to
minimize bugs
can’t replicate
private,
private undescribed results, often
and unanalyzed leading to
collections emerge wrong conclusions
47. IR Research & Development Cycle
undocumented measures,
lack of baselines as lower bound no accepted evaluation software
(random is not a baseline!)
proper statistics correct
interpretation
of statistics
49. IR Research & Development Cycle
raw musical material
unknown
undocumented
queries and/or
documents
go back to private
collections:
overfitting!
overfitting!
53. collections
large, heterogeneous and controlled
not a hard endeavour, except for the damn copyright
Million Song Dataset!
still problematic (new features?, actual music)
standardize collections across tasks
better understanding and use of improvements
54. raw music data
essential for Learning and Improvement phases
use copyright-free data
Jamendo!
study possible biases
reconsider artificial material
55. evaluation model
let teams run their own algorithms
(needs public collections)
relief for IMIRSEL and promote wider participation
successfuly used for 20 years in Text IR venues
adopted by MusiCLEF
only viable alternative in the long run
MIREX-DIY platforms still don’t allow full completion
of the IR Research & Development Cycle
56. organization
IMIRSEL plans, schedules and runs everything
add a 2nd tier of organizers, task-specific
logistics, planning, evaluation, troubleshooting…
format of large forums like TREC and CLEF
smooth the process and develop tasks that really
push the limits of the state of the art
57. overview papers
every year, by task organizers
detail the evaluation process, data, results
discussion to boost Interpretation and Learning
perfect wrap-up for team papers
rarely discuss results, and many are not even drafted
58. specific methodologies
MIR has unique methodologies and measures
meta-evaluate: analyze and improve
human effects on the evaluation
user satisfaction
59. standard evaluation software
bugs are inevitable
open evaluation software to everybody
gain reliability
speed up the development process
serve as documentation for newcomers
promote standardization of formats
60. baselines
help measuring the overall progress of the filed
standard formats + standard software +
public controlled collections + raw music +
task-specific organization
measure the state of the art
61. commitment
we need to acknowledge the current problems
MIREX should not only be a place to
evaluate and improve systems
but also a place to
meta-evaluate and improve how we evaluate
and a place to
design tasks that challenge researchers
analyze our evaluation methodologies
62. we all need to start
questioning
evaluation practices