Test-collection based evaluation in (Music) Information Retrieval has been used for half a century now as the means to evaluate and compare retrieval techniques and advance the state of the art. However, this paradigm makes certain assumptions that remain a research problem and that may invalidate our experimental results. In this talk I will approach this paradigm as an estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, computing system-related distributions assumed to reliably correlate with the target user-related distributions. Using the Audio Music Similarity task as an example, I will talk about issues with our current evaluation methods, the degree to which they are problematic, how to analyze them and improve the situation. In terms of validity, we will see how the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we will discuss optimal characteristics of test collections and statistical procedures. In terms of efficiency, we discuss models and methods to greatly reduce the annotation cost of an evaluation experiment.