Audio Music Similarity is a task within Music Information Retrieval that deals with systems that retrieve songs musically similar to a query song according to their audio content. Evaluation experiments are the main scientific tool in Information Retrieval to determine what systems work better and advance the state of the art accordingly. It is therefore essential that the conclusions drawn from these experiments are both valid and reliable, and that we can reach them at a low cost. This dissertation studies these three aspects of evaluation experiments for the particular case of Audio Music Similarity, with the general goal of improving how these systems are evaluated. The traditional paradigm for Information Retrieval evaluation based on test collections is approached as an statistical estimator of certain probability distributions that characterize how users employ systems. In terms of validity, we study how well the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we study the optimal characteristics of test collections and statistical procedures, and in terms of efficiency we study models and methods to greatly reduce the cost of running an evaluation experiment.