- The document evaluates how well traditional and document preference-based IR evaluation measures align with users' search results page (SERP) preferences.
- The best document preference-based measures, wpref5 and wpref6, had a mean agreement rate of 78% with human judges, comparable to the median judge. However, they performed significantly worse than the best human judge, which had an 82% agreement rate.
- The best overall measures were nDCG and iRBU, with a mean 80% agreement rate comparable to the best human judge. These measures performed as well as or better than most human judges.