1. The document discusses evaluating learning to rank models, including offline and online evaluation methods. Offline evaluation involves building a test set from labeled data and evaluating metrics like NDCG, while online evaluation uses methods like A/B testing and interleaving to directly measure user behavior and business metrics.
2. Common mistakes in offline evaluation include having only one sample per query, single relevance labels per query, and unrepresentative test samples. While offline evaluation provides efficiency, online evaluation allows observing real user interactions and model impact on key metrics.
3. Recommendations are given to test models both offline and online, with online testing providing advantages like measuring actual business outcomes and interpreting model effects.