This document discusses evaluating deep learning models, specifically for natural language processing tasks. It outlines the current status quo of evaluation, which focuses on aggregate performance but can miss failures on specific examples or distributions. The document then introduces Robustness Gym and SummVis as tools to help with more rigorous evaluation. Robustness Gym allows consolidated and fine-grained evaluation to better expose model vulnerabilities and inform next steps like further analysis or patching. SummVis helps evaluate text summarization models while addressing issues like input contamination between pre-training and evaluation data. The goal of more robust evaluation is to gain a fuller picture of model performance and drive iterative improvement.