This document discusses challenges in natural language generation (NLG) evaluation and corpus development. It presents referenceless quality estimation as a way to evaluate NLG systems without references that better correlates with human ratings. It also describes how crowdsourcing pictorial meaning representations can elicit more natural language and ensure high quality corpora. Finally, it introduces a new end-to-end NLG challenge corpus in the restaurant domain that provides a large set of training data and evaluation benchmarks.