1. The document presents a methodology and tool suite for evaluating the accuracy of combinations of statistical natural language processing engines. 2. It describes an "evaluation space" approach that allows for various comparisons between human-generated and machine-generated outputs at each stage of processing. 3. Example evaluation modules are discussed, including ones that use BLEU scores and ROC curves to evaluate the accuracy of specific engine combinations like speech recognition to machine translation.