Around the currently available large piles of Big Data, there's happening quite a mixed gathering: Business Engineers define which insightswould be precious, Analysts build models, Hadoop programmers tame the flood of data, and Operations people setup machines and networks. It's exactly the interplay of all participants which is central to project success. This setup together with the distributed nature of processing poses new challenges to well-established models of assuring software artifact quality: How can non-programmers define acceptance criteria? How can functionalities be tested which depend on cluster execution, orchestration of, e.g., different hadoop jobs without delaying the development process? Which data selection is suited best for simulating the live environment? How can intermediate results in arbitrary serialization formats be inspected?
In this talk, experiences and best practices from approaching these problems in a large-scale log data analysis project will be presented. At 1&1, our team develops hadoop applications which process roughly 1 billion log events (~1 TB) per day. We will give an overview of the hard- and software setup of our quality assurance environment, which includes FitNesse as a wiki-style acceptance testing framework.Starting from a comparison with existing test frameworks like MRUnit, we will explain how we automate the parameterized deployment of our applications, choose test data sampling strategies, perform workflow management and orchestration of jobs / applications, and use Pig for inspection of intermediate results and definition of final acceptance criteria. Our conclusion is that test-driven development in the field of Big Data requires adaption of existing paradigms, but is crucial for maintaining high quality standards for the resulting applications.