Testing Hadoop jobs
    with MRUnit



            © 2010 Eric Wendelin
Eric Wendelin
Hadooper at Return Path
Blog: eriwen.com
Twitter: @eriwen
What is MRUnit?

• Testing library for MapReduce
• Developed by Cloudera
• Easy integration between MapReduce
  and standa...
Why do I need that?
Testing without MRUnit
• Write tests that create JobConf or
  Configuration   objects
 •   conf.set(‘mapred.job.tracker’, ...
MRUnit makes testing
Hadoop jobs easier
Testing with MRUnit

• No external test input or output files
 • Programmatically specified
• Less test harness code (but al...
class ExampleTest() {
  private Example.MyMapper mapper
  private Example.MyReducer reducer
  private MapReduceDriver driv...
class ExampleTest() {
  private Example.MyMapper mapper
  private Example.MyReducer reducer
  private MapReduceDriver driv...
Test map and reduce
    separately
class ExampleTest() {
  private Example.MyMapper mapper
  private MapDriver mDriver

    @Before void setUp() {
       map...
class ExampleTest() {
  private Example.MyReducer reducer
  private ReduceDriver rDriver

    @Before void setUp() {
     ...
Counters!
driver.withInput(...)
driver.run()

def counters = driver.getCounters()

assertEquals(1, counters.findCounter
    (‘foo’, ...
Verifying logging
def messages = []
def appender = [
    append: { messages.add(it) },
    requiresLayout: { false }
  ] as AppenderSkeleton...
Cool stuff I haven’t
         tried...
• The   PipelineMapReduceDriver  - allows
  testing a series of MapReduce passes
 •...
Problems with MRUnit
runTest()  does not
    give meaningful
information on failure
Better to use run() and
      then assert
driver.setInput(new Text(‘foo’),
    new Text(‘bar’))

def output = driver.run()

assertEquals ‘baz’, output[0].first
asse...
Documentation is
 severely lacking
runXxx()   calls setup()
called for new Hadoop
 API, but not old API
Tests are not executed
 in a distributed way
In Summary, MRUnit...

• Makes testing your Hadoop jobs easier
• Abstracts away a lot of the boilerplate test
  setup you ...
cloudera.com/hadoop-mrunit


Blog: eriwen.com
Twitter: @eriwen
Email:
eric.wendelin@returnpath.net
                   © 20...
Testing Hadoop jobs with MRUnit
Upcoming SlideShare
Loading in...5
×

Testing Hadoop jobs with MRUnit

20,993

Published on

Real-world examples and struggles with MRUnit testing Hadoop MapReduce jobs.

Published in: Technology

Testing Hadoop jobs with MRUnit

  1. 1. Testing Hadoop jobs with MRUnit © 2010 Eric Wendelin
  2. 2. Eric Wendelin Hadooper at Return Path Blog: eriwen.com Twitter: @eriwen
  3. 3. What is MRUnit? • Testing library for MapReduce • Developed by Cloudera • Easy integration between MapReduce and standard testing tools (e.g. JUnit)
  4. 4. Why do I need that?
  5. 5. Testing without MRUnit • Write tests that create JobConf or Configuration objects • conf.set(‘mapred.job.tracker’, ‘local’) • Developing new test input files stored alongside MapReduce test code • Lots of work to validate output files • External file I/O makes tests slooooow
  6. 6. MRUnit makes testing Hadoop jobs easier
  7. 7. Testing with MRUnit • No external test input or output files • Programmatically specified • Less test harness code (but also perhaps less control) • Concise, fast tests
  8. 8. class ExampleTest() { private Example.MyMapper mapper private Example.MyReducer reducer private MapReduceDriver driver @Before void setUp() { mapper = new Example.MyMapper() reducer = new Example.MyReducer() driver = new MapReduceDriver(mapper, reducer) } @Test void testMapReduce() { driver.withInput(new Text(‘a’), new Text(‘b’)) driver.withOutput(new Text(‘c’), new Text(‘d’)) driver.runTest() } }
  9. 9. class ExampleTest() { private Example.MyMapper mapper private Example.MyReducer reducer private MapReduceDriver driver @Before void setUp() { mapper = new Example.MyMapper() reducer = new Example.MyReducer() driver = new MapReduceDriver(mapper, reducer) } @Test void testMapReduce() { driver.withInput(new Text(‘a’), new Text(‘b’)) .withOutput(new Text(‘c’), new Text(‘d’)) .runTest() } }
  10. 10. Test map and reduce separately
  11. 11. class ExampleTest() { private Example.MyMapper mapper private MapDriver mDriver @Before void setUp() { mapper = new Example.MyMapper() driver = new MapDriver(mapper) } @Test void testMap() { mDriver.withInput(new Text(‘a’), new Text(‘b’)) mDriver.withOutput(new Text(‘c’), new Text(‘d’)) mDriver.runTest() } }
  12. 12. class ExampleTest() { private Example.MyReducer reducer private ReduceDriver rDriver @Before void setUp() { rDriver = new Example.MyReducer() driver = new ReduceDriver(reducer) } @Test void testReduce() { rDriver.withInput(new Text(‘a’), [new Text(‘foo’), new Text(‘bar’)]) rDriver.withOutput(new Text(‘c’), new Text(‘d’)) rDriver.runTest() } }
  13. 13. Counters!
  14. 14. driver.withInput(...) driver.run() def counters = driver.getCounters() assertEquals(1, counters.findCounter (‘foo’, ‘bar’).getValue())
  15. 15. Verifying logging
  16. 16. def messages = [] def appender = [ append: { messages.add(it) }, requiresLayout: { false } ] as AppenderSkeleton Logger.getRootLogger().addAppender(appender) driver.runTest() assertTrue messages.find { it.getLevel.toString() == ‘WARN’ && it.getMessage().contains(‘My err’) } Logger.getRootLogger().removeAppender(appender)
  17. 17. Cool stuff I haven’t tried... • The PipelineMapReduceDriver - allows testing a series of MapReduce passes • Just call addMapReduce(mapper, reducer) • Mock objects - MockReporter, MockInputSplit, and MockOutputCollector • Test combiners with myMapReduceDriver.setCombiner(myCombiner)
  18. 18. Problems with MRUnit
  19. 19. runTest() does not give meaningful information on failure
  20. 20. Better to use run() and then assert
  21. 21. driver.setInput(new Text(‘foo’), new Text(‘bar’)) def output = driver.run() assertEquals ‘baz’, output[0].first assertEquals ‘jy’, output[0].second
  22. 22. Documentation is severely lacking
  23. 23. runXxx() calls setup() called for new Hadoop API, but not old API
  24. 24. Tests are not executed in a distributed way
  25. 25. In Summary, MRUnit... • Makes testing your Hadoop jobs easier • Abstracts away a lot of the boilerplate test setup you need • Has it’s problems • but they are outweighed by the benefits
  26. 26. cloudera.com/hadoop-mrunit Blog: eriwen.com Twitter: @eriwen Email: eric.wendelin@returnpath.net © 2010 Eric Wendelin
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×