Testing Hadoop jobs
    with MRUnit

 Boulder/Denver Hadoop Users Group
                        05.12.2010


                     © 2010 Eric Wendelin
Eric Wendelin
Hadooper @returnpath
Blog: eriwen.com
Twitter: @eriwen
What is MRUnit?

• Testing library for MapReduce
• Developed by Cloudera
• Easy integration between MapReduce
  and standard testing tools (e.g. JUnit)

  cloudera.com/hadoop-mrunit
Why do I need that?
Testing without MRUnit
• Write tests that create JobConf or
  Configuration   objects
 •   conf.set(‘mapred.job.tracker’, ‘local’)

• Developing new test input files stored
  alongside MapReduce test code
• Lots of work to validate output files
 • External file I/O makes tests slooooow
MRUnit makes testing
Hadoop jobs easier
Testing with MRUnit

• No external test input or output files
 • Programmatically specified
• Less test harness code (but also perhaps
  less control)
• Concise, fast tests
Example
class ExampleTest() {
  private Example.MyMapper mapper
  private Example.MyReducer reducer
  private MapReduceDriver driver

    @Before void setUp() {
      mapper = new Example.MyMapper()
      reducer = new Example.MyReducer()
      driver = new MapReduceDriver(mapper, reducer)
    }

    @Test void testMapReduce() {
      driver.withInput(new Text(‘a’), new Text(‘b’))
      driver.withOutput(new Text(‘c’), new Text(‘d’))
      driver.runTest()
    }
}
Example
class ExampleTest() {
  private Example.MyMapper mapper
  private Example.MyReducer reducer
  private MapReduceDriver driver

    @Before void setUp() {
      mapper = new Example.MyMapper()
      reducer = new Example.MyReducer()
      driver = new MapReduceDriver(mapper, reducer)
    }

    @Test void testMapReduce() {
      driver.withInput(new Text(‘a’), new Text(‘b’))
          .withOutput(new Text(‘c’), new Text(‘d’))
          .runTest()
    }
}
Test map and reduce
    separately
class ExampleTest() {
  private Example.MyMapper mapper
  private MapDriver driver

    @Before void setUp() {
       mapper = new Example.MyMapper()
       driver = new MapDriver(mapper)
     }

    @Test void testMap() {
      driver.withInput(new Text(‘a’), new Text(‘b’))
      driver.withOutput(new Text(‘c’), new Text(‘d’))
      driver.runTest()
    }
}
class ExampleTest() {
  private Example.MyReducer reducer
  private ReduceDriver driver

    @Before void setUp() {
       reducer = new Example.MyReducer()
       driver = new ReduceDriver(reducer)
     }

    @Test void testReduce() {
      driver.withInput(new Text(‘a’),
          [new Text(‘foo’), new Text(‘bar’)])
      driver.withOutput(new Text(‘c’), new Text(‘d’))
      driver.runTest()
    }
}
Counters!
driver.withInput(...)
driver.run()

def counters = driver.getCounters()

assertEquals(1, counters.findCounter
    (‘foo’, ‘bar’).getValue())
Verifying logging
def messages = []
def appender = [
    append: { messages.add(it) },
    requiresLayout: { false }
  ] as AppenderSkeleton
Logger.getRootLogger().addAppender(appender)

driver.runTest()

assertTrue messages.find {
    it.getLevel.toString() == ‘WARN’ &&
    it.getMessage().contains(‘My err’) }

Logger.getRootLogger().removeAppender(appender)
Cool stuff I haven’t
         tried...
• The   PipelineMapReduceDriver  - allows
  testing a series of MapReduce passes
 • Just call addMapReduce(mapper, reducer)
• Mock objects - MockReporter,
  MockInputSplit, and MockOutputCollector

• Test combiners with
  myMapReduceDriver.setCombiner(myCombiner)
Problems with MRUnit
Not useful for
streaming jobs
shell$ ./myMapper.py < test.input |
sort | ./myReducer.py > actual.out

shell$ diff expected.out actual.out
runTest()  does not
    give meaningful
information on failure
Better to use run() and
      then assert
driver.setInput(new Text(‘foo’),
    new Text(‘bar’))

def output = driver.run()

assertEquals ‘baz’, output[0].first
assertEquals ‘jy’, output[0].second
Documentation is
 severely lacking
runXxx()   calls setup()
called for new Hadoop
 API, but not old API
Tests are not executed
 in a distributed way
In Summary, MRUnit...

• Makes testing your Hadoop jobs easier
• Abstracts away a lot of the boilerplate test
  setup you need
• Has it’s problems
 • but they are outweighed by the benefits
?
cloudera.com/hadoop-mrunit


Blog: eriwen.com
Twitter: @eriwen
Email:
eric.wendelin@returnpath.net
                   © 2010 Eric Wendelin

Testing Hadoop jobs with MRUnit

  • 1.
    Testing Hadoop jobs with MRUnit Boulder/Denver Hadoop Users Group 05.12.2010 © 2010 Eric Wendelin
  • 2.
    Eric Wendelin Hadooper @returnpath Blog:eriwen.com Twitter: @eriwen
  • 3.
    What is MRUnit? •Testing library for MapReduce • Developed by Cloudera • Easy integration between MapReduce and standard testing tools (e.g. JUnit) cloudera.com/hadoop-mrunit
  • 4.
    Why do Ineed that?
  • 5.
    Testing without MRUnit •Write tests that create JobConf or Configuration objects • conf.set(‘mapred.job.tracker’, ‘local’) • Developing new test input files stored alongside MapReduce test code • Lots of work to validate output files • External file I/O makes tests slooooow
  • 6.
  • 7.
    Testing with MRUnit •No external test input or output files • Programmatically specified • Less test harness code (but also perhaps less control) • Concise, fast tests
  • 8.
    Example class ExampleTest() { private Example.MyMapper mapper private Example.MyReducer reducer private MapReduceDriver driver @Before void setUp() { mapper = new Example.MyMapper() reducer = new Example.MyReducer() driver = new MapReduceDriver(mapper, reducer) } @Test void testMapReduce() { driver.withInput(new Text(‘a’), new Text(‘b’)) driver.withOutput(new Text(‘c’), new Text(‘d’)) driver.runTest() } }
  • 9.
    Example class ExampleTest() { private Example.MyMapper mapper private Example.MyReducer reducer private MapReduceDriver driver @Before void setUp() { mapper = new Example.MyMapper() reducer = new Example.MyReducer() driver = new MapReduceDriver(mapper, reducer) } @Test void testMapReduce() { driver.withInput(new Text(‘a’), new Text(‘b’)) .withOutput(new Text(‘c’), new Text(‘d’)) .runTest() } }
  • 10.
    Test map andreduce separately
  • 11.
    class ExampleTest() { private Example.MyMapper mapper private MapDriver driver @Before void setUp() { mapper = new Example.MyMapper() driver = new MapDriver(mapper) } @Test void testMap() { driver.withInput(new Text(‘a’), new Text(‘b’)) driver.withOutput(new Text(‘c’), new Text(‘d’)) driver.runTest() } }
  • 12.
    class ExampleTest() { private Example.MyReducer reducer private ReduceDriver driver @Before void setUp() { reducer = new Example.MyReducer() driver = new ReduceDriver(reducer) } @Test void testReduce() { driver.withInput(new Text(‘a’), [new Text(‘foo’), new Text(‘bar’)]) driver.withOutput(new Text(‘c’), new Text(‘d’)) driver.runTest() } }
  • 13.
    Counters! driver.withInput(...) driver.run() def counters =driver.getCounters() assertEquals(1, counters.findCounter (‘foo’, ‘bar’).getValue())
  • 14.
    Verifying logging def messages= [] def appender = [ append: { messages.add(it) }, requiresLayout: { false } ] as AppenderSkeleton Logger.getRootLogger().addAppender(appender) driver.runTest() assertTrue messages.find { it.getLevel.toString() == ‘WARN’ && it.getMessage().contains(‘My err’) } Logger.getRootLogger().removeAppender(appender)
  • 15.
    Cool stuff Ihaven’t tried... • The PipelineMapReduceDriver - allows testing a series of MapReduce passes • Just call addMapReduce(mapper, reducer) • Mock objects - MockReporter, MockInputSplit, and MockOutputCollector • Test combiners with myMapReduceDriver.setCombiner(myCombiner)
  • 16.
  • 17.
  • 18.
    shell$ ./myMapper.py <test.input | sort | ./myReducer.py > actual.out shell$ diff expected.out actual.out
  • 19.
    runTest() doesnot give meaningful information on failure
  • 20.
    Better to userun() and then assert
  • 21.
    driver.setInput(new Text(‘foo’), new Text(‘bar’)) def output = driver.run() assertEquals ‘baz’, output[0].first assertEquals ‘jy’, output[0].second
  • 22.
  • 23.
    runXxx() calls setup() called for new Hadoop API, but not old API
  • 24.
    Tests are notexecuted in a distributed way
  • 25.
    In Summary, MRUnit... •Makes testing your Hadoop jobs easier • Abstracts away a lot of the boilerplate test setup you need • Has it’s problems • but they are outweighed by the benefits
  • 26.
  • 27.