Unit Testing Map Reduce Jobs in
                      Hadoop


Speaker Details :

Anirudh Bhatnagar
     Senior Consultant-Xebia India
     abhatnagar@xebia.com
Sanchit Agarwal
     Senior Consultant-Xebia India
     sagarwal@xebia.com
Agenda
●   Hadoop Introduction
●   What is Map Reduce [Sample Code]
●   Map-Reduce Testing using Mockito [Sample Code]
●   Shortcomings with Mockito
●   MRUnit Test Harness [Sample Code]
●   Advantages of MRUnit
●   What Lies Ahead
What is Hadoop??
WHY Hadoop???
How Hadoop works?
What is Map Reduce
Map Reduce Execution
Sample Map Reduce Code
 ●   All examples and setup is done for a single
     node cluster

- map(LongWritable key, Text value, Context
context) {Mapper Class}

- reduce(Text key, Iterable<IntWritable>
values, Context context) {Reducer Class}
Problem Statement
To find the top trend of all the given tags in
             different user logs
Sample Code Unit Testing with
             Mockito
●   No MRUnit code used
Shortcoming with Mockito
●   Not very intuitive for Map Reduce style of
    programming
●   Semantics for Map-Reduce are different in
    subtle ways as compared to how it is done
    with Mockito
●   Might be equally good in some scenarios and
    might fail to cover more complex scenarios
MRUnit Test Harness
●   Very intuitive for Map-Reduce style of prorgamming
●   MRUnit helps bridge the gap between MapReduce programs
    and JUnit by providing a set of interfaces and test harnesses,
    which allow MapReduce programs to be more easily tested
    using standard tools and practices.
●   Provides 4 drivers for seperately testing Map-Reduce code
    –   MapDriver
    –   ReduceDriver
    –   MapReduceDriver
    –   PipelineMapReduceDriver
Sample Code with MRunit
●   Used in combination with Junit to get better
    control on log messages
●   Easily integrable with Junit
Gotchas With MRUnit
●   MapDriver.withInput supports only one input
    types, multiple inputs are replaced sequentially
    and last one is used
●   Handle runTest() and run() methods with care,
    runTest() runs the test and returns void while
    run() executes the test and return a list of
    output map.
●   PipelineMapReduceDriver only supports old
    Hadoop API
What Lies Ahead
●   MiniMRCluster and MiniDFSCluster classes
    offer full-blown in-memory MapReduce and
    HDFS clusters, and can launch multiple
    MapReduce and HDFS nodes
●   Best Practices and Debugging techniques for
    Map-Reduce
Questions??
Bibliography
●   Books
    –   Hadoop in Practice
    –   Hadoop Definitive Guide
    –   Hadoop in Action
●   Links
    –   http://hadoop.apache.org/
●   Blogs
    –   http://codingjunkie.net/testing-hadoop-programs-with-mrunit/
    –   http://java.dzone.com/articles/effective-testing-strategies
    –   https://github.com/alexholmes/blog/blob/master/_posts/2012-10-20-
        hadoop-unit-testing-with-minimrcluster.markdown

Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr

  • 1.
    Unit Testing MapReduce Jobs in Hadoop Speaker Details : Anirudh Bhatnagar Senior Consultant-Xebia India abhatnagar@xebia.com Sanchit Agarwal Senior Consultant-Xebia India sagarwal@xebia.com
  • 2.
    Agenda ● Hadoop Introduction ● What is Map Reduce [Sample Code] ● Map-Reduce Testing using Mockito [Sample Code] ● Shortcomings with Mockito ● MRUnit Test Harness [Sample Code] ● Advantages of MRUnit ● What Lies Ahead
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    Sample Map ReduceCode ● All examples and setup is done for a single node cluster - map(LongWritable key, Text value, Context context) {Mapper Class} - reduce(Text key, Iterable<IntWritable> values, Context context) {Reducer Class}
  • 9.
    Problem Statement To findthe top trend of all the given tags in different user logs
  • 10.
    Sample Code UnitTesting with Mockito ● No MRUnit code used
  • 11.
    Shortcoming with Mockito ● Not very intuitive for Map Reduce style of programming ● Semantics for Map-Reduce are different in subtle ways as compared to how it is done with Mockito ● Might be equally good in some scenarios and might fail to cover more complex scenarios
  • 12.
    MRUnit Test Harness ● Very intuitive for Map-Reduce style of prorgamming ● MRUnit helps bridge the gap between MapReduce programs and JUnit by providing a set of interfaces and test harnesses, which allow MapReduce programs to be more easily tested using standard tools and practices. ● Provides 4 drivers for seperately testing Map-Reduce code – MapDriver – ReduceDriver – MapReduceDriver – PipelineMapReduceDriver
  • 13.
    Sample Code withMRunit ● Used in combination with Junit to get better control on log messages ● Easily integrable with Junit
  • 14.
    Gotchas With MRUnit ● MapDriver.withInput supports only one input types, multiple inputs are replaced sequentially and last one is used ● Handle runTest() and run() methods with care, runTest() runs the test and returns void while run() executes the test and return a list of output map. ● PipelineMapReduceDriver only supports old Hadoop API
  • 15.
    What Lies Ahead ● MiniMRCluster and MiniDFSCluster classes offer full-blown in-memory MapReduce and HDFS clusters, and can launch multiple MapReduce and HDFS nodes ● Best Practices and Debugging techniques for Map-Reduce
  • 17.
  • 18.
    Bibliography ● Books – Hadoop in Practice – Hadoop Definitive Guide – Hadoop in Action ● Links – http://hadoop.apache.org/ ● Blogs – http://codingjunkie.net/testing-hadoop-programs-with-mrunit/ – http://java.dzone.com/articles/effective-testing-strategies – https://github.com/alexholmes/blog/blob/master/_posts/2012-10-20- hadoop-unit-testing-with-minimrcluster.markdown