An Introduction to
MapReduce with MongoDB
        Russell Smith
/usr/bin/whoami

•   Russell Smith

•   Consultant for UKD1 Limited

•   I Specialise in helping companies going through rapid growth;

•   Code, architecture, infrastructure, devops, sysops, capacity planning, etc

•   <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
What is MongoDB

•   A scalable, high-performance, open source, document-oriented
    database.

•   Stores JSON like documents

•   Indexible on any attributes (like MySQL)

•   Built in MapReduce
Requirements

•   A running MongoDB server
    http://www.mongodb.org/downloads


•   Basic knowledge of MongoDB

•   Basic Javascript
What is Map Reduce

•   Allows aggregating data in parallel

•   Some built in aggregation functions exist;
    distinct, count

•   If you need to do something more, either query or MapReduce
How does it work?
•   You write two functions

•   You write them in Javascript (currently)
•   Map function:
    Called once per document - returns a key + a value

•   Reduce function:
    Called once per key emitted, with an array of values

•   Optional finalize function allowing rounding up of the reduce data
Some example data

•   I downloaded the H1B (US temporary work VISA data)
    http://www.flcdatacenter.com/CaseH1B.aspx


•   Imported the CSV data using mongoimport command

•   Total imported documents ~335k
What do the documents look like?
                                  {
                                  
   "_id" : ObjectId("4db7c981e243a6e23725570f"),
                                  
   "LCA_CASE_NUMBER" : "I-200-09132-243675",
                                  
   "STATUS" : "CERTIFIED",
                                  
   "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36",



•
                                  
   "VISA_CLASS" : "H-1B",

    LCA_CASE_EMPLOYER_STATE       
                                  
                                  
                                      "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00",
                                      "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00",
                                      "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC",
                                  
   "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.",
                                  
   "LCA_CASE_EMPLOYER_CITY" : "HOUSTON",



•
                                  
   "LCA_CASE_EMPLOYER_STATE" : "TX",

    STATUS                        
                                  
                                  
                                      "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092,
                                      "LCA_CASE_SOC_CODE" : "25-2022.00",
                                      "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio",
                                  
   "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR",
                                  
   "LCA_CASE_WAGE_RATE_FROM" : 51577.63,



•
                                  
   "LCA_CASE_WAGE_RATE_UNIT" : "Year",

    LCA_CASE_SUMBIT / Decision_Date
                                  
                                  
                                  
                                      "FULL_TIME_POS" : "Y",
                                      "TOTAL_WORKERS" : 1,
                                      "LCA_CASE_WORKLOC1_CITY" : "HOUSTON",
                                  
   "LCA_CASE_WORKLOC1_STATE" : "TX",




•
                                  
   "PW_1" : 47827,


    LCA_CASE_WAGE_RATE_FROM
                                  
   "PW_UNIT_1" : "Year",
                                  
   "PW_SOURCE_1" : "OES",
                                  
   "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER",
                                  
   "YR_SOURCE_PUB_1" : 2010,
                                  
   "LCA_CASE_NAICS_CODE" : 611110,
                                  
   "Decision_Date" : "7/20/2010 0:00:00r"
                                  }
What we can do with the data?

•   Work out the;

•   Applications per state

•   Applications by status per state

•   Average time from submission to decision, by status
Applications by State


•   Key will be LCA_CASE_EMPLOYER_STATE

•   Assume (wrongly) one person per document
Map


•   this is equal to the current document     m = function () {

                                              
   emit(this.LCA_CASE_EMPLOYER_STATE, 1);
•   emit a value of 1; as we are assuming a
    single H1B app per document               }
Reduce


•   Return a value; the length of the array      r = function (k, v_arr) {
                                                    return v_arr.length
•   This works as each value in the array is 1   }
Executing


•   This will execute the map/reduce
                                        db.text2010.mapReduce(m,r,
                                        {out: 'workers_by_state',
•   Output goes to a collection named
                                        keeptemp:true, verbose:true})
    workers_by_state
Result

{
"_id"
:
"NEW
YORK",
"value"
:
512
}
{
"_id"
:
"IOWA",
"value"
:
15
}
{
"_id"
:
"KANSAS",
"value"
:
54
}
...
A more complex Map!

                                            m = function () {
•   The last example assumed one worker
    per state...which is wrong.                   emit(this.LCA_CASE_EMPLOYER_STATE,
                                            this.TOTAL_WORKERS);

•   We now emit a numeric value per state
                                            }
Reduce
                                             r = function (k, v_arr) {
                                                   var total = 0;
                                                   var len = v_arr.length;

•   As the array now contains values other
                                                  for (var i=0, i<len, i++)
    than 1, we have to iterate over it
                                                  {
                                                        total = total + v_arr[i];
•   This is standard Javascript
                                                  }
                                                  return total;
                                             }
VISA Class by Application Status by
          Average wage                    m = function () {
                                               var k = this.VISA_CLASS + ' ' + this.STATUS;

                                              switch (this.LCA_CASE_WAGE_RATE_UNIT)
                                              {


•
                                                   case 'Year':
    Assumptions:                                         emit(k, this.LCA_CASE_WAGE_RATE_FROM);
                                                         break;

                                                   case 'Month':

•   People work ~40 hour weeks                         emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12);
                                                       break;

                                                   case 'Bi-Weekly':


•
                                                       emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26);
    Weekly wages are paid every week                   break;

    rather than only the weeks worked              case 'Week':
                                                       emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52);
                                                       break;



•   'Select Pay Range' seems to the the            case 'Hour':
                                                       emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52);

    default option...                                  break;

                                                   default:
                                                        emit(k, 0);
                                              }

                                          }
Reduce
                                        r = function (k, v_arr) {
                                              var tot = 0;
                                              var len = v_arr.length;
•   Work out the average for each key
                                             for (var i = 0; i < len; i++)
                                             {
•   Add each of the elements up
                                                   tot += v_arr[i];
                                             }
•   Average them

                                             return tot / len;
                                        }
Finalize

•   A finalize function may be run after reduction.

•   Called a single time per object

•   The finalize function takes a key and a value, and returns a finalized
    value.
Options

•   Persist the output

•   Filtering input documents

•   Sorting input documents

•   Javascript scope - allows you to pass in extra variables (cannot be
    changed at runtime?)
Current limitations / Watch for

•   Single threaded per node (which sucks)
    https://jira.mongodb.org/browse/SERVER-463


•   Language is restricted to Javascript (which sucks)
    https://jira.mongodb.org/browse/SERVER-699)


•   Does not use secondaries in replica sets

•   From 1.7.3 on, you can reduce into existing collection
...


•   Doesn't allow creation of full documents (which can be a pain for
    perm MR collections if using libraries)
    https://jira.mongodb.org/browse/SERVER-2517


•   Slow; ~x20-30 slower than Hadoop with 1.8
    https://jira.mongodb.org/browse/SERVER-3055
Using MongoDB with Hadoop

•   https://github.com/mongodb/mongo-hadoop

•   Open source

•   Requires knowledge of Java

•   Working Input and Output adapters for MongoDB are provided

•   Alpha quality from what I can tell
The future
1.9 / 2.0

•   V8 is replacing SpiderMonkey

•   Recent Hadoop provider

•   Sharded output collections

•   Improved yielding (concurrency)
> 2.0

•   Multi-threaded

•   Alternative languages
    https://jira.mongodb.org/browse/SERVER-699


•   ~2.2 native aggregation framework

•   Js only mode is faster for lighter jobs
    https://jira.mongodb.org/browse/SERVER-2976
Further reading
•   I’ve only brushed on the details, but this should be enough to get you
    interested / started with MongoDB Map Reduce. Some of the missing
    stuff;

•   Finalize functions - http://bit.ly/gEfKOr

•   Some more examples - http://bit.ly/ig1Yfj

An Introduction to Map/Reduce with MongoDB

  • 1.
    An Introduction to MapReducewith MongoDB Russell Smith
  • 2.
    /usr/bin/whoami • Russell Smith • Consultant for UKD1 Limited • I Specialise in helping companies going through rapid growth; • Code, architecture, infrastructure, devops, sysops, capacity planning, etc • <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
  • 3.
    What is MongoDB • A scalable, high-performance, open source, document-oriented database. • Stores JSON like documents • Indexible on any attributes (like MySQL) • Built in MapReduce
  • 4.
    Requirements • A running MongoDB server http://www.mongodb.org/downloads • Basic knowledge of MongoDB • Basic Javascript
  • 5.
    What is MapReduce • Allows aggregating data in parallel • Some built in aggregation functions exist; distinct, count • If you need to do something more, either query or MapReduce
  • 6.
    How does itwork? • You write two functions • You write them in Javascript (currently) • Map function: Called once per document - returns a key + a value • Reduce function: Called once per key emitted, with an array of values • Optional finalize function allowing rounding up of the reduce data
  • 7.
    Some example data • I downloaded the H1B (US temporary work VISA data) http://www.flcdatacenter.com/CaseH1B.aspx • Imported the CSV data using mongoimport command • Total imported documents ~335k
  • 8.
    What do thedocuments look like? { "_id" : ObjectId("4db7c981e243a6e23725570f"), "LCA_CASE_NUMBER" : "I-200-09132-243675", "STATUS" : "CERTIFIED", "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36", • "VISA_CLASS" : "H-1B", LCA_CASE_EMPLOYER_STATE "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00", "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00", "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC", "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.", "LCA_CASE_EMPLOYER_CITY" : "HOUSTON", • "LCA_CASE_EMPLOYER_STATE" : "TX", STATUS "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092, "LCA_CASE_SOC_CODE" : "25-2022.00", "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio", "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR", "LCA_CASE_WAGE_RATE_FROM" : 51577.63, • "LCA_CASE_WAGE_RATE_UNIT" : "Year", LCA_CASE_SUMBIT / Decision_Date "FULL_TIME_POS" : "Y", "TOTAL_WORKERS" : 1, "LCA_CASE_WORKLOC1_CITY" : "HOUSTON", "LCA_CASE_WORKLOC1_STATE" : "TX", • "PW_1" : 47827, LCA_CASE_WAGE_RATE_FROM "PW_UNIT_1" : "Year", "PW_SOURCE_1" : "OES", "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER", "YR_SOURCE_PUB_1" : 2010, "LCA_CASE_NAICS_CODE" : 611110, "Decision_Date" : "7/20/2010 0:00:00r" }
  • 9.
    What we cando with the data? • Work out the; • Applications per state • Applications by status per state • Average time from submission to decision, by status
  • 10.
    Applications by State • Key will be LCA_CASE_EMPLOYER_STATE • Assume (wrongly) one person per document
  • 11.
    Map • this is equal to the current document m = function () { emit(this.LCA_CASE_EMPLOYER_STATE, 1); • emit a value of 1; as we are assuming a single H1B app per document }
  • 12.
    Reduce • Return a value; the length of the array r = function (k, v_arr) { return v_arr.length • This works as each value in the array is 1 }
  • 13.
    Executing • This will execute the map/reduce db.text2010.mapReduce(m,r, {out: 'workers_by_state', • Output goes to a collection named keeptemp:true, verbose:true}) workers_by_state
  • 14.
  • 15.
    A more complexMap! m = function () { • The last example assumed one worker per state...which is wrong. emit(this.LCA_CASE_EMPLOYER_STATE, this.TOTAL_WORKERS); • We now emit a numeric value per state }
  • 16.
    Reduce r = function (k, v_arr) { var total = 0; var len = v_arr.length; • As the array now contains values other for (var i=0, i<len, i++) than 1, we have to iterate over it { total = total + v_arr[i]; • This is standard Javascript } return total; }
  • 17.
    VISA Class byApplication Status by Average wage m = function () { var k = this.VISA_CLASS + ' ' + this.STATUS; switch (this.LCA_CASE_WAGE_RATE_UNIT) { • case 'Year': Assumptions: emit(k, this.LCA_CASE_WAGE_RATE_FROM); break; case 'Month': • People work ~40 hour weeks emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12); break; case 'Bi-Weekly': • emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26); Weekly wages are paid every week break; rather than only the weeks worked case 'Week': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52); break; • 'Select Pay Range' seems to the the case 'Hour': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52); default option... break; default: emit(k, 0); } }
  • 18.
    Reduce r = function (k, v_arr) { var tot = 0; var len = v_arr.length; • Work out the average for each key for (var i = 0; i < len; i++) { • Add each of the elements up tot += v_arr[i]; } • Average them return tot / len; }
  • 19.
    Finalize • A finalize function may be run after reduction. • Called a single time per object • The finalize function takes a key and a value, and returns a finalized value.
  • 20.
    Options • Persist the output • Filtering input documents • Sorting input documents • Javascript scope - allows you to pass in extra variables (cannot be changed at runtime?)
  • 21.
    Current limitations /Watch for • Single threaded per node (which sucks) https://jira.mongodb.org/browse/SERVER-463 • Language is restricted to Javascript (which sucks) https://jira.mongodb.org/browse/SERVER-699) • Does not use secondaries in replica sets • From 1.7.3 on, you can reduce into existing collection
  • 22.
    ... • Doesn't allow creation of full documents (which can be a pain for perm MR collections if using libraries) https://jira.mongodb.org/browse/SERVER-2517 • Slow; ~x20-30 slower than Hadoop with 1.8 https://jira.mongodb.org/browse/SERVER-3055
  • 23.
    Using MongoDB withHadoop • https://github.com/mongodb/mongo-hadoop • Open source • Requires knowledge of Java • Working Input and Output adapters for MongoDB are provided • Alpha quality from what I can tell
  • 24.
  • 25.
    1.9 / 2.0 • V8 is replacing SpiderMonkey • Recent Hadoop provider • Sharded output collections • Improved yielding (concurrency)
  • 26.
    > 2.0 • Multi-threaded • Alternative languages https://jira.mongodb.org/browse/SERVER-699 • ~2.2 native aggregation framework • Js only mode is faster for lighter jobs https://jira.mongodb.org/browse/SERVER-2976
  • 27.
    Further reading • I’ve only brushed on the details, but this should be enough to get you interested / started with MongoDB Map Reduce. Some of the missing stuff; • Finalize functions - http://bit.ly/gEfKOr • Some more examples - http://bit.ly/ig1Yfj