Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Introduction to Map/Reduce with MongoDB


Published on

An introduction to Map/Reduce with MongoDB by Russell Smith from UKD1 Limited

Published in: Technology, Business
  • Be the first to comment

An Introduction to Map/Reduce with MongoDB

  1. 1. An Introduction toMapReduce with MongoDB Russell Smith
  2. 2. /usr/bin/whoami• Russell Smith• Consultant for UKD1 Limited• I Specialise in helping companies going through rapid growth;• Code, architecture, infrastructure, devops, sysops, capacity planning, etc• <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
  3. 3. What is MongoDB• A scalable, high-performance, open source, document-oriented database.• Stores JSON like documents• Indexible on any attributes (like MySQL)• Built in MapReduce
  4. 4. Requirements• A running MongoDB server• Basic knowledge of MongoDB• Basic Javascript
  5. 5. What is Map Reduce• Allows aggregating data in parallel• Some built in aggregation functions exist; distinct, count• If you need to do something more, either query or MapReduce
  6. 6. How does it work?• You write two functions• You write them in Javascript (currently)• Map function: Called once per document - returns a key + a value• Reduce function: Called once per key emitted, with an array of values• Optional finalize function allowing rounding up of the reduce data
  7. 7. Some example data• I downloaded the H1B (US temporary work VISA data)• Imported the CSV data using mongoimport command• Total imported documents ~335k
  8. 8. What do the documents look like? { "_id" : ObjectId("4db7c981e243a6e23725570f"), "LCA_CASE_NUMBER" : "I-200-09132-243675", "STATUS" : "CERTIFIED", "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36",• "VISA_CLASS" : "H-1B", LCA_CASE_EMPLOYER_STATE "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00", "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00", "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC", "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.", "LCA_CASE_EMPLOYER_CITY" : "HOUSTON",• "LCA_CASE_EMPLOYER_STATE" : "TX", STATUS "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092, "LCA_CASE_SOC_CODE" : "25-2022.00", "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio", "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR", "LCA_CASE_WAGE_RATE_FROM" : 51577.63,• "LCA_CASE_WAGE_RATE_UNIT" : "Year", LCA_CASE_SUMBIT / Decision_Date "FULL_TIME_POS" : "Y", "TOTAL_WORKERS" : 1, "LCA_CASE_WORKLOC1_CITY" : "HOUSTON", "LCA_CASE_WORKLOC1_STATE" : "TX",• "PW_1" : 47827, LCA_CASE_WAGE_RATE_FROM "PW_UNIT_1" : "Year", "PW_SOURCE_1" : "OES", "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER", "YR_SOURCE_PUB_1" : 2010, "LCA_CASE_NAICS_CODE" : 611110, "Decision_Date" : "7/20/2010 0:00:00r" }
  9. 9. What we can do with the data?• Work out the;• Applications per state• Applications by status per state• Average time from submission to decision, by status
  10. 10. Applications by State• Key will be LCA_CASE_EMPLOYER_STATE• Assume (wrongly) one person per document
  11. 11. Map• this is equal to the current document m = function () { emit(this.LCA_CASE_EMPLOYER_STATE, 1);• emit a value of 1; as we are assuming a single H1B app per document }
  12. 12. Reduce• Return a value; the length of the array r = function (k, v_arr) { return v_arr.length• This works as each value in the array is 1 }
  13. 13. Executing• This will execute the map/reduce db.text2010.mapReduce(m,r, {out: workers_by_state,• Output goes to a collection named keeptemp:true, verbose:true}) workers_by_state
  14. 14. Result{
  15. 15. A more complex Map! m = function () {• The last example assumed one worker per state...which is wrong. emit(this.LCA_CASE_EMPLOYER_STATE, this.TOTAL_WORKERS);• We now emit a numeric value per state }
  16. 16. Reduce r = function (k, v_arr) { var total = 0; var len = v_arr.length;• As the array now contains values other for (var i=0, i<len, i++) than 1, we have to iterate over it { total = total + v_arr[i];• This is standard Javascript } return total; }
  17. 17. VISA Class by Application Status by Average wage m = function () { var k = this.VISA_CLASS + + this.STATUS; switch (this.LCA_CASE_WAGE_RATE_UNIT) {• case Year: Assumptions: emit(k, this.LCA_CASE_WAGE_RATE_FROM); break; case Month:• People work ~40 hour weeks emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12); break; case Bi-Weekly:• emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26); Weekly wages are paid every week break; rather than only the weeks worked case Week: emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52); break;• Select Pay Range seems to the the case Hour: emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52); default option... break; default: emit(k, 0); } }
  18. 18. Reduce r = function (k, v_arr) { var tot = 0; var len = v_arr.length;• Work out the average for each key for (var i = 0; i < len; i++) {• Add each of the elements up tot += v_arr[i]; }• Average them return tot / len; }
  19. 19. Finalize• A finalize function may be run after reduction.• Called a single time per object• The finalize function takes a key and a value, and returns a finalized value.
  20. 20. Options• Persist the output• Filtering input documents• Sorting input documents• Javascript scope - allows you to pass in extra variables (cannot be changed at runtime?)
  21. 21. Current limitations / Watch for• Single threaded per node (which sucks)• Language is restricted to Javascript (which sucks)• Does not use secondaries in replica sets• From 1.7.3 on, you can reduce into existing collection
  22. 22. ...• Doesnt allow creation of full documents (which can be a pain for perm MR collections if using libraries)• Slow; ~x20-30 slower than Hadoop with 1.8
  23. 23. Using MongoDB with Hadoop•• Open source• Requires knowledge of Java• Working Input and Output adapters for MongoDB are provided• Alpha quality from what I can tell
  24. 24. The future
  25. 25. 1.9 / 2.0• V8 is replacing SpiderMonkey• Recent Hadoop provider• Sharded output collections• Improved yielding (concurrency)
  26. 26. > 2.0• Multi-threaded• Alternative languages• ~2.2 native aggregation framework• Js only mode is faster for lighter jobs
  27. 27. Further reading• I’ve only brushed on the details, but this should be enough to get you interested / started with MongoDB Map Reduce. Some of the missing stuff;• Finalize functions -• Some more examples -