An Introduction toMapReduce with MongoDB        Russell Smith
/usr/bin/whoami•   Russell Smith•   Consultant for UKD1 Limited•   I Specialise in helping companies going through rapid g...
What is MongoDB•   A scalable, high-performance, open source, document-oriented    database.•   Stores JSON like documents...
Requirements•   A running MongoDB server    http://www.mongodb.org/downloads•   Basic knowledge of MongoDB•   Basic Javasc...
What is Map Reduce•   Allows aggregating data in parallel•   Some built in aggregation functions exist;    distinct, count...
How does it work?•   You write two functions•   You write them in Javascript (currently)•   Map function:    Called once p...
Some example data•   I downloaded the H1B (US temporary work VISA data)    http://www.flcdatacenter.com/CaseH1B.aspx•   Imp...
What do the documents look like?                                  {                                     "_id" : ObjectId("...
What we can do with the data?•   Work out the;•   Applications per state•   Applications by status per state•   Average ti...
Applications by State•   Key will be LCA_CASE_EMPLOYER_STATE•   Assume (wrongly) one person per document
Map•   this is equal to the current document     m = function () {                                                 emit(th...
Reduce•   Return a value; the length of the array      r = function (k, v_arr) {                                          ...
Executing•   This will execute the map/reduce                                        db.text2010.mapReduce(m,r,           ...
Result{
"_id"
:
"NEW
YORK",
"value"
:
512
}{
"_id"
:
"IOWA",
"value"
:
15
}{
"_id"
:
"KANSAS",
"value"
:
54
}...
A more complex Map!                                            m = function () {•   The last example assumed one worker   ...
Reduce                                             r = function (k, v_arr) {                                              ...
VISA Class by Application Status by          Average wage                    m = function () {                            ...
Reduce                                        r = function (k, v_arr) {                                              var t...
Finalize•   A finalize function may be run after reduction.•   Called a single time per object•   The finalize function take...
Options•   Persist the output•   Filtering input documents•   Sorting input documents•   Javascript scope - allows you to ...
Current limitations / Watch for•   Single threaded per node (which sucks)    https://jira.mongodb.org/browse/SERVER-463•  ...
...•   Doesnt allow creation of full documents (which can be a pain for    perm MR collections if using libraries)    http...
Using MongoDB with Hadoop•   https://github.com/mongodb/mongo-hadoop•   Open source•   Requires knowledge of Java•   Worki...
The future
1.9 / 2.0•   V8 is replacing SpiderMonkey•   Recent Hadoop provider•   Sharded output collections•   Improved yielding (co...
> 2.0•   Multi-threaded•   Alternative languages    https://jira.mongodb.org/browse/SERVER-699•   ~2.2 native aggregation ...
Further reading•   I’ve only brushed on the details, but this should be enough to get you    interested / started with Mon...
Upcoming SlideShare
Loading in...5
×

An Introduction to Map/Reduce with MongoDB

7,023

Published on

An introduction to Map/Reduce with MongoDB by Russell Smith from UKD1 Limited

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,023
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
75
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "An Introduction to Map/Reduce with MongoDB"

    1. 1. An Introduction toMapReduce with MongoDB Russell Smith
    2. 2. /usr/bin/whoami• Russell Smith• Consultant for UKD1 Limited• I Specialise in helping companies going through rapid growth;• Code, architecture, infrastructure, devops, sysops, capacity planning, etc• <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
    3. 3. What is MongoDB• A scalable, high-performance, open source, document-oriented database.• Stores JSON like documents• Indexible on any attributes (like MySQL)• Built in MapReduce
    4. 4. Requirements• A running MongoDB server http://www.mongodb.org/downloads• Basic knowledge of MongoDB• Basic Javascript
    5. 5. What is Map Reduce• Allows aggregating data in parallel• Some built in aggregation functions exist; distinct, count• If you need to do something more, either query or MapReduce
    6. 6. How does it work?• You write two functions• You write them in Javascript (currently)• Map function: Called once per document - returns a key + a value• Reduce function: Called once per key emitted, with an array of values• Optional finalize function allowing rounding up of the reduce data
    7. 7. Some example data• I downloaded the H1B (US temporary work VISA data) http://www.flcdatacenter.com/CaseH1B.aspx• Imported the CSV data using mongoimport command• Total imported documents ~335k
    8. 8. What do the documents look like? { "_id" : ObjectId("4db7c981e243a6e23725570f"), "LCA_CASE_NUMBER" : "I-200-09132-243675", "STATUS" : "CERTIFIED", "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36",• "VISA_CLASS" : "H-1B", LCA_CASE_EMPLOYER_STATE "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00", "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00", "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC", "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.", "LCA_CASE_EMPLOYER_CITY" : "HOUSTON",• "LCA_CASE_EMPLOYER_STATE" : "TX", STATUS "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092, "LCA_CASE_SOC_CODE" : "25-2022.00", "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio", "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR", "LCA_CASE_WAGE_RATE_FROM" : 51577.63,• "LCA_CASE_WAGE_RATE_UNIT" : "Year", LCA_CASE_SUMBIT / Decision_Date "FULL_TIME_POS" : "Y", "TOTAL_WORKERS" : 1, "LCA_CASE_WORKLOC1_CITY" : "HOUSTON", "LCA_CASE_WORKLOC1_STATE" : "TX",• "PW_1" : 47827, LCA_CASE_WAGE_RATE_FROM "PW_UNIT_1" : "Year", "PW_SOURCE_1" : "OES", "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER", "YR_SOURCE_PUB_1" : 2010, "LCA_CASE_NAICS_CODE" : 611110, "Decision_Date" : "7/20/2010 0:00:00r" }
    9. 9. What we can do with the data?• Work out the;• Applications per state• Applications by status per state• Average time from submission to decision, by status
    10. 10. Applications by State• Key will be LCA_CASE_EMPLOYER_STATE• Assume (wrongly) one person per document
    11. 11. Map• this is equal to the current document m = function () { emit(this.LCA_CASE_EMPLOYER_STATE, 1);• emit a value of 1; as we are assuming a single H1B app per document }
    12. 12. Reduce• Return a value; the length of the array r = function (k, v_arr) { return v_arr.length• This works as each value in the array is 1 }
    13. 13. Executing• This will execute the map/reduce db.text2010.mapReduce(m,r, {out: workers_by_state,• Output goes to a collection named keeptemp:true, verbose:true}) workers_by_state
    14. 14. Result{
"_id"
:
"NEW
YORK",
"value"
:
512
}{
"_id"
:
"IOWA",
"value"
:
15
}{
"_id"
:
"KANSAS",
"value"
:
54
}...
    15. 15. A more complex Map! m = function () {• The last example assumed one worker per state...which is wrong. emit(this.LCA_CASE_EMPLOYER_STATE, this.TOTAL_WORKERS);• We now emit a numeric value per state }
    16. 16. Reduce r = function (k, v_arr) { var total = 0; var len = v_arr.length;• As the array now contains values other for (var i=0, i<len, i++) than 1, we have to iterate over it { total = total + v_arr[i];• This is standard Javascript } return total; }
    17. 17. VISA Class by Application Status by Average wage m = function () { var k = this.VISA_CLASS + + this.STATUS; switch (this.LCA_CASE_WAGE_RATE_UNIT) {• case Year: Assumptions: emit(k, this.LCA_CASE_WAGE_RATE_FROM); break; case Month:• People work ~40 hour weeks emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12); break; case Bi-Weekly:• emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26); Weekly wages are paid every week break; rather than only the weeks worked case Week: emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52); break;• Select Pay Range seems to the the case Hour: emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52); default option... break; default: emit(k, 0); } }
    18. 18. Reduce r = function (k, v_arr) { var tot = 0; var len = v_arr.length;• Work out the average for each key for (var i = 0; i < len; i++) {• Add each of the elements up tot += v_arr[i]; }• Average them return tot / len; }
    19. 19. Finalize• A finalize function may be run after reduction.• Called a single time per object• The finalize function takes a key and a value, and returns a finalized value.
    20. 20. Options• Persist the output• Filtering input documents• Sorting input documents• Javascript scope - allows you to pass in extra variables (cannot be changed at runtime?)
    21. 21. Current limitations / Watch for• Single threaded per node (which sucks) https://jira.mongodb.org/browse/SERVER-463• Language is restricted to Javascript (which sucks) https://jira.mongodb.org/browse/SERVER-699)• Does not use secondaries in replica sets• From 1.7.3 on, you can reduce into existing collection
    22. 22. ...• Doesnt allow creation of full documents (which can be a pain for perm MR collections if using libraries) https://jira.mongodb.org/browse/SERVER-2517• Slow; ~x20-30 slower than Hadoop with 1.8 https://jira.mongodb.org/browse/SERVER-3055
    23. 23. Using MongoDB with Hadoop• https://github.com/mongodb/mongo-hadoop• Open source• Requires knowledge of Java• Working Input and Output adapters for MongoDB are provided• Alpha quality from what I can tell
    24. 24. The future
    25. 25. 1.9 / 2.0• V8 is replacing SpiderMonkey• Recent Hadoop provider• Sharded output collections• Improved yielding (concurrency)
    26. 26. > 2.0• Multi-threaded• Alternative languages https://jira.mongodb.org/browse/SERVER-699• ~2.2 native aggregation framework• Js only mode is faster for lighter jobs https://jira.mongodb.org/browse/SERVER-2976
    27. 27. Further reading• I’ve only brushed on the details, but this should be enough to get you interested / started with MongoDB Map Reduce. Some of the missing stuff;• Finalize functions - http://bit.ly/gEfKOr• Some more examples - http://bit.ly/ig1Yfj
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×