Introduction to MapReduce using Disco


Published on

Presented to VanPyz, the Vancouver Python User Group, in June 2009.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to MapReduce using Disco

  1. 1. VanPyz, June 2, 2009Introduction to MapReduceusing DiscoErlang and Pythonby @JimRoepcke 1
  2. 2. Computing at Google Scale Image Source: databases and datastreams need to be processedquickly and reliablyThousands of commodity PCsavailable in Google’s clusterfor computationsFaults are statistically“guaranteed” to occur 2
  3. 3. Google’s MotivationGoogle has thousands of programs to process user-generated dataEven simple computations were being obscured by thecomplex code required to run efficiently and reliably ontheir clusters.Engineers shouldn’t have to be experts in distributedsystems to write scalable data-processing software. 3
  4. 4. Why not just use threads?Threads only add concurrency, only on one nodeDoes not scale to > 1 node, a cluster, or a cloudCoordinating work between nodes requires distributionmiddlewareMapReduce is distribution middlewareMapReduce scales linearly with cores / nodes 4
  5. 5. HadoopApache Foundation projectWritten in JavaIncludes the Hadoop Distributed File System 5
  6. 6. DiscoCreated by Ville Tuulos of the Nokia Research CenterWritten in Erlang and PythonDoes not include a distributed File System Provide your own data distribution mechanism 6
  7. 7. How MapReduce works 7
  8. 8. The big scary diagram...
  9. 9. Source: User Program (1) fork (1) fork (1) fork Master (2) (2) assign assign reduce map workersplit 0 (6) write outputsplit 1 worker file 0 (5) remote readsplit 2 (3) read (4) local write worker outputsplit 3 worker file 1split 4 workerInput Map Intermediate files Reduce Output files phase (on local disks) phase files 9 Figure 1: Execution overview
  10. 10. It’s truly very simple...
  11. 11. Master splits input The (typically huge) input is split into chunks One or more for each “map worker” 11
  12. 12. Splits fed to map workersThe master tells each map worker which split(s) it willprocess A split is a file containing some number of input records Each record has a key and its associated value 12
  13. 13. Map each inputThe map worker executes your problem-specific mapalgorithm Called once for each record in its input 13
  14. 14. Map emits (Key,Value) pairs Your map algorithm emits zero or more intermediate key-value pairs for each record processed Let’s call these “(K,V) pairs” from now on Keys and values are both strings 14
  15. 15. (K,V) Pairs hashed to buckets Each map worker has its own set of buckets Each (K,V) pair is placed into one of these buckets Which bucket is determined by a hash function Advanced: if you know the distribution of your intermediate keys is skewed, provide a custom hash function that distributes (K,V) pairs evenly 15
  16. 16. Buckets sent to ReducersOnce all map workers are finished, correspondingbuckets of (K,V) pairs are sent to reduce workersExample: Each map worker placed (K,V) pairs into itsown buckets A, B, and C.Send bucket A from each map to reduce worker 1;Send bucket B from each map to reduce worker 2;Send bucket C from each map to reduce worker 3. 16
  17. 17. Reduce inputs sortedThe reduce worker first concatenates the buckets itreceived into one fileThen the file of (K,V) pairs is sorted by K Now the (K,V) pairs are grouped by keyThis sorted list of (K,V) pairs is the input to the reduceworker 17
  18. 18. Reduce the list of (K,V) pairs The reduce worker executes your problem-specific reduce algorithm Called once for each key in its input Writes whatever it wants to its output file 18
  19. 19. OutputThe output of the MapReduce job is the set of outputfiles generated by the reduce workersWhat you do with this output is up to youYou might use this output as the input to anotherMapReduce job 19
  20. 20. Modified from source: Counting words def map (key, value): # key: document name (ignored) # value: words in document (list) for word in value: EmitIntermediate(word, “1”) def reduce (key, values): # key: a word # values: a list of counts result = 0 for v in values: result += int(v) print key, result 20
  21. 21. Stand up! Let’s do it! Organize yourselves into approximately equal numbers of map and reduce workers I’ll be the master
  22. 22. Disco demonstrationWanted to demonstrate a coolpuzzle solver.No go, but I can show the code.It’s really simple!Instead you get count_words again,but scaled way up!python count_words.pydisco://localhost