Topic 5: MapReduce Theory and Implementation
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,226
On Slideshare
1,226
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
69
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 5: MapReduce Theory and Implementation Zubair Nabi zubair.nabi@itu.edu.pk April 18, 2013Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 1 / 34
  • 2. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 2 / 34
  • 3. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 3 / 34
  • 4. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 5. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Compute inverted index, graph structure of web documents, summaries of pages crawled per host, etc. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 6. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Compute inverted index, graph structure of web documents, summaries of pages crawled per host, etc. Common properties: 1 Computation is conceptually simple and is distributed across hundreds or thousands of machines to leverage parallelism Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 7. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Compute inverted index, graph structure of web documents, summaries of pages crawled per host, etc. Common properties: 1 Computation is conceptually simple and is distributed across hundreds or thousands of machines to leverage parallelism 2 Input data is large Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 8. Common computations at Google Process large amounts of data generated from crawled documents, web request logs, etc. Compute inverted index, graph structure of web documents, summaries of pages crawled per host, etc. Common properties: 1 Computation is conceptually simple and is distributed across hundreds or thousands of machines to leverage parallelism 2 Input data is large 3 The original simple computation is made complex by system-level code to deal with issues of work assignment and distribution, and fault-tolerance Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 4 / 34
  • 9. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 10. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Abstraction that helps the programmer express simple computations Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 11. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Abstraction that helps the programmer express simple computations Hides the gory details of parallelization, fault-tolerance, data distribution, and load balancing Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 12. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Abstraction that helps the programmer express simple computations Hides the gory details of parallelization, fault-tolerance, data distribution, and load balancing Relies on user-provided map and reduce primitives present in functional languages Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 13. Enter MapReduce Based on the insights mentioned in the previous slide, 2 Google Engineers, Jeff Dean and Sanjay Ghemawat, in 2004 designed MapReduce Abstraction that helps the programmer express simple computations Hides the gory details of parallelization, fault-tolerance, data distribution, and load balancing Relies on user-provided map and reduce primitives present in functional languages Leverages one key insight: Most of the computation at Google involved applying a map operator to each logical record in the input dataset to obtain a set of intermediate key/value pairs and then applying a reduce operation to all values with the same key, for aggregation Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 5 / 34
  • 14. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 6 / 34
  • 15. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 7 / 34
  • 16. Programming Model Input: Set of key/value pairs Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 8 / 34
  • 17. Programming Model Input: Set of key/value pairs Output: Set of key/value pairs Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 8 / 34
  • 18. Programming Model Input: Set of key/value pairs Output: Set of key/value pairs The user provides the entire computation in the form of two functions: map and reduce Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 8 / 34
  • 19. User-defined functions 1 Map Takes an input pair and produces a set of intermediate key/value pairs Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
  • 20. User-defined functions 1 Map Takes an input pair and produces a set of intermediate key/value pairs The framework groups together the intermediate values by key for consumption by the Reduce Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
  • 21. User-defined functions 1 Map Takes an input pair and produces a set of intermediate key/value pairs The framework groups together the intermediate values by key for consumption by the Reduce 2 Reduce Takes as input a key and a list of associated values Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
  • 22. User-defined functions 1 Map Takes an input pair and produces a set of intermediate key/value pairs The framework groups together the intermediate values by key for consumption by the Reduce 2 Reduce Takes as input a key and a list of associated values In the common case, it merges these values to result in a smaller set of values Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 9 / 34
  • 23. Example: Word Count Counting the occurrence of each word in a large collection of documents Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 10 / 34
  • 24. Example: Word Count Counting the occurrence of each word in a large collection of documents 1 Map Emits each word and the value 1 Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 10 / 34
  • 25. Example: Word Count Counting the occurrence of each word in a large collection of documents 1 Map Emits each word and the value 1 2 Reduce Sums together all counts emitted for a particular word Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 10 / 34
  • 26. Example: Word Count(2) 1 map( String key , String value ): 2 // key: document name 3 // value : document contents 4 for each word w in value: 5 EmitIntermediate (w, "1"); 6 7 reduce ( String key , Iterator values ): 8 // key: a word 9 // values : a list of counts10 int result = 0;11 for each v in values :12 result += ParseInt (v);13 Emit( AsString ( result )); Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 11 / 34
  • 27. Types User-supplied map and reduce functions have associated types 1 Map map(k1, v1) → list(k2, v2) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 12 / 34
  • 28. Types User-supplied map and reduce functions have associated types 1 Map map(k1, v1) → list(k2, v2) 2 Reduce reduce(k2, list(v2)) → list(v2) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 12 / 34
  • 29. More applications Distributed Grep 1 Map Emits a line if its matches a user-provided pattern 2 Reduce Identity function Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 13 / 34
  • 30. More applications Distributed Grep 1 Map Emits a line if its matches a user-provided pattern 2 Reduce Identity function Count of URL Access Frequency 1 Map Similar to Word Count map. Instead of words we have URLs 2 Reduce Similar to Word Count reduce Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 13 / 34
  • 31. More applications (2) Inverted Index 1 Map Emits a sequence of < word, document_ID > 2 Reduce Emits < word, list(document_ID) > Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 14 / 34
  • 32. More applications (2) Inverted Index 1 Map Emits a sequence of < word, document_ID > 2 Reduce Emits < word, list(document_ID) > Distributed Sort 1 Map Identity 2 Reduce Identity Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 14 / 34
  • 33. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 15 / 34
  • 34. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 35. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 36. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology As clusters consist of hundreds or thousands of machines, failure is pretty common Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 37. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology As clusters consist of hundreds or thousands of machines, failure is pretty common Each machine consists of local hard-drives Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 38. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology As clusters consist of hundreds or thousands of machines, failure is pretty common Each machine consists of local hard-drives Google Filesystem runs atop of these disks which employs replication to ensure availability and reliability Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 39. Cluster architecture A large cluster of shared-nothing commodity machines connected via Ethernet Each node is an x86 system running Linux with local memory Commodity networking hardware connected in the form of a tree topology As clusters consist of hundreds or thousands of machines, failure is pretty common Each machine consists of local hard-drives Google Filesystem runs atop of these disks which employs replication to ensure availability and reliability Jobs are submitted to a scheduler, which maps tasks within that job to available machines within the cluster Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 16 / 34
  • 40. MapReduce architecture 1 Master: In charge of all meta data, work scheduling and distribution, and job orchestration Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 17 / 34
  • 41. MapReduce architecture 1 Master: In charge of all meta data, work scheduling and distribution, and job orchestration 2 Workers: Contain slots to execute map or reduce functions Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 17 / 34
  • 42. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 43. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes 2 The master logically splits the input dataset into M splits, where M = (Input_dataset_size)/(GFS_block _size) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 44. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes 2 The master logically splits the input dataset into M splits, where M = (Input_dataset_size)/(GFS_block _size) The GFS block size is typically a multiple of 64MB Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 45. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes 2 The master logically splits the input dataset into M splits, where M = (Input_dataset_size)/(GFS_block _size) The GFS block size is typically a multiple of 64MB 3 It then earmarks M map tasks and assigns them to workers. Each worker has a configurable number of task slots. Each time a worker completes a task, the master assigns it more pending map tasks Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 46. Execution 1 The user writes map and reduce functions and stitches together a MapReduce specification with the location of the input dataset, number of reduce tasks, and other attributes 2 The master logically splits the input dataset into M splits, where M = (Input_dataset_size)/(GFS_block _size) The GFS block size is typically a multiple of 64MB 3 It then earmarks M map tasks and assigns them to workers. Each worker has a configurable number of task slots. Each time a worker completes a task, the master assigns it more pending map tasks 4 Once all map tasks have completed, the master assigns R reduce tasks to worker nodes Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 18 / 34
  • 47. Mappers 1 A map worker reads the contents of the input split that it has been assigned Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
  • 48. Mappers 1 A map worker reads the contents of the input split that it has been assigned 2 It parses the file and converts it to key/value pairs and invokes the user-defined map function for each pair Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
  • 49. Mappers 1 A map worker reads the contents of the input split that it has been assigned 2 It parses the file and converts it to key/value pairs and invokes the user-defined map function for each pair 3 The intermediate key/value pairs after the application of the map logic are collected (buffered) in memory Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
  • 50. Mappers 1 A map worker reads the contents of the input split that it has been assigned 2 It parses the file and converts it to key/value pairs and invokes the user-defined map function for each pair 3 The intermediate key/value pairs after the application of the map logic are collected (buffered) in memory 4 Once the buffered key/value pairs exceed a threshold they are written to local disk and partitioned (using a partitioning function) into R partitions. The location of each partition is passed to the master Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 19 / 34
  • 51. Reducers 1 A reduce worker gets locations of its input partitions from the master and uses HTTP requests to retrieve them Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
  • 52. Reducers 1 A reduce worker gets locations of its input partitions from the master and uses HTTP requests to retrieve them 2 Once it has read all its input, it sorts it by key to group together all occurrences of the same key Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
  • 53. Reducers 1 A reduce worker gets locations of its input partitions from the master and uses HTTP requests to retrieve them 2 Once it has read all its input, it sorts it by key to group together all occurrences of the same key 3 It then invokes the user-defined reduce for each key and passes it the key and its associated values Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
  • 54. Reducers 1 A reduce worker gets locations of its input partitions from the master and uses HTTP requests to retrieve them 2 Once it has read all its input, it sorts it by key to group together all occurrences of the same key 3 It then invokes the user-defined reduce for each key and passes it the key and its associated values 4 The key/value pairs generated after the application of the reduce logic are then written to a final output file, which is subsequently written to the distributed filesystem Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 20 / 34
  • 55. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 21 / 34
  • 56. Book-keeping by the Master The master contains meta-data for all jobs running in the cluster Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 22 / 34
  • 57. Book-keeping by the Master The master contains meta-data for all jobs running in the cluster For each map and reduce tasks, it stores the state (pending, in-progress, or completed) and the ID of the worker on which it is executing (in-progress state) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 22 / 34
  • 58. Book-keeping by the Master The master contains meta-data for all jobs running in the cluster For each map and reduce tasks, it stores the state (pending, in-progress, or completed) and the ID of the worker on which it is executing (in-progress state) It stores the locations and sizes of partitions for each map task Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 22 / 34
  • 59. Fault-tolerance For large compute clusters, failures are the norm rather than the exception Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 60. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 61. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 62. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed In-progress map and reduce tasks are simply re-executed on other nodes. Same goes for completed map tasks (as their output is lost on machine failure) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 63. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed In-progress map and reduce tasks are simply re-executed on other nodes. Same goes for completed map tasks (as their output is lost on machine failure) Completed reduce tasks are not re-executed as their output resides on the distributed filesystem Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 64. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed In-progress map and reduce tasks are simply re-executed on other nodes. Same goes for completed map tasks (as their output is lost on machine failure) Completed reduce tasks are not re-executed as their output resides on the distributed filesystem 2 Master: The entire computation is marked as failed Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 65. Fault-tolerance For large compute clusters, failures are the norm rather than the exception 1 Worker: Each worker sends a periodic heartbeat signal to the master If the master does not receive a heartbeat from a worker in a certain amount of time, it marks the worker as failed In-progress map and reduce tasks are simply re-executed on other nodes. Same goes for completed map tasks (as their output is lost on machine failure) Completed reduce tasks are not re-executed as their output resides on the distributed filesystem 2 Master: The entire computation is marked as failed But simple to keep the master soft state and re-spawn Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 23 / 34
  • 66. Locality Network bandwidth is a scare resource in typical clusters Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 24 / 34
  • 67. Locality Network bandwidth is a scare resource in typical clusters GFS slices files into 64MB blocks and stores 3 replicas across the cluster Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 24 / 34
  • 68. Locality Network bandwidth is a scare resource in typical clusters GFS slices files into 64MB blocks and stores 3 replicas across the cluster The master exploits this information by scheduling a map task near its input data. Preference is in the order, node-local, rack/switch-local, and any Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 24 / 34
  • 69. Speculative re-execution Every now and then the entire computation is held-up by a “straggler” task Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
  • 70. Speculative re-execution Every now and then the entire computation is held-up by a “straggler” task Stragglers can arise due to a number of reasons, such as machine load, network traffic, software/hardware bugs, etc. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
  • 71. Speculative re-execution Every now and then the entire computation is held-up by a “straggler” task Stragglers can arise due to a number of reasons, such as machine load, network traffic, software/hardware bugs, etc. To deal with stragglers, the master speculatively re-executes slow tasks on other machines Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
  • 72. Speculative re-execution Every now and then the entire computation is held-up by a “straggler” task Stragglers can arise due to a number of reasons, such as machine load, network traffic, software/hardware bugs, etc. To deal with stragglers, the master speculatively re-executes slow tasks on other machines The task is marked as completed whenever the primary or the backup finishes its execution Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 25 / 34
  • 73. Scalability Possible to run on multiple scales: from single nodes to data centers with tens of thousands of nodes Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 26 / 34
  • 74. Scalability Possible to run on multiple scales: from single nodes to data centers with tens of thousands of nodes Nodes can be added/removed on the fly to scale up/down Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 26 / 34
  • 75. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 27 / 34
  • 76. Partitioning By default MapReduce uses hash partitioning to partition the key space hash(key) % R Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 28 / 34
  • 77. Partitioning By default MapReduce uses hash partitioning to partition the key space hash(key) % R Optionally, the user can provide a custom partitioning function to say, negate skew or to ensure that certain keys always end up at a particular reduce worker Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 28 / 34
  • 78. Combiner function For reduce functions which are commutative and associative, the user can additionally provide a combiner function which is applied to the output of the map for local merging Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 29 / 34
  • 79. Combiner function For reduce functions which are commutative and associative, the user can additionally provide a combiner function which is applied to the output of the map for local merging Typically, the same reduce function is used as a combiner Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 29 / 34
  • 80. Input/output formats By default, the library supports a number of input/output formats Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
  • 81. Input/output formats By default, the library supports a number of input/output formats For instance, text as input and key/value pairs as output Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
  • 82. Input/output formats By default, the library supports a number of input/output formats For instance, text as input and key/value pairs as output Optionally, the user can specify custom input readers and output writers Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
  • 83. Input/output formats By default, the library supports a number of input/output formats For instance, text as input and key/value pairs as output Optionally, the user can specify custom input readers and output writers For instance, to read/write from/to a database Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 30 / 34
  • 84. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 31 / 34
  • 85. Outline 1 Introduction 2 Programming Model 3 Implementation 4 Refinements 5 Hadoop Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 32 / 34
  • 86. Hadoop Open-source implementation of MapReduce, developed by Doug Cutting originally at Yahoo! in 2004 Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
  • 87. Hadoop Open-source implementation of MapReduce, developed by Doug Cutting originally at Yahoo! in 2004 Now a top-level Apache open-source project Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
  • 88. Hadoop Open-source implementation of MapReduce, developed by Doug Cutting originally at Yahoo! in 2004 Now a top-level Apache open-source project Implemented in Java (Google’s in-house implementation is in C++) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
  • 89. Hadoop Open-source implementation of MapReduce, developed by Doug Cutting originally at Yahoo! in 2004 Now a top-level Apache open-source project Implemented in Java (Google’s in-house implementation is in C++) Comes with an associated distributed filesystem, HDFS (clone of GFS) Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 33 / 34
  • 90. References Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation - (OSDI’04), Vol. 6. USENIX Association, Berkeley, CA, USA. Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 34 / 34