Hadoop Map Reduce


Published on

Published in: Education, Technology
  • http://dbmanagement.info/Tutorials/MapReduce.htm
    Are you sure you want to  Yes  No
    Your message goes here
  • very thanks.....................................................................
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop Map Reduce

  1. 1. Hadoop Map Reduce Apurva Jadhav Senior Software Engineer TheFind Inc. (diagrams borrowed from various sources)
  2. 2. Introduction <ul><ul><li>Open source project written in Java </li></ul></ul><ul><ul><li>Large scale distributed data processing </li></ul></ul><ul><ul><li>Based on Google’s Map Reduce framework and Google file system </li></ul></ul><ul><ul><li>Works on commodity hardware </li></ul></ul><ul><ul><li>Used by a Google, Yahoo, Facebook, Amazon, and many other startups http://wiki.apache.org/hadoop/PoweredBy </li></ul></ul>
  3. 3. Hadoop Core <ul><ul><li>Hadoop Distributed File System (HDFS) </li></ul></ul><ul><ul><ul><li>Distributes and stores data across a cluster (brief intro only) </li></ul></ul></ul><ul><ul><li>Hadoop Map Reduce (MR) </li></ul></ul><ul><ul><ul><li>Provides a parallel programming model </li></ul></ul></ul><ul><ul><ul><li>Moves computation to where the data is </li></ul></ul></ul><ul><ul><ul><li>Handles scheduling, fault tolerance </li></ul></ul></ul><ul><ul><ul><li>Status reporting and monitoring </li></ul></ul></ul>
  4. 4. Typical cluster <ul><li>Nodes are Linux PCs </li></ul><ul><li>4-8GB RAM ~100s of GB IDE/SATA drives </li></ul>
  5. 5. Hadoop Distributed File System <ul><li>Scale to Petabytes across 1000s of nodes </li></ul><ul><li>Single namespace for entire cluster </li></ul><ul><li>Files broken into 128MB blocks </li></ul><ul><li>Block level replication handles node failure </li></ul><ul><li>Optimized for single write multiple reads </li></ul><ul><li>Writes are append only </li></ul>
  6. 6. HDFS Architecture Namenode (Master) Datanode Client Datanode Datanode Read Write Replication Meta data ops Stores FS metadata – namespace, block locations Stores the data blocks as linux files
  7. 7. Hadoop Map Reduce <ul><li>Why Map Reduce </li></ul><ul><li>Map Reduce Architecture - 1 </li></ul><ul><li>Map Reduce Programming Model </li></ul><ul><li>Word count using Map Reduce </li></ul><ul><li>Map Reduce Architecture - 2 </li></ul>
  8. 8. Word Count Problem <ul><li>Find the frequency of each word in a given corpus of documents </li></ul><ul><li>Trivial for small data </li></ul><ul><li>How to process more than a TB of data </li></ul><ul><li>Doing it on one machine is very slow – takes days to finish! </li></ul><ul><li>Good News : It can be parallelized across number of machines </li></ul>
  9. 9. Why Map Reduce <ul><li>How to scale large data processing applications ? </li></ul><ul><li>Divide the data and process on many nodes </li></ul><ul><li>Each such application has to handle </li></ul><ul><ul><li>Communication between nodes </li></ul></ul><ul><ul><li>Division and scheduling of work </li></ul></ul><ul><ul><li>fault tolerance </li></ul></ul><ul><ul><li>monitoring and reporting </li></ul></ul><ul><li>Map Reduce handles and hides all these issues </li></ul><ul><li>Provides a clean abstraction for programmer </li></ul>
  10. 10. Map Reduce Architecture <ul><li>Each node is part of a HDFS cluster. </li></ul><ul><li>Input data is stored in HDFS spread across nodes and replicated </li></ul><ul><li>Programmer submits job (mapper, reducer, input) to Job tracker </li></ul><ul><li>Job tracker - Master </li></ul><ul><ul><li>splits input data </li></ul></ul><ul><ul><li>Schedules and monitors various map and reduce tasks </li></ul></ul><ul><li>Task tracker - Slaves </li></ul><ul><ul><li>Execute map and reduce tasks </li></ul></ul>Jobtracker tasktracker tasktracker tasktracker Input Job (mapper, reducer, input) Data transfer Assign tasks
  11. 11. Map Reduce Programming Model <ul><li>Inspired by functional language primitives </li></ul><ul><li>map f list : applies a given function f to a each element of list and returns a new list </li></ul><ul><li> map square [1 2 3 4 5] = [1 4 9 16 25] </li></ul><ul><li>reduce g list : combines elements of list using function g to generate a new value </li></ul><ul><li>reduce sum [1 2 3 4 5] = [15] </li></ul><ul><li>Map and reduce do not modify input data. They always create new data </li></ul><ul><li>A Hadoop Map Reduce job consists of a mapper and a reducer </li></ul>
  12. 12. Map Reduce Programming Model <ul><li>Mapper </li></ul><ul><ul><li>Records (lines, database rows etc) are input as key/value pairs </li></ul></ul><ul><ul><li>Mapper outputs one or more intermediate key/value pairs for each input </li></ul></ul><ul><ul><li>map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) </li></ul></ul><ul><li>Reducer </li></ul><ul><ul><li>After the map phase, all the intermediate values for a given output key are combined together into a list </li></ul></ul><ul><ul><li>reducer combines those intermediate values into one or more final key/value pairs </li></ul></ul><ul><ul><li>reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter) </li></ul></ul><ul><li>Input and output key/value types can be different </li></ul>
  13. 13. Word Count Map Reduce Job <ul><li>Mapper </li></ul><ul><ul><li>Input: <key:, offset, value:line of a document> </li></ul></ul><ul><ul><li>Output: for each word w in input line output<key: w, value:1> </li></ul></ul><ul><ul><li> Input: (2133, The quick brown fox jumps over the lazy dog.) </li></ul></ul><ul><ul><li> Output: (the, 1) , (quick, 1), (brown, 1) … (fox,1), (the, 1) </li></ul></ul><ul><li>Reducer </li></ul><ul><ul><li>Input: <key: word, value: list<integer>> </li></ul></ul><ul><ul><li>Output: sum all values from input for the given key input list of values </li></ul></ul><ul><ul><li>and output <Key:word value:count> </li></ul></ul><ul><ul><li>Input: (the, [1, 1, 1, 1,1]), (fox, [1, 1, 1]) … </li></ul></ul><ul><ul><li>Output: (the, 5) </li></ul></ul><ul><ul><li>(fox, 3) </li></ul></ul>
  14. 14. Word Count using Map Reduce
  15. 15. Map Reduce Architecture - 2 <ul><li>Map Phase </li></ul><ul><ul><li>Map tasks run in parallel – output intermediate key value pairs </li></ul></ul><ul><li>Shuffle and sort phase </li></ul><ul><ul><li>Map task output is partitioned by hashing the output key </li></ul></ul><ul><ul><li>Number of partitions is equal to number of reducers </li></ul></ul><ul><ul><li>Partitioning ensures all key/value pairs sharing same key belong to same partition </li></ul></ul><ul><ul><li>The map output partition is sorted by key to group all values for the same key </li></ul></ul><ul><li>Reduce Phase </li></ul><ul><ul><li>Each partition is assigned to one reducer. </li></ul></ul><ul><ul><li>Reducers also run in parallel. </li></ul></ul><ul><ul><li>No two reducers process the same intermediate key </li></ul></ul><ul><ul><li>Reducer gets all values for a given key at the same time </li></ul></ul>
  16. 16. Map Reduce Architecture - 2
  17. 17. Map Reduce Architecture - 2 <ul><li>Job tracker </li></ul><ul><ul><li>splits input and assigns to various map tasks </li></ul></ul><ul><ul><li>Schedules and monitors map tasks (heartbeat) </li></ul></ul><ul><ul><li>On completion, schedules reduce tasks </li></ul></ul><ul><li>Task tracker </li></ul><ul><ul><li>Execute map tasks – call mapper for every input record </li></ul></ul><ul><ul><li>Execute reduce tasks – call reducer for every intermediate key, list of values pair </li></ul></ul><ul><ul><li>Handle partitioning of map outputs </li></ul></ul><ul><ul><li>Handle sorting and grouping of reducer input </li></ul></ul>Jobtracker tasktracker tasktracker tasktracker Input Job (mapper, reducer, input) Data transfer Assign tasks
  18. 18. Map Reduce Advantages <ul><li>Locality </li></ul><ul><ul><li>Job tracker divides tasks based on location of data: it tries to schedule map tasks on same machine that has the physical data </li></ul></ul><ul><li>Parallelism </li></ul><ul><ul><li>Map tasks run in parallel working different input data splits </li></ul></ul><ul><ul><li>Reduce tasks run in parallel working on different intermediate keys </li></ul></ul><ul><ul><li>Reduce tasks wait until all map tasks are finished </li></ul></ul><ul><li>Fault tolerance </li></ul><ul><ul><li>Job tracker maintains a heartbeat with task trackers </li></ul></ul><ul><ul><li>Failures are handled by re-execution </li></ul></ul><ul><ul><li>If a task tracker node fails then all tasks scheduled on it (completed or incomplete) are re-executed on another node </li></ul></ul>
  19. 19. Conclusion <ul><li>Map Reduce greatly simplifies writing large scale distributed applications </li></ul><ul><li>Used for building search index at Google, Amazon </li></ul><ul><li>Widely used for analyzing user logs, data warehousing and analytics </li></ul><ul><li>Also used for large scale machine learning and data mining applications </li></ul>
  20. 20. References <ul><li>Hadoop. http://hadoop.apache.org/ </li></ul><ul><li>Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. http://labs.google.com/papers/mapreduce.html </li></ul><ul><li>http://code.google.com/edu/parallel/index.html </li></ul><ul><li>http://www.youtube.com/watch?v=yjPBkvYh-ss </li></ul><ul><li>http://www.youtube.com/watch?v=-vD6PUdf3Js </li></ul><ul><li>S. Ghemawat, H. Gobioff, and S. Leung. The Google File System. http:// labs.google.com/papers/gfs.html </li></ul>