Distributed Computing & MapReduce


Published on

Shared by Mansoor Mirza
Distributed Computing
What is it?
Why & when we need it?
Comparison with centralized computing
‘MapReduce’ (MR) Framework
Theory and practice
‘MapReduce’ in Action
Using Hadoop
Lab exercises

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Course Theme: Learning Multi-Core Programming using different paradigms, techniques e.g. raw threads like Posix Threads, Threading Building Blocks, SIMD style programming using PS3, CUDA etc. Today we will see another style of parallel programming, somewhat similar to SIMD, called SPMD (Single program, multiple data). -In some sense you have been introduced to different parallelization techniques from most generic (and probably most difficult to manage!) to less generic (and probably less powerful and less difficult). -MapReduce was invented by Google and presented in OSDI in 2004. During past 6, 7 years MapReduce has become a defacto way to crunch huge data.
  • Before we talk specifically about MapReduce you need to learn about: Some generic stuff about Distributed Computing What are clusters and how are they built
  • Lets do a 5 minute exercise by asking the audience what they think / know about distributed computing.
  • Probably distributed computing is one of those things which are difficult to succinctly define in few lines. But lets give it a shot.
  • What is distributed computing, as compared to centralized computing -The network and its intricacies ~ bandwidth, delay, jitter etc. -Similarly the software design and / or the software tools can have their affect on the whole distributed system. -The element of complexity and to tame it
  • Why on earth we need distributed computing? Distributed programming is difficult than centralized program because of the component of network, failure models (parts can fail, while other parts are alive) etc. -Over time a single machine is increasing in capability and one always need to stop and think that is it enough for me or not. Go to distributed computing out of the need only. -Example of John’s census example; now thinking to move census setup to a single server
  • Problem: MSN like chat service with possibly millions of users Partitioning users on multiple servers (+Only a portion of user base will go down; graceful degradation) How this partitioning should be done? Based on country? Load?
  • Take away message is that as a professional when you are making a decision to use centralized or distributed infrastructure, base your decision on the real need.
  • Google and many others use an implementation of MapReduce which runs on cluster of commodity computers (though there are other implementation are available as well e.g. Standord’s Phoenix system which uses multiple cores within a single computer as a computer node.
  • -It also depends who is the user of the cluster. E.g. in the case of library of congress, they use blade servers, which are easy for them to manage because their core business is not computing; when something go wrong in a blade, that server is pulled out and sent to the vendor for maintenance. -Similarly AWS can get a beefy server and make many system virtual machines out of it. (We will talk a lot more about it in tomorrow’s lecture).
  • When you need more ‘power’ what will you do; grow a bigger ‘ox’ or use multiple oxen?
  • MPP = Massively Parallel Programmed Computer (think of a big Ox!) Constellations = Sun Microsystems’ (Now acquired by Oracle) Sparc based system
  • Cost / $
  • So now the stage is set to introduce MapReduce
  • Generally what is MapReduce. It is a framework for parallel programming which is good for a specific type of problems.
  • Distributed Computing & MapReduce

    1. 1. Distributed Computing & MapReduce Presented by: Abdul Qadeer
    2. 2. Today’s Agenda <ul><li>Distributed Computing </li></ul><ul><ul><li>What is it? </li></ul></ul><ul><ul><li>Why & when we need it? </li></ul></ul><ul><ul><li>Comparison with centralized computing </li></ul></ul><ul><li>‘ MapReduce’ (MR) Framework </li></ul><ul><ul><li>Theory and practice </li></ul></ul><ul><li>‘ MapReduce’ in Action </li></ul><ul><ul><li>Using Hadoop </li></ul></ul><ul><ul><li>Lab exercises </li></ul></ul>Feel free to comment / ask questions anytime!
    3. 3. Distributed Computing
    4. 4. Distributed Computing <ul><li>A way of computing where: </li></ul><ul><ul><li>Software system has multiple sub-components </li></ul></ul><ul><ul><li>Sub-components are loosely coupled using ‘network’ </li></ul></ul><ul><ul><li>The placement and algorithms of interaction between sub-components are constructed to meet system goals </li></ul></ul><ul><ul><li>System goals can be: </li></ul></ul><ul><ul><ul><li>Fault Tolerance </li></ul></ul></ul><ul><ul><ul><li>Scalability in some design dimension (e.g. Increasing sys load) </li></ul></ul></ul><ul><ul><ul><li>Reliability </li></ul></ul></ul><ul><ul><ul><li>Performance (e.g. Many computers working together) </li></ul></ul></ul><ul><ul><ul><li>Cost (Better yet performance / $) </li></ul></ul></ul><ul><ul><ul><li>Usability / Accessibility / Ease of use (e.g. Distributed file system) </li></ul></ul></ul><ul><ul><ul><li>Easy and cost effective software maintenance (e.g. ChromeBook) </li></ul></ul></ul><ul><ul><ul><li>Etc. </li></ul></ul></ul>
    5. 5. Distributed Computing - What <ul><li>Everything on single computer </li></ul><ul><ul><li>E.g. Old POS system on a small shop </li></ul></ul><ul><li>Use of multiple computers </li></ul><ul><ul><li>Client – Server Model </li></ul></ul><ul><ul><li>Multiple tiers (2 .. N) </li></ul></ul><ul><ul><li>Logical / physical tiers </li></ul></ul><ul><ul><ul><li>Depends on complexity, load … </li></ul></ul></ul>
    6. 6. Distributed Computing - Why <ul><li>Single computer has limitations </li></ul><ul><ul><li>Computability limitation </li></ul></ul><ul><ul><li>Memory (Storage) limitation </li></ul></ul><ul><ul><li>Bandwidth limitation </li></ul></ul><ul><li>But these limitations are relative </li></ul><ul><ul><li>Today a single computer is darn powerful! </li></ul></ul><ul><ul><ul><li>E.g. Ethane – Taking control of the enterprise </li></ul></ul></ul><ul><ul><li>Multiple cores (8 core systems common) </li></ul></ul><ul><ul><li>6 to 8 GB memory (Up to 3 TB hard disk) </li></ul></ul><ul><ul><li>10Gbps network interfaces </li></ul></ul>
    7. 7. Distributed Computing - Why <ul><li>Reliability </li></ul><ul><ul><li>Failure Model </li></ul></ul><ul><ul><li>Probability of failure </li></ul></ul><ul><ul><li>The concept of redundancy </li></ul></ul><ul><li>Scalability in different dimensions </li></ul><ul><ul><li>Load balancing (distribution or partitioning) </li></ul></ul><ul><ul><li>MSN messenger user example </li></ul></ul><ul><ul><li>Scalability always has dimensions </li></ul></ul><ul><li>Cost </li></ul><ul><ul><li>1 beefy server VS many commodity machines </li></ul></ul><ul><ul><li>Economy of scale </li></ul></ul>
    8. 8. Distributed Computing <ul><li>Can use ‘toy problems’ to learn different concepts of distributed computing </li></ul><ul><li>Real world use should be based on a real need </li></ul><ul><ul><li>Many real world problems where distributed computed is needed </li></ul></ul><ul><ul><ul><li>Large scale web applications </li></ul></ul></ul><ul><ul><ul><ul><li>Computers, cell phones etc. </li></ul></ul></ul></ul><ul><ul><ul><li>Data post processing </li></ul></ul></ul><ul><ul><ul><ul><li>Finding clues in data, weather forecast </li></ul></ul></ul></ul><ul><ul><ul><li>Many scientific and HPC applications </li></ul></ul></ul><ul><ul><ul><ul><li>Discussed previously in another lecture of this course </li></ul></ul></ul></ul>
    9. 9. Clusters
    10. 10. How Clusters are Built? <ul><li>Two possible approaches </li></ul><ul><ul><li>Big, expensive, capable server class machines </li></ul></ul><ul><ul><ul><li>Usually Govts. used to had that much money! </li></ul></ul></ul><ul><ul><ul><li>There are limits beyond which a machine cant go! </li></ul></ul></ul><ul><ul><ul><li>Any failure to a big machine means disruption to a large population of clients </li></ul></ul></ul><ul><ul><li>Off the shelf, ordinary, cheap computers connected with ordinary Ethernet! </li></ul></ul><ul><ul><ul><li>Cost effective in terms of operational cost </li></ul></ul></ul><ul><ul><ul><li>Failure of a machine only disrupt a small portion of the operations </li></ul></ul></ul><ul><ul><ul><li>Most industry clusters are built like this! </li></ul></ul></ul>
    11. 11. How Clusters are Built? <ul><li>In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. </li></ul><ul><ul><ul><ul><ul><li>— Grace Hopper </li></ul></ul></ul></ul></ul>http://www.flickr.com/photos/drurydrama/;http://www.fotosearch.com/photos-images/ox.html
    12. 12. How Clusters are Built? Table generated at : http://top500.org/stats/list/37/archtype
    13. 13. How Clusters are Built? <ul><li>Mostly 64 bit processors used </li></ul>
    14. 14. How Clusters are Built? <ul><li>Ethernet holds a major share </li></ul>
    15. 15. How Clusters are Used? <ul><li>Different ways depending on the consumer </li></ul><ul><ul><li>Academic use </li></ul></ul><ul><ul><ul><li>End user logs in to ‘Head Nodes’ </li></ul></ul></ul><ul><ul><ul><li>Head Nodes are beefy systems used for non-compute intensive tasks; or task which have a good mix of compute and I/O; e.g. compiling code </li></ul></ul></ul><ul><ul><ul><li>Cluster machines are usually not accessible directly; only accessibly via head node. </li></ul></ul></ul><ul><ul><ul><li>Machine acquiring, job submission, job monitoring (e.g. using qsub command in a PBS based system) </li></ul></ul></ul>
    16. 16. How Clusters are Used? <ul><li>Different ways depending on the consumer </li></ul><ul><ul><li>Commercial use </li></ul></ul><ul><ul><ul><li>Infrastructure as a service </li></ul></ul></ul><ul><ul><ul><ul><li>Raw machines </li></ul></ul></ul></ul><ul><ul><ul><ul><li>E.g. AWS selling machine instances </li></ul></ul></ul></ul><ul><ul><ul><li>Platform as a service </li></ul></ul></ul><ul><ul><ul><ul><li>machine + OS + other software stack </li></ul></ul></ul></ul><ul><ul><ul><ul><li>E.g. Google’s’ App Engine </li></ul></ul></ul></ul><ul><ul><ul><li>Application as a service </li></ul></ul></ul><ul><ul><ul><ul><li>Using Gmail, Google search etc using browser </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Google Docs </li></ul></ul></ul></ul><ul><ul><ul><li>Google programmers sharing cluster machines using proprietary software </li></ul></ul></ul>
    17. 17. MapReduce
    18. 18. The Problem at Hand! <ul><li>Many organizations are data centric </li></ul><ul><ul><li>Google, Yahoo, Bing etc. (Search engines) </li></ul></ul><ul><ul><li>Facebook, Twiter, MySpace etc. (Social networking, Blogs) </li></ul></ul><ul><ul><li>NYT, </li></ul></ul><ul><ul><li>Stock exchanges </li></ul></ul><ul><li>Data is increasing at a high rate </li></ul><ul><ul><li>New web pages added each day </li></ul></ul><ul><ul><li>NY stock exchange produce 1 TB data every day! </li></ul></ul>
    19. 19. The Goal <ul><li>To “process” data: </li></ul><ul><ul><li>In reasonable time </li></ul></ul><ul><ul><ul><li>A single machine has capability limits!!! </li></ul></ul></ul><ul><ul><li>Cost effectively </li></ul></ul><ul><ul><ul><li>Cost coming from hardware resources </li></ul></ul></ul><ul><ul><ul><ul><li>No super computers!! </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Not even high end server machines </li></ul></ul></ul></ul><ul><ul><ul><li>Programmer hours </li></ul></ul></ul><ul><ul><ul><ul><li>Applicability (A framework for similar problems) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Simplicity </li></ul></ul></ul></ul><ul><ul><ul><ul><li>No specific / Expensive programmer training </li></ul></ul></ul></ul>
    20. 20. Elaboration by Example <ul><li>Word count </li></ul><ul><li>Input: </li></ul><ul><ul><li>Data size = 20 TB </li></ul></ul><ul><li>Output: </li></ul><ul><ul><li>A file as follows: </li></ul></ul><ul><ul><li><word1 12> </li></ul></ul><ul><ul><li><word2 30> </li></ul></ul><ul><ul><li><word3 34> </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Solution:? </li></ul><ul><ul><li>Single machine based solutions </li></ul></ul><ul><ul><ul><li>Don’t scale well!!! </li></ul></ul></ul><ul><ul><li>Use multiple machines (MPI?) </li></ul></ul>
    21. 21. Solution 1 <ul><li>Pseudo code </li></ul><ul><ul><li>(a) Make a dynamically extendable hash table such that word will be the “key” and an integer count will be its “value” </li></ul></ul><ul><ul><li>(b) while (read a line == true) </li></ul></ul><ul><ul><ul><ul><li>(c) parse the line on the basis of space, tab, newline </li></ul></ul></ul></ul><ul><ul><ul><ul><li>(d) for each word parsed in step (c) </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>(e) insert or update <word, oldval + 1> in hashtable </li></ul></ul></ul></ul></ul>
    22. 22. What is wrong with Solution1? <ul><li>Make a dynamically extendable hash table such that word will be the “key” and an integer count will be its “value” </li></ul><ul><li>while (read a line == true) </li></ul><ul><ul><li>(c) parse the line on the basis of space, tab, newline </li></ul></ul><ul><ul><li>(d) for each word parsed in step (c) </li></ul></ul><ul><ul><ul><li>(e) insert or update <word, oldval + 1> in hashtable </li></ul></ul></ul><ul><li>A big hashtable which might not fit in memory </li></ul><ul><li>Might be using swap memory. A typical pattern of access might cause thrashing! </li></ul><ul><li>Reading from disk is very slow. (i.e. Slow I/O) </li></ul><ul><li>Expect frequent cache misses and hence poor performance </li></ul>
    23. 23. How to improve on solution1? <ul><li>Multithreaded application </li></ul><ul><ul><li>Pros: </li></ul></ul><ul><ul><ul><li>Can fully utilize the disk bandwidth (~ 64 MB / sec) </li></ul></ul></ul><ul><ul><li>Cons: </li></ul></ul><ul><ul><ul><li>Dividing the input file among threads </li></ul></ul></ul><ul><ul><ul><li>Locking on some hash table blocks (e.g. the) </li></ul></ul></ul><ul><ul><ul><ul><li>One lock per hash table (very poor performance!) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>One lock per hash table element (lots of lock state!) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Deadlock prevention (Lock free design) </li></ul></ul></ul></ul>More cons than pros! No guarantee of speed up.
    24. 24. Solution2 <ul><li>Pseudo code </li></ul><ul><ul><li>(a) While (read line == true){ </li></ul></ul><ul><ul><ul><li>(b) Parse the line on white spaces, tabs, new line </li></ul></ul></ul><ul><ul><ul><li>(c) For each word in step (b) do </li></ul></ul></ul><ul><ul><ul><ul><li>(d) Emit the tuple <word 1> </li></ul></ul></ul></ul><ul><ul><li>}//end of while </li></ul></ul><ul><ul><li>(e) Sort all the tuples emitted in step (d) on the basis of word </li></ul></ul><ul><ul><li>(f) For the list of tuples produced in step (e) { </li></ul></ul><ul><ul><li>sum up similar words and emit <word final count> </li></ul></ul><ul><ul><li>} </li></ul></ul>
    25. 25. Solution 2 Elaboration <ul><li>Example Text </li></ul><ul><ul><li>“ the quick brown fox jumps over the lazy dog” </li></ul></ul><ul><li>Step 1 : </li></ul><ul><ul><li><0, the quick brown fox jumps over the lazy dog> </li></ul></ul><ul><ul><li><the 1>, <quick 1>, <brown 1>, <fox 1>, <jumps 1>, <over 1>, <the 1>, <lazy 1>, <dog 1> </li></ul></ul><ul><li>Step 2 : </li></ul><ul><ul><li><brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over 1>, <quick 1>, <the 1>, <the 1> </li></ul></ul><ul><ul><li>Emit output <key, value> pairs </li></ul></ul><ul><ul><ul><li><brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over 1>, <quick 1>, </li></ul></ul></ul><ul><ul><ul><li><the 2> </li></ul></ul></ul>
    26. 26. What is wrong with Solution2? <ul><li>Pros: </li></ul><ul><ul><li>No locking </li></ul></ul><ul><ul><li>No concurrency to handle in hash map </li></ul></ul><ul><li>Cons: </li></ul><ul><ul><li>Reading from disk (Slow I/O again!) </li></ul></ul><ul><ul><li>Intermediate keys might not fit in memory (Data too big to fit!) </li></ul></ul><ul><ul><li>External sort might need to be used </li></ul></ul><ul><ul><li>The sorting might become a bottleneck </li></ul></ul>
    27. 27. How to improve solution2 <ul><li>Most of the problems are related to “scalability” </li></ul><ul><li>There is so much you can extract from one machine! </li></ul><ul><li>Use multiple networked machines? </li></ul><ul><ul><li>Distributed computing intricacies </li></ul></ul><ul><ul><ul><li>How to divide work efficiently (who does what) </li></ul></ul></ul><ul><ul><ul><li>Scalability of solution with increasing problem size </li></ul></ul></ul><ul><ul><ul><li>Reliability, fault tolerance </li></ul></ul></ul><ul><ul><ul><li>Network related problems ( link failures, delays, etc.) </li></ul></ul></ul>So ideally we need a solution which takes care of messy / complicated parallelism / distributed computing details
    28. 28. What is MapReduce <ul><li>So MapReduce is a framework to </li></ul><ul><ul><li>Process enormous data (~ Multi Terabyte) </li></ul></ul><ul><ul><li>Efficient </li></ul></ul><ul><ul><li>Using cluster of ordinary machines </li></ul></ul><ul><ul><li>Linear scalability </li></ul></ul><ul><ul><li>Simple for end programmers </li></ul></ul><ul><ul><ul><li>Only write two small pieces of code and that’s it! </li></ul></ul></ul><ul><li>Not a general parallel programming model! </li></ul><ul><ul><li>No every parallel problem is solvable by it </li></ul></ul><ul><ul><ul><li>E.g. Producer consumer problems </li></ul></ul></ul><ul><ul><li>MR good where sub tasks either do not communicate with each other or any communication can be handled at “map end and reduce start” stage. </li></ul></ul>
    29. 29. Nuts and Bolts of MapReduce Fig. 1 taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters App. programmer write mapper and reducer code. Rest is automatic! Messy details of parallelism, scalability, fault tolerance is taken care of by the framework.
    30. 30. Word Count Example <ul><li>Example Text </li></ul><ul><ul><li>“ the quick brown fox jumps over the lazy dog” </li></ul></ul><ul><li>Step 1 – Split Input: </li></ul><ul><ul><li>Split usually on the basis of the size of the input data and available machines (assume only one machine for this example!) </li></ul></ul><ul><li>Step 2 – Map phase: </li></ul><ul><ul><li><0, the quick brown fox jumps over the lazy dog> </li></ul></ul><ul><ul><li><the 1>, <quick 1>, <brown 1>, <fox 1>, <jumps 1>, <over 1>, <the 1>, <lazy 1>, <dog 1> </li></ul></ul><ul><li>Step 3 – Distribute: </li></ul><ul><ul><li>If there is only one reducer, all the intermediate key value pairs are placed in a single intermediate file </li></ul></ul><ul><li>Step 4 – Reduce: </li></ul><ul><ul><li>Copies the intermediate file locally and sort it on the key </li></ul></ul><ul><ul><li><brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over 1>, <quick 1>, <the 1>, <the 1> </li></ul></ul><ul><ul><li>Emit output <key, value> pairs </li></ul></ul><ul><ul><ul><li><brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over 1>, <quick 1>, </li></ul></ul></ul><ul><ul><ul><li><the 2> </li></ul></ul></ul>
    31. 31. Word Count Example <ul><li>2 Mappers and 2 Reducers </li></ul><ul><li>Example Text </li></ul><ul><ul><li>“ the quick brown fox jumps over the lazy dog” </li></ul></ul><ul><li>Step 1 – Split Input: </li></ul><ul><ul><li>the quick brown fox </li></ul></ul><ul><ul><li>jumps over the lazy dog </li></ul></ul><ul><li>Step 2 – Map phase: </li></ul><ul><ul><li>Mapper 1: </li></ul></ul><ul><ul><li><0, the quick brown fox> </li></ul></ul><ul><ul><li><the 1>, <quick 1>, <brown 1>, <fox 1> </li></ul></ul><ul><ul><li>Mapper 2: </li></ul></ul><ul><ul><li><0, jumps over the lazy dog > </li></ul></ul><ul><ul><li><jumps 1>, <over 1>, <the 1>, <lazy 1>, <dog1> </li></ul></ul>
    32. 32. Word Count Example <ul><li>Step 3 – Distribute: </li></ul><ul><ul><li>Two intermediate files per mapper because there are 2 Reducers. </li></ul></ul><ul><ul><li>A hash function is used to place intermediate key value pairs in either of the “bucket” </li></ul></ul><ul><ul><li>A to M (capital or small) in bucket A, others in B </li></ul></ul><ul><ul><li>For Mapper 1: </li></ul></ul><ul><ul><ul><li>File A will have the pairs: </li></ul></ul></ul><ul><ul><ul><ul><li><brown 1>, <fox 1> </li></ul></ul></ul></ul><ul><ul><ul><li>File B will have the pairs: </li></ul></ul></ul><ul><ul><ul><ul><li><the 1>, <quick 1> </li></ul></ul></ul></ul><ul><ul><li>For Mapper 2: </li></ul></ul><ul><ul><ul><li>File A will have the pairs: </li></ul></ul></ul><ul><ul><ul><ul><li><jumps 1>, <lazy 1>, <dog 1> </li></ul></ul></ul></ul><ul><ul><ul><li>File B will have the pairs: </li></ul></ul></ul><ul><ul><ul><ul><li><over 1>, <the 1> </li></ul></ul></ul></ul>
    33. 33. Word Count Example <ul><li>Step 4 - Reduce </li></ul><ul><ul><li>Reducer 1: </li></ul></ul><ul><ul><ul><li>Fetches both file A from mapper 1 and 2 </li></ul></ul></ul><ul><ul><ul><li>Merge them and sort on key </li></ul></ul></ul><ul><ul><ul><ul><li><brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1> </li></ul></ul></ul></ul><ul><ul><ul><li>Emits final <key, value> pairs </li></ul></ul></ul><ul><ul><li>Reducer 2: </li></ul></ul><ul><ul><ul><li>Fetches both file B from mapper 1 and 2 </li></ul></ul></ul><ul><ul><ul><li>Merge them and sort on key </li></ul></ul></ul><ul><ul><ul><ul><li><over 1>, <quick 1>, <the 1>, <the 1> </li></ul></ul></ul></ul><ul><ul><ul><li>Emits: </li></ul></ul></ul><ul><ul><ul><ul><li><over 1>, <quick 1>, <the 2> </li></ul></ul></ul></ul><ul><ul><li>So final output has 2 files </li></ul></ul><ul><ul><ul><li>Merger them or feed to next stage mapper </li></ul></ul></ul>
    34. 34. Word Count Example – A Quiz <ul><li>Assume: </li></ul><ul><ul><li>I have a huge text file </li></ul></ul><ul><ul><li>50,000 Mappers, 2 Reducers </li></ul></ul><ul><ul><li>Same hash function used </li></ul></ul><ul><li>Questions: </li></ul><ul><ul><li>The pair <the 1> will always end up with Mapper 2 True or False? </li></ul></ul><ul><ul><li>Each mapper will produce how many intermediate files? </li></ul></ul><ul><ul><ul><li>(a) 50,000 (b) 2 (c) 100,000 </li></ul></ul></ul>
    35. 35. Why Split Data <ul><li>Exploiting locality </li></ul><ul><ul><li>Data on the same local disk (best case) </li></ul></ul><ul><ul><li>I/O operations are too slow </li></ul></ul><ul><ul><li>Somewhere in the same rack </li></ul></ul><ul><ul><li>Other nearby metrics </li></ul></ul><ul><ul><li>In GFS, by default each chunk of data is 64 MB long and usually each chunk is replicated at three different places </li></ul></ul><ul><li>Utilizing many machines </li></ul><ul><ul><li>Each split given to a different machine to work on </li></ul></ul><ul><ul><li>Increasing parallelism </li></ul></ul><ul><li>Fine grained split is better </li></ul><ul><ul><li>Faster machines can do more work </li></ul></ul>
    36. 36. The Power of Splitting
    37. 37. Fault Tolerance <ul><li>Worker failures </li></ul><ul><ul><li>Machines can fail, disks can crash </li></ul></ul><ul><ul><li>Tasks are re-scheduled on some other machines </li></ul></ul><ul><ul><li>Idempotent operations make it simple </li></ul></ul><ul><li>Master Failure </li></ul><ul><ul><li>Might be a single point of failure </li></ul></ul><ul><ul><li>Simple mechanisms like writing checkpoints can solve the problem </li></ul></ul><ul><ul><li>Replicate it (Approach proposed by Apache Hadoop) </li></ul></ul><ul><ul><li>Presents an important point in the design space </li></ul></ul><ul><ul><ul><li>Introduce as much complexity in the system as necessary </li></ul></ul></ul>
    38. 38. Backup Tasks <ul><li>Straggler Jobs </li></ul><ul><ul><li>Slow jobs due to failing hard disk </li></ul></ul><ul><ul><ul><li>30 MB/s reducing to 1 MB/s </li></ul></ul></ul><ul><ul><li>Some other job running on the same machine </li></ul></ul><ul><ul><ul><li>Jobs competing for CPU / Disk I/O </li></ul></ul></ul><ul><li>Run repetitive jobs </li></ul><ul><ul><li>Any one finishing first is considered </li></ul></ul><ul><ul><li>The other duplicate is killed </li></ul></ul>
    39. 39. Performance due to Backup Tasks Fig. 3 taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters
    40. 40. Combiners <ul><li>A way to condense number of intermediate keys </li></ul><ul><ul><li>If e.g. keys are as follows </li></ul></ul><ul><ul><ul><li><the 1>, <the 1> …. </li></ul></ul></ul><ul><ul><ul><li><the n> </li></ul></ul></ul><ul><ul><li>Reducing the needed bandwidth </li></ul></ul><ul><li>Only applicable if operation is commutative and associative </li></ul><ul><ul><li>Counting (i.e. summing up is comm. & assoc.) </li></ul></ul><ul><ul><li>Mean is not! </li></ul></ul>
    41. 41. Example Usage of MapReduce <ul><li>Search (grep) </li></ul><ul><li>Sort </li></ul><ul><li>Large scale indexing </li></ul><ul><li>Count of URL Access Frequency </li></ul><ul><li>Reverse Web-link graph </li></ul><ul><li>Term-vector per host </li></ul><ul><li>Inverted Index </li></ul><ul><li>Etc. </li></ul>
    42. 42. Web Search <ul><li>Web search Index of Google was changed using MapReduce (Actually was re-written) </li></ul>Fig taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters
    43. 43. Yahoo Now Using MapReduce <ul><li>Yahoo followed the suit after 4 years </li></ul>
    44. 44. Case Studies
    45. 45. Case Study 1: New York Times <ul><li>Make 1851 to 1980 paper articles online </li></ul><ul><ul><li>There are about 11 million articles </li></ul></ul><ul><ul><li>Articles were stores as scanned images </li></ul></ul><ul><ul><li>On getting a request these images glued together on the fly </li></ul></ul><ul><ul><ul><li>Can be slow </li></ul></ul></ul><ul><ul><ul><li>Can be stupid! (May be doing redundant work) </li></ul></ul></ul><ul><li>The new design </li></ul><ul><ul><li>Glue up all articles and make a pdf document </li></ul></ul><ul><li>EC2, S3 and HadoopSecure </li></ul><ul><ul><li>Uploade 4TB to S3 </li></ul></ul><ul><ul><li>100 EC2 instances did work in 24 hours </li></ul></ul><ul><ul><li>1.5 TB of output data against stored in S3 </li></ul></ul><ul><ul><li>Data served to clients from S3 </li></ul></ul>
    46. 46. Case Study 2: IPv4 Census <ul><li>John Heidemann et al conduct IPv4 census </li></ul><ul><ul><li>About 4 billion IPv4 </li></ul></ul><ul><ul><li>Addresses about to get exhausted </li></ul></ul><ul><ul><li>Seeing usability trends using </li></ul></ul><ul><ul><li>pings and their responses </li></ul></ul><ul><li>Hilbert Curve used to present 32 </li></ul><ul><li>Bit IP responses into 2 Dimensions </li></ul>
    47. 47. Case Study 2: IPv4 Census <ul><li>Lab machines used to run Hadoop </li></ul><ul><ul><li>Machines share resources with other processes </li></ul></ul><ul><ul><li>Hadoop process has normal priority; can be pre-empted by higher priority processes </li></ul></ul><ul><ul><li>Machine sharing same as in Wisconsin’s Candor project </li></ul></ul><ul><ul><li>Cluster has about 20 machines </li></ul></ul><ul><ul><li>File / data sharing using NFS and HDFS mounts </li></ul></ul><ul><ul><li>Improvised use of NFS to deploy latest version of Hadoop </li></ul></ul><ul><li>Census data </li></ul><ul><ul><li>Each census file about the size of 37GB </li></ul></ul>
    48. 48. Hadoop
    49. 49. MapReduce in Action! <ul><li>MapReduce’s open source </li></ul><ul><li>implementation by Apache called Hadoop </li></ul><ul><li>Hadoop: </li></ul><ul><ul><li>1 TB sorting record in 209 seconds (little less than 3 minutes!) using 910 machines </li></ul></ul><ul><li>Google’s MapReduce implementation: </li></ul><ul><ul><li>1TB records sorted in 68 seconds using 1000 machines </li></ul></ul><ul><li>Question: </li></ul><ul><ul><li>Why Google was able to sort 3 times faster than Hadoop? </li></ul></ul>
    50. 50. <ul><li>Hadoop can be run on </li></ul><ul><ul><li>Linux like OS </li></ul></ul><ul><ul><li>Windows with cygwin </li></ul></ul><ul><li>Hadoop opetaing mode is: </li></ul><ul><ul><li>Stand alone </li></ul></ul><ul><ul><ul><li>Good for development </li></ul></ul></ul><ul><ul><li>Pseudo distributed </li></ul></ul><ul><ul><ul><li>Good for debugging / testing </li></ul></ul></ul><ul><ul><li>Fully distributed </li></ul></ul><ul><li>Mapper / Reduce code can be written in: </li></ul><ul><ul><li>Any thing executable on Linux shell!!! </li></ul></ul><ul><ul><ul><li>C++ executable </li></ul></ul></ul><ul><ul><ul><li>Scripts like Python, shell scripts, etc. </li></ul></ul></ul><ul><ul><ul><li>Java program </li></ul></ul></ul>Hadoop and HPCNL Cluster Lets have a visual tour of Hadoop!
    51. 51. Cluster Summary
    52. 52. Worker State
    53. 53. Job Status
    54. 54. Graphical Progress Report
    55. 55. Speedy Machines do More Work
    56. 56. Backup jobs
    57. 57. Failures are Frequent
    58. 58. Backwards Progress
    59. 59. Backwards Progress
    60. 60. Blacklisting
    61. 61. Slow Machines
    62. 62. Stragglers
    63. 63. Multiple Backup Tasks
    64. 64. Sequencial WordCount
    65. 65. Word Counting Mapper and Reducer <ul><li>Mapper code </li></ul><ul><li>Reduce code </li></ul>
    66. 66. Concluding Remarks <ul><li>Master MapReduce / Hadoop because of its wide applicability </li></ul><ul><ul><li>Today’s lab exercises get you started </li></ul></ul><ul><ul><li>Utilize it in your term / final year projects </li></ul></ul><ul><ul><li>Build something in your free time which uses MapReduce </li></ul></ul><ul><li>MapReduce adds another tool into your repertoire of parallel programming </li></ul><ul><ul><li>Using the right tool at the right time remains your responsibility! </li></ul></ul><ul><ul><li>MapReduce is widely applicable but you can’t use it for every problem! </li></ul></ul><ul><ul><li>Misusing MapReduce for a problem where some other tool might be better, will most probably result in degraded performance </li></ul></ul>Any more questions?