Shared by Mansoor Mirza
Distributed Computing
What is it?
Why & when we need it?
Comparison with centralized computing
‘MapReduce’ (MR) Framework
Theory and practice
‘MapReduce’ in Action
Using Hadoop
Lab exercises
29. Nuts and Bolts of MapReduce Fig. 1 taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters App. programmer write mapper and reducer code. Rest is automatic! Messy details of parallelism, scalability, fault tolerance is taken care of by the framework.
Course Theme: Learning Multi-Core Programming using different paradigms, techniques e.g. raw threads like Posix Threads, Threading Building Blocks, SIMD style programming using PS3, CUDA etc. Today we will see another style of parallel programming, somewhat similar to SIMD, called SPMD (Single program, multiple data). -In some sense you have been introduced to different parallelization techniques from most generic (and probably most difficult to manage!) to less generic (and probably less powerful and less difficult). -MapReduce was invented by Google and presented in OSDI in 2004. During past 6, 7 years MapReduce has become a defacto way to crunch huge data.
Before we talk specifically about MapReduce you need to learn about: Some generic stuff about Distributed Computing What are clusters and how are they built
Lets do a 5 minute exercise by asking the audience what they think / know about distributed computing.
Probably distributed computing is one of those things which are difficult to succinctly define in few lines. But lets give it a shot.
What is distributed computing, as compared to centralized computing -The network and its intricacies ~ bandwidth, delay, jitter etc. -Similarly the software design and / or the software tools can have their affect on the whole distributed system. -The element of complexity and to tame it
Why on earth we need distributed computing? Distributed programming is difficult than centralized program because of the component of network, failure models (parts can fail, while other parts are alive) etc. -Over time a single machine is increasing in capability and one always need to stop and think that is it enough for me or not. Go to distributed computing out of the need only. -Example of John’s census example; now thinking to move census setup to a single server
Problem: MSN like chat service with possibly millions of users Partitioning users on multiple servers (+Only a portion of user base will go down; graceful degradation) How this partitioning should be done? Based on country? Load?
Take away message is that as a professional when you are making a decision to use centralized or distributed infrastructure, base your decision on the real need.
Google and many others use an implementation of MapReduce which runs on cluster of commodity computers (though there are other implementation are available as well e.g. Standord’s Phoenix system which uses multiple cores within a single computer as a computer node.
-It also depends who is the user of the cluster. E.g. in the case of library of congress, they use blade servers, which are easy for them to manage because their core business is not computing; when something go wrong in a blade, that server is pulled out and sent to the vendor for maintenance. -Similarly AWS can get a beefy server and make many system virtual machines out of it. (We will talk a lot more about it in tomorrow’s lecture).
When you need more ‘power’ what will you do; grow a bigger ‘ox’ or use multiple oxen?
MPP = Massively Parallel Programmed Computer (think of a big Ox!) Constellations = Sun Microsystems’ (Now acquired by Oracle) Sparc based system
Cost / $
So now the stage is set to introduce MapReduce
Generally what is MapReduce. It is a framework for parallel programming which is good for a specific type of problems.