Cloud Computing & MapReduce:Parallel Processing on a Massive Scale Geoff Rothman (email@example.com) March 27, 2010
Outline1. Overview of Cloud Computing – Establish a general definition2. Overview of Google MapReduce – Parallel programming with Cloud Computing3. Debate between MapReduce & Parallel DBMS – Is one better than the other or are they complementary?
“SPI Model - as a Service”• Software as a Service (SaaS): – Application system (Salesforce, WebEx)• Platform as a Service (PaaS): – Infrastructure pre-existing; simply code and deploy (Google AppEngine, MS Azure, Force.com)• Infrastructure as a Service (IaaS): – Raw infrastructure, servers and storage provided on- demand (Amazon Web Services, GoGrid) 
Cloud Deployment Models• Private – Single tenant, owned and managed by company or service provider either on or off-premise; consumers are trusted• Public – Single or multitenant (shared), owned by service provider off-premise; consumers are untrusted• Managed – Single or multi-tenant (shared), located in org’s datacenter but managed and secured by Service Provider; consumers are trusted or untrusted• Hybrid – Combination of public/private offering; “cloud burst”; consumers are trusted or untrusted
Why use the Cloud? CFO View• Operational vs Capital Expenditures• Better Cash Flow• Limited Financial Risk• Better Balance Sheet• Outsource non-core competencies 
Why Use the Cloud? CIO View• Analytics• Parallel Batch Processing• Compute intensive desktops apps • Mobile Interactive Apps (GUI for mashups) • Webserver uptime / redundancy• Accelerate project rollouts
Cloud Computing & Parallel BatchProcessing: Overview of Map/Reduce• Developed by Google to perform simple computations on massive amounts of data ( > 1TB) in a substantially reduced amount of time• Hides details for – Parallelization – Data distribution – Load balancing – Fault tolerance
MapReduce Programming Model Input & Output: each a set of key/value pairsCode two functions: map & reducemap (in_key, in_value) -> list(out_key, intermediate_value)• Processes input key/value pair• Produces set of intermediate pairsreduce (out_key, list(intermediate_value)) -> list(out_value)• Combines all intermediate values for a particular key• Produces a set of merged output values (usually just one)
Case 1: Word CountDetermine frequency of words in a file.Map function (assign a value of 1 to every word):- input is (file offset, various text)- output is a key-value pair [(word,1)]MR Library Shuffle Step takes Map Output and groups by Keys byHash function.Reduce function (total counts per word):- input is (word, [1,1,1])- output is (word, count)
Word Count – Sample Code map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
Word Count ResultFile 1 i love to code File 2 to code is to love code,2 i,1 is,1Map tasks: love,2 Reduce tasks: to,3Map1 Reducer1 File1[(i,1)][(love,1)] (code, [1,1]) -> (code,2)[(to,1)] (i, ) -> (i,1)[(code,1)] (is,) -> (is,1)Map2 MR Library groups[(to,1)] intermediate keys Reducer2 File2[(code,1)] and values in (love,[1,1]) -> (love,2)[(is,1)] “Shuffle Phase” (to,[1,1,1]) -> (to,3)[(to,1)][(love,1]* File2 will have a key value pair of (to,2) after map when using MR Combiner functionality
MapReduce Features• Fault Tolerance• Redundant Execution• Locality Optimization• Skip Bad Records• Sort before Reduce• Combiner
Case 2: Distributed GrepCounts lines in all files that match a <regex> and displays counts.Other uses include: analyzing web server access logs to find thetop requested pages that match a given patternMap function (establish a match):- input is (file offset, char)- output is either: 1. an empty list  (the line does not match ‘A’ or ‘C’) 2. a key-value pair [(line, 1)] (if it matches)Reduce function (total counts):- input is (char, [1, 1, ...])- output is (char, n) where n is the number of 1s in the list. http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC.ppt
Distributed Grep File 1 C File 2 Result C B A 3C B 1A CMap tasks:File1 Reduce tasks:(0, C) -> [(C, 1)] (A, ) -> (A, 1)(2, B) ->  (C, [1, 1, 1]) -> (C, 3)(4, B) -> (6, C) -> [(C, 1)]File2(0, C) -> [(C, 1)](2, A) -> [(A, 1)]
Case 3: Max Speed Serve-Data Analysis Needed: for all professional tennis tournamentsthe past 3 years, process log files to determine fastest speedserve each year.Map function (enumerate speeds for each year):- input is (file offset, Year Speed)- output is a key-value pair [(Year,Speed)]Reduce function (determine max speed each year): - input is (Year, [speed1, … speedN]) - output is (Year, Speed) where Speed is the fastest recordedthat year.
Max Speed Serve Result File 1 File 2 2008 134 2008 136 2008- 136 2009 126 2009 127 2009- 132 2009 132 2010 124 2010- 124 Map tasks: Reduce tasks: [(2008,136)] (2008, [136, 134]) -> (2008,136) [(2009,126)] (2009,[126,132,127]) -> (2009,132) [(2009,132)] [(2008,134)] (2010, -> (2010,124) [(2009,127)] [(2010,124)]* Will drop value when using MR Combiner functionality
Case 4: Word ProximityFind occurrences of pairs of words where word1 is located within4 words of word2.Map function (assign a value of 1 to every match):- input is (file offset, various text)- output is a key-value pair [(word1|word2,1)]Reduce function (total count per match):- input is (word1|word2, [1,1,1])- output is (word1|word2, count)
Word ProximityFile 1 File 2 Resulti have a piece of the pie it is a piece of cake; it piece|pie,1 doesn’t even look like pieWord1 = “piece” Word2 = “pie”Map tasks:(0,i have a piece of the pie) (piece|pie,1)(0,it is a piece of cake; it doesn’t even look like pie) ()Reduce tasks:(piece|pie, ) -> (piece|pie,1)
Case 5: Reverse Web-Link GraphGiven a list of website home pages (W1…W4) and every link onthat page, point the destination sites back to the original sourceweb site.Map function- input is (adjacency list in format source: dest1, dest2..)- output is a key-value pair[dest,source]Reduce function (create adjacency list with dest as key):-input/output is (dest,[source1, source2])
Link ReversalInput: Adjacency List Output: reversed list W1: W2,W4 W1: W2,W4 W2: W1,W3,W4 W2: W1 W3: W4 W3: W2,W4 W4: W1,W3 W4: W1,W2,W3Map tasks: Reduce tasks:(W1,W2) -> (W2,W1) (W1,[W2,W4])(W1,W4) -> (W4,W1) (W2,[W1](W2,W1) -> (W1,W2) MR Library groups (W3,[W2,W4](W2,W3) -> (W3,W2) intermediate keys(W2,W4) -> (W4,W2) and values in (W4,[W1,W2,W3] “Shuffle Phase”(W3,W4) -> (W4,W3)(W4,W1) -> (W1,W4)(W4,W3) -> (W3,W4)
Why Use MapReduce?• Hides messy details of distributed infrastructure• MapReduce simplifies programming paradigm to allow for easy parallel execution• Easily scales to thousands of machines
MapReduce Jobs Run @ Google  Aug. 04 Mar. 06 Sep. 07Number of jobs (1000s) 29 171 2,217Avg. completion time (secs) 634 874 395Machine years used 217 2,002 11,081map input data (TB) 3,288 52,254 403,152map output data (TB) 758 6,743 34,774reduce output data (TB) 193 2,970 14,018Avg. machines per job 157 268 394Unique implementationsmap 395 1958 4083reduce 269 1208 2418
Why Not Use A Parallel DBMS?• Parallel DBMS: – multiple CPUs, multiple servers – classic parallel programming concepts – HUGE established industry $$$• Parallel DBMS Vendors – Teradata (NCR), DB2 (IBM), Oracle (via exadata), Greenplum, Vertica etc.
“MapReduce is a Major Step Backward”Stonebraker & Dewitt attack on MR (1/17/08) [10,11] – a step backwards in database access – a poor implementation – not novel – missing features – incompatible with DBMS tools
“Comparison of Approaches to Large-Scale Data Analysis”Stonebraker Dewitt Comparison of Hadoop MR vs Vertica & DBMS-X (7/2009)  – Hadoop • easy to install, get up & running • Maintenance of apps harder • Good for fault tolerance in queries • Slow because of reading entire file each time & pull of file on reduce step – Vertica & DBMS-X • much faster than Hadoop because of indexes, schema, column orientation, compression & “warm start-up at boot time”.
“MapReduce and Parallel DBMSs: Friends or Foes?”Dewitt & Stonebraker update their position (1/2010)  – Hadoop MR and Parallel DBMS are complementary – Use Hadoop MR for subsets of tasks – Use Parallel DBMS for all other applications – Hadoop still needs significant improvements
“MapReduce: A Flexible Data Processing Tool”Jeffrey Dean & Sanjay Ghemawat (Google) rebuttal (1/2010)  – MR can input data from heterogenous environments – MR can use indices as input to MR – Useful for Complex functions – “Protocol Buffers” parse much faster – MR pull model non-negotiable – Addresses performance concerns
Conclusions• Hadoop MapReduce solid choice for leveraging power of Cloud Computing when tackling specific parallel data processing tasks; use PDBMS for all other tasks.• MR and PDBMS can learn from each other• Open source Hadoop MR continues to gain ground on performance and efficiency• Battle of MR vs PDBMS subsiding for now
References http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc http://en.wikipedia.org/wiki/File:Cloud_computing.svg http://news.cnet.com/8301-19413_3-10140278-240.html?tag=mncol;txt http://rationalsecurity.typepad.com/blog/2009/01/cloud-computing-taxonomy-ontology-please- review.html http://www.opencrowd.com/views/cloud.php/2Security http://berkeleyclouds.blogspot.com Forrester Research, Talking to Your CFO About Cloud Computing, Ted Schadler; Oct. 29, 2008. http://code.google.com/edu/parallel/mapreduce-tutorial.html http://labs.google.com/papers/mapreduce.html http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/ http://databasecolumn.vertica.com/database-innovation/mapreduce-ii/ “Comparison of Approaches to Large-Scale Data Analysis”, Pavlo, Abadi, Stonebraker, Dewitt , et al (7/2009) ACM, “MapReduce and Parallel DBMSs: Friends or Foes?”, Stonebraker, Abadi, Dewitt, et al (1/2010) ACM, “MapReduce: A Flexible Data Processing Tool”, Jeffrey Dean & Sanjay Ghemawat (1/2010) http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html