0
Cloud Computing & MapReduce:Parallel Processing on a Massive Scale     Geoff Rothman (rothman@hp.com)              March 2...
Outline1. Overview of Cloud Computing  – Establish a general definition2. Overview of Google MapReduce  – Parallel program...
Overview of Cloud Computing
Cloud Computing: What Does It Mean?• On-demand network access to shared pool of  configurable computing resources [1]     ...
NIST View of Cloud Computing• Five characteristics• Three service models• Four deployment models
Cloud Computing Characteristics• On-Demand & Automated• Broad network access• Resource Pooling• Rapid Elasticity• Measured...
“SPI Model - as a Service”• Software as a Service (SaaS):   – Application system (Salesforce, WebEx)• Platform as a Servic...
[4]
[5]
Cloud Deployment Models• Private   – Single tenant, owned and managed by company or service provider     either on or off-...
Why use the Cloud? CFO View• Operational vs Capital Expenditures• Better Cash Flow• Limited Financial Risk• Better Balance...
Why Use the Cloud? CIO View• Analytics• Parallel Batch Processing• Compute intensive desktops apps [6]• Mobile Interactive...
Overview of Google MapReduce
Cloud Computing & Parallel BatchProcessing: Overview of Map/Reduce• Developed by Google to perform simple  computations on...
MapReduce Programming Model [8]Input & Output: each a set of key/value pairsCode two functions: map & reducemap (in_key, i...
Case 1: Word CountDetermine frequency of words in a file.Map function (assign a value of 1 to every word):- input is (file...
Word Count – Sample Code [9]map(String key, String value):  // key: document name  // value: document contents  for each w...
Word Count                                  ResultFile 1     i love to code   File 2      to code is to love              ...
MapReduce Features• Fault Tolerance• Redundant Execution• Locality Optimization• Skip Bad Records• Sort before Reduce• Com...
MapReduce System Flow [8]
MapReduce Function Flow [8]
Map & Reduce Parallel Execution [8]
Case 2: Distributed GrepCounts lines in all files that match a <regex> and displays counts.Other uses include: analyzing w...
Distributed Grep    File 1      C    File 2                 Result                              C                B        ...
Case 3: Max Speed Serve-Data Analysis Needed: for all professional tennis tournamentsthe past 3 years, process log files t...
Max Speed Serve                                                         Result File 1                  File 2    2008 134 ...
Case 4: Word ProximityFind occurrences of pairs of words where word1 is located within4 words of word2.Map function (assig...
Word ProximityFile 1                       File 2                          Resulti have a piece of the pie        it is a ...
Case 5: Reverse Web-Link GraphGiven a list of website home pages (W1…W4) and every link onthat page, point the destination...
Link ReversalInput: Adjacency List                       Output: reversed list W1: W2,W4                                  ...
Why Use MapReduce?• Hides messy details of distributed infrastructure• MapReduce simplifies programming paradigm to  allow...
MapReduce Jobs Run @ Google [15]                              Aug. 04   Mar. 06   Sep. 07Number of jobs (1000s)        29 ...
Current Debate:MapReduce vs Parallel DBMS
Why Not Use A Parallel DBMS?• Parallel DBMS:  – multiple CPUs, multiple servers  – classic parallel programming concepts  ...
“MapReduce is a Major Step Backward”Stonebraker & Dewitt attack on MR (1/17/08) [10,11]  – a step backwards in database ac...
“Comparison of Approaches to Large-Scale Data                 Analysis”Stonebraker Dewitt Comparison of Hadoop MR vs Verti...
“MapReduce and Parallel DBMSs: Friends or Foes?”Dewitt & Stonebraker update their position (1/2010) [13]   – Hadoop MR and...
“MapReduce: A Flexible Data Processing Tool”Jeffrey Dean & Sanjay Ghemawat (Google) rebuttal (1/2010) [14]   – MR can inpu...
Conclusions• Hadoop MapReduce solid choice for leveraging  power of Cloud Computing when tackling specific  parallel data ...
Questions???
References[1] http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc[2] http://en.wikipedia.org/wiki/File:Cloud...
Upcoming SlideShare
Loading in...5
×

Geoff Rothman Presentation on Parallel Processing

980

Published on

Presentation to University of Kentucky Computer Science graduate studentrs on high level Cloud Computing, how MapReduce works, and the current competition for Parallel Processing on a Massive Scale

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
980
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Geoff Rothman Presentation on Parallel Processing"

  1. 1. Cloud Computing & MapReduce:Parallel Processing on a Massive Scale Geoff Rothman (rothman@hp.com) March 27, 2010
  2. 2. Outline1. Overview of Cloud Computing – Establish a general definition2. Overview of Google MapReduce – Parallel programming with Cloud Computing3. Debate between MapReduce & Parallel DBMS – Is one better than the other or are they complementary?
  3. 3. Overview of Cloud Computing
  4. 4. Cloud Computing: What Does It Mean?• On-demand network access to shared pool of configurable computing resources [1] [2]
  5. 5. NIST View of Cloud Computing• Five characteristics• Three service models• Four deployment models
  6. 6. Cloud Computing Characteristics• On-Demand & Automated• Broad network access• Resource Pooling• Rapid Elasticity• Measured Service
  7. 7. “SPI Model - as a Service”• Software as a Service (SaaS): – Application system (Salesforce, WebEx)• Platform as a Service (PaaS): – Infrastructure pre-existing; simply code and deploy (Google AppEngine, MS Azure, Force.com)• Infrastructure as a Service (IaaS): – Raw infrastructure, servers and storage provided on- demand (Amazon Web Services, GoGrid) [3]
  8. 8. [4]
  9. 9. [5]
  10. 10. Cloud Deployment Models• Private – Single tenant, owned and managed by company or service provider either on or off-premise; consumers are trusted• Public – Single or multitenant (shared), owned by service provider off-premise; consumers are untrusted• Managed – Single or multi-tenant (shared), located in org’s datacenter but managed and secured by Service Provider; consumers are trusted or untrusted• Hybrid – Combination of public/private offering; “cloud burst”; consumers are trusted or untrusted
  11. 11. Why use the Cloud? CFO View• Operational vs Capital Expenditures• Better Cash Flow• Limited Financial Risk• Better Balance Sheet• Outsource non-core competencies [7]
  12. 12. Why Use the Cloud? CIO View• Analytics• Parallel Batch Processing• Compute intensive desktops apps [6]• Mobile Interactive Apps (GUI for mashups) [6]• Webserver uptime / redundancy• Accelerate project rollouts
  13. 13. Overview of Google MapReduce
  14. 14. Cloud Computing & Parallel BatchProcessing: Overview of Map/Reduce• Developed by Google to perform simple computations on massive amounts of data ( > 1TB) in a substantially reduced amount of time• Hides details for – Parallelization – Data distribution – Load balancing – Fault tolerance
  15. 15. MapReduce Programming Model [8]Input & Output: each a set of key/value pairsCode two functions: map & reducemap (in_key, in_value) -> list(out_key, intermediate_value)• Processes input key/value pair• Produces set of intermediate pairsreduce (out_key, list(intermediate_value)) -> list(out_value)• Combines all intermediate values for a particular key• Produces a set of merged output values (usually just one)
  16. 16. Case 1: Word CountDetermine frequency of words in a file.Map function (assign a value of 1 to every word):- input is (file offset, various text)- output is a key-value pair [(word,1)]MR Library Shuffle Step takes Map Output and groups by Keys byHash function.Reduce function (total counts per word):- input is (word, [1,1,1])- output is (word, count)
  17. 17. Word Count – Sample Code [9]map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  18. 18. Word Count ResultFile 1 i love to code File 2 to code is to love code,2 i,1 is,1Map tasks: love,2 Reduce tasks: to,3Map1 Reducer1 File1[(i,1)][(love,1)] (code, [1,1]) -> (code,2)[(to,1)] (i, [1]) -> (i,1)[(code,1)] (is,[1]) -> (is,1)Map2 MR Library groups[(to,1)] intermediate keys Reducer2 File2[(code,1)] and values in (love,[1,1]) -> (love,2)[(is,1)] “Shuffle Phase” (to,[1,1,1]) -> (to,3)[(to,1)][(love,1]* File2 will have a key value pair of (to,2) after map when using MR Combiner functionality
  19. 19. MapReduce Features• Fault Tolerance• Redundant Execution• Locality Optimization• Skip Bad Records• Sort before Reduce• Combiner
  20. 20. MapReduce System Flow [8]
  21. 21. MapReduce Function Flow [8]
  22. 22. Map & Reduce Parallel Execution [8]
  23. 23. Case 2: Distributed GrepCounts lines in all files that match a <regex> and displays counts.Other uses include: analyzing web server access logs to find thetop requested pages that match a given patternMap function (establish a match):- input is (file offset, char)- output is either: 1. an empty list [] (the line does not match ‘A’ or ‘C’) 2. a key-value pair [(line, 1)] (if it matches)Reduce function (total counts):- input is (char, [1, 1, ...])- output is (char, n) where n is the number of 1s in the list. http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC.ppt
  24. 24. Distributed Grep File 1 C File 2 Result C B A 3C B 1A CMap tasks:File1 Reduce tasks:(0, C) -> [(C, 1)] (A, [1]) -> (A, 1)(2, B) -> [] (C, [1, 1, 1]) -> (C, 3)(4, B) -> [](6, C) -> [(C, 1)]File2(0, C) -> [(C, 1)](2, A) -> [(A, 1)]
  25. 25. Case 3: Max Speed Serve-Data Analysis Needed: for all professional tennis tournamentsthe past 3 years, process log files to determine fastest speedserve each year.Map function (enumerate speeds for each year):- input is (file offset, Year Speed)- output is a key-value pair [(Year,Speed)]Reduce function (determine max speed each year): - input is (Year, [speed1, … speedN]) - output is (Year, Speed) where Speed is the fastest recordedthat year.
  26. 26. Max Speed Serve Result File 1 File 2 2008 134 2008 136 2008- 136 2009 126 2009 127 2009- 132 2009 132 2010 124 2010- 124 Map tasks: Reduce tasks: [(2008,136)] (2008, [136, 134]) -> (2008,136) [(2009,126)] (2009,[126,132,127]) -> (2009,132) [(2009,132)] [(2008,134)] (2010,[124] -> (2010,124) [(2009,127)] [(2010,124)]* Will drop value when using MR Combiner functionality
  27. 27. Case 4: Word ProximityFind occurrences of pairs of words where word1 is located within4 words of word2.Map function (assign a value of 1 to every match):- input is (file offset, various text)- output is a key-value pair [(word1|word2,1)]Reduce function (total count per match):- input is (word1|word2, [1,1,1])- output is (word1|word2, count)
  28. 28. Word ProximityFile 1 File 2 Resulti have a piece of the pie it is a piece of cake; it piece|pie,1 doesn’t even look like pieWord1 = “piece” Word2 = “pie”Map tasks:(0,i have a piece of the pie)  (piece|pie,1)(0,it is a piece of cake; it doesn’t even look like pie)  ()Reduce tasks:(piece|pie, [1]) -> (piece|pie,1)
  29. 29. Case 5: Reverse Web-Link GraphGiven a list of website home pages (W1…W4) and every link onthat page, point the destination sites back to the original sourceweb site.Map function- input is (adjacency list in format source: dest1, dest2..)- output is a key-value pair[dest,source]Reduce function (create adjacency list with dest as key):-input/output is (dest,[source1, source2])
  30. 30. Link ReversalInput: Adjacency List Output: reversed list W1: W2,W4 W1: W2,W4 W2: W1,W3,W4 W2: W1 W3: W4 W3: W2,W4 W4: W1,W3 W4: W1,W2,W3Map tasks: Reduce tasks:(W1,W2) -> (W2,W1) (W1,[W2,W4])(W1,W4) -> (W4,W1) (W2,[W1](W2,W1) -> (W1,W2) MR Library groups (W3,[W2,W4](W2,W3) -> (W3,W2) intermediate keys(W2,W4) -> (W4,W2) and values in (W4,[W1,W2,W3] “Shuffle Phase”(W3,W4) -> (W4,W3)(W4,W1) -> (W1,W4)(W4,W3) -> (W3,W4)
  31. 31. Why Use MapReduce?• Hides messy details of distributed infrastructure• MapReduce simplifies programming paradigm to allow for easy parallel execution• Easily scales to thousands of machines
  32. 32. MapReduce Jobs Run @ Google [15] Aug. 04 Mar. 06 Sep. 07Number of jobs (1000s) 29 171 2,217Avg. completion time (secs) 634 874 395Machine years used 217 2,002 11,081map input data (TB) 3,288 52,254 403,152map output data (TB) 758 6,743 34,774reduce output data (TB) 193 2,970 14,018Avg. machines per job 157 268 394Unique implementationsmap 395 1958 4083reduce 269 1208 2418
  33. 33. Current Debate:MapReduce vs Parallel DBMS
  34. 34. Why Not Use A Parallel DBMS?• Parallel DBMS: – multiple CPUs, multiple servers – classic parallel programming concepts – HUGE established industry $$$• Parallel DBMS Vendors – Teradata (NCR), DB2 (IBM), Oracle (via exadata), Greenplum, Vertica etc.
  35. 35. “MapReduce is a Major Step Backward”Stonebraker & Dewitt attack on MR (1/17/08) [10,11] – a step backwards in database access – a poor implementation – not novel – missing features – incompatible with DBMS tools
  36. 36. “Comparison of Approaches to Large-Scale Data Analysis”Stonebraker Dewitt Comparison of Hadoop MR vs Vertica & DBMS-X (7/2009) [12] – Hadoop • easy to install, get up & running • Maintenance of apps harder • Good for fault tolerance in queries • Slow because of reading entire file each time & pull of file on reduce step – Vertica & DBMS-X • much faster than Hadoop because of indexes, schema, column orientation, compression & “warm start-up at boot time”.
  37. 37. “MapReduce and Parallel DBMSs: Friends or Foes?”Dewitt & Stonebraker update their position (1/2010) [13] – Hadoop MR and Parallel DBMS are complementary – Use Hadoop MR for subsets of tasks – Use Parallel DBMS for all other applications – Hadoop still needs significant improvements
  38. 38. “MapReduce: A Flexible Data Processing Tool”Jeffrey Dean & Sanjay Ghemawat (Google) rebuttal (1/2010) [14] – MR can input data from heterogenous environments – MR can use indices as input to MR – Useful for Complex functions – “Protocol Buffers” parse much faster – MR pull model non-negotiable – Addresses performance concerns
  39. 39. Conclusions• Hadoop MapReduce solid choice for leveraging power of Cloud Computing when tackling specific parallel data processing tasks; use PDBMS for all other tasks.• MR and PDBMS can learn from each other• Open source Hadoop MR continues to gain ground on performance and efficiency• Battle of MR vs PDBMS subsiding for now
  40. 40. Questions???
  41. 41. References[1] http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc[2] http://en.wikipedia.org/wiki/File:Cloud_computing.svg[3] http://news.cnet.com/8301-19413_3-10140278-240.html?tag=mncol;txt[4] http://rationalsecurity.typepad.com/blog/2009/01/cloud-computing-taxonomy-ontology-please- review.html[5] http://www.opencrowd.com/views/cloud.php/2Security[6] http://berkeleyclouds.blogspot.com[7] Forrester Research, Talking to Your CFO About Cloud Computing, Ted Schadler; Oct. 29, 2008.[8] http://code.google.com/edu/parallel/mapreduce-tutorial.html[9] http://labs.google.com/papers/mapreduce.html[10] http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/[11] http://databasecolumn.vertica.com/database-innovation/mapreduce-ii/[12] “Comparison of Approaches to Large-Scale Data Analysis”, Pavlo, Abadi, Stonebraker, Dewitt , et al (7/2009)[13] ACM, “MapReduce and Parallel DBMSs: Friends or Foes?”, Stonebraker, Abadi, Dewitt, et al (1/2010)[14] ACM, “MapReduce: A Flexible Data Processing Tool”, Jeffrey Dean & Sanjay Ghemawat (1/2010)[15] http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×