Cloud Engineering

4,440 views

Published on

An introduction to datacenter and cloud computing for master students in engineering

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,440
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
145
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Cloud Engineering

  1. 1. Cloud Engineering Software in Datacenters Gwendal Simon Department of Computer Science Institut Telecom 2009
  2. 2. Gartner Hype Cycle 2009 2 / 43 Gwendal Simon Cloud Engineering
  3. 3. Literature A book: “The Datacenter as a Computer”, Luiz André Barroso and Urs Hölzle and scientific publications including: ACM Sigops (SOSP) and Usenix OSDI HPDC, EuroPar, OOPSLA, etc. blogs (e.g. http://perspectives.mvdirona.com/) 3 / 43 Gwendal Simon Cloud Engineering
  4. 4. Disclaimer No discussion about the impact of cloud computing: net neutrality interactions between CDNs and ISPs privacy and electronic human rights ... 4 / 43 Gwendal Simon Cloud Engineering
  5. 5. Disclaimer No discussion about the impact of cloud computing: net neutrality interactions between CDNs and ISPs privacy and electronic human rights ... Focus here on how cloud computing works 4 / 43 Gwendal Simon Cloud Engineering
  6. 6. Introduction 5 / 43 Gwendal Simon Cloud Engineering
  7. 7. In a nutshell Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction NIST: http://csrc.nist.gov/groups/SNS/cloud-computing/index.html 6 / 43 Gwendal Simon Cloud Engineering
  8. 8. Why Cloud Computing The ∗aaS paradigm: SaaS: Software as a Service applications for end-users (salesforce.com, Google, etc.) - email, office suite, photos sharing, video storage 7 / 43 Gwendal Simon Cloud Engineering
  9. 9. Why Cloud Computing The ∗aaS paradigm: SaaS: Software as a Service applications for end-users (salesforce.com, Google, etc.) - email, office suite, photos sharing, video storage PaaS: Platform as a Service services for web app developers (Azure, Google, etc.) - workflow facilities and various basic services (http, database) 7 / 43 Gwendal Simon Cloud Engineering
  10. 10. Why Cloud Computing The ∗aaS paradigm: SaaS: Software as a Service applications for end-users (salesforce.com, Google, etc.) - email, office suite, photos sharing, video storage PaaS: Platform as a Service services for web app developers (Azure, Google, etc.) - workflow facilities and various basic services (http, database) IaaS: Infrastructure as a Service resources for developers (Amazon, Joyent, etc.) - servers, network equipment, memory, CPU 7 / 43 Gwendal Simon Cloud Engineering
  11. 11. An Engineer Vision Benefits: Outsourcing Infrastructure reduce run time and response time minimize infrastructure risk ease deployment and upgrading 8 / 43 Gwendal Simon Cloud Engineering
  12. 12. An Engineer Vision Benefits: Outsourcing Infrastructure reduce run time and response time minimize infrastructure risk ease deployment and upgrading Challenge: Warehouse-Scale Computers thousands of individual computing nodes costly equipments (power, conditioning, cooling) large buildings with engineering teams 8 / 43 Gwendal Simon Cloud Engineering
  13. 13. Few Pictures 9 / 43 Gwendal Simon Cloud Engineering
  14. 14. Few Pictures 9 / 43 Gwendal Simon Cloud Engineering
  15. 15. Few Pictures 9 / 43 Gwendal Simon Cloud Engineering
  16. 16. Few Pictures 9 / 43 Gwendal Simon Cloud Engineering
  17. 17. Datacenter is Different Datacenter vs. Desktop Software: inherent parallelism software control platform homogeneity fault-free requirement 10 / 43 Gwendal Simon Cloud Engineering
  18. 18. Datacenter is Different Datacenter vs. Desktop Software: inherent parallelism software control platform homogeneity fault-free requirement Datacenter vs. High-Performance Computing unpredictable input high volume of data not only computing 10 / 43 Gwendal Simon Cloud Engineering
  19. 19. Basic Elements of a Datacenter Computing Architecture Energy Dealing with Failures 11 / 43 Gwendal Simon Cloud Engineering
  20. 20. Basic Elements of a Datacenter Computing Architecture Energy Dealing with Failures 12 / 43 Gwendal Simon Cloud Engineering
  21. 21. Architecture Basics 13 / 43 Gwendal Simon Cloud Engineering
  22. 22. Main Elements Storage: distributed file sys. (e.g. GFS) or NAS ? GFS is cheaper and faster for read operations 14 / 43 Gwendal Simon Cloud Engineering
  23. 23. Main Elements Storage: distributed file sys. (e.g. GFS) or NAS ? GFS is cheaper and faster for read operations Network: 1-Gbps switch with 48 ports in a rack 1 port per server, 8 ports for cluster rack → Oversubscription factor greater than 5 scarce cluster-level bandwidth attractive rack-level networking 14 / 43 Gwendal Simon Cloud Engineering
  24. 24. System Overview (in 2009) 15 / 43 Gwendal Simon Cloud Engineering
  25. 25. Basic Elements of a Datacenter Computing Architecture Energy Dealing with Failures 16 / 43 Gwendal Simon Cloud Engineering
  26. 26. Main Electric Components 17 / 43 Gwendal Simon Cloud Engineering
  27. 27. Cooling Challenge 18 / 43 Gwendal Simon Cloud Engineering
  28. 28. Evaluating Energy Efficiency computation efficiency = total energy total energy Power Usage Effectiveness = energy in equipment first generation datacenter PUE was poor (often ≥ 3.0) toward PUE around 1.2 19 / 43 Gwendal Simon Cloud Engineering
  29. 29. Evaluating Energy Efficiency computation efficiency = total energy total energy Power Usage Effectiveness = energy in equipment first generation datacenter PUE was poor (often ≥ 3.0) toward PUE around 1.2 critical component power Server PUE = total server power basic SPUE is 1.7 state-of-the-art servers reach 1.2 19 / 43 Gwendal Simon Cloud Engineering
  30. 30. Evaluating Energy Efficiency computation efficiency = total energy total energy Power Usage Effectiveness = energy in equipment first generation datacenter PUE was poor (often ≥ 3.0) toward PUE around 1.2 critical component power Server PUE = total server power basic SPUE is 1.7 state-of-the-art servers reach 1.2 and computing efficiency 19 / 43 Gwendal Simon Cloud Engineering
  31. 31. Computing Efficiency Benchmarking cluster-level efficiency: on-going work 20 / 43 Gwendal Simon Cloud Engineering
  32. 32. Computing Efficiency Benchmarking cluster-level efficiency: on-going work Benchmarking individual computer is easier 20 / 43 Gwendal Simon Cloud Engineering
  33. 33. Computing Efficiency Benchmarking cluster-level efficiency: on-going work Benchmarking individual computer is easier 20 / 43 Gwendal Simon Cloud Engineering
  34. 34. Toward Energy-Proportional Servers 21 / 43 Gwendal Simon Cloud Engineering
  35. 35. Basic Elements of a Datacenter Computing Architecture Energy Dealing with Failures 22 / 43 Gwendal Simon Cloud Engineering
  36. 36. Fault-Tolerant Application-Level Fault-Tolerant software infrastructure layer: masks failures of lower-layer levels reduces hardware cost eases operational procedures (e.g., upgrade) 23 / 43 Gwendal Simon Cloud Engineering
  37. 37. Fault-Tolerant Application-Level Fault-Tolerant software infrastructure layer: masks failures of lower-layer levels reduces hardware cost eases operational procedures (e.g., upgrade) But application-level still experiences failures service is in degraded mode service is unreachable service is corrupted (loss of data) 23 / 43 Gwendal Simon Cloud Engineering
  38. 38. Origin of Impacting Failures Cause % of events software 33 % configuration 28 % human 13 % network 12 % hardware 11 % other 3% 24 / 43 Gwendal Simon Cloud Engineering
  39. 39. Origin of Impacting Failures Cause % of events software 33 % configuration 28 % human 13 % network 12 % hardware 11 % other 3% hardware faults are masked by fault-tolerant software 24 / 43 Gwendal Simon Cloud Engineering
  40. 40. Failures and Crash Average machine availability is 99.9% 95% of machines restart less than once a month 80% of restart events last less than 10 minutes 25 / 43 Gwendal Simon Cloud Engineering
  41. 41. Failures and Crash Average machine availability is 99.9% 95% of machines restart less than once a month 80% of restart events last less than 10 minutes Software most frequent faults (in one year): DRAM soft-errors: 1% experience uncorrectable err disk soft-errors: 3% of drives see corrupted sectors 25 / 43 Gwendal Simon Cloud Engineering
  42. 42. Software Infrastructure Fundamentals Cluster-Level: MapReduce Application-Level: Web Search 26 / 43 Gwendal Simon Cloud Engineering
  43. 43. Software Infrastructure Fundamentals Cluster-Level: MapReduce Application-Level: Web Search 27 / 43 Gwendal Simon Cloud Engineering
  44. 44. Three Layers Infrastructure-level software: kernel, operating systems, networking libraries 28 / 43 Gwendal Simon Cloud Engineering
  45. 45. Three Layers Infrastructure-level software: kernel, operating systems, networking libraries Cluster-level software (middleware): specific software operating a pool of servers 28 / 43 Gwendal Simon Cloud Engineering
  46. 46. Three Layers Infrastructure-level software: kernel, operating systems, networking libraries Cluster-level software (middleware): specific software operating a pool of servers Application-level software: implementation of the Internet services 28 / 43 Gwendal Simon Cloud Engineering
  47. 47. Main Software Components replication 29 / 43 Gwendal Simon Cloud Engineering
  48. 48. Main Software Components replication partitioning 29 / 43 Gwendal Simon Cloud Engineering
  49. 49. Main Software Components replication partitioning load-balancing 29 / 43 Gwendal Simon Cloud Engineering
  50. 50. Main Software Components replication partitioning load-balancing health checking 29 / 43 Gwendal Simon Cloud Engineering
  51. 51. Main Software Components replication partitioning load-balancing health checking integrity check 29 / 43 Gwendal Simon Cloud Engineering
  52. 52. Main Software Components replication partitioning load-balancing health checking integrity check compression 29 / 43 Gwendal Simon Cloud Engineering
  53. 53. Main Software Components replication partitioning load-balancing health checking integrity check compression weak consistency 29 / 43 Gwendal Simon Cloud Engineering
  54. 54. Main Software Components replication MapReduce partitioning Dynamo load-balancing BigTable health checking Hadoop integrity check Sawzall compression Chubby weak consistency Dryad 29 / 43 Gwendal Simon Cloud Engineering
  55. 55. OS at a Cluster-Level Scale Resource Management: mapping tasks to resources should optimize energy usage 30 / 43 Gwendal Simon Cloud Engineering
  56. 56. OS at a Cluster-Level Scale Resource Management: mapping tasks to resources should optimize energy usage Hardware Abstraction: handling hardware elements should optimize performances 30 / 43 Gwendal Simon Cloud Engineering
  57. 57. OS at a Cluster-Level Scale Resource Management: mapping tasks to resources should optimize energy usage Hardware Abstraction: handling hardware elements should optimize performances Deployment Maintenance: upgrading and monitoring should reduce manual tasks 30 / 43 Gwendal Simon Cloud Engineering
  58. 58. OS at a Cluster-Level Scale Resource Management: mapping tasks to resources should optimize energy usage Hardware Abstraction: handling hardware elements should optimize performances Deployment Maintenance: upgrading and monitoring should reduce manual tasks Programming Frameworks: easing implementation should increase programmer productivity 30 / 43 Gwendal Simon Cloud Engineering
  59. 59. Software Infrastructure Fundamentals Cluster-Level: MapReduce Application-Level: Web Search 31 / 43 Gwendal Simon Cloud Engineering
  60. 60. Motivation Map/Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. 32 / 43 Gwendal Simon Cloud Engineering
  61. 61. Functional Programming Two fundamentals functions: map: apply a function to a list of elements map f [] = [] | map f [x::xs] = (f x) :: (map f xs) map square [1,2,5] → [1,4,25] 33 / 43 Gwendal Simon Cloud Engineering
  62. 62. Functional Programming Two fundamentals functions: map: apply a function to a list of elements map f [] = [] | map f [x::xs] = (f x) :: (map f xs) map square [1,2,5] → [1,4,25] reduce: build a value from a function and a list reduce f a [] = a | reduce f a [x::xs] = reduce f (f x a) xs reduce add 0 [1,3,6] → 10 33 / 43 Gwendal Simon Cloud Engineering
  63. 63. MapReduce Implementing two functions w.r.t data (key,val) map: smaller sub-problems distributed to nodes map (inKey, inVal) → list (outKey, v) produces intermediate values with an output key 34 / 43 Gwendal Simon Cloud Engineering
  64. 64. MapReduce Implementing two functions w.r.t data (key,val) map: smaller sub-problems distributed to nodes map (inKey, inVal) → list (outKey, v) produces intermediate values with an output key reduce: combines results of sub-problems reduce (outKey, list v) → outVal produces an output value from intermediate values 34 / 43 Gwendal Simon Cloud Engineering
  65. 65. Example: Word Count map(filename, content): for each w in content: emitInt(w, 1) 35 / 43 Gwendal Simon Cloud Engineering
  66. 66. Example: Word Count map(filename, content): for each w in content: emitInt(w, 1) reduce(word, partCount): int result = 0 for pc in partCount: result += pc emit(result) 35 / 43 Gwendal Simon Cloud Engineering
  67. 67. Example: Word Count map(file1, “hello me, goodbye me”)→ map(filename, content): <hello,1> <me,1> <goodbye,1> <me,1> for each w in content: emitInt(w, 1) map(file2, “hello you, bye you”)→ <hello,1> <you,1> <bye,1> <you,1> reduce(word, partCount): int result = 0 for pc in partCount: result += pc emit(result) 35 / 43 Gwendal Simon Cloud Engineering
  68. 68. Example: Word Count map(file1, “hello me, goodbye me”)→ map(filename, content): <hello,1> <me,1> <goodbye,1> <me,1> for each w in content: emitInt(w, 1) map(file2, “hello you, bye you”)→ <hello,1> <you,1> <bye,1> <you,1> reduce(word, partCount): int result = 0 a given key is allocated to a given server for pc in partCount: result += pc emit(result) 35 / 43 Gwendal Simon Cloud Engineering
  69. 69. Example: Word Count map(file1, “hello me, goodbye me”)→ map(filename, content): <hello,1> <me,1> <goodbye,1> <me,1> for each w in content: emitInt(w, 1) map(file2, “hello you, bye you”)→ <hello,1> <you,1> <bye,1> <you,1> reduce(word, partCount): int result = 0 a given key is allocated to a given server for pc in partCount: result += pc reduce(hello,<1,1>) → 2 emit(result) ... 35 / 43 Gwendal Simon Cloud Engineering
  70. 70. Example: Word Count map(file1, “hello me, goodbye me”)→ map(filename, content): <hello,1> <me,1> <goodbye,1> <me,1> for each w in content: emitInt(w, 1) map(file2, “hello you, bye you”)→ <hello,1> <you,1> <bye,1> <you,1> reduce(word, partCount): int result = 0 a given key is allocated to a given server for pc in partCount: result += pc reduce(hello,<1,1>) → 2 emit(result) ... <hello,2> <me,2> <you,2> <goodbye,1> <bye,1> 35 / 43 Gwendal Simon Cloud Engineering
  71. 71. MapReduce Implemented 36 / 43 Gwendal Simon Cloud Engineering
  72. 72. Software Infrastructure Fundamentals Cluster-Level: MapReduce Application-Level: Web Search 37 / 43 Gwendal Simon Cloud Engineering
  73. 73. Basics Input: the Web 100 billion file 400 terabytes the pagerank algorithm a query “w1 AND w2 AND · · · AND wn ” 38 / 43 Gwendal Simon Cloud Engineering
  74. 74. Basics Input: the Web 100 billion file 400 terabytes the pagerank algorithm a query “w1 AND w2 AND · · · AND wn ” Output: a list of files containing all words wi , i ∈ [1, n] sorted by the pagerank algorithm 38 / 43 Gwendal Simon Cloud Engineering
  75. 75. Implementation Overview Offline task: index management based on keywords: a word is associated with a table a table contains all occurrences in the web distributed on thousands of machines multiple copies and weak consistency 39 / 43 Gwendal Simon Cloud Engineering
  76. 76. Implementation Overview Offline task: index management based on keywords: a word is associated with a table a table contains all occurrences in the web distributed on thousands of machines multiple copies and weak consistency Online task: query management front-end Web servers to a subset of machines: compute and rank their local results all best results are combined intermediate servers to servers of file replica from file pointers to a set of metadata 39 / 43 Gwendal Simon Cloud Engineering
  77. 77. Discussion The user-perceived latency is less than one second: read-only operations high parallelism many thousands of queries per second traffic variations 40 / 43 Gwendal Simon Cloud Engineering
  78. 78. Discussion The user-perceived latency is less than one second: read-only operations high parallelism many thousands of queries per second traffic variations Networking: tiny size of data exchanges possible packet loss around the front-end servers 40 / 43 Gwendal Simon Cloud Engineering
  79. 79. Conclusion 41 / 43 Gwendal Simon Cloud Engineering
  80. 80. Key Challenges Time-scale: datacenter are expected to last 10 years Internet apps gain popularity in weeks 42 / 43 Gwendal Simon Cloud Engineering
  81. 81. Key Challenges Time-scale: datacenter are expected to last 10 years Internet apps gain popularity in weeks Hardware components: processors are faster and more energy efficient memory systems and networks are not 42 / 43 Gwendal Simon Cloud Engineering
  82. 82. Key Challenges Time-scale: datacenter are expected to last 10 years Internet apps gain popularity in weeks Hardware components: processors are faster and more energy efficient memory systems and networks are not Server evolution: more cores mean more parallelism 42 / 43 Gwendal Simon Cloud Engineering
  83. 83. Personal Thoughts A new era in the Internet: the industrial era of applications computer science does really matter 43 / 43 Gwendal Simon Cloud Engineering
  84. 84. Personal Thoughts A new era in the Internet: the industrial era of applications computer science does really matter About nano-datacenter using your always-on devices 43 / 43 Gwendal Simon Cloud Engineering

×