1<br />UC Berkeley<br />Cloud Computing: <br />Past, Present, and Future <br />Professor Anthony D. Joseph*, UC BerkeleyRe...
RAD Lab 5-year Mission<br />Enable 1 person to develop, deploy, operate next -generation Internet application<br />Key ena...
Course Timeline<br />Friday<br />10:00-12:00 History of Cloud Computing: Time-sharing, virtual machines, datacenter archit...
Nexus: A common substrate for cluster computing<br />Joint work with Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali ...
Recall: Hadoop on HDFS<br />namenode<br />job submission node<br />namenode daemon<br />jobtracker<br />tasktracker<br />t...
Problem<br />Rapid innovation in cluster computing frameworks<br />No single framework optimal for all applications<br />E...
What do we want to run in the cluster?<br />Pregel<br />Apache<br />Hama<br />Dryad<br />Pig<br />
Why share the cluster between frameworks?<br />Better utilization and efficiency (e.g., take advantage of diurnal patterns...
Solution<br />Nexus is an “operating system” for the cluster over which diverse frameworks can run<br />Nexus multiplexes ...
Goals<br />Scalable<br />Robust (i.e., simple enough to harden)<br />Flexible enough for a variety of different cluster fr...
Question 1: Granularity of Sharing<br />Option: Coarse-grained sharing<br />Give framework a (slice of) machine for its en...
Question 1: Granularity of Sharing<br />Nexus: Fine-grained sharing<br />Support frameworks that use smaller tasks (in tim...
Question 2: Resource Allocation<br />Option: Global scheduler<br />Frameworks express needs in a specification language, a...
Question 2: Resource Allocation<br />Nexus: Resource offers<br />Offer free resources to frameworks, let frameworks pick w...
Distributed decisions might not be optimal</li></li></ul><li>Outline<br />Nexus Architecture<br />Resource Allocation<br /...
Nexus Architecture<br />
Hadoop job<br />Hadoop job<br />MPI job<br />Hadoop v20 scheduler<br />Hadoop v19 scheduler<br />MPI<br />scheduler<br />N...
Hadoop job<br />Hadoop<br />scheduler<br />Nexus master<br />Nexus slave<br />Nexus slave<br />MPI executor<br />task<br /...
Resource Offers<br />MPI job<br />Hadoop job<br />MPI<br />scheduler<br />Hadoop<br />scheduler<br />offer = list of {mach...
Hadoop job<br />Hadoop<br />scheduler<br />Nexus master<br />Nexus slave<br />Nexus slave<br />Hadoop<br />executor<br />M...
Resource Offer Details<br />Min and max task sizes to control fragmentation<br />Filters let framework restrict offers sen...
Using Offers for Data Locality<br />We found that a simple policy called delay scheduling can give very high locality:<br ...
Framework Isolation<br />Isolation mechanism is pluggable due to the inherent perfomance/isolation tradeoff<br />Current i...
Resource Allocation<br />
Allocation Policies<br />Nexus picks framework to offer resources to, and hence controls how many resources each framework...
Example: Hierarchical Fairshare Policy<br />Cluster Share Policy<br />Facebook.com<br />20% <br />100%<br />80%<br />0%<br...
Revocation<br />Killing tasks to make room for other users<br />Not the normal case because fine-grained tasks enable quic...
Revocation Mechanism<br />Allocation policy defines a safe share for each user<br />Users will get at least safe share wit...
How Do We Run MPI?<br />Users always told their safe share<br />Avoid revocation by staying below it<br />Giving each user...
Example: Torque on Nexus<br />Facebook.com<br />Safe share = 40%<br />40%<br />20%<br />40%<br />Torque<br />Ads<br />Spam...
Multi-Resource Fairness<br />
What is Fair?<br />Goal: define a fair allocation of resources in the cluster between multiple users<br />Example: suppose...
Definition 1: Asset Fairness<br />Idea: give weights to resources (e.g. 1 CPU = 1 GB) and equalize value of resources give...
Lessons from Definition 1<br />“You shouldn’t do worse than if you ran a smaller, private cluster equal in size to your sh...
Def. 2: Dominant Resource Fairness<br />Idea: give every user an equal share of her dominant resource (i.e., resource it c...
Fairness Properties<br />
Implementation<br />
Implementation Stats<br />7000 lines of C++<br />APIs in C, C++, Java, Python, Ruby<br />Executor isolation using Linux co...
Frameworks<br />Ported frameworks:<br />Hadoop(900 line patch)<br />MPI (160 line wrapper scripts)<br />New frameworks:<br...
Results<br />
Overhead<br />Less than 4% seen in practice<br />
Dynamic Resource Sharing<br />
Multiple Hadoops Experiment<br />Hadoop 1<br />Hadoop 2<br />Hadoop 3<br />
Multiple Hadoops Experiment<br />Hadoop 3<br />Hadoop 3<br />Hadoop 2<br />Hadoop 1<br />Hadoop 1<br />Hadoop 1<br />Hadoo...
Results with 16 Hadoops<br />
Web Server Farm Framework<br />
Web Framework Experiment<br />httperf<br />HTTP request<br />HTTP request<br />HTTP request<br />Load calculation<br />Sch...
Web Framework Results<br />
Future Work<br />Experiment with parallel programming models<br />Further explore low-latency services on Nexus (web appli...
Cloud Computing Testbeds<br />
Open Cirrus™: Seizing the Open Source Cloud Stack OpportunityA joint initiative sponsored by HP, Intel, and Yahoo!<br />ht...
Proprietary Cloud Computing stacks<br />GOOGLE<br />AMAZON<br />MICROSOFT<br />Publicly accessible layer<br />Applications...
Applications<br />Monitoring<br />Ganglia<br />Nagios<br />Zenoss<br />MON<br />Moara<br />Storage <br />Management<br />H...
Open Cirrus™ Cloud Computing Testbed<br />Shared:  research, applications, infrastructure (12K cores), data sets<br />Glob...
Open Cirrus Organization<br /> Central Management Office, oversees Open Cirrus<br />Currently owned by HP<br /> Governance...
Intel BigData Open Cirrus Site<br />Mobile Rack<br />8 (1u) nodes<br />-------------<br />2 Xeon E5440<br />(quad-core)<br...
Open Cirrus Sites<br />Total<br />1,029<br />4 PB<br />12,074<br />1,746<br />26.3 TB<br />
Testbed Comparison<br />
Open Cirrus Stack<br />Compute + network + <br />storage resources <br />Management and<br /> control subsystem<br />Power...
Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />PRS clients,...
Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual clus...
Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual clus...
Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual clus...
Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual clus...
Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual clus...
Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual clus...
System Organization<br />Compute nodes are divided into dynamically-allocated, vlan-isolated PRS subdomains<br />Apps swit...
Open Cirrus stack - Zoni<br />Zoni service goals<br />Provide mini-datacenters to researchers<br />Isolate experiments fro...
Open Cirrus Stack - Tashi<br /> An open source Apache Software Foundation project sponsored by Intel (with CMU, Yahoo, HP)...
Node<br />Node<br />Node<br />Node<br />Node<br />Node<br />Tashi High-Level Design<br />Services are instantiated <br />t...
Location Matters (calculated)<br />
73<br />Open Cirrus Stack - Hadoop <br /> An open-source Apache Software Foundation project sponsored by Yahoo! <br />http...
What kinds of research projects are Open Cirrus sites looking for?<br /> Open Cirrus is seeking research in the following ...
How do users get access to Open Cirrus sites?<br />Project PIs apply to each site separately. <br />Contact names, email a...
Summary and Lessons <br />Intel is collaborating with HP and Yahoo! to provide a cloud computing testbed for the research ...
Other Cloud Computing Research Topics: Isolation and DC Energy<br />
Heterogeneity in Virtualized Environments<br />VM technology isolates CPU and memory, but disk and network are shared<br /...
Isolation Research<br />Need predictable variance over raw performance<br />Some resources that people have run into probl...
Datacenter Energy<br />EPA, 8/2007:<br />1.5% of total U.S. energy consumption<br />Growing from 60 to 100 Billion kWh in ...
81<br />Power/Cooling Issues<br />
First Milestone: DC Energy Conservation<br />DCs limited by power<br />For each dollar spent on servers, add $0.48 (2005)/...
Thermal Image of Typical Cluster Rack<br />Rack<br />Switch<br />M. K. Patterson, A. Pratt, P. Kumar, “From UPS to Silicon...
DC Networking and Power<br />Selectively power down ports/portions of net elements<br />Enhanced power-awareness in the ne...
Summary<br />Many areas for research into Cloud Computing!<br />Datacenter design, languages, scheduling, isolation, energ...
Thank you!<br />adj@eecs.berkeley.edu<br />http://abovetheclouds.cs.berkeley.edu/<br />86<br />
Cloud Computing
Upcoming SlideShare
Loading in...5
×

Cloud Computing

12,772

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,772
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
91
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Just mention briefly that there are things MR and Dryad can’t do, and that there are competing implementations; perhaps also note the need to share resources with other data center services here?The excitement surrounding cluster computing frameworks like Hadoop continues to accelerate. (e.g. EC2 Hadoop and Dryad in Azure)Startups, enterprises, and us researchers are bursting with ideas to improve these already existing frameworks. But more importantly as we encounter the limitations of MR, we’re making a shopping list of what we want in next generation frameworks, new abstractions, programming models, even new implementations of existing models (e.g. Erlang MR called Disco).We believe that no single framework can best facilitate this innovation, but instead that people will want to run existing and new frameworks on the same physical clusters at the same time.
  • Useful even if you only use one frameworkRun isolated framework instances (production vs test)Run multiple versions of framework together
  • Global scheduler needs to make guesses about a lot more (job running times, etc)Talk about adaptive frameworks that may not know how many tasks they need in advanceTalk about irregular parallelism jobs that don’t even know DAG in advance**We are exploring resource offers but don’t yet know the limits; seem to work OK for jobs with data locality needs though**
  • Global scheduler needs to make guesses about a lot more (job running times, etc)Talk about adaptive frameworks that may not know how many tasks they need in advanceTalk about irregular parallelism jobs that don’t even know DAG in advance**We are exploring resource offers but don’t yet know the limits; seem to work OK for jobs with data locality needs though**
  • …multiple frameworks to run concurrently! Here we see a new framework, Dryad being run side by side with Hadoop, and Nexus is multiplexing the slaves between both. Some are running Hadoop tasks, some Dryad, and some both.
  • …multiple frameworks to run concurrently! Here we see a new framework, Dryad being run side by side with Hadoop, and Nexus is multiplexing the slaves between both. Some are running Hadoop tasks, some Dryad, and some both.
  • …multiple frameworks to run concurrently! Here we see a new framework, Dryad being run side by side with Hadoop, and Nexus is multiplexing the slaves between both. Some are running Hadoop tasks, some Dryad, and some both.
  • …multiple frameworks to run concurrently! Here we see a new framework, Dryad being run side by side with Hadoop, and Nexus is multiplexing the slaves between both. Some are running Hadoop tasks, some Dryad, and some both.
  • Waiting 1s gives 90% locality, 5s gives 95%
  • Linux containers can actually be both “application” containers where an app shares the filesystem with the host (similar to Solaris projects), or “system” containers where each container has its own filesystem (similar to Solaris zones); both types also prevent processes in a container from seeing those outside it
  • Transition to next slide: when you have policy == SLAs
  • What to do with the rest of the resources?
  • Mentioned shared HDFS!
  • Mentioned shared HDFS!
  • 16 Hadoop instances doing synthetic filter job100 nodes, 4 slots per nodeDelay scheduling improves performance by 1.7x
  • Cloud Computing

    1. 1. 1<br />UC Berkeley<br />Cloud Computing: <br />Past, Present, and Future <br />Professor Anthony D. Joseph*, UC BerkeleyReliable Adaptive Distributed Systems Lab<br />RWTH Aachen<br />22 March 2010<br />http://abovetheclouds.cs.berkeley.edu/<br />*Director, Intel Research Berkeley<br />
    2. 2. RAD Lab 5-year Mission<br />Enable 1 person to develop, deploy, operate next -generation Internet application<br />Key enabling technology: Statistical machine learning<br />debugging, monitoring, pwr mgmt, auto-configuration, perfprediction, ...<br />Highly interdisciplinary faculty & students<br />PI’s: Patterson/Fox/Katz (systems/networks), Jordan (machine learning), Stoica (networks & P2P), Joseph (security), Shenker (networks), Franklin (DB)<br />2 postdocs, ~30 PhD students, ~6 undergrads<br />Grad/Undergrad teaching integrated with research<br />
    3. 3. Course Timeline<br />Friday<br />10:00-12:00 History of Cloud Computing: Time-sharing, virtual machines, datacenter architectures, utility computing<br />12:00-13:30 Lunch<br />13:30-15:00 Modern Cloud Computing: economics, elasticity, failures<br />15:00-15:30 Break<br />15:30-17:00 Cloud Computing Infrastructure: networking, storage, computation models<br />Monday<br />10:00-12:00 Cloud Computing research topics: scheduling, multiple datacenters, testbeds<br />
    4. 4. Nexus: A common substrate for cluster computing<br />Joint work with Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi,<br /> Scott Shenker, and Ion Stoica<br />
    5. 5. Recall: Hadoop on HDFS<br />namenode<br />job submission node<br />namenode daemon<br />jobtracker<br />tasktracker<br />tasktracker<br />tasktracker<br />datanode daemon<br />datanode daemon<br />datanode daemon<br />Linux file system<br />Linux file system<br />Linux file system<br />…<br />…<br />…<br />slave node<br />slave node<br />slave node<br />Adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)<br />
    6. 6. Problem<br />Rapid innovation in cluster computing frameworks<br />No single framework optimal for all applications<br />Energy efficiency means maximizing cluster utilization<br />Want to run multiple frameworks in a single cluster<br />
    7. 7. What do we want to run in the cluster?<br />Pregel<br />Apache<br />Hama<br />Dryad<br />Pig<br />
    8. 8. Why share the cluster between frameworks?<br />Better utilization and efficiency (e.g., take advantage of diurnal patterns)<br />Better data sharing across frameworks and applications<br />
    9. 9. Solution<br />Nexus is an “operating system” for the cluster over which diverse frameworks can run<br />Nexus multiplexes resources between frameworks<br />Frameworks control job execution<br />
    10. 10. Goals<br />Scalable<br />Robust (i.e., simple enough to harden)<br />Flexible enough for a variety of different cluster frameworks<br />Extensible enough to encourage innovative future frameworks<br />
    11. 11. Question 1: Granularity of Sharing<br />Option: Coarse-grained sharing<br />Give framework a (slice of) machine for its entire duration<br />Data locality compromised if machine held for long time<br />Hard to account for new frameworks and changing demands -> hurts utilization and interactivity<br />Hadoop 1<br />Hadoop 2<br />Hadoop 3<br />
    12. 12. Question 1: Granularity of Sharing<br />Nexus: Fine-grained sharing<br />Support frameworks that use smaller tasks (in time and space) by multiplexing them across all available resources<br />Hadoop 3<br />Hadoop 3<br />Hadoop 2<br />Hadoop 1<br />Frameworks can take turns accessing data on each node<br />Can resize frameworks shares to get utilization & interactivity<br />Hadoop 1<br />Hadoop 2<br />Hadoop 2<br />Hadoop 1<br />Hadoop 3<br />Hadoop 1<br />Hadoop 2<br />Hadoop 3<br />Hadoop 3<br />Hadoop 2<br />Hadoop 2<br />Hadoop 3<br />Hadoop 2<br />Hadoop 1<br />Hadoop 3<br />Hadoop 1<br />Hadoop 2<br />
    13. 13. Question 2: Resource Allocation<br />Option: Global scheduler<br />Frameworks express needs in a specification language, a global scheduler matches resources to frameworks<br />Requires encoding a framework’s semantics using the language, which is complex and can lead to ambiguities<br />Restricts frameworks if specification is unanticipated<br />Designing a general-purpose global scheduler is hard<br />
    14. 14. Question 2: Resource Allocation<br />Nexus: Resource offers<br />Offer free resources to frameworks, let frameworks pick which resources best suit their needs<br /><ul><li>Keeps Nexus simple and allows us to support future jobs
    15. 15. Distributed decisions might not be optimal</li></li></ul><li>Outline<br />Nexus Architecture<br />Resource Allocation<br />Multi-Resource Fairness<br />Implementation<br />Results<br />
    16. 16. Nexus Architecture<br />
    17. 17. Hadoop job<br />Hadoop job<br />MPI job<br />Hadoop v20 scheduler<br />Hadoop v19 scheduler<br />MPI<br />scheduler<br />Nexus master<br />Nexus slave<br />Nexus slave<br />Nexus slave<br />MPI<br />executor<br />MPI<br />executor<br />Hadoop v19 executor<br />Hadoop v20 executor<br />Hadoop v19 executor<br />task<br />task<br />task<br />task<br />Overview<br />task<br />
    18. 18. Hadoop job<br />Hadoop<br />scheduler<br />Nexus master<br />Nexus slave<br />Nexus slave<br />MPI executor<br />task<br />Resource Offers<br />MPI job<br />MPI<br />scheduler<br />Pick framework to offer to<br />Resourceoffer<br />MPI<br />executor<br />task<br />
    19. 19. Resource Offers<br />MPI job<br />Hadoop job<br />MPI<br />scheduler<br />Hadoop<br />scheduler<br />offer = list of {machine, free_resources}<br />Example: <br /> [ {node 1, <2 CPUs, 4 GB>},<br /> {node 2, <2 CPUs, 4 GB>} ]<br />Pick framework to offer to<br />Resource offer<br />Nexus master<br />Nexus slave<br />Nexus slave<br />MPI executor<br />MPI<br />executor<br />task<br />task<br />
    20. 20. Hadoop job<br />Hadoop<br />scheduler<br />Nexus master<br />Nexus slave<br />Nexus slave<br />Hadoop<br />executor<br />MPI executor<br />task<br />Resource Offers<br />MPI job<br />MPI<br />scheduler<br />Framework-specific scheduling<br />task<br />Pick framework to offer to<br />Resourceoffer<br />Launches & isolates executors<br />MPI<br />executor<br />task<br />
    21. 21. Resource Offer Details<br />Min and max task sizes to control fragmentation<br />Filters let framework restrict offers sent to it<br />By machine list<br />By quantity of resources<br />Timeouts can be added to filters<br />Frameworks can signal when to destroy filters, or when they want more offers<br />
    22. 22. Using Offers for Data Locality<br />We found that a simple policy called delay scheduling can give very high locality:<br />Framework waits for offers on nodes that have its data<br />If waited longer than a certain delay, starts launching non-local tasks<br />
    23. 23. Framework Isolation<br />Isolation mechanism is pluggable due to the inherent perfomance/isolation tradeoff<br />Current implementation supports Solaris projects and Linux containers <br />Both isolate CPU, memory and network bandwidth<br />Linux developers working on disk IO isolation<br />Other options: VMs, Solaris zones, policing<br />
    24. 24. Resource Allocation<br />
    25. 25. Allocation Policies<br />Nexus picks framework to offer resources to, and hence controls how many resources each framework can get (but not which)<br />Allocation policies are pluggable to suit organization needs, through allocation modules<br />
    26. 26. Example: Hierarchical Fairshare Policy<br />Cluster Share Policy<br />Facebook.com<br />20% <br />100%<br />80%<br />0%<br />Ads<br />Spam<br />User 2<br />User 1<br />14%<br />70%<br />30%<br />20%<br />100%<br />6%<br />Job 4<br />Job 3<br />Job 2<br />Job 1<br />CurrTime<br />CurrTime<br />CurrTime<br />
    27. 27. Revocation<br />Killing tasks to make room for other users<br />Not the normal case because fine-grained tasks enable quick reallocation of resources <br />Sometimes necessary:<br />Long running tasks never relinquishing resources<br />Buggy job running forever<br />Greedy user who decides to makes his task long<br />
    28. 28. Revocation Mechanism<br />Allocation policy defines a safe share for each user<br />Users will get at least safe share within specified time<br />Revoke only if a user is below its safe share and is interested in offers<br />Revoke tasks from users farthest above their safe share<br />Framework warned before its task is killed<br />
    29. 29. How Do We Run MPI?<br />Users always told their safe share<br />Avoid revocation by staying below it<br />Giving each user a small safe share may not be enough if jobs need many machines <br />Can run a traditional grid or HPC scheduler as a user with a larger safe share of the cluster, and have MPI jobs queue up on it<br />E.g. Torque gets 40% of cluster<br />
    30. 30. Example: Torque on Nexus<br />Facebook.com<br />Safe share = 40%<br />40%<br />20%<br />40%<br />Torque<br />Ads<br />Spam<br />User 2<br />User 1<br />MPI Job<br />MPI Job<br />MPI Job<br />MPI Job<br />Job 4<br />Job 1<br />Job 2<br />Job 1<br />
    31. 31. Multi-Resource Fairness<br />
    32. 32. What is Fair?<br />Goal: define a fair allocation of resources in the cluster between multiple users<br />Example: suppose we have:<br /> 30 CPUs and 30 GB RAM<br />Two users with equal shares<br />User 1 needs <1 CPU, 1 GB RAM> per task<br />User 2 needs <1 CPU, 3 GB RAM> per task<br />What is a fair allocation?<br />
    33. 33. Definition 1: Asset Fairness<br />Idea: give weights to resources (e.g. 1 CPU = 1 GB) and equalize value of resources given to each user<br />Algorithm: when resources are free, offer to whoever has the least value<br />Result:<br />U1: 12 tasks: 12 CPUs, 12 GB ($24)<br />U2: 6 tasks: 6 CPUs, 18 GB ($24)<br />PROBLEM<br />User 1 has < 50% of both CPUs and RAM<br />User 1<br />User 2<br />100%<br />50%<br />0%<br />CPU<br />RAM<br />
    34. 34. Lessons from Definition 1<br />“You shouldn’t do worse than if you ran a smaller, private cluster equal in size to your share”<br />Thus, given N users, each user should get ≥ 1/N of his dominating resource (i.e., the resource that he consumes most of)<br />
    35. 35. Def. 2: Dominant Resource Fairness<br />Idea: give every user an equal share of her dominant resource (i.e., resource it consumes most of) <br />Algorithm: when resources are free, offer to the user with the smallest dominant share (i.e., fractional share of the her dominant resource)<br />Result:<br />U1: 15 tasks: 15 CPUs, 15 GB<br />U2: 5 tasks: 5 CPUs, 15 GB<br />User 1<br />User 2<br />100%<br />50%<br />0%<br />CPU<br />RAM<br />
    36. 36. Fairness Properties<br />
    37. 37. Implementation<br />
    38. 38. Implementation Stats<br />7000 lines of C++<br />APIs in C, C++, Java, Python, Ruby<br />Executor isolation using Linux containers and Solaris projects<br />
    39. 39. Frameworks<br />Ported frameworks:<br />Hadoop(900 line patch)<br />MPI (160 line wrapper scripts)<br />New frameworks:<br />Spark, Scala framework for iterative jobs (1300 lines)<br />Apache+haproxy, elastic web server farm (200 lines)<br />
    40. 40. Results<br />
    41. 41. Overhead<br />Less than 4% seen in practice<br />
    42. 42. Dynamic Resource Sharing<br />
    43. 43. Multiple Hadoops Experiment<br />Hadoop 1<br />Hadoop 2<br />Hadoop 3<br />
    44. 44. Multiple Hadoops Experiment<br />Hadoop 3<br />Hadoop 3<br />Hadoop 2<br />Hadoop 1<br />Hadoop 1<br />Hadoop 1<br />Hadoop 2<br />Hadoop 2<br />Hadoop 1<br />Hadoop 3<br />Hadoop 1<br />Hadoop 2<br />Hadoop 2<br />Hadoop 3<br />Hadoop 3<br />Hadoop 2<br />Hadoop 2<br />Hadoop 3<br />Hadoop 2<br />Hadoop 3<br />Hadoop 1<br />Hadoop 1<br />Hadoop 3<br />Hadoop 2<br />
    45. 45. Results with 16 Hadoops<br />
    46. 46. Web Server Farm Framework<br />
    47. 47. Web Framework Experiment<br />httperf<br />HTTP request<br />HTTP request<br />HTTP request<br />Load calculation<br />Scheduler (haproxy)<br />Load gen framework<br />task<br />resource offer<br />Nexus master<br />status update<br />Nexus slave<br />Nexus slave<br />Nexus slave<br />Web executor<br />Load gen executor<br />Web executor<br />Load gen executor<br />Load gen<br />executor<br />Web executor<br />task<br />(Apache)<br />task<br />task<br />(Apache)<br />task<br />task<br />task<br />task<br />(Apache)<br />
    48. 48. Web Framework Results<br />
    49. 49. Future Work<br />Experiment with parallel programming models<br />Further explore low-latency services on Nexus (web applications, etc)<br />Shared services (e.g. BigTable, GFS)<br />Deploy to users and open source<br />
    50. 50. Cloud Computing Testbeds<br />
    51. 51. Open Cirrus™: Seizing the Open Source Cloud Stack OpportunityA joint initiative sponsored by HP, Intel, and Yahoo!<br />http://opencirrus.org/<br />
    52. 52. Proprietary Cloud Computing stacks<br />GOOGLE<br />AMAZON<br />MICROSOFT<br />Publicly accessible layer<br />Applications<br />Applications<br />Applications<br />Application Frameworks<br />MapReduce, Sawzall, Google App Engine, Protocol Buffers<br />Application Frameworks<br />EMR – Hadoop<br />Application Frameworks<br />.NET Services<br />Software Infrastructure<br />VM Management<br />Job Scheduling<br />Borg<br />Storage Management<br />GFS, BigTable<br />Monitoring<br />Borg<br />Software Infrastructure<br />VM Management<br />EC2<br />Job Scheduling<br />Storage Management<br />S3, EBS<br />Monitoring<br />Borg<br />Software Infrastructure<br />VM Management<br />Fabric Controller<br />Job Scheduling<br />Fabric Controller<br />Storage Management<br />SQL Services, blobs, tables, queues<br />Monitoring<br />Fabric Controller<br />Hardware Infrastructure<br />Borg<br />Hardware Infrastructure<br />Hardware Infrastructure<br />Fabric Controller<br />
    53. 53. Applications<br />Monitoring<br />Ganglia<br />Nagios<br />Zenoss<br />MON<br />Moara<br />Storage <br />Management<br />HDFS<br />KFS<br />Gluster<br />Lustre<br />PVFS<br />MooseFS <br />HBase <br />Hypertable<br />Application Frameworks<br />Pig, Hadoop, MPI, Sprout, Mahout<br />Software Infrastructure<br />VM Management<br />Job Scheduling<br />Storage Management<br />Monitoring<br />Job Scheduling<br />Maui/Torque<br />VM Management<br />Eucalyptus <br />Enomalism<br />Tashi <br />Reservoir<br />Nimbus,<br />oVirt<br />Hardware <br />Infrastructure<br />PRS <br />Emulab <br />Cobbler <br />xCat<br />Hardware Infrastructure<br />PRS, Emulab, Cobbler, xCat<br />Open Cloud Computing stacks<br />Heavily fragmented <br />today!<br />
    54. 54. Open Cirrus™ Cloud Computing Testbed<br />Shared: research, applications, infrastructure (12K cores), data sets<br />Global services: sign on, monitoring, store. Open src stack (prs, tashi, hadoop)<br />Sponsored by HP, Intel, and Yahoo! (with additional support from NSF)<br /><ul><li>9 sites currently, target of around 20 in the next two years. </li></li></ul><li>Open Cirrus Goals<br /> Goals<br />Foster new systems and services research around cloud computing<br />Catalyze open-source stack and APIs for the cloud<br /> How are we unique?<br />Support for systems research and applications research<br />Federation of heterogeneous datacenters<br />
    55. 55. Open Cirrus Organization<br /> Central Management Office, oversees Open Cirrus<br />Currently owned by HP<br /> Governance model<br />Research team <br />Technical team <br />New site additions <br />Support (legal (export, privacy), IT, etc.)<br /> Each site <br />Runs its own research and technical teams <br />Contributes individual technologies <br />Operates some of the global services<br /> E.g. <br />HP site supports portal and PRS<br />Intel site developing and supporting Tashi<br />Yahoo! contributes to Hadoop<br />
    56. 56. Intel BigData Open Cirrus Site<br />Mobile Rack<br />8 (1u) nodes<br />-------------<br />2 Xeon E5440<br />(quad-core)<br />[Harpertown/<br />Core 2] <br />16GB DRAM<br />2 1TB Disk<br />http://opencirrus.intel-research.net<br />1 Gb/s <br />(x8 p2p)<br />1 Gb/s <br />(x4)<br />Switch<br />24 Gb/s<br />1 Gb/s <br />(x8)<br />1 Gb/s <br />(x4)<br />45 Mb/s T3 <br />to Internet<br />Switch<br />48 Gb/s<br />*<br />Switch<br />48 Gb/s<br />1 Gb/s (x2x5 p2p)<br />1 Gb/s <br />(x4)<br />1 Gb/s <br />(x4)<br />1 Gb/s <br />(x4)<br />1 Gb/s <br />(x4)<br />1 Gb/s <br />(x4)<br />3U Rack<br />5 storage nodes<br />-------------<br />12 1TB Disks<br />Switch<br />48 Gb/s<br />Switch<br />48 Gb/s<br />Switch<br />48 Gb/s<br />Switch<br />48 Gb/s<br />Switch<br />48 Gb/s<br />1 Gb/s <br />(x4x4 p2p)<br />1 Gb/s <br />(x4x4 p2p)<br />1 Gb/s <br />(x15 p2p)<br />1 Gb/s <br />(x15 p2p)<br />1 Gb/s <br />(x15 p2p)<br />(r1r5)<br />PDU<br />w/per-port monitoring <br />and control<br />Blade Rack <br />40 nodes<br />Blade Rack <br />40 nodes<br />1U Rack <br />15 nodes<br />2U Rack <br />15 nodes<br />2U Rack <br />15 nodes<br />20 nodes: 1 Xeon (1-core) [Irwindale/Pent4], 6GB DRAM, 366GB disk (36+300GB)<br />10 nodes: 2 Xeon 5160 (2-core) [Woodcrest/Core], 4GB RAM, 2 75GB disks<br />10 nodes: 2 Xeon E5345 (4-core) [Clovertown/Core],8GB DRAM, 2 150GB Disk<br />2 Xeon E5345<br />(quad-core)<br />[Clovertown/<br />Core]<br />8GB DRAM<br />2 150GB Disk<br />2 Xeon E5420<br />(quad-core)<br />[Harpertown/<br />Core 2]<br />8GB DRAM<br />2 1TB Disk<br />2 Xeon E5440<br />(quad-core)<br />[Harpertown/<br />Core 2]<br />8GB DRAM<br />6 1TB Disk<br />2 Xeon E5520<br />(quad-core)<br />[Nehalem-EP/<br />Core i7] <br />16GB DRAM<br />6 1TB Disk<br />x3<br />x2<br />x2<br />Key:<br />rXrY=row X rack Y<br />rXrYcZ=row X rack Y chassis Z<br />(r2r2c1-4)<br />(r2r1c1-4)<br />(r1r1, r1r2)<br />(r1r3, r1r4, r2r3)<br />(r3r2, r3r3)<br />
    57. 57. Open Cirrus Sites<br />Total<br />1,029<br />4 PB<br />12,074<br />1,746<br />26.3 TB<br />
    58. 58. Testbed Comparison<br />
    59. 59. Open Cirrus Stack<br />Compute + network + <br />storage resources <br />Management and<br /> control subsystem<br />Power + cooling <br />Physical Resource set (Zoni) service<br />Credit: John Wilkes (HP)<br />
    60. 60. Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />PRS clients, each with theirown “physical data center”<br />Zoni service<br />
    61. 61. Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual cluster<br />Virtual cluster<br />Virtual clusters (e.g., Tashi)<br />Zoni service<br />
    62. 62. Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual cluster<br />Virtual cluster<br />Application running<br />On Hadoop<br />On Tashi virtual cluster<br />On a PRS<br />On real hardware<br />BigData App<br />Hadoop<br />Zoni service<br />
    63. 63. Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual cluster<br />Virtual cluster<br />Experiment/<br />save/restore<br />BigData app<br />Hadoop<br />Zoni service<br />
    64. 64. Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual cluster<br />Virtual cluster<br />Experiment/<br />save/restore<br />BigData App<br />Hadoop<br />Platform services<br />Zoni service<br />
    65. 65. Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual cluster<br />Virtual cluster<br />User services<br />Experiment/<br />save/restore<br />BigData App<br />Hadoop<br />Platform services<br />Zoni service<br />
    66. 66. Open Cirrus Stack<br />Research<br />Tashi<br />NFS storage <br />service<br />HDFS storage<br />service<br />Virtual cluster<br />Virtual cluster<br />User services<br />Experiment/<br />save/restore<br />BigData App<br />Hadoop<br />Platform services<br />Zoni<br />
    67. 67. System Organization<br />Compute nodes are divided into dynamically-allocated, vlan-isolated PRS subdomains<br />Apps switch back and forth between virtual and phyiscal<br />Open<br />service <br />research<br />Apps running in a VM mgmt infrastructure <br />(e.g., Tashi)<br />Tashi <br />development <br />Production <br />storage <br />service <br />Proprietary<br />service <br />research <br />Open workload monitoring and trace collection<br />
    68. 68. Open Cirrus stack - Zoni<br />Zoni service goals<br />Provide mini-datacenters to researchers<br />Isolate experiments from each other<br />Stable base for other research<br />Zoni service approach<br />Allocate sets of physical co-located nodes, isolated inside VLANs.<br />Zoni code from HP being merged into Tashi Apache project and extended by Intel<br />Running on HP site<br />Being ported to Intel site<br />Will eventually run on all sites <br />
    69. 69. Open Cirrus Stack - Tashi<br /> An open source Apache Software Foundation project sponsored by Intel (with CMU, Yahoo, HP)<br />Infrastructure for cloud computing on Big Data <br /> http://incubator.apache.org/projects/tashi<br /> Research focus: <br />Location-aware co-scheduling of VMs, storage, and power.<br />Seamless physical/virtual migration. <br /> Joint with Greg Ganger (CMU), Mor Harchol-Balter (CMU), Milan Milenkovic (CTG)<br />
    70. 70. Node<br />Node<br />Node<br />Node<br />Node<br />Node<br />Tashi High-Level Design<br />Services are instantiated <br />through virtual machines<br />Most decisions happen in<br />the scheduler; manages <br />compute/storage/power <br />in concert<br />Data location <br />and power <br />information<br />is exposed <br />to scheduler<br /> and services<br />Scheduler<br />Virtualization Service<br />Storage Service<br />The storage service aggregates the<br />capacity of the commodity nodes<br /> to house Big Data repositories. <br />Cluster<br />Manager<br />Cluster nodes are assumed<br /> to be commodity machines<br />CM maintains databases<br />and routes messages;<br />decision logic is limited<br />
    71. 71. Location Matters (calculated)<br />
    72. 72. 73<br />Open Cirrus Stack - Hadoop <br /> An open-source Apache Software Foundation project sponsored by Yahoo! <br />http://wiki.apache.org/hadoop/ProjectDescription<br /> Provides a parallel programming model (MapReduce), a distributed file system, and a parallel database (HDFS)<br />
    73. 73. What kinds of research projects are Open Cirrus sites looking for?<br /> Open Cirrus is seeking research in the following areas (different centers will weight these differently):<br />Datacenter federation<br />Datacenter management<br />Web services<br />Data-intensive applications and systems<br /> The following kinds of projects are generally not of interest:<br />Traditional HPC application development<br />Production applications that just need lots of cycles<br />Closed source system development<br />
    74. 74. How do users get access to Open Cirrus sites?<br />Project PIs apply to each site separately. <br />Contact names, email addresses, and web links for applications to each site will be available on the Open Cirrus Web site (which goes live Q209)<br />http://opencirrus.org<br />Each Open Cirrus site decides which users and projects get access to its site.<br />Developing a global sign on for all sites (Q2 09)<br />Users will be able to login to each Open Cirrus site for which they are authorized using the same login and password. <br />
    75. 75. Summary and Lessons <br />Intel is collaborating with HP and Yahoo! to provide a cloud computing testbed for the research community<br />Using the cloud as an accelerator for interactive streaming/big data apps is an important usage model<br /> Primary goals are to <br />Foster new systems research around cloud computing<br />Catalyze open-source reference stack and APIs for the cloud<br />Access model, Local and global services, Application frameworks<br />Explore location-aware and power-aware workload scheduling<br />Develop integrated physical/virtual allocations to combat cluster squatting<br />Design cloud storage models<br />GFS-style storage systems not mature, impact of SSDs unknown<br />Investigate new application framework alternatives to map-reduce/Hadoop<br />
    76. 76. Other Cloud Computing Research Topics: Isolation and DC Energy<br />
    77. 77. Heterogeneity in Virtualized Environments<br />VM technology isolates CPU and memory, but disk and network are shared<br />Full bandwidth when no contention<br />Equal shares when there is contention<br />2.5x performance difference<br />EC2 small instances<br />
    78. 78. Isolation Research<br />Need predictable variance over raw performance<br />Some resources that people have run into problems with: <br />Power, disk space, disk I/O rate (drive, bus), memory space (user/kernel), memory bus, cache at all levels (TLB, etc), hyperthreading/etc, CPU rate, interrupts<br />Network: NIC (Rx/Tx), Switch, cross-datacenter, cross-country<br />OS resources: File descriptors, ports, sockets<br />
    79. 79. Datacenter Energy<br />EPA, 8/2007:<br />1.5% of total U.S. energy consumption<br />Growing from 60 to 100 Billion kWh in 5 yrs<br />48% of typical IT budget spent on energy<br />75 MW new DC deployments in PG&E’s service area – that they know about! (expect another 2x)<br />Microsoft: $500m new Chicago facility<br />Three substations with a capacity of 198MW<br /> 200+ shipping containers w/ 2,000 servers each<br />Overall growth of 20,000/month<br />
    80. 80. 81<br />Power/Cooling Issues<br />
    81. 81. First Milestone: DC Energy Conservation<br />DCs limited by power<br />For each dollar spent on servers, add $0.48 (2005)/$0.71 (2010) for power/cooling<br />$26B spent to power and cool servers in 2005 grows to $45B in 2010<br />Within DC racks, network equipment often the “hottest” components in the hot spot<br />
    82. 82. Thermal Image of Typical Cluster Rack<br />Rack<br />Switch<br />M. K. Patterson, A. Pratt, P. Kumar, “From UPS to Silicon: an end-to-end evaluation of datacenter efficiency”, Intel Corporation<br />
    83. 83. DC Networking and Power<br />Selectively power down ports/portions of net elements<br />Enhanced power-awareness in the network stack<br />Power-aware routing and support for system virtualization<br />Support for datacenter “slice” power down and restart<br />Application and power-aware media access/control<br />Dynamic selection of full/half duplex<br />Directional asymmetry to save power, e.g., 10Gb/s send, 100Mb/s receive <br />Power-awareness in applications and protocols<br />Hard state (proxying), soft state (caching), protocol/data “streamlining” for power as well as b/w reduction<br />Power implications for topology design<br />Tradeoffs in redundancy/high-availability vs. power consumption<br />VLANs support for power-aware system virtualization <br />
    84. 84. Summary<br />Many areas for research into Cloud Computing!<br />Datacenter design, languages, scheduling, isolation, energy efficiency (at all levels)<br />Opportunities to try out research at scale!<br />Amazon EC2, Open Cirrus, …<br />
    85. 85. Thank you!<br />adj@eecs.berkeley.edu<br />http://abovetheclouds.cs.berkeley.edu/<br />86<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×