8. Availability
“System availability refers to the accessibility of
system services to users. A system is available if it is
operational for an overwhelming fraction of the time.
Unlike reliability, availability is instantaneous.”
Saturday, October 2, 2010
9. Availability
“System reliability refers to the property of tolerating
constituent component failures, for the longest time. A
system is perfectly reliable if it never fails.”
Saturday, October 2, 2010
11. Availability
• Availability & Reliability
• Mean time to failures
• Mean time to repair
• Durability
• Fault isolation
• Fault tolerance
Saturday, October 2, 2010
12. Availability
• Uptime / Downtime
• Perceived
• Actual
Saturday, October 2, 2010
13. Availability
• Probabilistic Risk Assessment
• Event Tree Analysis
• Fault Tree Analysis
Apthorpe (http://www.usenix.org/events/lisa01/tech/apthorpe/apthorpe.ps)
Saturday, October 2, 2010
15. The Cloud
“It never gets easier, you just go faster.”
- Greg Lemond
Saturday, October 2, 2010
16. The Cloud
• Abstraction
• Commoditization
• Homogenous
• Ephemeral
Saturday, October 2, 2010
17. The Cloud
• Costs
• Loss of Control
• Single Points of Failure
• Network Partitions / Data Locality
• Unreliable
• Performance
Saturday, October 2, 2010
18. The Cloud
• Benefits
• API to everything
• Fast and Flexible Resource Mgmt
• “Unlimited” Resources
Saturday, October 2, 2010
19. The Cloud
• Bootstrapping
• Time and Effort
Adam Jacob and Ezra Zygmuntowicz (http://blip.tv/file/2285124/)
Saturday, October 2, 2010
20. The Cloud
• Nodes are stateless and disposable.
Saturday, October 2, 2010
21. The Cloud
"Clouds are systems ... and with systems, you have to think hard and know how to deal with issues in that
environment. The scale is so much bigger, and you don't have the physical control. But we think people should
be optimistic about what we can do here. If we are clever about deploying cloud computing with a clear-eyed
notion of what the risk models are, maybe we can actually save the economy through technology."
- Security in the Ether By David Talbot - MIT Technology Review Jan/Feb 2010
Saturday, October 2, 2010
22. What’s Next
• Distributed Systems
• Automation
• Data Driven Operations
Saturday, October 2, 2010
23. Distributed Systems
Baran (http://www.rand.org/pubs/research_memoranda/RM3420/)
Saturday, October 2, 2010
24. Distributed Systems
• RAID ain’t as redundant as it used to be.
Leventhal (http://queue.acm.org/detail.cfm?id=1670144)
Saturday, October 2, 2010
25. Distributed Systems
• Redundancy
• Duplication
• Distribution
Saturday, October 2, 2010
30. Distributed Systems
• Erlang
• Hot Code Upgrades
• Distributed Upgrades are HARD
Saturday, October 2, 2010
31. Distributed Systems
• Future Work
• Erlang Supervision Trees
• PRA / FTA / ETA
Apthorpe (http://www.usenix.org/events/lisa01/tech/apthorpe/apthorpe.ps)
Saturday, October 2, 2010
43. Data Driven Operations
• Modeling
• Analysis
• Universal Law of Computational Scalability
• Amdahl’s Law
Saturday, October 2, 2010
44. Data Driven Operations
• Modeling isn’t just for capacity planning.
Montagne (http://queue.acm.org/detail.cfm?id=1862187)
Saturday, October 2, 2010