Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloud Capacity Planning..an Oxymoron? - South Bay SRE Meetup Aug-09-2016

553 views

Published on

Coburn Watson, Director of Performance and Reliability Engineering at Netflix discusses the differences between Cloud and traditional DC/On-Prem capacity planning models . He additional covers some of the distinct methodologies applied at Netflix to improve the rate of innovation, overall reliability, while keeping a pulse on efficiency.

Published in: Technology
  • Be the first to comment

Cloud Capacity Planning..an Oxymoron? - South Bay SRE Meetup Aug-09-2016

  1. 1. Cloud Capacity Planning South Bay SRE meetup - August 9th, 2016
  2. 2. ● Cloud Capacity Planning..an Oxymoron? ● Santa Cloud: How Netflix Does Holiday Capacity Planning ● The Data Behind the Planning Presenting...
  3. 3. Cloud Capacity Planning..an Oxymoron? South Bay SRE Meetup: August 9th, 2016
  4. 4. ● > 83M households ● 190 Countries ● 35% of Internet traffic in US at peak ● Entirely on Cloud*, three regions ● Evacuate a region monthly...for 24 hours ● Capacity planning ~ 5 people! (in the room :-) * Content served from homegrown OpenConnect CDN
  5. 5. Capacity Planning Concerns ● Facility considerations (Space, Power, Network, Cooling) ● Supply Chain Management Constraints and Relationships ● Hardware lifetime contour & failure rates (MTBF) ● Systems management staff ● Seasonal and unexpected burst considerations ● Workload colocation and performance demands ● Over-provisioning for reliability and rate of innovation ● Effective tooling ● Business continuity planning
  6. 6. (Cloud) Capacity Planning Concerns ● Facility considerations (Power, Network, Cooling) ● Supply Chain Management Constraints and Relationships ● Hardware lifetime contour & failure rates (MTBF) ● Systems management staff ● Seasonal and unexpected burst considerations ● Workload colocation and performance demands ● Over-provisioning for reliability and rate of innovation ● Effective tooling ● Business continuity planning
  7. 7. Cloud-specific CP Factors ● Capacity bounds..unknown (-) ● Vendor Decisions (-/+) ○ Hardware/Offering Evolution Timeline ○ Resource Demand (CPU/Mem/Disk/Net) Matrix ● On-Demand Capability (+)
  8. 8. Netflix Model ● Depend on the AWS on-demand pool for elasticity ● Monitor insufficient capacity exceptions (ICEs) for boundaries ● Invest heavily in 3 year reservations ● Maintain relatively few, large reserved pools ● Cloud Capacity Analytics team develops tools for insight ● Leverage cross-account resource borrowing
  9. 9. The Triad Cloud Impact Innovation Reliability Efficiency Default Preferred
  10. 10. Considerations of Scale ● Capacity required for critical footprint might require “guarantees” ● API-based observability has limits ● All resources have capacity limits/throttles ● Resource limits by default set for lowest common denominator ● Get creative with unused, but paid for capacity ● Billing file size!
  11. 11. Summary Capacity Planning
  12. 12. Coburn Watson ● Director of Performance and Reliability at Netflix ○ Site Reliability Engineering, Performance and OS Engineering, Traffic Management, Chaos Engineering, Capacity Planning, Cloud Network Engineering ● @coburnw, cwatson@netflix.com ● Looking for some great capacity planning-minded folks ● Performance and Reliability Youtube Channel

×