At many high-growth companies, staying at the bleeding edge of innovation and maintaining the highest level of service availability often sideline financial efficiency. This problem is exacerbated in a micro-service environment, where decentralized engineering teams can spin up thousands of instances at a moment’s notice, with no governing body tracking cost. By developing a cost-conscious culture and assigning the responsibility for efficiency to the appropriate business owners, you can deliver innovation efficiently and cost effectively . At Netflix, the Finance and Operations Engineering teams bear the responsibility for ensuring that the rate of innovation is fast and that development is cost effective. In this presentation, we’ll explore the building blocks of AWS cost management and discuss Netflix’s best practices.
2. What to Expect from the Session
• Managing efficiency vs. innovation & availability
• How and why cloud mgmt. has changed at Netflix
• Best practices & future goals
3. The Efficiency Challenge
Netflix: world’s largest
subscription Internet TV
business
Business Stats:
>60m members
2,000+ employees
80+ countries
>100m hours watched per day
Strategy: innovation & availability
prioritized above efficiency
Engineering Stats:
1,400 Tech & Dev. engineers
40+ independent teams
500+ microservices
90,000+ instances (~15% autoscaling)
6. Our foundations
• Forming the Cloud Capacity Team
• Matching processes with strategy
• Developing transparency tools
7. The Who: Cloud Capacity Planners
“Serving customers, not keeping gates”
8. Main Responsibilities
Strategy & Operations
• Scalable cloud growth
• Ad-hoc internal consulting
• Capacity liaisons with AWS
Capacity Planning
• Purchasing capacity
• Planning with major teams
• Retroactive RI purchasing
9. Retroactive Reservation Purchasing
• Look back purchases of on-demand
• Bi-weekly process includes rebalancing unused RIs
• Purchasing considerations:
10. Implementing Retroactive Purchasing
• Assumptions & processes required:
• Understanding about infrastructure & usage by account
• Irregular growth to be communicated
• Robust usage dashboards & cost tools
• Benefits:
• Lowered deployment friction accelerating innovation
• Reduced operational overhead management
• Potential reduction of “capacity sandbagging”
12. General Cloud Capacity Strategies
• Strat: Service oriented architecture @ massive scale
• Process: centralized cloud capacity planning function
• Strat: Unconstrained deployment capabilities
• Process: develop contextual efficiency information via tooling
• Strat: Improve overall availability
• Process: dedicated failover capacity, critical services on
general instance families
13. Transparency Through Tooling
• Invest in robust tooling for capacity team
• Reveal AWS usage & cost back to service teams
• Select a business metric to set growth context
19. Netflix Today
• Decentralizing cloud cost responsibilities
• Active ROI analysis with largest service teams
• Exposing efficiency metrics beyond instance usage
21. “ROI” Based Mindset
• Cause: today’s scale requires thoughtful deployment
strategy & service architecture
• Cloud capacity team engaged on a per-project basis
24. The Future of Cloud Cost Management
• Bin-packing at the service level through containers
• Dynamic traffic shifting between regions
• Automated ROI calculations at testing & deployment
28. Related Sessions
• SPOT302 – Availability: The New Kind of Innovator’s Dilemma
• DVO203 – A Day in the Life of a Netflix Engineer using 37% of the
Internet
• ISM301 – Engineering Netflix Global Operations in the Cloud