Your SlideShare is downloading. ×
0
Maximizing Scalability, Resiliency, and
Engineering Velocity in the Cloud
Coburn Watson
Manager, Cloud Performance, Netfli...
Netflix, Inc.
• World's leading internet television network
• ~ 38 Million subscribers in 40+ countries
• Over a billion h...
About Me
• Manage Cloud Performance Engineering Team
• Sub-team of Cloud Solutions Organization
• Focus on performance sin...
Freedom and Responsibility
• Culture deck..a great read
• Good performers: 2x, Top performers: 10x
• What engineers dislik...
Maximizing: Engineering
Velocity
5
How
• Implementation freedom
• SCM, libraries, language
• that said..platform benefits exist
• Deployment freedom
• Servic...
Rapid Deployment?
Impossible..
3-6 Months?
7
Rapid (Cloud) Deployment
3-5 Minutes
8
BaseAMI
• Supply the foundation
• Monitoring, java, apache, tomcat, etc.
• Open source project: Aminator
9
Pushing Code: Red-Black
• Gracefully roll code in, or out, of production
• Asgard is our AWS configuration mgmt. tool
10
Compounded risks with increased velocity
Risks: Decreased Reliability, Performance, and Scalability
Not all Roses
11
Goal: CI (Continuous
Improvement)
12
Maximizing: Reliability
13
Fear (Revere) the Monkeys
• Simulate
• Latency
• Errors
• Initiate
• Instance Termination
• Availability Zone Failure
• Id...
Tracking Change: Chronos
• Aggregate Significant Events *
• Current Sources:
• Pushes (Asgard)
• Production Change Request...
Chronos, cont.
16
Automated Canary Analysis
• Identify regression between new and existing code
• Point ACA to baseline (prod) and canary AS...
HOT OK NOISYCOLDOK
NOISY
constrained rollup (dashed)
final rollup
ACA: in Action
18
Hystrix: Defend Your App
● Protection from downstream service failures
● Functional (unavailable) or performance in nature...
Maximizing: Scalability and
Performance
20
Dynamic Scaling
EC2 footprint autoscales 2500-3500 instances per
day
• order of tens of thousands of EC2 instances
• Large...
Dynamic Scaling, cont.
Example covers 3 services
• 2 edge (A,B), 1 mid-tier (C)
• C has more upstream services
than simply...
Dynamic Scaling, cont.
23
Dynamic Scaling, cont.
• Response time variability greatest during scaling events
• Average response time primary between ...
Dynamic Scaling, cont.
• Instance counts 3x, Aggregate requests 4.5x (not shown)
• Average CPU utilization per instance: ~...
Study performed:
• 24 node C* SSD-based cluster (hi1.4xlarge)
• mid-tier service load application
• Targeting 2x productio...
Response times consistent during 4x increase in load *
* Due to upstream code change
EVcache (memcached) Scalability
27
Cloud-scale Load Testing
• Ad-Hoc or CI-based load test model
• (CI) Run-over-run comparison; email on rule violation
1. J...
Conclusions
• Continually accelerate engineering velocity
• Evolve architecture and processes to mitigate risks
• Stateles...
Netflix Open Source
Our Open Source Software simplifies mgmt at
scale
Great projects, stunning colleagues:
jobs.netflix.co...
Q&A
• cwatson@netflix.com
• Netflix Tech Blog: http://techblog.netflix.com
31
Upcoming SlideShare
Loading in...5
×

Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

1,601

Published on

Surge 2013 presentation which covers how Netflix maximizes engineering velocity while keeping risks to scalability, reliability, and performance in check.

Published in: Technology
1 Comment
9 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,601
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
27
Comments
1
Likes
9
Embeds 0
No embeds

No notes for slide
  • Maximum engineering velocity can only be achieved when deployment velocity is a non-factor…thousands of systems in the time it takes to get a coffee.
  • Chronos is the “go to” tool when something goes awry in production
  • Transcript of "Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud"

    1. 1. Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud Coburn Watson Manager, Cloud Performance, Netflix Surge „13
    2. 2. Netflix, Inc. • World's leading internet television network • ~ 38 Million subscribers in 40+ countries • Over a billion hours streamed per month • Approximately 33% of all US Internet traffic at night • Recent Notables • Increased Originals catalog • Large open source contribution • OpenConnect (homegrown CDN) 2
    3. 3. About Me • Manage Cloud Performance Engineering Team • Sub-team of Cloud Solutions Organization • Focus on performance since 2000 • Large-scale billing applications, eCommerce, datacenter mgmt., etc. • Genentech, McKesson, Amdocs, Mercury Int., HP, etc. • Passion for tackling performance at cloud-scale • Looking for great performance engineers • cwatson@netflix.com 3
    4. 4. Freedom and Responsibility • Culture deck..a great read • Good performers: 2x, Top performers: 10x • What engineers dislike • cumbersome processes • deployment inefficiency • restricted access • restricted technical freedom • lack of trust • If removed…maximize: • Engineering velocity • Engineer satisfaction 4
    5. 5. Maximizing: Engineering Velocity 5
    6. 6. How • Implementation freedom • SCM, libraries, language • that said..platform benefits exist • Deployment freedom • Service team owns • push schedule, functionality, performance • operational activities (being paged) • On-demand cloud capacity • Thousands of instances at the push of a button 6
    7. 7. Rapid Deployment? Impossible.. 3-6 Months? 7
    8. 8. Rapid (Cloud) Deployment 3-5 Minutes 8
    9. 9. BaseAMI • Supply the foundation • Monitoring, java, apache, tomcat, etc. • Open source project: Aminator 9
    10. 10. Pushing Code: Red-Black • Gracefully roll code in, or out, of production • Asgard is our AWS configuration mgmt. tool 10
    11. 11. Compounded risks with increased velocity Risks: Decreased Reliability, Performance, and Scalability Not all Roses 11
    12. 12. Goal: CI (Continuous Improvement) 12
    13. 13. Maximizing: Reliability 13
    14. 14. Fear (Revere) the Monkeys • Simulate • Latency • Errors • Initiate • Instance Termination • Availability Zone Failure • Identify • Configuration Drift … in Test and Production 14
    15. 15. Tracking Change: Chronos • Aggregate Significant Events * • Current Sources: • Pushes (Asgard) • Production Change Requests (JIRA) • AWS Notifications • Dynamic Property Changes • ASG Scaling Events • Implementation • Simple REST-service; customized adapters * - “can disrupt production service” 15
    16. 16. Chronos, cont. 16
    17. 17. Automated Canary Analysis • Identify regression between new and existing code • Point ACA to baseline (prod) and canary ASG • Typically analyze an hours worth of time series data • Compare ratio of averages between canary and baseline • Evaluate range and noise; determine quality of signal • Bucket: Hot, Cold, Noisy, or OK • Multiple classifiers available • Multiple metric collections (e.g. hand-picked by service, general) • Rollup • Constrained: along metric dimensions • Final: Score the canary • Implementation: R-based analysis 17
    18. 18. HOT OK NOISYCOLDOK NOISY constrained rollup (dashed) final rollup ACA: in Action 18
    19. 19. Hystrix: Defend Your App ● Protection from downstream service failures ● Functional (unavailable) or performance in nature 19
    20. 20. Maximizing: Scalability and Performance 20
    21. 21. Dynamic Scaling EC2 footprint autoscales 2500-3500 instances per day • order of tens of thousands of EC2 instances • Larger ASG spans 200-900 m2.4xlarge daily Why: • Improved scalability during unexpected workloads • Absorb variance in service performance profile • Reactive chain of dependencies • Creates "reserved instance troughs" for batch activity 21
    22. 22. Dynamic Scaling, cont. Example covers 3 services • 2 edge (A,B), 1 mid-tier (C) • C has more upstream services than simply A and B Multiple Autoscaling Policies • (A) System Load Average • (B,C) Request-Rate based 22
    23. 23. Dynamic Scaling, cont. 23
    24. 24. Dynamic Scaling, cont. • Response time variability greatest during scaling events • Average response time primary between 75-150 msec 24
    25. 25. Dynamic Scaling, cont. • Instance counts 3x, Aggregate requests 4.5x (not shown) • Average CPU utilization per instance: ~25-55%25
    26. 26. Study performed: • 24 node C* SSD-based cluster (hi1.4xlarge) • mid-tier service load application • Targeting 2x production rates • Increase read ops from 30k to to 70k in ~ 3 minutes • Increase write ops 750 to 1500 in ~ 3 minutes Results: • 95th pctl response time increase: ~ 17 msec to 45 msec • 99th pctl response time increase: ~ 35 msec to 80 msec Cassandra Performance 26
    27. 27. Response times consistent during 4x increase in load * * Due to upstream code change EVcache (memcached) Scalability 27
    28. 28. Cloud-scale Load Testing • Ad-Hoc or CI-based load test model • (CI) Run-over-run comparison; email on rule violation 1. Jenkins initiates job 2. JMeter instances apply load 3. Results written to s3 4. Instance metrics published to Atlas 5. Raw data fetched and processed 28
    29. 29. Conclusions • Continually accelerate engineering velocity • Evolve architecture and processes to mitigate risks • Stateless micro-service architectures win! • Remove barriers for engineers • Last option should be to reduce rate of change • Exercise failure and “thundering herd” scenarios • Cloud native scaling and resiliency are key factors • Leverage pre-existing OSS PaaS when possible 29
    30. 30. Netflix Open Source Our Open Source Software simplifies mgmt at scale Great projects, stunning colleagues: jobs.netflix.com30
    31. 31. Q&A • cwatson@netflix.com • Netflix Tech Blog: http://techblog.netflix.com 31
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×