On any given day DailyKos can receive traffic peaks up to five times our base traffic, sometimes requiring us to scale out to double our backend app server capacity within a 10-20 minutes window (sometimes at unpredictable times). In this talk, Susan Potter will discuss DailyKos's use of autoscaling in EC2 from the essential components to some gotchas learned along the way.
4. whoami
$ finger $(whoami)
Name: Susan Potter
Last login Sun Jan 18 18:30 1996 (GMT) on tty1
- 23 years writing software
- Server-side/backend/infrastructure engineering, mostly
- Likes: functional programming (e.g. Haskell)
Today:
- Build new backend services in Haskell
- I babysit a bloated Rails webapp
Previously: trading systems, SaaS products, CI/CD
2
5. In the cloud
Figure 1: Programming cloud infrastructures from the soy bean fields
3
8. Legacy/History
• Deliver news, discussions, & campaigns to over two million users/day
• Traffic varies significantly during the day
• Heavy reads (varnish saves our site every day)
• Writes go to content publishing backends, which are slow and expensive (Perl, Ruby)
• When news breaks or the newsletter is sent our active users want to login, comment,
recommend, write their own story, etc, which are WRITES.
5
9. Legacy/History
• Deliver news, discussions, & campaigns to over two million users/day
• Traffic varies significantly during the day
• Heavy reads (varnish saves our site every day)
• Writes go to content publishing backends, which are slow and expensive (Perl, Ruby)
• When news breaks or the newsletter is sent our active users want to login, comment,
recommend, write their own story, etc, which are WRITES.
5
10. Legacy/History
• Deliver news, discussions, & campaigns to over two million users/day
• Traffic varies significantly during the day
• Heavy reads (varnish saves our site every day)
• Writes go to content publishing backends, which are slow and expensive (Perl, Ruby)
• When news breaks or the newsletter is sent our active users want to login, comment,
recommend, write their own story, etc, which are WRITES.
5
11. Legacy/History
• Deliver news, discussions, & campaigns to over two million users/day
• Traffic varies significantly during the day
• Heavy reads (varnish saves our site every day)
• Writes go to content publishing backends, which are slow and expensive (Perl, Ruby)
• When news breaks or the newsletter is sent our active users want to login, comment,
recommend, write their own story, etc, which are WRITES.
5
12. Legacy/History
• Deliver news, discussions, & campaigns to over two million users/day
• Traffic varies significantly during the day
• Heavy reads (varnish saves our site every day)
• Writes go to content publishing backends, which are slow and expensive (Perl, Ruby)
• When news breaks or the newsletter is sent our active users want to login, comment,
recommend, write their own story, etc, which are WRITES.
5
13. Related Problems
• Deployment method (Capistrano) has horrific failure modes during scale out/in
events
• Chef converged less and less and work to maintain it increased
• Using moving to dynamic autoscaling didn’t fix these directly but our solution
considered how it could help.
6
16. Before: When I started (Sept 2016)
• Only one problematic service was in a static autoscaling group (no scaling policies,
manually modified by a human :gasp:, ”static”)
• Services used atrophying AMIs that may not converge due to external APT source
dependencies changing in significant ways :(
• Often AMIs didn’t successfully bootstrap within 15 minutes
8
17. Before: When I started (Sept 2016)
• Only one problematic service was in a static autoscaling group (no scaling policies,
manually modified by a human :gasp:, ”static”)
• Services used atrophying AMIs that may not converge due to external APT source
dependencies changing in significant ways :(
• Often AMIs didn’t successfully bootstrap within 15 minutes
8
18. Before: When I started (Sept 2016)
• Only one problematic service was in a static autoscaling group (no scaling policies,
manually modified by a human :gasp:, ”static”)
• Services used atrophying AMIs that may not converge due to external APT source
dependencies changing in significant ways :(
• Often AMIs didn’t successfully bootstrap within 15 minutes
8
19. Today: All services in dynamic autoscaling groups
• Frontend caching/routing layer
• Both content publishing backends
• Internal systems, e.g. logging, metrics, etc.
9
20. Today: All services in dynamic autoscaling groups
• Frontend caching/routing layer
• Both content publishing backends
• Internal systems, e.g. logging, metrics, etc.
9
21. Today: All services in dynamic autoscaling groups
• Frontend caching/routing layer
• Both content publishing backends
• Internal systems, e.g. logging, metrics, etc.
9
45. EC2 Instance Bootstrapping
• Chef converge boostrapping took ~15minutes
• Improved bootstrapping by an order of magnitude with fully baked AMIs
• Now we fully bake AMIs for each config and app change (5mins, one time per release
per environment, a constant factor, using NixOS)
Fully baking AMIs also gives us system reproducibility that convergent configuration
systems like Chef couldn’t give us.
23
46. EC2 Instance Bootstrapping
• Chef converge boostrapping took ~15minutes
• Improved bootstrapping by an order of magnitude with fully baked AMIs
• Now we fully bake AMIs for each config and app change (5mins, one time per release
per environment, a constant factor, using NixOS)
Fully baking AMIs also gives us system reproducibility that convergent configuration
systems like Chef couldn’t give us.
23
47. EC2 Instance Bootstrapping
• Chef converge boostrapping took ~15minutes
• Improved bootstrapping by an order of magnitude with fully baked AMIs
• Now we fully bake AMIs for each config and app change (5mins, one time per release
per environment, a constant factor, using NixOS)
Fully baking AMIs also gives us system reproducibility that convergent configuration
systems like Chef couldn’t give us.
23
48. Right-Size Instance Types per Service
• We used to use whatever instance type was set before because $REASONS
• Now we inspect each service’s resource usage in production in peak, typical, and
overnight resting states to know how to size a service’s cluster.
• Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and
potentially hurting your product’s UX
24
49. Right-Size Instance Types per Service
• We used to use whatever instance type was set before because $REASONS
• Now we inspect each service’s resource usage in production in peak, typical, and
overnight resting states to know how to size a service’s cluster.
• Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and
potentially hurting your product’s UX
24
50. Right-Size Instance Types per Service
• We used to use whatever instance type was set before because $REASONS
• Now we inspect each service’s resource usage in production in peak, typical, and
overnight resting states to know how to size a service’s cluster.
• Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and
potentially hurting your product’s UX
24
51. Find Leading Indicator Metric for Dynamic Scale Out/In
• Every service behaves differently under load
• We initially started dyanamically scaling using policies based purely on CPU (a start
but not good enough for us)
• Now we report custom metrics to AWS CloudWatch that are leading indicators that
our cluster needs to scale out or in.
Leads to more predictable performance on the site even under traffic spikes.
25
52. Find Leading Indicator Metric for Dynamic Scale Out/In
• Every service behaves differently under load
• We initially started dyanamically scaling using policies based purely on CPU (a start
but not good enough for us)
• Now we report custom metrics to AWS CloudWatch that are leading indicators that
our cluster needs to scale out or in.
Leads to more predictable performance on the site even under traffic spikes.
25
53. Find Leading Indicator Metric for Dynamic Scale Out/In
• Every service behaves differently under load
• We initially started dyanamically scaling using policies based purely on CPU (a start
but not good enough for us)
• Now we report custom metrics to AWS CloudWatch that are leading indicators that
our cluster needs to scale out or in.
Leads to more predictable performance on the site even under traffic spikes.
25
54. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
55. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
56. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
57. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
58. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
59. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
60. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
61. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
62. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
63. Fail-Safe Semantics for Deploy
• AMI artifacts built and tested
• AMIs for each service uploaded and registered with AWS EC2
• Brand new ASG + LC created referring to new AMI for release
• Scaling policies from current/live ASG copied over to new ASG
• Copy over min, max, and desired capacities from current to new
• Wait for all desired instances to report app-level healthy
• Add ASG to ALB with current/old ASG
• Remove current/old ASG from ALB
• Set min=desired=0 in old ASG
• Clean up stale ASG (not old one, but older)
26
64. Other stuff
• DONE Script rollback (~1 minute to previous version)
• TODO Implement canary deploy capability
• TODO Check error rates and/or latencies haven’t increased before removing old ASG
from ALB
• REMINDER your max capacity should be determined by your backend runtime
dependencies (it’s transitive)
27
65. Other stuff
• DONE Script rollback (~1 minute to previous version)
• TODO Implement canary deploy capability
• TODO Check error rates and/or latencies haven’t increased before removing old ASG
from ALB
• REMINDER your max capacity should be determined by your backend runtime
dependencies (it’s transitive)
27
66. Other stuff
• DONE Script rollback (~1 minute to previous version)
• TODO Implement canary deploy capability
• TODO Check error rates and/or latencies haven’t increased before removing old ASG
from ALB
• REMINDER your max capacity should be determined by your backend runtime
dependencies (it’s transitive)
27
67. Other stuff
• DONE Script rollback (~1 minute to previous version)
• TODO Implement canary deploy capability
• TODO Check error rates and/or latencies haven’t increased before removing old ASG
from ALB
• REMINDER your max capacity should be determined by your backend runtime
dependencies (it’s transitive)
27