Dynamically scaling a news & activism hub to handle 5x traffic spikes

1. Dynamically scaling a news & activism hub (scaling out up to 5x the write-trafﬁc in 20 minutes) Susan Potter April 26, 2019

2. Outline Intro Problem Outline Before & After AWS EC2 AutoScaling: An overview Related Side Notes Questions? 1

3. Intro

4. whoami $ finger $(whoami) Name: Susan Potter Last login Sun Jan 18 18:30 1996 (GMT) on tty1 - 23 years writing software - Server-side/backend/infrastructure engineering, mostly - Likes: functional programming (e.g. Haskell) Today: - Build new backend services in Haskell - I babysit a bloated Rails webapp Previously: trading systems, SaaS products, CI/CD 2

5. In the cloud Figure 1: Programming cloud infrastructures from the soy bean ﬁelds 3

6. Problem Outline

7. Trafﬁc 4

8. Legacy/History • Deliver news, discussions, & campaigns to over two million users/day • Trafﬁc varies signiﬁcantly during the day • Heavy reads (varnish saves our site every day) • Writes go to content publishing backends, which are slow and expensive (Perl, Ruby) • When news breaks or the newsletter is sent our active users want to login, comment, recommend, write their own story, etc, which are WRITES. 5

13. Related Problems • Deployment method (Capistrano) has horriﬁc failure modes during scale out/in events • Chef converged less and less and work to maintain it increased • Using moving to dynamic autoscaling didn’t ﬁx these directly but our solution considered how it could help. 6

14. Making me … 7

15. Before & After

16. Before: When I started (Sept 2016) • Only one problematic service was in a static autoscaling group (no scaling policies, manually modiﬁed by a human :gasp:, ”static”) • Services used atrophying AMIs that may not converge due to external APT source dependencies changing in signiﬁcant ways :( • Often AMIs didn’t successfully bootstrap within 15 minutes 8

19. Today: All services in dynamic autoscaling groups • Frontend caching/routing layer • Both content publishing backends • Internal systems, e.g. logging, metrics, etc. 9

22. AWS EC2 AutoScaling: An overview

23. 10

24. High-level Primitives • AutoScaling Group • Launch Conﬁguration • Scaling Policies • Lifecycle Hooks 10

28. AutoScaling Lifecycle Figure 2: Transition between instance states in the Amazon EC2 AutoScaling lifecycle 11

29. AutoScaling Group: Properties • Min, max, desired • Launch conﬁguration (exactly one pointer) • Health check type (EC2/ELB) • AZs • Timeouts • Scaling policies (zero or more) 12

30. AutoScaling Group: Create via CLI declare -r rid="ResourceId=${asg_name}" delcare -r rtype="ResourceType=auto-scaling-group" aws autoscaling create-auto-scaling-group --auto-scaling-group-name "${asg_name}" --launch-configuration-name "${lc_name}" --min-size ${min_size:-1} --max-size ${max_size:-9} --default-cooldown ${cooldown:-120} --availability-zones ${availability_zones} --health-check-type "${health_check_type:-ELB}" --health-check-grace-period "${grace_period:-90}" --vpc-zone-identifier "${subnet_ids}" --tags "${rid},${rtype},Key=LifeCycle,Value=alive,PropagateAtLaunch=false" 13

31. Autoscaling Group: Enable metrics collection # After creation aws autoscaling enable-metrics-collection --auto-scaling-group-name "${asg_name}" --granularity "1Minute" 14

32. Autoscaling Group: Querying instance IDs in ASG aws autoscaling describe-auto-scaling-groups --output text --region "${region}" --auto-scaling-group-names "${asg_name}" --query 'AutoScalingGroups[].Instances[].InstanceId' 15

33. Launch Conﬁguration: Properties • AMI • Instance type • User-data • Instance tags • Security groups • Block device mappings • IAM instance proﬁles Note: immutable after creation 16

34. Launch Conﬁguration: Create via CLI declare -r bdev="DeviceName=/dev/sda1" declare -r vtype="VolumeType=gp2" declare -r term="DeleteOnTermination=true" aws autoscaling create-launch-configuration --launch-configuration-name "${lc_name}" --image-id "${image_id}" --iam-instance-profile "${lc_name}-profile" --security-groups ${security_groups} --instance-type ${instance} --block-device-mappings "${bdev},Ebs={${term},${vtype},VolumeSize=${disk_size}}" 17

35. Scaling Policies: Properties • Policy name • Metric type • Adjustment type • Scaling adjustment 18

39. Scaling Policies: Create via CLI aws autoscaling put-scaling-policy --auto-scaling-group-name "${asg_name}" --policy-name "${scaling_policy_name}" --adjustment-type ChangeInCapacity --scaling-adjustment 1 19

40. Scaling Policies: Attach Metric Alarm aws cloudwatch put-metric-alarm --alarm-name Step-Scaling-AlarmHigh-AddCapacity --metric-name CPUUtilization --namespace AWS/EC2 --statistic Average --period 120 --evaluation-periods 2 --threshold 60 --comparison-operator GreaterThanOrEqualToThreshold --dimensions "Name=AutoScalingGroupName,Value=${asg_name}" --alarm-actions "${policy_arn}" 20

41. Custom Metrics: Report metric data aws cloudwatch put-metric-data --metric-name custom-metric-name --namespace MyOrg/Custom --unit Count --value ${value} --storage-resolution 1 --dimensions "AutoScalingGroupName=${asg_name}" 21

42. Lifecycle Hooks: Properties We don’t use this but for adding hooks to provision software on newly launched instances and similar actions. 22

43. Related Side Notes

44. 23

45. EC2 Instance Bootstrapping • Chef converge boostrapping took ~15minutes • Improved bootstrapping by an order of magnitude with fully baked AMIs • Now we fully bake AMIs for each conﬁg and app change (5mins, one time per release per environment, a constant factor, using NixOS) Fully baking AMIs also gives us system reproducibility that convergent conﬁguration systems like Chef couldn’t give us. 23

48. Right-Size Instance Types per Service • We used to use whatever instance type was set before because $REASONS • Now we inspect each service’s resource usage in production in peak, typical, and overnight resting states to know how to size a service’s cluster. • Recommend this practice post-ASG or you are dropping $$$ in AWS’s lap and potentially hurting your product’s UX 24

51. Find Leading Indicator Metric for Dynamic Scale Out/In • Every service behaves differently under load • We initially started dyanamically scaling using policies based purely on CPU (a start but not good enough for us) • Now we report custom metrics to AWS CloudWatch that are leading indicators that our cluster needs to scale out or in. Leads to more predictable performance on the site even under trafﬁc spikes. 25

54. Fail-Safe Semantics for Deploy • AMI artifacts built and tested • AMIs for each service uploaded and registered with AWS EC2 • Brand new ASG + LC created referring to new AMI for release • Scaling policies from current/live ASG copied over to new ASG • Copy over min, max, and desired capacities from current to new • Wait for all desired instances to report app-level healthy • Add ASG to ALB with current/old ASG • Remove current/old ASG from ALB • Set min=desired=0 in old ASG • Clean up stale ASG (not old one, but older) 26

64. Other stuff • DONE Script rollback (~1 minute to previous version) • TODO Implement canary deploy capability • TODO Check error rates and/or latencies haven’t increased before removing old ASG from ALB • REMINDER your max capacity should be determined by your backend runtime dependencies (it’s transitive) 27

68. Questions?

69. LinkedIn /in/susanpotter GitHub @mbbx6spp Keybase @mbbx6spp Twitter @SusanPotter 27

Dynamically scaling a news & activism hub to handle 5x traffic spikes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dynamically scaling a news & activism hub to handle 5x traffic spikes

Similar to Dynamically scaling a news & activism hub to handle 5x traffic spikes (20)

More from Susan Potter

More from Susan Potter (17)

Recently uploaded

Recently uploaded (20)

Dynamically scaling a news & activism hub to handle 5x traffic spikes