9. BEFORE ECS
• Less Transparency
• Container Orchestration Issues
• Monitoring and Alerting - built
our own with Cloudwatch
custom metrics
3RD-PARTY WORKER SOLUTION
10. BEFORE ECS
• Custom AMIs + ASGs
• One container per EC2 instance!
• Long, error-prone deployments
APPS & MICROSERVICES
12. EVALUATING ECS
• IAM Roles specific to each service running
on the cluster - principle of least privilege
• Familiarity for DevOps team
• Integration with AWS services like
Cloudwatch
• Enterprise support
• Less cluster maintenance
14. INFRASTRUCTURE AS CODE
We Terraformed (almost) everything
• VPCs
• ECS Clusters
• SQS Queues
• ECS Task Definitions
• Docker image (tag)
• Resource reservations
• IAM Role
• Env vars
• Cloudwatch alarms
15. INFRASTRUCTURE AS CODE
TF config and module structure
• DevOps developed base
modules
• Development teams imported
(tf init) modules and tweaked
them for each worker or service
• Terraform definitions for each
worker or service are checked in
with the app code
Project-specific variables
• Service name
• ECS Task definition
• CPU Allocation
• Hard/Soft memory limits
• Cloudwatch alarm thresholds
• SQS max queue length
16. MIGRATION: VPC DESIGN
•For ECS, we introduced new VPCs in each environment
• qa, staging, prod
•
We transitioned to new Application ELBs when bringing up an ECS service.
• Service discovery is achieved through well-known ELB DNS names
17. MIGRATION: PORTING WORKER CODE
• Long running workers that process multiple tasks
• Support for container draining using a SIGINT signal
• Feature flip between 3rd party queue and SNS/SQS
19. MIGRATION: IAM ROLES
• Each ECS Task Definition has an attached IAM Role
• We started from our existing IAM Roles, consolidated them
• Tightened policies toward principle of least privilege
20. CONTAINER TAGGING
“Master image was deployed to production!”
“...but was it today’s master?!”
• Semantic versioning provides container tagging
• ex: v1.0.33
• Matches git tag of the release
• Provides easy roll-back
• Ops gets visibility into the version of every service running
26. LESSONS: SCALING
1. Start scaling at 80% capacity
• Container startup time
2. Use CPU Reservation (not Utilization)
3. Scaling is more than resource
• Threshold Queue size
27. LESSONS: NAMING
• Consistency across many projects
• Concise for DevOps, QA, Release Team and Developers
• Ease of identifying, searching and associating
• AWS resources :: code :: docker images
28. LESSONS: TERRAFORM
• Executing Terraform apply requires privileges across many AWS
services
• We added a check to ensure the image tag was the latest
• Decouple unrelated modules to avoid unintended consequences
• This separates the remote state
30. COST SAVINGS: EXAMPLE
Old architecture, ASG-based worker service:
>=2 instances spread across AZs, per environment.
9 EC2 instances minimum for this service (many more during peak).
We traded efficient utilization for HA.
33. METRICS & MONITORING
• Pro’s:
• Benchmarking each service’s resource requirements
• ECS event stream is helpful for diagnosing errors
• Con’s:
• Lack of container-level monitoring
• Basic Cloudwatch metrics only come at 5-minute intervals
34. In summary…
• ECS is the foundation of the GoPro cloud platform
• In under a quarter, we migrated all of our worker
services
• We’re realizing:
• Big cost savings, better deploys, stability
• Ops and developer happiness :-)