Migrating the GoPro Plus Cloud Service to Amazon ECS

Migrating GoPro Plus to ECS
August 22, 2017

ABOUT US
Zaven Boni, DevOps Manager
Lawrence Chin, Technical Lead

INTRODUCTION
The GoPro
Platform
Choosing
ECS
Migration
Process
Lessons
learned

PLATFORM COMPONENTS
• Media upload
• Photo and video processing
• User and subscription
management
• Device management
• Web front-ends

MEDIA SERVICE ARCHITECTURE
Media
Processed
webservice
browser
internet
Queue
media worker
media worker
media worker
mobile app
desktop app
Apps Query for Done
webservicewebservice
AWS S3

BEFORE ECS
• Less Transparency
• Container Orchestration Issues
• Monitoring and Alerting - built
our own with Cloudwatch
custom metrics
3RD-PARTY WORKER SOLUTION

BEFORE ECS
• Custom AMIs + ASGs
• One container per EC2 instance!
• Long, error-prone deployments
APPS & MICROSERVICES

EVALUATING ECS
• IAM Roles specific to each service running
on the cluster - principle of least privilege
• Familiarity for DevOps team
• Integration with AWS services like
Cloudwatch
• Enterprise support
• Less cluster maintenance

INFRASTRUCTURE AS CODE
We Terraformed (almost) everything
• VPCs
• ECS Clusters
• SQS Queues
• ECS Task Definitions
• Docker image (tag)
• Resource reservations
• IAM Role
• Env vars
• Cloudwatch alarms

INFRASTRUCTURE AS CODE
TF config and module structure
• DevOps developed base
modules
• Development teams imported
(tf init) modules and tweaked
them for each worker or service
• Terraform definitions for each
worker or service are checked in
with the app code
Project-specific variables
• Service name
• ECS Task definition
• CPU Allocation
• Hard/Soft memory limits
• Cloudwatch alarm thresholds
• SQS max queue length

MIGRATION: VPC DESIGN
•For ECS, we introduced new VPCs in each environment
• qa, staging, prod
•
We transitioned to new Application ELBs when bringing up an ECS service.
• Service discovery is achieved through well-known ELB DNS names

MIGRATION: PORTING WORKER CODE
• Long running workers that process multiple tasks
• Support for container draining using a SIGINT signal
• Feature flip between 3rd party queue and SNS/SQS

MIGRATION: PORTING WORKER CODE

MIGRATION: IAM ROLES
• Each ECS Task Definition has an attached IAM Role
• We started from our existing IAM Roles, consolidated them
• Tightened policies toward principle of least privilege

CONTAINER TAGGING
“Master image was deployed to production!”
“...but was it today’s master?!”
• Semantic versioning provides container tagging
• ex: v1.0.33
• Matches git tag of the release
• Provides easy roll-back
• Ops gets visibility into the version of every service running

AUTOSCALING
1. Cluster scaling (EC2)
• Standard CPU metrics
2. Task scaling (ECS)
• Workers - SQS queue length (ApproximateNumberOfMessagesVisible)
• Services – CPUUtilization
3. Scaling in (without interrupting tasks)
• Container Instance draining

DRAINING
ECS Cluster
EC2
Queue reached
a threshold size
ECS Tasks Autoscale
EC2 Checks if any container running
If none, kills EC2 instance
Signals
Shutdown
Lifecycle
Event Lifecycle event triggers
lambda
SQS
CloudWatch
Alarm
AWS Lambda

MIGRATION PROCESS
browser
internet webservicemobile app
desktop app
Old 3rd Party Solution
worker
worker
worker
SNS + SQS ECS Workers
AWS S3

LESSONS: SCALING
1. Start scaling at 80% capacity
• Container startup time
2. Use CPU Reservation (not Utilization)
3. Scaling is more than resource
• Threshold Queue size

LESSONS: NAMING
• Consistency across many projects
• Concise for DevOps, QA, Release Team and Developers
• Ease of identifying, searching and associating
• AWS resources :: code :: docker images

LESSONS: TERRAFORM
• Executing Terraform apply requires privileges across many AWS
services
• We added a check to ensure the image tag was the latest
• Decouple unrelated modules to avoid unintended consequences
• This separates the remote state

COST SAVINGS: EXAMPLE
Old architecture, ASG-based worker service:
>=2 instances spread across AZs, per environment.
9 EC2 instances minimum for this service (many more during peak).
We traded efficient utilization for HA.

COST SAVINGS: EXAMPLE
Old architecture, ASG-based worker service:
Vs. an ECS cluster today:

METRICS & MONITORING
• Pro’s:
• Benchmarking each service’s resource requirements
• ECS event stream is helpful for diagnosing errors
• Con’s:
• Lack of container-level monitoring
• Basic Cloudwatch metrics only come at 5-minute intervals

In summary…
• ECS is the foundation of the GoPro cloud platform
• In under a quarter, we migrated all of our worker
services
• We’re realizing:
• Big cost savings, better deploys, stability
• Ops and developer happiness :-)

Container Instance Draining Ref.

Migrating the GoPro Plus Cloud Service to Amazon ECS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Migrating the GoPro Plus Cloud Service to Amazon ECS

Similar to Migrating the GoPro Plus Cloud Service to Amazon ECS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Migrating the GoPro Plus Cloud Service to Amazon ECS