Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)


Published on

Batch computing is a common way for developers, scientists and engineers to run a series of jobs on a large pool of shared compute resources, such as servers, virtual machines, and containers. Amazon ECS makes it easy to run and manage Docker-enabled applications across a cluster of Amazon EC2 instances. In this session will show you how to run batch jobs using Amazon ECS and together with other AWS services, such as AWS Lambda and Amazon SQS. We will see how you can leverage Amazon EC2 Spot Instances to power your ECS cluster and easily scale your batch workloads. You'll hear from Mapbox on how they use ECS to power their entire batch processing architecture to collect and process over 100 million miles of sensor data per day that they use for powering their maps. Mapbox will also discuss how they optimize their batch processing framework on ECS using Spot Instances and demo their open source framework that will help you get up and running with ECS in minutes.

Published in: Technology
  • Be the first to comment

AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Asha Chakrabarty, Senior Solutions Architect, AWS Will White, Engineering Lead, Mapbox December 1, 2016 Running Batch Processes on ECS CON310
  2. 2. What to Expect from the Session • Understand the challenges of running batch processes • Why Amazon ECS for Batch? • Architectural Design Patterns • Best Practices • Mapbox and Amazon ECS
  3. 3. Challenges of Running Batch Workloads • Typically resource intensive • Time constraint for completion • Potential impact to concurrent batch jobs • Scaling infrastructure resources • Ensuring effective resource utilization and cost savings • Fragile and unreliable
  4. 4. What Batch Workloads Need Reliable Easy Development Easy Deployment High Efficiency Low Ops Load Cost Effective
  5. 5. Why ECS for Batch Processing?
  6. 6. Cluster Management Made Easy Nothing to run Complete state Control and monitoring Scale
  7. 7. Performance at Scale
  8. 8. Flexible Container Placement Applications Batch jobs Multiple schedulers
  9. 9. Designed for Use with Other AWS Services Elastic Load Balancing Amazon Elastic Block Store Amazon Virtual Private Cloud AWS Identity and Access Management AWS CloudTrail
  10. 10. Security Your own EC2 instances in a VPC with all its security features to provide a high level of isolation.
  11. 11. Key Concepts
  12. 12. Tasks Containers Clusters Container Instances
  13. 13. Tasks Containers Clusters Container Instances
  14. 14. Task: A grouping of related containers Nginx Web Server Rails Application MySQL Database Log Collector
  15. 15. Task Definition { “family” : “my-website”, “version” : “1.0” “containers” : [ <<CONTAINER DEFINTIONS>> ] }
  16. 16. Tasks Containers Clusters Container Instances
  17. 17. Container Definition Names and identifies your image Includes default runtime attributes for your container • Environment Variables • Port Mappings • Container entry point and commands • Resource constraints • Etc.
  18. 18. Example { “name” : “webServer”, “image” : “nginx:latest” “cpu” : 512, “memory” : 128, “portMappings” : [ { “containerPort” : 9443, “hostPort” : 443 }], “links” : [“rails”], “essential” : true }
  19. 19. Tasks Containers Clusters Container Instances
  20. 20. Cluster Provides a pool of resources for your Tasks A grouping of Container Instances Starts empty, dynamically scalable
  21. 21. Tasks Containers Clusters Container Instances
  22. 22. Container Instance EC2 instance on which Tasks are scheduled We provide ECS-optimized AMI or you can download lightweight ECS Agent Registers into cluster upon launch Different EC2 instance types for variety in resource pool
  23. 23. Architectural Design Patterns
  24. 24. Trigger Batch Processing with Lambda Amazon ECS Availability Zone Availability Zone Container Instance Container Instance AutoScaling Group Task A AWS Lambda Amazon S3 Bucket (Source) ecs:RunTask Amazon S3 Bucket (Target) Amazon S3 Bucket Object Amazon CloudWatch AWS CloudTrail
  25. 25. Fleet of workers with ECS with SQS Amazon ECS Availability Zone Availability Zone SQS queue Container Instance Container Instance AutoScaling Group Task A AWS Lambda Amazon S3 DynamoDB Amazon Kinesis ecs:RunTask Amazon CloudWatch AWS CloudTrail
  26. 26. Long-running Batch Jobs • Utilize Spot Instances • EC2 Spot Blocks for Defined-Duration Workloads • ECS event stream for CloudWatch Events • Service Scaling and Monitoring Amazon ECS Availability Zone Availability Zone Container Instance Container Instance AutoScaling Group Task A Task B Task C Amazon CloudWatch AWS CloudTrail
  27. 27. Best Practices • Store state and inputs, outputs in S3 or another datastore • Minimize dependencies between task definitions (should be independent of each other) • Use Spot Instances and Spot fleets for long-running batch jobs • Monitor cluster state with ECS APIs • Share pools of resources • Auto Scaling, VPC, IAM, scheduled Reserved Instances
  28. 28. ECS at Mapbox
  29. 29. Maps Directions Geocoding Mobile Developer tools Analysis
  30. 30. 3 billion probes = 100 million miles per day
  31. 31. Similar pattern for batch processing • EC2 instances • SQS queue • Error handling / reporting
  32. 32. Introducing Watchbot
  33. 33. What is watchbot? A library to help run a highly-scalable AWS service that performs data processing tasks in response to external events. You provide the the messages and the logic to process them, while Watchbot handles making sure that your processing task is run at least once for each message.
  34. 34.
  35. 35. ECS Cluster SQS Watcher Container Running Tasks
  36. 36. Your task can do anything you want! • Your task can be anything that works in Docker • Use any language • Environment variables as input • bash exit codes to indicate success/failure/retry • Do any I/O • Save outputs to S3 or DynamoDB
  37. 37. Environment Variables Name Description Subject the message's subject Message the message's body MessageId the message's ID defined by SQS SentTimestamp the time the message was sent ApproximateFirstReceiveTimestamp the time the message was first received ApproximateReceiveCount the number of times the message has been received
  38. 38. Messages • Use any format as long as your task is equipped to handle it • JSON can capture more complex
  39. 39. Exit Codes Exit code Description Outcome 0 completed successfully message is removed from the queue without notification 3 rejected the message message is removed from the queue and a notification is sent 4 no-op message is returned to the queue without notification other failure message is returned to the queue and a notification is sent
  40. 40. More features! • Logging - write logs to CloudWatch LogGroup • Send alarms to SNS • Reduce mode - tracks progress of distributed tasks and runs a reduce task when everything finishes
  41. 41. Why not Lambda? Watchbot is similar in many regards to AWS Lambda, but is more configurable, more focused on data processing, and not subject to several of Lambda's limitations. • Full control over execution environment allows you to install anything you want • No limits on execution time • No memory limits • No concurrency limits or account-wide throttling • No DynamoDB Streams or Kinesis support
  42. 42. Gotcha: EBS Boot • ECS optimized instances are only available as EBS boot AMIs so consider rolling your own instance store AMI • EBS is more expensive - especially if you are running many instances on Spot • Slower than ephemeral disks
  43. 43. Gotcha: EBS Boot
  44. 44. Demo!
  45. 45.
  46. 46. 14 Data Processing Services 3500 Peak Container Instances 500 million Compute Hours Used This Year
  47. 47. Thank you!
  48. 48. Remember to complete your evaluations!