Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

4,118 views

Published on

In this talk, Frank Chen and Brennan Saeta discuss Coursera's use of Docker, and Amazon ECS. We discuss the implementation of our unified processing framework, and delve into the security challenges inherent in running un-trusted code.

Published in: Software
  • Be the first to comment

Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Frank Chen, Coursera Brennan Saeta, Coursera October 2015 CMP406 Amazon ECS at Coursera Powering a general-purpose near-line execution microservice, while defending against untrusted code
  2. 2. What to Expect from the Session • Techniques for a unified near-line, batch, and scheduled micro-service powered by Amazon ECS • Security vulnerabilities and countermeasures when running untrusted code in Docker with Amazon ECS • Reasons to modify the Amazon ECS agent
  3. 3. Session Outline • Introduction to Coursera • Near-line, batch and scheduled job execution framework • Motivations and background • Amazon ECS benefits and limitations • Iguazú and its architecture • Evaluating programming assignments • System requirements • Security threat model • Attacks and defenses
  4. 4. Education at Scale 15 million learners worldwide 2.5 million course completions 1,300+ courses 125+ partners
  5. 5. A unified execution framework
  6. 6. Batch Processing Enables… Reporting Instructor Reports • Grade exports • Learner demographics • Course progress statistics Internal Reports • Business metrics • Payments reconciliation
  7. 7. Scheduled Processing Enables… Marketing • Recommendation emails • Targeted marketing / reactivation emails
  8. 8. Nearline Processing Enables… Pedagogical Innovations • Peer-review matching & analysis • Auto-graded programming assignments
  9. 9. The early days… January 2012
  10. 10. Bad Old Days of Batch Processing @ Coursera Cascade • PHP-based job runner • Originally ran in screen sessions • Polled APIs for new jobs • Forced restarts on regular basis due to unidentified memory leaks • Fragile and unreliable The early days…
  11. 11. Bad Old Days of Batch Processing @ Coursera Saturn • Scala scheduled batch job runner • Powered by Quartz Scheduler library • Better than Cascade, but… • All jobs ran on same JVM, causing interference The not- so early days?
  12. 12. Looking for something better…
  13. 13. What We Wanted Reliable Easy Development Easy Deployment High Efficiency Low Ops Load Cost Effective
  14. 14. What We Wanted Reliable Easy Development Easy Deployment High Efficiency Low Ops Load Cost Effective
  15. 15. What We Wanted Reliable Easy Development Easy Deployment High Efficiency Low Ops Load Cost Effective
  16. 16. What We Wanted Reliable Easy Development Easy Deployment High Efficiency Low Ops Load Cost Effective
  17. 17. What We Wanted Reliable Easy Development Easy Deployment High Efficiency Low Ops Load Cost Effective
  18. 18. What We Wanted Reliable Easy Development Easy Deployment High Efficiency Low Ops Load Cost Effective
  19. 19. What Else Did We Look At? Home-grown Tech • Tried, but proved to be unreliable • Difficult to handle coordination and synchronization • Powerful, but hard to productionize • Needs developers with experience • Designed for GCE first • Not a managed service, higher Ops load
  20. 20. Amazon ECS to the Rescue Amazon re:Invent 2014 – Dr. Werner Vogels introducing Amazon ECS Screenshot from https://www.youtube.com/watch?v=LE5uBqNp2Ds by Amazon Web Services
  21. 21. Amazon ECS to the Rescue Little maintenance Integrated with rest of AWS Easy to develop for
  22. 22. Amazon ECS to the Rescue Little maintenance Integrated with rest of AWS Easy to develop for
  23. 23. Amazon ECS to the Rescue Little maintenance Integrated with rest of AWS Easy to develop for
  24. 24. However… Amazon ECS is a great building block, but we still need to build tools around it for our purposes.
  25. 25. What We Built: Iguazú Marissa Strniste (https://www.flickr.com/photos/mstrniste/5999464924) CC-BY-2.0 • Batch Job Scheduler for Amazon ECS • Immediately • Deferred (run once at X time) • Scheduled recurring (cron-like) • Programmatically accessible internally via our standard APIs and clients • Named for Iguazú falls • World’s largest waterfall by volume • We hope Iguazú handles a similar volume of jobs
  26. 26. Iguazú Frontend Iguazú Scheduler Iguazú Backend Iguazú: Architecture CassandraServices Services Iguazú Admin ECS Workers SQS ECS API Devs Users
  27. 27. Iguazú Frontend Iguazú Scheduler Iguazú Backend Iguazú: Architecture CassandraServices Services Iguazú Admin ECS Workers SQS ECS API Devs Users
  28. 28. Iguazú Frontend Iguazú Scheduler Iguazú Backend Iguazú: Architecture CassandraServices Services Iguazú Admin ECS Workers SQS ECS API Devs Users
  29. 29. Iguazú Frontend Iguazú Scheduler Iguazú Backend Iguazú: Architecture CassandraServices Services Iguazú Admin ECS Workers SQS ECS API Devs Users
  30. 30. Iguazú Frontend Iguazú Scheduler Iguazú Backend Iguazú: Architecture CassandraServices Services Iguazú Admin ECS Workers SQS ECS API Devs Users
  31. 31. Iguazú Frontend Iguazú Scheduler Iguazú Backend Iguazú: Architecture CassandraServices Services Iguazú Admin ECS Workers SQS ECS API Devs Users
  32. 32. Iguazú Frontend Iguazú Scheduler Iguazú Backend Iguazú: Architecture CassandraServices Services Iguazú Admin ECS Workers SQS ECS API Devs Users
  33. 33. Iguazú Frontend Iguazú Scheduler Iguazú Backend Iguazú: Architecture CassandraServices Services Iguazú Admin ECS Workers SQS ECS API Devs Users
  34. 34. Developing Iguazú Jobs class Job extends AbstractJob with StrictLogging { override val reservedCpu = 1024 // 1 CPU core override val reservedMemory = 1024 // 1 GB RAM def run(parameters: JsValue) = { logger.info("I am running my job! ") expensiveComputationHere() } }
  35. 35. Running Jobs from Other Services // invoking a job with one function call // from another service via Naptime RPC/REST framework val invocationId = IguazuJobInvocationClient .create(IguazuJobInvocationRequest( jobName = "exportQuizGrades", parameters = quizParams))
  36. 36. Iguazú: Developer / Ops User Interface
  37. 37. Deploying Jobs Easy Deployment 1. Developers  Merge into master. Done! Jenkins Build Steps: 1. Builds zip package from master 2. Prepares Docker image with zip file 3. Pushes image into Docker registry 4. Registers updated jobs with Amazon ECS API
  38. 38. Logs • Logs are in /var/lib/docker/containers/* • Upload into log analysis service (Sumologic) • Wrapper prints out job name and job ID at the start for easy searching • Good enough for now
  39. 39. Metrics • Using third-party metrics collector (Datadog) • Metrics for both jobs and container instances • So long as the worker machines can talk to Internet, things will work out pretty well
  40. 40. Since April 2015… 65 jobs in production >1000 runs per day 44 different scheduled jobs
  41. 41. Evaluating Programming Assignments
  42. 42. Programming Assignments at Coursera
  43. 43. The Security Challenge Compiling and running untrusted, arbitrary code in Amazon EC2 Would you like to compile and run C code from random people on the Internet on your servers?
  44. 44. 1st Generation System Class graders in separate AWS acct Custom grader systems on cloud providers Course grader under the instructor’s desk Learners Coursera Servers Queue Service
  45. 45. 1st Generation System: Weaknesses No Auto Scaling No standard security Graders crashed
  46. 46. 1st Generation System: Weaknesses No Auto Scaling No standard security Graders crashed
  47. 47. 1st Generation System: Weaknesses No Auto Scaling No standard security Graders crashed
  48. 48. Design Goals Cost Savings No Maintenance Near Real-time Secure Infrastructure
  49. 49. Design Goals Cost Savings No Maintenance Near Real-time Secure Infrastructure
  50. 50. Design Goals Cost Savings No Maintenance Near Real-time Secure Infrastructure
  51. 51. Design Goals Cost Savings No Maintenance Near Real-time Secure Infrastructure
  52. 52. Threat Model Prevent submitted code from: • impacting the evaluation of other submissions. • disrupting the grading environment (e.g., DoS) • affecting the rest of the Coursera learning platform Additional goals: • Minimize exfiltration of information • Test cases, solutions, etc… • Minimize risk of submissions changing own scores • Avoid turning into bitcoin miners or part of botnet
  53. 53. Threat Model - Assumptions • Run arbitrary binaries • Instructor grading scripts may have vulnerabilities • ∴ Grading code is untrusted • Unknown vulnerabilities in Docker and Linux name- spacing and/or container implementation
  54. 54. Attack / Vulnerability Classes Divided into 2 main categories: • Assuming basic containers are secure, prevent any negative impacts to running arbitrary code. • Assuming basic container technology is vulnerable, mitigate negative impacts as much as possible.
  55. 55. What We Built: GrID Patrick Hoesly (https://www.flickr.com/photos/zooboing/5665221326/) CC-BY-2.0 • Service + architecture for grading programming assignments • Builds on Amazon ECS and Iguazú • Named for Tron’s “digital frontier” • Backronym: Grading Inside Docker
  56. 56. High-level GrID Architecture Learners GrID Iguazú S3 Bucket ECS APIs Grading MachinesVPC Firewalls Coursera Production Account Coursera GrID Grading Account
  57. 57. High-level GrID Architecture Learners GrID Iguazú S3 Bucket ECS APIs Grading MachinesVPC Firewalls Coursera Production Account Coursera GrID Grading Account
  58. 58. High-level GrID Architecture Learners GrID Iguazú S3 Bucket ECS API Grading MachinesVPC Firewalls Production Acct GrID Grading Account
  59. 59. High-level GrID Architecture Learners GrID Iguazú S3 Bucket ECS API Grading Machines VPC Firewalls Production Acct GrID Grading Account
  60. 60. Attacks: Resource Exhaustion Defenses: • Docker / CGroups: • CPU quotas • Memory limits • Swap limits • Hard timeouts for container execution • btrfs limits • file system storage quotas • IOPS throttling
  61. 61. Attacks: Kernel Resource Exhaustion Defenses: • Open file limits per container (nofile) • nproc Process limits • Limit kernel memory per cgroup • Limit execution time
  62. 62. Attacks: Network attacks Attacks: • Bitcoin mining • DoS attacks on third-party systems • Access Amazon S3 and other AWS APIs Defense: • Deny network access
  63. 63. Modifying the ECS Agent: Network Modes • NetworkDisabled too restrictive • Some graders require local loopback • Feature also deprecated • --net=none + deny net_admin + audit network • Isolation via Docker creating an independent network stack for each container • github.com/coursera/amazon-ecs-agent
  64. 64. Attacks: Namespace / Container Vulnerabilities • App Armor & Mandatory Access Control • Required modifying the Amazon ECS Agent • Allows auditing or denying access to a variety of subsystems • Drop capabilities • No need for NET_BIND_SERVICE, CAP_FOWNER • No root within container
  65. 65. Attacks: Root escalations within the container • We modify instructor grader images before allowing them to be run • Clears setuid • Inserts C wrapper to drop privileges from root and redirect stdin/stdout/stderr • Required Amazon ECS Agent modification • Grant root privileges • Map Docker socket into Docker containers to run Docker in Docker!
  66. 66. Attacks: If all else fails… • Utilizes VPC security measures to further restrict network access • No public internet access • Security group to restrict inbound/outbound access • Network flow logs for auditing • Separate AWS account • Run in an Auto Scaling group • Regularly terminate all grading EC2 instances
  67. 67. Other Security Measures • Utilize AWS CloudTrail for audit logs • Third-party security monitoring (Threat Stack) • No one should log in, so any TTY is an alert • Penetration testing by third-party red team (Synack)
  68. 68. Technique: Co-process • Environment has no network, but has to get submissions in and results out • Python co-process watches Amazon ECS / Docker • Python co-process then: • Mounts a shared folder containing submission • Reads back the grade from the shared folder after container exits • Monitors and cleans up
  69. 69. Future Improvements • Priority queues for different grading priorities • Re-grades vs on-demand grades • Better instructor tooling • Automated “unit-testing” for new graders • Better simulation of production environment on instructor machines • Support scheduling GPUs
  70. 70. Lessons Learned • Run the latest kernels • Latest security patches • btrfs wedging on older kernels • Default Ubuntu 14.04 kernel not new enough! • Carefully monitor disk usage • Docker-in-docker can’t clean up after itself (yet). • Reliable deploy tooling pays for itself
  71. 71. Related Sessions Also from Coursera: • BDT404 - Building and Managing Large-Scale ETL Data Flows with AWS Data Pipeline and Dataduct - Friday Containers and Amazon ECS: • CMP302 - Amazon EC2 Container Service: Distributed Applications at Scale – Next timeslot in Venetian H
  72. 72. Thank you! Questions? Also, we are hiring! www.coursera.org/jobs tech.coursera.org Brennan Saeta github/saeta @bsaeta saeta@coursera.org Frank Chen github/frankchn @frankchn frankchn@coursera.org
  73. 73. Remember to complete your evaluations!

×