SlideShare a Scribd company logo
ECS & Docker:
Secure Async Execution @
Brennan Saeta
The Beginnings — 2012
10
courses
1 million
learners
worldwide
4
partners
Education at Scale
1,800
courses
18 million
learners
worldwide
140
partners
Outline
• Evolution of Coursera’s nearline execution systems
• Next-generation execution framework: Iguazú
• Iguazú application deep dive:
GrID — evaluating programming assignments
Key Takeaways
• What is nearline execution, and why it is useful
• Best practices for running containers in production
in the cloud
• Hardening techniques for securely operating
container infrastructure at scale
A history of
nearline execution
Coursera Architecture (2012)
PHP
Monolith
Early days - Requirements
• Video re-encoding for distribution
• Grade computation for 100,000+ learners
• Pedagogical data exports for courses
Coursera Architecture (2012)
PHP
Monolith
Cascade Architecture
PHP
Monolith
PHP
Monolith
Cascade
Cascade Architecture
PHP
Monolith
PHP
Monolith
Cascade
Queue
Upgrading to Scala
Re-architecting delayed execution for our 2nd generation
learning platform.
Upgrading to the JVM
• Leverage mature Scala & JVM ecosystems for code
sharing
• JVM much more reliable (no memory leaks)
• New job model: scheduled recurring jobs.
• Named: Saturn
Saturn Architecture
Service A
Service B
Service C
C*
Online Serving
Scala/micro-service architecture
C*
Saturn Architecture
Service A
Service B
Service C
C*
Online Serving
Scala/micro-service architecture
Saturn
C*
Saturn Architecture
Service A
Service B
Service C
C*
Saturn
C*
ZK
Ensemble
Saturn Architecture
Saturn
Leader ZK
Ensemble
Service A
Service B
Service C
C*C*
Problems with Saturn
• Single master meant naïve implementation ran all
jobs in same JVM
• Huge CPU contention @ top of the hour
• OOM Exceptions & GC issues
Enter: Docker
Containers allow for resource isolation!
CC-by-2.0 https://www.flickr.com/photos/photohome_uk/1494590209
Supported Features
Platform
Saturn Docker
Amazon
ECS
Iguazú
Run code ✅ ✅ ✅ ✅
Resource
Isolation ❌ ✅ ✅ ✅
Clusters /
HA ☑️ ❌ ✅ ✅
Great
developer
workflow
✅ ❌ ❌ ✅
Scheduled
Jobs ✅ ❌ ❌ ✅
Supported Features
Platform
Saturn Docker
Amazon
ECS
Iguazú
Run code ✅ ✅ ✅ ✅
Resource
Isolation ❌ ✅ ✅ ✅
Clusters /
HA ✅ ❌ ✅ ✅
Great
developer
workflow
✅ ❌ ❌ ✅
Scheduled
Jobs ✅ ❌ ❌ ✅
Supported Features
Platform
Saturn Docker
Amazon
ECS
Iguazú
Run code ✅ ✅ ✅ ✅
Resource
Isolation ❌ ✅ ✅ ✅
Clusters /
HA ✅ ❌ ✅ ✅
Great
developer
workflow
✅ ❌ ❌ ✅
Scheduled
Jobs ✅ ❌ ❌ ✅
Supported Features
Platform
Saturn Docker
Amazon
ECS
Iguazú
Run code ✅ ✅ ✅ ✅
Resource
Isolation ❌ ✅ ✅ ✅
Clusters /
HA ✅ ❌ ✅ ✅
Great
developer
workflow
✅ ❌ ❌ ✅
Scheduled
Jobs ✅ ❌ ❌ ✅
Supported Features
Platform
Saturn Docker
Amazon
ECS
???
Run code ✅ ✅ ✅ ✅
Resource
Isolation ❌ ✅ ✅ ✅
Clusters /
HA ✅ ❌ ✅ ✅
Great
developer
workflow
✅ ❌ ❌ ✅
Scheduled
Jobs ✅ ❌ ❌ ✅
Solution: Iguazú
Marissa Strniste (https://www.flickr.com/photos/mstrniste/5999464924) CC-BY-2.0
Solution: Iguazú
• Framework & service for
asynchronous execution
• Optimized Scala developer
experience for Coursera
• Unified scheduler supports:
• Immediate execution (nearline)
• Scheduled recurring execution
(cron-like)
• Deferred execution (run once @
time X)
Marissa Strniste (https://www.flickr.com/photos/mstrniste/5999464924) CC-BY-2.0
Iguazú Architecture
Iguazú
Frontend
Iguazú
Scheduler
Iguazú
Backend
CassandraServices Services
Iguazú
Admin
Iguazú
Workers
SQS
ECS API
Devs
Users
Iguazú Architecture
Iguazú
Frontend
Iguazú
Scheduler
Iguazú
Backend
CassandraServices Services
Iguazú
Admin
Iguazú
Workers
SQS
Queue
ECS API
Devs
Users
Iguazú Architecture
Iguazú
Frontend
Iguazú
Scheduler
Iguazú
Backend
CassandraServices Services
Iguazú
Admin
Iguazú
Workers
ECS API
Devs
Users
SQS
Queue
Iguazú Architecture
Iguazú
Frontend
Iguazú
Scheduler
Iguazú
Backend
CassandraServices Services
Iguazú
Admin
Iguazú
Workers
ECS API
Devs
Users
ZK Ensemble
SQS
Queue
Iguazú Architecture
Iguazú
Frontend
Iguazú
Scheduler
Iguazú
Backend
CassandraServices Services
Iguazú
Admin
Iguazú
Workers
ECS API
Devs
Users
ZK Ensemble
SQS
Queue
Autoscale, autoscale,
autoscale!
Autoscaling ⇄ Iguazú ⇆ ECS
Iguazu
ECS APIAutoscaling
EC2
Worker
EC2
Worker
Shutdown
Lifecycle
Notification Poll Worker
Job Status
All finished
Proceed
Term-
inate EC2
Worker
Failure in Nearline Systems
• Most jobs are non-idempotent
• Iguazú: At most once execution
• Time-bounded delay
• Future: At least once execution
• With caveats
Iguazú adoption by the numbers
~100 jobs in
production
>1000 runs
per day
>100 different job
schedules
Iguazú Applications
Nearline Jobs
• Pedagogical Instructor
Data Exports
• System Integrations
• Course Migrations
Scheduled Recurring Jobs
• Course Reminders
• System Integrations
• Payment reconciliation
• Course translations
• Housekeeping
• Build artifact archival
• A/B Experiments
While containers may help you
on your journey, they are not
themselves a destination.CC-by-2.0 https://www.flickr.com/photos/usoceangov/5369581593
Writing an Iguazu Job
class AbReminderJob @Inject() (abClient: AbClient, email: EmailAPI)
extends AbstractJob {
override val reservedCpu = 1024 // 1 CPU core
override val reservedMemory = 1024 // 1 GB RAM
def run(parameters: JsValue) = {
val experiments = abClient.findForgotten()
logger.info(s"Found ${experiments.size} forgotten experiments.")
experiments.foreach { experiment =>
sendReminder(experiment.owners, experiment.description)
}
}
}
Writing an Iguazu Job
class AbReminderJob @Inject() (abClient: AbClient, email: EmailAPI)
extends AbstractJob {
override val reservedCpu = 1024 // 1 CPU core
override val reservedMemory = 1024 // 1 GB RAM
def run(parameters: JsValue) = {
val experiments = abClient.findForgotten()
logger.info(s"Found ${experiments.size} forgotten experiments.")
experiments.foreach { experiment =>
sendReminder(experiment.owners, experiment.description)
}
}
}
Writing an Iguazu Job
class AbReminderJob @Inject() (abClient: AbClient, email: EmailAPI)
extends AbstractJob {
override val reservedCpu = 1024 // 1 CPU core
override val reservedMemory = 1024 // 1 GB RAM
def run(parameters: JsValue) = {
val experiments = abClient.findForgotten()
logger.info(s"Found ${experiments.size} forgotten experiments.")
experiments.foreach { experiment =>
sendReminder(experiment.owners, experiment.description)
}
}
}
Writing an Iguazu Job
class AbReminderJob @Inject() (abClient: AbClient, email: EmailAPI)
extends AbstractJob {
override val reservedCpu = 1024 // 1 CPU core
override val reservedMemory = 1024 // 1 GB RAM
def run(parameters: JsValue) = {
val experiments = abClient.findForgotten()
logger.info(s"Found ${experiments.size} forgotten experiments.")
experiments.foreach { experiment =>
sendReminder(experiment.owners, experiment.description)
}
}
}
Writing an Iguazu Job
class AbReminderJob @Inject() (abClient: AbClient, email: EmailAPI)
extends AbstractJob {
override val reservedCpu = 1024 // 1 CPU core
override val reservedMemory = 1024 // 1 GB RAM
def run(parameters: JsValue) = {
val experiments = abClient.findForgotten()
logger.info(s"Found ${experiments.size} forgotten experiments.")
experiments.foreach { experiment =>
sendReminder(experiment.owners, experiment.description)
}
}
}
Testing an Iguazu job
The Hollywood Principle
applies to distributed
systems. CC-by-2.0 https://www.flickr.com/photos/raindog808/354080327
Deploying a new Iguazu Job
• Developer
• merge into master… done
• Jenkins Build Steps
• Compile & package job JAR
• Prepare Docker image
• Pushes image into registry
• Register updated job with
Amazon ECS API
Invoking an Iguazú Job
// invoking a job with one function call
// from another service via REST framework RPC
val invocationId = iguazuJobInvocationClient
.create(IguazuJobInvocationRequest(
jobName = "exportQuizGrades",
parameters = quizParams))
A clean
environment
increases reliability.CC-by-2.0 https://www.flickr.com/photos/raindog808/354080327
Evaluating Programming
Assignments
An application of Iguazú
Design Goals
Elastic
Infrastructure
No
Maintenance
Near Real-time Secure
Infrastructure
Design Goals
Elastic
Infrastructure
No
Maintenance
Near Real-time Secure
Infrastructure
Design Goals
Elastic
Infrastructure
No
Maintenance
Near Real-time Secure
Infrastructure
Solution: GrID
Patrick Hoesly (https://www.flickr.com/photos/zooboing/5665221326/) CC-BY-2.0
• Service + framework for grading
programming assignments
• Builds on Iguazú
• Named for Tron’s “digital frontier”
• Backronym: Grading Inside Docker
High-level GrID Architecture
Learners
GrID
Iguazú
S3 Bucket
ECS APIs
Grading MachinesVPC Firewalls
Coursera Production Account Coursera GrID Grading Account
High-level GrID Architecture
Learners
GrID
Iguazú
S3 Bucket
ECS APIs
Grading MachinesVPC Firewalls
Coursera Production Account Coursera GrID Grading Account
High-level GrID Architecture
Learners
GrID
Iguazú
S3 Bucket
ECS API
Grading MachinesVPC Firewalls
Production Acct GrID Grading Account
High-level GrID Architecture
Learners
GrID
Iguazú
S3 Bucket
ECS API
Grading
Machines
VPC
Firewalls
Production Acct GrID Grading Account
Design Goals
Elastic
Infrastructure
No
Maintenance
Near Real-time Secure
Infrastructure
Programming Assignments
The Security Challenge
Compiling and running untrusted, arbitrary code on
our cluster in near real time.
Would you like to compile and run C code from random
people on the Internet on your servers?
FROM redis
FROM ubuntu:latest
FROM jane’s-image
Security Assumptions
• Run arbitrary binaries
• Instructor grading scripts may have vulnerabilities
• ∴ Grading code is untrusted
• Unknown vulnerabilities in Docker and Linux
name-spacing and/or container implementation
Security Goals
Prevent submitted code from:
• impacting the evaluation of other submissions.
• disrupting the grading environment (e.g., DoS)
• affecting the rest of the Coursera learning platform
Grading assignment submissions
CC-by-2.0 https://www.flickr.com/photos/dherholz/4367511580/
CPU CPU CPU CPU
RAM
Alice’s Container
Alice’s
Submission
Grader
Bob’s Container
Bob’s
Submission
Grader
Mallory’s
Container
Mallory’s
Submission
Grader
Kernel
Disk
CPU CPU CPU CPU
RAM
Alice’s Container
Alice’s
Submission
Grader
Bob’s Container
Bob’s
Submission
Grader
Mallory’s
Container
Mallory’s
Submission
Grader
Kernel
Disk
CPU cgroups CPU cgroups
RAM — cgroups
Alice’s Container
Alice’s
Submission
Grader
Bob’s Container
Bob’s
Submission
Grader
Mallory’s
Container
Mallory’s
Submission
Grader
Kernel
Disk
CPU cgroups CPU cgroups
RAM — cgroups
Alice’s Container
Alice’s
Submission
Grader
Bob’s Container
Bob’s
Submission
Grader
Mallory’s
Container
Mallory’s
Submission
Grader
Kernel
Disk
CPU cgroups CPU cgroups
RAM — cgroups
Alice’s Container
Alice’s
Submission
Grader
Bob’s Container
Bob’s
Submission
Grader
Mallory’s
Container
Mallory’s
Submission
Grader
Kernel
Disk — blkio limits & btrfs quotas
CPU cgroups CPU cgroups
RAM — cgroups
Alice’s Container
Alice’s
Submission
Grader
Bob’s Container
Bob’s
Submission
Grader
Mallory’s
Container
Mallory’s
Submission
Grader
Kernel
Disk — blkio limits & btrfs quotas
Attacks: Kernel Resource
Exhaustion
• Open file limits per container
(nofile)
• nproc Process limits
• Limit kernel memory per cgroup
• Limit execution time
CPU cgroups CPU cgroups
RAM — cgroups
Alice’s Container
Alice’s
Submission
Grader
Bob’s Container
Bob’s
Submission
Grader
Mallory’s
Container
Mallory’s
Submission
Grader
Kernel — cgroups, ulimits
Disk — blkio limits & btrfs quotas Network
Attacks: Network attacks
Attacks:
• Bitcoin mining
• DoS attacks on other systems
• Access Amazon S3 and other AWS APIs
Defense:
• Deny network access
Docker Network Modes
NetworkDisabled too restrictive
• Some graders require local loopback
• Feature also deprecated
--net=none + deny net_admin + audit
network
• Isolation via Docker creating an
independent network stack for each
container
github.com/coursera/amazon-ecs-agent
CC-by-2.0 https://www.flickr.com/photos/valentinap/253659858
CC-by-2.0 https://www.flickr.com/photos/jessicafm/2834658255/
CC-by-2.0 https://www.flickr.com/photos/donnieray/11501178306/in/photostream/
Defense in Depth
• Mandatory Access Control (App Armor)
• Allows auditing or denying access to a
variety of subsystems
• Drop capabilities from bounding set
• No need for NET_BIND_SERVICE,
CAP_FOWNER, MKNOD
• Deny root within container
Deny Root Escalations
• We modify instructor grader images
before allowing them to be run
• Clears setuid
• Inserts C wrapper to drop privileges from
root and redirect stdin/stdout/stderr
• Run cleaning job on another Iguazú
cluster
• Run Docker in Docker!
• Docker 1.10 adds User Namespaces
If all else fails…
• Utilizes VPC security measures to
further restrict network access
• No public internet access
• Security group to restrict
inbound/outbound access
• Network flow logs for auditing
• Separate AWS account
• Run in an Auto Scaling group
• Regularly terminate all grading EC2
instances
Other Security Measures
• Utilize AWS CloudTrail for audit logs
• Third-party security monitoring
(Threat Stack)
• No one should log in, so any TTY is an alert
• Penetration testing by third-party red
team (Synack)
Lessons Learned - GrID
• Building a platform for code
execution is hard!
• Carefully monitor disk usage
• Run the latest kernels
• Latest security patches
• btrfs wedging on older kernels
• Default Ubuntu 14.04 kernel not new
enough!
Reliable deploy
tooling pays for itself.
Thank you!
Brennan Saeta
github/saeta
@bsaeta
saeta@coursera.org
Frank Chen
github/frankchn
@frankchn
frankchn@coursera.org
GrID lead Iguazú Lead
Questions?
Brennan Saeta
github/saeta
@bsaeta
saeta@coursera.org
Frank Chen
github/frankchn
@frankchn
frankchn@coursera.org
GrID lead Iguazú Lead

More Related Content

What's hot

Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Amazon Web Services
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
Mario-Leander Reimer
 
New AWS Services
New AWS ServicesNew AWS Services
New AWS Services
Josh Padnick
 
(CMP302) Amazon ECS: Distributed Applications at Scale
(CMP302) Amazon ECS: Distributed Applications at Scale(CMP302) Amazon ECS: Distributed Applications at Scale
(CMP302) Amazon ECS: Distributed Applications at Scale
Amazon Web Services
 
Continuous delivery and deployment on AWS
Continuous delivery and deployment on AWSContinuous delivery and deployment on AWS
Continuous delivery and deployment on AWS
Shiva Narayanaswamy
 
Continuous Delivery to Amazon ECS
Continuous Delivery to Amazon ECSContinuous Delivery to Amazon ECS
Continuous Delivery to Amazon ECS
Amazon Web Services
 
Serverless in java Lessons learnt
Serverless in java Lessons learntServerless in java Lessons learnt
Serverless in java Lessons learnt
Krzysztof Pawlowski
 
Application Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless WorldApplication Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless World
Amazon Web Services
 
DevOps On AWS - Deep Dive on Continuous Delivery
DevOps On AWS - Deep Dive on Continuous DeliveryDevOps On AWS - Deep Dive on Continuous Delivery
DevOps On AWS - Deep Dive on Continuous Delivery
Mikhail Prudnikov
 
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Amazon Web Services
 
Play Framework + Docker + CircleCI + AWS + EC2 Container Service
Play Framework + Docker + CircleCI + AWS + EC2 Container ServicePlay Framework + Docker + CircleCI + AWS + EC2 Container Service
Play Framework + Docker + CircleCI + AWS + EC2 Container Service
Josh Padnick
 
(DEV302) Hosting ASP.Net 5 Apps in AWS with Docker & AWS CodeDeploy
(DEV302) Hosting ASP.Net 5 Apps in AWS with Docker & AWS CodeDeploy(DEV302) Hosting ASP.Net 5 Apps in AWS with Docker & AWS CodeDeploy
(DEV302) Hosting ASP.Net 5 Apps in AWS with Docker & AWS CodeDeploy
Amazon Web Services
 
Accelerating Innovation with DevOps on AWS
Accelerating Innovation with DevOps on AWSAccelerating Innovation with DevOps on AWS
Accelerating Innovation with DevOps on AWS
Amazon Web Services
 
Java script nirvana in netbeans [con5679]
Java script nirvana in netbeans [con5679]Java script nirvana in netbeans [con5679]
Java script nirvana in netbeans [con5679]
Ryan Cuprak
 
Everything-as-code. A polyglot adventure. #DevoxxPL
Everything-as-code. A polyglot adventure. #DevoxxPLEverything-as-code. A polyglot adventure. #DevoxxPL
Everything-as-code. A polyglot adventure. #DevoxxPL
Mario-Leander Reimer
 
CI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the TimeCI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the Time
Amazon Web Services
 
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
Amazon Web Services
 
Introduction to Docker on AWS
Introduction to Docker on AWSIntroduction to Docker on AWS
Introduction to Docker on AWS
Amazon Web Services
 
Rock Solid Deployment of Web Applications
Rock Solid Deployment of Web ApplicationsRock Solid Deployment of Web Applications
Rock Solid Deployment of Web Applications
Pablo Godel
 
Introduction to Docker | Docker and Kubernetes Training
Introduction to Docker | Docker and Kubernetes TrainingIntroduction to Docker | Docker and Kubernetes Training
Introduction to Docker | Docker and Kubernetes Training
Shailendra Chauhan
 

What's hot (20)

Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
 
New AWS Services
New AWS ServicesNew AWS Services
New AWS Services
 
(CMP302) Amazon ECS: Distributed Applications at Scale
(CMP302) Amazon ECS: Distributed Applications at Scale(CMP302) Amazon ECS: Distributed Applications at Scale
(CMP302) Amazon ECS: Distributed Applications at Scale
 
Continuous delivery and deployment on AWS
Continuous delivery and deployment on AWSContinuous delivery and deployment on AWS
Continuous delivery and deployment on AWS
 
Continuous Delivery to Amazon ECS
Continuous Delivery to Amazon ECSContinuous Delivery to Amazon ECS
Continuous Delivery to Amazon ECS
 
Serverless in java Lessons learnt
Serverless in java Lessons learntServerless in java Lessons learnt
Serverless in java Lessons learnt
 
Application Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless WorldApplication Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless World
 
DevOps On AWS - Deep Dive on Continuous Delivery
DevOps On AWS - Deep Dive on Continuous DeliveryDevOps On AWS - Deep Dive on Continuous Delivery
DevOps On AWS - Deep Dive on Continuous Delivery
 
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
Managing Your Application Lifecycle on AWS: Continuous Integration and Deploy...
 
Play Framework + Docker + CircleCI + AWS + EC2 Container Service
Play Framework + Docker + CircleCI + AWS + EC2 Container ServicePlay Framework + Docker + CircleCI + AWS + EC2 Container Service
Play Framework + Docker + CircleCI + AWS + EC2 Container Service
 
(DEV302) Hosting ASP.Net 5 Apps in AWS with Docker & AWS CodeDeploy
(DEV302) Hosting ASP.Net 5 Apps in AWS with Docker & AWS CodeDeploy(DEV302) Hosting ASP.Net 5 Apps in AWS with Docker & AWS CodeDeploy
(DEV302) Hosting ASP.Net 5 Apps in AWS with Docker & AWS CodeDeploy
 
Accelerating Innovation with DevOps on AWS
Accelerating Innovation with DevOps on AWSAccelerating Innovation with DevOps on AWS
Accelerating Innovation with DevOps on AWS
 
Java script nirvana in netbeans [con5679]
Java script nirvana in netbeans [con5679]Java script nirvana in netbeans [con5679]
Java script nirvana in netbeans [con5679]
 
Everything-as-code. A polyglot adventure. #DevoxxPL
Everything-as-code. A polyglot adventure. #DevoxxPLEverything-as-code. A polyglot adventure. #DevoxxPL
Everything-as-code. A polyglot adventure. #DevoxxPL
 
CI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the TimeCI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the Time
 
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
 
Introduction to Docker on AWS
Introduction to Docker on AWSIntroduction to Docker on AWS
Introduction to Docker on AWS
 
Rock Solid Deployment of Web Applications
Rock Solid Deployment of Web ApplicationsRock Solid Deployment of Web Applications
Rock Solid Deployment of Web Applications
 
Introduction to Docker | Docker and Kubernetes Training
Introduction to Docker | Docker and Kubernetes TrainingIntroduction to Docker | Docker and Kubernetes Training
Introduction to Docker | Docker and Kubernetes Training
 

Similar to Docker & ECS: Secure Nearline Execution

Where is my cache architectural patterns for caching microservices by example
Where is my cache architectural patterns for caching microservices by exampleWhere is my cache architectural patterns for caching microservices by example
Where is my cache architectural patterns for caching microservices by example
Rafał Leszko
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
Shu-Jeng Hsieh
 
Cannibalising The Google App Engine
Cannibalising The  Google  App  EngineCannibalising The  Google  App  Engine
Cannibalising The Google App Engine
catherinewall
 
Continuous Delivery with Docker and Amazon ECS
Continuous Delivery with Docker and Amazon ECSContinuous Delivery with Docker and Amazon ECS
Continuous Delivery with Docker and Amazon ECS
Amazon Web Services
 
Batch Processing with Amazon EC2 Container Service
Batch Processing with Amazon EC2 Container ServiceBatch Processing with Amazon EC2 Container Service
Batch Processing with Amazon EC2 Container Service
Amazon Web Services
 
Integration tests: use the containers, Luke!
Integration tests: use the containers, Luke!Integration tests: use the containers, Luke!
Integration tests: use the containers, Luke!
Roberto Franchini
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
AWSCOMSUM
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
PhilipBasford
 
Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...
Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...
Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...
DataStax Academy
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
Pavel Chunyayev
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
Puppet
 
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
Craeg Strong
 
Commit to excellence - Java in containers
Commit to excellence - Java in containersCommit to excellence - Java in containers
Commit to excellence - Java in containers
Red Hat Developers
 
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
Craeg Strong
 
GlobalAzureBootCamp 2018
GlobalAzureBootCamp 2018GlobalAzureBootCamp 2018
GlobalAzureBootCamp 2018
girish goudar
 
What to expect from Java 9
What to expect from Java 9What to expect from Java 9
What to expect from Java 9
Ivan Krylov
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
Running your dockerized application(s) on AWS Elastic Container Service
Running your dockerized application(s) on AWS Elastic Container ServiceRunning your dockerized application(s) on AWS Elastic Container Service
Running your dockerized application(s) on AWS Elastic Container Service
Marco Pas
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.js
orkaplan
 
ILM - Pipeline in the cloud
ILM - Pipeline in the cloudILM - Pipeline in the cloud
ILM - Pipeline in the cloud
Aaron Carey
 

Similar to Docker & ECS: Secure Nearline Execution (20)

Where is my cache architectural patterns for caching microservices by example
Where is my cache architectural patterns for caching microservices by exampleWhere is my cache architectural patterns for caching microservices by example
Where is my cache architectural patterns for caching microservices by example
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Cannibalising The Google App Engine
Cannibalising The  Google  App  EngineCannibalising The  Google  App  Engine
Cannibalising The Google App Engine
 
Continuous Delivery with Docker and Amazon ECS
Continuous Delivery with Docker and Amazon ECSContinuous Delivery with Docker and Amazon ECS
Continuous Delivery with Docker and Amazon ECS
 
Batch Processing with Amazon EC2 Container Service
Batch Processing with Amazon EC2 Container ServiceBatch Processing with Amazon EC2 Container Service
Batch Processing with Amazon EC2 Container Service
 
Integration tests: use the containers, Luke!
Integration tests: use the containers, Luke!Integration tests: use the containers, Luke!
Integration tests: use the containers, Luke!
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 
Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...
Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...
Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cass...
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
 
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
 
Commit to excellence - Java in containers
Commit to excellence - Java in containersCommit to excellence - Java in containers
Commit to excellence - Java in containers
 
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
 
GlobalAzureBootCamp 2018
GlobalAzureBootCamp 2018GlobalAzureBootCamp 2018
GlobalAzureBootCamp 2018
 
What to expect from Java 9
What to expect from Java 9What to expect from Java 9
What to expect from Java 9
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Running your dockerized application(s) on AWS Elastic Container Service
Running your dockerized application(s) on AWS Elastic Container ServiceRunning your dockerized application(s) on AWS Elastic Container Service
Running your dockerized application(s) on AWS Elastic Container Service
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.js
 
ILM - Pipeline in the cloud
ILM - Pipeline in the cloudILM - Pipeline in the cloud
ILM - Pipeline in the cloud
 

Recently uploaded

Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Undress Baby
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
aymanquadri279
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 

Recently uploaded (20)

Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfRevolutionizing Visual Effects Mastering AI Face Swaps.pdf
Revolutionizing Visual Effects Mastering AI Face Swaps.pdf
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 

Docker & ECS: Secure Nearline Execution

Editor's Notes

  1. - General platform, not just for single course types . - Advance pedagogy - Transformative education?
  2. Let me paint a picture for you. It's the wild wild west of 2012 silicon valley. Like gold miners from yesteryear, the weight of hopes, dreams and promises of affordable high quality education pushed a small team of mostly Stanford undergrads to build a platform for global learning.
  3. Everyone was working around the clock, and we needed to get something shipped quickly. We started with a stateless PHP-based monolith backed by a sharded array of MySQL servers. This architecture enabled the small team to quickly build out the fundamental features of the learning platform. We built forums, video lectures, in video-quizzes, assessments, and more in this architecture. Thanks to some good engineering, it scaled beautifully and had great availability. But then, we started getting these weird feature requests that we couldn't effectively build in this monolithic architecture.
  4. Since joining Coursera, I've learned a few things. One of which is that instructors are humans. Another, is that procrastination is a global phenomenon. Instructors would upload their video lectures hours before they needed to be released. We needed to quickly optimize them for distribution across the internet and to our low-bandwidth users. However, our webtier was not well suited for this long-running job. Additionally, as we built our platform, we wrote a function that would compute a user's grade as they progressed through the course. However, as courses ended, we needed to re-compute everyone's grades in order to issue certificates of completion. We had no way of doing this effectively within a web request. Finally, a key promise of MOOCs is pedagogical innovation derived from large learner behavior datasets. Our early instructional teams were begging us to release data on their own courses
  5. The PHP monolith had a lot of really useful code. We had a sharded database abstraction, common data models, and libraries such as the grade computation function. We had so many new features to build, so we wanted to avoid re-writing all of that. So, we did the easy-expedient thing, ...
  6. Copy of online serving codebase polling a queue. Restarts required due to memory leaks in PHP runtime. Code updates were infrequent and painful.
  7. Already in 2012, we realized the need to move off of PHP. After many lengthy debates on the comparitive merits of static types, concurrency, and performance, and after experimenting with toy Python, Go, and Java services, we eventually settled on Scala for our primary server-side technology. By 2013, we began completely re-architecting the learning platform from the ground up. As part of this migration, we re-built our nearline execution framework in Scala.
  8. Code sharing: - JARs - Packages - DI-abstractions, such as Guice Modules ... Now, as part of the migration, we changed the mental model for running a job. We realized that running some code on a regular cadence is a useful building block for platform features. Developers would write their jobs, and schedule them to run on a regular, recurring basis.
  9. As we moved to a modern, Scala, microservices-based architecture, we invested heavily in the tool-chain, from common libraries to automated deployment. We still were aggressively under-resourced, so we wanted to re-use as much of that as possible.
  10. As a result, Saturn is just another HTTP microservice, that serves no HTTP requests. When the server boots up, it forks a background thread to run the jobs. These jobs can easily interact with the other microservices in our architecture, just like any other microservice. For high availability, we always run at minimum 3 replicas of every service across 3 availability zones. While this works fine for the other microservices where each incoming request is sent to one replica, this is a big problem for Saturn. We do not want
  11. ... Now the conventional wisdom is that if you have a problem, and then you introduce zookeeper, you now have 2 problems. While zookeeper may be seen as an architecture anti-pattern, Saturn had much bigger issues.
  12. Saturn - https://upload.wikimedia.org/wikipedia/commons/c/c7/Saturn_during_Equinox.jpg
  13. Saturn - https://upload.wikimedia.org/wikipedia/commons/c/c7/Saturn_during_Equinox.jpg
  14. Saturn - https://upload.wikimedia.org/wikipedia/commons/c/c7/Saturn_during_Equinox.jpg
  15. Key point: minimal amount of work required to get their job done. Abstract away not just VMs / instances / clusters / etc., but also difficulties of code sharing & scheduling & deployment.
  16. Most important feature: great developer workflow. Developers care about the product features they need to ship. They don’t care if underneath the hood it’s running on containers, VMs or bare metal, so long as there is: - Easy development - Automated deployment - Reliable runtime
  17. Saturn - https://upload.wikimedia.org/wikipedia/commons/c/c7/Saturn_during_Equinox.jpg
  18. - Where Iguazu name comes from
  19. Nearline execution, or almost immediate execution of non-interactive jobs that interact with online serving systems.
  20. Now, I want to talk about an important implementation detail. In particular, why do we put this queue here right in the middle of a nice, clean, normal microservice? We do not need to have a queue for communication between the two halves of Iguazu. It could be a simple function call; when a request comes in, we could have the Iguazu microservice immediately turn around and schedule with the ECS API before responding. Recall, the big problem with Saturn is that at the top of the hour, dozens of jobs would kick off, and we’d exhaust all available resources. But, a nearline system is intentionally not an online system. In an online system, requests must be served immediately. But ia nearline architecture, the framework and scheduler is allowed to delay the execution of the jobs. We leverage a Queue to buffer up the bursty nature of incoming jobs. As a result, a nearline system can be provisioned at less than peak capacity. In fact, a nearline cluster can be provisioned on a gradient between peak capacity and average capacity, allowing a tradeoff between latency and cost.
  21. When moving to a cloud-native architecture, you will be brainwashed into using autoscaling. There is a good reason for that. This is because autoscaling is a really good practice for online, latency-sensitive microservices. Even more important than saving money, Autoscaling enforces immutable infrastructure, and high degrees of automation resulting in a modern, flexible and highly available architecture. Those benefits translate over to nearline environments. We autoscale not just the control plane, but the worker pool as well. However, autoscaling a cluster with long running jobs is much more challenging than low latency API servers. While scaling up is easy, scaling down safely is harder. You don’t want to terminate an EC2 instance that’s running a non-idempotent job! To solve this problem, we don’t use the default Amazon ECS scheduler. Instead, Iguazu has its own scheduler that is integrated with the Amazon Autoscaling API to avoid scheduling new jobs on instances scheduled for termination.
  22. Unfortunately, while we can work to avoid premature terminations, the reality is that jobs will fail to complete. The hardware could fail, power could go out, it could try and use too much memory, and there may be bugs. When designing distributed systems, you must architect for failure right from the start. In our experience, many of these nearline jobs make API calls, and have a large number of side effects (e.g. sending emails). Re-running a failed job could have serious consequences.
  23. Coursera is a very data-informed company; we always look to numbers to track our progress and validate our successes. Coursera developers have authored over an order of magnitude more jobs than in any of our previous systems. Developers take advantage of scheduled recurring jobs, and many jobs have multiple different schedules associated with them. As a result, we’re constantly running jobs on our cluster.
  24. While numbers can tell a very insightful story, I think in this context they are too difficult to interpret appropriately. I find it more illustrative to look at how we use Iguazu to truly understand how ubiquitously applicable nearline architectures can be.
  25. When you decide to build a new website, you almost never start with int main(). We always build on top of higher-level frameworks; there’s no need to re-write HTTP parsing libraries, cookie libraries, or database connection pools. The same principles apply to containers and nearline jobs. Saying “I’m using containers to build my app” is like saying “I’m using HTTP to build my app”. While it’s a great foundation, often a higher level of abstractions results in increased developer productivity. So, while containers may be an integral component of your architecture, or even necessary to the solution, they are not sufficient! Good architects should think about even higher levels of abstraction.
  26. While Iguazu can invoke and run arbitrary containers, in practice almost all jobs use the most important feature of Igauzu: the developer-optimized higher level framework. This is what a toy job looks like. Let’s break it down.
  27. The Hollywood principle says, “Don’t call me, I’ll call you.” Normally, you hear about it in the context of IoC frameworks, dependency injection, and UI or app toolkits. But it absolutely applies to distributed systems as well. Thinking back to Cascade (the initial PHP framework), if a developer wanted to test their new job, they must create a new queue, reconfigure their local copy of Cascade to talk to their new private queue, insert the job information into the queue, and wait for their job to eventually be run.
  28. At Coursera, we practice a DevOps (or actually NoOps) approach. All developers deploy their own code hundreds of times a week via automated tools and custom webapp tools.
  29. Now, back in 2012, we totally laughed at PHP for it's horribly unreliable runtime full of memory leaks. But in Iguazu, we're actually worse. We don't just throw away the whole process, we throw away the whole file system, and the rest of the container. But, actually, this is a really good idea. Longer-running, resource intensive jobs tend to leave a disproportionate amount of garbage in their wake. It's common to use temporary files on disk & a variety of other resources, such as temporary files as part of our pedagogical data exports. By allocating a new container instance from the container image, the system ensures a consistent environment and freeing developers from file bookkeeping in the same way a garbage collector frees developers from memory management. PHP was on to something after all!!!
  30. Now, I'd like to delve into the flagship application of Iguazu: Evaluating programming assignments.
  31. Procrastination is a global phenomenon. We regularly see an order of magnitude increase in submission rates right before assignment deadlines. We needed an elastic service backed by a shared pool of resources to efficiently evaluate programming assignments in a cost effective manner.
  32. Our online serving environment benefits greatly from immutable infrastructure and high degrees of automation to radically reduce operations and maintenance overhead. We wanted to apply these same lessons to evaluating programming assignments.
  33. For pedagogical reasons, we would like to provide feedback as quickly as possible. Ideally, we are able to execute fast graders and turn around their scores within 60 seconds at the 90th percentile.
  34. … Thanks to Iguazu, the GrID service itself is only ~1k LoC.
  35. Because we’re operating on a shared pool of resources, we need to bake security into the infrastructure. This also has the added benefit of making the system robust to less byzantine occurrences. But, what does “Secure Infrastructure” even mean?
  36. … By a show of hands, who of you would like to run arbitrary C code from random people on the internet on your servers? While you may think this insane security challenge only applies to these crazies from Coursera, it turns out that this applies far more broadly.
  37. Most Dockerfiles start with “from ubuntu”, or “from redis” or ”from jane-doe-on-github”. That one little innocent-looking line pulls in effectively arbitrary binaries & code to run on your container infrastructure. What this means is that: in practice, if you have container-based infrastructure at your organization, you should prepare to defend against arbitrary code running within your containers.
  38. Now, containers are very new, and security is sometimes very impenetrable. So, let’s instead talk about something that’s old, and much more straight forward. Babies. The first picture I have of a gaggle of small children is something along the lines of this picture. Each one warmly swaddled in their own … tub, happy as can be. When I initially thought of grading programming assignments, I had a similar image. Each submission happly running along within their own container. Reality will quickly disabuse of these foolish notions. https://www.google.com/search?espv=2&biw=2560&bih=1468&tbm=isch&sa=1&q=babies+hospital+&oq=babies+hospital+&gs_l=img.3..0j0i30j0i5i30l3j0i8i30l5.4194.4194.0.4783.1.1.0.0.0.0.74.74.1.1.0....0...1c.1.64.img..0.1.74.mKcYVszmBgo#imgrc=BRbfAc8Wi9uf2M%3A
  39. Once we have all of these systems configured, graders can run happily within the containers. Now, some of you functional programmers may have picked up on something: grading is an idempotent operation. But as it turns out, with GrID, its even better. Because we have hermetically sealed the grading containers, we have transformed messy business of evaluating programming assignments into effectively a pure function in the functional programming sense. It has almost zero extra input from the outside world! Containers are really cool!
  40. https://www.flickr.com/photos/donnieray/11501178306/in/photostream/
  41. If you ignore all the name-spacing and container mumbo-jumbo, at the core processes running within containers are just linux processes, and so the standard security techniques apply.
  42. Now, there are a number of unknown vulnerabilities not included in this defense.
  43. Baby monitor graphics?
  44. Public Domain: https://www.flickr.com/photos/mustangjoe/20437315996/in/photolist-x8YA2b-4CHj67-8Cjveb-bC2UPc-ibCEkV-aswFR8-gmv5Vj-4r5sPk-4CHiyy-92qQGf-28i54x-5LfUcS-opNLAM-7QTwNd-d7HmTA-efZc4Y-brT6Uv-d7Hnfd-5sARbG-5vvzmv-aqn5Li-DTWCYi-7XMsUo-8m1fUK-uj58iZ-D2nADa-78SpzZ-6BJGaL-4BrcEY-ne6BDJ-9FhXQ6-9QALSm-4EP8Hb-6h14wn-5nTnpt-7groVi-4EP8VW-8Qv9zx-6bCq1k-a7E8EJ-adFoNW-5Rp7Pb-s8otHi-7xSqsJ-4JZiUA-qW6wFZ-7XJdzg-jiYBq5-9hJ5Vo-ySx3Uo