SlideShare a Scribd company logo
1 of 58
Download to read offline
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ARC348
Seagull
Osman Sarood
Software Engineer @ Yelp
A highly Fault-tolerant Distributed System for Concurrent Task Execution
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Seagull:
A Highly Fault-Tolerant Distributed System
for Concurrent Task Execution
Osman Sarood, Software Engineer, Yelp
ARC348
October 2015
Monthly Visitors Reviews Mobile Searches Countries
How Yelp:
• Runs millions of tests a day
• Downloads TBs of data in an extremely efficient manner
• Scales using our custom metric
What’s in it for me?
What to Expect
• High-level architecture talk
• No code
• Assumes basic understanding of Apache Mesos
A distributed system that allows concurrent task
execution:
• at large scale
• while maintaining high cluster utilization
• and is highly fault tolerant and resilient
What Is Seagull?
The Developer
Run millions of tests each day
Seagull @ Yelp
• How does Seagull work?
• Major problem: Artifact downloads
• Auto Scaling and fault tolerance
What We’ll Cover
1: How Does Seagull Work?
• Each day, Seagull runs tests that would take 700 days (serially)!
• On average, 350 seagull-runs a day
• Each seagull-run has ~ 70,000 tests
• Each seagull-run would take 2 days to run serially
• Volatile but predictable demand
• 30% of load in 3 hours
• Predictable peak time (3PM-6PM)
What’s the Challenge?
• Run 3000 tests concurrently on a 200 machine cluster
• 2 days in serial => 14 mins!
• Running at such high scale involves:
• Downloading 36 TB a day (5 TB/hr peak)
• Up to 200 simultaneous downloads for a single large file
• 1.5 million Docker containers a day (210K/hr peak)
What Do We Do?
S3
Docker
Jenkins
Mesos
EC2
Elasticsearch
DynamoDB
Reporting
Monitoring
Seagull Ingredients
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
Prioritizer
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Seagull Overview
• Cluster of 7 r3.8xlarges
• Builds our largest service (artifact)
• Uploads artifacts to Amazon S3
• Discover tests
Build artifact
Discover tests
S3
Yelp Developer
Jenkins
• Largest service that forms major part of website
• Several hundreds of MB in size
• Takes ~10 mins to build
• Uses lots of memory
• Huge setup cost
• Build it once and download later
Yelp Artifact
• Takes the artifact and determines test names
• Parse Python code to extract test names
• Finishes in ~2 mins
• Separate test list for each of the 7 different suites
Test Discovery
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Prioritizer
Recap
• Schedule longest tests first
• Historical test timing data from DynamoDB
• Fetch 25 million records/day
• Why DynamoDB:
• We don’t need to maintain it!
• Low cost (just $200/month!)
DynamoDB
Historical data
Test list
Prioritizer
Test Prioritization
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Prioritizer
Recap
• Run ~350 seagull-runs/day:
• each run ~70000 tests (~ 25 million tests/day)
• total serial time of 48 hours/run
• Challenging to run lots of tests at scale during peak times
Runs submitted per 10 mins
Peak
The Testing Problem
• Resource management system
• mesos-master: On master node
• mesos-slave: On every slave
• Slaves register resources Mesos master
• Schedulers subscribe to Mesos master for
consuming resources
• Master offers resources to schedulers in a fair
manner
Mesos Master
Slave2 Slave2
Scheduler 1 Scheduler 2
Apache Mesos
Seagull leverages resource management abilities of Apache
Mesos
• Each run has a Mesos scheduler
• Each scheduler distributes work amongst ~600
workers (executors)
• 200 instances r3.8xlarge machines (32 cores/256GB)
Running Tests in Parallel
Test (color coded for different schedulers
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)
Slave S1
Terminology (Key)
C1
Yelp Devs
C2
Seagull Schedulers
Seagull Cluster
Test
Set of tests
(bundle)
Key
Mesos Master
Slave1(s1) Slave2 (s2)
S1 S2
User1 User2
S1
Parallel Test Execution
2: Key Challenges: Artifact
Downloads
• Each executor needs to have the artifact before running
tests
• 18,000 requests per hour at peak
• Each request is for a large file (hundreds of MBs)
• A single executor (out of 600) taking long to download could
delay the entire seagull-run.
Why Is Artifact Download Critical?
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Prioritizer
Recap
Docker
Amazon S3
Elasticsearch Amazon
DynamoDB
Fetch artifact
Takes 10 mins on average
Start Service
Run Tests
Report Results
Seagull Executor
Tes
t
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)
Slave S1
Artifact for Scheduler C1
A1
Exec C1
A1
Executor of scheduler C1
Terminology (Key)
• Scheduler C1 starts and distributes works amongst 600 executors
• Each executor (a.k.a task):
• own artifact (independent)
• Runs for ~ 10 mins on average
• Each slave runs 15 executors (C1 uses a total of 40 slaves)
• 200 * 15 * 6 = 18000 reqs/hr! (13.5 TB/hour)
S
3
Seagull Cluster
Slave 40
Exec C1
A1
….
Exec C1
A1
Slave 1
Exec C1
A1
….
Exec C1
A1
Artifact Handling
• Lots of requests took as long as 30 mins!
• We choked NAT boxes with tons of request
• Avoiding NAT required bigger effort
• Wanted a quick solutions
Slow Download Times
• Executors from same scheduler can share artifacts
• Disadvantages:
• Executors are no longer independent
• Locking implementation for downloading artifacts
S3
Still doesn’t scale well
Seagull Cluster
Slave 40
Exec C2
A2
Exec C1
A1
Slave 1
Exec C1
A1
Exec C1
A1
Exec C2
A2
Exec C2
A1
A2
A2A2 A1
Sharing Artifacts
• Artifactcache consisting of 9 r3.8xlarges
• Replicate each artifact across each of the 9 artifact caches
• Nginx distributes requests
• 10 Gbps network bandwidth helped
Artifactcache
Seagull Cluster
Slave 40
Exec C2Exec C1
Slave 1
Exec C1Exec C1 Exec C2 Exec C2
A1 A2A2 A1
Separate Artifactcache
Number of active schedulers per 10m
Download time (secs) per 10m
Artifact Download Metrics
• Why not use so much network bandwidth from our Amazon EC2
compute?
• The entire cluster serves as the artifactcache
• Cache scales as the cluster scales
• Bandwidth comparison:
• Centralized cache ~ 30 Mbps/executor
9 (# caches) * 10 (Gbps) / 3000 (# of executors)]
• Distributed cache ~ 666 Mbps/executor
200 (#caches) * 10 (Gbps) / 3000 (# of executors)
Distributed Artifactcache
Random Selector
Seagull Cluster
Slave 1
Artifact Pool
Slave 2
Artifact Pool
Slave 3
Artifact Pool
A1
Slave 4
Benefits of distributed artifact caching:
• Very scalable
• No extra machines to maintain
• Significant reduction in out-of-space disk issues
• Fast downloads due to less contention
A1
Artifact Pool
Distributed Artifact Caching
Artifact Download Time (secs) per 10 min
Number of Downloads per 10 mins
Can we improve download
times further?
Distributed Artifactcache Performance
• At peak times:
• Lots of downloads happens
• Most artifacts end up being downloaded on 90% of
slaves
• Once a machine downloads an artifact it should serve
other requests for that artifact
• Disadvantage: Bookkeeping
Stealing Artifact
1. Slave 4 gets A2
Seagull Cluster
Slave 1
Artifact Pool
A2
2. Bundle starts on Slave 2
3. Slave 2 pulls A2 from Slave 4
5a. Slave 3 gets A2 from Slave 3
5b. Slave 1 steals A2 from Slave 2
Exec C2
Slave 2
Artifact Pool
A2
Exec C2
Slave 3
Artifact Pool
A2
Exec C2
Slave 4
Artifact Pool
A2
Steal
4. Bundles start on Slave 1 & 3
Random Selector
Stealing Artifact
ARTIFACT STEAL TIME (per 10m)
NUM OF STEAL (per 10m)
Performance: Stealing in Distributed Artifact
Caching
Artifact Load-Balancing Viz
3: Auto Scaling and Fault Tolerance
• Used Auto Scaling group provided by AWS but it wasn’t easy to ‘select’ which
instances to terminate
• Mesos uses FIFO to assign work whereas Auto Scaling also uses FIFO to
terminate
• Example: 10% Slave working -> remove 10% -> terminate slaves doing work
Runs submitted (per 10 mins)
Auto Scaling
• CPU and memory demand is volatile
• Seagull tells Mesos to reserve the max amount of memory a
task requires ( )
• Total memory required to run a set of ( ) tasks concurrently:
Reserved Memory
• Total available memory for slave ‘i’:
• Let denote the set of all slaves in our cluster
• Total available memory available:
• Gull-load: Ratio of total reserved memory to total memory available
Gull-load
GullLoad
Running lots of executors
Gull-load
Calculate Gull-load for
each machine
Sort on
Gull-load
Select slaves with
least Gull-load (10%)
Terminate
Slaves
Add 10% Extra
Machines
Invoke Auto
Scaling (Every 10
mins)
Yes
NoYes
No
Gull-load
> 0.5
Gull-load < 0.9
Gull-load (GL) Action (# slaves)
0.5 < GL < 0.9 Nothing
GL > 0.9 Add 10%
GL < 0.5 Remove 10%
How Do We Scale Automatically?
• Started with all Reserved instances. Too expensive!
• Shifted to all Spot. Always knew it was risky..
• One fine day, all slaves were gone!
• A mix of On-Demand (25%) and Spot (75%) instances
Reserved, On-Demand, or Spot Instances?
Seagull provides fault tolerance at two levels
• Hardware level: Spreading our machines
geographically (preventive)
• Infrastructure level: Seagull retries upon failure
(corrective)
Fault Tolerance and Reliability
• Equally dividing machines amongst AZs
• us-west-2: a => 60, b => 66, c => 66
• Easy to terminate a slave and recreate it quickly
• In the event of losing Spot instances:
• Our seagull-runs keep running using the On-Demand
instances
• Add on-demand instances until Spot Instances are
available again (manual)
Preventive Fault Tolerance (Reliability)
• Lots of reasons for executors to fail:
• Bad service
• Docker problems (>100 concurrent containers/machine)
• External partners (e.g., Sauce Labs)
• How do we do it:
• Task Manager (inside scheduler) tracks life cycle of each
executor/task
• Fixed number of retries upon failure/timeout
Corrective Fault Tolerance
Tes
t
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)
Slave S1
Artifact for Scheduler 1
A1
Exec C1
A1
Executor of scheduler C1
Tracks life-cycle for each task
i.e. queued, running, finished
Terminology (Key)
Task
Manager
Yelp Devs C1
Seagull Schedulers
Seagull Cluster
Test
Set of
tests
(bundle)
Key
Mesos Master
S1 (uswest2a) S2 (uswest2b)
S1
User1
S2
Task
Manager
S1 Crashed
Rerun
Bundles?
Corrective Fault Tolerance
• How Seagull works and interacts with other systems
• An extremely efficient artifact hosting design
• Custom scaling policy and its use of gull-load
• Fault tolerance at scale using:
• AWS
• Executor retry logic
What Did We Learn?
• Sanitize code for open source
• Explore why Amazon S3 downloads are so slow
• Avoiding NAT box
• Using multiple buckets
• Breaking our artifact to smaller files
• Improve scaling:
• Ability to use other instance types
• Reduce cost by choosing Spot instance types with minimum GB/$
Future Work
Remember to complete
your evaluations!
Thank you!

More Related Content

What's hot

Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
confluent
 

What's hot (20)

How to improve ELK log pipeline performance
How to improve ELK log pipeline performanceHow to improve ELK log pipeline performance
How to improve ELK log pipeline performance
 
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. SaxIntroducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
 
Way to cloud
Way to cloudWay to cloud
Way to cloud
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
Kubernetes "Ubernetes" Cluster Federation by Quinton Hoole (Google, Inc) Huaw...
 
Introduction to Akka-Streams
Introduction to Akka-StreamsIntroduction to Akka-Streams
Introduction to Akka-Streams
 
Micro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and AnsibleMicro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and Ansible
 
Introduction to akka actors with java 8
Introduction to akka actors with java 8Introduction to akka actors with java 8
Introduction to akka actors with java 8
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard way
 
Running & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleRunning & Monitoring Docker at Scale
Running & Monitoring Docker at Scale
 
Heat optimization
Heat optimizationHeat optimization
Heat optimization
 
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent...
 
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
Streaming Design Patterns Using Alpakka Kafka Connector (Sean Glover, Lightbe...
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
 
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
 
Integrating Puppet with Cloud Infrastructures-Remco Overdijk
Integrating Puppet with Cloud Infrastructures-Remco OverdijkIntegrating Puppet with Cloud Infrastructures-Remco Overdijk
Integrating Puppet with Cloud Infrastructures-Remco Overdijk
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 

Similar to (ARC348) Seagull: How Yelp Built A System For Task Execution

London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalability
Guy Tomer
 

Similar to (ARC348) Seagull: How Yelp Built A System For Task Execution (20)

Rails performance at Justin.tv - Guillaume Luccisano
Rails performance at Justin.tv - Guillaume LuccisanoRails performance at Justin.tv - Guillaume Luccisano
Rails performance at Justin.tv - Guillaume Luccisano
 
Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...
Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...
Cassandra Day Atlanta 2015: Recording the Web: High-Fidelity Storage and Play...
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using MesosJolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using Mesos
 
Anton Boyko "The future of serverless computing"
Anton Boyko "The future of serverless computing"Anton Boyko "The future of serverless computing"
Anton Boyko "The future of serverless computing"
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
RedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech StackRedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech Stack
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
Anton Boyko, "The evolution of microservices platform or marketing gibberish"
Anton Boyko, "The evolution of microservices platform or marketing gibberish"Anton Boyko, "The evolution of microservices platform or marketing gibberish"
Anton Boyko, "The evolution of microservices platform or marketing gibberish"
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalability
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
mtl_rubykaigi
mtl_rubykaigimtl_rubykaigi
mtl_rubykaigi
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
 
Azure Functions - the evolution of microservices platform or marketing gibber...
Azure Functions - the evolution of microservices platform or marketing gibber...Azure Functions - the evolution of microservices platform or marketing gibber...
Azure Functions - the evolution of microservices platform or marketing gibber...
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

(ARC348) Seagull: How Yelp Built A System For Task Execution

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ARC348 Seagull Osman Sarood Software Engineer @ Yelp A highly Fault-tolerant Distributed System for Concurrent Task Execution
  • 2. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Seagull: A Highly Fault-Tolerant Distributed System for Concurrent Task Execution Osman Sarood, Software Engineer, Yelp ARC348 October 2015
  • 3.
  • 4. Monthly Visitors Reviews Mobile Searches Countries
  • 5. How Yelp: • Runs millions of tests a day • Downloads TBs of data in an extremely efficient manner • Scales using our custom metric What’s in it for me?
  • 6. What to Expect • High-level architecture talk • No code • Assumes basic understanding of Apache Mesos
  • 7. A distributed system that allows concurrent task execution: • at large scale • while maintaining high cluster utilization • and is highly fault tolerant and resilient What Is Seagull?
  • 8. The Developer Run millions of tests each day Seagull @ Yelp
  • 9. • How does Seagull work? • Major problem: Artifact downloads • Auto Scaling and fault tolerance What We’ll Cover
  • 10. 1: How Does Seagull Work?
  • 11. • Each day, Seagull runs tests that would take 700 days (serially)! • On average, 350 seagull-runs a day • Each seagull-run has ~ 70,000 tests • Each seagull-run would take 2 days to run serially • Volatile but predictable demand • 30% of load in 3 hours • Predictable peak time (3PM-6PM) What’s the Challenge?
  • 12. • Run 3000 tests concurrently on a 200 machine cluster • 2 days in serial => 14 mins! • Running at such high scale involves: • Downloading 36 TB a day (5 TB/hr peak) • Up to 200 simultaneous downloads for a single large file • 1.5 million Docker containers a day (210K/hr peak) What Do We Do?
  • 14. EC2 Scheduler1 Scheduler2 …… Scheduler’y’… Slave1 Slave2 Slave’n’ Elasticsearch DynamoDB S3 Prioritizer UI … Yelp Developer 1 2 3 4 5 6(a) 6(b) 7 8 Seagull Overview
  • 15. • Cluster of 7 r3.8xlarges • Builds our largest service (artifact) • Uploads artifacts to Amazon S3 • Discover tests Build artifact Discover tests S3 Yelp Developer Jenkins
  • 16. • Largest service that forms major part of website • Several hundreds of MB in size • Takes ~10 mins to build • Uses lots of memory • Huge setup cost • Build it once and download later Yelp Artifact
  • 17. • Takes the artifact and determines test names • Parse Python code to extract test names • Finishes in ~2 mins • Separate test list for each of the 7 different suites Test Discovery
  • 18. EC2 Scheduler1 Scheduler2 …… Scheduler’y’… Slave1 Slave2 Slave’n’ Elasticsearch DynamoDB S3 UI … Yelp Developer 1 2 3 4 5 6(a) 6(b) 7 8 Prioritizer Recap
  • 19. • Schedule longest tests first • Historical test timing data from DynamoDB • Fetch 25 million records/day • Why DynamoDB: • We don’t need to maintain it! • Low cost (just $200/month!) DynamoDB Historical data Test list Prioritizer Test Prioritization
  • 20. EC2 Scheduler1 Scheduler2 …… Scheduler’y’… Slave1 Slave2 Slave’n’ Elasticsearch DynamoDB S3 UI … Yelp Developer 1 2 3 4 5 6(a) 6(b) 7 8 Prioritizer Recap
  • 21. • Run ~350 seagull-runs/day: • each run ~70000 tests (~ 25 million tests/day) • total serial time of 48 hours/run • Challenging to run lots of tests at scale during peak times Runs submitted per 10 mins Peak The Testing Problem
  • 22. • Resource management system • mesos-master: On master node • mesos-slave: On every slave • Slaves register resources Mesos master • Schedulers subscribe to Mesos master for consuming resources • Master offers resources to schedulers in a fair manner Mesos Master Slave2 Slave2 Scheduler 1 Scheduler 2 Apache Mesos
  • 23. Seagull leverages resource management abilities of Apache Mesos • Each run has a Mesos scheduler • Each scheduler distributes work amongst ~600 workers (executors) • 200 instances r3.8xlarge machines (32 cores/256GB) Running Tests in Parallel
  • 24. Test (color coded for different schedulers Set of tests (bundle) C1 Scheduler C1 Slave1(s1) Slave S1 Terminology (Key)
  • 25. C1 Yelp Devs C2 Seagull Schedulers Seagull Cluster Test Set of tests (bundle) Key Mesos Master Slave1(s1) Slave2 (s2) S1 S2 User1 User2 S1 Parallel Test Execution
  • 26. 2: Key Challenges: Artifact Downloads
  • 27. • Each executor needs to have the artifact before running tests • 18,000 requests per hour at peak • Each request is for a large file (hundreds of MBs) • A single executor (out of 600) taking long to download could delay the entire seagull-run. Why Is Artifact Download Critical?
  • 28. EC2 Scheduler1 Scheduler2 …… Scheduler’y’… Slave1 Slave2 Slave’n’ Elasticsearch DynamoDB S3 UI … Yelp Developer 1 2 3 4 5 6(a) 6(b) 7 8 Prioritizer Recap
  • 29. Docker Amazon S3 Elasticsearch Amazon DynamoDB Fetch artifact Takes 10 mins on average Start Service Run Tests Report Results Seagull Executor
  • 30. Tes t Set of tests (bundle) C1 Scheduler C1 Slave1(s1) Slave S1 Artifact for Scheduler C1 A1 Exec C1 A1 Executor of scheduler C1 Terminology (Key)
  • 31. • Scheduler C1 starts and distributes works amongst 600 executors • Each executor (a.k.a task): • own artifact (independent) • Runs for ~ 10 mins on average • Each slave runs 15 executors (C1 uses a total of 40 slaves) • 200 * 15 * 6 = 18000 reqs/hr! (13.5 TB/hour) S 3 Seagull Cluster Slave 40 Exec C1 A1 …. Exec C1 A1 Slave 1 Exec C1 A1 …. Exec C1 A1 Artifact Handling
  • 32. • Lots of requests took as long as 30 mins! • We choked NAT boxes with tons of request • Avoiding NAT required bigger effort • Wanted a quick solutions Slow Download Times
  • 33. • Executors from same scheduler can share artifacts • Disadvantages: • Executors are no longer independent • Locking implementation for downloading artifacts S3 Still doesn’t scale well Seagull Cluster Slave 40 Exec C2 A2 Exec C1 A1 Slave 1 Exec C1 A1 Exec C1 A1 Exec C2 A2 Exec C2 A1 A2 A2A2 A1 Sharing Artifacts
  • 34. • Artifactcache consisting of 9 r3.8xlarges • Replicate each artifact across each of the 9 artifact caches • Nginx distributes requests • 10 Gbps network bandwidth helped Artifactcache Seagull Cluster Slave 40 Exec C2Exec C1 Slave 1 Exec C1Exec C1 Exec C2 Exec C2 A1 A2A2 A1 Separate Artifactcache
  • 35. Number of active schedulers per 10m Download time (secs) per 10m Artifact Download Metrics
  • 36. • Why not use so much network bandwidth from our Amazon EC2 compute? • The entire cluster serves as the artifactcache • Cache scales as the cluster scales • Bandwidth comparison: • Centralized cache ~ 30 Mbps/executor 9 (# caches) * 10 (Gbps) / 3000 (# of executors)] • Distributed cache ~ 666 Mbps/executor 200 (#caches) * 10 (Gbps) / 3000 (# of executors) Distributed Artifactcache
  • 37. Random Selector Seagull Cluster Slave 1 Artifact Pool Slave 2 Artifact Pool Slave 3 Artifact Pool A1 Slave 4 Benefits of distributed artifact caching: • Very scalable • No extra machines to maintain • Significant reduction in out-of-space disk issues • Fast downloads due to less contention A1 Artifact Pool Distributed Artifact Caching
  • 38. Artifact Download Time (secs) per 10 min Number of Downloads per 10 mins Can we improve download times further? Distributed Artifactcache Performance
  • 39. • At peak times: • Lots of downloads happens • Most artifacts end up being downloaded on 90% of slaves • Once a machine downloads an artifact it should serve other requests for that artifact • Disadvantage: Bookkeeping Stealing Artifact
  • 40. 1. Slave 4 gets A2 Seagull Cluster Slave 1 Artifact Pool A2 2. Bundle starts on Slave 2 3. Slave 2 pulls A2 from Slave 4 5a. Slave 3 gets A2 from Slave 3 5b. Slave 1 steals A2 from Slave 2 Exec C2 Slave 2 Artifact Pool A2 Exec C2 Slave 3 Artifact Pool A2 Exec C2 Slave 4 Artifact Pool A2 Steal 4. Bundles start on Slave 1 & 3 Random Selector Stealing Artifact
  • 41. ARTIFACT STEAL TIME (per 10m) NUM OF STEAL (per 10m) Performance: Stealing in Distributed Artifact Caching
  • 43. 3: Auto Scaling and Fault Tolerance
  • 44. • Used Auto Scaling group provided by AWS but it wasn’t easy to ‘select’ which instances to terminate • Mesos uses FIFO to assign work whereas Auto Scaling also uses FIFO to terminate • Example: 10% Slave working -> remove 10% -> terminate slaves doing work Runs submitted (per 10 mins) Auto Scaling
  • 45. • CPU and memory demand is volatile • Seagull tells Mesos to reserve the max amount of memory a task requires ( ) • Total memory required to run a set of ( ) tasks concurrently: Reserved Memory
  • 46. • Total available memory for slave ‘i’: • Let denote the set of all slaves in our cluster • Total available memory available: • Gull-load: Ratio of total reserved memory to total memory available Gull-load
  • 47. GullLoad Running lots of executors Gull-load
  • 48. Calculate Gull-load for each machine Sort on Gull-load Select slaves with least Gull-load (10%) Terminate Slaves Add 10% Extra Machines Invoke Auto Scaling (Every 10 mins) Yes NoYes No Gull-load > 0.5 Gull-load < 0.9 Gull-load (GL) Action (# slaves) 0.5 < GL < 0.9 Nothing GL > 0.9 Add 10% GL < 0.5 Remove 10% How Do We Scale Automatically?
  • 49. • Started with all Reserved instances. Too expensive! • Shifted to all Spot. Always knew it was risky.. • One fine day, all slaves were gone! • A mix of On-Demand (25%) and Spot (75%) instances Reserved, On-Demand, or Spot Instances?
  • 50. Seagull provides fault tolerance at two levels • Hardware level: Spreading our machines geographically (preventive) • Infrastructure level: Seagull retries upon failure (corrective) Fault Tolerance and Reliability
  • 51. • Equally dividing machines amongst AZs • us-west-2: a => 60, b => 66, c => 66 • Easy to terminate a slave and recreate it quickly • In the event of losing Spot instances: • Our seagull-runs keep running using the On-Demand instances • Add on-demand instances until Spot Instances are available again (manual) Preventive Fault Tolerance (Reliability)
  • 52. • Lots of reasons for executors to fail: • Bad service • Docker problems (>100 concurrent containers/machine) • External partners (e.g., Sauce Labs) • How do we do it: • Task Manager (inside scheduler) tracks life cycle of each executor/task • Fixed number of retries upon failure/timeout Corrective Fault Tolerance
  • 53. Tes t Set of tests (bundle) C1 Scheduler C1 Slave1(s1) Slave S1 Artifact for Scheduler 1 A1 Exec C1 A1 Executor of scheduler C1 Tracks life-cycle for each task i.e. queued, running, finished Terminology (Key) Task Manager
  • 54. Yelp Devs C1 Seagull Schedulers Seagull Cluster Test Set of tests (bundle) Key Mesos Master S1 (uswest2a) S2 (uswest2b) S1 User1 S2 Task Manager S1 Crashed Rerun Bundles? Corrective Fault Tolerance
  • 55. • How Seagull works and interacts with other systems • An extremely efficient artifact hosting design • Custom scaling policy and its use of gull-load • Fault tolerance at scale using: • AWS • Executor retry logic What Did We Learn?
  • 56. • Sanitize code for open source • Explore why Amazon S3 downloads are so slow • Avoiding NAT box • Using multiple buckets • Breaking our artifact to smaller files • Improve scaling: • Ability to use other instance types • Reduce cost by choosing Spot instance types with minimum GB/$ Future Work