Efficiently parallelizing mutually exclusively tasks can be a challenging problem when done at scale. Yelp's recent in-house product, Seagull, demonstrates how an intelligent scheduling system can use several open-source products to provide a highly scalable and fault-tolerant distributed system. Learn how Yelp built Seagull with a variety of Amazon Web Services to concurrently execute thousands of tasks that can greatly improve performance. Seagull combines open-source software like ElasticSearch, Mesos, Docker, and Jenkins with Amazon Web Services (AWS) to parallelize Yelp's testing suite. Our current use case of Seagull involves distributively running Yelp's test suite that has over 55,000 test cases. Using our smart scheduling, we can run one of our largest test suites to process 42 hours of serial work in less than 10 minutes using 200 r3.8xlarge instances from Amazon Elastic Compute Cloud (Amazon EC2). Seagull consumes and produces data at very high rates. On a typical day, Seagull writes 60 GBs of data and consumes 20 TBs of data. Although we are currently using Seagull to parallelize test execution, it can efficiently parallelize other types of independent tasks.
5. How Yelp:
• Runs millions of tests a day
• Downloads TBs of data in an extremely efficient manner
• Scales using our custom metric
What’s in it for me?
6. What to Expect
• High-level architecture talk
• No code
• Assumes basic understanding of Apache Mesos
7. A distributed system that allows concurrent task
execution:
• at large scale
• while maintaining high cluster utilization
• and is highly fault tolerant and resilient
What Is Seagull?
11. • Each day, Seagull runs tests that would take 700 days (serially)!
• On average, 350 seagull-runs a day
• Each seagull-run has ~ 70,000 tests
• Each seagull-run would take 2 days to run serially
• Volatile but predictable demand
• 30% of load in 3 hours
• Predictable peak time (3PM-6PM)
What’s the Challenge?
12. • Run 3000 tests concurrently on a 200 machine cluster
• 2 days in serial => 14 mins!
• Running at such high scale involves:
• Downloading 36 TB a day (5 TB/hr peak)
• Up to 200 simultaneous downloads for a single large file
• 1.5 million Docker containers a day (210K/hr peak)
What Do We Do?
15. • Cluster of 7 r3.8xlarges
• Builds our largest service (artifact)
• Uploads artifacts to Amazon S3
• Discover tests
Build artifact
Discover tests
S3
Yelp Developer
Jenkins
16. • Largest service that forms major part of website
• Several hundreds of MB in size
• Takes ~10 mins to build
• Uses lots of memory
• Huge setup cost
• Build it once and download later
Yelp Artifact
17. • Takes the artifact and determines test names
• Parse Python code to extract test names
• Finishes in ~2 mins
• Separate test list for each of the 7 different suites
Test Discovery
19. • Schedule longest tests first
• Historical test timing data from DynamoDB
• Fetch 25 million records/day
• Why DynamoDB:
• We don’t need to maintain it!
• Low cost (just $200/month!)
DynamoDB
Historical data
Test list
Prioritizer
Test Prioritization
21. • Run ~350 seagull-runs/day:
• each run ~70000 tests (~ 25 million tests/day)
• total serial time of 48 hours/run
• Challenging to run lots of tests at scale during peak times
Runs submitted per 10 mins
Peak
The Testing Problem
22. • Resource management system
• mesos-master: On master node
• mesos-slave: On every slave
• Slaves register resources Mesos master
• Schedulers subscribe to Mesos master for
consuming resources
• Master offers resources to schedulers in a fair
manner
Mesos Master
Slave2 Slave2
Scheduler 1 Scheduler 2
Apache Mesos
23. Seagull leverages resource management abilities of Apache
Mesos
• Each run has a Mesos scheduler
• Each scheduler distributes work amongst ~600
workers (executors)
• 200 instances r3.8xlarge machines (32 cores/256GB)
Running Tests in Parallel
24. Test (color coded for different schedulers
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)
Slave S1
Terminology (Key)
27. • Each executor needs to have the artifact before running
tests
• 18,000 requests per hour at peak
• Each request is for a large file (hundreds of MBs)
• A single executor (out of 600) taking long to download could
delay the entire seagull-run.
Why Is Artifact Download Critical?
30. Tes
t
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)
Slave S1
Artifact for Scheduler C1
A1
Exec C1
A1
Executor of scheduler C1
Terminology (Key)
31. • Scheduler C1 starts and distributes works amongst 600 executors
• Each executor (a.k.a task):
• own artifact (independent)
• Runs for ~ 10 mins on average
• Each slave runs 15 executors (C1 uses a total of 40 slaves)
• 200 * 15 * 6 = 18000 reqs/hr! (13.5 TB/hour)
S
3
Seagull Cluster
Slave 40
Exec C1
A1
….
Exec C1
A1
Slave 1
Exec C1
A1
….
Exec C1
A1
Artifact Handling
32. • Lots of requests took as long as 30 mins!
• We choked NAT boxes with tons of request
• Avoiding NAT required bigger effort
• Wanted a quick solutions
Slow Download Times
33. • Executors from same scheduler can share artifacts
• Disadvantages:
• Executors are no longer independent
• Locking implementation for downloading artifacts
S3
Still doesn’t scale well
Seagull Cluster
Slave 40
Exec C2
A2
Exec C1
A1
Slave 1
Exec C1
A1
Exec C1
A1
Exec C2
A2
Exec C2
A1
A2
A2A2 A1
Sharing Artifacts
34. • Artifactcache consisting of 9 r3.8xlarges
• Replicate each artifact across each of the 9 artifact caches
• Nginx distributes requests
• 10 Gbps network bandwidth helped
Artifactcache
Seagull Cluster
Slave 40
Exec C2Exec C1
Slave 1
Exec C1Exec C1 Exec C2 Exec C2
A1 A2A2 A1
Separate Artifactcache
35. Number of active schedulers per 10m
Download time (secs) per 10m
Artifact Download Metrics
36. • Why not use so much network bandwidth from our Amazon EC2
compute?
• The entire cluster serves as the artifactcache
• Cache scales as the cluster scales
• Bandwidth comparison:
• Centralized cache ~ 30 Mbps/executor
9 (# caches) * 10 (Gbps) / 3000 (# of executors)]
• Distributed cache ~ 666 Mbps/executor
200 (#caches) * 10 (Gbps) / 3000 (# of executors)
Distributed Artifactcache
37. Random Selector
Seagull Cluster
Slave 1
Artifact Pool
Slave 2
Artifact Pool
Slave 3
Artifact Pool
A1
Slave 4
Benefits of distributed artifact caching:
• Very scalable
• No extra machines to maintain
• Significant reduction in out-of-space disk issues
• Fast downloads due to less contention
A1
Artifact Pool
Distributed Artifact Caching
38. Artifact Download Time (secs) per 10 min
Number of Downloads per 10 mins
Can we improve download
times further?
Distributed Artifactcache Performance
39. • At peak times:
• Lots of downloads happens
• Most artifacts end up being downloaded on 90% of
slaves
• Once a machine downloads an artifact it should serve
other requests for that artifact
• Disadvantage: Bookkeeping
Stealing Artifact
40. 1. Slave 4 gets A2
Seagull Cluster
Slave 1
Artifact Pool
A2
2. Bundle starts on Slave 2
3. Slave 2 pulls A2 from Slave 4
5a. Slave 3 gets A2 from Slave 3
5b. Slave 1 steals A2 from Slave 2
Exec C2
Slave 2
Artifact Pool
A2
Exec C2
Slave 3
Artifact Pool
A2
Exec C2
Slave 4
Artifact Pool
A2
Steal
4. Bundles start on Slave 1 & 3
Random Selector
Stealing Artifact
41. ARTIFACT STEAL TIME (per 10m)
NUM OF STEAL (per 10m)
Performance: Stealing in Distributed Artifact
Caching
44. • Used Auto Scaling group provided by AWS but it wasn’t easy to ‘select’ which
instances to terminate
• Mesos uses FIFO to assign work whereas Auto Scaling also uses FIFO to
terminate
• Example: 10% Slave working -> remove 10% -> terminate slaves doing work
Runs submitted (per 10 mins)
Auto Scaling
45. • CPU and memory demand is volatile
• Seagull tells Mesos to reserve the max amount of memory a
task requires ( )
• Total memory required to run a set of ( ) tasks concurrently:
Reserved Memory
46. • Total available memory for slave ‘i’:
• Let denote the set of all slaves in our cluster
• Total available memory available:
• Gull-load: Ratio of total reserved memory to total memory available
Gull-load
48. Calculate Gull-load for
each machine
Sort on
Gull-load
Select slaves with
least Gull-load (10%)
Terminate
Slaves
Add 10% Extra
Machines
Invoke Auto
Scaling (Every 10
mins)
Yes
NoYes
No
Gull-load
> 0.5
Gull-load < 0.9
Gull-load (GL) Action (# slaves)
0.5 < GL < 0.9 Nothing
GL > 0.9 Add 10%
GL < 0.5 Remove 10%
How Do We Scale Automatically?
49. • Started with all Reserved instances. Too expensive!
• Shifted to all Spot. Always knew it was risky..
• One fine day, all slaves were gone!
• A mix of On-Demand (25%) and Spot (75%) instances
Reserved, On-Demand, or Spot Instances?
50. Seagull provides fault tolerance at two levels
• Hardware level: Spreading our machines
geographically (preventive)
• Infrastructure level: Seagull retries upon failure
(corrective)
Fault Tolerance and Reliability
51. • Equally dividing machines amongst AZs
• us-west-2: a => 60, b => 66, c => 66
• Easy to terminate a slave and recreate it quickly
• In the event of losing Spot instances:
• Our seagull-runs keep running using the On-Demand
instances
• Add on-demand instances until Spot Instances are
available again (manual)
Preventive Fault Tolerance (Reliability)
52. • Lots of reasons for executors to fail:
• Bad service
• Docker problems (>100 concurrent containers/machine)
• External partners (e.g., Sauce Labs)
• How do we do it:
• Task Manager (inside scheduler) tracks life cycle of each
executor/task
• Fixed number of retries upon failure/timeout
Corrective Fault Tolerance
53. Tes
t
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)
Slave S1
Artifact for Scheduler 1
A1
Exec C1
A1
Executor of scheduler C1
Tracks life-cycle for each task
i.e. queued, running, finished
Terminology (Key)
Task
Manager
55. • How Seagull works and interacts with other systems
• An extremely efficient artifact hosting design
• Custom scaling policy and its use of gull-load
• Fault tolerance at scale using:
• AWS
• Executor retry logic
What Did We Learn?
56. • Sanitize code for open source
• Explore why Amazon S3 downloads are so slow
• Avoiding NAT box
• Using multiple buckets
• Breaking our artifact to smaller files
• Improve scaling:
• Ability to use other instance types
• Reduce cost by choosing Spot instance types with minimum GB/$
Future Work