Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jolt
Who We Are
Kyle Kelly (kkelly@)
Release Engineering
Sunil Shah (sunil@)
Distributed Systems
Timmy Zhu (tzhu@)
Release Engi...
Release Engineering at Yelp
• Focus on maximizing engineering productivity
• Provide review, build, and test infrastructur...
Yelp’s Mission
Connecting people with great
local businesses.
Yelp scale
Why?
• Yelp runs a lot of tests
• The legacy monolith has 85,000+ tests
• Other services have thousands of tests too
• Dep...
Why?
• Parallelizing test runs saves significant
developer time
• Allows us to push new versions of Yelp.com
multiple time...
Why?
We already have a working system: Seagull
• ~350 test runs every day. Average run time ~10-15
mins.
• ~2.5 million ep...
Why?
• Seagull was unnecessarily complex
• Custom executor
• Custom artifact management
• Hard to reuse for other services...
Features
• Split tests into "bundles" of desired duration
• Further grouped by runtime environment
requirements
• Bundles ...
Bundling
• User specifies a target bundle execution time
• Bin pack tests based on estimated duration
• Uses a rolling his...
Example Invocation
jolt test_runner.sh tests.list
--artifact=minified.tar.gz
--project=yelp-main
--bundle-retries=3
--targ...
Other Supporting Infrastructure
• Elasticsearch
• Storage and retrieval of test durations
• Test Results
• Distributed rep...
Task Processing
• Jolt isn’t implementing an entire Mesos
framework like Seagull does
• Task Processing is an open source ...
Task Processing
Generic TaskExecutor interface
• run
• kill
• stop
• get_event_queue
• (i.e. users mostly shouldn’t care t...
Task Processing
• Implementations are composable
• We offer a few different types:
• MesosExecutor
• RetryingExecutor
• Ti...
Loads are cyclical
$
$
$
$
Weekend Weekend
Weekdays
$ $
$
$
$
$
$
$
$
$
$$ $
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$ $
$
$
$
$
Loads are bursty
Euro code
push
US office hours
Lunch time
Clusterman
As part of Jolt, we’re building a next generation
autoscaler (Clusterman) that does two things:
1. Autoscaling ...
- Users bid for Amazon’s spare capacity
- Lowest winning bid is the $$ paid
Used
Used
Used
Available
Available
Available
A...
- Users bid for Amazon’s spare capacity
- Lowest winning bid is the $$ paid
Used
Used
Used
User A - $2
User A - $2
User B ...
- Users bid for Amazon’s spare capacity
- Lowest winning bid is the $$ paid
Used
Used
Used
User A - $3
User A - $3
User B ...
Spot Fleet Requests
Spot Fleet Requests allow us to request a certain
amount of spot instances simultaneously:
• Diversifi...
Signals
Right now we autoscale based on two signals:
• CPU utilisation
• e.g. scale up if utilisation > 65% for 15 min, sc...
Instance termination
• AWS Spotfleet does not allow us to specify which
instances to terminate.
• Clusterman finds and ter...
Cost savings
IntegrationTesting
InfrastructureCost
55% reduction in costs after initial transition to
spot instances
Addit...
Scaling issues
Challenges
• Mesos HTTP API is considerably less
performant than Protobuf API.
• HTTP API Timeouts in produ...
Defensive maintenance
• Yelp-main tests are not fully containerised yet
• Necessary to perform defensive cluster
maintenan...
Defensive maintenance
Challenges
docker-reaperExecutor
Creates a new Unix socket
and sets $DOCKER_HOST
to that socket.
Chi...
Future Work
• Mitigating setup and teardown time
• Bidirectional communication between
framework and executors
• Cluster-w...
Demo
Link
We are hiring
● Engineers or managers with dist-sys experience:
○ Strong knowledge of systems and application design.
○ Ab...
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1

Share

Download to read offline

Jolt: Distributed, fault-tolerant test running at scale using Mesos

Download to read offline

In this presentation, Kyle Kelly, Sunil Shah and Timmy Zhu present Jolt, a system that aims to solve the problem of running integration tests with a high fixed resource cost at scale. We use Mesos plus a custom open source framework called Task Processing to run integration tests for the Yelp website in a massively parallel manner - taking test runs down from a few days to less than an hour.

Related Books

Free with a 30 day trial from Scribd

See all

Jolt: Distributed, fault-tolerant test running at scale using Mesos

  1. 1. Jolt
  2. 2. Who We Are Kyle Kelly (kkelly@) Release Engineering Sunil Shah (sunil@) Distributed Systems Timmy Zhu (tzhu@) Release Engineering
  3. 3. Release Engineering at Yelp • Focus on maximizing engineering productivity • Provide review, build, and test infrastructure for developers at Yelp
  4. 4. Yelp’s Mission Connecting people with great local businesses.
  5. 5. Yelp scale
  6. 6. Why? • Yelp runs a lot of tests • The legacy monolith has 85,000+ tests • Other services have thousands of tests too • Deployments require running all tests
  7. 7. Why? • Parallelizing test runs saves significant developer time • Allows us to push new versions of Yelp.com multiple times a day with confidence
  8. 8. Why? We already have a working system: Seagull • ~350 test runs every day. Average run time ~10-15 mins. • ~2.5 million ephemeral containers every day. • Cluster scales from ~70 spot instances to ~450 spot instances. • ~25 million tests executed every day.
  9. 9. Why? • Seagull was unnecessarily complex • Custom executor • Custom artifact management • Hard to reuse for other services’ tests • Built primarily to run yelp-main tests
  10. 10. Features • Split tests into "bundles" of desired duration • Further grouped by runtime environment requirements • Bundles run on Mesos • Retry on unexpected failures
  11. 11. Bundling • User specifies a target bundle execution time • Bin pack tests based on estimated duration • Uses a rolling historical average • Reports task durations after every Jolt run
  12. 12. Example Invocation jolt test_runner.sh tests.list --artifact=minified.tar.gz --project=yelp-main --bundle-retries=3 --target-bundle-duration=300 --results=results.list --env TR_RUN_ID ymjkkelly-1509390913
  13. 13. Other Supporting Infrastructure • Elasticsearch • Storage and retrieval of test durations • Test Results • Distributed reporting & summarization • Collected via a Kafka stream and indexed in Elasticsearch • Viewable/queryable via web application • Autoscaling hosts via Clusterman
  14. 14. Task Processing • Jolt isn’t implementing an entire Mesos framework like Seagull does • Task Processing is an open source Python library • Uses the HTTP scheduler API via PyMesos • Intended to be composable • Basis for both Jolt and for running scheduled jobs using Tron
  15. 15. Task Processing Generic TaskExecutor interface • run • kill • stop • get_event_queue • (i.e. users mostly shouldn’t care that they are using Mesos)
  16. 16. Task Processing • Implementations are composable • We offer a few different types: • MesosExecutor • RetryingExecutor • TimeoutExecutor
  17. 17. Loads are cyclical $ $ $ $ Weekend Weekend Weekdays $ $ $ $ $ $ $ $ $ $ $$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
  18. 18. Loads are bursty Euro code push US office hours Lunch time
  19. 19. Clusterman As part of Jolt, we’re building a next generation autoscaler (Clusterman) that does two things: 1. Autoscaling of a pool of Mesos agents 2. Simulations based on changing Spot Fleet Requests
  20. 20. - Users bid for Amazon’s spare capacity - Lowest winning bid is the $$ paid Used Used Used Available Available Available Available User A - $4 User A - $4 User B - $3 User C - $2 User C - $2 User D - $1 User D - $1 User D - $1
  21. 21. - Users bid for Amazon’s spare capacity - Lowest winning bid is the $$ paid Used Used Used User A - $2 User A - $2 User B - $2 User C - $2 User A - $4 User A - $4 User B - $3 User C - $2 User C - $2 User D - $1 User D - $1 User D - $1
  22. 22. - Users bid for Amazon’s spare capacity - Lowest winning bid is the $$ paid Used Used Used User A - $3 User A - $3 User B - $3 User B - $3 User A - $4 User A - $4 User B - $3 User B - $3 User C - $2 User C - $2 User D - $1 User D - $1
  23. 23. Spot Fleet Requests Spot Fleet Requests allow us to request a certain amount of spot instances simultaneously: • Diversification via availability zone • Diversification via instance type Simulating how we might do based on changing our bid prices helps us understand instance churn. Clusterman
  24. 24. Signals Right now we autoscale based on two signals: • CPU utilisation • e.g. scale up if utilisation > 65% for 15 min, scale down if utilisation < 35% for 30 min • Test runs in-flight We also have option to operate on additional signals too, for example: • Predicted load Clusterman
  25. 25. Instance termination • AWS Spotfleet does not allow us to specify which instances to terminate. • Clusterman finds and terminates the idle instances, and readjusts the Spotfleet capacity. Clusterman
  26. 26. Cost savings IntegrationTesting InfrastructureCost 55% reduction in costs after initial transition to spot instances Additional 60% savings after transition to spot+autoscaling complete
  27. 27. Scaling issues Challenges • Mesos HTTP API is considerably less performant than Protobuf API. • HTTP API Timeouts in production when running hundreds of applications on Marathon and less than 10 HTTP API schedulers.
  28. 28. Defensive maintenance • Yelp-main tests are not fully containerised yet • Necessary to perform defensive cluster maintenance/healthiness in order to guard against bad actors. Challenges
  29. 29. Defensive maintenance Challenges docker-reaperExecutor Creates a new Unix socket and sets $DOCKER_HOST to that socket. Child process Fork-exec Create container API call Create container API call Remove Container Container ID Stores the container ID
  30. 30. Future Work • Mitigating setup and teardown time • Bidirectional communication between framework and executors • Cluster-wide resources
  31. 31. Demo Link
  32. 32. We are hiring ● Engineers or managers with dist-sys experience: ○ Strong knowledge of systems and application design. ○ Ability to work closely with information retrieval/machine learning experts on big-data problems. ○ Strong understanding of operating systems, file systems and networking. ○ Fluency in Python, C, C++, Java, or a similar language. ○ Technologies we use: Mesos, Marathon, Docker, ZooKeeper, Kafka, Cassandra, Flink, Spark, Elasticsearch Apply at https://www.yelp.com/careers or come say hi! Europe / San Francisco
  33. 33. @YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp
  • accavdar

    Nov. 15, 2017

In this presentation, Kyle Kelly, Sunil Shah and Timmy Zhu present Jolt, a system that aims to solve the problem of running integration tests with a high fixed resource cost at scale. We use Mesos plus a custom open source framework called Task Processing to run integration tests for the Yelp website in a massively parallel manner - taking test runs down from a few days to less than an hour.

Views

Total views

715

On Slideshare

0

From embeds

0

Number of embeds

152

Actions

Downloads

6

Shares

0

Comments

0

Likes

1

×