Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jolt: Distributed, fault-tolerant test running at scale using Mesos

365 views

Published on

In this presentation, Kyle Kelly, Sunil Shah and Timmy Zhu present Jolt, a system that aims to solve the problem of running integration tests with a high fixed resource cost at scale. We use Mesos plus a custom open source framework called Task Processing to run integration tests for the Yelp website in a massively parallel manner - taking test runs down from a few days to less than an hour.

Published in: Technology
  • Be the first to comment

Jolt: Distributed, fault-tolerant test running at scale using Mesos

  1. 1. Jolt
  2. 2. Who We Are Kyle Kelly (kkelly@) Release Engineering Sunil Shah (sunil@) Distributed Systems Timmy Zhu (tzhu@) Release Engineering
  3. 3. Release Engineering at Yelp • Focus on maximizing engineering productivity • Provide review, build, and test infrastructure for developers at Yelp
  4. 4. Yelp’s Mission Connecting people with great local businesses.
  5. 5. Yelp scale
  6. 6. Why? • Yelp runs a lot of tests • The legacy monolith has 85,000+ tests • Other services have thousands of tests too • Deployments require running all tests
  7. 7. Why? • Parallelizing test runs saves significant developer time • Allows us to push new versions of Yelp.com multiple times a day with confidence
  8. 8. Why? We already have a working system: Seagull • ~350 test runs every day. Average run time ~10-15 mins. • ~2.5 million ephemeral containers every day. • Cluster scales from ~70 spot instances to ~450 spot instances. • ~25 million tests executed every day.
  9. 9. Why? • Seagull was unnecessarily complex • Custom executor • Custom artifact management • Hard to reuse for other services’ tests • Built primarily to run yelp-main tests
  10. 10. Features • Split tests into "bundles" of desired duration • Further grouped by runtime environment requirements • Bundles run on Mesos • Retry on unexpected failures
  11. 11. Bundling • User specifies a target bundle execution time • Bin pack tests based on estimated duration • Uses a rolling historical average • Reports task durations after every Jolt run
  12. 12. Example Invocation jolt test_runner.sh tests.list --artifact=minified.tar.gz --project=yelp-main --bundle-retries=3 --target-bundle-duration=300 --results=results.list --env TR_RUN_ID ymjkkelly-1509390913
  13. 13. Other Supporting Infrastructure • Elasticsearch • Storage and retrieval of test durations • Test Results • Distributed reporting & summarization • Collected via a Kafka stream and indexed in Elasticsearch • Viewable/queryable via web application • Autoscaling hosts via Clusterman
  14. 14. Task Processing • Jolt isn’t implementing an entire Mesos framework like Seagull does • Task Processing is an open source Python library • Uses the HTTP scheduler API via PyMesos • Intended to be composable • Basis for both Jolt and for running scheduled jobs using Tron
  15. 15. Task Processing Generic TaskExecutor interface • run • kill • stop • get_event_queue • (i.e. users mostly shouldn’t care that they are using Mesos)
  16. 16. Task Processing • Implementations are composable • We offer a few different types: • MesosExecutor • RetryingExecutor • TimeoutExecutor
  17. 17. Loads are cyclical $ $ $ $ Weekend Weekend Weekdays $ $ $ $ $ $ $ $ $ $ $$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
  18. 18. Loads are bursty Euro code push US office hours Lunch time
  19. 19. Clusterman As part of Jolt, we’re building a next generation autoscaler (Clusterman) that does two things: 1. Autoscaling of a pool of Mesos agents 2. Simulations based on changing Spot Fleet Requests
  20. 20. - Users bid for Amazon’s spare capacity - Lowest winning bid is the $$ paid Used Used Used Available Available Available Available User A - $4 User A - $4 User B - $3 User C - $2 User C - $2 User D - $1 User D - $1 User D - $1
  21. 21. - Users bid for Amazon’s spare capacity - Lowest winning bid is the $$ paid Used Used Used User A - $2 User A - $2 User B - $2 User C - $2 User A - $4 User A - $4 User B - $3 User C - $2 User C - $2 User D - $1 User D - $1 User D - $1
  22. 22. - Users bid for Amazon’s spare capacity - Lowest winning bid is the $$ paid Used Used Used User A - $3 User A - $3 User B - $3 User B - $3 User A - $4 User A - $4 User B - $3 User B - $3 User C - $2 User C - $2 User D - $1 User D - $1
  23. 23. Spot Fleet Requests Spot Fleet Requests allow us to request a certain amount of spot instances simultaneously: • Diversification via availability zone • Diversification via instance type Simulating how we might do based on changing our bid prices helps us understand instance churn. Clusterman
  24. 24. Signals Right now we autoscale based on two signals: • CPU utilisation • e.g. scale up if utilisation > 65% for 15 min, scale down if utilisation < 35% for 30 min • Test runs in-flight We also have option to operate on additional signals too, for example: • Predicted load Clusterman
  25. 25. Instance termination • AWS Spotfleet does not allow us to specify which instances to terminate. • Clusterman finds and terminates the idle instances, and readjusts the Spotfleet capacity. Clusterman
  26. 26. Cost savings IntegrationTesting InfrastructureCost 55% reduction in costs after initial transition to spot instances Additional 60% savings after transition to spot+autoscaling complete
  27. 27. Scaling issues Challenges • Mesos HTTP API is considerably less performant than Protobuf API. • HTTP API Timeouts in production when running hundreds of applications on Marathon and less than 10 HTTP API schedulers.
  28. 28. Defensive maintenance • Yelp-main tests are not fully containerised yet • Necessary to perform defensive cluster maintenance/healthiness in order to guard against bad actors. Challenges
  29. 29. Defensive maintenance Challenges docker-reaperExecutor Creates a new Unix socket and sets $DOCKER_HOST to that socket. Child process Fork-exec Create container API call Create container API call Remove Container Container ID Stores the container ID
  30. 30. Future Work • Mitigating setup and teardown time • Bidirectional communication between framework and executors • Cluster-wide resources
  31. 31. Demo Link
  32. 32. We are hiring ● Engineers or managers with dist-sys experience: ○ Strong knowledge of systems and application design. ○ Ability to work closely with information retrieval/machine learning experts on big-data problems. ○ Strong understanding of operating systems, file systems and networking. ○ Fluency in Python, C, C++, Java, or a similar language. ○ Technologies we use: Mesos, Marathon, Docker, ZooKeeper, Kafka, Cassandra, Flink, Spark, Elasticsearch Apply at https://www.yelp.com/careers or come say hi! Europe / San Francisco
  33. 33. @YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp

×