2. ● This has been a cross-team effort
○ Development
○ QA
○ Operations
○ L3
● Lots of people have helped
● This includes management (no suckup)
Credit where credit is due
3. What are background jobs?
● Tasks to be performed in the background
(duh)
● May be handed off by the web
● May be handed off by other jobs
● May be scheduled at regular intervals
● Are typically expensive
4. At PeopleAdmin backgrounding is...
● Resque (Ruby API)
● Redis (Middleware)
● Jobs are put in queues
● Workers look at queues for
work
● Workers are grouped into
pools
● We have 1 pool per worker
server
● We have many worker servers
● Resque scheduler puts jobs
into queues at their scheduled
run time
10. So what were/are the problems?
● Visibility
● Performance
● Job Contention
● Technology limitations
● Technology reliability
● Deployment interruption
● Others...
11. No Visibility
● Resque was a black box
● Operations, L3 & Development had no view
into production
● Ability to diagnose problems was limited
● Also had no way to know if we were creating
more problems
12. No Visibility
● Instrumented jobs with Splunk
● Gave us sophisticated querying ability and
graphing of results
● Gave us view into life of each job
● Allowed view into usage patterns, time in
queue, time to perform and other metrics
14. Performance
● Perceived performance is time in queue +
time to perform
● Some individual jobs were particularly slow
to perform
○ emails
○ system events
● These affected system as a whole
15. Performance
● Emails & system events targeted for
performance improvements
● Perform time for emails down from 23
seconds to 9 seconds
● Perform time for system events down from
32 to 8 seconds
17. Job Contention
● Non-prod jobs interfered
with production jobs
● So we separated prod &
non-prod queues
18. Job Contention
● Non-prod jobs interfered
with production jobs
● So we separated prod &
non-prod queues
● Still have a few issues...
19. Job Contention
● Jobs of different types in
the same queue would
contend for workers
20. Job Contention
● Jobs of different types in
the same queue would
contend for workers
● So we reallocated jobs
into fine-grained queues
21. Technology Limitations
● Resque & Resque-Pool work, but are simple
● We are not simple
○ Multiple customers
○ Multiple groups
○ User activity dynamics
○ Flood possibility
● Best illustrated by example...
22. Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
23. Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
24. Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
25. Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
26. Technology Limitations
job
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
27. Technology Limitations
job
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
28. Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Working
32. Technology Limitations
job job
Keyword
Indexes
Emails Imports
Sometimes we get floods of jobs
Workers are dumb, they always start
at left and move right
Queues of a lower priority of the
flooded queue get lonely
Net result is a customer waiting
while a job sits in a queue
WorkerWorkerWorking 1
33. Technology Limitations
Keyword
Indexes
Emails Imports
Sometimes we get floods of jobs
Workers are dumb, they always start
at left and move right
Queues of a lower priority of the
flooded queue get lonely
Net result is a customer waiting
while a job sits in a queue
Working 2Worker 3 Working 1
34. Technology Limitations
● There was no existing solution to this
problem within the Resque ecosystem.
● Our options
○ Migrate to a different technology
○ Contribute enhancements to our current technology
● We opted for the latter (Qtrix)
35. Technology Limitations
Qtrix says, “Your priority is…”
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Worker 2
Worker 3
Worker 1
Keyword Indexes, Emails,
Imports
Emails, Imports,
Keyword Indexes
Imports, Keyword Indexes,
Emails
37. Technology Limitations
job
job
job
job
job
job
job
job
Keyword
Indexes
Emails Imports
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Working 3Working 2Working 1
38. Technology Limitations
job
job
job
job
job
Keyword
Indexes
Emails Imports
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Working 3 Working 2Working 1
39. Technology Limitations
job
job
Keyword
Indexes
Emails Imports
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Working 3 Working 2Working 1
40. Technology Limitations
Keyword
Indexes
Emails Imports
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Worker 3Working 2Working 1
41. Technology Limitations
Qtrix also gives us...
● The ability to create different priority configurations for
different scenarios
● The ability to change to those configurations on the fly
● The ability to script these changes in reaction to
different events
● The ability to have this work elastically
We are not taking advantage of all of these things yet…
43. Technology Reliability
● Redis is memory bound
● Resque would leave a mess
● Redis was a single point of failure
● Solutions
○ Automated memory cleanup
○ Added redis AOF backups
○ Added data replication but not failover (yet)
44. Deployment Interruption
● Jobs would be terminated
● Jobs sit idle while workers restart
● Scheduler would go down and execution
times missed
● Ditto employer method jobs, plus hung locks
45. Deployment Interruption
● Now…
○ All jobs finish gracefully
○ There is no delay time where jobs are not getting
worked (includes employer methods jobs)
○ Scheduler is not brought down during deploys
○ Employer method job locks are still a problem
46. We have gained
● Diagnostic ability
● Performance metrics
● Better performance
● Less long-term &
catastrophic risk
● Lowered resource needs
● Lower customer pain
And here we are...
Still issues
● Redis is single point of
failure
● Resque scheduler
reliability
● Scaling elastically
● Tidying up
47. Since June...
● Total time waiting on jobs decreased 31%
○ SystemEventWorker time decreased 72%
● Total time jobs enqueued decreased 68%
○ Production jobs enqueued time decreased 74%
● Redis memory use decreased ~70%
● “Stuck jobs” during floods decreased 100%
● Eliminated 1 worker server
The numbers tell the story
48. ● For the opportunity to work on these fun,
challenging problems
● For the help along the way
● For the trust to be allowed to work
unrestrained
● For the patience & understanding when
things didn’t go according to plan
Thanks!