SlideShare a Scribd company logo
1 of 49
Download to read offline
Backgrounding Overhaul
Overview and Results
● This has been a cross-team effort
○ Development
○ QA
○ Operations
○ L3
● Lots of people have helped
● This includes management (no suckup)
Credit where credit is due
What are background jobs?
● Tasks to be performed in the background
(duh)
● May be handed off by the web
● May be handed off by other jobs
● May be scheduled at regular intervals
● Are typically expensive
At PeopleAdmin backgrounding is...
● Resque (Ruby API)
● Redis (Middleware)
● Jobs are put in queues
● Workers look at queues for
work
● Workers are grouped into
pools
● We have 1 pool per worker
server
● We have many worker servers
● Resque scheduler puts jobs
into queues at their scheduled
run time
So what do we use
backgrounding for?
EVERYTHING
Specifically...
● Transitions of postings,
applications, hiring proposals,
etc...
● Emails
● Keyword indexing (search)
● Import jobs
● Export jobs
● Report generation (EEO)
● Employment task lifecycle
● Onboarding task lifecycle
● Marketplace integrations (job
boards, background checks)
● Chore notifications
● Clearing cached data
● Promoting changes between
customer environments
● Employer stats collection
● Et cetera
● Et cetera
● Et cetera
So, uh... If everything relies on this,
wouldn’t changes be dangerous?
YES
But we are smart and daring
(sometimes)
So what were/are the problems?
● Visibility
● Performance
● Job Contention
● Technology limitations
● Technology reliability
● Deployment interruption
● Others...
No Visibility
● Resque was a black box
● Operations, L3 & Development had no view
into production
● Ability to diagnose problems was limited
● Also had no way to know if we were creating
more problems
No Visibility
● Instrumented jobs with Splunk
● Gave us sophisticated querying ability and
graphing of results
● Gave us view into life of each job
● Allowed view into usage patterns, time in
queue, time to perform and other metrics
No Visibility
Performance
● Perceived performance is time in queue +
time to perform
● Some individual jobs were particularly slow
to perform
○ emails
○ system events
● These affected system as a whole
Performance
● Emails & system events targeted for
performance improvements
● Perform time for emails down from 23
seconds to 9 seconds
● Perform time for system events down from
32 to 8 seconds
Job Contention
● Non-prod jobs interfered
with production jobs
Job Contention
● Non-prod jobs interfered
with production jobs
● So we separated prod &
non-prod queues
Job Contention
● Non-prod jobs interfered
with production jobs
● So we separated prod &
non-prod queues
● Still have a few issues...
Job Contention
● Jobs of different types in
the same queue would
contend for workers
Job Contention
● Jobs of different types in
the same queue would
contend for workers
● So we reallocated jobs
into fine-grained queues
Technology Limitations
● Resque & Resque-Pool work, but are simple
● We are not simple
○ Multiple customers
○ Multiple groups
○ User activity dynamics
○ Flood possibility
● Best illustrated by example...
Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
Technology Limitations
job
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
Technology Limitations
job
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Worker
Technology Limitations
Keyword
Indexes
Emails Imports
Jobs enter the queues
Workers prioritize queues from left to
right
Worker proceeds down list of
queues until it finds a job to be
processed
If no jobs are available, workers start
back at the left of the list Working
Technology Limitations
job
job
job
job
job
job
job
job
job
job
job
Keyword
Indexes
Emails Imports
Sometimes we get floods of jobs
Workers are dumb, they always start
at left and move right
Queues of a lower priority of the
flooded queue get lonely
Net result is a customer waiting
while a job sits in a queue
WorkerWorkerWorker 1
Technology Limitations
job
job
job
job
job
job
job
job
Keyword
Indexes
Emails Imports
Sometimes we get floods of jobs
Workers are dumb, they always start
at left and move right
Queues of a lower priority of the
flooded queue get lonely
Net result is a customer waiting
while a job sits in a queue
WorkerWorkerWorking 1
Technology Limitations
job
job
job
job
job
Keyword
Indexes
Emails Imports
Sometimes we get floods of jobs
Workers are dumb, they always start
at left and move right
Queues of a lower priority of the
flooded queue get lonely
Net result is a customer waiting
while a job sits in a queue
WorkerWorking 2 Working 1
Technology Limitations
job job
Keyword
Indexes
Emails Imports
Sometimes we get floods of jobs
Workers are dumb, they always start
at left and move right
Queues of a lower priority of the
flooded queue get lonely
Net result is a customer waiting
while a job sits in a queue
WorkerWorkerWorking 1
Technology Limitations
Keyword
Indexes
Emails Imports
Sometimes we get floods of jobs
Workers are dumb, they always start
at left and move right
Queues of a lower priority of the
flooded queue get lonely
Net result is a customer waiting
while a job sits in a queue
Working 2Worker 3 Working 1
Technology Limitations
● There was no existing solution to this
problem within the Resque ecosystem.
● Our options
○ Migrate to a different technology
○ Contribute enhancements to our current technology
● We opted for the latter (Qtrix)
Technology Limitations
Qtrix says, “Your priority is…”
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Worker 2
Worker 3
Worker 1
Keyword Indexes, Emails,
Imports
Emails, Imports,
Keyword Indexes
Imports, Keyword Indexes,
Emails
Technology Limitations
job
job
job
job
job
job
job
job
job
job
job
Keyword
Indexes
Emails Imports
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Worker 3Worker 2Worker 1
Technology Limitations
job
job
job
job
job
job
job
job
Keyword
Indexes
Emails Imports
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Working 3Working 2Working 1
Technology Limitations
job
job
job
job
job
Keyword
Indexes
Emails Imports
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Working 3 Working 2Working 1
Technology Limitations
job
job
Keyword
Indexes
Emails Imports
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Working 3 Working 2Working 1
Technology Limitations
Keyword
Indexes
Emails Imports
Our central Qtrix orchestrator tells
each worker what their queue
priorities are
Workers still dumb, the lists are
intelligently shuffled
Every queue is the top priority of at
least one worker
Higher priority queues appear to left
more often than lower priority
queues
Worker 3Working 2Working 1
Technology Limitations
Qtrix also gives us...
● The ability to create different priority configurations for
different scenarios
● The ability to change to those configurations on the fly
● The ability to script these changes in reaction to
different events
● The ability to have this work elastically
We are not taking advantage of all of these things yet…
Technology Reliability
● Redis is memory bound
● Resque would leave a mess
● Redis was a single point of failure
Technology Reliability
● Redis is memory bound
● Resque would leave a mess
● Redis was a single point of failure
● Solutions
○ Automated memory cleanup
○ Added redis AOF backups
○ Added data replication but not failover (yet)
Deployment Interruption
● Jobs would be terminated
● Jobs sit idle while workers restart
● Scheduler would go down and execution
times missed
● Ditto employer method jobs, plus hung locks
Deployment Interruption
● Now…
○ All jobs finish gracefully
○ There is no delay time where jobs are not getting
worked (includes employer methods jobs)
○ Scheduler is not brought down during deploys
○ Employer method job locks are still a problem
We have gained
● Diagnostic ability
● Performance metrics
● Better performance
● Less long-term &
catastrophic risk
● Lowered resource needs
● Lower customer pain
And here we are...
Still issues
● Redis is single point of
failure
● Resque scheduler
reliability
● Scaling elastically
● Tidying up
Since June...
● Total time waiting on jobs decreased 31%
○ SystemEventWorker time decreased 72%
● Total time jobs enqueued decreased 68%
○ Production jobs enqueued time decreased 74%
● Redis memory use decreased ~70%
● “Stuck jobs” during floods decreased 100%
● Eliminated 1 worker server
The numbers tell the story
● For the opportunity to work on these fun,
challenging problems
● For the help along the way
● For the trust to be allowed to work
unrestrained
● For the patience & understanding when
things didn’t go according to plan
Thanks!
Questions?

More Related Content

Similar to Rescuing Resque

Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
Data Con LA
 
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
AgileNetwork
 

Similar to Rescuing Resque (20)

Keeping business logic out of your UIs
Keeping business logic out of your UIsKeeping business logic out of your UIs
Keeping business logic out of your UIs
 
sat_presentation
sat_presentationsat_presentation
sat_presentation
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
 
Model-based programming and AI-assisted software development
Model-based programming and AI-assisted software developmentModel-based programming and AI-assisted software development
Model-based programming and AI-assisted software development
 
C# Programming: Fundamentals
C# Programming: FundamentalsC# Programming: Fundamentals
C# Programming: Fundamentals
 
From prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.ioFrom prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.io
 
Atlassian Executive Business Forum - LinkedIn HQ
Atlassian Executive Business Forum - LinkedIn HQAtlassian Executive Business Forum - LinkedIn HQ
Atlassian Executive Business Forum - LinkedIn HQ
 
AppDynamics User Group
AppDynamics User GroupAppDynamics User Group
AppDynamics User Group
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Dances with bits - industrial data analytics made easy!
Dances with bits - industrial data analytics made easy!Dances with bits - industrial data analytics made easy!
Dances with bits - industrial data analytics made easy!
 
Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
 
Observability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxObservability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptx
 
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
 
Modern sql
Modern sqlModern sql
Modern sql
 
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
 
The working architecture of NodeJS applications, Виктор Турский
The working architecture of NodeJS applications, Виктор ТурскийThe working architecture of NodeJS applications, Виктор Турский
The working architecture of NodeJS applications, Виктор Турский
 
The working architecture of node js applications open tech week javascript ...
The working architecture of node js applications   open tech week javascript ...The working architecture of node js applications   open tech week javascript ...
The working architecture of node js applications open tech week javascript ...
 
3 types of monitoring for 2020
3 types of monitoring for 20203 types of monitoring for 2020
3 types of monitoring for 2020
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Rescuing Resque

  • 2. ● This has been a cross-team effort ○ Development ○ QA ○ Operations ○ L3 ● Lots of people have helped ● This includes management (no suckup) Credit where credit is due
  • 3. What are background jobs? ● Tasks to be performed in the background (duh) ● May be handed off by the web ● May be handed off by other jobs ● May be scheduled at regular intervals ● Are typically expensive
  • 4. At PeopleAdmin backgrounding is... ● Resque (Ruby API) ● Redis (Middleware) ● Jobs are put in queues ● Workers look at queues for work ● Workers are grouped into pools ● We have 1 pool per worker server ● We have many worker servers ● Resque scheduler puts jobs into queues at their scheduled run time
  • 5. So what do we use backgrounding for?
  • 7. Specifically... ● Transitions of postings, applications, hiring proposals, etc... ● Emails ● Keyword indexing (search) ● Import jobs ● Export jobs ● Report generation (EEO) ● Employment task lifecycle ● Onboarding task lifecycle ● Marketplace integrations (job boards, background checks) ● Chore notifications ● Clearing cached data ● Promoting changes between customer environments ● Employer stats collection ● Et cetera ● Et cetera ● Et cetera
  • 8. So, uh... If everything relies on this, wouldn’t changes be dangerous?
  • 9. YES But we are smart and daring (sometimes)
  • 10. So what were/are the problems? ● Visibility ● Performance ● Job Contention ● Technology limitations ● Technology reliability ● Deployment interruption ● Others...
  • 11. No Visibility ● Resque was a black box ● Operations, L3 & Development had no view into production ● Ability to diagnose problems was limited ● Also had no way to know if we were creating more problems
  • 12. No Visibility ● Instrumented jobs with Splunk ● Gave us sophisticated querying ability and graphing of results ● Gave us view into life of each job ● Allowed view into usage patterns, time in queue, time to perform and other metrics
  • 14. Performance ● Perceived performance is time in queue + time to perform ● Some individual jobs were particularly slow to perform ○ emails ○ system events ● These affected system as a whole
  • 15. Performance ● Emails & system events targeted for performance improvements ● Perform time for emails down from 23 seconds to 9 seconds ● Perform time for system events down from 32 to 8 seconds
  • 16. Job Contention ● Non-prod jobs interfered with production jobs
  • 17. Job Contention ● Non-prod jobs interfered with production jobs ● So we separated prod & non-prod queues
  • 18. Job Contention ● Non-prod jobs interfered with production jobs ● So we separated prod & non-prod queues ● Still have a few issues...
  • 19. Job Contention ● Jobs of different types in the same queue would contend for workers
  • 20. Job Contention ● Jobs of different types in the same queue would contend for workers ● So we reallocated jobs into fine-grained queues
  • 21. Technology Limitations ● Resque & Resque-Pool work, but are simple ● We are not simple ○ Multiple customers ○ Multiple groups ○ User activity dynamics ○ Flood possibility ● Best illustrated by example...
  • 22. Technology Limitations Keyword Indexes Emails Imports Jobs enter the queues Workers prioritize queues from left to right Worker proceeds down list of queues until it finds a job to be processed If no jobs are available, workers start back at the left of the list Worker
  • 23. Technology Limitations Keyword Indexes Emails Imports Jobs enter the queues Workers prioritize queues from left to right Worker proceeds down list of queues until it finds a job to be processed If no jobs are available, workers start back at the left of the list Worker
  • 24. Technology Limitations Keyword Indexes Emails Imports Jobs enter the queues Workers prioritize queues from left to right Worker proceeds down list of queues until it finds a job to be processed If no jobs are available, workers start back at the left of the list Worker
  • 25. Technology Limitations Keyword Indexes Emails Imports Jobs enter the queues Workers prioritize queues from left to right Worker proceeds down list of queues until it finds a job to be processed If no jobs are available, workers start back at the left of the list Worker
  • 26. Technology Limitations job Keyword Indexes Emails Imports Jobs enter the queues Workers prioritize queues from left to right Worker proceeds down list of queues until it finds a job to be processed If no jobs are available, workers start back at the left of the list Worker
  • 27. Technology Limitations job Keyword Indexes Emails Imports Jobs enter the queues Workers prioritize queues from left to right Worker proceeds down list of queues until it finds a job to be processed If no jobs are available, workers start back at the left of the list Worker
  • 28. Technology Limitations Keyword Indexes Emails Imports Jobs enter the queues Workers prioritize queues from left to right Worker proceeds down list of queues until it finds a job to be processed If no jobs are available, workers start back at the left of the list Working
  • 29. Technology Limitations job job job job job job job job job job job Keyword Indexes Emails Imports Sometimes we get floods of jobs Workers are dumb, they always start at left and move right Queues of a lower priority of the flooded queue get lonely Net result is a customer waiting while a job sits in a queue WorkerWorkerWorker 1
  • 30. Technology Limitations job job job job job job job job Keyword Indexes Emails Imports Sometimes we get floods of jobs Workers are dumb, they always start at left and move right Queues of a lower priority of the flooded queue get lonely Net result is a customer waiting while a job sits in a queue WorkerWorkerWorking 1
  • 31. Technology Limitations job job job job job Keyword Indexes Emails Imports Sometimes we get floods of jobs Workers are dumb, they always start at left and move right Queues of a lower priority of the flooded queue get lonely Net result is a customer waiting while a job sits in a queue WorkerWorking 2 Working 1
  • 32. Technology Limitations job job Keyword Indexes Emails Imports Sometimes we get floods of jobs Workers are dumb, they always start at left and move right Queues of a lower priority of the flooded queue get lonely Net result is a customer waiting while a job sits in a queue WorkerWorkerWorking 1
  • 33. Technology Limitations Keyword Indexes Emails Imports Sometimes we get floods of jobs Workers are dumb, they always start at left and move right Queues of a lower priority of the flooded queue get lonely Net result is a customer waiting while a job sits in a queue Working 2Worker 3 Working 1
  • 34. Technology Limitations ● There was no existing solution to this problem within the Resque ecosystem. ● Our options ○ Migrate to a different technology ○ Contribute enhancements to our current technology ● We opted for the latter (Qtrix)
  • 35. Technology Limitations Qtrix says, “Your priority is…” Our central Qtrix orchestrator tells each worker what their queue priorities are Workers still dumb, the lists are intelligently shuffled Every queue is the top priority of at least one worker Higher priority queues appear to left more often than lower priority queues Worker 2 Worker 3 Worker 1 Keyword Indexes, Emails, Imports Emails, Imports, Keyword Indexes Imports, Keyword Indexes, Emails
  • 36. Technology Limitations job job job job job job job job job job job Keyword Indexes Emails Imports Our central Qtrix orchestrator tells each worker what their queue priorities are Workers still dumb, the lists are intelligently shuffled Every queue is the top priority of at least one worker Higher priority queues appear to left more often than lower priority queues Worker 3Worker 2Worker 1
  • 37. Technology Limitations job job job job job job job job Keyword Indexes Emails Imports Our central Qtrix orchestrator tells each worker what their queue priorities are Workers still dumb, the lists are intelligently shuffled Every queue is the top priority of at least one worker Higher priority queues appear to left more often than lower priority queues Working 3Working 2Working 1
  • 38. Technology Limitations job job job job job Keyword Indexes Emails Imports Our central Qtrix orchestrator tells each worker what their queue priorities are Workers still dumb, the lists are intelligently shuffled Every queue is the top priority of at least one worker Higher priority queues appear to left more often than lower priority queues Working 3 Working 2Working 1
  • 39. Technology Limitations job job Keyword Indexes Emails Imports Our central Qtrix orchestrator tells each worker what their queue priorities are Workers still dumb, the lists are intelligently shuffled Every queue is the top priority of at least one worker Higher priority queues appear to left more often than lower priority queues Working 3 Working 2Working 1
  • 40. Technology Limitations Keyword Indexes Emails Imports Our central Qtrix orchestrator tells each worker what their queue priorities are Workers still dumb, the lists are intelligently shuffled Every queue is the top priority of at least one worker Higher priority queues appear to left more often than lower priority queues Worker 3Working 2Working 1
  • 41. Technology Limitations Qtrix also gives us... ● The ability to create different priority configurations for different scenarios ● The ability to change to those configurations on the fly ● The ability to script these changes in reaction to different events ● The ability to have this work elastically We are not taking advantage of all of these things yet…
  • 42. Technology Reliability ● Redis is memory bound ● Resque would leave a mess ● Redis was a single point of failure
  • 43. Technology Reliability ● Redis is memory bound ● Resque would leave a mess ● Redis was a single point of failure ● Solutions ○ Automated memory cleanup ○ Added redis AOF backups ○ Added data replication but not failover (yet)
  • 44. Deployment Interruption ● Jobs would be terminated ● Jobs sit idle while workers restart ● Scheduler would go down and execution times missed ● Ditto employer method jobs, plus hung locks
  • 45. Deployment Interruption ● Now… ○ All jobs finish gracefully ○ There is no delay time where jobs are not getting worked (includes employer methods jobs) ○ Scheduler is not brought down during deploys ○ Employer method job locks are still a problem
  • 46. We have gained ● Diagnostic ability ● Performance metrics ● Better performance ● Less long-term & catastrophic risk ● Lowered resource needs ● Lower customer pain And here we are... Still issues ● Redis is single point of failure ● Resque scheduler reliability ● Scaling elastically ● Tidying up
  • 47. Since June... ● Total time waiting on jobs decreased 31% ○ SystemEventWorker time decreased 72% ● Total time jobs enqueued decreased 68% ○ Production jobs enqueued time decreased 74% ● Redis memory use decreased ~70% ● “Stuck jobs” during floods decreased 100% ● Eliminated 1 worker server The numbers tell the story
  • 48. ● For the opportunity to work on these fun, challenging problems ● For the help along the way ● For the trust to be allowed to work unrestrained ● For the patience & understanding when things didn’t go according to plan Thanks!