SlideShare a Scribd company logo
1 of 79
Nonconformist
Resilience:
Database-backed Job Queues
John Mileham | @jmileham
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
betterment-delayed-job
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
User Signup
with Email Confirmation
User Signup
with Email Confirmation
A feature so easy we’re still fighting about how to do it in 2017
Requirements:
Validate the user’s profile information
Store the user record to the database
Email a link
When the link is clicked, mark the user as verified
Requirements:
Validate the user’s profile information
Store the user record to the database
Email a link
When the link is clicked, mark the user as verified
Take 1:
Inline the email delivery
Take 1:
Inline the email delivery
… but it’s slow
Take 2:
Spin off a thread or use a thread pool
Take 2:
Spin off a thread or use a thread pool
… but it’s unreliable
Take 3:
Use a grown-up message bus
Take 3:
Use a grown-up message bus
… but it’s unreliable?
Commit-then-Enqueue
Commit
to DB
App Timeline
Enqueue
to bus
Customer Timeline
Deliver
to
ESP
Request Response
Email
Commit
to DB
App Timeline
Enqueue
to bus
Customer Timeline
Deliver
to
ESP
Request Response
Email
Enqueue-then-Commit
Enqueue
to bus
App Timeline
Commit
to DB
Customer Timeline
Deliver
to
ESP
Request Response
Email
Enqueue
to bus
App Timeline
Commit
to DB
Customer Timeline
Deliver
to
ESP
Request
Email
Response
Enqueue
to bus
App Timeline
Commit
to DB
Customer Timeline
Deliver
to
ESP
Request
Email
Build
Email
App Timeline
Response
You could make the enqueue and the database commit atomic via a distributed transaction manager,
but:
● Mature, robust, distributed transaction managers aren’t available for all platforms
● They’re usually proprietary
● Even where they exist, these tools have nuanced configuration, can have operational warts and
are another subsystem that requires care and feeding
● The additional network pings necessary to coordinate the commit between the datastores can
cause write performance problems
Distributed Transactions
Take 4:
Use the database as a queue
Take 4:
Use the database as a queue
… but it won’t scale
Commit
&
Enqueue
App Timeline
Customer Timeline
Deliver
to
ESP
Request Response
Email
Commit
&
Enqueue
App Timeline
Customer Timeline
Deliver
to
ESP
Request Response
Email
Commit
&
Enqueue
App Timeline
Customer Timeline
Deliver
to
ESP
Request Response
Email
Robust By Default
Addressing the pitfalls
Because everything is a tradeoff
DJ: Retry with Exponential Backoff
Two key columns: run_at, and attempts.
● Jobs are picked up oldest first
● Only jobs with a run_at in the past are workable
● When a job fails, a new future run_at is calculated from the previous run_at and number of
previous attempts
● After too many failures (days later), a job will stop being attempted
Message Bus Solution: DLQs
Messages don’t have a desired delivery time in a message bus, so exponential backoff isn’t feasible.
Message delivery will be attempted a preconfigured number of times, and then transferred to a
dead-letter queue, or a cascading set of queues to approximate exponential backoff.
DJ: Priority
Delayed::Job will work off the highest priority first.
Pickup is simply a matter of sorting on priority and then run_at.
We use priority to establish different service level objectives for different kinds of work.
Allows developer not to worry about resourcing their jobs, leaning into DJ.
Allows DJ to fully utilize its worker capacity.
Message Bus Solution: Topics
Message busses can’t as easily support priority.
To assure resource availability for important work, work is shunted to a specific topic or queue with its
own resource pool.
Strong assurance that one job type won’t exhaust resources of another type.
But you must resource each topic individually.
DJ’s got topics too ;)
Even though it’s not the only way to organize work, if you have a mission critical work stream that
must be processed no matter what, you can use a specialized queue to keep its workers separate.
Opt in for as much control as you need, only when you need it.
More Featureful, Not Less
Betterment’s Schema
Users
Deposits
Bank
Accounts
Goals
Investing
Accounts
Auto
Deposits
State-
ments
Betterment’s Schema
Users
Deposits
Bank
Accounts
Goals
Investing
Accounts
Auto
Deposits
State-
ments
Power Law Distribution
The Message Bus Isn’t a Silver Bullet
Coordinated Polling
● Your application chooses a global polling interval, say a half second.
● Every active worker process inserts itself into an active_workers table with a last_active_at
timestamp and maintains it every 30 seconds or so.
● Every few seconds, each worker queries the number of recently active workers.
● It then multiplies the global polling interval by the number of workers and adds random jitter to
prevent thundering herds
● Your app converges on the desired polling interval at arbitrary worker scale
When is a DB-backed
queue the right tool?
1. Should your app use a DB at all?
You should be using an ACID SQL DB if:
● You have a read-heavy usage pattern
● You value agility in supporting new use cases
● You aren’t launching directly into #webscale
● Or even if you are, your app doesn’t exist primarily to solve a graph problem
○ if you’re going big and still want to use SQL, your dataset must inherently shardable
2. Are your clients human?
If clients are interacting with your app like humans, i.e.:
● They do individual operations at a reasonable pace
● The don’t generate batches of 10,000 operations at once
Then you’re looking still looking good.
3. Are Your Bulk Operations Cool?
● Are there relatively few of them?
● Are they customer experience-impacting?
● Are they no more than daily?
All Yes? All Set.
Operating a DB-backed
Queue
Alerting Needs
Two key alerts:
1. Max attempt count
2. Max age
Both metrics are partitioned by job priority.
Max Attempt Count
Total backoff time function: n == 0 ? 0 : n ** 4 + 5 + backoff(n-1)
● First retry in 6 seconds
● Third retry in 2 minutes
● Fifth retry in 16 minutes
● Tenth retry in 7 hours
● Twentieth retry in 8 days
Our thresholds:
● INTERACTIVE errors after 2 attempts (~30 seconds)
● EVENTUAL errors after 8 attempts (~2.5 hours)
Max Age
Age is defined as now() - run_at.
This is your brain
on DJ
(your message
may vary)
Why not just use DJ?
Why not just use DJ?
Date
Author
June 27th, 2017
John Mileham | @jmileham
(is hiring)
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
betterment-delayed-job

More Related Content

More from C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Nonconformist Resilience: DB-backed Job Queues

  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ betterment-delayed-job
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. User Signup with Email Confirmation
  • 19. User Signup with Email Confirmation A feature so easy we’re still fighting about how to do it in 2017
  • 20. Requirements: Validate the user’s profile information Store the user record to the database Email a link When the link is clicked, mark the user as verified
  • 21. Requirements: Validate the user’s profile information Store the user record to the database Email a link When the link is clicked, mark the user as verified
  • 22. Take 1: Inline the email delivery
  • 23. Take 1: Inline the email delivery … but it’s slow
  • 24. Take 2: Spin off a thread or use a thread pool
  • 25. Take 2: Spin off a thread or use a thread pool … but it’s unreliable
  • 26. Take 3: Use a grown-up message bus
  • 27. Take 3: Use a grown-up message bus … but it’s unreliable?
  • 29. Commit to DB App Timeline Enqueue to bus Customer Timeline Deliver to ESP Request Response Email
  • 30. Commit to DB App Timeline Enqueue to bus Customer Timeline Deliver to ESP Request Response Email
  • 32. Enqueue to bus App Timeline Commit to DB Customer Timeline Deliver to ESP Request Response Email
  • 33. Enqueue to bus App Timeline Commit to DB Customer Timeline Deliver to ESP Request Email Response
  • 34. Enqueue to bus App Timeline Commit to DB Customer Timeline Deliver to ESP Request Email Build Email App Timeline Response
  • 35. You could make the enqueue and the database commit atomic via a distributed transaction manager, but: ● Mature, robust, distributed transaction managers aren’t available for all platforms ● They’re usually proprietary ● Even where they exist, these tools have nuanced configuration, can have operational warts and are another subsystem that requires care and feeding ● The additional network pings necessary to coordinate the commit between the datastores can cause write performance problems Distributed Transactions
  • 36. Take 4: Use the database as a queue
  • 37. Take 4: Use the database as a queue … but it won’t scale
  • 41.
  • 43. Addressing the pitfalls Because everything is a tradeoff
  • 44.
  • 45. DJ: Retry with Exponential Backoff Two key columns: run_at, and attempts. ● Jobs are picked up oldest first ● Only jobs with a run_at in the past are workable ● When a job fails, a new future run_at is calculated from the previous run_at and number of previous attempts ● After too many failures (days later), a job will stop being attempted
  • 46. Message Bus Solution: DLQs Messages don’t have a desired delivery time in a message bus, so exponential backoff isn’t feasible. Message delivery will be attempted a preconfigured number of times, and then transferred to a dead-letter queue, or a cascading set of queues to approximate exponential backoff.
  • 47. DJ: Priority Delayed::Job will work off the highest priority first. Pickup is simply a matter of sorting on priority and then run_at. We use priority to establish different service level objectives for different kinds of work. Allows developer not to worry about resourcing their jobs, leaning into DJ. Allows DJ to fully utilize its worker capacity.
  • 48. Message Bus Solution: Topics Message busses can’t as easily support priority. To assure resource availability for important work, work is shunted to a specific topic or queue with its own resource pool. Strong assurance that one job type won’t exhaust resources of another type. But you must resource each topic individually.
  • 49. DJ’s got topics too ;) Even though it’s not the only way to organize work, if you have a mission critical work stream that must be processed no matter what, you can use a specialized queue to keep its workers separate. Opt in for as much control as you need, only when you need it.
  • 50.
  • 52.
  • 56.
  • 57. The Message Bus Isn’t a Silver Bullet
  • 58.
  • 59. Coordinated Polling ● Your application chooses a global polling interval, say a half second. ● Every active worker process inserts itself into an active_workers table with a last_active_at timestamp and maintains it every 30 seconds or so. ● Every few seconds, each worker queries the number of recently active workers. ● It then multiplies the global polling interval by the number of workers and adds random jitter to prevent thundering herds ● Your app converges on the desired polling interval at arbitrary worker scale
  • 60.
  • 61.
  • 62. When is a DB-backed queue the right tool?
  • 63. 1. Should your app use a DB at all? You should be using an ACID SQL DB if: ● You have a read-heavy usage pattern ● You value agility in supporting new use cases ● You aren’t launching directly into #webscale ● Or even if you are, your app doesn’t exist primarily to solve a graph problem ○ if you’re going big and still want to use SQL, your dataset must inherently shardable
  • 64. 2. Are your clients human? If clients are interacting with your app like humans, i.e.: ● They do individual operations at a reasonable pace ● The don’t generate batches of 10,000 operations at once Then you’re looking still looking good.
  • 65. 3. Are Your Bulk Operations Cool? ● Are there relatively few of them? ● Are they customer experience-impacting? ● Are they no more than daily?
  • 66. All Yes? All Set.
  • 68. Alerting Needs Two key alerts: 1. Max attempt count 2. Max age Both metrics are partitioned by job priority.
  • 69. Max Attempt Count Total backoff time function: n == 0 ? 0 : n ** 4 + 5 + backoff(n-1) ● First retry in 6 seconds ● Third retry in 2 minutes ● Fifth retry in 16 minutes ● Tenth retry in 7 hours ● Twentieth retry in 8 days Our thresholds: ● INTERACTIVE errors after 2 attempts (~30 seconds) ● EVENTUAL errors after 8 attempts (~2.5 hours)
  • 70. Max Age Age is defined as now() - run_at.
  • 71. This is your brain on DJ
  • 73.
  • 74.
  • 75. Why not just use DJ?
  • 76. Why not just use DJ?
  • 77.
  • 78. Date Author June 27th, 2017 John Mileham | @jmileham (is hiring)
  • 79. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ betterment-delayed-job