4. GOALS OF PROJECT
Leverage Mongo
• Reduce ops overhead by reusing infrastructure
• Map queue semantics to Mongo’s strengths
Reliable
• Durable - support long running process
• Resilient to machine failure
• Narrow down window of failure/ data loss.
Centralized, distributed:
• Multiple producers
• Multiple consumers
5. ITERATION 0
Capped collection – not the perfect choice
• Tailing queue seems attractive, but…
• Need external sync to avoid double-consume
• Secondary indexes and updating are anti-pattern
Relaxing FIFO is OK
• No guarantee that first-popped is first done
• Multi-client is negated if they have to sync on execution order
• Race condition for queue insertion has same effect
Conclusion: Project doesn’t use capped collection and
relaxes FIFO.
6. PARANOID BY DESIGN
Network dies
Process dies
DB dies
Machine dies Poison letter Dead letter
8. ARE WE THERE YET?
Network dies
Process dies
DB dies
Machine dies Poison letter Dead letter
9. QUEUE SEMANTICS
Local / Memory Distributed
Push Put
Pop Get << visibility >>
<< exception >> Release << retry >>
Delete
<< exception >>
10. ITERATION 2
db.q4foo.save({v:{f:1}, dq: null})
db.q4foo.findAndModify( {
query: { dq: null},
sort: {_id:1},
update:{ $set: { dq: later(60)}}})
… If processing was success => delete..
Hot: If client dies, item remains in queue. Data not lost.
Not: index on _id less useful in high volume.
11. ARE WE THERE YET?
Network dies
Process dies
DB dies
Machine dies Poison letter Dead letter
12. ITERATION 3
db.q4foo.save({v:{f:1}, dq: null, pc: 0})
db.q4foo.findAndModify({
query: { dq: null, pc:{$lt:3}},
sort: {_id:1},
update:{$set:{dq:later(60)},$inc:{pc:1}}}) // consume
db.q4foo.findAndModify({
query: {_id:"..."},
update:{$set:{dq: null}}}) // release
Hot: An item can be retried automatically (pc) after released.
Exhausted item remains in queue.
Not: Not strict FIFO.
13. ARE WE THERE? YES.
Network dies
Process dies
DB dies
Machine dies Poison letter Dead letter
14. ITERATION 4
Ensure your queue writes use applicable durability
• db.q4foo.save() + getLastError(…)
• db.q4foo.findAndModify () + getLastError(…)
Replica sets for durability only. No capacity or speed gain.
15. OTHER THOUGHTS
Create admin jobs to monitor queues:
• Growth
• Retries exhausted
Consider TTL risks (ex: client failure before calling Release())
Consider idempotent operations when possible
Design clients to back off polling
Separate queue vs. extra “topic” field
Consider dedicated DB for write-lock scope
Capped vs. regular collection – capped now can have _id, in-place update.