SlideShare a Scribd company logo
Chapman: Building a
Distributed Job Queue
in MongoDB
Rick Copeland
@rick446
rick@synapp.io
Getting to Know One
Another
Rick
Roadmap
Define the problem
Schema design & operations
Types of tasks
Reducing Polling
•
•
•
•
Requirements
Digital Ocean
Rackspace US
Rackspace UK
SMTP
Server
SMTP
Server
SMTP
Server
SMTP
Server
SMTP
Server
SMTP
Server
App
Server
Requirements
Group
check_smtp
Analyze
Results
Update
Reports
Pipeline
Requirements
MongoDB
(of course)
Basic Ideas
msg
Chapman Insecure
msg
msg
msg
msg
msg
Task
Task
Worker
Process
Job Queue Schema:
Message
{ _id: ObjectId(...),
task_id: ObjectId(...),
slot: 'run',
s: {
status: 'ready',
ts: ISODateTime(...),
q: 'chapman',
pri: 10,
w: '----------',
},
args: Binary(...),
kwargs: Binary(...),
send_args: Binary(...),
send_kwargs: Binary(...)
}
Task method to
be run
Destination
Task
Sched
SynchroMessage
arguments
Job Queue Schema:
TaskState
{ _id: ObjectId(...),
type: 'Group',
parent_id: ObjectId(...),
on_complete: ObjectId(...),
mq: [ObjectId(...), ...],
status: 'pending',
options: {
queue: 'chapman',
priority: 10,
immutable: false,
ignore_result: true,
}
result: Binary(...),
data: {...}
}
Python class
registered for
task
Parent task
(if any)
Message to be
sent on
completion
Message State: Naive
Approach
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Reserve
Message
Try to
lock task
Un-
reserve
Message
Message States:
Reserve Message
findAndModify (‘ready’)
s.state => ‘q1’
s.w => worker_id
$push _id onto task’s mq field
If msg is first in mq, s.state =>
‘busy’
Start processing
•
•
•
•
•
•
ready
q1
busy
q2
next
pending
Message States:
Reserve Message
findAndModify (‘next’)
s.state => ‘busy’
s.w => worker_id
start processing
•
•
•
•
ready
q1
busy
q2
next
pending
Message States:
Retire Message
findAndModify TaskState
$pull message _id from ‘mq’
findAndModify new first message
in ‘mq’ if its s.state is in [‘q1’, ‘q2’]
s.state => ‘next’
•
•
•
•
ready
q1
busy
q2
next
pending
Task States
States mainly
advisory
success,failure
transitions trigger
on_complete
message
‘chained’ is a tail-
call optimization
•
•
•
pending
active
chainedfailuresuccess
Basic Tasks:
FunctionTask
Simplest task: run a function to
completion, set the result to the return
value
If a ‘Suspend’ exception is raised, move
the task to ‘chained’ status
Other exceptions set task to ‘failure’, save
•
•
•
@task(ignore_result=True, priority=50)
def track(event, user_id=None, **kwargs):
log.info('track(%s, %s...)', event, user_id)
# ...
Digression: Task
Chaining
Task state set to ‘chained’
New “Chain” task is created that will
Call the “next” task
When the “next” task completes, also
complete the “current” task
•
•
•
•
@task(ignore_result=True,
priority=50)
def function_task(*args, **kwargs):
# ...
Chain.call(some_other_task)
Composite Tasks
on_complete message for each subtask
with slot=retire_subtask, specifying subtask
position & the result of the subtask
Different composite tasks implement ‘run’
and ‘retire_subtask’ differently
•
•
task_state.update(
{ '_id': subtask_id },
{ $set: {
'parent_id': composite_id,
'data.composite_position': position,
'options.ignore_result': false
}}
)
Composite Task:
Pipeline
Run
Send a ‘run’ message to the subtask with
position=0
Retire_subtask(position, result)
Send a ‘run’ message with the previous
result to the subtask with position =
(position+1), OR retire the Pipeline if no
more tasks
•
•
•
•
Composite Task:
Group
Run
Send a ‘run’ message to all subtasks
Retire_subtask(position, result)
Decrement the num_waiting counter
If num_waiting is 0, retire the group
Collect subtask results, complete
•
•
•
•
•
•
Reducing Polling
Reserving messages is expensive
Use Pub/Sub system instead
Publish to the channel whenever a
message is ready to be handled
Each worker subscribes to the channel
•
•
•
•
Pub/Sub for MongoDB
Capped
Collection
Fixed size
Fast inserts
“Tailable”
cursors
•
•
•
Tailable
Cursor
Getting a Tailable
Cursor
def get_cursor(collection, topic_re, await_data=True):
options = { 'tailable': True }
if await_data:
options['await_data'] = True
cur = collection.find(
{ 'k': topic_re },
**options)
cur = cur.hint([('$natural', 1)]) # ensure we don't use any indexes
return cur
import re, time
while True:
cur = get_cursor(
db.capped_collection,
re.compile('^foo'),
await_data=True)
for msg in cur:
do_something(msg)
time.sleep(0.1)
Holds open cursor for a
while
Make curso
Don’t use indexes
Still some polling
producer, so do
too fast
Building in retry...
def get_cursor(collection, topic_re, last_id=-1, await_data=True):
options = { 'tailable': True }
spec = {
'id': { '$gt': last_id }, # only new messages
'k': topic_re }
if await_data:
options['await_data'] = True
cur = collection.find(spec, **options)
cur = cur.hint([('$natural', 1)]) # ensure we don't use any indexes
return cur
Integer au
Building auto-increment
class Sequence(object):
...
def next(self, sname, inc=1):
doc = self._db[self._name].find_and_modify(
query={'_id': sname},
update={'$inc': { 'value': inc } },
upsert=True,
new=True)
return doc['value']
Atomically $inc the
dedicated document
Ludicrous Speed
from pymongo.cursor import _QUERY_OPTIONS
def get_cursor(collection, topic_re, last_id=-1, await_data=True):
options = { 'tailable': True }
spec = {
'ts': { '$gt': last_id }, # only new messages
'k': topic_re }
if await_data:
options['await_data'] = True
cur = collection.find(spec, **options)
cur = cur.hint([('$natural', 1)]) # ensure we don't use any indexes
if await:
cur = cur.add_option(_QUERY_OPTIONS['oplog_replay'])
return cur
id ==> ts
Co-opt the
oplog_repla
option
Performance
Questions?
Rick Copeland
rick@synapp.io
@rick446

More Related Content

What's hot

Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Simplilearn
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesBuild your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resources
Martin Czygan
 
Spark로 알아보는 빅데이터 처리
Spark로 알아보는 빅데이터 처리Spark로 알아보는 빅데이터 처리
Spark로 알아보는 빅데이터 처리
Jeong-gyu Kim
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
chrislusf
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
DataStax Academy
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
Ceph Community
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
Trisha Gee
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Hadoop
HadoopHadoop
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advance
DaeMyung Kang
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Steve Pember
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
Dmitry Kan
 
Working with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDBWorking with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDB
ScaleGrid.io
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
Valdas Maksimavičius
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Getting Started with NgRx (Redux) Angular
Getting Started with NgRx (Redux) AngularGetting Started with NgRx (Redux) Angular
Getting Started with NgRx (Redux) Angular
Gustavo Costa
 

What's hot (20)

Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-2 | Big Data Interview Questions ...
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesBuild your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resources
 
Spark로 알아보는 빅데이터 처리
Spark로 알아보는 빅데이터 처리Spark로 알아보는 빅데이터 처리
Spark로 알아보는 빅데이터 처리
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hadoop
HadoopHadoop
Hadoop
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advance
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
 
MongoDB
MongoDBMongoDB
MongoDB
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Working with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDBWorking with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDB
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Getting Started with NgRx (Redux) Angular
Getting Started with NgRx (Redux) AngularGetting Started with NgRx (Redux) Angular
Getting Started with NgRx (Redux) Angular
 

Similar to Building a High-Performance Distributed Task Queue on MongoDB

Streaming Way to Webscale: How We Scale Bitly via Streaming
Streaming Way to Webscale: How We Scale Bitly via StreamingStreaming Way to Webscale: How We Scale Bitly via Streaming
Streaming Way to Webscale: How We Scale Bitly via Streaming
All Things Open
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Universitat Politècnica de Catalunya
 
Debugging of (C)Python applications
Debugging of (C)Python applicationsDebugging of (C)Python applications
Debugging of (C)Python applications
Roman Podoliaka
 
Functional programming with Ruby - can make you look smart
Functional programming with Ruby - can make you look smartFunctional programming with Ruby - can make you look smart
Functional programming with Ruby - can make you look smart
Chen Fisher
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
Jonathan Felch
 
Celery with python
Celery with pythonCelery with python
Celery with python
Alexandre González Rodríguez
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
JAXLondon_Conference
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul Dix
InfluxData
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
Łukasz Bałamut
 
Node Boot Camp
Node Boot CampNode Boot Camp
Node Boot Camp
Troy Miles
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit faster
Patrick Bos
 
Angular2 for Beginners
Angular2 for BeginnersAngular2 for Beginners
Angular2 for Beginners
Oswald Campesato
 
H2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff ClickH2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff Click
Sri Ambati
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
b0ris_1
 
Parallel and Async Programming With C#
Parallel and Async Programming With C#Parallel and Async Programming With C#
Parallel and Async Programming With C#
Rainer Stropek
 
ELK stack at weibo.com
ELK stack at weibo.comELK stack at weibo.com
ELK stack at weibo.com
琛琳 饶
 
Unit-2 Getting Input from User.pptx
Unit-2 Getting Input from User.pptxUnit-2 Getting Input from User.pptx
Unit-2 Getting Input from User.pptx
Lovely Professional University
 
Getting Input from User
Getting Input from UserGetting Input from User
Getting Input from User
Lovely Professional University
 
Celery
CeleryCelery

Similar to Building a High-Performance Distributed Task Queue on MongoDB (20)

Streaming Way to Webscale: How We Scale Bitly via Streaming
Streaming Way to Webscale: How We Scale Bitly via StreamingStreaming Way to Webscale: How We Scale Bitly via Streaming
Streaming Way to Webscale: How We Scale Bitly via Streaming
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
 
Debugging of (C)Python applications
Debugging of (C)Python applicationsDebugging of (C)Python applications
Debugging of (C)Python applications
 
Functional programming with Ruby - can make you look smart
Functional programming with Ruby - can make you look smartFunctional programming with Ruby - can make you look smart
Functional programming with Ruby - can make you look smart
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
Celery with python
Celery with pythonCelery with python
Celery with python
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul Dix
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 
Dynomite Nosql
Dynomite NosqlDynomite Nosql
Dynomite Nosql
 
Node Boot Camp
Node Boot CampNode Boot Camp
Node Boot Camp
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit faster
 
Angular2 for Beginners
Angular2 for BeginnersAngular2 for Beginners
Angular2 for Beginners
 
H2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff ClickH2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff Click
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Parallel and Async Programming With C#
Parallel and Async Programming With C#Parallel and Async Programming With C#
Parallel and Async Programming With C#
 
ELK stack at weibo.com
ELK stack at weibo.comELK stack at weibo.com
ELK stack at weibo.com
 
Unit-2 Getting Input from User.pptx
Unit-2 Getting Input from User.pptxUnit-2 Getting Input from User.pptx
Unit-2 Getting Input from User.pptx
 
Getting Input from User
Getting Input from UserGetting Input from User
Getting Input from User
 
Celery
CeleryCelery
Celery
 

More from MongoDB

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 

More from MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Building a High-Performance Distributed Task Queue on MongoDB

  • 1. Chapman: Building a Distributed Job Queue in MongoDB Rick Copeland @rick446 rick@synapp.io
  • 2. Getting to Know One Another Rick
  • 3. Roadmap Define the problem Schema design & operations Types of tasks Reducing Polling • • • •
  • 4. Requirements Digital Ocean Rackspace US Rackspace UK SMTP Server SMTP Server SMTP Server SMTP Server SMTP Server SMTP Server App Server
  • 8. Job Queue Schema: Message { _id: ObjectId(...), task_id: ObjectId(...), slot: 'run', s: { status: 'ready', ts: ISODateTime(...), q: 'chapman', pri: 10, w: '----------', }, args: Binary(...), kwargs: Binary(...), send_args: Binary(...), send_kwargs: Binary(...) } Task method to be run Destination Task Sched SynchroMessage arguments
  • 9. Job Queue Schema: TaskState { _id: ObjectId(...), type: 'Group', parent_id: ObjectId(...), on_complete: ObjectId(...), mq: [ObjectId(...), ...], status: 'pending', options: { queue: 'chapman', priority: 10, immutable: false, ignore_result: true, } result: Binary(...), data: {...} } Python class registered for task Parent task (if any) Message to be sent on completion
  • 10. Message State: Naive Approach Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message Reserve Message Try to lock task Un- reserve Message
  • 11. Message States: Reserve Message findAndModify (‘ready’) s.state => ‘q1’ s.w => worker_id $push _id onto task’s mq field If msg is first in mq, s.state => ‘busy’ Start processing • • • • • • ready q1 busy q2 next pending
  • 12. Message States: Reserve Message findAndModify (‘next’) s.state => ‘busy’ s.w => worker_id start processing • • • • ready q1 busy q2 next pending
  • 13. Message States: Retire Message findAndModify TaskState $pull message _id from ‘mq’ findAndModify new first message in ‘mq’ if its s.state is in [‘q1’, ‘q2’] s.state => ‘next’ • • • • ready q1 busy q2 next pending
  • 14. Task States States mainly advisory success,failure transitions trigger on_complete message ‘chained’ is a tail- call optimization • • • pending active chainedfailuresuccess
  • 15. Basic Tasks: FunctionTask Simplest task: run a function to completion, set the result to the return value If a ‘Suspend’ exception is raised, move the task to ‘chained’ status Other exceptions set task to ‘failure’, save • • • @task(ignore_result=True, priority=50) def track(event, user_id=None, **kwargs): log.info('track(%s, %s...)', event, user_id) # ...
  • 16. Digression: Task Chaining Task state set to ‘chained’ New “Chain” task is created that will Call the “next” task When the “next” task completes, also complete the “current” task • • • • @task(ignore_result=True, priority=50) def function_task(*args, **kwargs): # ... Chain.call(some_other_task)
  • 17. Composite Tasks on_complete message for each subtask with slot=retire_subtask, specifying subtask position & the result of the subtask Different composite tasks implement ‘run’ and ‘retire_subtask’ differently • • task_state.update( { '_id': subtask_id }, { $set: { 'parent_id': composite_id, 'data.composite_position': position, 'options.ignore_result': false }} )
  • 18. Composite Task: Pipeline Run Send a ‘run’ message to the subtask with position=0 Retire_subtask(position, result) Send a ‘run’ message with the previous result to the subtask with position = (position+1), OR retire the Pipeline if no more tasks • • • •
  • 19. Composite Task: Group Run Send a ‘run’ message to all subtasks Retire_subtask(position, result) Decrement the num_waiting counter If num_waiting is 0, retire the group Collect subtask results, complete • • • • • •
  • 20. Reducing Polling Reserving messages is expensive Use Pub/Sub system instead Publish to the channel whenever a message is ready to be handled Each worker subscribes to the channel • • • •
  • 21. Pub/Sub for MongoDB Capped Collection Fixed size Fast inserts “Tailable” cursors • • • Tailable Cursor
  • 22. Getting a Tailable Cursor def get_cursor(collection, topic_re, await_data=True): options = { 'tailable': True } if await_data: options['await_data'] = True cur = collection.find( { 'k': topic_re }, **options) cur = cur.hint([('$natural', 1)]) # ensure we don't use any indexes return cur import re, time while True: cur = get_cursor( db.capped_collection, re.compile('^foo'), await_data=True) for msg in cur: do_something(msg) time.sleep(0.1) Holds open cursor for a while Make curso Don’t use indexes Still some polling producer, so do too fast
  • 23. Building in retry... def get_cursor(collection, topic_re, last_id=-1, await_data=True): options = { 'tailable': True } spec = { 'id': { '$gt': last_id }, # only new messages 'k': topic_re } if await_data: options['await_data'] = True cur = collection.find(spec, **options) cur = cur.hint([('$natural', 1)]) # ensure we don't use any indexes return cur Integer au
  • 24. Building auto-increment class Sequence(object): ... def next(self, sname, inc=1): doc = self._db[self._name].find_and_modify( query={'_id': sname}, update={'$inc': { 'value': inc } }, upsert=True, new=True) return doc['value'] Atomically $inc the dedicated document
  • 25. Ludicrous Speed from pymongo.cursor import _QUERY_OPTIONS def get_cursor(collection, topic_re, last_id=-1, await_data=True): options = { 'tailable': True } spec = { 'ts': { '$gt': last_id }, # only new messages 'k': topic_re } if await_data: options['await_data'] = True cur = collection.find(spec, **options) cur = cur.hint([('$natural', 1)]) # ensure we don't use any indexes if await: cur = cur.add_option(_QUERY_OPTIONS['oplog_replay']) return cur id ==> ts Co-opt the oplog_repla option