django-celery
a distributed task queue




                           alex@eftimie.ro
                               2013/04/11
problem
fast sites > slow sites

caching, load balancing, CDN, nosql, ajax, ...
special problems
send an email to 10.000 users

add watermark to an uploaded video

generate a big PDF
oldschool solution: crontab
the good:
   - simple
   - easy to setup
the bad:
   - too simple
   - hard to scale
alternatives
scheduled (crontab-ish):
  - APScheduler

parallel work (celery-ish):
  - gearman
  - Huey
  - django-ztask
                              not a fan :-)
introducing Celery
introducing Celery
asynchronous job queue/task queue
based on distributed message passing
focused on real time operation
works for scheduled tasks too!



open source and integrated with Django
how it works?

                                        worker 1

website                                    .
  (view)        celery             MB
                                           .
                                           .



                                        worker n


                         results
task
@celery.task
def add(x, y):
    return x + y
...
add.delay(2, 2) #somewhere in a view

atomic
ideally idempotent
same environment as the website
real-time or scheduled
task states
PENDING
STARTED
SUCCESS
FAILURE
RETRY
REVOKED
few tips about tasks
granularity
data locality
state


   “asserting the world is the responsibility of the task”
subtasks
tasks spawned from within another task

calling:
add.subtask(1, 1).delay()
add.s(1, 1).delay() # this is a shortcut


the primitives:
chains, groups, chords, maps, chunks
subtasks - partials
add.s(1, 1).delay()
1 + 1

partial:
partial = add.s(1) # incomplete definition
partial.delay(2)
1 + 2        # same as: add.s(1, 2).delay()
partial.delay(0)
1 + 0
subtasks: chains
subtasks running one after another
chain(add.s(4, 4), mul.s(8), mul.s(10))

or
add.s(4, 4) | mul.s(8) | mul.s(10)
((4 + 4) * 8) * 10)
subtasks: groups
independent tasks running in parallel
g = group(add.s(2, 2), add.s(4, 4))
res = g()
res.get()
[4, 8]
subtasks: chords
same as groups, but apply callback on results
@task
def ts(numbers):
    return sum(numbers)

chord(add.s(i,i) for i in range(3))(ts.s())

6 # sum([0, 2, 4])


chord(headers)(callback)
subtasks: chunks

split a long list of arguments into parts
res = add.chunks(zip(range(100), range(100)), 10)()
>>> res.get()
[[0, 2, 4, 6, 8, 10, 12, 14, 16, 18],
 [20, 22, 24, 26, 28, 30, 32, 34, 36, 38],
 [40, 42, 44, 46, 48, 50, 52, 54, 56, 58],
 [60, 62, 64, 66, 68, 70, 72, 74, 76, 78],
 [80, 82, 84, 86, 88, 90, 92, 94, 96, 98],
 [100, 102, 104, 106, 108, 110, 112, 114, 116, 118],
 [120, 122, 124, 126, 128, 130, 132, 134, 136, 138],
 [140, 142, 144, 146, 148, 150, 152, 154, 156, 158],
 [160, 162, 164, 166, 168, 170, 172, 174, 176, 178],
 [180, 182, 184, 186, 188, 190, 192, 194, 196, 198]]

                          (10 tasks of 10 adds each)
integrate with django project
tasks module inside django app
configuration setup
   - message broker
   - result backend
start a worker
$ python manage.py celery worker




  (start another, tune concurrency, monitor, ...)
hidden ad




            supervisord.org
case study: korect
quiz exam management and automatic paper
processing using OMR


existing solution: desktop app, ~2 tests/minute
using ReportLab, PyPDF, OpenCV
django + celery version
~ 20 tests/minute
same machine, 4 worker processes
parallelized parts:
   - print file generation
   - paper scanning
   - correcting and grading
   - question usage report
example - generate print file
add a Download object to db

delay the task (not the real task):
chord([test_pdf for t in tests])(merge_pdf)

update page with dw status
...
at the end of merge_pdf:
      update status, flash user
last slide
high availability, high performance solution
easy to set up
fun to use


                 celeryproject.org
?

Django Celery - A distributed task queue

  • 1.
    django-celery a distributed taskqueue alex@eftimie.ro 2013/04/11
  • 2.
    problem fast sites >slow sites caching, load balancing, CDN, nosql, ajax, ...
  • 3.
    special problems send anemail to 10.000 users add watermark to an uploaded video generate a big PDF
  • 4.
    oldschool solution: crontab thegood: - simple - easy to setup the bad: - too simple - hard to scale
  • 5.
    alternatives scheduled (crontab-ish): - APScheduler parallel work (celery-ish): - gearman - Huey - django-ztask not a fan :-)
  • 6.
  • 7.
    introducing Celery asynchronous jobqueue/task queue based on distributed message passing focused on real time operation works for scheduled tasks too! open source and integrated with Django
  • 8.
    how it works? worker 1 website . (view) celery MB . . worker n results
  • 9.
    task @celery.task def add(x, y): return x + y ... add.delay(2, 2) #somewhere in a view atomic ideally idempotent same environment as the website real-time or scheduled
  • 10.
  • 11.
    few tips abouttasks granularity data locality state “asserting the world is the responsibility of the task”
  • 12.
    subtasks tasks spawned fromwithin another task calling: add.subtask(1, 1).delay() add.s(1, 1).delay() # this is a shortcut the primitives: chains, groups, chords, maps, chunks
  • 13.
    subtasks - partials add.s(1,1).delay() 1 + 1 partial: partial = add.s(1) # incomplete definition partial.delay(2) 1 + 2 # same as: add.s(1, 2).delay() partial.delay(0) 1 + 0
  • 14.
    subtasks: chains subtasks runningone after another chain(add.s(4, 4), mul.s(8), mul.s(10)) or add.s(4, 4) | mul.s(8) | mul.s(10) ((4 + 4) * 8) * 10)
  • 15.
    subtasks: groups independent tasksrunning in parallel g = group(add.s(2, 2), add.s(4, 4)) res = g() res.get() [4, 8]
  • 16.
    subtasks: chords same asgroups, but apply callback on results @task def ts(numbers): return sum(numbers) chord(add.s(i,i) for i in range(3))(ts.s()) 6 # sum([0, 2, 4]) chord(headers)(callback)
  • 17.
    subtasks: chunks split along list of arguments into parts res = add.chunks(zip(range(100), range(100)), 10)() >>> res.get() [[0, 2, 4, 6, 8, 10, 12, 14, 16, 18], [20, 22, 24, 26, 28, 30, 32, 34, 36, 38], [40, 42, 44, 46, 48, 50, 52, 54, 56, 58], [60, 62, 64, 66, 68, 70, 72, 74, 76, 78], [80, 82, 84, 86, 88, 90, 92, 94, 96, 98], [100, 102, 104, 106, 108, 110, 112, 114, 116, 118], [120, 122, 124, 126, 128, 130, 132, 134, 136, 138], [140, 142, 144, 146, 148, 150, 152, 154, 156, 158], [160, 162, 164, 166, 168, 170, 172, 174, 176, 178], [180, 182, 184, 186, 188, 190, 192, 194, 196, 198]] (10 tasks of 10 adds each)
  • 18.
    integrate with djangoproject tasks module inside django app configuration setup - message broker - result backend start a worker $ python manage.py celery worker (start another, tune concurrency, monitor, ...)
  • 19.
    hidden ad supervisord.org
  • 20.
    case study: korect quizexam management and automatic paper processing using OMR existing solution: desktop app, ~2 tests/minute using ReportLab, PyPDF, OpenCV
  • 21.
    django + celeryversion ~ 20 tests/minute same machine, 4 worker processes parallelized parts: - print file generation - paper scanning - correcting and grading - question usage report
  • 22.
    example - generateprint file add a Download object to db delay the task (not the real task): chord([test_pdf for t in tests])(merge_pdf) update page with dw status ... at the end of merge_pdf: update status, flash user
  • 23.
    last slide high availability,high performance solution easy to set up fun to use celeryproject.org
  • 24.