Keeping Your Workers In Line: Brad Whitaker Lisa Phillips use TheSchwartz;
Once upon a time... In a galaxy far far away Users wanted features Like subscription based notifications, Pinging external services, And other things which Love to tie up webserver processes And are generally too slow / blocking  to execute synchronously  with web requests
(Yes, that was really my  best slide-foo, so bear  with me)
So we needed to solve the problems these features create
Which can easily result in a mess
Contacting External Web Services In LJ's case: weblogs.com updates.sixapart.com event notificatoins to Mother Russia seriously Services need to be contacted whenever an entry / comment / asset is created
Processing Uploaded Media Pictures need to be scaled Videos need to be transcoded Shouldn’t be done on the webserver -too slow -hogs CPU/Memory resources requires unnecessary libraries loaded in to Apache
Initial Solution: GhettoQueue Some sort of buffer on disk/database Or worse, a queue which gets blocked when a single job repeatedly fails Cron or daemon to process the queue Gets really behind, pretty flaky, generally annoying Think 'qbufferd' in LJ Which was the bane of our existence For years. Hard to administer!
Incoming data from lots of different transports Incoming emails Incoming SMS and outbound response Audio from Asterisk ... Just to name a few
Initial Solution: Lots of Daemons! Lots of daemons! In LJ: phonepostd mailgated Mostly consist of biolerplate code to read / manage a spool directory, daemonize, handle locking Very little code dedicated to processing the actual data Hard to administer!
Events, Subscriptions, Notifications? 1 event => many subscribers => many notifications Much too slow to find subscribers and issue notifications synchronously Existing GhettoQueue mechanism sucks, so that's not an option. ... Needs a real solution to reliable job processing
Problems With These Approaches Everything is different for each service: Monitoring Tools for Administration Troubleshooting Operations people end up hating new features Each one brings a new set of headaches Fine for a while, but at some point becomes ridiculous
 
Implementation Perl + MySQL / SQLite SQLite mostly for test suite Python/Ruby/Etc people: Don't worry!  Plans are underway to make TheSchwartz language agnostic
Topology One or more databases Worker machines Each running many worker processes
Schwartz Database Keeps track of: Jobs and their args Errors Exit status ...That's it! Small schema, can mostly stay in memory Scaleable: Inserts are random to any database
Schwartz Workers TheSchwartz::Worker subclasses Know how to handle one or more job types Accept TheSchwartz::Job as single parameter ... That's it!
Full Topology: With Application
Request Cycle 1) Event to application 2) Application registers Schwartz job 3) Worker grabs Schwartz job 4) Worker does work, (usually modifying application data) 5) Worker (optionally) stores result in Schwartz database
Let's look at some code...
Application code
Worker code
It's really that simple...
But it doesn't have to be
Workers can define other per-worker behaviors Retries: sub max_retries { 5 }  # 5 tries sub retry_delay {  my ($class, $fail_ct) = @_; return 2 ** $fail_ct;  } Max time a process can work on a job: sub grab_for { 300 }  # seconds Keeping exit status: sub keep_exit_status_for { 86400 } # 1 day
Other fun features Coalescing based on prefix Coalescing field explicitly stated when job is inserted “ Give me all jobs that are sending email to Yahoo” Atomic job replacement “ For splitting one job up into many, which other workers can immdiately start working on” Scheduling future jobs Because we hate cron
Using TheSchwartz in production Livejournal currently handling over 100 jobs per second
Schwartz Database Configuration Innodb Master-master replication One side active Linux Heartbeat for shared VIP Automatic binlog purging
Schwartz Database Configuration … .. And so on, adding clusters as needed
Tools and monitoring Schwartzmon Schwartz-rate LJWorkerctrl Nagios plugins for queues Triggers
Schwartzmon Example lj@ljadmin1:~$ schwartzmon --dsn=DBI:mysql:theschwartz_livejournal\;host=10.191.90.101 --user=lj -f errors Thu Jul 26 19:22:13 2007 [2116902910]: Connection failed to domain 'imagemenagerie.com', MXes: [imagemenagerie.com] Thu Jul 26 19:22:14 2007 [2120335058]: Connection failed to domain 'thedashcat.net', MXes: [thedashcat.net] Thu Jul 26 19:22:15 2007 [2126277932]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126007758]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126309296]: Permanent failure TO [hey2a@cs.com]: 550 MAILBOX NOT FOUND Thu Jul 26 19:22:17 2007 [2126309446]: Permanent failure TO [joe_junkpan@hotmail.com]: 550 Requested action not taken: mailbox unavailable Thu Jul 26 19:22:18 2007 [2126308836]: Error during DATAEND phase to [ourxtrees@yahoo.com]: 451 go ahead Message temporarily deferred - [250] Thu Jul 26 19:22:18 2007 [2126188996]: Connection failed to domain 'buffyboarders.zzn.com', MXes: [c2mailmx.mailcentro.com c2mds.mailcentro.com] Thu Jul 26 19:22:21 2007 [2125661508]: Connection failed to domain 'cox.net', MXes: [mx.east.cox.net mx.west.cox.net]
Ljworkerctrl example
Ljworkerctrl example …
Ljworkerctrl example … . -
Questions? http://code.sixapart.com/svn/TheSchwartz/trunk

Os Whitaker

  • 1.
    Keeping Your WorkersIn Line: Brad Whitaker Lisa Phillips use TheSchwartz;
  • 2.
    Once upon atime... In a galaxy far far away Users wanted features Like subscription based notifications, Pinging external services, And other things which Love to tie up webserver processes And are generally too slow / blocking to execute synchronously with web requests
  • 3.
    (Yes, that wasreally my best slide-foo, so bear with me)
  • 4.
    So we neededto solve the problems these features create
  • 5.
    Which can easilyresult in a mess
  • 6.
    Contacting External WebServices In LJ's case: weblogs.com updates.sixapart.com event notificatoins to Mother Russia seriously Services need to be contacted whenever an entry / comment / asset is created
  • 7.
    Processing Uploaded MediaPictures need to be scaled Videos need to be transcoded Shouldn’t be done on the webserver -too slow -hogs CPU/Memory resources requires unnecessary libraries loaded in to Apache
  • 8.
    Initial Solution: GhettoQueueSome sort of buffer on disk/database Or worse, a queue which gets blocked when a single job repeatedly fails Cron or daemon to process the queue Gets really behind, pretty flaky, generally annoying Think 'qbufferd' in LJ Which was the bane of our existence For years. Hard to administer!
  • 9.
    Incoming data fromlots of different transports Incoming emails Incoming SMS and outbound response Audio from Asterisk ... Just to name a few
  • 10.
    Initial Solution: Lotsof Daemons! Lots of daemons! In LJ: phonepostd mailgated Mostly consist of biolerplate code to read / manage a spool directory, daemonize, handle locking Very little code dedicated to processing the actual data Hard to administer!
  • 11.
    Events, Subscriptions, Notifications?1 event => many subscribers => many notifications Much too slow to find subscribers and issue notifications synchronously Existing GhettoQueue mechanism sucks, so that's not an option. ... Needs a real solution to reliable job processing
  • 12.
    Problems With TheseApproaches Everything is different for each service: Monitoring Tools for Administration Troubleshooting Operations people end up hating new features Each one brings a new set of headaches Fine for a while, but at some point becomes ridiculous
  • 13.
  • 14.
    Implementation Perl +MySQL / SQLite SQLite mostly for test suite Python/Ruby/Etc people: Don't worry! Plans are underway to make TheSchwartz language agnostic
  • 15.
    Topology One ormore databases Worker machines Each running many worker processes
  • 16.
    Schwartz Database Keepstrack of: Jobs and their args Errors Exit status ...That's it! Small schema, can mostly stay in memory Scaleable: Inserts are random to any database
  • 17.
    Schwartz Workers TheSchwartz::Workersubclasses Know how to handle one or more job types Accept TheSchwartz::Job as single parameter ... That's it!
  • 18.
  • 19.
    Request Cycle 1)Event to application 2) Application registers Schwartz job 3) Worker grabs Schwartz job 4) Worker does work, (usually modifying application data) 5) Worker (optionally) stores result in Schwartz database
  • 20.
    Let's look atsome code...
  • 21.
  • 22.
  • 23.
  • 24.
    But it doesn'thave to be
  • 25.
    Workers can defineother per-worker behaviors Retries: sub max_retries { 5 } # 5 tries sub retry_delay { my ($class, $fail_ct) = @_; return 2 ** $fail_ct; } Max time a process can work on a job: sub grab_for { 300 } # seconds Keeping exit status: sub keep_exit_status_for { 86400 } # 1 day
  • 26.
    Other fun featuresCoalescing based on prefix Coalescing field explicitly stated when job is inserted “ Give me all jobs that are sending email to Yahoo” Atomic job replacement “ For splitting one job up into many, which other workers can immdiately start working on” Scheduling future jobs Because we hate cron
  • 27.
    Using TheSchwartz inproduction Livejournal currently handling over 100 jobs per second
  • 28.
    Schwartz Database ConfigurationInnodb Master-master replication One side active Linux Heartbeat for shared VIP Automatic binlog purging
  • 29.
    Schwartz Database Configuration… .. And so on, adding clusters as needed
  • 30.
    Tools and monitoringSchwartzmon Schwartz-rate LJWorkerctrl Nagios plugins for queues Triggers
  • 31.
    Schwartzmon Example lj@ljadmin1:~$schwartzmon --dsn=DBI:mysql:theschwartz_livejournal\;host=10.191.90.101 --user=lj -f errors Thu Jul 26 19:22:13 2007 [2116902910]: Connection failed to domain 'imagemenagerie.com', MXes: [imagemenagerie.com] Thu Jul 26 19:22:14 2007 [2120335058]: Connection failed to domain 'thedashcat.net', MXes: [thedashcat.net] Thu Jul 26 19:22:15 2007 [2126277932]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126007758]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126309296]: Permanent failure TO [hey2a@cs.com]: 550 MAILBOX NOT FOUND Thu Jul 26 19:22:17 2007 [2126309446]: Permanent failure TO [joe_junkpan@hotmail.com]: 550 Requested action not taken: mailbox unavailable Thu Jul 26 19:22:18 2007 [2126308836]: Error during DATAEND phase to [ourxtrees@yahoo.com]: 451 go ahead Message temporarily deferred - [250] Thu Jul 26 19:22:18 2007 [2126188996]: Connection failed to domain 'buffyboarders.zzn.com', MXes: [c2mailmx.mailcentro.com c2mds.mailcentro.com] Thu Jul 26 19:22:21 2007 [2125661508]: Connection failed to domain 'cox.net', MXes: [mx.east.cox.net mx.west.cox.net]
  • 32.
  • 33.
  • 34.
  • 35.