    1. 1. Keeping Your Workers In Line: <ul><li>Brad Whitaker </li></ul><ul><li>Lisa Phillips </li></ul>use TheSchwartz;
    2. 2. Once upon a time... In a galaxy far far away Users wanted features Like subscription based notifications, Pinging external services, And other things which Love to tie up webserver processes And are generally too slow / blocking to execute synchronously with web requests
    4. 4. So we needed to solve the problems these features create
    5. 5. Which can easily result in a mess
    6. 6. Contacting External Web Services <ul><li>In LJ's case: </li></ul><ul><ul><li>weblogs.com </li></ul></ul><ul><ul><li>updates.sixapart.com </li></ul></ul><ul><ul><li>event notificatoins to Mother Russia </li></ul></ul><ul><ul><ul><li>seriously </li></ul></ul></ul><ul><li>Services need to be contacted whenever an entry / comment / asset is created </li></ul>
    7. 7. Processing Uploaded Media <ul><li>Pictures need to be scaled </li></ul><ul><li>Videos need to be transcoded </li></ul><ul><li>Shouldn’t be done on the webserver </li></ul><ul><ul><li>-too slow </li></ul></ul><ul><ul><li>-hogs CPU/Memory resources </li></ul></ul><ul><ul><li>requires unnecessary libraries loaded in to Apache </li></ul></ul>
    8. 8. Initial Solution: GhettoQueue <ul><li>Some sort of buffer on disk/database </li></ul><ul><ul><li>Or worse, a queue which gets blocked when a single job repeatedly fails </li></ul></ul><ul><li>Cron or daemon to process the queue </li></ul><ul><li>Gets really behind, pretty flaky, generally annoying </li></ul><ul><ul><li>Think 'qbufferd' in LJ </li></ul></ul><ul><ul><ul><li>Which was the bane of our existence </li></ul></ul></ul><ul><ul><ul><ul><li>For years. </li></ul></ul></ul></ul><ul><li>Hard to administer! </li></ul>
    9. 9. Incoming data from lots of different transports <ul><li>Incoming emails </li></ul><ul><li>Incoming SMS and outbound response </li></ul><ul><li>Audio from Asterisk </li></ul><ul><li>... Just to name a few </li></ul>
    10. 10. Initial Solution: Lots of Daemons! <ul><li>Lots of daemons! </li></ul><ul><li>In LJ: </li></ul><ul><ul><li>phonepostd </li></ul></ul><ul><ul><li>mailgated </li></ul></ul><ul><li>Mostly consist of biolerplate code to read / manage a spool directory, daemonize, handle locking </li></ul><ul><ul><li>Very little code dedicated to processing the actual data </li></ul></ul><ul><li>Hard to administer! </li></ul>
    11. 11. Events, Subscriptions, Notifications? <ul><li>1 event => many subscribers => many notifications </li></ul><ul><li>Much too slow to find subscribers and issue notifications synchronously </li></ul><ul><li>Existing GhettoQueue mechanism sucks, so that's not an option. </li></ul><ul><li>... </li></ul><ul><li>Needs a real solution to reliable job processing </li></ul>
    12. 12. Problems With These Approaches <ul><li>Everything is different for each service: </li></ul><ul><ul><li>Monitoring </li></ul></ul><ul><ul><li>Tools for Administration </li></ul></ul><ul><ul><li>Troubleshooting </li></ul></ul><ul><li>Operations people end up hating new features </li></ul><ul><ul><li>Each one brings a new set of headaches </li></ul></ul><ul><li>Fine for a while, but at some point becomes ridiculous </li></ul>
    13. 14. Implementation <ul><li>Perl + MySQL / SQLite </li></ul><ul><ul><li>SQLite mostly for test suite </li></ul></ul><ul><li>Python/Ruby/Etc people: Don't worry! Plans are underway to make TheSchwartz language agnostic </li></ul>
    14. 15. Topology <ul><li>One or more databases </li></ul><ul><li>Worker machines </li></ul><ul><ul><li>Each running many worker processes </li></ul></ul>
    15. 16. Schwartz Database <ul><li>Keeps track of: </li></ul><ul><ul><li>Jobs and their args </li></ul></ul><ul><ul><li>Errors </li></ul></ul><ul><ul><li>Exit status </li></ul></ul><ul><ul><li>...That's it! </li></ul></ul><ul><li>Small schema, can mostly stay in memory </li></ul><ul><li>Scaleable: Inserts are random to any database </li></ul>
    16. 17. Schwartz Workers <ul><li>TheSchwartz::Worker subclasses </li></ul><ul><li>Know how to handle one or more job types </li></ul><ul><li>Accept TheSchwartz::Job as single parameter </li></ul><ul><li>... That's it! </li></ul>
    17. 18. Full Topology: With Application
    18. 19. Request Cycle <ul><li>1) Event to application </li></ul><ul><li>2) Application registers Schwartz job </li></ul><ul><li>3) Worker grabs Schwartz job </li></ul><ul><li>4) Worker does work, (usually modifying application data) </li></ul><ul><li>5) Worker (optionally) stores result in Schwartz database </li></ul>
    20. 21. Application code
    21. 22. Worker code
    23. 24. But it doesn't have to be
    24. 25. Workers can define other per-worker behaviors <ul><li>Retries: </li></ul><ul><ul><li>sub max_retries { 5 } # 5 tries </li></ul></ul><ul><ul><li>sub retry_delay { </li></ul></ul><ul><ul><li>my ($class, $fail_ct) = @_; </li></ul></ul><ul><ul><li>return 2 ** $fail_ct; </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><li>Max time a process can work on a job: </li></ul><ul><ul><li>sub grab_for { 300 } # seconds </li></ul></ul><ul><li>Keeping exit status: </li></ul><ul><ul><li>sub keep_exit_status_for { 86400 } # 1 day </li></ul></ul>
    25. 26. Other fun features <ul><li>Coalescing based on prefix </li></ul><ul><ul><li>Coalescing field explicitly stated when job is inserted </li></ul></ul><ul><ul><li>“ Give me all jobs that are sending email to Yahoo” </li></ul></ul><ul><li>Atomic job replacement </li></ul><ul><ul><li>“ For splitting one job up into many, which other workers can immdiately start working on” </li></ul></ul><ul><li>Scheduling future jobs </li></ul><ul><ul><li>Because we hate cron </li></ul></ul>
    26. 27. Using TheSchwartz in production <ul><li>Livejournal currently handling over 100 jobs per second </li></ul>
    27. 28. Schwartz Database Configuration <ul><li>Innodb </li></ul><ul><li>Master-master replication </li></ul><ul><li>One side active </li></ul><ul><li>Linux Heartbeat for shared VIP </li></ul><ul><li>Automatic binlog purging </li></ul>
    28. 29. Schwartz Database Configuration … .. And so on, adding clusters as needed
    29. 30. Tools and monitoring <ul><li>Schwartzmon </li></ul><ul><li>Schwartz-rate </li></ul><ul><li>LJWorkerctrl </li></ul><ul><li>Nagios plugins for queues </li></ul><ul><li>Triggers </li></ul>
    30. 31. Schwartzmon Example lj@ljadmin1:~$ schwartzmon --dsn=DBI:mysql:theschwartz_livejournal;host= --user=lj -f errors Thu Jul 26 19:22:13 2007 [2116902910]: Connection failed to domain 'imagemenagerie.com', MXes: [imagemenagerie.com] Thu Jul 26 19:22:14 2007 [2120335058]: Connection failed to domain 'thedashcat.net', MXes: [thedashcat.net] Thu Jul 26 19:22:15 2007 [2126277932]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126007758]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126309296]: Permanent failure TO [hey2a@cs.com]: 550 MAILBOX NOT FOUND Thu Jul 26 19:22:17 2007 [2126309446]: Permanent failure TO [joe_junkpan@hotmail.com]: 550 Requested action not taken: mailbox unavailable Thu Jul 26 19:22:18 2007 [2126308836]: Error during DATAEND phase to [ourxtrees@yahoo.com]: 451 go ahead Message temporarily deferred - [250] Thu Jul 26 19:22:18 2007 [2126188996]: Connection failed to domain 'buffyboarders.zzn.com', MXes: [c2mailmx.mailcentro.com c2mds.mailcentro.com] Thu Jul 26 19:22:21 2007 [2125661508]: Connection failed to domain 'cox.net', MXes: [mx.east.cox.net mx.west.cox.net]
    31. 32. Ljworkerctrl example
    32. 33. Ljworkerctrl example <ul><li>… </li></ul>
    33. 34. Ljworkerctrl example <ul><li>… . </li></ul>-
    34. 35. Questions? http://code.sixapart.com/svn/TheSchwartz/trunk