Os Whitaker


Published on

Published in: Business, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Os Whitaker

    1. 1. Keeping Your Workers In Line: <ul><li>Brad Whitaker </li></ul><ul><li>Lisa Phillips </li></ul>use TheSchwartz;
    2. 2. Once upon a time... In a galaxy far far away Users wanted features Like subscription based notifications, Pinging external services, And other things which Love to tie up webserver processes And are generally too slow / blocking to execute synchronously with web requests
    3. 3. (Yes, that was really my best slide-foo, so bear with me)
    4. 4. So we needed to solve the problems these features create
    5. 5. Which can easily result in a mess
    6. 6. Contacting External Web Services <ul><li>In LJ's case: </li></ul><ul><ul><li>weblogs.com </li></ul></ul><ul><ul><li>updates.sixapart.com </li></ul></ul><ul><ul><li>event notificatoins to Mother Russia </li></ul></ul><ul><ul><ul><li>seriously </li></ul></ul></ul><ul><li>Services need to be contacted whenever an entry / comment / asset is created </li></ul>
    7. 7. Processing Uploaded Media <ul><li>Pictures need to be scaled </li></ul><ul><li>Videos need to be transcoded </li></ul><ul><li>Shouldn’t be done on the webserver </li></ul><ul><ul><li>-too slow </li></ul></ul><ul><ul><li>-hogs CPU/Memory resources </li></ul></ul><ul><ul><li>requires unnecessary libraries loaded in to Apache </li></ul></ul>
    8. 8. Initial Solution: GhettoQueue <ul><li>Some sort of buffer on disk/database </li></ul><ul><ul><li>Or worse, a queue which gets blocked when a single job repeatedly fails </li></ul></ul><ul><li>Cron or daemon to process the queue </li></ul><ul><li>Gets really behind, pretty flaky, generally annoying </li></ul><ul><ul><li>Think 'qbufferd' in LJ </li></ul></ul><ul><ul><ul><li>Which was the bane of our existence </li></ul></ul></ul><ul><ul><ul><ul><li>For years. </li></ul></ul></ul></ul><ul><li>Hard to administer! </li></ul>
    9. 9. Incoming data from lots of different transports <ul><li>Incoming emails </li></ul><ul><li>Incoming SMS and outbound response </li></ul><ul><li>Audio from Asterisk </li></ul><ul><li>... Just to name a few </li></ul>
    10. 10. Initial Solution: Lots of Daemons! <ul><li>Lots of daemons! </li></ul><ul><li>In LJ: </li></ul><ul><ul><li>phonepostd </li></ul></ul><ul><ul><li>mailgated </li></ul></ul><ul><li>Mostly consist of biolerplate code to read / manage a spool directory, daemonize, handle locking </li></ul><ul><ul><li>Very little code dedicated to processing the actual data </li></ul></ul><ul><li>Hard to administer! </li></ul>
    11. 11. Events, Subscriptions, Notifications? <ul><li>1 event => many subscribers => many notifications </li></ul><ul><li>Much too slow to find subscribers and issue notifications synchronously </li></ul><ul><li>Existing GhettoQueue mechanism sucks, so that's not an option. </li></ul><ul><li>... </li></ul><ul><li>Needs a real solution to reliable job processing </li></ul>
    12. 12. Problems With These Approaches <ul><li>Everything is different for each service: </li></ul><ul><ul><li>Monitoring </li></ul></ul><ul><ul><li>Tools for Administration </li></ul></ul><ul><ul><li>Troubleshooting </li></ul></ul><ul><li>Operations people end up hating new features </li></ul><ul><ul><li>Each one brings a new set of headaches </li></ul></ul><ul><li>Fine for a while, but at some point becomes ridiculous </li></ul>
    13. 14. Implementation <ul><li>Perl + MySQL / SQLite </li></ul><ul><ul><li>SQLite mostly for test suite </li></ul></ul><ul><li>Python/Ruby/Etc people: Don't worry! Plans are underway to make TheSchwartz language agnostic </li></ul>
    14. 15. Topology <ul><li>One or more databases </li></ul><ul><li>Worker machines </li></ul><ul><ul><li>Each running many worker processes </li></ul></ul>
    15. 16. Schwartz Database <ul><li>Keeps track of: </li></ul><ul><ul><li>Jobs and their args </li></ul></ul><ul><ul><li>Errors </li></ul></ul><ul><ul><li>Exit status </li></ul></ul><ul><ul><li>...That's it! </li></ul></ul><ul><li>Small schema, can mostly stay in memory </li></ul><ul><li>Scaleable: Inserts are random to any database </li></ul>
    16. 17. Schwartz Workers <ul><li>TheSchwartz::Worker subclasses </li></ul><ul><li>Know how to handle one or more job types </li></ul><ul><li>Accept TheSchwartz::Job as single parameter </li></ul><ul><li>... That's it! </li></ul>
    17. 18. Full Topology: With Application
    18. 19. Request Cycle <ul><li>1) Event to application </li></ul><ul><li>2) Application registers Schwartz job </li></ul><ul><li>3) Worker grabs Schwartz job </li></ul><ul><li>4) Worker does work, (usually modifying application data) </li></ul><ul><li>5) Worker (optionally) stores result in Schwartz database </li></ul>
    19. 20. Let's look at some code...
    20. 21. Application code
    21. 22. Worker code
    22. 23. It's really that simple...
    23. 24. But it doesn't have to be
    24. 25. Workers can define other per-worker behaviors <ul><li>Retries: </li></ul><ul><ul><li>sub max_retries { 5 } # 5 tries </li></ul></ul><ul><ul><li>sub retry_delay { </li></ul></ul><ul><ul><li>my ($class, $fail_ct) = @_; </li></ul></ul><ul><ul><li>return 2 ** $fail_ct; </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><li>Max time a process can work on a job: </li></ul><ul><ul><li>sub grab_for { 300 } # seconds </li></ul></ul><ul><li>Keeping exit status: </li></ul><ul><ul><li>sub keep_exit_status_for { 86400 } # 1 day </li></ul></ul>
    25. 26. Other fun features <ul><li>Coalescing based on prefix </li></ul><ul><ul><li>Coalescing field explicitly stated when job is inserted </li></ul></ul><ul><ul><li>“ Give me all jobs that are sending email to Yahoo” </li></ul></ul><ul><li>Atomic job replacement </li></ul><ul><ul><li>“ For splitting one job up into many, which other workers can immdiately start working on” </li></ul></ul><ul><li>Scheduling future jobs </li></ul><ul><ul><li>Because we hate cron </li></ul></ul>
    26. 27. Using TheSchwartz in production <ul><li>Livejournal currently handling over 100 jobs per second </li></ul>
    27. 28. Schwartz Database Configuration <ul><li>Innodb </li></ul><ul><li>Master-master replication </li></ul><ul><li>One side active </li></ul><ul><li>Linux Heartbeat for shared VIP </li></ul><ul><li>Automatic binlog purging </li></ul>
    28. 29. Schwartz Database Configuration … .. And so on, adding clusters as needed
    29. 30. Tools and monitoring <ul><li>Schwartzmon </li></ul><ul><li>Schwartz-rate </li></ul><ul><li>LJWorkerctrl </li></ul><ul><li>Nagios plugins for queues </li></ul><ul><li>Triggers </li></ul>
    30. 31. Schwartzmon Example lj@ljadmin1:~$ schwartzmon --dsn=DBI:mysql:theschwartz_livejournal;host= --user=lj -f errors Thu Jul 26 19:22:13 2007 [2116902910]: Connection failed to domain 'imagemenagerie.com', MXes: [imagemenagerie.com] Thu Jul 26 19:22:14 2007 [2120335058]: Connection failed to domain 'thedashcat.net', MXes: [thedashcat.net] Thu Jul 26 19:22:15 2007 [2126277932]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126007758]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126309296]: Permanent failure TO [hey2a@cs.com]: 550 MAILBOX NOT FOUND Thu Jul 26 19:22:17 2007 [2126309446]: Permanent failure TO [joe_junkpan@hotmail.com]: 550 Requested action not taken: mailbox unavailable Thu Jul 26 19:22:18 2007 [2126308836]: Error during DATAEND phase to [ourxtrees@yahoo.com]: 451 go ahead Message temporarily deferred - [250] Thu Jul 26 19:22:18 2007 [2126188996]: Connection failed to domain 'buffyboarders.zzn.com', MXes: [c2mailmx.mailcentro.com c2mds.mailcentro.com] Thu Jul 26 19:22:21 2007 [2125661508]: Connection failed to domain 'cox.net', MXes: [mx.east.cox.net mx.west.cox.net]
    31. 32. Ljworkerctrl example
    32. 33. Ljworkerctrl example <ul><li>… </li></ul>
    33. 34. Ljworkerctrl example <ul><li>… . </li></ul>-
    34. 35. Questions? http://code.sixapart.com/svn/TheSchwartz/trunk