• Save
Os Whitaker
Upcoming SlideShare
Loading in...5
×
 

Os Whitaker

on

  • 1,710 views

 

Statistics

Views

Total Views
1,710
Views on SlideShare
1,710
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Os Whitaker Os Whitaker Presentation Transcript

  • Keeping Your Workers In Line:
    • Brad Whitaker
    • Lisa Phillips
    use TheSchwartz;
  • Once upon a time... In a galaxy far far away Users wanted features Like subscription based notifications, Pinging external services, And other things which Love to tie up webserver processes And are generally too slow / blocking to execute synchronously with web requests
  • (Yes, that was really my best slide-foo, so bear with me)
  • So we needed to solve the problems these features create
  • Which can easily result in a mess
  • Contacting External Web Services
    • In LJ's case:
      • weblogs.com
      • updates.sixapart.com
      • event notificatoins to Mother Russia
        • seriously
    • Services need to be contacted whenever an entry / comment / asset is created
  • Processing Uploaded Media
    • Pictures need to be scaled
    • Videos need to be transcoded
    • Shouldn’t be done on the webserver
      • -too slow
      • -hogs CPU/Memory resources
      • requires unnecessary libraries loaded in to Apache
  • Initial Solution: GhettoQueue
    • Some sort of buffer on disk/database
      • Or worse, a queue which gets blocked when a single job repeatedly fails
    • Cron or daemon to process the queue
    • Gets really behind, pretty flaky, generally annoying
      • Think 'qbufferd' in LJ
        • Which was the bane of our existence
          • For years.
    • Hard to administer!
  • Incoming data from lots of different transports
    • Incoming emails
    • Incoming SMS and outbound response
    • Audio from Asterisk
    • ... Just to name a few
  • Initial Solution: Lots of Daemons!
    • Lots of daemons!
    • In LJ:
      • phonepostd
      • mailgated
    • Mostly consist of biolerplate code to read / manage a spool directory, daemonize, handle locking
      • Very little code dedicated to processing the actual data
    • Hard to administer!
  • Events, Subscriptions, Notifications?
    • 1 event => many subscribers => many notifications
    • Much too slow to find subscribers and issue notifications synchronously
    • Existing GhettoQueue mechanism sucks, so that's not an option.
    • ...
    • Needs a real solution to reliable job processing
  • Problems With These Approaches
    • Everything is different for each service:
      • Monitoring
      • Tools for Administration
      • Troubleshooting
    • Operations people end up hating new features
      • Each one brings a new set of headaches
    • Fine for a while, but at some point becomes ridiculous
  •  
  • Implementation
    • Perl + MySQL / SQLite
      • SQLite mostly for test suite
    • Python/Ruby/Etc people: Don't worry! Plans are underway to make TheSchwartz language agnostic
  • Topology
    • One or more databases
    • Worker machines
      • Each running many worker processes
  • Schwartz Database
    • Keeps track of:
      • Jobs and their args
      • Errors
      • Exit status
      • ...That's it!
    • Small schema, can mostly stay in memory
    • Scaleable: Inserts are random to any database
  • Schwartz Workers
    • TheSchwartz::Worker subclasses
    • Know how to handle one or more job types
    • Accept TheSchwartz::Job as single parameter
    • ... That's it!
  • Full Topology: With Application
  • Request Cycle
    • 1) Event to application
    • 2) Application registers Schwartz job
    • 3) Worker grabs Schwartz job
    • 4) Worker does work, (usually modifying application data)
    • 5) Worker (optionally) stores result in Schwartz database
  • Let's look at some code...
  • Application code
  • Worker code
  • It's really that simple...
  • But it doesn't have to be
  • Workers can define other per-worker behaviors
    • Retries:
      • sub max_retries { 5 } # 5 tries
      • sub retry_delay {
      • my ($class, $fail_ct) = @_;
      • return 2 ** $fail_ct;
      • }
    • Max time a process can work on a job:
      • sub grab_for { 300 } # seconds
    • Keeping exit status:
      • sub keep_exit_status_for { 86400 } # 1 day
  • Other fun features
    • Coalescing based on prefix
      • Coalescing field explicitly stated when job is inserted
      • “ Give me all jobs that are sending email to Yahoo”
    • Atomic job replacement
      • “ For splitting one job up into many, which other workers can immdiately start working on”
    • Scheduling future jobs
      • Because we hate cron
  • Using TheSchwartz in production
    • Livejournal currently handling over 100 jobs per second
  • Schwartz Database Configuration
    • Innodb
    • Master-master replication
    • One side active
    • Linux Heartbeat for shared VIP
    • Automatic binlog purging
  • Schwartz Database Configuration … .. And so on, adding clusters as needed
  • Tools and monitoring
    • Schwartzmon
    • Schwartz-rate
    • LJWorkerctrl
    • Nagios plugins for queues
    • Triggers
  • Schwartzmon Example lj@ljadmin1:~$ schwartzmon --dsn=DBI:mysql:theschwartz_livejournal;host=10.191.90.101 --user=lj -f errors Thu Jul 26 19:22:13 2007 [2116902910]: Connection failed to domain 'imagemenagerie.com', MXes: [imagemenagerie.com] Thu Jul 26 19:22:14 2007 [2120335058]: Connection failed to domain 'thedashcat.net', MXes: [thedashcat.net] Thu Jul 26 19:22:15 2007 [2126277932]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126007758]: Connection failed to domain 'cox.net', MXes: [mx.west.cox.net mx.east.cox.net] Thu Jul 26 19:22:16 2007 [2126309296]: Permanent failure TO [hey2a@cs.com]: 550 MAILBOX NOT FOUND Thu Jul 26 19:22:17 2007 [2126309446]: Permanent failure TO [joe_junkpan@hotmail.com]: 550 Requested action not taken: mailbox unavailable Thu Jul 26 19:22:18 2007 [2126308836]: Error during DATAEND phase to [ourxtrees@yahoo.com]: 451 go ahead Message temporarily deferred - [250] Thu Jul 26 19:22:18 2007 [2126188996]: Connection failed to domain 'buffyboarders.zzn.com', MXes: [c2mailmx.mailcentro.com c2mds.mailcentro.com] Thu Jul 26 19:22:21 2007 [2125661508]: Connection failed to domain 'cox.net', MXes: [mx.east.cox.net mx.west.cox.net]
  • Ljworkerctrl example
  • Ljworkerctrl example
  • Ljworkerctrl example
    • … .
    -
  • Questions? http://code.sixapart.com/svn/TheSchwartz/trunk