SlideShare a Scribd company logo
1 of 115
Download to read offline
LiveJournal: Behind The Scenes
                                             Scaling Storytime

                                                       June 2007
                                                        USENIX



                                         Brad Fitzpatrick
                                         brad@danga.com

                          danga.com / livejournal.com / sixapart.com
       This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. To
            view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/1.0/ or send a letter to
                        Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.




http://danga.com/words/
                                                                                                                       1
The plan...

        Refer to previous presentations for more
    

        details...
              http://danga.com/words/
          


        Questions anytime! Yell. Interrupt.
    

        Part 0:
    

              show where talk will end up
          −
        Part I:
    

              What is LiveJournal? Quick history.
          −
              LJ’s scaling history
          −
        Part II:
    

              explain all our software,
          −
              explain all the moving parts
          −

http://danga.com/words/
                                                    2
net.
                            LiveJournal Backend: Today
                                                           (Roughly.)

    BIG-IP
                                                                                            Global Database
                            perlbal (httpd/proxy)
            bigip1                                        mod_perl
            bigip2                  proxy1                                                       master_a master_b
                                                            web1
                                    proxy2                  web2
                                    proxy3                                      Memcached
                                                            web3                            slave1 slave2     ...   slave5
                                    proxy4
  djabberd                                                                         mc1
                                                            web4
      djabberd                      proxy5                                         mc2
                                                              ...                               User DB Cluster 1
      djabberd
                                                                                   mc3             uc1a         uc1b
                                                            webN
                                                                                   mc4          User DB Cluster 2
                                                                                    ...            uc2a         uc2b
                                                    gearmand
                                                                                   mcN
 Mogile Storage Nodes                                   gearmand1                               User DB Cluster 3
                     sto2
     sto1                                               gearmandN                                  uc3a         uc3b
                                Mogile Trackers
                     sto8
      ...                        tracker1    tracker3                                           User DB Cluster N
                                                                                                   ucNa         ucNb
                                                                    “workers”
    MogileFS Database
                                                                       gearwrkN                 Job Queues (xN)
        mog_a           mog_b                                         theschwkN                    jqNa         jqNb


     slave1     slaveN
http://danga.com/words/
                                                                                                                             3
LiveJournal Overview

      college hobby project, Apr 1999
  

      4-in-1:
  

       − blogging
       − forums
       − social-networking (“friends”)
       − aggregator: “friends page”
           − “friends” can be external RSS/Atom
      10M+ accounts
  

      Open Source!
  

       − server,
       − infrastructure,
       − original clients,


http://danga.com/words/
                                                  4
Stuff we've built...
                             (all production, open source)



       memcached                                   djabberd
                                              

        − distributed caching                        − the super-extensible
       MogileFS                                         everything-is-a-plugin
   

        − distributed filesystem                        mod_perl/qpsmtpd/
       Perlbal                                          Eclipse of XMPP/Jabber
   

        − HTTP load balancer, web                       servers
          server, swiss-army knife                 .....
                                               

       gearman                                     OpenID
                                              

        − LB/HA/coalescing low-                          federated identity
                                                     

          latency function call                          protocol
          “router”
       TheSchwartz
   

        − reliable, async job
          dispatch system

http://danga.com/words/
                                                                                 5
“Uh, why?”

          NIH? (Not Invented Here?)
      

          Are we reinventing the wheel?
      




http://danga.com/words/
                                          6
Yes.

          We build wheels.
      

               ... when existing suck,
           −
               ... or don’t exist.
           −




http://danga.com/words/
                                           7
Yes.

          We build wheels.
      

               ... when existing suck,
           −
               ... or don’t exist.
           −




http://danga.com/words/
                                           7
Yes.

          We build wheels.
      

               ... when existing suck,
           −
               ... or don’t exist.
           −




http://danga.com/words/
                                           7
Yes.

          We build wheels.
      

               ... when existing suck,
           −
               ... or don’t exist.
           −




                                           (yes, arguably tires. sshh..)


http://danga.com/words/
                                                                           7
Part I
                          Quick Scaling History




http://danga.com/words/
                                                  8
Quick Scaling History

          1 server to hundreds...
      




          you can do all this with just 1 server!
      

               then you’re ready for tons of servers, without pain
           −
               don’t repeat our scaling mistakes
           −




http://danga.com/words/
                                                                     9
Terminology

          Scaling:
      

               NOT: “How fast?”
           −
               But: “When you add twice as many servers, are you
           −
               twice as fast (or have twice the capacity)?”
          Fast still matters,
      

               2x faster: 50 servers instead of 100...
           −
                     that’s some good money
                 


               but that’s not what scaling is.
           −




http://danga.com/words/
                                                                   10
Terminology

          “Cluster”
      

               varying definitions... basically:
           −
               making a bunch of computers work together for
           −
               some purpose
               what purpose?
           −
                     load balancing (LB),
                 

                     high availablility (HA)
                 


          Load Balancing?
      

          High Availability?
      

          Venn Diagram time!
      

               I love Venn Diagrams
           −




http://danga.com/words/
                                                               11
LB vs. HA




                          Load Balancing    High Availability




http://danga.com/words/
                                                                12
LB vs. HA


                                                 High
                            Load
                                               Availability
                          Balancing
                                                          LVS heartbeat,
                                          http
                round-robin DNS,                    cold/warm/hot spare,
                                     reverse proxy,
                                                                      ...
                data partitioning,    wackamole,
                ....                       ...




http://danga.com/words/
                                                                            13
Favorite Venn Diagram




             Times When               Times When I’m
           I’m Truly Happy             Wearing Pants




http://danga.com/words/
                                                       14
One Server

             Simple:
         




                                   mysql




                                   apache




http://danga.com/words/
                                            15
Two Servers




                                           apache




                                   mysql




http://danga.com/words/
                                                    16
Two Servers - Problems


             Two single points of failure!
         

             No hot or cold spares
         

             Site gets slow again.
         

                  CPU-bound on web node
              −
                  need more web nodes...
              −




http://danga.com/words/
                                                   17
Four Servers

             3 webs, 1 db
         

             Now we need to load-balance!
         

                  LVS, mod_backhand, whackamole, BIG-IP,
              

                  Alteon, pound, Perlbal, etc, etc..
                  ...
              −




http://danga.com/words/
                                                           18
Four Servers - Problems

             Now I/O bound...
         

             ... how to use another database?
         

              −




http://danga.com/words/
                                                    19
Five Servers
                          introducing MySQL replication

       We buy a new DB
   

       MySQL replication
   

       Writes to DB (master)
   

       Reads from both
   




http://danga.com/words/
                                                          20
More Servers




                             Chaos!

http://danga.com/words/
                                         21
net.

                                 Where we're at....

      BIG-IP
            bigip1
            bigip2
                          mod_proxy
                                         mod_perl
                               proxy1
                                           web1
                               proxy2
                                           web2
                               proxy3               Global Database
                                           web3
                                                                master
                                           web4
                                             ...
                                           web12
                                                    slave1 slave2     ...   slave6




http://danga.com/words/
                                                                                     22
Problems with Architecture
                                           or,
                                  “This don't scale...”

      DB master is SPOF
  

      Adding slaves doesn't scale
  

      well...
       − only spreads reads, not writes!




            500 reads/s
                                                               250 reads/s
                                                 250 reads/s


                                                               200 write/s
                                                 200 write/s
            200 writes/s


http://danga.com/words/
                                                                             23
Eventually...

             databases eventual only writing
         



          3 reads/s
            3 r/s                                                  3 reads/s
                                                                     3 r/s
                                       3 reads/s
                                         3 r/s
                          3 reads/s
                            3 r/s                                                3 reads/s
                                                                                   3 r/s       3 reads/s
                                                                                                 3 r/s
                                                     3 reads/s
                                                       3 r/s




            400                                                      400
                          400            400                                       400           400
                                                       400
         400 write/s                                              400 write/s
                       400 write/s    400 write/s                               400 write/s   400 write/s
                                                    400 write/s
           write/s                                                  write/s
                         write/s        write/s                                   write/s       write/s
                                                      write/s




http://danga.com/words/
                                                                                                            24
Spreading Writes

          Our database machines already did RAID
      

          We did backups
      

          So why put user data on 6+ slave machines?
      

          (~12+ disks)
               overkill redundancy
           −
               wasting time writing everywhere!
           −




http://danga.com/words/
                                                       25
Partition your data!

    Spread your databases out, into “roles”


     − roles that you never need to join between
          different users

          or accept you'll have to join in app

    Each user assigned to a numbered HA cluster


    Each cluster has multiple machines


     − writes self-contained in cluster (writing to 2-3 machines, not
       6)




http://danga.com/words/
                                                                        26
User Clusters




http://danga.com/words/
                                          27
User Clusters

     SELECT userid,
     clusterid FROM
     user WHERE
     user='bob'




http://danga.com/words/
                                          27
User Clusters

     SELECT userid,
     clusterid FROM
     user WHERE
     user='bob'




     userid: 839
     clusterid: 2




http://danga.com/words/
                                          27
User Clusters

                                          SELECT ....
     SELECT userid,
                                          FROM ...
     clusterid FROM
                                          WHERE
     user WHERE
                                          userid=839 ...
     user='bob'




     userid: 839
     clusterid: 2




http://danga.com/words/
                                                       27
User Clusters

                                          SELECT ....
     SELECT userid,
                                          FROM ...
     clusterid FROM
                                          WHERE
     user WHERE
                                          userid=839 ...
     user='bob'




                                            OMG i like
                                            totally hate
                                            my parents
     userid: 839                            they just
     clusterid: 2                           dont
                                            understand me
                                            and i h8 the
                                            world omg lol
                                            rofl *! :^-
                                            ^^;

                                            add me as a
                                            friend!!!
http://danga.com/words/
                                                          27
Details

    per-user numberspaces


     − don't use AUTO_INCREMENT
     − PRIMARY KEY (user_id, thing_id)
     − so:
    Can move/upgrade users 1-at-a-time:


     − per-user “readonly” flag
     − per-user “schema_ver” property
     − user-moving harness
         job server that coordinates, distributed long-

          lived user-mover clients who ask for tasks
     − balancing disk I/O, disk space

http://danga.com/words/
                                                           28
Shared Storage
                          (SAN, SCSI, DRBD...)

      Turn pair of InnoDB machines into a cluster
  

       − looks like 1 box to outside world. floating IP.
      One machine at a time mounting fs, running MySQL
  

      Heartbeat to move IP, {un,}mount filesystem, {stop,start}
  

      mysql
        filesystem repairs,

        innodb repairs,

        don’t lose any committed transactions.

      No special schema considerations
  

      MySQL 4.1 w/ binlog sync/flush options
  

       − good
       − The cluster can be a master or slave as well


http://danga.com/words/
                                                                  29
Shared Storage: DRBD

      Linux block device driver
  

                                                     floater ip
       − “Network RAID 1”
       − Shared storage without sharing!
                                             mysql                mysql
       − sits atop another block device
       − syncs w/ another machine's
                                             ext3                 ext3
         block device
            cross-over gigabit cable
                                             drbd                 drbd
             ideal. network is faster than
                                              sda                  sda
             random writes on your disks.
      InnoDB on DRBD: HA MySQL!
  

       − can hang slaves off HA pair,
       − and/or,
       − HA pair can be slave of a
         master
http://danga.com/words/
                                                                          30
MySQL Clustering Options:
                               Pros & Cons

          No magic bullet...
      

               Master/Slave
           −
                      doesn’t scale with writes
                 


               Master/Master
           −
                      special schemas
                 


               DRBD
           −
                      only HA, not LB
                 


               MySQL Cluster
           −
                      special-purpose
                 


               ....
           −
          lots of options!
      

           − :)
           − :(
http://danga.com/words/
                                                      31
Part II
                          Our Software




http://danga.com/words/
                                         32
Caching

        caching's key to performance
    

         − store result of a computation or I/O for quicker future
           access (classic space/time trade-off)
        Where to cache?
    

         − mod_perl/php internal caching
             memory waste (address space per apache child)

         − shared memory
             limited to single machine, same with Java/C#/

              Mono
         − MySQL query cache
             flushed per update, small max size

         − HEAP tables
             fixed length rows, small max size




http://danga.com/words/
                                                                     33
memcached
                          http://www.danga.com/memcached/


       our Open Source, distributed caching system
   

         implements a dictionary ADT, with network API

       run instances wherever free memory
   

       two-level hash
   

        − client hashes* to server,
        − server has internal dictionary (hash table)
       no “master node”, nodes aren’t aware of each
   

       other
       protocol simple, XML-free
   

        − clients: c, perl, java, c#, php, python, ruby, ...
       popular, fast
   

       scalable
   
http://danga.com/words/
                                                               34
Protocol Commands

          set, add, replace
      

          delete
      

          incr, decr
      

               atomic, returning new value
           −




http://danga.com/words/
                                              35
Picture




http://danga.com/words/
                                    36
Picture


   10.0.0.100:11211       10.0.0.101:11211   10.0.0.102:11211
             1GB                2GB                1GB




http://danga.com/words/
                                                                36
Picture


   10.0.0.100:11211       10.0.0.101:11211   10.0.0.102:11211
             1GB                2GB                1GB




http://danga.com/words/
                                                                36
Picture


   10.0.0.100:11211           10.0.0.101:11211       10.0.0.102:11211
             1GB                    2GB                    1GB


              0           1                      2           3




http://danga.com/words/
                                                                        36
Picture


   10.0.0.100:11211            10.0.0.101:11211       10.0.0.102:11211
             1GB                     2GB                    1GB


              0            1                      2           3




                  Client



http://danga.com/words/
                                                                         36
Picture


   10.0.0.100:11211               10.0.0.101:11211       10.0.0.102:11211
             1GB                        2GB                    1GB


              0               1                      2           3




                           $val = $client->get(“foo”)

                  Client



http://danga.com/words/
                                                                            36
Picture


   10.0.0.100:11211               10.0.0.101:11211       10.0.0.102:11211
             1GB                        2GB                    1GB


              0               1                      2           3




                           $val = $client->get(“foo”)
                             CRC32(“foo”) % 4 = 2
                  Client



http://danga.com/words/
                                                                            36
Picture


   10.0.0.100:11211               10.0.0.101:11211         10.0.0.102:11211
             1GB                        2GB                       1GB


              0               1                      2              3




                           $val = $client->get(“foo”)
                             CRC32(“foo”) % 4 = 2
                  Client
                             connect to server[2] (“10.0.0.101:11211”)



http://danga.com/words/
                                                                              36
Picture


   10.0.0.100:11211                        10.0.0.101:11211         10.0.0.102:11211
             1GB                                 2GB                       1GB


              0                        1                      2              3

                          GET foo



                                    $val = $client->get(“foo”)
                                      CRC32(“foo”) % 4 = 2
                  Client
                                      connect to server[2] (“10.0.0.101:11211”)



http://danga.com/words/
                                                                                       36
Picture


   10.0.0.100:11211                        10.0.0.101:11211         10.0.0.102:11211
             1GB                                    2GB                    1GB


              0                        1                      2              3

                          GET foo      (response)



                                    $val = $client->get(“foo”)
                                      CRC32(“foo”) % 4 = 2
                  Client
                                      connect to server[2] (“10.0.0.101:11211”)



http://danga.com/words/
                                                                                       36
Client hashing onto a memcacached
                            node

          Up to client how to pick a memcached node
      

          Traditional way:
      

               CRC32(<key>) % <num_servers>
           −
               (servers with more memory can own more slots)
           −
               CRC32 was least common denominator for all
           −
               languages to implement, allowing cross-language
               memcached sharing
               con: can’t add/remove servers without hit rate
           −
               crashing
          “Consistent hashing”
      

               can add/remove servers with minimal <key> to
           −
               <server> map changes

http://danga.com/words/
                                                                 37
memcached internals

          libevent
      

               epoll, kqueue...
           −
          event-based, non-blocking design
      

               optional multithreading, thread per CPU (not per
           −
               client)
          slab allocator
      

          referenced counted objects
      

               slow clients can’t block other clients from altering
           −
               namespace or data
          LRU
      

          all internal operations O(1)
      




http://danga.com/words/
                                                                      38
Perlbal




http://danga.com/words/
                                    39
Web Load Balancing

      BIG-IP, Alteon, Juniper, Foundry
  

       − good for L4 or minimal L7
       − not tricky / fun enough. :-)
      Tried a dozen reverse proxies
  

       − none did what we wanted or were fast enough
      Wrote Perlbal
  

       − fast, smart, manageable HTTP web server / reverse proxy / LB
       − can do internal redirects
            and dozen other tricks




http://danga.com/words/
                                                                        40
Perlbal

       Perl
   

         parts optionally in C with plugins

       single threaded, async event-based
   

        − uses epoll, kqueue, etc.
       console / HTTP remote management
   

        − live config changes
       handles dead nodes, smart balancing
   

       multiple modes
   

        − static webserver
        − reverse proxy
        − plug-ins (Javascript message bus.....)
       plug-ins
   

        − GIF/PNG altering, ....
http://danga.com/words/
                                                   41
Perlbal: Persistent Connections




http://danga.com/words/
                                                    42
Perlbal: Persistent Connections

    perlbal to backends (mod_perls)


          know exactly when a connection is ready for a new
      −
          request
                no complex load balancing logic: just use whatever's free.
            

                beats managing “weighted round robin” hell.
    clients persistent; not tied to a specific backend


    connection




http://danga.com/words/
                                                                             42
Perlbal: Persistent Connections

    perlbal to backends (mod_perls)


          know exactly when a connection is ready for a new
      −
          request
                no complex load balancing logic: just use whatever's free.
            

                beats managing “weighted round robin” hell.
    clients persistent; not tied to a specific backend


    connection


                                       PB




http://danga.com/words/
                                                                             42
Perlbal: Persistent Connections

    perlbal to backends (mod_perls)


          know exactly when a connection is ready for a new
      −
          request
                no complex load balancing logic: just use whatever's free.
            

                beats managing “weighted round robin” hell.
    clients persistent; not tied to a specific backend


    connection
                                                                    Apache
      Client
                                       PB

                                                                    Apache
      Client

http://danga.com/words/
                                                                             42
Perlbal: Persistent Connections

    perlbal to backends (mod_perls)


          know exactly when a connection is ready for a new
      −
          request
                no complex load balancing logic: just use whatever's free.
            

                beats managing “weighted round robin” hell.
    clients persistent; not tied to a specific backend


    connection
                            reqA1, A2             reqA1, B2         Apache
      Client
                                        PB
                          reqB1, B2

                                                                    Apache
      Client                                     reqB1, A2

http://danga.com/words/
                                                                             42
Perlbal: can verify new backend
                           connections
                          #include <sys/socket.h>
                          int listen(int sockfd, int backlog);


          connects to backends are often fast, but...
      

               are you talking to the kernel’s listen queue?
           

               or apache? (did apache accept() yet?)
           


          send OPTIONs request to see if apache is
      

          there
               Apache can reply to OPTIONS request quickly,
           −
               then Perlbal knows that conn is bound to an
           −
               apache process, not waiting in a kernel queue
          Huge improvement to user-visible latency!
      

               (and more fair/even load balancing)
           



http://danga.com/words/
                                                                 43
Perlbal: multiple queues

       high, normal, low priority queues
   

       paid users -> high queue
   

       bots/spiders/suspect traffic -> low queue
   




http://danga.com/words/
                                                     44
Perlbal: cooperative large file serving

           large file serving w/ mod_perl bad...
       

                mod_perl has better things to do than spoon-feed
            −
                clients bytes




http://danga.com/words/
                                                                   45
Perlbal: cooperative large file serving

        internal redirects
    

             mod_perl can pass off serving a big file to Perlbal
         −
                   either from disk, or from other URL(s)
               


             client sees no HTTP redirect
         −
             “Friends-only” images
         −
                   one, clean URL
               

                   mod_perl does auth, and is done.
               

                   perlbal serves.
               




http://danga.com/words/
                                                                   46
Internal redirect picture




http://danga.com/words/
                                                      47
And the reverse...

       Now Perlbal can buffer uploads as well..
   

       − Problems:
            LifeBlog uploading

              − cellphones are slow
            LiveJournal/Friendster photo uploads

              − cable/DSL uploads still slow
       − decide to buffer to “disk” (tmpfs, likely)
            on any of: rate, size, time

        blast at backend, only when full request is in




http://danga.com/words/
                                                          48
Palette Altering GIF/PNGs

          based on palette indexes, colors in URL,
      

          dynamically alter GIF/PNG palette table, then
          sendfile(2) the rest.




http://danga.com/words/
                                                          49
MogileFS




http://danga.com/words/
                                     50
oMgFileS




http://danga.com/words/
                                     51
MogileFS

         our distributed file system
     

         open source
     

         userspace
     

               based all around HTTP (NFS support now removed)
           


         hardly unique
     

               Google GFS
           −
               Nutch Distributed File System (NDFS)
           −
         production-quality
     

               lot of users
           −
               lot of big installs
           −




http://danga.com/words/
                                                                 52
MogileFS: Why

    alternatives at time were either:


     − closed, non-existent, expensive, in development,
       complicated, ...
     − scary/impossible when it came to data recovery
          new/uncommon/ unstudied on-disk formats

    because it was easy


     − initial version = 1 weekend! :)
     − current version = many, many weekends :)




http://danga.com/words/
                                                          53
MogileFS: Main Ideas

       files belong to classes,             multiple tracker
                                         −
   

       which dictate:                       databases
         − replication policy, min        − all share same
           replicas, ...                    database cluster
       tracks what disks files              (MySQL, etc..)
   

       are on                            big, cheap disks
                                     

         − set disk's state (up,          − dumb storage nodes
           temp_down, dead)                 w/ 12, 16 disks, no
           and host                         RAID
       keep replicas on devices
   

       on different hosts
         − (default class policy)
         − No RAID!

http://danga.com/words/
                                                                  54
MogileFS components

            clients
        

            mogilefsd (does all real work)
        

            database(s) (MySQL, .... abstract)
        

            storage nodes
        




http://danga.com/words/
                                                 55
MogileFS: Clients

          tiny text-based protocol
      

          Libraries available for:
      

               Perl
           −
                     tied filehandles
                 

                     MogileFS::Client
                 

                          my $fh = $mogc->new_file(“key”, [[$class], ...])
                      −

               Java
           −
               PHP
           −
               Python?
           −
               porting to $LANG is be trivial
           −
               future: no custom protocol. only HTTP
           −
          clients don't do database access
      




http://danga.com/words/
                                                                             56
MogileFS: Tracker
                                 (mogilefsd)

          The Meat
      

          event-based message bus
      

          load balances client requests, world info
      

          process manager
      

               heartbeats/watchdog, respawner, ...
           −
          Child processes:
      

               ~30x client interface (“query” process)
           −
                     interfaces client protocol w/ db(s), etc
                 


               ~5x replicate
           −
               ~2x delete
           −
               ~1x fsck, reap, monitor, ..., ...
           −



http://danga.com/words/
                                                                57
Trackers' Database(s)

          Abstract as of Mogile 2.x
      

           − MySQL
           − SQLite (joke/demo)
           − Pg/Oracle coming soon?
           − Also future:
               wrapper driver, partitioning any above

                  − small metadata in one driver (MySQL Cluster?),
                  − large tables partitioned over 2-node HA pairs
          Recommend config:
      

           − 2xMySQL InnoDB on DRBD
           − 2 slaves underneath HA VIP
               1 for backups

               read-only slave for during master failover window




http://danga.com/words/
                                                                     58
MogileFS storage nodes
                               (mogstored)
          HTTP transport
      

            − GET
            − PUT
            − DELETE
          mogstored listens on 2 ports...
      

             HTTP. --server={perlbal,lighttpd,...}

                  configs/manages your webserver of choice.

                  perlbal is default. some people like apache, etc

            − management/status:
                  iostat interface, AIO control, multi-stat() (for faster

                   fsck)
          files on filesystem, not DB
      

            − sendfile()! future: splice()
            − filesystem can be any filesystem

http://danga.com/words/
                                                                             59
Large file
          GET
        request




http://danga.com/words/
                          60
Auth: complex, but quick


       Large file
          GET
        request




http://danga.com/words/
                                            60
Spoonfeeding:
                                            slow, but event-
                                            based


                 Auth: complex, but quick


       Large file
          GET
        request




http://danga.com/words/
                                                           60
Gearman




http://danga.com/words/
                                    61
manaGer




http://danga.com/words/
                                    62
Manager
                           dispatches work,
              but doesn't do anything useful itself. :)




http://danga.com/words/
                                                          63
Gearman

    system to load balance function calls...


      scatter/gather bunch of calls in parallel,

      different languages,

      db connection pooling,

      spread CPU usage around your network,

      keep heavy libraries out of caller code,

      ...

      ...




http://danga.com/words/
                                                    64
Gearman Pieces

      gearmand
  

       − the function call router
       − event-loop (epoll, kqueue, etc)
      workers.
  

       − Gearman::Worker – perl/ruby
       − register/heartbeat/grab jobs
      clients
  

       − Gearman::Client[::Async] -- perl
           − also Ruby Gearman::Client
       − submit jobs to gearmand
           − opaque (to server) “funcname” string
           − optional opaque (to server) “args” string
           − opt coallescing key
http://danga.com/words/
                                                         65
Gearman Picture




http://danga.com/words/
                                            66
Gearman Picture


                          gearmand   gearmand   gearmand




http://danga.com/words/
                                                           66
Gearman Picture


                          gearmand   gearmand   gearmand




                                           Worker          Worker


http://danga.com/words/
                                                                    66
Gearman Picture


                          gearmand    gearmand         gearmand




                                                                  can_do(“funcA”)

                                     can_do(“funcA”)
                                     can_do(“funcB”)

                                                 Worker           Worker


http://danga.com/words/
                                                                                    66
Gearman Picture


                          gearmand    gearmand         gearmand




                                                                  can_do(“funcA”)

                                     can_do(“funcA”)
                                     can_do(“funcB”)

      Client                                     Worker           Worker


http://danga.com/words/
                                                                                    66
Gearman Picture


                          gearmand    gearmand         gearmand




 call(“funcA”)
                                                                  can_do(“funcA”)

                                     can_do(“funcA”)
                                     can_do(“funcB”)

      Client                                     Worker           Worker


http://danga.com/words/
                                                                                    66
Gearman Picture


                             gearmand    gearmand         gearmand




 call(“funcA”)
                                                                     can_do(“funcA”)

                                        can_do(“funcA”)
                                        can_do(“funcB”)

      Client              Client                    Worker           Worker


http://danga.com/words/
                                                                                       66
Gearman Picture


                             gearmand    gearmand         gearmand




 call(“funcA”)
                                                                     can_do(“funcA”)
                     call(“funcB”)
                                        can_do(“funcA”)
                                        can_do(“funcB”)

      Client              Client                    Worker           Worker


http://danga.com/words/
                                                                                       66
Gearman Protocol

          efficient binary protocol
      

          No XML
      

          but also line-based text protocol for admin
      

          commands
           − telnet to gearmand and get status
           − useful for Nagios plugins, etc




http://danga.com/words/
                                                        67
Gearman Uses

          Image::Magick outside of your mod_perls!
      

          DBI connection pooling (DBD::Gofer +
      

          Gearman)
          reducing load, improving visibility
      

          “services”
      

               can all be in different languages, too!
           −




http://danga.com/words/
                                                         68
Gearman Uses, cont..

           running code in parallel
       

                query ten databases at once
            −
           running blocking code from event loops
       

                DBI from POE/Danga::Socket apps
            −
           spreading CPU from ev loop daemons
       

           calling between different languages,
       

           ...
       




http://danga.com/words/
                                                    69
Gearman Misc

          Guarantees:
      

           − none! hah! :)
               please wait for your results.

               if client goes away, no promises

           − all retries on failures are done by client
               but server will notify client(s) if working worker

                goes away.
          No policy/conventions in gearmand
      

           − all policy/meaning between clients <-> workers
          ...
      




http://danga.com/words/
                                                                     70
Sick Gearman Demo

          Don’t actually use it like this... but:
      

                      use strict;
                      use DMap qw(dmap);
                      DMap->set_job_servers(quot;sammyquot;, quot;papagquot;);

                      my @foo = dmap { quot;$_ = quot; . `hostname` } (1..10);

                      print quot;dmap says:n @fooquot;;

                      $ ./dmap.pl
                      dmap says:
                       1 = sammy
                       2 = papag
                       3 = sammy
                       4 = papag
                       5 = sammy
                       6 = papag
                       7 = sammy
                       8 = papag
                       9 = sammy
                       10 = papag

http://danga.com/words/
                                                                         71
Gearman Summary

         Gearman is sexy.
     

              especially the coalescing
          −
         Check it out!
     

              it's kinda our little unadvertised secret
          −
                    oh crap, did I leak the secret?
                




http://danga.com/words/
                                                          72
TheSchwartz




http://danga.com/words/
                                        73
TheSchwartz

      Like gearman:
  

            job queuing system
        −
            opaque function name
        −
            opaque “args” blob
        −
            clients are either:
        −
                  submitting jobs
              

                  workers
              


      But unlike gearman:
  

            Reliable job queueing system
        −
            not low latency
        −
                  fire & forget (as opposed to gearman, where you wait for
              −
                  result)
      currently library, not network service
  



http://danga.com/words/
                                                                             74
TheSchwartz Primitives

          insert job
      

          “grab” job (atomic grab)
      

                for 'n' seconds.
           −
          mark job done
      

          temp fail job for future
      

                optional notes, rescheduling details..
           −
          replace job with 1+ other jobs
      

                atomic.
           −
          ...
      




http://danga.com/words/
                                                         75
TheSchwartz

     backing store:
 

           a database
       −
           uses Data::ObjectDriver
       −
                MySQL,
            

                Postgres,
            

                SQLite,
            

                ....
            


     but HA: you tell it @dbs, and it finds one to
 

     insert job into
           likewise, workers foreach (@dbs) to do work
       −




http://danga.com/words/
                                                         76
TheSchwartz uses

      outgoing email (SMTP client)
  

        − millions of emails per day
        − TheSchwartz::Worker::SendEmail
        − Email::Send::TheSchwartz
      LJ notifications
  

        − ESN: event, subscription, notification
             one event (new post, etc) -> thousands of emails, SMSes,

              XMPP messages, etc...
      pinging external services
  

      atomstream injection
  

      .....
  

      dozens of users
  

      shared farm for TypePad, Vox, LJ
  




http://danga.com/words/
                                                                         77
gearmand + TheSchwartz

       gearmand: not reliable, low-latency, no disks
   

       TheSchwartz: latency, reliable, disks
   

       In TypePad:
   

             TheSchwartz, with gearman to fire off TheSchwartz
         −
             workers.
                   disks, but low-latency
               

                   future: no disks, SSD/Flash, MySQL Cluster
               




http://danga.com/words/
                                                                 78
djabberd




http://danga.com/words/
                                     79
djabberd

       Our Jabber/XMPP server
   

             powers our “LJ Talk” service
         


       S2S: works with GoogleTalk, etc
   

       perl, event-based (epoll, etc)
   

       done 300,000+ conns
   

       tiny per-conn memory overhead
   

             release XML parser state if possible
         −




http://danga.com/words/
                                                    80
djabberd hooks

     everything is a hook
 

           not just auth! like, everything.
       −
                 auth,
             −
                 roster,
             −
                 vcard info (avatars),
             −
                 presence,
             −
                 delivery,
             −
                 inter-node cluster delivery,
             −
           ala mod_perl, qpsmtpd, etc.
       −
     async hooks
 

           hooks phases can take as long as they want before
       −
           they answer, or decline to next phase in hook chain...
           we use Gearman::Client::Async
       −


http://danga.com/words/
                                                                    81
Thank you!

                               Questions to:
                             brad@danga.com

                                  Software:
                              http://danga.com/
                          http://code.sixapart.com/




http://danga.com/words/
                                                      82
net.

                                                  Questions?
    BIG-IP
                                                                                            Global Database
                            perlbal (httpd/proxy)
            bigip1                                        mod_perl
            bigip2                  proxy1                                                       master_a master_b
                                                            web1
                                    proxy2                  web2
                                    proxy3                                      Memcached
                                                            web3                            slave1 slave2     ...   slave5
                                    proxy4
  djabberd                                                                         mc1
                                                            web4
      djabberd                      proxy5                                         mc2
                                                              ...                               User DB Cluster 1
      djabberd
                                                                                   mc3             uc1a         uc1b
                                                            webN
                                                                                   mc4          User DB Cluster 2
                                                                                    ...            uc2a         uc2b
                                                    gearmand
                                                                                   mcN
 Mogile Storage Nodes                                   gearmand1                               User DB Cluster 3
                     sto2
     sto1                                               gearmandN                                  uc3a         uc3b
                                Mogile Trackers
                     sto8
      ...                        tracker1    tracker3                                           User DB Cluster N
                                                                                                   ucNa         ucNb
                                                                    “workers”
    MogileFS Database
                                                                       gearwrkN                 Job Queues (xN)
        mog_a           mog_b                                         theschwkN                    jqNa         jqNb


     slave1     slaveN
http://danga.com/words/
                                                                                                                             83
Bonus Slides

             if extra time
         




http://danga.com/words/
                                            84
Data Integrity

          Databases depend on fsync()
      

               but databases can't send raw SCSI/ATA commands
           −
               to flush controller caches, etc
          fsync() almost never works work
      

               Linux, FS' (lack of) barriers, raid cards, controllers,
           −
               disks, ....
          Solution: test! & fix
      

               disk-checker.pl
           −
                     client/server
                 

                     spew writes/fsyncs, record intentions on alive machine,
                 

                     yank power, checks.



http://danga.com/words/
                                                                               85
Persistent Connection Woes

          connections == threads == memory
      

               My pet peeve:
           −
                     want connection/thread distinction in MySQL!
                 

                     w/ max-runnable-threads tunable
                 


          max threads
      

               limit max memory/concurrency
           −
          DBD::Gofer + Gearman
      

               Ask
           −
          Data::ObjectDriver + Gearman
      




http://danga.com/words/
                                                                    86

More Related Content

What's hot

Advanced mysql replication for the masses
Advanced mysql replication for the massesAdvanced mysql replication for the masses
Advanced mysql replication for the massesGiuseppe Maxia
 
MongoDB at the energy frontier
MongoDB at the energy frontierMongoDB at the energy frontier
MongoDB at the energy frontierValentin Kuznetsov
 
Dynamic Languages & Web Frameworks in GlassFish
Dynamic Languages & Web Frameworks in GlassFishDynamic Languages & Web Frameworks in GlassFish
Dynamic Languages & Web Frameworks in GlassFishIndicThreads
 
Rapid JCR applications development with Sling
Rapid JCR applications development with SlingRapid JCR applications development with Sling
Rapid JCR applications development with SlingBertrand Delacretaz
 
Lightweight Grids With Terracotta
Lightweight Grids With TerracottaLightweight Grids With Terracotta
Lightweight Grids With TerracottaPT.JUG
 
Inside mbga Open Platform API architecture
Inside mbga Open Platform API architectureInside mbga Open Platform API architecture
Inside mbga Open Platform API architectureToru Yamaguchi
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with PacemakerKris Buytaert
 
Couchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveCouchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveDipti Borkar
 
Bio2 Rdf Presentation V3
Bio2 Rdf Presentation V3Bio2 Rdf Presentation V3
Bio2 Rdf Presentation V3nolmar01
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
The new ehcache 2.0 and hibernate spi
The new ehcache 2.0 and hibernate spiThe new ehcache 2.0 and hibernate spi
The new ehcache 2.0 and hibernate spiCyril Lakech
 
CloudStackユーザ会〜仮想ルータの謎に迫る
CloudStackユーザ会〜仮想ルータの謎に迫るCloudStackユーザ会〜仮想ルータの謎に迫る
CloudStackユーザ会〜仮想ルータの謎に迫るsamemoon
 
Status Quo and (current) Limitations of Library Linked Data
Status Quo and (current) Limitations of Library Linked DataStatus Quo and (current) Limitations of Library Linked Data
Status Quo and (current) Limitations of Library Linked DataDaniel Vila Suero
 
What makes JBoss AS7 tick?
What makes JBoss AS7 tick?What makes JBoss AS7 tick?
What makes JBoss AS7 tick?marius_bogoevici
 

What's hot (18)

HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
 
Advanced mysql replication for the masses
Advanced mysql replication for the massesAdvanced mysql replication for the masses
Advanced mysql replication for the masses
 
MongoDB at the energy frontier
MongoDB at the energy frontierMongoDB at the energy frontier
MongoDB at the energy frontier
 
Dynamic Languages & Web Frameworks in GlassFish
Dynamic Languages & Web Frameworks in GlassFishDynamic Languages & Web Frameworks in GlassFish
Dynamic Languages & Web Frameworks in GlassFish
 
Data Aggregation System
Data Aggregation SystemData Aggregation System
Data Aggregation System
 
DTrace and Drupal
DTrace and DrupalDTrace and Drupal
DTrace and Drupal
 
Rapid JCR applications development with Sling
Rapid JCR applications development with SlingRapid JCR applications development with Sling
Rapid JCR applications development with Sling
 
Lightweight Grids With Terracotta
Lightweight Grids With TerracottaLightweight Grids With Terracotta
Lightweight Grids With Terracotta
 
Inside mbga Open Platform API architecture
Inside mbga Open Platform API architectureInside mbga Open Platform API architecture
Inside mbga Open Platform API architecture
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
Couchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep diveCouchbase Server 2.0 - XDCR - Deep dive
Couchbase Server 2.0 - XDCR - Deep dive
 
Bio2 Rdf Presentation V3
Bio2 Rdf Presentation V3Bio2 Rdf Presentation V3
Bio2 Rdf Presentation V3
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
The new ehcache 2.0 and hibernate spi
The new ehcache 2.0 and hibernate spiThe new ehcache 2.0 and hibernate spi
The new ehcache 2.0 and hibernate spi
 
CloudStackユーザ会〜仮想ルータの謎に迫る
CloudStackユーザ会〜仮想ルータの謎に迫るCloudStackユーザ会〜仮想ルータの謎に迫る
CloudStackユーザ会〜仮想ルータの謎に迫る
 
Status Quo and (current) Limitations of Library Linked Data
Status Quo and (current) Limitations of Library Linked DataStatus Quo and (current) Limitations of Library Linked Data
Status Quo and (current) Limitations of Library Linked Data
 
What's new in JSR-283?
What's new in JSR-283?What's new in JSR-283?
What's new in JSR-283?
 
What makes JBoss AS7 tick?
What makes JBoss AS7 tick?What makes JBoss AS7 tick?
What makes JBoss AS7 tick?
 

Similar to Behind the Scenes at LiveJournal: Scaling Storytime

Why JRuby? - RubyConf 2012
Why JRuby? - RubyConf 2012Why JRuby? - RubyConf 2012
Why JRuby? - RubyConf 2012Charles Nutter
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovationsLucidworks (Archived)
 
Node.js Explained
Node.js ExplainedNode.js Explained
Node.js ExplainedJeff Kunkle
 
Replatforming Legacy Packaged Applications: Block-by-Block with Minecraft
Replatforming Legacy Packaged Applications: Block-by-Block with MinecraftReplatforming Legacy Packaged Applications: Block-by-Block with Minecraft
Replatforming Legacy Packaged Applications: Block-by-Block with MinecraftVMware Tanzu
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
Hong Qiangning in QConBeijing
Hong Qiangning in QConBeijingHong Qiangning in QConBeijing
Hong Qiangning in QConBeijingshen liu
 
State of the art - server side JavaScript - web-5 2012
State of the art - server side JavaScript - web-5 2012State of the art - server side JavaScript - web-5 2012
State of the art - server side JavaScript - web-5 2012Alexandre Morgaut
 
GlassFish can support multiple Ruby frameworks ... really ?
GlassFish can support multiple Ruby frameworks ... really ?GlassFish can support multiple Ruby frameworks ... really ?
GlassFish can support multiple Ruby frameworks ... really ?Arun Gupta
 
Glass fish rubyconf-india-2010-Arun gupta
Glass fish rubyconf-india-2010-Arun gupta Glass fish rubyconf-india-2010-Arun gupta
Glass fish rubyconf-india-2010-Arun gupta ThoughtWorks
 
Automated testing with OffScale and MongoDB
Automated testing with OffScale and MongoDBAutomated testing with OffScale and MongoDB
Automated testing with OffScale and MongoDBOmer Gertel
 
豆瓣技术架构的发展历程 @ QCon Beijing 2009
豆瓣技术架构的发展历程 @ QCon Beijing 2009豆瓣技术架构的发展历程 @ QCon Beijing 2009
豆瓣技术架构的发展历程 @ QCon Beijing 2009Qiangning Hong
 
Deploying with JRuby
Deploying with JRubyDeploying with JRuby
Deploying with JRubyJoe Kutner
 
State of the art: Server-Side JavaScript - WebWorkersCamp IV - Open World For...
State of the art: Server-Side JavaScript - WebWorkersCamp IV - Open World For...State of the art: Server-Side JavaScript - WebWorkersCamp IV - Open World For...
State of the art: Server-Side JavaScript - WebWorkersCamp IV - Open World For...Alexandre Morgaut
 
High Availability With DRBD & Heartbeat
High Availability With DRBD & HeartbeatHigh Availability With DRBD & Heartbeat
High Availability With DRBD & HeartbeatChris Barber
 
视觉中国的MongoDB应用实践(QConBeijing2011)
视觉中国的MongoDB应用实践(QConBeijing2011)视觉中国的MongoDB应用实践(QConBeijing2011)
视觉中国的MongoDB应用实践(QConBeijing2011)Night Sailer
 
State of the art: server-side javaScript - NantesJS
State of the art: server-side javaScript - NantesJSState of the art: server-side javaScript - NantesJS
State of the art: server-side javaScript - NantesJSAlexandre Morgaut
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 

Similar to Behind the Scenes at LiveJournal: Scaling Storytime (20)

RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
Why JRuby? - RubyConf 2012
Why JRuby? - RubyConf 2012Why JRuby? - RubyConf 2012
Why JRuby? - RubyConf 2012
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
 
Node.js Explained
Node.js ExplainedNode.js Explained
Node.js Explained
 
Replatforming Legacy Packaged Applications: Block-by-Block with Minecraft
Replatforming Legacy Packaged Applications: Block-by-Block with MinecraftReplatforming Legacy Packaged Applications: Block-by-Block with Minecraft
Replatforming Legacy Packaged Applications: Block-by-Block with Minecraft
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Hong Qiangning in QConBeijing
Hong Qiangning in QConBeijingHong Qiangning in QConBeijing
Hong Qiangning in QConBeijing
 
State of the art - server side JavaScript - web-5 2012
State of the art - server side JavaScript - web-5 2012State of the art - server side JavaScript - web-5 2012
State of the art - server side JavaScript - web-5 2012
 
GlassFish can support multiple Ruby frameworks ... really ?
GlassFish can support multiple Ruby frameworks ... really ?GlassFish can support multiple Ruby frameworks ... really ?
GlassFish can support multiple Ruby frameworks ... really ?
 
Glass fish rubyconf-india-2010-Arun gupta
Glass fish rubyconf-india-2010-Arun gupta Glass fish rubyconf-india-2010-Arun gupta
Glass fish rubyconf-india-2010-Arun gupta
 
NoSQL with MySQL
NoSQL with MySQLNoSQL with MySQL
NoSQL with MySQL
 
Automated testing with OffScale and MongoDB
Automated testing with OffScale and MongoDBAutomated testing with OffScale and MongoDB
Automated testing with OffScale and MongoDB
 
Open stack@ebay
Open stack@ebayOpen stack@ebay
Open stack@ebay
 
豆瓣技术架构的发展历程 @ QCon Beijing 2009
豆瓣技术架构的发展历程 @ QCon Beijing 2009豆瓣技术架构的发展历程 @ QCon Beijing 2009
豆瓣技术架构的发展历程 @ QCon Beijing 2009
 
Deploying with JRuby
Deploying with JRubyDeploying with JRuby
Deploying with JRuby
 
State of the art: Server-Side JavaScript - WebWorkersCamp IV - Open World For...
State of the art: Server-Side JavaScript - WebWorkersCamp IV - Open World For...State of the art: Server-Side JavaScript - WebWorkersCamp IV - Open World For...
State of the art: Server-Side JavaScript - WebWorkersCamp IV - Open World For...
 
High Availability With DRBD & Heartbeat
High Availability With DRBD & HeartbeatHigh Availability With DRBD & Heartbeat
High Availability With DRBD & Heartbeat
 
视觉中国的MongoDB应用实践(QConBeijing2011)
视觉中国的MongoDB应用实践(QConBeijing2011)视觉中国的MongoDB应用实践(QConBeijing2011)
视觉中国的MongoDB应用实践(QConBeijing2011)
 
State of the art: server-side javaScript - NantesJS
State of the art: server-side javaScript - NantesJSState of the art: server-side javaScript - NantesJS
State of the art: server-side javaScript - NantesJS
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 

More from SergeyChernyshev

Capturing speed of user experience using user timing api
Capturing speed of user experience using user timing apiCapturing speed of user experience using user timing api
Capturing speed of user experience using user timing apiSergeyChernyshev
 
Managing application performance by Kwame Thomison
Managing application performance by Kwame ThomisonManaging application performance by Kwame Thomison
Managing application performance by Kwame ThomisonSergeyChernyshev
 
Fastest request is never made
Fastest request is never madeFastest request is never made
Fastest request is never madeSergeyChernyshev
 
Designing speed with progressive enhancement
Designing speed with progressive enhancementDesigning speed with progressive enhancement
Designing speed with progressive enhancementSergeyChernyshev
 
Web performance tools @ WebPerf.camp 2016
Web performance tools @ WebPerf.camp 2016Web performance tools @ WebPerf.camp 2016
Web performance tools @ WebPerf.camp 2016SergeyChernyshev
 
Extending your applications to the edge with CDNs
Extending your applications to the edge with CDNsExtending your applications to the edge with CDNs
Extending your applications to the edge with CDNsSergeyChernyshev
 
What we can learn from CDNs about Web Development, Deployment, and Performance
What we can learn from CDNs about Web Development, Deployment, and PerformanceWhat we can learn from CDNs about Web Development, Deployment, and Performance
What we can learn from CDNs about Web Development, Deployment, and PerformanceSergeyChernyshev
 
Scalability vs. Performance
Scalability vs. PerformanceScalability vs. Performance
Scalability vs. PerformanceSergeyChernyshev
 
Web performance: beyond load testing
Web performance: beyond load testingWeb performance: beyond load testing
Web performance: beyond load testingSergeyChernyshev
 
Introduction to web performance @ IEEE
Introduction to web performance @ IEEEIntroduction to web performance @ IEEE
Introduction to web performance @ IEEESergeyChernyshev
 
Introduction to Web Performance
Introduction to Web PerformanceIntroduction to Web Performance
Introduction to Web PerformanceSergeyChernyshev
 
Performance anti patterns in ajax applications
Performance anti patterns in ajax applicationsPerformance anti patterns in ajax applications
Performance anti patterns in ajax applicationsSergeyChernyshev
 
Keeping track of your performance using Show Slow
Keeping track of your performance using Show SlowKeeping track of your performance using Show Slow
Keeping track of your performance using Show SlowSergeyChernyshev
 
Keeping track of your performance using show slow
Keeping track of your performance using show slowKeeping track of your performance using show slow
Keeping track of your performance using show slowSergeyChernyshev
 
Building your own CDN using Amazon EC2
Building your own CDN using Amazon EC2Building your own CDN using Amazon EC2
Building your own CDN using Amazon EC2SergeyChernyshev
 

More from SergeyChernyshev (20)

Capturing speed of user experience using user timing api
Capturing speed of user experience using user timing apiCapturing speed of user experience using user timing api
Capturing speed of user experience using user timing api
 
Managing application performance by Kwame Thomison
Managing application performance by Kwame ThomisonManaging application performance by Kwame Thomison
Managing application performance by Kwame Thomison
 
Fastest request is never made
Fastest request is never madeFastest request is never made
Fastest request is never made
 
Designing speed with progressive enhancement
Designing speed with progressive enhancementDesigning speed with progressive enhancement
Designing speed with progressive enhancement
 
Web performance tools @ WebPerf.camp 2016
Web performance tools @ WebPerf.camp 2016Web performance tools @ WebPerf.camp 2016
Web performance tools @ WebPerf.camp 2016
 
Extending your applications to the edge with CDNs
Extending your applications to the edge with CDNsExtending your applications to the edge with CDNs
Extending your applications to the edge with CDNs
 
Speed is feature #1
Speed is feature #1Speed is feature #1
Speed is feature #1
 
Tools of the trade 2014
Tools of the trade 2014Tools of the trade 2014
Tools of the trade 2014
 
What we can learn from CDNs about Web Development, Deployment, and Performance
What we can learn from CDNs about Web Development, Deployment, and PerformanceWhat we can learn from CDNs about Web Development, Deployment, and Performance
What we can learn from CDNs about Web Development, Deployment, and Performance
 
Scalability vs. Performance
Scalability vs. PerformanceScalability vs. Performance
Scalability vs. Performance
 
Web performance: beyond load testing
Web performance: beyond load testingWeb performance: beyond load testing
Web performance: beyond load testing
 
Introduction to web performance @ IEEE
Introduction to web performance @ IEEEIntroduction to web performance @ IEEE
Introduction to web performance @ IEEE
 
Introduction to Web Performance
Introduction to Web PerformanceIntroduction to Web Performance
Introduction to Web Performance
 
Performance anti patterns in ajax applications
Performance anti patterns in ajax applicationsPerformance anti patterns in ajax applications
Performance anti patterns in ajax applications
 
Taming 3rd party content
Taming 3rd party contentTaming 3rd party content
Taming 3rd party content
 
Velocity 2010 review
Velocity 2010 reviewVelocity 2010 review
Velocity 2010 review
 
Keeping track of your performance using Show Slow
Keeping track of your performance using Show SlowKeeping track of your performance using Show Slow
Keeping track of your performance using Show Slow
 
Keeping track of your performance using show slow
Keeping track of your performance using show slowKeeping track of your performance using show slow
Keeping track of your performance using show slow
 
In Search of Speed
In Search of SpeedIn Search of Speed
In Search of Speed
 
Building your own CDN using Amazon EC2
Building your own CDN using Amazon EC2Building your own CDN using Amazon EC2
Building your own CDN using Amazon EC2
 

Recently uploaded

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Recently uploaded (20)

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

Behind the Scenes at LiveJournal: Scaling Storytime

  • 1. LiveJournal: Behind The Scenes Scaling Storytime June 2007 USENIX Brad Fitzpatrick brad@danga.com danga.com / livejournal.com / sixapart.com This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA. http://danga.com/words/ 1
  • 2. The plan... Refer to previous presentations for more  details... http://danga.com/words/  Questions anytime! Yell. Interrupt.  Part 0:  show where talk will end up − Part I:  What is LiveJournal? Quick history. − LJ’s scaling history − Part II:  explain all our software, − explain all the moving parts − http://danga.com/words/ 2
  • 3. net. LiveJournal Backend: Today (Roughly.) BIG-IP Global Database perlbal (httpd/proxy) bigip1 mod_perl bigip2 proxy1 master_a master_b web1 proxy2 web2 proxy3 Memcached web3 slave1 slave2 ... slave5 proxy4 djabberd mc1 web4 djabberd proxy5 mc2 ... User DB Cluster 1 djabberd mc3 uc1a uc1b webN mc4 User DB Cluster 2 ... uc2a uc2b gearmand mcN Mogile Storage Nodes gearmand1 User DB Cluster 3 sto2 sto1 gearmandN uc3a uc3b Mogile Trackers sto8 ... tracker1 tracker3 User DB Cluster N ucNa ucNb “workers” MogileFS Database gearwrkN Job Queues (xN) mog_a mog_b theschwkN jqNa jqNb slave1 slaveN http://danga.com/words/ 3
  • 4. LiveJournal Overview college hobby project, Apr 1999  4-in-1:  − blogging − forums − social-networking (“friends”) − aggregator: “friends page” − “friends” can be external RSS/Atom 10M+ accounts  Open Source!  − server, − infrastructure, − original clients, http://danga.com/words/ 4
  • 5. Stuff we've built... (all production, open source) memcached djabberd   − distributed caching − the super-extensible MogileFS everything-is-a-plugin  − distributed filesystem mod_perl/qpsmtpd/ Perlbal Eclipse of XMPP/Jabber  − HTTP load balancer, web servers server, swiss-army knife .....  gearman OpenID   − LB/HA/coalescing low- federated identity  latency function call protocol “router” TheSchwartz  − reliable, async job dispatch system http://danga.com/words/ 5
  • 6. “Uh, why?” NIH? (Not Invented Here?)  Are we reinventing the wheel?  http://danga.com/words/ 6
  • 7. Yes. We build wheels.  ... when existing suck, − ... or don’t exist. − http://danga.com/words/ 7
  • 8. Yes. We build wheels.  ... when existing suck, − ... or don’t exist. − http://danga.com/words/ 7
  • 9. Yes. We build wheels.  ... when existing suck, − ... or don’t exist. − http://danga.com/words/ 7
  • 10. Yes. We build wheels.  ... when existing suck, − ... or don’t exist. − (yes, arguably tires. sshh..) http://danga.com/words/ 7
  • 11. Part I Quick Scaling History http://danga.com/words/ 8
  • 12. Quick Scaling History 1 server to hundreds...  you can do all this with just 1 server!  then you’re ready for tons of servers, without pain − don’t repeat our scaling mistakes − http://danga.com/words/ 9
  • 13. Terminology Scaling:  NOT: “How fast?” − But: “When you add twice as many servers, are you − twice as fast (or have twice the capacity)?” Fast still matters,  2x faster: 50 servers instead of 100... − that’s some good money  but that’s not what scaling is. − http://danga.com/words/ 10
  • 14. Terminology “Cluster”  varying definitions... basically: − making a bunch of computers work together for − some purpose what purpose? − load balancing (LB),  high availablility (HA)  Load Balancing?  High Availability?  Venn Diagram time!  I love Venn Diagrams − http://danga.com/words/ 11
  • 15. LB vs. HA Load Balancing High Availability http://danga.com/words/ 12
  • 16. LB vs. HA High Load Availability Balancing LVS heartbeat, http round-robin DNS, cold/warm/hot spare, reverse proxy, ... data partitioning, wackamole, .... ... http://danga.com/words/ 13
  • 17. Favorite Venn Diagram Times When Times When I’m I’m Truly Happy Wearing Pants http://danga.com/words/ 14
  • 18. One Server Simple:  mysql apache http://danga.com/words/ 15
  • 19. Two Servers apache mysql http://danga.com/words/ 16
  • 20. Two Servers - Problems Two single points of failure!  No hot or cold spares  Site gets slow again.  CPU-bound on web node − need more web nodes... − http://danga.com/words/ 17
  • 21. Four Servers 3 webs, 1 db  Now we need to load-balance!  LVS, mod_backhand, whackamole, BIG-IP,  Alteon, pound, Perlbal, etc, etc.. ... − http://danga.com/words/ 18
  • 22. Four Servers - Problems Now I/O bound...  ... how to use another database?  − http://danga.com/words/ 19
  • 23. Five Servers introducing MySQL replication We buy a new DB  MySQL replication  Writes to DB (master)  Reads from both  http://danga.com/words/ 20
  • 24. More Servers Chaos! http://danga.com/words/ 21
  • 25. net. Where we're at.... BIG-IP bigip1 bigip2 mod_proxy mod_perl proxy1 web1 proxy2 web2 proxy3 Global Database web3 master web4 ... web12 slave1 slave2 ... slave6 http://danga.com/words/ 22
  • 26. Problems with Architecture or, “This don't scale...” DB master is SPOF  Adding slaves doesn't scale  well... − only spreads reads, not writes! 500 reads/s 250 reads/s 250 reads/s 200 write/s 200 write/s 200 writes/s http://danga.com/words/ 23
  • 27. Eventually... databases eventual only writing  3 reads/s 3 r/s 3 reads/s 3 r/s 3 reads/s 3 r/s 3 reads/s 3 r/s 3 reads/s 3 r/s 3 reads/s 3 r/s 3 reads/s 3 r/s 400 400 400 400 400 400 400 400 write/s 400 write/s 400 write/s 400 write/s 400 write/s 400 write/s 400 write/s write/s write/s write/s write/s write/s write/s write/s http://danga.com/words/ 24
  • 28. Spreading Writes Our database machines already did RAID  We did backups  So why put user data on 6+ slave machines?  (~12+ disks) overkill redundancy − wasting time writing everywhere! − http://danga.com/words/ 25
  • 29. Partition your data! Spread your databases out, into “roles”  − roles that you never need to join between  different users  or accept you'll have to join in app Each user assigned to a numbered HA cluster  Each cluster has multiple machines  − writes self-contained in cluster (writing to 2-3 machines, not 6) http://danga.com/words/ 26
  • 31. User Clusters SELECT userid, clusterid FROM user WHERE user='bob' http://danga.com/words/ 27
  • 32. User Clusters SELECT userid, clusterid FROM user WHERE user='bob' userid: 839 clusterid: 2 http://danga.com/words/ 27
  • 33. User Clusters SELECT .... SELECT userid, FROM ... clusterid FROM WHERE user WHERE userid=839 ... user='bob' userid: 839 clusterid: 2 http://danga.com/words/ 27
  • 34. User Clusters SELECT .... SELECT userid, FROM ... clusterid FROM WHERE user WHERE userid=839 ... user='bob' OMG i like totally hate my parents userid: 839 they just clusterid: 2 dont understand me and i h8 the world omg lol rofl *! :^- ^^; add me as a friend!!! http://danga.com/words/ 27
  • 35. Details per-user numberspaces  − don't use AUTO_INCREMENT − PRIMARY KEY (user_id, thing_id) − so: Can move/upgrade users 1-at-a-time:  − per-user “readonly” flag − per-user “schema_ver” property − user-moving harness  job server that coordinates, distributed long- lived user-mover clients who ask for tasks − balancing disk I/O, disk space http://danga.com/words/ 28
  • 36. Shared Storage (SAN, SCSI, DRBD...) Turn pair of InnoDB machines into a cluster  − looks like 1 box to outside world. floating IP. One machine at a time mounting fs, running MySQL  Heartbeat to move IP, {un,}mount filesystem, {stop,start}  mysql  filesystem repairs,  innodb repairs,  don’t lose any committed transactions. No special schema considerations  MySQL 4.1 w/ binlog sync/flush options  − good − The cluster can be a master or slave as well http://danga.com/words/ 29
  • 37. Shared Storage: DRBD Linux block device driver  floater ip − “Network RAID 1” − Shared storage without sharing! mysql mysql − sits atop another block device − syncs w/ another machine's ext3 ext3 block device  cross-over gigabit cable drbd drbd ideal. network is faster than sda sda random writes on your disks. InnoDB on DRBD: HA MySQL!  − can hang slaves off HA pair, − and/or, − HA pair can be slave of a master http://danga.com/words/ 30
  • 38. MySQL Clustering Options: Pros & Cons No magic bullet...  Master/Slave − doesn’t scale with writes  Master/Master − special schemas  DRBD − only HA, not LB  MySQL Cluster − special-purpose  .... − lots of options!  − :) − :( http://danga.com/words/ 31
  • 39. Part II Our Software http://danga.com/words/ 32
  • 40. Caching caching's key to performance  − store result of a computation or I/O for quicker future access (classic space/time trade-off) Where to cache?  − mod_perl/php internal caching  memory waste (address space per apache child) − shared memory  limited to single machine, same with Java/C#/ Mono − MySQL query cache  flushed per update, small max size − HEAP tables  fixed length rows, small max size http://danga.com/words/ 33
  • 41. memcached http://www.danga.com/memcached/ our Open Source, distributed caching system   implements a dictionary ADT, with network API run instances wherever free memory  two-level hash  − client hashes* to server, − server has internal dictionary (hash table) no “master node”, nodes aren’t aware of each  other protocol simple, XML-free  − clients: c, perl, java, c#, php, python, ruby, ... popular, fast  scalable  http://danga.com/words/ 34
  • 42. Protocol Commands set, add, replace  delete  incr, decr  atomic, returning new value − http://danga.com/words/ 35
  • 44. Picture 10.0.0.100:11211 10.0.0.101:11211 10.0.0.102:11211 1GB 2GB 1GB http://danga.com/words/ 36
  • 45. Picture 10.0.0.100:11211 10.0.0.101:11211 10.0.0.102:11211 1GB 2GB 1GB http://danga.com/words/ 36
  • 46. Picture 10.0.0.100:11211 10.0.0.101:11211 10.0.0.102:11211 1GB 2GB 1GB 0 1 2 3 http://danga.com/words/ 36
  • 47. Picture 10.0.0.100:11211 10.0.0.101:11211 10.0.0.102:11211 1GB 2GB 1GB 0 1 2 3 Client http://danga.com/words/ 36
  • 48. Picture 10.0.0.100:11211 10.0.0.101:11211 10.0.0.102:11211 1GB 2GB 1GB 0 1 2 3 $val = $client->get(“foo”) Client http://danga.com/words/ 36
  • 49. Picture 10.0.0.100:11211 10.0.0.101:11211 10.0.0.102:11211 1GB 2GB 1GB 0 1 2 3 $val = $client->get(“foo”) CRC32(“foo”) % 4 = 2 Client http://danga.com/words/ 36
  • 50. Picture 10.0.0.100:11211 10.0.0.101:11211 10.0.0.102:11211 1GB 2GB 1GB 0 1 2 3 $val = $client->get(“foo”) CRC32(“foo”) % 4 = 2 Client connect to server[2] (“10.0.0.101:11211”) http://danga.com/words/ 36
  • 51. Picture 10.0.0.100:11211 10.0.0.101:11211 10.0.0.102:11211 1GB 2GB 1GB 0 1 2 3 GET foo $val = $client->get(“foo”) CRC32(“foo”) % 4 = 2 Client connect to server[2] (“10.0.0.101:11211”) http://danga.com/words/ 36
  • 52. Picture 10.0.0.100:11211 10.0.0.101:11211 10.0.0.102:11211 1GB 2GB 1GB 0 1 2 3 GET foo (response) $val = $client->get(“foo”) CRC32(“foo”) % 4 = 2 Client connect to server[2] (“10.0.0.101:11211”) http://danga.com/words/ 36
  • 53. Client hashing onto a memcacached node Up to client how to pick a memcached node  Traditional way:  CRC32(<key>) % <num_servers> − (servers with more memory can own more slots) − CRC32 was least common denominator for all − languages to implement, allowing cross-language memcached sharing con: can’t add/remove servers without hit rate − crashing “Consistent hashing”  can add/remove servers with minimal <key> to − <server> map changes http://danga.com/words/ 37
  • 54. memcached internals libevent  epoll, kqueue... − event-based, non-blocking design  optional multithreading, thread per CPU (not per − client) slab allocator  referenced counted objects  slow clients can’t block other clients from altering − namespace or data LRU  all internal operations O(1)  http://danga.com/words/ 38
  • 56. Web Load Balancing BIG-IP, Alteon, Juniper, Foundry  − good for L4 or minimal L7 − not tricky / fun enough. :-) Tried a dozen reverse proxies  − none did what we wanted or were fast enough Wrote Perlbal  − fast, smart, manageable HTTP web server / reverse proxy / LB − can do internal redirects  and dozen other tricks http://danga.com/words/ 40
  • 57. Perlbal Perl   parts optionally in C with plugins single threaded, async event-based  − uses epoll, kqueue, etc. console / HTTP remote management  − live config changes handles dead nodes, smart balancing  multiple modes  − static webserver − reverse proxy − plug-ins (Javascript message bus.....) plug-ins  − GIF/PNG altering, .... http://danga.com/words/ 41
  • 59. Perlbal: Persistent Connections perlbal to backends (mod_perls)  know exactly when a connection is ready for a new − request no complex load balancing logic: just use whatever's free.  beats managing “weighted round robin” hell. clients persistent; not tied to a specific backend  connection http://danga.com/words/ 42
  • 60. Perlbal: Persistent Connections perlbal to backends (mod_perls)  know exactly when a connection is ready for a new − request no complex load balancing logic: just use whatever's free.  beats managing “weighted round robin” hell. clients persistent; not tied to a specific backend  connection PB http://danga.com/words/ 42
  • 61. Perlbal: Persistent Connections perlbal to backends (mod_perls)  know exactly when a connection is ready for a new − request no complex load balancing logic: just use whatever's free.  beats managing “weighted round robin” hell. clients persistent; not tied to a specific backend  connection Apache Client PB Apache Client http://danga.com/words/ 42
  • 62. Perlbal: Persistent Connections perlbal to backends (mod_perls)  know exactly when a connection is ready for a new − request no complex load balancing logic: just use whatever's free.  beats managing “weighted round robin” hell. clients persistent; not tied to a specific backend  connection reqA1, A2 reqA1, B2 Apache Client PB reqB1, B2 Apache Client reqB1, A2 http://danga.com/words/ 42
  • 63. Perlbal: can verify new backend connections #include <sys/socket.h> int listen(int sockfd, int backlog); connects to backends are often fast, but...  are you talking to the kernel’s listen queue?  or apache? (did apache accept() yet?)  send OPTIONs request to see if apache is  there Apache can reply to OPTIONS request quickly, − then Perlbal knows that conn is bound to an − apache process, not waiting in a kernel queue Huge improvement to user-visible latency!  (and more fair/even load balancing)  http://danga.com/words/ 43
  • 64. Perlbal: multiple queues high, normal, low priority queues  paid users -> high queue  bots/spiders/suspect traffic -> low queue  http://danga.com/words/ 44
  • 65. Perlbal: cooperative large file serving large file serving w/ mod_perl bad...  mod_perl has better things to do than spoon-feed − clients bytes http://danga.com/words/ 45
  • 66. Perlbal: cooperative large file serving internal redirects  mod_perl can pass off serving a big file to Perlbal − either from disk, or from other URL(s)  client sees no HTTP redirect − “Friends-only” images − one, clean URL  mod_perl does auth, and is done.  perlbal serves.  http://danga.com/words/ 46
  • 68. And the reverse... Now Perlbal can buffer uploads as well..  − Problems:  LifeBlog uploading − cellphones are slow  LiveJournal/Friendster photo uploads − cable/DSL uploads still slow − decide to buffer to “disk” (tmpfs, likely)  on any of: rate, size, time  blast at backend, only when full request is in http://danga.com/words/ 48
  • 69. Palette Altering GIF/PNGs based on palette indexes, colors in URL,  dynamically alter GIF/PNG palette table, then sendfile(2) the rest. http://danga.com/words/ 49
  • 72. MogileFS our distributed file system  open source  userspace  based all around HTTP (NFS support now removed)  hardly unique  Google GFS − Nutch Distributed File System (NDFS) − production-quality  lot of users − lot of big installs − http://danga.com/words/ 52
  • 73. MogileFS: Why alternatives at time were either:  − closed, non-existent, expensive, in development, complicated, ... − scary/impossible when it came to data recovery  new/uncommon/ unstudied on-disk formats because it was easy  − initial version = 1 weekend! :) − current version = many, many weekends :) http://danga.com/words/ 53
  • 74. MogileFS: Main Ideas files belong to classes, multiple tracker −  which dictate: databases − replication policy, min − all share same replicas, ... database cluster tracks what disks files (MySQL, etc..)  are on big, cheap disks  − set disk's state (up, − dumb storage nodes temp_down, dead) w/ 12, 16 disks, no and host RAID keep replicas on devices  on different hosts − (default class policy) − No RAID! http://danga.com/words/ 54
  • 75. MogileFS components clients  mogilefsd (does all real work)  database(s) (MySQL, .... abstract)  storage nodes  http://danga.com/words/ 55
  • 76. MogileFS: Clients tiny text-based protocol  Libraries available for:  Perl − tied filehandles  MogileFS::Client  my $fh = $mogc->new_file(“key”, [[$class], ...]) − Java − PHP − Python? − porting to $LANG is be trivial − future: no custom protocol. only HTTP − clients don't do database access  http://danga.com/words/ 56
  • 77. MogileFS: Tracker (mogilefsd) The Meat  event-based message bus  load balances client requests, world info  process manager  heartbeats/watchdog, respawner, ... − Child processes:  ~30x client interface (“query” process) − interfaces client protocol w/ db(s), etc  ~5x replicate − ~2x delete − ~1x fsck, reap, monitor, ..., ... − http://danga.com/words/ 57
  • 78. Trackers' Database(s) Abstract as of Mogile 2.x  − MySQL − SQLite (joke/demo) − Pg/Oracle coming soon? − Also future:  wrapper driver, partitioning any above − small metadata in one driver (MySQL Cluster?), − large tables partitioned over 2-node HA pairs Recommend config:  − 2xMySQL InnoDB on DRBD − 2 slaves underneath HA VIP  1 for backups  read-only slave for during master failover window http://danga.com/words/ 58
  • 79. MogileFS storage nodes (mogstored) HTTP transport  − GET − PUT − DELETE mogstored listens on 2 ports...   HTTP. --server={perlbal,lighttpd,...}  configs/manages your webserver of choice.  perlbal is default. some people like apache, etc − management/status:  iostat interface, AIO control, multi-stat() (for faster fsck) files on filesystem, not DB  − sendfile()! future: splice() − filesystem can be any filesystem http://danga.com/words/ 59
  • 80. Large file GET request http://danga.com/words/ 60
  • 81. Auth: complex, but quick Large file GET request http://danga.com/words/ 60
  • 82. Spoonfeeding: slow, but event- based Auth: complex, but quick Large file GET request http://danga.com/words/ 60
  • 85. Manager dispatches work, but doesn't do anything useful itself. :) http://danga.com/words/ 63
  • 86. Gearman system to load balance function calls...   scatter/gather bunch of calls in parallel,  different languages,  db connection pooling,  spread CPU usage around your network,  keep heavy libraries out of caller code,  ...  ... http://danga.com/words/ 64
  • 87. Gearman Pieces gearmand  − the function call router − event-loop (epoll, kqueue, etc) workers.  − Gearman::Worker – perl/ruby − register/heartbeat/grab jobs clients  − Gearman::Client[::Async] -- perl − also Ruby Gearman::Client − submit jobs to gearmand − opaque (to server) “funcname” string − optional opaque (to server) “args” string − opt coallescing key http://danga.com/words/ 65
  • 89. Gearman Picture gearmand gearmand gearmand http://danga.com/words/ 66
  • 90. Gearman Picture gearmand gearmand gearmand Worker Worker http://danga.com/words/ 66
  • 91. Gearman Picture gearmand gearmand gearmand can_do(“funcA”) can_do(“funcA”) can_do(“funcB”) Worker Worker http://danga.com/words/ 66
  • 92. Gearman Picture gearmand gearmand gearmand can_do(“funcA”) can_do(“funcA”) can_do(“funcB”) Client Worker Worker http://danga.com/words/ 66
  • 93. Gearman Picture gearmand gearmand gearmand call(“funcA”) can_do(“funcA”) can_do(“funcA”) can_do(“funcB”) Client Worker Worker http://danga.com/words/ 66
  • 94. Gearman Picture gearmand gearmand gearmand call(“funcA”) can_do(“funcA”) can_do(“funcA”) can_do(“funcB”) Client Client Worker Worker http://danga.com/words/ 66
  • 95. Gearman Picture gearmand gearmand gearmand call(“funcA”) can_do(“funcA”) call(“funcB”) can_do(“funcA”) can_do(“funcB”) Client Client Worker Worker http://danga.com/words/ 66
  • 96. Gearman Protocol efficient binary protocol  No XML  but also line-based text protocol for admin  commands − telnet to gearmand and get status − useful for Nagios plugins, etc http://danga.com/words/ 67
  • 97. Gearman Uses Image::Magick outside of your mod_perls!  DBI connection pooling (DBD::Gofer +  Gearman) reducing load, improving visibility  “services”  can all be in different languages, too! − http://danga.com/words/ 68
  • 98. Gearman Uses, cont.. running code in parallel  query ten databases at once − running blocking code from event loops  DBI from POE/Danga::Socket apps − spreading CPU from ev loop daemons  calling between different languages,  ...  http://danga.com/words/ 69
  • 99. Gearman Misc Guarantees:  − none! hah! :)  please wait for your results.  if client goes away, no promises − all retries on failures are done by client  but server will notify client(s) if working worker goes away. No policy/conventions in gearmand  − all policy/meaning between clients <-> workers ...  http://danga.com/words/ 70
  • 100. Sick Gearman Demo Don’t actually use it like this... but:  use strict; use DMap qw(dmap); DMap->set_job_servers(quot;sammyquot;, quot;papagquot;); my @foo = dmap { quot;$_ = quot; . `hostname` } (1..10); print quot;dmap says:n @fooquot;; $ ./dmap.pl dmap says: 1 = sammy 2 = papag 3 = sammy 4 = papag 5 = sammy 6 = papag 7 = sammy 8 = papag 9 = sammy 10 = papag http://danga.com/words/ 71
  • 101. Gearman Summary Gearman is sexy.  especially the coalescing − Check it out!  it's kinda our little unadvertised secret − oh crap, did I leak the secret?  http://danga.com/words/ 72
  • 103. TheSchwartz Like gearman:  job queuing system − opaque function name − opaque “args” blob − clients are either: − submitting jobs  workers  But unlike gearman:  Reliable job queueing system − not low latency − fire & forget (as opposed to gearman, where you wait for − result) currently library, not network service  http://danga.com/words/ 74
  • 104. TheSchwartz Primitives insert job  “grab” job (atomic grab)  for 'n' seconds. − mark job done  temp fail job for future  optional notes, rescheduling details.. − replace job with 1+ other jobs  atomic. − ...  http://danga.com/words/ 75
  • 105. TheSchwartz backing store:  a database − uses Data::ObjectDriver − MySQL,  Postgres,  SQLite,  ....  but HA: you tell it @dbs, and it finds one to  insert job into likewise, workers foreach (@dbs) to do work − http://danga.com/words/ 76
  • 106. TheSchwartz uses outgoing email (SMTP client)  − millions of emails per day − TheSchwartz::Worker::SendEmail − Email::Send::TheSchwartz LJ notifications  − ESN: event, subscription, notification  one event (new post, etc) -> thousands of emails, SMSes, XMPP messages, etc... pinging external services  atomstream injection  .....  dozens of users  shared farm for TypePad, Vox, LJ  http://danga.com/words/ 77
  • 107. gearmand + TheSchwartz gearmand: not reliable, low-latency, no disks  TheSchwartz: latency, reliable, disks  In TypePad:  TheSchwartz, with gearman to fire off TheSchwartz − workers. disks, but low-latency  future: no disks, SSD/Flash, MySQL Cluster  http://danga.com/words/ 78
  • 109. djabberd Our Jabber/XMPP server  powers our “LJ Talk” service  S2S: works with GoogleTalk, etc  perl, event-based (epoll, etc)  done 300,000+ conns  tiny per-conn memory overhead  release XML parser state if possible − http://danga.com/words/ 80
  • 110. djabberd hooks everything is a hook  not just auth! like, everything. − auth, − roster, − vcard info (avatars), − presence, − delivery, − inter-node cluster delivery, − ala mod_perl, qpsmtpd, etc. − async hooks  hooks phases can take as long as they want before − they answer, or decline to next phase in hook chain... we use Gearman::Client::Async − http://danga.com/words/ 81
  • 111. Thank you! Questions to: brad@danga.com Software: http://danga.com/ http://code.sixapart.com/ http://danga.com/words/ 82
  • 112. net. Questions? BIG-IP Global Database perlbal (httpd/proxy) bigip1 mod_perl bigip2 proxy1 master_a master_b web1 proxy2 web2 proxy3 Memcached web3 slave1 slave2 ... slave5 proxy4 djabberd mc1 web4 djabberd proxy5 mc2 ... User DB Cluster 1 djabberd mc3 uc1a uc1b webN mc4 User DB Cluster 2 ... uc2a uc2b gearmand mcN Mogile Storage Nodes gearmand1 User DB Cluster 3 sto2 sto1 gearmandN uc3a uc3b Mogile Trackers sto8 ... tracker1 tracker3 User DB Cluster N ucNa ucNb “workers” MogileFS Database gearwrkN Job Queues (xN) mog_a mog_b theschwkN jqNa jqNb slave1 slaveN http://danga.com/words/ 83
  • 113. Bonus Slides if extra time  http://danga.com/words/ 84
  • 114. Data Integrity Databases depend on fsync()  but databases can't send raw SCSI/ATA commands − to flush controller caches, etc fsync() almost never works work  Linux, FS' (lack of) barriers, raid cards, controllers, − disks, .... Solution: test! & fix  disk-checker.pl − client/server  spew writes/fsyncs, record intentions on alive machine,  yank power, checks. http://danga.com/words/ 85
  • 115. Persistent Connection Woes connections == threads == memory  My pet peeve: − want connection/thread distinction in MySQL!  w/ max-runnable-threads tunable  max threads  limit max memory/concurrency − DBD::Gofer + Gearman  Ask − Data::ObjectDriver + Gearman  http://danga.com/words/ 86