1

                               chenshuo.com




          ZURG        PART   1 OF N
2012/04   Shuo Chen
What is it?
2


       An example of muduo protorpc
        A  toy C++ project that can be useful
         https://github.com/chenshuo/muduo-protorpc

       分布式系统部署、监控与进程管理的几重境界
           http://www.cnblogs.com/Solstice/archive/2011/05/09/2041306.html

       多线程服务器的适用场合
           http://blog.csdn.net/Solstice/article/details/5334243

       分布式系统的工程化开发方法
           http://blog.csdn.net/solstice/article/details/5950190   (slides)
           http://techparty.org/2010/10/19/2010q4summary/          (video)

    2012/04                                                              chenshuo.com
Overview
3


       Master-Slave structure
         Communicates with   bi-directional RPC
         Command line tool to change and view status

         A web frontend in future if I have time to learn web

       Central configuration of service placements
         Zurg slave  is memory-less, doesn’t store any thing
         That is different to supervisord

       Also serve as a name server
       Master looks like a SPOF, but can be overcome
    2012/04                                            chenshuo.com
Why not just run services as
4
    daemons?
       It’s fine to do so on 5 hosts, how about 50? 500?
       Not easy to upgrade apps
         Usually needs to   ssh to every host and restart apps
       Not transparent
         How   is every application running well ?
       Has to deploy a monitor system anyway
         And   the notification of app crashing is not real time
       Auto restart daemons could hide the real
        problem and confuse the monitor system
    2012/04                                              chenshuo.com
Zurg slave – functionalities
5


       Process management
         Run  a command (short-lived child process)
         Start/stop a service (long-lived child process)
               Not standard services, but programs written by yourself

         Detect child     death in real time and report to master
               Not polling with pids or process names

       Collecting performance metrics
         Monitor      system health
       Both regular heartbeats and event notifications
        to Master
    2012/04                                                    chenshuo.com
Zurg slave – design decisions
6


       All-in-one single-threaded process
         Don’tkeep running iostat/vmstat/top/netstat/XXXstat
         Replaces(?) nagios/monit/ganglia/munin/supervisord
               No plugins, just compiled what you need into one binary

       C++ for efficient and less resource usage
         Itruns on every hosts, every little helps
         Often the monitoring tools* use too much resource

       No local configuration, easy to deploy & upgrade
         Just    point it to the master
       Start it in init.d, it will take over everything else
    2012/04                                            chenshuo.com
Zurg slave – NOT in scope
7


       Configuration management
       System administration
         Use Puppet   instead
       Deployment of in-house software
         Although can   be done with ‘wget’ followed by ‘tar xf’




    2012/04                                             chenshuo.com
Run a command
8


       Start a child process
       Wait until it finishes (asynchronously, of course)
       Capture stdout/stderr
         No  other opened files in the parent should be leaked
          to child, set FD_CLOEXEC on every fd


       Sounds like re-invent Python subprocess module?
       Not exactly!

    2012/04                                            chenshuo.com
The easy part of process mgmt
9


       Start a new process
         fork(2)/exec*(2)

         How  to get errno if exec() failes? It’s in child process
         “The self-pipe trick” http://cr.yp.to/docs/selfpipe.html

       Get notification when a child terminates
         SIGCHLD, either   signalfd(2) or legacy signal handler
         Signal is not reliable, so run wait(2) periodically (nb)

       Get exit status of a terminated child process
         wait4(2) tells   everything incl. memory/CPU usage
    2012/04                                              chenshuo.com
A simple challenge
10


        Limit the runtime of a command, not CPU time
          Typical timeout of 60 seconds
          Remember the pid when start running a command

          Set up a timer, kill(2) it when timeout

        How do you know that the process you are going
         to kill is the one that you created for the cmd?
          Set atimer to kill pid 9527, 60 seconds later
          What if process 9527 dies just before the timer event,

          And a new process was created with the same pid (?!)

     2012/04                                           chenshuo.com
Pid is unique but not always
11


        Pid wraps        (in minutes or seconds)
          Pid is unique when take a snapshot of all processes
          But it is not unique if time moves on

        The possible values of pids are small (1~32767)
          /proc/sys/kernel/pid_max      default     32768
          /proc/loadavg                 lastpid     3387
          /proc/stat                    processes 423666
        There is a tiny time window between timer wakeup
         and kill(2)ing, anything could happen in between
          And there is no mutex or lock for this race condition
     2012/04                                            chenshuo.com
How to kill a child properly?
12


        So it is not safe to kill-by-pid, you may kill
         someone else’s child process by mistake
        How about check ppid first?
          Youmay kill you own new child, if another
           RunCommand reuses the pid just before the timer.
        The pid + start_time combination is unique in
         space and time
          Start
               time is in /proc/pid/stat, in jiffies since boot
          Remember the start time after fork() a child*

          Check start time before killing the child
     2012/04                                              chenshuo.com
Why it is safe?
13


        If two processes start at almost the same time,
         their pids must be different
        If two processes happen to have the same pid,
         their start time must be different
          It   takes seconds to wrap pid, start time is monotonic
        Since zurg slave is single-threaded, no race
         condition between checking and killing
          Don’t run zurg slave as root, (it quits if euid == 0)
          Don’t run two zurg slaves with same uid on a box

     2012/04                                               chenshuo.com
Capture stdout&stderr, simple ?
14


        Two pipes are needed, dup2() the write fd to 1, 2
         in child, read the other side of two fds in parent.
          Keep data      in memory and send back when finishes
        Command ‘cat /dev/zero’ will blow up zurg slave
        We must limit the size of stdout and stderr
          The default     size is 1024KiB
        Two approaches, when size breaches limit:
          Stop reading, i.e. block writing, wait until timeout
          Close the read side of pipe, i.e. kill child with SIGPIPE
                Directly sending a SIGPIPE signal doesn’t work
     2012/04                                                      chenshuo.com
Race condition at process exits
15


        When a child exits, all its open fds will be closed
          Parent will read(2) a 0, it should close the fd,
           otherwise POLLHUP will cause a busy loop
          A child could close them purposefully before dying

        The events of process exited and std{out,err} fds
         closed could arrive in no particular order
          Is there   any flying data that has not been received?
        The lifetime mgmt of Process/Pipe objects are
         also subtle, as fds are reused so aggressively
        Read the code to find out how to do it correctly
     2012/04                                              chenshuo.com
Run Command Request
16


message RunCommandRequest {
  required string command = 1;
  optional string cwd     = 2 [default = "/tmp"];
  repeated string args    = 3;
  repeated string envs      = 4;
  optional bool envs_only   = 5 [default = false];
  optional int32 max_stdout = 6 [default = 1048576];
  optional int32 max_stderr = 7 [default = 1048576];
  optional int32 timeout    = 8 [default = 60];
  optional int32 max_memory_mb = 9 [default = 32768];
}

     2012/04                                chenshuo.com
Run Command Response
17


message RunCommandResponse {
  required int32 error_code = 1;
  optional int32 pid         = 2;
  optional int32 status      = 3;
  optional bytes std_output = 4;
  optional bytes std_error = 5;
  optional int64 start_time_us = 16;
  optional int64 finish_time_us = 17;
  optional float user_time       = 18;
  optional float system_time     = 19;
  optional int64 memory_maxrss_kb = 20;
  // optional int64 ctxsw = 21;
  optional int32 exit_status = 30 [default = 0];
  optional int32 signaled = 31 [default = 0];
  optional bool coredump = 32 [default = false];
} 2012/04                                          chenshuo.com
Run Script
18


        RunCommand with script file content provided
         in the request
        A programmatic way to run slightly different
         scripts on many hosts




     2012/04                                    chenshuo.com
Application management
19


        Start/monitor/stop applications
          Applications a.k.a
                            services, long running processes
          Apps can be written in C++/Java/Python/etc.

        Share most functionalities of RunCommand
          stdout/stderr redirected to   files, not captured
          No   timeout
        Intrusive vs. non-intrusive
          Canzurg_slave manage any application?
          Should the managed application follow some rules?

     2012/04                                              chenshuo.com
How to detect app exiting
20


        Polling (pid and start time)
          Not real
                  time, always with a poll interval
          How do you know one process is the application?

        SIGCHLD
          Not 100%   reliable, so call wait(2) periodically
        Pipe, leave the write side in child process, read
         in zurg_slave, when app exits, read(2) returns 0
          Reliable and promptly
          The application must not close the fd* (intrusive!)

     2012/04                                              chenshuo.com
What if zurg_slave crashes?
21


        How to prevent starting duplicated services
        SIGCHILD and pipe(2) are nonrenewable
        Sockets? App reconnects to localhost zurg slave
          i.e.
              heartbeat between app and zurg slave
          Even more intrusive, retry logic in all languages



        Other thoughts?
          An     other layer of indirection?


     2012/04                                            chenshuo.com
To be continued
22


        Collecting health & performance data
        Periodically heartbeat to master
          Process status,   performance metrics


        Zurg slave is 50% done as of end of April 2012




     2012/04                                       chenshuo.com
Zurg Master
23


        A multithreaded program
        Its status is all retrievable from outside
          Easy   to build Web/GUI frontends


        Have not started coding yet.




     2012/04                                          chenshuo.com

Zurg part 1

  • 1.
    1 chenshuo.com ZURG PART 1 OF N 2012/04 Shuo Chen
  • 2.
    What is it? 2  An example of muduo protorpc A toy C++ project that can be useful  https://github.com/chenshuo/muduo-protorpc  分布式系统部署、监控与进程管理的几重境界  http://www.cnblogs.com/Solstice/archive/2011/05/09/2041306.html  多线程服务器的适用场合  http://blog.csdn.net/Solstice/article/details/5334243  分布式系统的工程化开发方法  http://blog.csdn.net/solstice/article/details/5950190 (slides)  http://techparty.org/2010/10/19/2010q4summary/ (video) 2012/04 chenshuo.com
  • 3.
    Overview 3  Master-Slave structure  Communicates with bi-directional RPC  Command line tool to change and view status  A web frontend in future if I have time to learn web  Central configuration of service placements  Zurg slave is memory-less, doesn’t store any thing  That is different to supervisord  Also serve as a name server  Master looks like a SPOF, but can be overcome 2012/04 chenshuo.com
  • 4.
    Why not justrun services as 4 daemons?  It’s fine to do so on 5 hosts, how about 50? 500?  Not easy to upgrade apps  Usually needs to ssh to every host and restart apps  Not transparent  How is every application running well ?  Has to deploy a monitor system anyway  And the notification of app crashing is not real time  Auto restart daemons could hide the real problem and confuse the monitor system 2012/04 chenshuo.com
  • 5.
    Zurg slave –functionalities 5  Process management  Run a command (short-lived child process)  Start/stop a service (long-lived child process)  Not standard services, but programs written by yourself  Detect child death in real time and report to master  Not polling with pids or process names  Collecting performance metrics  Monitor system health  Both regular heartbeats and event notifications to Master 2012/04 chenshuo.com
  • 6.
    Zurg slave –design decisions 6  All-in-one single-threaded process  Don’tkeep running iostat/vmstat/top/netstat/XXXstat  Replaces(?) nagios/monit/ganglia/munin/supervisord  No plugins, just compiled what you need into one binary  C++ for efficient and less resource usage  Itruns on every hosts, every little helps  Often the monitoring tools* use too much resource  No local configuration, easy to deploy & upgrade  Just point it to the master  Start it in init.d, it will take over everything else 2012/04 chenshuo.com
  • 7.
    Zurg slave –NOT in scope 7  Configuration management  System administration  Use Puppet instead  Deployment of in-house software  Although can be done with ‘wget’ followed by ‘tar xf’ 2012/04 chenshuo.com
  • 8.
    Run a command 8  Start a child process  Wait until it finishes (asynchronously, of course)  Capture stdout/stderr  No other opened files in the parent should be leaked to child, set FD_CLOEXEC on every fd  Sounds like re-invent Python subprocess module?  Not exactly! 2012/04 chenshuo.com
  • 9.
    The easy partof process mgmt 9  Start a new process  fork(2)/exec*(2)  How to get errno if exec() failes? It’s in child process  “The self-pipe trick” http://cr.yp.to/docs/selfpipe.html  Get notification when a child terminates  SIGCHLD, either signalfd(2) or legacy signal handler  Signal is not reliable, so run wait(2) periodically (nb)  Get exit status of a terminated child process  wait4(2) tells everything incl. memory/CPU usage 2012/04 chenshuo.com
  • 10.
    A simple challenge 10  Limit the runtime of a command, not CPU time  Typical timeout of 60 seconds  Remember the pid when start running a command  Set up a timer, kill(2) it when timeout  How do you know that the process you are going to kill is the one that you created for the cmd?  Set atimer to kill pid 9527, 60 seconds later  What if process 9527 dies just before the timer event,  And a new process was created with the same pid (?!) 2012/04 chenshuo.com
  • 11.
    Pid is uniquebut not always 11  Pid wraps (in minutes or seconds)  Pid is unique when take a snapshot of all processes  But it is not unique if time moves on  The possible values of pids are small (1~32767)  /proc/sys/kernel/pid_max default 32768  /proc/loadavg lastpid 3387  /proc/stat processes 423666  There is a tiny time window between timer wakeup and kill(2)ing, anything could happen in between  And there is no mutex or lock for this race condition 2012/04 chenshuo.com
  • 12.
    How to killa child properly? 12  So it is not safe to kill-by-pid, you may kill someone else’s child process by mistake  How about check ppid first?  Youmay kill you own new child, if another RunCommand reuses the pid just before the timer.  The pid + start_time combination is unique in space and time  Start time is in /proc/pid/stat, in jiffies since boot  Remember the start time after fork() a child*  Check start time before killing the child 2012/04 chenshuo.com
  • 13.
    Why it issafe? 13  If two processes start at almost the same time, their pids must be different  If two processes happen to have the same pid, their start time must be different  It takes seconds to wrap pid, start time is monotonic  Since zurg slave is single-threaded, no race condition between checking and killing  Don’t run zurg slave as root, (it quits if euid == 0)  Don’t run two zurg slaves with same uid on a box 2012/04 chenshuo.com
  • 14.
    Capture stdout&stderr, simple? 14  Two pipes are needed, dup2() the write fd to 1, 2 in child, read the other side of two fds in parent.  Keep data in memory and send back when finishes  Command ‘cat /dev/zero’ will blow up zurg slave  We must limit the size of stdout and stderr  The default size is 1024KiB  Two approaches, when size breaches limit:  Stop reading, i.e. block writing, wait until timeout  Close the read side of pipe, i.e. kill child with SIGPIPE  Directly sending a SIGPIPE signal doesn’t work 2012/04 chenshuo.com
  • 15.
    Race condition atprocess exits 15  When a child exits, all its open fds will be closed  Parent will read(2) a 0, it should close the fd, otherwise POLLHUP will cause a busy loop  A child could close them purposefully before dying  The events of process exited and std{out,err} fds closed could arrive in no particular order  Is there any flying data that has not been received?  The lifetime mgmt of Process/Pipe objects are also subtle, as fds are reused so aggressively  Read the code to find out how to do it correctly 2012/04 chenshuo.com
  • 16.
    Run Command Request 16 messageRunCommandRequest { required string command = 1; optional string cwd = 2 [default = "/tmp"]; repeated string args = 3; repeated string envs = 4; optional bool envs_only = 5 [default = false]; optional int32 max_stdout = 6 [default = 1048576]; optional int32 max_stderr = 7 [default = 1048576]; optional int32 timeout = 8 [default = 60]; optional int32 max_memory_mb = 9 [default = 32768]; } 2012/04 chenshuo.com
  • 17.
    Run Command Response 17 messageRunCommandResponse { required int32 error_code = 1; optional int32 pid = 2; optional int32 status = 3; optional bytes std_output = 4; optional bytes std_error = 5; optional int64 start_time_us = 16; optional int64 finish_time_us = 17; optional float user_time = 18; optional float system_time = 19; optional int64 memory_maxrss_kb = 20; // optional int64 ctxsw = 21; optional int32 exit_status = 30 [default = 0]; optional int32 signaled = 31 [default = 0]; optional bool coredump = 32 [default = false]; } 2012/04 chenshuo.com
  • 18.
    Run Script 18  RunCommand with script file content provided in the request  A programmatic way to run slightly different scripts on many hosts 2012/04 chenshuo.com
  • 19.
    Application management 19  Start/monitor/stop applications  Applications a.k.a services, long running processes  Apps can be written in C++/Java/Python/etc.  Share most functionalities of RunCommand  stdout/stderr redirected to files, not captured  No timeout  Intrusive vs. non-intrusive  Canzurg_slave manage any application?  Should the managed application follow some rules? 2012/04 chenshuo.com
  • 20.
    How to detectapp exiting 20  Polling (pid and start time)  Not real time, always with a poll interval  How do you know one process is the application?  SIGCHLD  Not 100% reliable, so call wait(2) periodically  Pipe, leave the write side in child process, read in zurg_slave, when app exits, read(2) returns 0  Reliable and promptly  The application must not close the fd* (intrusive!) 2012/04 chenshuo.com
  • 21.
    What if zurg_slavecrashes? 21  How to prevent starting duplicated services  SIGCHILD and pipe(2) are nonrenewable  Sockets? App reconnects to localhost zurg slave  i.e. heartbeat between app and zurg slave  Even more intrusive, retry logic in all languages  Other thoughts?  An other layer of indirection? 2012/04 chenshuo.com
  • 22.
    To be continued 22  Collecting health & performance data  Periodically heartbeat to master  Process status, performance metrics  Zurg slave is 50% done as of end of April 2012 2012/04 chenshuo.com
  • 23.
    Zurg Master 23  A multithreaded program  Its status is all retrievable from outside  Easy to build Web/GUI frontends  Have not started coding yet. 2012/04 chenshuo.com

Editor's Notes

  • #7 * In script language
  • #13 *Must be done in child process and pass back to parent