Data	
  Driven	
  Tes,ng	
  for	
  	
  
              Distributed	
  Systems	
  	
  
          Case	
  study	
  with	
  Apache	
  Helix	
  


Kishore	
  Gopalakrishna,	
  @kishoreg1980	
  
hBp://www.linkedin.com/in/kgopalak	
  
	
  
Outline	
  
•    Intro	
  to	
  Helix	
  
•    Use	
  case:	
  Distributed	
  data	
  store	
  
•    Tradi,onal	
  approach	
  
•    Data	
  driven	
  tes,ng	
  
•    Q	
  &	
  A	
  
What	
  is	
  Helix	
  
•  Generic	
  cluster	
  management	
  framework	
  
   –  Par,,on	
  management	
  
   –  Failure	
  detec,on	
  and	
  handling	
  
   –  Elas,city	
  
Terminologies 	
  	
  
Node	
          A	
  single	
  machine	
  

Cluster	
       Set	
  of	
  Nodes	
  

Resource	
      A	
  logical	
  en/ty	
  e.g.	
  database,	
  index,	
  task	
  

Par,,on	
       Subset	
  of	
  the	
  resource.	
  

Replica	
       Copy	
  of	
  a	
  par,,on	
  

State	
         Status	
  of	
  a	
  par,,on	
  replica,	
  e.g	
  Master,	
  Slave	
  

Transi,on	
     Ac,on	
  that	
  lets	
  replicas	
  change	
  status	
  e.g	
  Slave	
  -­‐>	
  Master	
  




                                                                                                              4	
  
Core	
  concept:	
  Augmented	
  finite	
  state	
  machine	
  

          State	
  Machine	
                        Constraints	
                           Objec,ves	
  

• States	
                              • States	
                                • Par,,on	
  Placement	
  
  • S1,S2,S3	
                            • S1à	
  max=1,	
  S2=min=2	
          • Failure	
  seman,cs	
  
• Transi,on	
                           • Transi,ons	
  
  • S1àS2,	
  S2àS1,	
  S2àS3,	
       • Concurrent(S1-­‐>S2)	
  
    S3àS1	
  	
                            across	
  cluster	
  <	
  5	
  	
  




                                                                                                               5	
  
Helix	
  usage	
  at	
  LinkedIn	
  
	
  




       	
     Espresso	
  




                                                           6	
  
Use	
  case:	
  Distributed	
  data	
  store	
  
•    Timeline	
  consistent	
  par,,oned	
  data	
  store	
  
•    One	
  master	
  replica	
  per	
  par,,on	
  
•    Even	
  distribu,on	
  of	
  master/slave	
  
•    On	
  failure:	
  promote	
  slave	
  to	
  master	
  
      P.1	
        P.2	
        P.3	
     P.5	
          P.6	
       P.7	
     P.9	
         P.10	
       P.11	
  


      P.4	
        P.5	
        P.6	
     P.8	
          P.1	
       P.2	
     P.12	
        P.3	
        P.4	
  
                                                         P.1	
  

      P.9	
        P.10	
                 P.11	
         P.12	
                P.7	
         P.8	
  



                Node	
  1	
                          Node	
  2	
                          Node	
  3	
  
Helix	
  based	
  solu,on	
  
            State	
  Machine	
                                        Constraints	
                                      Objec,ves	
  

• States	
                                               • States	
                                            • Par,,on	
  Placement	
  
  • Offline,	
  Slave,	
  Master	
                           • M=1,	
  S=2	
                                     • Failure	
  seman,cs	
  
• Transi,on	
                                            • Transi,ons	
  
  • O-­‐>S,	
  S-­‐>M,S-­‐>M,	
  M-­‐>S	
                  • concurrent(0-­‐>S)	
  <	
  5	
  	
  




                                                                   COUNT=2             minimize(maxnj∈N	
  S(nj)	
  )
                                                 t1≤5
                                                                        S	
  
                                                        t1                                  t2

                                                              t3                  t4
                                              O	
                                                   M	
     COUNT=1     minimize(maxnj∈N	
  M(nj)	
  )




                                                                                                                                                         8	
  
Tes,ng	
  
•  Happy	
  path	
  func,onality	
  
    –  Meet	
  SLA	
  
        •  	
   99th	
  percen,le	
  latency	
  etc	
  
    –  Writes	
  to	
  master	
  
•  Non	
  happy	
  path	
  
    –  System	
  failures	
  	
  
    –  Applica,on	
  failures	
  
    –  How	
  does	
  system	
  behave	
  in	
  such	
  scenarios	
  
        	
  
Non	
  happy	
  path	
  -­‐	
  Tradi,onal	
  approach	
  
•  Iden,fy	
  scenarios	
  of	
  interest	
  
    –  Node	
  failure	
  
    –  System	
  upgrade	
  
•  Tested	
  each	
  scenario	
  in	
  isola,on	
  via	
  test	
  case	
  
    –  All	
  test	
  passed	
  J	
  
•  Deployed	
  in	
  alpha	
  
    –  First	
  soiware	
  upgrade	
  failed	
  …	
  but	
  we	
  tested	
  it	
  
What	
  was	
  missing	
  
•  Failures	
  don’t	
  happen	
  in	
  isola,on	
  
•  Induc,on	
  principle	
  does	
  not	
  work	
  
    –  If	
  something	
  works	
  once	
  does	
  not	
  mean	
  it	
  will	
  
       always	
  work	
  
•  Lack	
  of	
  tools	
  to	
  debug	
  issues	
  
    –  Could	
  not	
  iden,fy	
  the	
  cause	
  from	
  one	
  log	
  file	
  
•  Poor	
  coverage	
  
    –  Impossible	
  to	
  think	
  of	
  all	
  possible	
  test	
  cases	
  
What	
  we	
  learnt	
  
•  Test	
  with	
  all	
  components	
  integrated	
  
•  Simulate	
  real	
  produc,on	
  environment	
  
   –  Generate	
  load	
  
   –  Random	
  failures	
  of	
  mul,ple	
  components	
  
•  BeBer	
  debugging	
  tools	
  
   –  Need	
  to	
  co-­‐relate	
  messages	
  from	
  mul,ple	
  logs	
  
   –  Failure	
  is	
  a	
  symptom,	
  actual	
  reason	
  in	
  past	
  logs	
  of	
  
      different	
  machine.	
  
Data	
  driven	
  tes,ng	
  
•  Instrument	
  –	
  
       •  	
  Zookeeper,	
  controller,	
  par,cipant	
  logs	
  
•  Simulate	
  –	
  Chaos	
  monkey	
  
•  Analyze	
  –	
  Invariants	
  are	
  
       •  Respect	
  state	
  transi,on	
  constraints	
  
       •  Respect	
  state	
  count	
  constraints	
  
       •  And	
  so	
  on	
  
•  Debugging	
  made	
  easy	
  
       •  Reproduce	
  exact	
  sequence	
  of	
  events	
  	
  
	
  
                                                                    13	
  
Chaos	
  monkey	
  
•  Select	
  a	
  random	
  component(s)	
  to	
  fail	
  
•  How	
  should	
  it	
  fail	
  
    –  Hard/soi	
  failure	
  
    –  Network	
  Par,,on	
  
    –  Garbage	
  collec,on	
  
    –  Process	
  freeze	
  


    	
  
Automa,on	
  of	
  chaos	
  monkey	
  
•  Helix	
  agent	
  on	
  each	
  node	
  
                                                     STATE	
  MACHINE	
  
•  Modify	
  the	
  behavior	
  of	
  
   each	
  service	
  using	
  Helix	
                        STOPPED	
  

    –  Component	
  1	
  
        •  Node1:	
  RUNNING	
       STOP	
                    FREEZED	
               START	
  
        •  Node2:	
  STOPPED	
                  PAUSE	
                      UNPAUSE	
  
        •  Node3:	
  KILLED	
  
                                                              RUNNING	
  
    –  Component	
  2	
  
                                                            KILL	
  
        •  Node1:	
  STOPPED	
  
                                                                KILLED	
  
Pseudo	
  test	
  case	
  
setup	
  cluster	
  	
  
generate	
  load	
  
do	
  
   	
  (c,t)	
  =	
  components	
  to	
  fail	
  and	
  type	
  of	
  failure	
  
   	
  simulate	
  failure	
  
   	
  verify	
  system_is_stable	
  
   	
  restart	
  failed	
  components	
  
while(verify	
  system_is_stable)	
  
Test	
  case	
  failed	
  &	
  here	
  is	
  the	
  sequence	
  of	
  events	
  	
  
   	
  	
  
Cluster	
  verifica,on	
  
•  Verify	
  all	
  constraints	
  are	
  sa,sfied	
  
    –  Is	
  there	
  a	
  master	
  for	
  all	
  par,,on	
  
    –  Is	
  slave	
  replica,ng	
  	
  
    –  Node/component	
  down	
  should	
  not	
  maBer	
  
    –  Validate	
  every	
  ac,on	
  not	
  just	
  end	
  result	
  
        •  Having	
  master	
  is	
  not	
  good	
  enough,	
  if	
  two	
  nodes	
  
           became	
  master	
  and	
  later	
  one	
  of	
  them	
  died.	
  
Log	
  analysis	
  
•  Log	
  important	
  events	
  
    –  Becoming	
  master	
  from	
  slave	
  for	
  this	
  par,,on	
  at	
  
       this	
  ,me	
  
•  Tools	
  to	
  collect,	
  merge	
  &	
  analyze	
  logs	
  
    –  Parsed	
  zookeeper	
  transac,on	
  logs	
  
    –  Gathered	
  helix	
  controller,	
  par,cipant	
  logs	
  
    –  Sorted	
  on	
  ,me.	
  
•  Helix	
  provides	
  these	
  tools	
  out	
  of	
  the	
  box	
  
Structured	
  Log	
  File	
  –	
  sample	
  
 timestamp      partition     instanceName                   sessionId                  state

1323312236368   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236426   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236530   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236530   TestDB_91    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236561   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE

1323312236561   TestDB_91    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236685   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE

1323312236685   TestDB_91    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236685   TestDB_60    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236719   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE

1323312236719   TestDB_91    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE

1323312236719   TestDB_60    express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   OFFLINE

1323312236814   TestDB_123   express1-md_16918   ef172fe9-09ca-4d77b05e-15a414478ccc   SLAVE
Benefits	
  
•  Test	
  case	
  stops	
  as	
  soon	
  as	
  system	
  is	
  unstable	
  
    –  The	
  cluster	
  is	
  available	
  for	
  debugging	
  	
  


•  Provides	
  exact	
  sequence	
  of	
  events	
  
    –  Makes	
  it	
  easy	
  to	
  debug	
  and	
  reproduce	
  
    –  Best	
  part:	
  We	
  auto	
  generated	
  test	
  case.	
  	
  
    	
  	
  
Reproduce	
  the	
  issue	
  	
  
Start	
  state	
                                Orchestrate	
  the	
  sequence	
  
•  Helix	
  brings	
  the	
  system	
  to	
     •  Use	
  Helix	
  messaging	
  api	
  to	
  
   start	
  state.	
                               replay	
  the	
  events	
  
 {
     	
  
   "id" : "MyDataStore",
   "simpleFields" : {                              1.  Node1:MyDataStore_0: Master-Slave
     "IDEAL_STATE_MODE" : "CUSTOM",
     "NUM_PARTITIONS" : ”2",
     "REPLICAS" : "3",                             2. Node1:HARD KILL
     "STATE_MODEL_DEF_REF" : "MasterSlave",
   }
  "mapFields" : {                                  3. Node2:START
     "MyDataStore_0" : {
       "node1" : "MASTER",
       "node2" : "OFFLINE",
       "node3" : "SLAVE",
     },
     "MyDataStore_0" : {
       "node1" : "SLAVE",
       "node2" : "OFFLINE",
       "node3" : "MASTER",
     },
   }
 }
Constraint	
  viola,on	
  
No	
  more	
  than	
  R=2	
  slaves	
  

 Time             State              Number Slaves         Instance
42632          OFFLINE                    0          10.117.58.247_12918
42796            SLAVE                    1          10.117.58.247_12918
43124          OFFLINE                    1          10.202.187.155_12918
43131          OFFLINE                    1          10.220.225.153_12918
43275            SLAVE                    2          10.220.225.153_12918
43323            SLAVE                    3          10.202.187.155_12918
85795          MASTER                     2          10.220.225.153_12918
How	
  long	
  was	
  it	
  out	
  of	
  whack?	
  
Number	
  of	
  Slaves	
            Time	
  	
                          Percentage	
  
0	
                                 1082319	
                           0.5	
  
1	
                                 35578388	
                          16.46	
  
2	
                                 179417802	
                         82.99	
  
3	
                                 118863	
                            0.05	
  


              83%	
  of	
  the	
  ,me,	
  there	
  were	
  2	
  slaves	
  to	
  a	
  par,,on	
  
              93%	
  of	
  the	
  ,me,	
  there	
  was	
  1	
  master	
  to	
  a	
  par,,on	
  

Number	
  of	
  Masters	
           Time	
                              Percentage	
  
                  0                                15490456                        7.164960359
                  1                                200706916                       92.83503964
Invariant	
  2:	
  State	
  Transi,ons	
  
 FROM	
            TO	
            COUNT	
  

MASTER           SLAVE               55
OFFLINE        DROPPED                0
OFFLINE          SLAVE              298
SLAVE           MASTER              155
SLAVE           OFFLINE               0
Fun	
  facts	
  
•  For	
  almost	
  a	
  month	
  the	
  test	
  failed	
  to	
  run	
  
   successfully	
  for	
  one	
  night	
  
•  Most	
  issues	
  were	
  found	
  using	
  one	
  test	
  case	
  
•  Reproduced	
  almost	
  all	
  failures	
  
Conclusion	
  
•  Tradi,onal	
  approach	
  is	
  not	
  good	
  enough	
  
•  Data	
  driven	
  tes,ng	
  is	
  way	
  to	
  go	
  
   –  Focus	
  on	
  workload	
  and	
  analysis	
  
   –  Produc,on	
  system	
  always	
  in	
  test	
  mode	
  
   –  Leverage	
  tools	
  built	
  for	
  tes,ng	
  to	
  debug	
  
      produc,on	
  issues	
  
website	
   helix.incubator.apache.org	
  

users	
      user@helix.incubator.apache.org	
  

dev	
        dev@helix.incubator.apache.org	
  
twiBer	
     @apachehelix	
  




                                                   27	
  

Data driven testing: Case study with Apache Helix

  • 1.
    Data  Driven  Tes,ng  for     Distributed  Systems     Case  study  with  Apache  Helix   Kishore  Gopalakrishna,  @kishoreg1980   hBp://www.linkedin.com/in/kgopalak    
  • 2.
    Outline   •  Intro  to  Helix   •  Use  case:  Distributed  data  store   •  Tradi,onal  approach   •  Data  driven  tes,ng   •  Q  &  A  
  • 3.
    What  is  Helix   •  Generic  cluster  management  framework   –  Par,,on  management   –  Failure  detec,on  and  handling   –  Elas,city  
  • 4.
    Terminologies     Node   A  single  machine   Cluster   Set  of  Nodes   Resource   A  logical  en/ty  e.g.  database,  index,  task   Par,,on   Subset  of  the  resource.   Replica   Copy  of  a  par,,on   State   Status  of  a  par,,on  replica,  e.g  Master,  Slave   Transi,on   Ac,on  that  lets  replicas  change  status  e.g  Slave  -­‐>  Master   4  
  • 5.
    Core  concept:  Augmented  finite  state  machine   State  Machine   Constraints   Objec,ves   • States   • States   • Par,,on  Placement   • S1,S2,S3   • S1à  max=1,  S2=min=2   • Failure  seman,cs   • Transi,on   • Transi,ons   • S1àS2,  S2àS1,  S2àS3,   • Concurrent(S1-­‐>S2)   S3àS1     across  cluster  <  5     5  
  • 6.
    Helix  usage  at  LinkedIn       Espresso   6  
  • 7.
    Use  case:  Distributed  data  store   •  Timeline  consistent  par,,oned  data  store   •  One  master  replica  per  par,,on   •  Even  distribu,on  of  master/slave   •  On  failure:  promote  slave  to  master   P.1   P.2   P.3   P.5   P.6   P.7   P.9   P.10   P.11   P.4   P.5   P.6   P.8   P.1   P.2   P.12   P.3   P.4   P.1   P.9   P.10   P.11   P.12   P.7   P.8   Node  1   Node  2   Node  3  
  • 8.
    Helix  based  solu,on   State  Machine   Constraints   Objec,ves   • States   • States   • Par,,on  Placement   • Offline,  Slave,  Master   • M=1,  S=2   • Failure  seman,cs   • Transi,on   • Transi,ons   • O-­‐>S,  S-­‐>M,S-­‐>M,  M-­‐>S   • concurrent(0-­‐>S)  <  5     COUNT=2 minimize(maxnj∈N  S(nj)  ) t1≤5 S   t1 t2 t3 t4 O   M   COUNT=1 minimize(maxnj∈N  M(nj)  ) 8  
  • 9.
    Tes,ng   •  Happy  path  func,onality   –  Meet  SLA   •    99th  percen,le  latency  etc   –  Writes  to  master   •  Non  happy  path   –  System  failures     –  Applica,on  failures   –  How  does  system  behave  in  such  scenarios    
  • 10.
    Non  happy  path  -­‐  Tradi,onal  approach   •  Iden,fy  scenarios  of  interest   –  Node  failure   –  System  upgrade   •  Tested  each  scenario  in  isola,on  via  test  case   –  All  test  passed  J   •  Deployed  in  alpha   –  First  soiware  upgrade  failed  …  but  we  tested  it  
  • 11.
    What  was  missing   •  Failures  don’t  happen  in  isola,on   •  Induc,on  principle  does  not  work   –  If  something  works  once  does  not  mean  it  will   always  work   •  Lack  of  tools  to  debug  issues   –  Could  not  iden,fy  the  cause  from  one  log  file   •  Poor  coverage   –  Impossible  to  think  of  all  possible  test  cases  
  • 12.
    What  we  learnt   •  Test  with  all  components  integrated   •  Simulate  real  produc,on  environment   –  Generate  load   –  Random  failures  of  mul,ple  components   •  BeBer  debugging  tools   –  Need  to  co-­‐relate  messages  from  mul,ple  logs   –  Failure  is  a  symptom,  actual  reason  in  past  logs  of   different  machine.  
  • 13.
    Data  driven  tes,ng   •  Instrument  –   •   Zookeeper,  controller,  par,cipant  logs   •  Simulate  –  Chaos  monkey   •  Analyze  –  Invariants  are   •  Respect  state  transi,on  constraints   •  Respect  state  count  constraints   •  And  so  on   •  Debugging  made  easy   •  Reproduce  exact  sequence  of  events       13  
  • 14.
    Chaos  monkey   • Select  a  random  component(s)  to  fail   •  How  should  it  fail   –  Hard/soi  failure   –  Network  Par,,on   –  Garbage  collec,on   –  Process  freeze    
  • 15.
    Automa,on  of  chaos  monkey   •  Helix  agent  on  each  node   STATE  MACHINE   •  Modify  the  behavior  of   each  service  using  Helix   STOPPED   –  Component  1   •  Node1:  RUNNING   STOP   FREEZED   START   •  Node2:  STOPPED   PAUSE   UNPAUSE   •  Node3:  KILLED   RUNNING   –  Component  2   KILL   •  Node1:  STOPPED   KILLED  
  • 16.
    Pseudo  test  case   setup  cluster     generate  load   do    (c,t)  =  components  to  fail  and  type  of  failure    simulate  failure    verify  system_is_stable    restart  failed  components   while(verify  system_is_stable)   Test  case  failed  &  here  is  the  sequence  of  events        
  • 17.
    Cluster  verifica,on   • Verify  all  constraints  are  sa,sfied   –  Is  there  a  master  for  all  par,,on   –  Is  slave  replica,ng     –  Node/component  down  should  not  maBer   –  Validate  every  ac,on  not  just  end  result   •  Having  master  is  not  good  enough,  if  two  nodes   became  master  and  later  one  of  them  died.  
  • 18.
    Log  analysis   • Log  important  events   –  Becoming  master  from  slave  for  this  par,,on  at   this  ,me   •  Tools  to  collect,  merge  &  analyze  logs   –  Parsed  zookeeper  transac,on  logs   –  Gathered  helix  controller,  par,cipant  logs   –  Sorted  on  ,me.   •  Helix  provides  these  tools  out  of  the  box  
  • 19.
    Structured  Log  File  –  sample   timestamp partition instanceName sessionId state 1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE 1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE 1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
  • 20.
    Benefits   •  Test  case  stops  as  soon  as  system  is  unstable   –  The  cluster  is  available  for  debugging     •  Provides  exact  sequence  of  events   –  Makes  it  easy  to  debug  and  reproduce   –  Best  part:  We  auto  generated  test  case.        
  • 21.
    Reproduce  the  issue     Start  state   Orchestrate  the  sequence   •  Helix  brings  the  system  to   •  Use  Helix  messaging  api  to   start  state.   replay  the  events   {   "id" : "MyDataStore", "simpleFields" : { 1.  Node1:MyDataStore_0: Master-Slave "IDEAL_STATE_MODE" : "CUSTOM", "NUM_PARTITIONS" : ”2", "REPLICAS" : "3", 2. Node1:HARD KILL "STATE_MODEL_DEF_REF" : "MasterSlave", } "mapFields" : { 3. Node2:START "MyDataStore_0" : { "node1" : "MASTER", "node2" : "OFFLINE", "node3" : "SLAVE", }, "MyDataStore_0" : { "node1" : "SLAVE", "node2" : "OFFLINE", "node3" : "MASTER", }, } }
  • 22.
    Constraint  viola,on   No  more  than  R=2  slaves   Time State Number Slaves Instance 42632 OFFLINE 0 10.117.58.247_12918 42796 SLAVE 1 10.117.58.247_12918 43124 OFFLINE 1 10.202.187.155_12918 43131 OFFLINE 1 10.220.225.153_12918 43275 SLAVE 2 10.220.225.153_12918 43323 SLAVE 3 10.202.187.155_12918 85795 MASTER 2 10.220.225.153_12918
  • 23.
    How  long  was  it  out  of  whack?   Number  of  Slaves   Time     Percentage   0   1082319   0.5   1   35578388   16.46   2   179417802   82.99   3   118863   0.05   83%  of  the  ,me,  there  were  2  slaves  to  a  par,,on   93%  of  the  ,me,  there  was  1  master  to  a  par,,on   Number  of  Masters   Time   Percentage   0 15490456 7.164960359 1 200706916 92.83503964
  • 24.
    Invariant  2:  State  Transi,ons   FROM   TO   COUNT   MASTER SLAVE 55 OFFLINE DROPPED 0 OFFLINE SLAVE 298 SLAVE MASTER 155 SLAVE OFFLINE 0
  • 25.
    Fun  facts   • For  almost  a  month  the  test  failed  to  run   successfully  for  one  night   •  Most  issues  were  found  using  one  test  case   •  Reproduced  almost  all  failures  
  • 26.
    Conclusion   •  Tradi,onal  approach  is  not  good  enough   •  Data  driven  tes,ng  is  way  to  go   –  Focus  on  workload  and  analysis   –  Produc,on  system  always  in  test  mode   –  Leverage  tools  built  for  tes,ng  to  debug   produc,on  issues  
  • 27.
    website   helix.incubator.apache.org   users   user@helix.incubator.apache.org   dev   dev@helix.incubator.apache.org   twiBer   @apachehelix   27  

Editor's Notes

  • #6 In this slide, we will look at the problem from a different perspective and possibly re-define the cluster management problem.So re-cap to solve dds we need to define number of partitions and replicas, and for each replicas we need to different roles like master/slave etcOne of the well proven way to express such behavior is use a state machine
  • #7 Used in production and manage the core infrastructure components in the companyOperation is easy and easy for dev ops to operate multiple systems
  • #23 Coverage
  • #24 Mention coverage