Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data	  Driven	  Tes,ng	  for	  	                Distributed	  Systems	  	            Case	  study	  with	  Apache	  Helix	...
Outline	  •    Intro	  to	  Helix	  •    Use	  case:	  Distributed	  data	  store	  •    Tradi,onal	  approach	  •    Data...
What	  is	  Helix	  •  Generic	  cluster	  management	  framework	     –  Par,,on	  management	     –  Failure	  detec,on	...
Terminologies 	  	  Node	          A	  single	  machine	  Cluster	       Set	  of	  Nodes	  Resource	      A	  logical	  e...
Core	  concept:	  Augmented	  finite	  state	  machine	            State	  Machine	                        Constraints	    ...
Helix	  usage	  at	  LinkedIn	  	         	     Espresso	                                                             6	  
Use	  case:	  Distributed	  data	  store	  •    Timeline	  consistent	  par,,oned	  data	  store	  •    One	  master	  rep...
Helix	  based	  solu,on	              State	  Machine	                                        Constraints	                ...
Tes,ng	  •  Happy	  path	  func,onality	      –  Meet	  SLA	          •  	   99th	  percen,le	  latency	  etc	      –  Wri...
Non	  happy	  path	  -­‐	  Tradi,onal	  approach	  •  Iden,fy	  scenarios	  of	  interest	      –  Node	  failure	      – ...
What	  was	  missing	  •  Failures	  don’t	  happen	  in	  isola,on	  •  Induc,on	  principle	  does	  not	  work	      – ...
What	  we	  learnt	  •  Test	  with	  all	  components	  integrated	  •  Simulate	  real	  produc,on	  environment	     – ...
Data	  driven	  tes,ng	  •  Instrument	  –	         •  	  Zookeeper,	  controller,	  par,cipant	  logs	  •  Simulate	  –	 ...
Chaos	  monkey	  •  Select	  a	  random	  component(s)	  to	  fail	  •  How	  should	  it	  fail	      –  Hard/soi	  failu...
Automa,on	  of	  chaos	  monkey	  •  Helix	  agent	  on	  each	  node	                                                    ...
Pseudo	  test	  case	  setup	  cluster	  	  generate	  load	  do	     	  (c,t)	  =	  components	  to	  fail	  and	  type	 ...
Cluster	  verifica,on	  •  Verify	  all	  constraints	  are	  sa,sfied	      –  Is	  there	  a	  master	  for	  all	  par,,o...
Log	  analysis	  •  Log	  important	  events	      –  Becoming	  master	  from	  slave	  for	  this	  par,,on	  at	       ...
Structured	  Log	  File	  –	  sample	   timestamp      partition     instanceName                   sessionId             ...
Benefits	  •  Test	  case	  stops	  as	  soon	  as	  system	  is	  unstable	      –  The	  cluster	  is	  available	  for	 ...
Reproduce	  the	  issue	  	  Start	  state	                                Orchestrate	  the	  sequence	  •  Helix	  bring...
Constraint	  viola,on	  No	  more	  than	  R=2	  slaves	   Time             State              Number Slaves         Insta...
How	  long	  was	  it	  out	  of	  whack?	  Number	  of	  Slaves	            Time	  	                          Percentage	...
Invariant	  2:	  State	  Transi,ons	   FROM	            TO	            COUNT	  MASTER           SLAVE               55OFFL...
Fun	  facts	  •  For	  almost	  a	  month	  the	  test	  failed	  to	  run	     successfully	  for	  one	  night	  •  Most...
Conclusion	  •  Tradi,onal	  approach	  is	  not	  good	  enough	  •  Data	  driven	  tes,ng	  is	  way	  to	  go	     –  ...
website	   helix.incubator.apache.org	  users	      user@helix.incubator.apache.org	  dev	        dev@helix.incubator.apac...
Upcoming SlideShare
Loading in …5
×

Data driven testing: Case study with Apache Helix

4,278 views

Published on

Case study of how we used Helix not only to build the distributed system but also to test it. We built a Chaos monkey to simulate failures and developed tools in Helix to parse zookeeper transaction logs, controller and participant logs and reconstructed the exact sequence of steps that led to a failure. Once we get the exact sequence of steps, we reproduce the events using Helix for orchestration.

Published in: Technology
  • Be the first to comment

Data driven testing: Case study with Apache Helix

  1. 1. Data  Driven  Tes,ng  for     Distributed  Systems     Case  study  with  Apache  Helix  Kishore  Gopalakrishna,  @kishoreg1980  hBp://www.linkedin.com/in/kgopalak    
  2. 2. Outline  •  Intro  to  Helix  •  Use  case:  Distributed  data  store  •  Tradi,onal  approach  •  Data  driven  tes,ng  •  Q  &  A  
  3. 3. What  is  Helix  •  Generic  cluster  management  framework   –  Par,,on  management   –  Failure  detec,on  and  handling   –  Elas,city  
  4. 4. Terminologies    Node   A  single  machine  Cluster   Set  of  Nodes  Resource   A  logical  en/ty  e.g.  database,  index,  task  Par,,on   Subset  of  the  resource.  Replica   Copy  of  a  par,,on  State   Status  of  a  par,,on  replica,  e.g  Master,  Slave  Transi,on   Ac,on  that  lets  replicas  change  status  e.g  Slave  -­‐>  Master   4  
  5. 5. Core  concept:  Augmented  finite  state  machine   State  Machine   Constraints   Objec,ves  • States   • States   • Par,,on  Placement   • S1,S2,S3   • S1à  max=1,  S2=min=2   • Failure  seman,cs  • Transi,on   • Transi,ons   • S1àS2,  S2àS1,  S2àS3,   • Concurrent(S1-­‐>S2)   S3àS1     across  cluster  <  5     5  
  6. 6. Helix  usage  at  LinkedIn       Espresso   6  
  7. 7. Use  case:  Distributed  data  store  •  Timeline  consistent  par,,oned  data  store  •  One  master  replica  per  par,,on  •  Even  distribu,on  of  master/slave  •  On  failure:  promote  slave  to  master   P.1   P.2   P.3   P.5   P.6   P.7   P.9   P.10   P.11   P.4   P.5   P.6   P.8   P.1   P.2   P.12   P.3   P.4   P.1   P.9   P.10   P.11   P.12   P.7   P.8   Node  1   Node  2   Node  3  
  8. 8. Helix  based  solu,on   State  Machine   Constraints   Objec,ves  • States   • States   • Par,,on  Placement   • Offline,  Slave,  Master   • M=1,  S=2   • Failure  seman,cs  • Transi,on   • Transi,ons   • O-­‐>S,  S-­‐>M,S-­‐>M,  M-­‐>S   • concurrent(0-­‐>S)  <  5     COUNT=2 minimize(maxnj∈N  S(nj)  ) t1≤5 S   t1 t2 t3 t4 O   M   COUNT=1 minimize(maxnj∈N  M(nj)  ) 8  
  9. 9. Tes,ng  •  Happy  path  func,onality   –  Meet  SLA   •    99th  percen,le  latency  etc   –  Writes  to  master  •  Non  happy  path   –  System  failures     –  Applica,on  failures   –  How  does  system  behave  in  such  scenarios    
  10. 10. Non  happy  path  -­‐  Tradi,onal  approach  •  Iden,fy  scenarios  of  interest   –  Node  failure   –  System  upgrade  •  Tested  each  scenario  in  isola,on  via  test  case   –  All  test  passed  J  •  Deployed  in  alpha   –  First  soiware  upgrade  failed  …  but  we  tested  it  
  11. 11. What  was  missing  •  Failures  don’t  happen  in  isola,on  •  Induc,on  principle  does  not  work   –  If  something  works  once  does  not  mean  it  will   always  work  •  Lack  of  tools  to  debug  issues   –  Could  not  iden,fy  the  cause  from  one  log  file  •  Poor  coverage   –  Impossible  to  think  of  all  possible  test  cases  
  12. 12. What  we  learnt  •  Test  with  all  components  integrated  •  Simulate  real  produc,on  environment   –  Generate  load   –  Random  failures  of  mul,ple  components  •  BeBer  debugging  tools   –  Need  to  co-­‐relate  messages  from  mul,ple  logs   –  Failure  is  a  symptom,  actual  reason  in  past  logs  of   different  machine.  
  13. 13. Data  driven  tes,ng  •  Instrument  –   •   Zookeeper,  controller,  par,cipant  logs  •  Simulate  –  Chaos  monkey  •  Analyze  –  Invariants  are   •  Respect  state  transi,on  constraints   •  Respect  state  count  constraints   •  And  so  on  •  Debugging  made  easy   •  Reproduce  exact  sequence  of  events       13  
  14. 14. Chaos  monkey  •  Select  a  random  component(s)  to  fail  •  How  should  it  fail   –  Hard/soi  failure   –  Network  Par,,on   –  Garbage  collec,on   –  Process  freeze    
  15. 15. Automa,on  of  chaos  monkey  •  Helix  agent  on  each  node   STATE  MACHINE  •  Modify  the  behavior  of   each  service  using  Helix   STOPPED   –  Component  1   •  Node1:  RUNNING   STOP   FREEZED   START   •  Node2:  STOPPED   PAUSE   UNPAUSE   •  Node3:  KILLED   RUNNING   –  Component  2   KILL   •  Node1:  STOPPED   KILLED  
  16. 16. Pseudo  test  case  setup  cluster    generate  load  do    (c,t)  =  components  to  fail  and  type  of  failure    simulate  failure    verify  system_is_stable    restart  failed  components  while(verify  system_is_stable)  Test  case  failed  &  here  is  the  sequence  of  events        
  17. 17. Cluster  verifica,on  •  Verify  all  constraints  are  sa,sfied   –  Is  there  a  master  for  all  par,,on   –  Is  slave  replica,ng     –  Node/component  down  should  not  maBer   –  Validate  every  ac,on  not  just  end  result   •  Having  master  is  not  good  enough,  if  two  nodes   became  master  and  later  one  of  them  died.  
  18. 18. Log  analysis  •  Log  important  events   –  Becoming  master  from  slave  for  this  par,,on  at   this  ,me  •  Tools  to  collect,  merge  &  analyze  logs   –  Parsed  zookeeper  transac,on  logs   –  Gathered  helix  controller,  par,cipant  logs   –  Sorted  on  ,me.  •  Helix  provides  these  tools  out  of  the  box  
  19. 19. Structured  Log  File  –  sample   timestamp partition instanceName sessionId state1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
  20. 20. Benefits  •  Test  case  stops  as  soon  as  system  is  unstable   –  The  cluster  is  available  for  debugging    •  Provides  exact  sequence  of  events   –  Makes  it  easy  to  debug  and  reproduce   –  Best  part:  We  auto  generated  test  case.        
  21. 21. Reproduce  the  issue    Start  state   Orchestrate  the  sequence  •  Helix  brings  the  system  to   •  Use  Helix  messaging  api  to   start  state.   replay  the  events   {   "id" : "MyDataStore", "simpleFields" : { 1.  Node1:MyDataStore_0: Master-Slave "IDEAL_STATE_MODE" : "CUSTOM", "NUM_PARTITIONS" : ”2", "REPLICAS" : "3", 2. Node1:HARD KILL "STATE_MODEL_DEF_REF" : "MasterSlave", } "mapFields" : { 3. Node2:START "MyDataStore_0" : { "node1" : "MASTER", "node2" : "OFFLINE", "node3" : "SLAVE", }, "MyDataStore_0" : { "node1" : "SLAVE", "node2" : "OFFLINE", "node3" : "MASTER", }, } }
  22. 22. Constraint  viola,on  No  more  than  R=2  slaves   Time State Number Slaves Instance42632 OFFLINE 0 10.117.58.247_1291842796 SLAVE 1 10.117.58.247_1291843124 OFFLINE 1 10.202.187.155_1291843131 OFFLINE 1 10.220.225.153_1291843275 SLAVE 2 10.220.225.153_1291843323 SLAVE 3 10.202.187.155_1291885795 MASTER 2 10.220.225.153_12918
  23. 23. How  long  was  it  out  of  whack?  Number  of  Slaves   Time     Percentage  0   1082319   0.5  1   35578388   16.46  2   179417802   82.99  3   118863   0.05   83%  of  the  ,me,  there  were  2  slaves  to  a  par,,on   93%  of  the  ,me,  there  was  1  master  to  a  par,,on  Number  of  Masters   Time   Percentage   0 15490456 7.164960359 1 200706916 92.83503964
  24. 24. Invariant  2:  State  Transi,ons   FROM   TO   COUNT  MASTER SLAVE 55OFFLINE DROPPED 0OFFLINE SLAVE 298SLAVE MASTER 155SLAVE OFFLINE 0
  25. 25. Fun  facts  •  For  almost  a  month  the  test  failed  to  run   successfully  for  one  night  •  Most  issues  were  found  using  one  test  case  •  Reproduced  almost  all  failures  
  26. 26. Conclusion  •  Tradi,onal  approach  is  not  good  enough  •  Data  driven  tes,ng  is  way  to  go   –  Focus  on  workload  and  analysis   –  Produc,on  system  always  in  test  mode   –  Leverage  tools  built  for  tes,ng  to  debug   produc,on  issues  
  27. 27. website   helix.incubator.apache.org  users   user@helix.incubator.apache.org  dev   dev@helix.incubator.apache.org  twiBer   @apachehelix   27  

×