Data	  Driven	  Tes,ng	  for	  	                Distributed	  Systems	  	            Case	  study	  with	  Apache	  Helix	...
Outline	  •    Intro	  to	  Helix	  •    Use	  case:	  Distributed	  data	  store	  •    Tradi,onal	  approach	  •    Data...
What	  is	  Helix	  •  Generic	  cluster	  management	  framework	     –  Par,,on	  management	     –  Failure	  detec,on	...
Terminologies 	  	  Node	          A	  single	  machine	  Cluster	       Set	  of	  Nodes	  Resource	      A	  logical	  e...
Core	  concept:	  Augmented	  finite	  state	  machine	            State	  Machine	                        Constraints	    ...
Helix	  usage	  at	  LinkedIn	  	         	     Espresso	                                                             6	  
Use	  case:	  Distributed	  data	  store	  •    Timeline	  consistent	  par,,oned	  data	  store	  •    One	  master	  rep...
Helix	  based	  solu,on	              State	  Machine	                                        Constraints	                ...
Tes,ng	  •  Happy	  path	  func,onality	      –  Meet	  SLA	          •  	   99th	  percen,le	  latency	  etc	      –  Wri...
Non	  happy	  path	  -­‐	  Tradi,onal	  approach	  •  Iden,fy	  scenarios	  of	  interest	      –  Node	  failure	      – ...
What	  was	  missing	  •  Failures	  don’t	  happen	  in	  isola,on	  •  Induc,on	  principle	  does	  not	  work	      – ...
What	  we	  learnt	  •  Test	  with	  all	  components	  integrated	  •  Simulate	  real	  produc,on	  environment	     – ...
Data	  driven	  tes,ng	  •  Instrument	  –	         •  	  Zookeeper,	  controller,	  par,cipant	  logs	  •  Simulate	  –	 ...
Chaos	  monkey	  •  Select	  a	  random	  component(s)	  to	  fail	  •  How	  should	  it	  fail	      –  Hard/soi	  failu...
Automa,on	  of	  chaos	  monkey	  •  Helix	  agent	  on	  each	  node	                                                    ...
Pseudo	  test	  case	  setup	  cluster	  	  generate	  load	  do	     	  (c,t)	  =	  components	  to	  fail	  and	  type	 ...
Cluster	  verifica,on	  •  Verify	  all	  constraints	  are	  sa,sfied	      –  Is	  there	  a	  master	  for	  all	  par,,o...
Log	  analysis	  •  Log	  important	  events	      –  Becoming	  master	  from	  slave	  for	  this	  par,,on	  at	       ...
Structured	  Log	  File	  –	  sample	   timestamp      partition     instanceName                   sessionId             ...
Benefits	  •  Test	  case	  stops	  as	  soon	  as	  system	  is	  unstable	      –  The	  cluster	  is	  available	  for	 ...
Reproduce	  the	  issue	  	  Start	  state	                                Orchestrate	  the	  sequence	  •  Helix	  bring...
Constraint	  viola,on	  No	  more	  than	  R=2	  slaves	   Time             State              Number Slaves         Insta...
How	  long	  was	  it	  out	  of	  whack?	  Number	  of	  Slaves	            Time	  	                          Percentage	...
Invariant	  2:	  State	  Transi,ons	   FROM	            TO	            COUNT	  MASTER           SLAVE               55OFFL...
Fun	  facts	  •  For	  almost	  a	  month	  the	  test	  failed	  to	  run	     successfully	  for	  one	  night	  •  Most...
Conclusion	  •  Tradi,onal	  approach	  is	  not	  good	  enough	  •  Data	  driven	  tes,ng	  is	  way	  to	  go	     –  ...
website	   helix.incubator.apache.org	  users	      user@helix.incubator.apache.org	  dev	        dev@helix.incubator.apac...
Upcoming SlideShare
Loading in …5
×

Data driven testing: Case study with Apache Helix

3,386 views

Published on

Case study of how we used Helix not only to build the distributed system but also to test it. We built a Chaos monkey to simulate failures and developed tools in Helix to parse zookeeper transaction logs, controller and participant logs and reconstructed the exact sequence of steps that led to a failure. Once we get the exact sequence of steps, we reproduce the events using Helix for orchestration.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,386
On SlideShare
0
From Embeds
0
Number of Embeds
68
Actions
Shares
0
Downloads
43
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • In this slide, we will look at the problem from a different perspective and possibly re-define the cluster management problem.So re-cap to solve dds we need to define number of partitions and replicas, and for each replicas we need to different roles like master/slave etcOne of the well proven way to express such behavior is use a state machine
  • Used in production and manage the core infrastructure components in the companyOperation is easy and easy for dev ops to operate multiple systems
  • Coverage
  • Mention coverage
  • Data driven testing: Case study with Apache Helix

    1. 1. Data  Driven  Tes,ng  for     Distributed  Systems     Case  study  with  Apache  Helix  Kishore  Gopalakrishna,  @kishoreg1980  hBp://www.linkedin.com/in/kgopalak    
    2. 2. Outline  •  Intro  to  Helix  •  Use  case:  Distributed  data  store  •  Tradi,onal  approach  •  Data  driven  tes,ng  •  Q  &  A  
    3. 3. What  is  Helix  •  Generic  cluster  management  framework   –  Par,,on  management   –  Failure  detec,on  and  handling   –  Elas,city  
    4. 4. Terminologies    Node   A  single  machine  Cluster   Set  of  Nodes  Resource   A  logical  en/ty  e.g.  database,  index,  task  Par,,on   Subset  of  the  resource.  Replica   Copy  of  a  par,,on  State   Status  of  a  par,,on  replica,  e.g  Master,  Slave  Transi,on   Ac,on  that  lets  replicas  change  status  e.g  Slave  -­‐>  Master   4  
    5. 5. Core  concept:  Augmented  finite  state  machine   State  Machine   Constraints   Objec,ves  • States   • States   • Par,,on  Placement   • S1,S2,S3   • S1à  max=1,  S2=min=2   • Failure  seman,cs  • Transi,on   • Transi,ons   • S1àS2,  S2àS1,  S2àS3,   • Concurrent(S1-­‐>S2)   S3àS1     across  cluster  <  5     5  
    6. 6. Helix  usage  at  LinkedIn       Espresso   6  
    7. 7. Use  case:  Distributed  data  store  •  Timeline  consistent  par,,oned  data  store  •  One  master  replica  per  par,,on  •  Even  distribu,on  of  master/slave  •  On  failure:  promote  slave  to  master   P.1   P.2   P.3   P.5   P.6   P.7   P.9   P.10   P.11   P.4   P.5   P.6   P.8   P.1   P.2   P.12   P.3   P.4   P.1   P.9   P.10   P.11   P.12   P.7   P.8   Node  1   Node  2   Node  3  
    8. 8. Helix  based  solu,on   State  Machine   Constraints   Objec,ves  • States   • States   • Par,,on  Placement   • Offline,  Slave,  Master   • M=1,  S=2   • Failure  seman,cs  • Transi,on   • Transi,ons   • O-­‐>S,  S-­‐>M,S-­‐>M,  M-­‐>S   • concurrent(0-­‐>S)  <  5     COUNT=2 minimize(maxnj∈N  S(nj)  ) t1≤5 S   t1 t2 t3 t4 O   M   COUNT=1 minimize(maxnj∈N  M(nj)  ) 8  
    9. 9. Tes,ng  •  Happy  path  func,onality   –  Meet  SLA   •    99th  percen,le  latency  etc   –  Writes  to  master  •  Non  happy  path   –  System  failures     –  Applica,on  failures   –  How  does  system  behave  in  such  scenarios    
    10. 10. Non  happy  path  -­‐  Tradi,onal  approach  •  Iden,fy  scenarios  of  interest   –  Node  failure   –  System  upgrade  •  Tested  each  scenario  in  isola,on  via  test  case   –  All  test  passed  J  •  Deployed  in  alpha   –  First  soiware  upgrade  failed  …  but  we  tested  it  
    11. 11. What  was  missing  •  Failures  don’t  happen  in  isola,on  •  Induc,on  principle  does  not  work   –  If  something  works  once  does  not  mean  it  will   always  work  •  Lack  of  tools  to  debug  issues   –  Could  not  iden,fy  the  cause  from  one  log  file  •  Poor  coverage   –  Impossible  to  think  of  all  possible  test  cases  
    12. 12. What  we  learnt  •  Test  with  all  components  integrated  •  Simulate  real  produc,on  environment   –  Generate  load   –  Random  failures  of  mul,ple  components  •  BeBer  debugging  tools   –  Need  to  co-­‐relate  messages  from  mul,ple  logs   –  Failure  is  a  symptom,  actual  reason  in  past  logs  of   different  machine.  
    13. 13. Data  driven  tes,ng  •  Instrument  –   •   Zookeeper,  controller,  par,cipant  logs  •  Simulate  –  Chaos  monkey  •  Analyze  –  Invariants  are   •  Respect  state  transi,on  constraints   •  Respect  state  count  constraints   •  And  so  on  •  Debugging  made  easy   •  Reproduce  exact  sequence  of  events       13  
    14. 14. Chaos  monkey  •  Select  a  random  component(s)  to  fail  •  How  should  it  fail   –  Hard/soi  failure   –  Network  Par,,on   –  Garbage  collec,on   –  Process  freeze    
    15. 15. Automa,on  of  chaos  monkey  •  Helix  agent  on  each  node   STATE  MACHINE  •  Modify  the  behavior  of   each  service  using  Helix   STOPPED   –  Component  1   •  Node1:  RUNNING   STOP   FREEZED   START   •  Node2:  STOPPED   PAUSE   UNPAUSE   •  Node3:  KILLED   RUNNING   –  Component  2   KILL   •  Node1:  STOPPED   KILLED  
    16. 16. Pseudo  test  case  setup  cluster    generate  load  do    (c,t)  =  components  to  fail  and  type  of  failure    simulate  failure    verify  system_is_stable    restart  failed  components  while(verify  system_is_stable)  Test  case  failed  &  here  is  the  sequence  of  events        
    17. 17. Cluster  verifica,on  •  Verify  all  constraints  are  sa,sfied   –  Is  there  a  master  for  all  par,,on   –  Is  slave  replica,ng     –  Node/component  down  should  not  maBer   –  Validate  every  ac,on  not  just  end  result   •  Having  master  is  not  good  enough,  if  two  nodes   became  master  and  later  one  of  them  died.  
    18. 18. Log  analysis  •  Log  important  events   –  Becoming  master  from  slave  for  this  par,,on  at   this  ,me  •  Tools  to  collect,  merge  &  analyze  logs   –  Parsed  zookeeper  transac,on  logs   –  Gathered  helix  controller,  par,cipant  logs   –  Sorted  on  ,me.  •  Helix  provides  these  tools  out  of  the  box  
    19. 19. Structured  Log  File  –  sample   timestamp partition instanceName sessionId state1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
    20. 20. Benefits  •  Test  case  stops  as  soon  as  system  is  unstable   –  The  cluster  is  available  for  debugging    •  Provides  exact  sequence  of  events   –  Makes  it  easy  to  debug  and  reproduce   –  Best  part:  We  auto  generated  test  case.        
    21. 21. Reproduce  the  issue    Start  state   Orchestrate  the  sequence  •  Helix  brings  the  system  to   •  Use  Helix  messaging  api  to   start  state.   replay  the  events   {   "id" : "MyDataStore", "simpleFields" : { 1.  Node1:MyDataStore_0: Master-Slave "IDEAL_STATE_MODE" : "CUSTOM", "NUM_PARTITIONS" : ”2", "REPLICAS" : "3", 2. Node1:HARD KILL "STATE_MODEL_DEF_REF" : "MasterSlave", } "mapFields" : { 3. Node2:START "MyDataStore_0" : { "node1" : "MASTER", "node2" : "OFFLINE", "node3" : "SLAVE", }, "MyDataStore_0" : { "node1" : "SLAVE", "node2" : "OFFLINE", "node3" : "MASTER", }, } }
    22. 22. Constraint  viola,on  No  more  than  R=2  slaves   Time State Number Slaves Instance42632 OFFLINE 0 10.117.58.247_1291842796 SLAVE 1 10.117.58.247_1291843124 OFFLINE 1 10.202.187.155_1291843131 OFFLINE 1 10.220.225.153_1291843275 SLAVE 2 10.220.225.153_1291843323 SLAVE 3 10.202.187.155_1291885795 MASTER 2 10.220.225.153_12918
    23. 23. How  long  was  it  out  of  whack?  Number  of  Slaves   Time     Percentage  0   1082319   0.5  1   35578388   16.46  2   179417802   82.99  3   118863   0.05   83%  of  the  ,me,  there  were  2  slaves  to  a  par,,on   93%  of  the  ,me,  there  was  1  master  to  a  par,,on  Number  of  Masters   Time   Percentage   0 15490456 7.164960359 1 200706916 92.83503964
    24. 24. Invariant  2:  State  Transi,ons   FROM   TO   COUNT  MASTER SLAVE 55OFFLINE DROPPED 0OFFLINE SLAVE 298SLAVE MASTER 155SLAVE OFFLINE 0
    25. 25. Fun  facts  •  For  almost  a  month  the  test  failed  to  run   successfully  for  one  night  •  Most  issues  were  found  using  one  test  case  •  Reproduced  almost  all  failures  
    26. 26. Conclusion  •  Tradi,onal  approach  is  not  good  enough  •  Data  driven  tes,ng  is  way  to  go   –  Focus  on  workload  and  analysis   –  Produc,on  system  always  in  test  mode   –  Leverage  tools  built  for  tes,ng  to  debug   produc,on  issues  
    27. 27. website   helix.incubator.apache.org  users   user@helix.incubator.apache.org  dev   dev@helix.incubator.apache.org  twiBer   @apachehelix   27  

    ×