Data driven testing: Case study with Apache Helix
Upcoming SlideShare
Loading in...5
×
 

Data driven testing: Case study with Apache Helix

on

  • 1,950 views

Case study of how we used Helix not only to build the distributed system but also to test it. We built a Chaos monkey to simulate failures and developed tools in Helix to parse zookeeper transaction ...

Case study of how we used Helix not only to build the distributed system but also to test it. We built a Chaos monkey to simulate failures and developed tools in Helix to parse zookeeper transaction logs, controller and participant logs and reconstructed the exact sequence of steps that led to a failure. Once we get the exact sequence of steps, we reproduce the events using Helix for orchestration.

Statistics

Views

Total Views
1,950
Views on SlideShare
1,856
Embed Views
94

Actions

Likes
4
Downloads
20
Comments
0

1 Embed 94

https://twitter.com 94

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • In this slide, we will look at the problem from a different perspective and possibly re-define the cluster management problem.So re-cap to solve dds we need to define number of partitions and replicas, and for each replicas we need to different roles like master/slave etcOne of the well proven way to express such behavior is use a state machine
  • Used in production and manage the core infrastructure components in the companyOperation is easy and easy for dev ops to operate multiple systems
  • Coverage
  • Mention coverage

Data driven testing: Case study with Apache Helix Data driven testing: Case study with Apache Helix Presentation Transcript

  • Data  Driven  Tes,ng  for     Distributed  Systems     Case  study  with  Apache  Helix  Kishore  Gopalakrishna,  @kishoreg1980  hBp://www.linkedin.com/in/kgopalak    
  • Outline  •  Intro  to  Helix  •  Use  case:  Distributed  data  store  •  Tradi,onal  approach  •  Data  driven  tes,ng  •  Q  &  A  
  • What  is  Helix  •  Generic  cluster  management  framework   –  Par,,on  management   –  Failure  detec,on  and  handling   –  Elas,city  
  • Terminologies    Node   A  single  machine  Cluster   Set  of  Nodes  Resource   A  logical  en/ty  e.g.  database,  index,  task  Par,,on   Subset  of  the  resource.  Replica   Copy  of  a  par,,on  State   Status  of  a  par,,on  replica,  e.g  Master,  Slave  Transi,on   Ac,on  that  lets  replicas  change  status  e.g  Slave  -­‐>  Master   4  
  • Core  concept:  Augmented  finite  state  machine   State  Machine   Constraints   Objec,ves  • States   • States   • Par,,on  Placement   • S1,S2,S3   • S1à  max=1,  S2=min=2   • Failure  seman,cs  • Transi,on   • Transi,ons   • S1àS2,  S2àS1,  S2àS3,   • Concurrent(S1-­‐>S2)   S3àS1     across  cluster  <  5     5  
  • Helix  usage  at  LinkedIn       Espresso   6  
  • Use  case:  Distributed  data  store  •  Timeline  consistent  par,,oned  data  store  •  One  master  replica  per  par,,on  •  Even  distribu,on  of  master/slave  •  On  failure:  promote  slave  to  master   P.1   P.2   P.3   P.5   P.6   P.7   P.9   P.10   P.11   P.4   P.5   P.6   P.8   P.1   P.2   P.12   P.3   P.4   P.1   P.9   P.10   P.11   P.12   P.7   P.8   Node  1   Node  2   Node  3  
  • Helix  based  solu,on   State  Machine   Constraints   Objec,ves  • States   • States   • Par,,on  Placement   • Offline,  Slave,  Master   • M=1,  S=2   • Failure  seman,cs  • Transi,on   • Transi,ons   • O-­‐>S,  S-­‐>M,S-­‐>M,  M-­‐>S   • concurrent(0-­‐>S)  <  5     COUNT=2 minimize(maxnj∈N  S(nj)  ) t1≤5 S   t1 t2 t3 t4 O   M   COUNT=1 minimize(maxnj∈N  M(nj)  ) 8  
  • Tes,ng  •  Happy  path  func,onality   –  Meet  SLA   •    99th  percen,le  latency  etc   –  Writes  to  master  •  Non  happy  path   –  System  failures     –  Applica,on  failures   –  How  does  system  behave  in  such  scenarios    
  • Non  happy  path  -­‐  Tradi,onal  approach  •  Iden,fy  scenarios  of  interest   –  Node  failure   –  System  upgrade  •  Tested  each  scenario  in  isola,on  via  test  case   –  All  test  passed  J  •  Deployed  in  alpha   –  First  soiware  upgrade  failed  …  but  we  tested  it  
  • What  was  missing  •  Failures  don’t  happen  in  isola,on  •  Induc,on  principle  does  not  work   –  If  something  works  once  does  not  mean  it  will   always  work  •  Lack  of  tools  to  debug  issues   –  Could  not  iden,fy  the  cause  from  one  log  file  •  Poor  coverage   –  Impossible  to  think  of  all  possible  test  cases  
  • What  we  learnt  •  Test  with  all  components  integrated  •  Simulate  real  produc,on  environment   –  Generate  load   –  Random  failures  of  mul,ple  components  •  BeBer  debugging  tools   –  Need  to  co-­‐relate  messages  from  mul,ple  logs   –  Failure  is  a  symptom,  actual  reason  in  past  logs  of   different  machine.  
  • Data  driven  tes,ng  •  Instrument  –   •   Zookeeper,  controller,  par,cipant  logs  •  Simulate  –  Chaos  monkey  •  Analyze  –  Invariants  are   •  Respect  state  transi,on  constraints   •  Respect  state  count  constraints   •  And  so  on  •  Debugging  made  easy   •  Reproduce  exact  sequence  of  events       13  
  • Chaos  monkey  •  Select  a  random  component(s)  to  fail  •  How  should  it  fail   –  Hard/soi  failure   –  Network  Par,,on   –  Garbage  collec,on   –  Process  freeze    
  • Automa,on  of  chaos  monkey  •  Helix  agent  on  each  node   STATE  MACHINE  •  Modify  the  behavior  of   each  service  using  Helix   STOPPED   –  Component  1   •  Node1:  RUNNING   STOP   FREEZED   START   •  Node2:  STOPPED   PAUSE   UNPAUSE   •  Node3:  KILLED   RUNNING   –  Component  2   KILL   •  Node1:  STOPPED   KILLED  
  • Pseudo  test  case  setup  cluster    generate  load  do    (c,t)  =  components  to  fail  and  type  of  failure    simulate  failure    verify  system_is_stable    restart  failed  components  while(verify  system_is_stable)  Test  case  failed  &  here  is  the  sequence  of  events        
  • Cluster  verifica,on  •  Verify  all  constraints  are  sa,sfied   –  Is  there  a  master  for  all  par,,on   –  Is  slave  replica,ng     –  Node/component  down  should  not  maBer   –  Validate  every  ac,on  not  just  end  result   •  Having  master  is  not  good  enough,  if  two  nodes   became  master  and  later  one  of  them  died.  
  • Log  analysis  •  Log  important  events   –  Becoming  master  from  slave  for  this  par,,on  at   this  ,me  •  Tools  to  collect,  merge  &  analyze  logs   –  Parsed  zookeeper  transac,on  logs   –  Gathered  helix  controller,  par,cipant  logs   –  Sorted  on  ,me.  •  Helix  provides  these  tools  out  of  the  box  
  • Structured  Log  File  –  sample   timestamp partition instanceName sessionId state1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
  • Benefits  •  Test  case  stops  as  soon  as  system  is  unstable   –  The  cluster  is  available  for  debugging    •  Provides  exact  sequence  of  events   –  Makes  it  easy  to  debug  and  reproduce   –  Best  part:  We  auto  generated  test  case.        
  • Reproduce  the  issue    Start  state   Orchestrate  the  sequence  •  Helix  brings  the  system  to   •  Use  Helix  messaging  api  to   start  state.   replay  the  events   {   "id" : "MyDataStore", "simpleFields" : { 1.  Node1:MyDataStore_0: Master-Slave "IDEAL_STATE_MODE" : "CUSTOM", "NUM_PARTITIONS" : ”2", "REPLICAS" : "3", 2. Node1:HARD KILL "STATE_MODEL_DEF_REF" : "MasterSlave", } "mapFields" : { 3. Node2:START "MyDataStore_0" : { "node1" : "MASTER", "node2" : "OFFLINE", "node3" : "SLAVE", }, "MyDataStore_0" : { "node1" : "SLAVE", "node2" : "OFFLINE", "node3" : "MASTER", }, } }
  • Constraint  viola,on  No  more  than  R=2  slaves   Time State Number Slaves Instance42632 OFFLINE 0 10.117.58.247_1291842796 SLAVE 1 10.117.58.247_1291843124 OFFLINE 1 10.202.187.155_1291843131 OFFLINE 1 10.220.225.153_1291843275 SLAVE 2 10.220.225.153_1291843323 SLAVE 3 10.202.187.155_1291885795 MASTER 2 10.220.225.153_12918
  • How  long  was  it  out  of  whack?  Number  of  Slaves   Time     Percentage  0   1082319   0.5  1   35578388   16.46  2   179417802   82.99  3   118863   0.05   83%  of  the  ,me,  there  were  2  slaves  to  a  par,,on   93%  of  the  ,me,  there  was  1  master  to  a  par,,on  Number  of  Masters   Time   Percentage   0 15490456 7.164960359 1 200706916 92.83503964
  • Invariant  2:  State  Transi,ons   FROM   TO   COUNT  MASTER SLAVE 55OFFLINE DROPPED 0OFFLINE SLAVE 298SLAVE MASTER 155SLAVE OFFLINE 0
  • Fun  facts  •  For  almost  a  month  the  test  failed  to  run   successfully  for  one  night  •  Most  issues  were  found  using  one  test  case  •  Reproduced  almost  all  failures  
  • Conclusion  •  Tradi,onal  approach  is  not  good  enough  •  Data  driven  tes,ng  is  way  to  go   –  Focus  on  workload  and  analysis   –  Produc,on  system  always  in  test  mode   –  Leverage  tools  built  for  tes,ng  to  debug   produc,on  issues  
  • website   helix.incubator.apache.org  users   user@helix.incubator.apache.org  dev   dev@helix.incubator.apache.org  twiBer   @apachehelix   27