Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Helix presentation at Vmware

2,981 views

Published on

Apache Helix presentation at Vmware. Download the file to see animations

Published in: Technology
  • Be the first to comment

Apache Helix presentation at Vmware

  1. 1. Building  distributed  systems  using   Helix   h?p://helix.incubator.apache.org   Apache  IncubaGon  Oct,  2012     @apachehelix  Kishore  Gopalakrishna,  @kishoreg1980h?p://www.linkedin.com/in/kgopalak     1  
  2. 2. Outline  •  Introduc)on  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage     2  
  3. 3. Examples  of  distributed  data  systems   3  
  4. 4. Lifecycle   Cluster   Fault   Expansion   tolerance   •  Thro?le  data  movement   MulG   •  Re-­‐distribuGon   •  ReplicaGon   node   •  Fault  detecGon   •  Recovery   Single   Node   •  ParGGoning   •  Discovery   •  Co-­‐locaGon   4  
  5. 5. Zookeeper  provides  low  level  primiGves.     We  need  high  level  primiGves.     ApplicaGon  •  File  system   •  Node  •  Lock   •  ParGGon  •  Ephemeral   •  Replica   •  State   •  TransiGon   ApplicaGon   Framework     Consensus   Zookeeper   System   5  
  6. 6. 6  
  7. 7. Outline  •  IntroducGon  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage   7  
  8. 8. Terminologies    Node   A  single  machine  Cluster   Set  of  Nodes  Resource   A  logical  en/ty  e.g.  database,  index,  task  ParGGon   Subset  of  the  resource.  Replica   Copy  of  a  parGGon  State   Status  of  a  parGGon  replica,  e.g  Master,  Slave  TransiGon   AcGon  that  lets  replicas  change  status  e.g  Slave  -­‐>  Master   8  
  9. 9. Core  concept  -­‐  state  machine  •  Set  of  legal  states   –  S1,  S2,  S3  •  Set  of  legal  state  transiGons   –  S1àS2   –  S2àS1   –  S2àS3   –  S3àS2       9  
  10. 10. Core  concept  -­‐  constraints  •  Minimum  and  maximum  number  of  replicas   that  should  be  in  a  given  state   –  S3àmax=1,  S2àmin=2  •  Maximum  concurrent  transiGons   –  Per  node   –  Per  resource   –  Across  cluster       10  
  11. 11. Core  concept:  objecGves  •  ParGGon  placement   –  Even  distribuGon  of  replicas  in  state  S1,S2  across   cluster  •  Failure/Expansion  semanGcs   –  Create  new  replicas  and  assign  state   –  Change  state  of  exisGng  replicas   –  Even  distribuGon  of  replicas           11  
  12. 12. Augmented  finite  state  machine   State  Machine   Constraints   ObjecGves  • States   • States   • ParGGon  Placement   • S1,S2,S3   • S1à  max=1,  S2=min=2   • Failure  semanGcs  • TransiGon   • TransiGons   • S1àS2,  S2àS1,  S2àS3,   • Concurrent(S1-­‐>S2)   S3àS1     across  cluster  <  5     12  
  13. 13. Message  consumer  group  -­‐  problem     ASSIGNMENT   SCALING   FAULT  TOLERANCE  PARTITIONED  QUEUE   CONSUMER   ParGGon  management   ElasGcity   Fault  tolerance   •  One  consumer  per   •  Re-­‐distribute  queues   •  Re-­‐distribute   queue   among  consumers   •  Minimize  movement   •  Even  distribuGon   •  Minimize  movement   •  Limit  max  queue  per   consumer   13  
  14. 14. Message  consumer  group:  soluGon   ONLINE  OFFLINE  STATE  MODEL   MAX  10  queues  per  consumer   Start  consumpGon   MAX=1  per  par))on     Offline   Online   Stop  consumpGon   14  
  15. 15.     Distributed  data  store   P.1   P.2   P.3   P.5   P.6   P.7   P.9   P.10   P.11   P.4   P.5   P.6   P.8   P.1   P.2   P.12   P.3   P.4   P.1   P.9   P.10   P.11   P.12   P.7   P.8   SLAVE  MASTER   Node  1   Node  2   Node  3   ParGGon   Fault  tolerance   ElasGcity   management   • MulGple  replicas   • Fault  detecGon   • Minimize   • 1  designated   • Promote  slave   downGme   master   to  master   • Minimize  data   • Even   • Even   movement   distribuGon   distribuGon   • Thro?le  data   • No  SPOF   movement  
  16. 16. Distributed  data  store:  soluGon   MASTER  SLAVE  STATE  MODEL   COUNT=2 minimize(maxnj∈N  S(nj)  ) t1≤5 S   t1 t2 t3 t4 O   M   COUNT=1 minimize(maxnj∈N  M(nj)  ) 16  
  17. 17.     Distributed  search  service   INDEX  SHARD   P.1   P.2   P.5   P.6   P.3   P.4   P.3   P.4   P.1   P.2   P.5   P.6   REPLICA   Node  1   Node  2   Node  3   ParGGon   Fault  tolerance   ElasGcity   management   • MulGple  replicas   • Fault  detecGon   • re-­‐distribute   • Even   • Auto  create   parGGons   distribuGon   replicas   • Minimize   • Rack  aware   • Controlled   movement   placement   creaGon  of   • Thro?le  data   replicas     movement  
  18. 18. Distributed  search  service:  soluGon   MAX  per  node=5   MAX=3   (number  of  replicas)   18  
  19. 19. Internals   19  
  20. 20. IDEALSTATE   P1   P2   P3   ConfiguraGon   Constraints  • 3  nodes   • 1  Master  • 3  parGGons  • 2  replicas   • 1  Slave   • Even   N1:M   N2:M   N3:M  • StateMachine   distribuGon   Replica   placement   N2:S   N3:S   N1:S   Replica     State   20  
  21. 21. CURRENT  STATE  N1   •  P1:OFFLINE   •  P3:OFFLINE  N2   •  P2:MASTER   •  P1:MASTER  N3   •  P3:MASTER   •  P2:SLAVE   21  
  22. 22. EXTERNAL  VIEW  P1   P2   P3  N1:O   N2:M   N3:M  N2:M   N3:S   N1:O   22  
  23. 23. Helix  Based  System  Roles   PARTICIPANT IDEAL STATE SPECTATOR Controller Parition routing logicCURRENT STATE RESPONSE COMMANDP.1   P.2   P.3   P.5   P.6   P.7   P.9   P.10   P.11  P.4   P.5   P.6   P.8   P.1   P.2   P.12   P.3   P.4   P.1  P.9   P.10   P.11   P.12   P.7   P.8   Node  1   Node  2   Node  3   23  
  24. 24. Logical  deployment   24  
  25. 25. Outline  •  IntroducGon  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage       25  
  26. 26. Helix  based  soluGon  1.  Define    2.  Configure    3.  Run   26  
  27. 27. Define:  State  model  definiGon  •  States   •  e.g.  MasterSlave   –  All  possible  states   –  Priority  •  TransiGons   –  Legal  transiGons   S   –  Priority  •  Applicable  to  each   O   M   parGGon  of  a  resource   27  
  28. 28. Define:  state  model   Builder = new StateModelDefinition.Builder(“MASTERSLAVE”);! // Add states and their rank to indicate priority. ! builder.addState(MASTER, 1);! builder.addState(SLAVE, 2);! builder.addState(OFFLINE);!! //Set the initial state when the node starts! builder.initialState(OFFLINE);   //Add transitions between the states.! builder.addTransition(OFFLINE, SLAVE);! builder.addTransition(SLAVE, OFFLINE);! builder.addTransition(SLAVE, MASTER);! builder.addTransition(MASTER, SLAVE);! ! 28  
  29. 29. Define:  constraints   State   Transi)on  ParGGon   Y   Y  Resource   -­‐   Y  Node   Y   Y   COUNT=2Cluster   -­‐   Y   S   COUNT=1 State   Transi)on   O   M   ParGGon   M=1,S=2   -­‐   29  
  30. 30. Define:constraints   // static constraint! builder.upperBound(MASTER, 1);!!! // dynamic constraint! builder.dynamicUpperBound(SLAVE, "R");!!! ! // Unconstrained ! builder.upperBound(OFFLINE, -1;     30  
  31. 31. Define:  parGcipant  plug-­‐in  code   31  
  32. 32. Step  2:  configure  helix-­‐admin  –zkSvr  <zkAddress>  CREATE  CLUSTER  -­‐-­‐addCluster  <clusterName>  ADD  NODE  -­‐-­‐addNode  <clusterName  instanceId(host:port)>    CONFIGURE  RESOURCE    -­‐-­‐addResource  <clusterName  resourceName  par;;ons  statemodel>    REBALANCE  èSET  IDEALSTATE  -­‐-­‐rebalance  <clusterName  resourceName  replicas>   32  
  33. 33. zookeeper  view   IDEALSTATE   33  
  34. 34. Step  3:  Run  START  CONTROLLER   run-­‐helix-­‐controller    -­‐zkSvr  localhost:2181  –cluster  MyCluster  START  PARTICIPANT   34  
  35. 35. zookeeper  view   35  
  36. 36. Znode  content  CURRENT  STATE   EXTERNAL  VIEW   36  
  37. 37. Spectator  Plug-­‐in  code   37  
  38. 38. Helix  ExecuGon  modes   38  
  39. 39. IDEALSTATE   P1   P2   P3   ConfiguraGon   Constraints   N1:M   N2:M   N3:M  • 3  nodes   • 1  Master  • 3  parGGons   • 1  Slave  • 2  replicas   • Even  • StateMachine   distribuGon   N2:S   N3:S   N1:S   Replica   Replica     placement   State   39  
  40. 40. ExecuGon  modes  •  Who  controls  what     AUTO   AUTO   CUSTOM   REBALANCE   Replica   Helix   App   App   placement   Replica     Helix   Helix   App   State   40  
  41. 41. Auto  rebalance  v/s  Auto  AUTO  REBALANCE   AUTO   41  
  42. 42. In  acGon     Auto  rebalance   Auto     MasterSlave  p=3  r=2  N=3   MasterSlave  p=3  r=2  N=3  Node1   Node2   Node3   Node  1   Node  2   Node  3  P1:M   P2:M   P3:M   P1:M   P2:M   P3:M  P2:S   P3:S   P1:S   P2:S   P3:S   P1:S   On  failure:  Auto  create  replica     On  failure:  Only  change  states  to  saGsfy   and  assign  state   constraint   Node  1   Node  2   Node  3   Node  1   Node  2   Node  3   P1:O   P2:M   P3:M   P1:M   P2:M   P3:M   P2:O   P3:S   P1:S   P2:S   P3:S   P1:M   P1:M   P2:S   42  
  43. 43. Custom  mode:  example   43  
  44. 44. Custom  mode:  handling  failure  ™  Custom  code  invoker   ™  Code  that  lives  on  all  nodes,  but  acGve  in  one  place   ™  Invoked  when  node  joins/leaves  the  cluster   ™  Computes  new  idealstate   ™  Helix  controller  fires  the  transiGon  without  viola)ng  constraints   P1   P2   P3   P1   P2   P3   Transi)ons   1   N1   MàS   2   N2   Sà  M   N1:M   N2:M   N3:M   N1:S   N2:M   N3:M   1  &  2  in  parallel  violate  single   master  constraint   N2:S   N3:S   N1:S   N2:M   N3:S   N1:S   Helix  sends  2  aqer  1  is  finished   44  
  45. 45. Controller  deployment  Embedded   Separate  •  Embedded  controller  within   •  At  least  2  separate   each  parGcipant   controllers  process  to  •  Only  one  controller  acGve   avoid  SPOF  •  No  extra  process  to  manage   •  Only  one  controller  acGve  •  Suitable  for  small  size  cluster.   •  Extra  process  to  manage  •  Upgrading  controller  is  costly   •  Recommended  for  large  •  ParGcipant  health  impacts   size  clusters   controller   •  Upgrading  controller  is   easy   45  
  46. 46. Controller  fault  tolerance   Zookeeper Controller Controller Controller 1 2 3 LEADER STANDBY STANDBYZookeeper  ephemeral  based  leader  elecGon  for  deciding  controller  leader       46  
  47. 47. Controller  fault  tolerance   Zookeeper Controller Controller Controller 1 2 3 OFFLINE LEADER STANDBYWhen  leader  fails,  another  controller  becomes  the  new  leader   47  
  48. 48. Managing  the  controllers   48  
  49. 49. Scaling  the  controller:     Leader  Standby  Model   STANDBY Cluster S Cluster Controller O L ClusterOFFLINE LEADER Controller Cluster Cluster Controller Cluster 49  
  50. 50. Scaling  the  controller:  Failure   STANDBY S Cluster Cluster Controller O L ClusterOFFLINE LEADER Controller Cluster Cluster Controller Cluster 50  
  51. 51. Outline  •  IntroducGon  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage     51  
  52. 52. Tools  •  Chaos  monkey  •  Data  driven  tesGng  and  debugging  •  Rolling  upgrade  •  On  demand  task  scheduling  and  intra-­‐cluster   messaging  •  Health  monitoring  and  alerts   52  
  53. 53. Data  driven  tesGng  •  Instrument  –   •   Zookeeper,  controller,  parGcipant  logs  •  Simulate  –  Chaos  monkey  •  Analyze  –  Invariants  are   •  Respect  state  transiGon  constraints   •  Respect  state  count  constraints   •  And  so  on  •  Debugging  made  easy   •  Reproduce  exact  sequence  of  events       53  
  54. 54. Structured  Log  File  -­‐  sample   timestamp partition instanceName sessionId state1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
  55. 55. No  more  than  R=2  slaves  Time State Number Slaves Instance42632 OFFLINE 0 10.117.58.247_1291842796 SLAVE 1 10.117.58.247_1291843124 OFFLINE 1 10.202.187.155_1291843131 OFFLINE 1 10.220.225.153_1291843275 SLAVE 2 10.220.225.153_1291843323 SLAVE 3 10.202.187.155_1291885795 MASTER 2 10.220.225.153_12918
  56. 56. How  long  was  it  out  of  whack?  Number  of  Slaves   Time     Percentage  0   1082319   0.5  1   35578388   16.46  2   179417802   82.99  3   118863   0.05   83%  of  the  Gme,  there  were  2  slaves  to  a  parGGon   93%  of  the  Gme,  there  was  1  master  to  a  parGGon  Number  of  Masters   Time   Percentage   0 15490456 7.164960359 1 200706916 92.83503964
  57. 57. Invariant  2:  State  TransiGons   FROM   TO   COUNT  MASTER SLAVE 55OFFLINE DROPPED 0OFFLINE SLAVE 298SLAVE MASTER 155SLAVE OFFLINE 0
  58. 58. Outline  •  IntroducGon  •  Architecture  •  How  to  use  Helix  •  Tools  •  Helix  usage     58  
  59. 59. Helix  usage  at  LinkedIn       Espresso   59  
  60. 60. In  flight  •  Apache  S4   –  ParGGoning,  co-­‐locaGon   –  Dynamic  cluster  expansion  •  Archiva   –  ParGGoned  replicated  file  store   –  Rsync  based  replicaGon  •  Others  in  evaluaGon   –  Bigtop   60  
  61. 61. Auto  scaling  soqware  deployment  tool  •  States   Offline < 100 •  Download,  Configure,  Start   Download •  AcGve,  Standby   Configure•  Constraint  for  each  state   Start •  Download    <  100   •  AcGve  1000   Active 1000 •  Standby  100   Standby 100 61  
  62. 62. Summary  •  Helix:  A  Generic  framework  for  building   distributed  systems  •  Modifying/enhancing  system  behavior  is  easy   –  AbstracGon  and  modularity  is  key  •  Simple  programming  model:  declaraGve  state   machine   62  
  63. 63. Roadmap  •  Features   •  Span  mulGple  data  centers   •  AutomaGc  Load  balancing   •  Distributed  health  monitoring   •  YARN  Generic  ApplicaGon  master  for  real  Gme   Apps   •  Stand  alone  Helix  agent      
  64. 64. website   h?p://helix.incubator.apache.org  user   user@helix.incubator.apache.org  dev   dev@helix.incubator.apache.org  twi?er   @apachehelix,  @kishoreg1980   64  

×