HBase: How to get MTTR below 1 minute

4,916 views

Published on

Best practices with HBase Mean Time to Recovery.

Published in: Technology, Business

HBase: How to get MTTR below 1 minute

  1. 1. How  to  get  the  MTTR  below  1   minute  and  more   Devaraj  Das   (ddas@hortonworks.com)   Nicolas  Liochon   (nkeywal@gmail.com)  
  2. 2. Outline   •  What  is  this?  Why  are  we  talking  about  this   topic?  Why  it  ma>ers?  ….   •  HBase  Recovery  –  an  overview   •  HDFS  issues   •  Beyond  MTTR  (Performance  post  recovery)   •  Conclusion  /  Future  /  Q  &  A  
  3. 3. What  is  MTTR?  Why  its  important?  …   •  Mean  Time  To  Recovery  -­‐>  Average  Pme  required   to  repair  a  failed  component  (Courtesy:  Wikipedia)   •  Enterprises  want  an  MTTR  of  ZERO   –  Data  should  always  be  available  with  no  degradaPon   of  perceived  SLAs   –  PracPcally  hard  to  obtain  but  yeah  it’s  a  goal   •  Close  to  Zero-­‐MTTR  is  especially  important  for   HBase   –  Given  it  is  used  in  near  realPme  systems   •  MTTR  in  other  NoSQL  systems  &  Databases  
  4. 4. HBase  Basics   •  Strongly  consistent   –  Write  ordered  with  reads   –  Once  wri>en,  the  data  will  stay   •  Built  on  top  of  HDFS   •  When  a  machine  fails  the  cluster  remains   available,  and  its  data  as  well   •  We’re  just  speaking  about  the  piece  of  data  that   was  handled  by  this  machine  
  5. 5. Write  path   WAL  –  Write   Ahead  Log   A  write  is   finished  once   wri>en  on  all   HDFS  nodes   The  client   communicated   with  the  region   servers  
  6. 6. We’re  in  a  distributed  system   •  You  can’t  disPnguish  a   slow  server  from  a   dead  server   •  Everything,  or,  nearly   everything,  is  based   on  Pmeout   •  Smaller  Pmeouts  means  more  false  posiPve   •  HBase  works  well  with  false  posiPve,  but  they   always  have  a  cost.   •  The  less  the  Pmeouts  the  be>er  
  7. 7. HBase  components  for  recovery  
  8. 8. Recovery  in  acPon  
  9. 9. Recovery  process   •  Failure  detecPon:  ZooKeeper   heartbeats  the  servers.  Expire   the  session  when  it  does  not   reply   •  Region  assignment:  the  master   reallocates  the  regions  to  the   other  servers   •  Failure  recovery:  read  the  WAL   and  rewrite  the  data  again   •  The  clients  stops  the   connecPon  to  the  dead  server   and  goes  to  the  new  one.   ZK   Heartbeat   Client   Region  Servers,   DataNode   Data  recovery   Master,  RS,  ZK   Region  Assignment  
  10. 10. So….   •  Detect  the  failure  as  fast  as  possible   •  Reassign  as  fast  as  possible   •  Read  /  rewrite  the  WAL  as  fast  as  possible   •  That’s  obvious  
  11. 11. The  obvious  –  failure  detecPon   •  Failure  detecPon   –  Set  a  ZooKeeper  Pmeout  to  30s  instead  of  the  old  180s   default.     –  Beware  of  the  GC,  but  lower  values  are  possible.   –  ZooKeeper  detects  the  errors  sooner  than  the  configured   Pmeout   •  0.96     –  HBase  scripts  clean  the  ZK  node  when  the  server  is  kill   -­‐9ed   •  =>  DetecPon  Pme  becomes  0   –  Can  be  used  by  any  monitoring  tool  
  12. 12. The  obvious  –  faster  data  recovery   •  Not  so  obvious  actually   •  Already  distributed  since  0.92   –  The  large  the  cluster  the  be>er.   •  Completely  rewri>en  in  0.96   –  Recovery  itself  rewri>en  in  0.96   –  Will  be  covered  in  the  second  part  
  13. 13. The  obvious  –  Faster  assignment   •  Faster  assignment   –  Just  improving  performances   •  Parallelism   •  Speed   –  Globally  ‘much’  faster   –  Backported  to  0.94   •  SPll  possible  to  do  be>er  for  huge  number  of   regions.     •  A  few  seconds  for  most  cases  
  14. 14. With  this   •  DetecPon:  from  180s  to  30s   •  Data  recovery:  around  10s   •  Reassignment  :  from  10s  of  seconds  to   seconds  
  15. 15. Do  you  think  we’re  be>er  with  this   •  Answer  is  NO   •  Actually,  yes  but  if  and  only  if  HDFS  is  fine   – But  when  you  lose  a  regionserver,  you’ve  just  lost   a  datanode  
  16. 16. DataNode  crash  is  expensive!   •  One  replica  of  WAL  edits  is  on  the  crashed  DN   – 33%  of  the  reads  during  the  regionserver  recovery   will  go  to  it   •  Many  writes  will  go  to  it  as  well  (the  smaller   the  cluster,  the  higher  that  probability)   •  NameNode  re-­‐replicates  the  data  (maybe  TBs)   that  was  on  this  node  to  restore  replica  count   – NameNode  does  this  work  only  amer  a  good   Pmeout  (10  minutes  by  default)  
  17. 17. HDFS  –  Stale  mode   Live   Stale   Dead   As  today:  used  for  reads  &   writes,  using  locality   Not  used  for  writes,  used  as   last  resort  for  reads   As  today:  not  used.   And  actually,  it’s  be>er  to  do  the  HBase   recovery  before  HDFS  replicates  the  TBs   of  data  of  this  node   30  seconds,  can  be  less.   10  minutes,  don’t  change  this  
  18. 18. Results   •  Do  more  read/writes  to  HDFS  during  the   recovery   •  MulPple  failures  are  sPll  possible   – Stale  mode  will  sPll  play  its  role   – And  set  dfs.Pmeout  to  30s   – This  limits  the  effect  of  two  failure  in  a  row.  The   cost  of  the  second  failure  is  30s  if  you  were   unlucky  
  19. 19. Are  we  done?   •  We’re  not  bad   •  But  there  is  sPll  something  
  20. 20. The  client   You  lem  it  waiPng  on  the  dead  server          
  21. 21. Here  it  is  
  22. 22. The  client   •  You  want  the  client  to  be  paPent   •  Retries  when  the  system  is  already  loaded  is   not  good.     •  You  want  the  client  to  learn  about  region   servers  dying,  and  to  be  able  to  react   immediately.   •  You  want  this  to  scale.  
  23. 23. SoluPon   •  The  master  noPfies  the  client   –  A  cheap  mulPcast  message  with  the  “dead  servers”   list.  Sent  5  Pmes  for  safety.   –  Off  by  default.   –  On  recepPon,  the  client  stops  immediately  waiPng  on   the  TCP  connecPon.  You  can  now  enjoy  large   hbase.rpc.Pmeout  
  24. 24. Full  workflow   t0   t1   t2   t3   Client  reads   and  writes   RegionServer  serving   reads  and  writes   RegionServer  crashes   Affected  regions   reassigned   Client  writes   Data  recovered   Client  reads   and  writes  t4  
  25. 25. Are  we  done   •  In  a  way,  yes   – There  is  a  lot  of  things  around  asynchronous   writes,  reads  during  recovery   – Will  be  for  another  Pme,  but  there  will  be  some   nice  things  in  0.96   •  And  a  couple  of  them  is  presented  in  the   second  part  of  this  talk!  
  26. 26. Faster  recovery   •  Previous  algo   –  Read  the  WAL  files   –  Write  new  Hfiles   –  Tell  the  region  server  it  got  new  Hfiles   •  Put  pressure  on  namenode   –  Remember:  don’t  put  pressure  on  the  namenode   •  New  algo:   –  Read  the  WAL   –  Write  to  the  regionserver   –  We’re  done  (have  seen  great  improvements  in  our  tests)   –  TBD:  Assign  the  WAL  to  a  RegionServer  local  to  a  replica  
  27. 27. RegionServer0   RegionServer_x   RegionServer_y   WAL-­‐file3   <region2:edit1><region1:edit2>   ……   <region3:edit1>   ……..   WAL-­‐file2   <region2:edit1><region1:edit2>   ……   <region3:edit1>   ……..   WAL-­‐file1   <region2:edit1><region1:edit2>   ……   <region3:edit1>   ……..   HDFS   Splitlog-­‐file-­‐for-­‐region3   <region3:edit1><region1:edit2>   ……   <region3:edit1>   ……..   Splitlog-­‐file-­‐for-­‐region2   <region2:edit1><region1:edit2>   ……   <region2:edit1>   ……..   Splitlog-­‐file-­‐for-­‐region1   <region1:edit1><region1:edit2>   ……   <region1:edit1>   ……..   HDFS   RegionServer3   RegionServer2   RegionServer1   writes   writes   reads   reads   Distributed  log   Split  
  28. 28. RegionServer0   RegionServer_x   RegionServer_y   WAL-­‐file3   <region2:edit1><region1:edit2>   ……   <region3:edit1>   ……..   WAL-­‐file2   <region2:edit1><region1:edit2>   ……   <region3:edit1>   ……..   WAL-­‐file1   <region2:edit1><region1:edit2>   ……   <region3:edit1>   ……..   HDFS   Recovered-­‐file-­‐for-­‐region3   <region3:edit1><region1:edit2>   ……   <region3:edit1>   ……..   Recovered-­‐file-­‐for-­‐region2   <region2:edit1><region1:edit2>   ……   <region2:edit1>   ……..   Recovered-­‐file-­‐for-­‐region1   <region1:edit1><region1:edit2>   ……   <region1:edit1>   ……..   HDFS   RegionServer3   RegionServer2   RegionServer1   writes   writes   reads   reads   Distributed  log   Replay   replays  
  29. 29. Write  during  recovery   •  Hey,  you  can  write  during  the  WAL  replay   •  Events  stream:  your  new  recovery  Pme  is  the   failure  detecPon  Pme:  max  30s,  likely  less!  
  30. 30. MemStore  flush   •  Real  life:  some  tables  are  updated  at  a  given   moment  then  lem  alone   – With  a  non  empty  memstore   – More  data  to  recover   •  It’s  now  possible  to  guarantee  that  we  don’t   have  MemStore  with  old  data   •  Improves  real  life  MTTR   •  Helps  snapshots  
  31. 31. .META.   •  .META.   –  There  is  no  –ROOT-­‐  in  0.95/0.96   –  But  .META.  failures  are  criPcal   •  A  lot  of  small  improvements   –  Server  now  says  to  the  client  when  a  region  has   moved  (client  can  avoid  going  to  meta)   •  And  a  big  one   –  .META.  WAL  is  managed  separately  to  allow  an   immediate  recovery  of  META   –  With  the  new  MemStore  flush,  ensure  a  quick   recovery  
  32. 32. Data  locality  post  recovery   •  HBase  performance  depends  on  data-­‐locality   •  Amer  a  recovery,  you’ve  lost  it   –  Bad  for  performance   •  Here  comes  region  groups   •  Assign  3  favored  RegionServers  for  every  region   •  On  failures  assign  the  region  to  one  of  the   secondaries   •  The  data-­‐locality  issue  is  minimized  on  failures  
  33. 33. Block1   Block2   Block3   Block1   Block2   Rack1   Block3   Block3   Rack2   Rack3   Block1   Block2   Datanode   RegionServer1   Datanode1   RegionServer1   Datanode   RegionServer2   Datanode1   RegionServer1   Datanode   RegionServer3   Block1   Block2   Rack1   Block3   Block3   Rack2   Rack3   Block1   Block2   RegionServer4   Datanode1   RegionServer1   Datanode   RegionServer2   Datanode1   RegionServer1   Datanode   RegionServer3   Reads  Blk1  and   Blk2  remotely   Reads  Blk3   remotely   RegionServer1  serves  three  regions,  and  their  StoreFile  blks  are  sca>ered   across  the  cluster  with  one  replica  local  to  RegionServer1.  
  34. 34. Block1   Block2   Block3   Block1   Block2   Rack1   Block3   Block3   Rack2   Rack3   Block1   Block2   Datanode   RegionServer1   Datanode1   RegionServer1   Datanode   RegionServer2   Datanode1   RegionServer1   Datanode   RegionServer3   RegionServer1  serves  three  regions,  and  their  StoreFile  blks  are  placed  on   specific  machines  on  the  other  racks   Block1   Block2   Rack1   Block3   Block3   Rack2   Rack3   Block1   Block2   RegionServer4   Datanode1   RegionServer1   Datanode   RegionServer2   Datanode1   RegionServer1   Datanode   RegionServer3   No  remote  reads   Datanode  
  35. 35. Conclusion   •  The  target  was  “from  omen  10  minutes  to   always  less  than  1  minute”   – We’re  almost  there   •  Most  of  it  is  available  in  0.96,  some  parts  were   backported   •  Real  life  tesPng  of  the  improvements  in   progress   •  Room  for  more  improvements  
  36. 36. Q  &  A   Thanks!  

×