Advertisement
Advertisement

More Related Content

Advertisement

Recently uploaded(20)

Advertisement

Apache HBase Low Latency

  1. HBase  Low  Latency Nick  Dimiduk,  Hortonworks  (@xefyr)   Nicolas  Liochon,  Scaled  Risk  (@nkeywal)     HBaseCon  May  5,  2014  
  2. Agenda •  Latency,  what  is  it,  how  to  measure  it   •  Write  path   •  Read  path   •  Next  steps  
  3. What’s  low  latency Latency  is  about  percenJles   •  Average  !=  50%  percenJle   •  There  are  oRen  order  of  magnitudes  between  «  average  »  and  «  95   percenJle  »   •  Post  99%  =  «  magical  1%  ».  Work  in  progress  here.   •  Meaning  from  micro  seconds  (High  Frequency   Trading)  to  seconds  (interacJve  queries)   •  In  this  talk  milliseconds  
  4. Measure  latency bin/hbase  org.apache.hadoop.hbase.PerformanceEvaluaJon   •  More  opJons  related  to  HBase:  autoflush,  replicas,  …   •  Latency  measured  in  micro  second   •  Easier  for  internal  analysis   YCSB  -­‐  Yahoo!  Cloud  Serving  Benchmark   •  Useful  for  comparison  between  databases   •  Set  of  workload  already  defined  
  5. Write  path •  Two  parts   •  Single  put  (WAL)   •  The  client  just  sends  the  put   •  MulJple  puts  from  the  client  (new  behavior  since  0.96)   •  The  client  is  much  smarter   •  Four  stages  to  look  at  for  latency   •  Start  (establish  tcp  connecJons,  etc.)   •  Steady:  when  expected  condiJons  are  met   •  Machine  failure:  expected  as  well   •  Overloaded  system  
  6. Single  put:  communica>on  &  scheduling •  Client:  TCP  connecJon  to  the  server   •  Shared:  mulJtheads  on  the  same  client  are  using  the  same  TCP  connecJon   •  Pooling  is  possible  and  does  improve  the  performances  in  some  circonstances   •  hbase.client.ipc.pool.size   •  Server:  mulJple  calls  from  mulJple  threads  on  mulJple  machines   •  Can  become  thousand  of  simultaneous  queries   •  Scheduling  is  required    
  7. Single  put:  real  work •  The  server  must   •  Write  into  the  WAL  queue   •  Sync  the    WAL  queue  (HDFS  flush)   •  Write  into  the  memstore   •  WALs  queue  is  shared  between  all  the  regions/handlers   •  Sync  is  avoided  if  another  handlers  did  the  work   •  You  may  flush  more  than  expected  
  8. Simple  put:  A  small  run Percen&le   Time  in  ms   Mean   1.21   50%   0.95   95%   1.50   99%   2.12  
  9. Latency  sources •  Candidate  one:  network   •  0.5ms  within  a  datacenter   •  Much  less  between  nodes  in  the  same  rack           Percen&le   Time  in  ms   Mean   0.13   50%   0.12   95%   0.15   99%   0.47  
  10. Latency  sources •  Candidate  two:  HDFS  Flush   •  We  can  sJll  do  beier:  HADOOP-­‐7714  &  sons.   Percen&le   Time  in  ms   Mean   0.33   50%   0.26   95%   0.59   99%   1.24  
  11. Latency  sources •  Millisecond  world:  everything  can  go  wrong   •  JVM   •  Network   •  OS  Scheduler   •  File  System   •  All  this  goes  into  the  post  99%  percenJle   •  Requires  monitoring   •  Usually  using  the  latest  version  shelps.  
  12. Latency  sources •  Split  (and  presplits)   •  Autosharding  is  great!   •  Puts  have  to  wait   •  Impacts:  seconds   •  Balance   •  Regions  move   •  Triggers  a  retry  for  the  client   •  hbase.client.pause  =  100ms  since  HBase  0.96   •   Garbage  CollecJon   •  Impacts:  10’s  of  ms,  even  with  a  good  config   •  Covered  with  the  read  path  of  this  talk  
  13. From  steady  to  loaded  and  overloaded •  Number  of  concurrent  tasks  is  a  factor  of   •  Number  of  cores   •  Number  of  disks   •  Number  of  remote  machines  used   •  Difficult  to  esJmate   •  Queues  are  doomed  to  happen   •  hbase.regionserver.handler.count •  So  for  low  latency   •  Replable  scheduler  since  HBase  0.98  (HBASE-­‐8884).  Requires  specific  code.   •  RPC  PrioriJes:  work  in  progress  (HBASE-­‐11048)  
  14. From  loaded  to  overloaded •  MemStore  takes  too  much  room:  flush,  then  blocksquite  quickly   •  hbase.regionserver.global.memstore.size.lower.limit •  hbase.regionserver.global.memstore.size •  hbase.hregion.memstore.block.multiplier •  Too  many  Hfiles:  block  unJl  compacJons  keep  up   •  hbase.hstore.blockingStoreFiles •  Too  many  WALs  files:  Flush  and  block   •  hbase.regionserver.maxlogs
  15. Machine  failure •  Failure   •  Dectect   •  Reallocate   •  Replay  WAL   •  Replaying  WAL  is  NOT  required  for  puts   •  hbase.master.distributed.log.replay   •  (default  true  in  1.0)   •  Failure  =  Dectect  +  Reallocate  +  Retry   •  That’s  in  the  range  of  ~1s  for  simple  failures   •  Silent  failures  leads  puts  you  in  the  10s  range  if  the  hardware  does  not  help   •  zookeeper.session.timeout
  16. Single  puts •  Millisecond  range   •  Spikes  do  happen  in  steady  mode   •  100ms   •  Causes:  GC,  load,  splits  
  17. Streaming  puts Htable#setAutoFlushTo(false)! Htable#put! Htable#flushCommit! •  As  simple  puts,  but   •  Puts  are  grouped  and  send  in  background   •  Load  is  taken  into  account   •  Does  not  block  
  18. Mul>ple  puts hbase.client.max.total.tasks (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1) •  Decouple  the  client  from  a  latency  spike  of  a  region  server   •  Increase  the  throughput  by  50%  compared  to  old  mulJput   •  Makes  split  and  GC  more  transparent  
  19. Conclusion  on  write  path •  Single  puts  can  be  very  fast   •  It’s  not  a  «  hard  real  Jme  »  system:  there  are  spikes   •  Most  latency  spikes  can  be  hidden  when  streaming  puts   •  Failure  are  NOT  that  difficult  for  the  write  path   •  No  WAL  to  replay  
  20. And  now  for  the  read  path
  21. Read  path •  Get/short  scan  are  assumed  for  low-­‐latency  operaJons   •  Again,  two  APIs   •  Single  get:  HTable#get(Get) •  MulJ-­‐get:  HTable#get(List<Get>) •  Four  stages,  same  as  write  path   •  Start  (tcp  connecJon,  …)   •  Steady:  when  expected  condiJons  are  met   •  Machine  failure:  expected  as  well   •  Overloaded  system:  you  may  need  to  add  machines  or  tune  your  workload  
  22. Mul>  get  /  Client Group  Gets  by   RegionServer   Execute  them   one  by  one  
  23. Mul>  get  /  Server
  24. Mul>  get  /  Server
  25. Access  latency  magnides Storage hierarchy: a different view Dean/2009   Memory  is  100000x   faster  than  disk!   Disk  seek  =  10ms  
  26. Known  unknowns •  For  each  candidate  HFile   •  Exclude  by  file  metadata   •  Timestamp   •  Rowkey  range   •  Exclude  by  bloom  filter   StoreFileScanner#   shouldUseScanner()  
  27. Unknown  knowns •  Merge  sort  results  polled  from  Stores   •  Seek  each  scanner  to  a  reference  KeyValue   •  Retrieve  candidate  data  from  disk   •  MulJple  HFiles  =>  mulitple  seeks   •  hbase.storescanner.parallel.seek.enable=true   •  Short  Circuit  Reads   •  dfs.client.read.shortcircuit=true   •  Block  locality   •  Happy  clusters  compact!   HFileBlock#   readBlockData()  
  28. BlockCache •  Reuse  previously  read  data   •  Maximize  cache  hit  rate   •  Larger  cache   •  Temporal  access  locality   •  Physical  access  locality   BlockCache#getBlock()  
  29. BlockCache  Showdown •  LruBlockCache   •  Default,  onheap   •  Quite  good  most  of  the  Jme   •  EvicJons  impact  GC   •  BucketCache   •  Oxeap  alternaJve   •  SerializaJon  overhead   •  Large  memory  configuraJons   hip://www.n10k.com/blog/ blockcache-­‐showdown/   L2  off-­‐heap  BucketCache   makes  a  strong  showing  
  30. Latency  enemies:  Garbage  Collec>on •  Use  heap.  Not  too  much.  With  CMS.   •  Max  heap   •  30GB  (compressed  pointers)   •  8-­‐16GB  if  you  care  about  9’s   •  Healthy  cluster  load   •  regular,  reliable  collecJons   •  25-­‐100ms  pause  on  regular  interval   •  Overloaded  RegionServer  suffers  GC  overmuch    
  31. Off-­‐heap  to  the  rescue?   •  BucketCache  (0.96,  HBASE-­‐7404)   •  Network  interfaces  (HBASE-­‐9535)   •  MemStore  et  al  (HBASE-­‐10191)  
  32. Latency  enemies:  Compac>ons •  Fewer  HFiles  =>  fewer  seeks   •  Evict  data  blocks!   •  Evict  Index  blocks!!   •  hfile.block.index.cacheonwrite   •  Evict  bloom  blocks!!!   •  hfile.block.bloom.cacheonwrite   •  OS  buffer  cache  to  the  rescue   •  Compactected  data  is  sJll  fresh   •  Beier  than  going  all  the  way  back  to  disk  
  33. Failure •  Detect  +  Reassign  +  Replay   •  Strong  consistency  requires  replay   •  Locality  drops  to  0   •  Cache  starts  from  scratch  
  34. Hedging  our  bets •  HDFS  Hedged  reads  (2.4,  HDFS-­‐5776)   •  Reads  on  secondary  DataNodes   •  Strongly  consistent   •  Works  at  the  HDFS  level   •  Timeline  consistency  (HBASE-­‐10070)   •  Reads  on  «  Replica  Region  »   •  Not  strongly  consistent  
  35. Read  latency  in  summary •  Steady  mode   •  Cache  hit:  <  1  ms   •  Cache  miss:  +  10  ms  per  seek   •  WriJng  while  reading  =>  cache  churn   •  GC:  25-­‐100ms  pause  on  regular  interval   Network  request  +  (1  -­‐  P(cache  hit))  *  (10  ms  *  seeks)     •  Same  long  tail  issues  as  write   •  Overloaded:  same  scheduling  issues  as  write   •  ParJal  failures  hurt  a  lot  
  36. HBase  ranges  for  99%  latency     Put   Streamed   Mul&put   Get   Timeline  get   Steady   milliseconds   milliseconds   milliseconds   milliseconds   Failure   seconds   seconds   seconds   milliseconds   GC   10’s  of   milliseconds   milliseconds   10’s  of   milliseconds   milliseconds  
  37. What’s  next •  Less  GC   •  Use  less  objects   •  Oxeap   •  Compressed  BlockCache  (HBASE-­‐8894)   •  Prefered  locaJon  (HBASE-­‐4755)   •  The  «  magical  1%  »   •  Most  tools  stops  at  the  99%  latency   •  What  happens  aRer  is  much  more  complex  
  38. Thanks! Nick  Dimiduk,  Hortonworks  (@xefyr)   Nicolas  Liochon,  Scaled  Risk  (@nkeywal)     HBaseCon  May  5,  2014  
Advertisement