Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HBase  Low  Latency
Nick	
  Dimiduk,	
  Hortonworks	
  (@xefyr)	
  
Nicolas	
  Liochon,	
  Scaled	
  Risk	
  (@nkeywal)	
 ...
Agenda
•  Latency,	
  what	
  is	
  it,	
  how	
  to	
  measure	
  it	
  
•  Write	
  path	
  
•  Read	
  path	
  
•  Next...
What’s  low  latency
Latency	
  is	
  about	
  percenJles	
  
•  Average	
  !=	
  50%	
  percenJle	
  
•  There	
  are	
  ...
Measure  latency
bin/hbase	
  org.apache.hadoop.hbase.PerformanceEvaluaJon	
  
•  More	
  opJons	
  related	
  to	
  HBase...
Write  path
•  Two	
  parts	
  
•  Single	
  put	
  (WAL)	
  
•  The	
  client	
  just	
  sends	
  the	
  put	
  
•  MulJp...
Single  put:  communica>on  &  scheduling
•  Client:	
  TCP	
  connecJon	
  to	
  the	
  server	
  
•  Shared:	
  mulJthea...
Single  put:  real  work
•  The	
  server	
  must	
  
•  Write	
  into	
  the	
  WAL	
  queue	
  
•  Sync	
  the	
  	
  WA...
Simple  put:  A  small  run
Percen&le	
   Time	
  in	
  ms	
  
Mean	
   1.21	
  
50%	
   0.95	
  
95%	
   1.50	
  
99%	
  ...
Latency  sources
•  Candidate	
  one:	
  network	
  
•  0.5ms	
  within	
  a	
  datacenter	
  
•  Much	
  less	
  between	...
Latency  sources
•  Candidate	
  two:	
  HDFS	
  Flush	
  
•  We	
  can	
  sJll	
  do	
  beier:	
  HADOOP-­‐7714	
  &	
  s...
Latency  sources
•  Millisecond	
  world:	
  everything	
  can	
  go	
  wrong	
  
•  JVM	
  
•  Network	
  
•  OS	
  Sched...
Latency  sources
•  Split	
  (and	
  presplits)	
  
•  Autosharding	
  is	
  great!	
  
•  Puts	
  have	
  to	
  wait	
  
...
From  steady  to  loaded  and  overloaded
•  Number	
  of	
  concurrent	
  tasks	
  is	
  a	
  factor	
  of	
  
•  Number	...
From  loaded  to  overloaded
•  MemStore	
  takes	
  too	
  much	
  room:	
  flush,	
  then	
  blocksquite	
  quickly	
  
•...
Machine  failure
•  Failure	
  
•  Dectect	
  
•  Reallocate	
  
•  Replay	
  WAL	
  
•  Replaying	
  WAL	
  is	
  NOT	
  ...
Single  puts
•  Millisecond	
  range	
  
•  Spikes	
  do	
  happen	
  in	
  steady	
  mode	
  
•  100ms	
  
•  Causes:	
  ...
Streaming  puts
Htable#setAutoFlushTo(false)!
Htable#put!
Htable#flushCommit!
•  As	
  simple	
  puts,	
  but	
  
•  Puts	...
Mul>ple  puts
hbase.client.max.total.tasks (default 100)
hbase.client.max.perserver.tasks (default 5)
hbase.client.max.per...
Conclusion  on  write  path
•  Single	
  puts	
  can	
  be	
  very	
  fast	
  
•  It’s	
  not	
  a	
  «	
  hard	
  real	
 ...
And  now  for  the  read  path
Read  path
•  Get/short	
  scan	
  are	
  assumed	
  for	
  low-­‐latency	
  operaJons	
  
•  Again,	
  two	
  APIs	
  
• ...
Mul>  get  /  Client
Group	
  Gets	
  by	
  
RegionServer	
  
Execute	
  them	
  
one	
  by	
  one	
  
Mul>  get  /  Server
Mul>  get  /  Server
Access  latency  magnides
Storage hierarchy: a different view
Dean/2009	
  
Memory	
  is	
  100000x	
  
faster	
  than	
  ...
Known  unknowns
•  For	
  each	
  candidate	
  HFile	
  
•  Exclude	
  by	
  file	
  metadata	
  
•  Timestamp	
  
•  Rowke...
Unknown  knowns
•  Merge	
  sort	
  results	
  polled	
  from	
  Stores	
  
•  Seek	
  each	
  scanner	
  to	
  a	
  refer...
BlockCache
•  Reuse	
  previously	
  read	
  data	
  
•  Maximize	
  cache	
  hit	
  rate	
  
•  Larger	
  cache	
  
•  Te...
BlockCache  Showdown
•  LruBlockCache	
  
•  Default,	
  onheap	
  
•  Quite	
  good	
  most	
  of	
  the	
  Jme	
  
•  Ev...
Latency  enemies:  Garbage  Collec>on
•  Use	
  heap.	
  Not	
  too	
  much.	
  With	
  CMS.	
  
•  Max	
  heap	
  
•  30G...
Off-­‐heap  to  the  rescue?  
•  BucketCache	
  (0.96,	
  HBASE-­‐7404)	
  
•  Network	
  interfaces	
  (HBASE-­‐9535)	
  ...
Latency  enemies:  Compac>ons
•  Fewer	
  HFiles	
  =>	
  fewer	
  seeks	
  
•  Evict	
  data	
  blocks!	
  
•  Evict	
  I...
Failure
•  Detect	
  +	
  Reassign	
  +	
  Replay	
  
•  Strong	
  consistency	
  requires	
  replay	
  
•  Locality	
  dr...
Hedging  our  bets
•  HDFS	
  Hedged	
  reads	
  (2.4,	
  HDFS-­‐5776)	
  
•  Reads	
  on	
  secondary	
  DataNodes	
  
• ...
Read  latency  in  summary
•  Steady	
  mode	
  
•  Cache	
  hit:	
  <	
  1	
  ms	
  
•  Cache	
  miss:	
  +	
  10	
  ms	
...
HBase  ranges  for  99%  latency
	
  	
   Put	
  
Streamed	
  
Mul&put	
   Get	
   Timeline	
  get	
  
Steady	
   millisec...
What’s  next
•  Less	
  GC	
  
•  Use	
  less	
  objects	
  
•  Oxeap	
  
•  Compressed	
  BlockCache	
  (HBASE-­‐8894)	
 ...
Thanks!
Nick	
  Dimiduk,	
  Hortonworks	
  (@xefyr)	
  
Nicolas	
  Liochon,	
  Scaled	
  Risk	
  (@nkeywal)	
  
	
  
HBase...
Upcoming SlideShare
Loading in …5
×

Apache HBase Low Latency

10,467 views

Published on

A deeper look at the HBase read and write paths with a focus on request latency. We look at sources of latency and how to minimize them.

Published in: Technology
  • Be the first to comment

Apache HBase Low Latency

  1. 1. HBase  Low  Latency Nick  Dimiduk,  Hortonworks  (@xefyr)   Nicolas  Liochon,  Scaled  Risk  (@nkeywal)     HBaseCon  May  5,  2014  
  2. 2. Agenda •  Latency,  what  is  it,  how  to  measure  it   •  Write  path   •  Read  path   •  Next  steps  
  3. 3. What’s  low  latency Latency  is  about  percenJles   •  Average  !=  50%  percenJle   •  There  are  oRen  order  of  magnitudes  between  «  average  »  and  «  95   percenJle  »   •  Post  99%  =  «  magical  1%  ».  Work  in  progress  here.   •  Meaning  from  micro  seconds  (High  Frequency   Trading)  to  seconds  (interacJve  queries)   •  In  this  talk  milliseconds  
  4. 4. Measure  latency bin/hbase  org.apache.hadoop.hbase.PerformanceEvaluaJon   •  More  opJons  related  to  HBase:  autoflush,  replicas,  …   •  Latency  measured  in  micro  second   •  Easier  for  internal  analysis   YCSB  -­‐  Yahoo!  Cloud  Serving  Benchmark   •  Useful  for  comparison  between  databases   •  Set  of  workload  already  defined  
  5. 5. Write  path •  Two  parts   •  Single  put  (WAL)   •  The  client  just  sends  the  put   •  MulJple  puts  from  the  client  (new  behavior  since  0.96)   •  The  client  is  much  smarter   •  Four  stages  to  look  at  for  latency   •  Start  (establish  tcp  connecJons,  etc.)   •  Steady:  when  expected  condiJons  are  met   •  Machine  failure:  expected  as  well   •  Overloaded  system  
  6. 6. Single  put:  communica>on  &  scheduling •  Client:  TCP  connecJon  to  the  server   •  Shared:  mulJtheads  on  the  same  client  are  using  the  same  TCP  connecJon   •  Pooling  is  possible  and  does  improve  the  performances  in  some  circonstances   •  hbase.client.ipc.pool.size   •  Server:  mulJple  calls  from  mulJple  threads  on  mulJple  machines   •  Can  become  thousand  of  simultaneous  queries   •  Scheduling  is  required    
  7. 7. Single  put:  real  work •  The  server  must   •  Write  into  the  WAL  queue   •  Sync  the    WAL  queue  (HDFS  flush)   •  Write  into  the  memstore   •  WALs  queue  is  shared  between  all  the  regions/handlers   •  Sync  is  avoided  if  another  handlers  did  the  work   •  You  may  flush  more  than  expected  
  8. 8. Simple  put:  A  small  run Percen&le   Time  in  ms   Mean   1.21   50%   0.95   95%   1.50   99%   2.12  
  9. 9. Latency  sources •  Candidate  one:  network   •  0.5ms  within  a  datacenter   •  Much  less  between  nodes  in  the  same  rack           Percen&le   Time  in  ms   Mean   0.13   50%   0.12   95%   0.15   99%   0.47  
  10. 10. Latency  sources •  Candidate  two:  HDFS  Flush   •  We  can  sJll  do  beier:  HADOOP-­‐7714  &  sons.   Percen&le   Time  in  ms   Mean   0.33   50%   0.26   95%   0.59   99%   1.24  
  11. 11. Latency  sources •  Millisecond  world:  everything  can  go  wrong   •  JVM   •  Network   •  OS  Scheduler   •  File  System   •  All  this  goes  into  the  post  99%  percenJle   •  Requires  monitoring   •  Usually  using  the  latest  version  shelps.  
  12. 12. Latency  sources •  Split  (and  presplits)   •  Autosharding  is  great!   •  Puts  have  to  wait   •  Impacts:  seconds   •  Balance   •  Regions  move   •  Triggers  a  retry  for  the  client   •  hbase.client.pause  =  100ms  since  HBase  0.96   •   Garbage  CollecJon   •  Impacts:  10’s  of  ms,  even  with  a  good  config   •  Covered  with  the  read  path  of  this  talk  
  13. 13. From  steady  to  loaded  and  overloaded •  Number  of  concurrent  tasks  is  a  factor  of   •  Number  of  cores   •  Number  of  disks   •  Number  of  remote  machines  used   •  Difficult  to  esJmate   •  Queues  are  doomed  to  happen   •  hbase.regionserver.handler.count •  So  for  low  latency   •  Replable  scheduler  since  HBase  0.98  (HBASE-­‐8884).  Requires  specific  code.   •  RPC  PrioriJes:  work  in  progress  (HBASE-­‐11048)  
  14. 14. From  loaded  to  overloaded •  MemStore  takes  too  much  room:  flush,  then  blocksquite  quickly   •  hbase.regionserver.global.memstore.size.lower.limit •  hbase.regionserver.global.memstore.size •  hbase.hregion.memstore.block.multiplier •  Too  many  Hfiles:  block  unJl  compacJons  keep  up   •  hbase.hstore.blockingStoreFiles •  Too  many  WALs  files:  Flush  and  block   •  hbase.regionserver.maxlogs
  15. 15. Machine  failure •  Failure   •  Dectect   •  Reallocate   •  Replay  WAL   •  Replaying  WAL  is  NOT  required  for  puts   •  hbase.master.distributed.log.replay   •  (default  true  in  1.0)   •  Failure  =  Dectect  +  Reallocate  +  Retry   •  That’s  in  the  range  of  ~1s  for  simple  failures   •  Silent  failures  leads  puts  you  in  the  10s  range  if  the  hardware  does  not  help   •  zookeeper.session.timeout
  16. 16. Single  puts •  Millisecond  range   •  Spikes  do  happen  in  steady  mode   •  100ms   •  Causes:  GC,  load,  splits  
  17. 17. Streaming  puts Htable#setAutoFlushTo(false)! Htable#put! Htable#flushCommit! •  As  simple  puts,  but   •  Puts  are  grouped  and  send  in  background   •  Load  is  taken  into  account   •  Does  not  block  
  18. 18. Mul>ple  puts hbase.client.max.total.tasks (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1) •  Decouple  the  client  from  a  latency  spike  of  a  region  server   •  Increase  the  throughput  by  50%  compared  to  old  mulJput   •  Makes  split  and  GC  more  transparent  
  19. 19. Conclusion  on  write  path •  Single  puts  can  be  very  fast   •  It’s  not  a  «  hard  real  Jme  »  system:  there  are  spikes   •  Most  latency  spikes  can  be  hidden  when  streaming  puts   •  Failure  are  NOT  that  difficult  for  the  write  path   •  No  WAL  to  replay  
  20. 20. And  now  for  the  read  path
  21. 21. Read  path •  Get/short  scan  are  assumed  for  low-­‐latency  operaJons   •  Again,  two  APIs   •  Single  get:  HTable#get(Get) •  MulJ-­‐get:  HTable#get(List<Get>) •  Four  stages,  same  as  write  path   •  Start  (tcp  connecJon,  …)   •  Steady:  when  expected  condiJons  are  met   •  Machine  failure:  expected  as  well   •  Overloaded  system:  you  may  need  to  add  machines  or  tune  your  workload  
  22. 22. Mul>  get  /  Client Group  Gets  by   RegionServer   Execute  them   one  by  one  
  23. 23. Mul>  get  /  Server
  24. 24. Mul>  get  /  Server
  25. 25. Access  latency  magnides Storage hierarchy: a different view Dean/2009   Memory  is  100000x   faster  than  disk!   Disk  seek  =  10ms  
  26. 26. Known  unknowns •  For  each  candidate  HFile   •  Exclude  by  file  metadata   •  Timestamp   •  Rowkey  range   •  Exclude  by  bloom  filter   StoreFileScanner#   shouldUseScanner()  
  27. 27. Unknown  knowns •  Merge  sort  results  polled  from  Stores   •  Seek  each  scanner  to  a  reference  KeyValue   •  Retrieve  candidate  data  from  disk   •  MulJple  HFiles  =>  mulitple  seeks   •  hbase.storescanner.parallel.seek.enable=true   •  Short  Circuit  Reads   •  dfs.client.read.shortcircuit=true   •  Block  locality   •  Happy  clusters  compact!   HFileBlock#   readBlockData()  
  28. 28. BlockCache •  Reuse  previously  read  data   •  Maximize  cache  hit  rate   •  Larger  cache   •  Temporal  access  locality   •  Physical  access  locality   BlockCache#getBlock()  
  29. 29. BlockCache  Showdown •  LruBlockCache   •  Default,  onheap   •  Quite  good  most  of  the  Jme   •  EvicJons  impact  GC   •  BucketCache   •  Oxeap  alternaJve   •  SerializaJon  overhead   •  Large  memory  configuraJons   hip://www.n10k.com/blog/ blockcache-­‐showdown/   L2  off-­‐heap  BucketCache   makes  a  strong  showing  
  30. 30. Latency  enemies:  Garbage  Collec>on •  Use  heap.  Not  too  much.  With  CMS.   •  Max  heap   •  30GB  (compressed  pointers)   •  8-­‐16GB  if  you  care  about  9’s   •  Healthy  cluster  load   •  regular,  reliable  collecJons   •  25-­‐100ms  pause  on  regular  interval   •  Overloaded  RegionServer  suffers  GC  overmuch    
  31. 31. Off-­‐heap  to  the  rescue?   •  BucketCache  (0.96,  HBASE-­‐7404)   •  Network  interfaces  (HBASE-­‐9535)   •  MemStore  et  al  (HBASE-­‐10191)  
  32. 32. Latency  enemies:  Compac>ons •  Fewer  HFiles  =>  fewer  seeks   •  Evict  data  blocks!   •  Evict  Index  blocks!!   •  hfile.block.index.cacheonwrite   •  Evict  bloom  blocks!!!   •  hfile.block.bloom.cacheonwrite   •  OS  buffer  cache  to  the  rescue   •  Compactected  data  is  sJll  fresh   •  Beier  than  going  all  the  way  back  to  disk  
  33. 33. Failure •  Detect  +  Reassign  +  Replay   •  Strong  consistency  requires  replay   •  Locality  drops  to  0   •  Cache  starts  from  scratch  
  34. 34. Hedging  our  bets •  HDFS  Hedged  reads  (2.4,  HDFS-­‐5776)   •  Reads  on  secondary  DataNodes   •  Strongly  consistent   •  Works  at  the  HDFS  level   •  Timeline  consistency  (HBASE-­‐10070)   •  Reads  on  «  Replica  Region  »   •  Not  strongly  consistent  
  35. 35. Read  latency  in  summary •  Steady  mode   •  Cache  hit:  <  1  ms   •  Cache  miss:  +  10  ms  per  seek   •  WriJng  while  reading  =>  cache  churn   •  GC:  25-­‐100ms  pause  on  regular  interval   Network  request  +  (1  -­‐  P(cache  hit))  *  (10  ms  *  seeks)     •  Same  long  tail  issues  as  write   •  Overloaded:  same  scheduling  issues  as  write   •  ParJal  failures  hurt  a  lot  
  36. 36. HBase  ranges  for  99%  latency     Put   Streamed   Mul&put   Get   Timeline  get   Steady   milliseconds   milliseconds   milliseconds   milliseconds   Failure   seconds   seconds   seconds   milliseconds   GC   10’s  of   milliseconds   milliseconds   10’s  of   milliseconds   milliseconds  
  37. 37. What’s  next •  Less  GC   •  Use  less  objects   •  Oxeap   •  Compressed  BlockCache  (HBASE-­‐8894)   •  Prefered  locaJon  (HBASE-­‐4755)   •  The  «  magical  1%  »   •  Most  tools  stops  at  the  99%  latency   •  What  happens  aRer  is  much  more  complex  
  38. 38. Thanks! Nick  Dimiduk,  Hortonworks  (@xefyr)   Nicolas  Liochon,  Scaled  Risk  (@nkeywal)     HBaseCon  May  5,  2014  

×