0
HBase  Low  Latency
Nick	
  Dimiduk,	
  Hortonworks	
  (@xefyr)	
  
Nicolas	
  Liochon,	
  Scaled	
  Risk	
  (@nkeywal)	
 ...
Agenda
•  Latency,	
  what	
  is	
  it,	
  how	
  to	
  measure	
  it	
  
•  Write	
  path	
  
•  Read	
  path	
  
•  Next...
What’s  low  latency
Latency	
  is	
  about	
  percenJles	
  
•  Average	
  !=	
  50%	
  percenJle	
  
•  There	
  are	
  ...
Measure  latency
bin/hbase	
  org.apache.hadoop.hbase.PerformanceEvaluaJon	
  
•  More	
  opJons	
  related	
  to	
  HBase...
Write  path
•  Two	
  parts	
  
•  Single	
  put	
  (WAL)	
  
•  The	
  client	
  just	
  sends	
  the	
  put	
  
•  MulJp...
Single  put:  communica>on  &  scheduling
•  Client:	
  TCP	
  connecJon	
  to	
  the	
  server	
  
•  Shared:	
  mulJthea...
Single  put:  real  work
•  The	
  server	
  must	
  
•  Write	
  into	
  the	
  WAL	
  queue	
  
•  Sync	
  the	
  	
  WA...
Simple  put:  A  small  run
Percen&le	
   Time	
  in	
  ms	
  
Mean	
   1.21	
  
50%	
   0.95	
  
95%	
   1.50	
  
99%	
  ...
Latency  sources
•  Candidate	
  one:	
  network	
  
•  0.5ms	
  within	
  a	
  datacenter	
  
•  Much	
  less	
  between	...
Latency  sources
•  Candidate	
  two:	
  HDFS	
  Flush	
  
•  We	
  can	
  sJll	
  do	
  beier:	
  HADOOP-­‐7714	
  &	
  s...
Latency  sources
•  Millisecond	
  world:	
  everything	
  can	
  go	
  wrong	
  
•  JVM	
  
•  Network	
  
•  OS	
  Sched...
Latency  sources
•  Split	
  (and	
  presplits)	
  
•  Autosharding	
  is	
  great!	
  
•  Puts	
  have	
  to	
  wait	
  
...
From  steady  to  loaded  and  overloaded
•  Number	
  of	
  concurrent	
  tasks	
  is	
  a	
  factor	
  of	
  
•  Number	...
From  loaded  to  overloaded
•  MemStore	
  takes	
  too	
  much	
  room:	
  flush,	
  then	
  blocksquite	
  quickly	
  
•...
Machine  failure
•  Failure	
  
•  Dectect	
  
•  Reallocate	
  
•  Replay	
  WAL	
  
•  Replaying	
  WAL	
  is	
  NOT	
  ...
Single  puts
•  Millisecond	
  range	
  
•  Spikes	
  do	
  happen	
  in	
  steady	
  mode	
  
•  100ms	
  
•  Causes:	
  ...
Streaming  puts
Htable#setAutoFlushTo(false)!
Htable#put!
Htable#flushCommit!
•  As	
  simple	
  puts,	
  but	
  
•  Puts	...
Mul>ple  puts
hbase.client.max.total.tasks (default 100)
hbase.client.max.perserver.tasks (default 5)
hbase.client.max.per...
Conclusion  on  write  path
•  Single	
  puts	
  can	
  be	
  very	
  fast	
  
•  It’s	
  not	
  a	
  «	
  hard	
  real	
 ...
And  now  for  the  read  path
Read  path
•  Get/short	
  scan	
  are	
  assumed	
  for	
  low-­‐latency	
  operaJons	
  
•  Again,	
  two	
  APIs	
  
• ...
Mul>  get  /  Client
Group	
  Gets	
  by	
  
RegionServer	
  
Execute	
  them	
  
one	
  by	
  one	
  
Mul>  get  /  Server
Mul>  get  /  Server
Access  latency  magnides
Storage hierarchy: a different view
Dean/2009	
  
Memory	
  is	
  100000x	
  
faster	
  than	
  ...
Known  unknowns
•  For	
  each	
  candidate	
  HFile	
  
•  Exclude	
  by	
  file	
  metadata	
  
•  Timestamp	
  
•  Rowke...
Unknown  knowns
•  Merge	
  sort	
  results	
  polled	
  from	
  Stores	
  
•  Seek	
  each	
  scanner	
  to	
  a	
  refer...
BlockCache
•  Reuse	
  previously	
  read	
  data	
  
•  Maximize	
  cache	
  hit	
  rate	
  
•  Larger	
  cache	
  
•  Te...
BlockCache  Showdown
•  LruBlockCache	
  
•  Default,	
  onheap	
  
•  Quite	
  good	
  most	
  of	
  the	
  Jme	
  
•  Ev...
Latency  enemies:  Garbage  Collec>on
•  Use	
  heap.	
  Not	
  too	
  much.	
  With	
  CMS.	
  
•  Max	
  heap	
  
•  30G...
Off-­‐heap  to  the  rescue?  
•  BucketCache	
  (0.96,	
  HBASE-­‐7404)	
  
•  Network	
  interfaces	
  (HBASE-­‐9535)	
  ...
Latency  enemies:  Compac>ons
•  Fewer	
  HFiles	
  =>	
  fewer	
  seeks	
  
•  Evict	
  data	
  blocks!	
  
•  Evict	
  I...
Failure
•  Detect	
  +	
  Reassign	
  +	
  Replay	
  
•  Strong	
  consistency	
  requires	
  replay	
  
•  Locality	
  dr...
Hedging  our  bets
•  HDFS	
  Hedged	
  reads	
  (2.4,	
  HDFS-­‐5776)	
  
•  Reads	
  on	
  secondary	
  DataNodes	
  
• ...
Read  latency  in  summary
•  Steady	
  mode	
  
•  Cache	
  hit:	
  <	
  1	
  ms	
  
•  Cache	
  miss:	
  +	
  10	
  ms	
...
HBase  ranges  for  99%  latency
	
  	
   Put	
  
Streamed	
  
Mul&put	
   Get	
   Timeline	
  get	
  
Steady	
   millisec...
What’s  next
•  Less	
  GC	
  
•  Use	
  less	
  objects	
  
•  Oxeap	
  
•  Compressed	
  BlockCache	
  (HBASE-­‐8894)	
 ...
Thanks!
Nick	
  Dimiduk,	
  Hortonworks	
  (@xefyr)	
  
Nicolas	
  Liochon,	
  Scaled	
  Risk	
  (@nkeywal)	
  
	
  
HBase...
Upcoming SlideShare
Loading in...5
×

Apache HBase Low Latency

3,033

Published on

A deeper look at the HBase read and write paths with a focus on request latency. We look at sources of latency and how to minimize them.

Published in: Technology
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,033
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
55
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide

Transcript of "Apache HBase Low Latency"

  1. 1. HBase  Low  Latency Nick  Dimiduk,  Hortonworks  (@xefyr)   Nicolas  Liochon,  Scaled  Risk  (@nkeywal)     HBaseCon  May  5,  2014  
  2. 2. Agenda •  Latency,  what  is  it,  how  to  measure  it   •  Write  path   •  Read  path   •  Next  steps  
  3. 3. What’s  low  latency Latency  is  about  percenJles   •  Average  !=  50%  percenJle   •  There  are  oRen  order  of  magnitudes  between  «  average  »  and  «  95   percenJle  »   •  Post  99%  =  «  magical  1%  ».  Work  in  progress  here.   •  Meaning  from  micro  seconds  (High  Frequency   Trading)  to  seconds  (interacJve  queries)   •  In  this  talk  milliseconds  
  4. 4. Measure  latency bin/hbase  org.apache.hadoop.hbase.PerformanceEvaluaJon   •  More  opJons  related  to  HBase:  autoflush,  replicas,  …   •  Latency  measured  in  micro  second   •  Easier  for  internal  analysis   YCSB  -­‐  Yahoo!  Cloud  Serving  Benchmark   •  Useful  for  comparison  between  databases   •  Set  of  workload  already  defined  
  5. 5. Write  path •  Two  parts   •  Single  put  (WAL)   •  The  client  just  sends  the  put   •  MulJple  puts  from  the  client  (new  behavior  since  0.96)   •  The  client  is  much  smarter   •  Four  stages  to  look  at  for  latency   •  Start  (establish  tcp  connecJons,  etc.)   •  Steady:  when  expected  condiJons  are  met   •  Machine  failure:  expected  as  well   •  Overloaded  system  
  6. 6. Single  put:  communica>on  &  scheduling •  Client:  TCP  connecJon  to  the  server   •  Shared:  mulJtheads  on  the  same  client  are  using  the  same  TCP  connecJon   •  Pooling  is  possible  and  does  improve  the  performances  in  some  circonstances   •  hbase.client.ipc.pool.size   •  Server:  mulJple  calls  from  mulJple  threads  on  mulJple  machines   •  Can  become  thousand  of  simultaneous  queries   •  Scheduling  is  required    
  7. 7. Single  put:  real  work •  The  server  must   •  Write  into  the  WAL  queue   •  Sync  the    WAL  queue  (HDFS  flush)   •  Write  into  the  memstore   •  WALs  queue  is  shared  between  all  the  regions/handlers   •  Sync  is  avoided  if  another  handlers  did  the  work   •  You  may  flush  more  than  expected  
  8. 8. Simple  put:  A  small  run Percen&le   Time  in  ms   Mean   1.21   50%   0.95   95%   1.50   99%   2.12  
  9. 9. Latency  sources •  Candidate  one:  network   •  0.5ms  within  a  datacenter   •  Much  less  between  nodes  in  the  same  rack           Percen&le   Time  in  ms   Mean   0.13   50%   0.12   95%   0.15   99%   0.47  
  10. 10. Latency  sources •  Candidate  two:  HDFS  Flush   •  We  can  sJll  do  beier:  HADOOP-­‐7714  &  sons.   Percen&le   Time  in  ms   Mean   0.33   50%   0.26   95%   0.59   99%   1.24  
  11. 11. Latency  sources •  Millisecond  world:  everything  can  go  wrong   •  JVM   •  Network   •  OS  Scheduler   •  File  System   •  All  this  goes  into  the  post  99%  percenJle   •  Requires  monitoring   •  Usually  using  the  latest  version  shelps.  
  12. 12. Latency  sources •  Split  (and  presplits)   •  Autosharding  is  great!   •  Puts  have  to  wait   •  Impacts:  seconds   •  Balance   •  Regions  move   •  Triggers  a  retry  for  the  client   •  hbase.client.pause  =  100ms  since  HBase  0.96   •   Garbage  CollecJon   •  Impacts:  10’s  of  ms,  even  with  a  good  config   •  Covered  with  the  read  path  of  this  talk  
  13. 13. From  steady  to  loaded  and  overloaded •  Number  of  concurrent  tasks  is  a  factor  of   •  Number  of  cores   •  Number  of  disks   •  Number  of  remote  machines  used   •  Difficult  to  esJmate   •  Queues  are  doomed  to  happen   •  hbase.regionserver.handler.count •  So  for  low  latency   •  Replable  scheduler  since  HBase  0.98  (HBASE-­‐8884).  Requires  specific  code.   •  RPC  PrioriJes:  work  in  progress  (HBASE-­‐11048)  
  14. 14. From  loaded  to  overloaded •  MemStore  takes  too  much  room:  flush,  then  blocksquite  quickly   •  hbase.regionserver.global.memstore.size.lower.limit •  hbase.regionserver.global.memstore.size •  hbase.hregion.memstore.block.multiplier •  Too  many  Hfiles:  block  unJl  compacJons  keep  up   •  hbase.hstore.blockingStoreFiles •  Too  many  WALs  files:  Flush  and  block   •  hbase.regionserver.maxlogs
  15. 15. Machine  failure •  Failure   •  Dectect   •  Reallocate   •  Replay  WAL   •  Replaying  WAL  is  NOT  required  for  puts   •  hbase.master.distributed.log.replay   •  (default  true  in  1.0)   •  Failure  =  Dectect  +  Reallocate  +  Retry   •  That’s  in  the  range  of  ~1s  for  simple  failures   •  Silent  failures  leads  puts  you  in  the  10s  range  if  the  hardware  does  not  help   •  zookeeper.session.timeout
  16. 16. Single  puts •  Millisecond  range   •  Spikes  do  happen  in  steady  mode   •  100ms   •  Causes:  GC,  load,  splits  
  17. 17. Streaming  puts Htable#setAutoFlushTo(false)! Htable#put! Htable#flushCommit! •  As  simple  puts,  but   •  Puts  are  grouped  and  send  in  background   •  Load  is  taken  into  account   •  Does  not  block  
  18. 18. Mul>ple  puts hbase.client.max.total.tasks (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1) •  Decouple  the  client  from  a  latency  spike  of  a  region  server   •  Increase  the  throughput  by  50%  compared  to  old  mulJput   •  Makes  split  and  GC  more  transparent  
  19. 19. Conclusion  on  write  path •  Single  puts  can  be  very  fast   •  It’s  not  a  «  hard  real  Jme  »  system:  there  are  spikes   •  Most  latency  spikes  can  be  hidden  when  streaming  puts   •  Failure  are  NOT  that  difficult  for  the  write  path   •  No  WAL  to  replay  
  20. 20. And  now  for  the  read  path
  21. 21. Read  path •  Get/short  scan  are  assumed  for  low-­‐latency  operaJons   •  Again,  two  APIs   •  Single  get:  HTable#get(Get) •  MulJ-­‐get:  HTable#get(List<Get>) •  Four  stages,  same  as  write  path   •  Start  (tcp  connecJon,  …)   •  Steady:  when  expected  condiJons  are  met   •  Machine  failure:  expected  as  well   •  Overloaded  system:  you  may  need  to  add  machines  or  tune  your  workload  
  22. 22. Mul>  get  /  Client Group  Gets  by   RegionServer   Execute  them   one  by  one  
  23. 23. Mul>  get  /  Server
  24. 24. Mul>  get  /  Server
  25. 25. Access  latency  magnides Storage hierarchy: a different view Dean/2009   Memory  is  100000x   faster  than  disk!   Disk  seek  =  10ms  
  26. 26. Known  unknowns •  For  each  candidate  HFile   •  Exclude  by  file  metadata   •  Timestamp   •  Rowkey  range   •  Exclude  by  bloom  filter   StoreFileScanner#   shouldUseScanner()  
  27. 27. Unknown  knowns •  Merge  sort  results  polled  from  Stores   •  Seek  each  scanner  to  a  reference  KeyValue   •  Retrieve  candidate  data  from  disk   •  MulJple  HFiles  =>  mulitple  seeks   •  hbase.storescanner.parallel.seek.enable=true   •  Short  Circuit  Reads   •  dfs.client.read.shortcircuit=true   •  Block  locality   •  Happy  clusters  compact!   HFileBlock#   readBlockData()  
  28. 28. BlockCache •  Reuse  previously  read  data   •  Maximize  cache  hit  rate   •  Larger  cache   •  Temporal  access  locality   •  Physical  access  locality   BlockCache#getBlock()  
  29. 29. BlockCache  Showdown •  LruBlockCache   •  Default,  onheap   •  Quite  good  most  of  the  Jme   •  EvicJons  impact  GC   •  BucketCache   •  Oxeap  alternaJve   •  SerializaJon  overhead   •  Large  memory  configuraJons   hip://www.n10k.com/blog/ blockcache-­‐showdown/   L2  off-­‐heap  BucketCache   makes  a  strong  showing  
  30. 30. Latency  enemies:  Garbage  Collec>on •  Use  heap.  Not  too  much.  With  CMS.   •  Max  heap   •  30GB  (compressed  pointers)   •  8-­‐16GB  if  you  care  about  9’s   •  Healthy  cluster  load   •  regular,  reliable  collecJons   •  25-­‐100ms  pause  on  regular  interval   •  Overloaded  RegionServer  suffers  GC  overmuch    
  31. 31. Off-­‐heap  to  the  rescue?   •  BucketCache  (0.96,  HBASE-­‐7404)   •  Network  interfaces  (HBASE-­‐9535)   •  MemStore  et  al  (HBASE-­‐10191)  
  32. 32. Latency  enemies:  Compac>ons •  Fewer  HFiles  =>  fewer  seeks   •  Evict  data  blocks!   •  Evict  Index  blocks!!   •  hfile.block.index.cacheonwrite   •  Evict  bloom  blocks!!!   •  hfile.block.bloom.cacheonwrite   •  OS  buffer  cache  to  the  rescue   •  Compactected  data  is  sJll  fresh   •  Beier  than  going  all  the  way  back  to  disk  
  33. 33. Failure •  Detect  +  Reassign  +  Replay   •  Strong  consistency  requires  replay   •  Locality  drops  to  0   •  Cache  starts  from  scratch  
  34. 34. Hedging  our  bets •  HDFS  Hedged  reads  (2.4,  HDFS-­‐5776)   •  Reads  on  secondary  DataNodes   •  Strongly  consistent   •  Works  at  the  HDFS  level   •  Timeline  consistency  (HBASE-­‐10070)   •  Reads  on  «  Replica  Region  »   •  Not  strongly  consistent  
  35. 35. Read  latency  in  summary •  Steady  mode   •  Cache  hit:  <  1  ms   •  Cache  miss:  +  10  ms  per  seek   •  WriJng  while  reading  =>  cache  churn   •  GC:  25-­‐100ms  pause  on  regular  interval   Network  request  +  (1  -­‐  P(cache  hit))  *  (10  ms  *  seeks)     •  Same  long  tail  issues  as  write   •  Overloaded:  same  scheduling  issues  as  write   •  ParJal  failures  hurt  a  lot  
  36. 36. HBase  ranges  for  99%  latency     Put   Streamed   Mul&put   Get   Timeline  get   Steady   milliseconds   milliseconds   milliseconds   milliseconds   Failure   seconds   seconds   seconds   milliseconds   GC   10’s  of   milliseconds   milliseconds   10’s  of   milliseconds   milliseconds  
  37. 37. What’s  next •  Less  GC   •  Use  less  objects   •  Oxeap   •  Compressed  BlockCache  (HBASE-­‐8894)   •  Prefered  locaJon  (HBASE-­‐4755)   •  The  «  magical  1%  »   •  Most  tools  stops  at  the  99%  latency   •  What  happens  aRer  is  much  more  complex  
  38. 38. Thanks! Nick  Dimiduk,  Hortonworks  (@xefyr)   Nicolas  Liochon,  Scaled  Risk  (@nkeywal)     HBaseCon  May  5,  2014  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×