How to Study Performance• Hook-up OpenTSDB and gather all the metrics.• Learn as much as you can about your read/write patterns or the benchmark tool* you’re using.• While testing on a single node is easier/cheaper, testing on a small cluster will stress the machines differently.• * LoadTestTool might not fit your use case right now, but patches are welcome wink wink
Write Path• Use big MemStores.• Experimented with flushes as big as 6GB: – Leave 1 region per RS. – Set MEMSTORE_FLUSHSIZE to >100GB. – Rely only on hbase.regionserver.global.memstore.lowerLimit but leave enough room to not hit the upperLimit. – As good as it gets.• Less compacting, but be wary of…
HBASE-3484: ReplacememstoresConcurrentSkipListMap with our own implementation
Write Path• MemStoresize VS HLogsize – Force flushing because of too many logs can be pretty nasty – Ideal situation is hbase.regionserver.hlog.blocksize * hbase.regionserver.maxlogs (2GB by default) just a bit above hbase.regionserver.global.memstore.lowerLimit (35% of heap, so 358.4MB by default) – Mixed cases are cursed by tables with slow rate of updates needing to flush small files because busier tables fill the HLogs. New balancer in trunk can be tuned to take into account read and write requests.
Write Path• Too many regions/families – The global MemStoresize will always be reached before the flush size, resulting in suboptimal small files being flushed. The more regions/families the worse.• Writing to many families with different data sizes – Currently the flush size is based on the whole region, not per family. – HBASE-3149 is about fixing that.
Write Path• HBASE-4608: HLog compression• Relies on a custom dictionary-based compression scheme, but it doesn’t compress the values.• Writing to the WAL directly determines the write speed, anything that lowers the IO helps us.• Available in 0.94.0, disabled by default since experimental and breaks replication (see HBASE- 5778)• Real benefits come when the keys to values ratio is high. Counters is a good example.
Write Path• Use big MemStores(but watch out).• Make sure HLogs are not forcing flushes.• Control the number of regions/families being written to in order to not hit the global MemStore size.• I know this sounds crazy, but use HLog compression maybe?
Read Path• Our “LRU” block cache – It’s not ARC (Adaptive Replacement Cache), and we can’t use it anyways because it’s patented (we’d need a license). – It’s not an LRU either, it’s a map and then the eviction thread manages the LRU behavior. – My research shows that it’s sometimes worse than a plain LRU, sometimes better, and 2Q can really beat our cache algorithm on some use cases.
Read Path• Sizing the BlockCache – Make sure your working data set fits in cache. – Evictions happen at 85% and bring the BC down to 75%, this might be too aggressive for your use case. LruBlockCacheneeds to be recompiled for this change though. Try 95% and 90%. /** Eviction thresholds */ static final float DEFAULT_MIN_FACTOR = 0.75f; static final float DEFAULT_ACCEPTABLE_FACTOR = 0.85f;
Read Path• Disabling the BlockCache – Highly random read patterns benefit from not using the BC, it just churns and generates garbage. – Recommended: disable the BC on the families, BLOCKCACHE => ‘false’ – Not recommended: hfile.block.cache.size=0, because the meta blocks (leaf, bloom) need to be in cache else your performance will be disastrous.
Read Path• SlabCacheaka off-heap cache – Available since 0.92.0, very experimental. – My experiments show that it currently shouldn’t be used. – Common wisdom tells us that managing the cache outside of the JVM should help with GC, but as far as I can tell BC only plays a minor role. – Also the current implementation double caches everything, it might not be what you need. – Another complaint I have is that it’s not flexible with regards to block sizes.
Read Path• Bloom filters – If you use >0.92.0, you should turn it on. – This probably should be on by default from now on. – This was made possible by Facebook’s work on HFileV2.