Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Millions of Regions in HBase: Size Matters
PRESENTED BY
Francis Liu | toffer@apache.org
Virag Kothari | virag@apache.org
HBase @ Y! Grid
▪ Off-stage processing
› Batch
› Near Real Time
▪ Store lots of data
▪ Hosted Multi-tenant
▪ Performance
›...
HBase
Client
HBase
Client
JobTracker Namenode
TaskTracker
DataNode
Namenode
RegionServer
DataNode
RegionServer
DataNode
Re...
Need to Scale
▪ Scale to Petabytes for Near Real-time
▪ Multi-tenant clusters still growing
▪ Web Crawl Cache
› ~2.3PB Tab...
Region
▪ Subset of a table’s key space
▪ Unit of work
▪ Load distribution
▪ Availability
Unit of Work
▪ Map reduce Split per Region
› Parallelism
› Compute/Recovery Time
› Skew
▪ Filters & Coprocessors
› Region ...
Load Distribution
▪ Load balancing granularity
▪ Fast as slowest region server
▪ Tasks per Region server (ie MapReduce)
› ...
Compaction
▪ Optimization for reads
▪ Less files to read the better
▪ Contend for I/O
▪ Cache Misses
▪ Write amplification...
Regions and Compaction
▪ Optimization for reads
▪ Less files to read the better
▪ Contend for I/O
▪ Cache Misses
▪ Write a...
HDFS
▪ Storefiles are broken up into blocks
What size then?
▪ As a general rule keep regions small-ish
▪ HDFS block size? (not there yet)
Scaling Region Count
▪ Master Region Management
› Creation, Assign, Balance, etc.
› Meta table
▪ Metadata
› HDFS scalabili...
ZK Region Assignment
▪ Master orchestrates region assignment
▪ Region mapping tracked by master
memory, meta table and zoo...
1. Master tries to assign region
2. RS transitions the region to open
3. Masters updates its in memory state
4. RS persist...
Observations with 1M regions
▪ Complex
› 3 way communication
› Split brain problem
▪ Zookeeper
› More storage
› Operations...
▪ Assignment
› ZK less assignment (HBASE-11059)
› No involvement of ZK
› Region assignment is controlled by
Master
› Bette...
Performance Comparison
Assignment time for 1M regions
ZK (force-sync=yes) ZK (force-sync=no) ZK Less
1hr 16 mins 11 mins 1...
Single HOT meta
▪ Assignment info is persisted to meta
▪ 7GB in size for 1M
▪ Meta cannot split
▪ Large compactions
▪ Long...
▪ Split meta (HBASE-11288)
› Distributed IO load
› Distributed caching
› Shorter scan time
› Distributed compaction
Master...
Performance comparison
Split size: 200 MB
Meta split across 10 servers. Each server has 5 meta regions
Assignment time for...
Scaling namenode operations
Longer time to create all region dirs under a single table dir
Namenode limitation to hold max...
Namenode create file ops during region init for 5M normal table
Enhancements - Hierarchical region dir
● Approach - Buckets within table directory (Humungous table)
● E.g 3 letters of bu...
Namenode create file ops during region init for 5M humongous table
Region dir creation time - 4k buckets
1M regions 5M 10M
normal table 20 mins 4 hours 23 mins Doesn’t finish
humongous tabl...
HBaseCon 2014
Thank You!
(We’re Hiring)
Upcoming SlideShare
Loading in …5
×

Millions of Regions in HBase: Size Matters

1,175 views

Published on

Hadoop Summit 2015

Published in: Technology

Millions of Regions in HBase: Size Matters

  1. 1. Millions of Regions in HBase: Size Matters PRESENTED BY Francis Liu | toffer@apache.org Virag Kothari | virag@apache.org
  2. 2. HBase @ Y! Grid ▪ Off-stage processing › Batch › Near Real Time ▪ Store lots of data ▪ Hosted Multi-tenant ▪ Performance › Throughput › Latency › Availability › Scale!
  3. 3. HBase Client HBase Client JobTracker Namenode TaskTracker DataNode Namenode RegionServer DataNode RegionServer DataNode RegionServer DataNode HBase MasterZookeeper Quorum HBase Client MR Client M/R Task TaskTracker DataNode M/R Task TaskTracker DataNode MR Task Compute Cluster HBase Cluster Gateway/Launcher Experience at Scale ▪ 6 Multitenant HBase clusters ▪ ~100k regions ▪ 50 - 700 nodes
  4. 4. Need to Scale ▪ Scale to Petabytes for Near Real-time ▪ Multi-tenant clusters still growing ▪ Web Crawl Cache › ~2.3PB Table › Batch Processing workload › 80GB regions -> 20GB regions
  5. 5. Region ▪ Subset of a table’s key space ▪ Unit of work ▪ Load distribution ▪ Availability
  6. 6. Unit of Work ▪ Map reduce Split per Region › Parallelism › Compute/Recovery Time › Skew ▪ Filters & Coprocessors › Region boundaries › Sparse filters -> scan timeouts • 30mins to scan 80GB region ▪ Custom Applications › Storm Grouping, etc
  7. 7. Load Distribution ▪ Load balancing granularity ▪ Fast as slowest region server ▪ Tasks per Region server (ie MapReduce) › Limit running tasks (MAPREDUCE-5583)
  8. 8. Compaction ▪ Optimization for reads ▪ Less files to read the better ▪ Contend for I/O ▪ Cache Misses ▪ Write amplification ▪ Too Many Store files › Blocked flushes (90 secs)
  9. 9. Regions and Compaction ▪ Optimization for reads ▪ Less files to read the better ▪ Contend for I/O ▪ Cache Misses ▪ Write amplification
  10. 10. HDFS ▪ Storefiles are broken up into blocks
  11. 11. What size then? ▪ As a general rule keep regions small-ish ▪ HDFS block size? (not there yet)
  12. 12. Scaling Region Count ▪ Master Region Management › Creation, Assign, Balance, etc. › Meta table ▪ Metadata › HDFS scalability › Zookeeper › Region Server density
  13. 13. ZK Region Assignment ▪ Master orchestrates region assignment ▪ Region mapping tracked by master memory, meta table and zookeeper znodes RS Master Zookeeper Meta Region 1 Region 2 RS
  14. 14. 1. Master tries to assign region 2. RS transitions the region to open 3. Masters updates its in memory state 4. RS persists region state to META Region transition example RS Master Zookeeper Meta 3 1 2 4 Region 1 Region 2 RS
  15. 15. Observations with 1M regions ▪ Complex › 3 way communication › Split brain problem ▪ Zookeeper › More storage › Operations like listing a znode is not efficient RS Master Zookeeper Meta Region 1 Region 2 RS
  16. 16. ▪ Assignment › ZK less assignment (HBASE-11059) › No involvement of ZK › Region assignment is controlled by Master › Better API’s - E.g scanning meta vs ls on znode ▪ Unlock region states (HBASE-11290) › Reduce CPU utilization Enhancements - Assignment Meta region RS Master Region 1 Region 2 RS
  17. 17. Performance Comparison Assignment time for 1M regions ZK (force-sync=yes) ZK (force-sync=no) ZK Less 1hr 16 mins 11 mins 11 mins
  18. 18. Single HOT meta ▪ Assignment info is persisted to meta ▪ 7GB in size for 1M ▪ Meta cannot split ▪ Large compactions ▪ Longer failover times RS Meta Master Region 1 Region 2 RS
  19. 19. ▪ Split meta (HBASE-11288) › Distributed IO load › Distributed caching › Shorter scan time › Distributed compaction Master Meta region User region User region Meta region RS Meta region User region RS Enhancements – Split Meta
  20. 20. Performance comparison Split size: 200 MB Meta split across 10 servers. Each server has 5 meta regions Assignment time for 3M regions Single Meta Split Meta 18 mins 10 mins
  21. 21. Scaling namenode operations Longer time to create all region dirs under a single table dir Namenode limitation to hold maximum 6.3 million files TableDir RegionDir1 RegionDir2 RegionDirN...
  22. 22. Namenode create file ops during region init for 5M normal table
  23. 23. Enhancements - Hierarchical region dir ● Approach - Buckets within table directory (Humungous table) ● E.g 3 letters of bucket names gives 4k buckets TableDir Bucket1 BucketM... RegionDir1 RegionDirKRegionDir1 RegionDirK... ...
  24. 24. Namenode create file ops during region init for 5M humongous table
  25. 25. Region dir creation time - 4k buckets 1M regions 5M 10M normal table 20 mins 4 hours 23 mins Doesn’t finish humongous table 15 mins 48 secs 1 hour 27 mins 2hr 53 mins Performance results
  26. 26. HBaseCon 2014 Thank You! (We’re Hiring)

×