Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

HBaseCon 2015: HBase 2.0 and Beyond Panel

  1. 1 hbasecon.com HBase 2.0 and Beyond Panel Moderator: Jonathan Hsieh Panel: Matteo Bertozzi / Sean Busbey / Jingcheng Du / Lars Hofhansl / / Enis Soztutar / Jimmy Xiang
  2. 2 hbasecon.com Who are we?  Matteo Bertozzi – HBase PMC, Cloudera  Sean Busbey – HBase PMC, Cloudera  Jingcheng Du – Intel  Lars Hofhansl – HBase PMC, 0.94.x RM, Salesforce.com  Jonathan Hsieh – HBase PMC  Enis Soztutar – HBase PMC, 1.0.0 RM, Hortonworks  Jimmy Xiang – HBase PMC, Cloudera
  3. 3 hbasecon.com Outline  Storing Larger Objects efficiently  Making DDL Operations fault tolerant  Better Region Assignment  Compatibility guarantees for our users  Improving Availability  Using all machine resources  Q+A
  4. 4 hbasecon.com Outline  Storing Larger Objects efficiently  Making DDL Operations fault tolerant  Better Region Assignment  Compatibility guarantees for our users  Improving Availability  Using all machine resources  Q+A
  5. 5 hbasecon.com Why Moderate Object Storage (MOB)?  A growing demand for the ability to store moderateobjects (MOB) in HBase ( 100KB up to 10MB).  Write amplification created by compactions, the write performance degrades along with the accumulation of massive MOBs in HBase.  Too many store files -> Frequent region compactions -> Massive I/O -> Slow compactions -> Flush delay -> High memory usage -> Blocking updates 8.098 10.159 10.700 0.000 2.000 4.000 6.000 8.000 10.000 12.000 125G 500G 1T Latency(sec) Data volume Data Insertion Average Latency (5MB/record, 32 pre-split regions) 0 5 10 15 20 25 1 2 3 4 5 6 7 8 Latency(sec) Time (hour) 1T Data Insertion Average Latency (5MB/record, 32 pre-split regions)
  6. 6 hbasecon.com How MOB I/O works HRegionServer Client HFIle MOB cell HLog memstore MOB cell MOB HFile Flush MOB cell Write Path Ref cell Client Read Path HRegionServer memstore HFIle MOB HFile MOB cell MOB cell Ref cell
  7. 7 hbasecon.com Benefits  Move the MOBs out of the main I/O path to make the write amplification more predictable.  The same APIs to read and write MOBs.  Work with HBase export/copy table, bulk load, replication and snapshot features.  Work with HBase security mechanism. 8.098 10.159 10.700 6.851 6.963 7.033 0.000 2.000 4.000 6.000 8.000 10.000 12.000 125G 500G 1T Latency(sec) Data volume Data Insertion Average Latency (5MB/record, 32 pre-split regions) MOB Disabled MOB Enabled 10.590 57.975 6.212 33.886 0.000 10.000 20.000 30.000 40.000 50.000 60.000 Data Insertion Data Random Get Latency(sec) Average Latency for R/W Mixed Workload (5MB/record, 32 pre-split regions, 300G pre-load, 200G insertion) MOB Disabled MOB Enabled 0 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 Lantecy(sec) Time (minute) Data Insertion Average Latency MOB Enabled MOB Disabled 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 Latency(minute) Time (minute) Data Random Get Average Latency MOB Enabled MOB Disabled
  8. 8 hbasecon.com Outline  Storing Larger Objects efficiently  Making DDL Operations fault tolerant  Better Region Assignment  Compatibility guarantees for our users  Improving Availability  Using all machine resources  Q+A
  9. 9 hbasecon.com Problem – Multi-Steps ops & Failures DDL & other operations consist of multiple steps e.g. Create Table Handler Create regions on FileSystem Add regions to META Assign cpHost.postCreateTableHandler() -> (ACLs) if we crash in between steps. we end up with half state. e.g. File-System present, META not present hbck MAY be able to repair it if we crash in the middle of a single step (e.g. create N regions on fs) hbck has not enough information to rebuild a correct state. Requires manual intervention to repair the state
  10. 10 hbasecon.com Solution – Multi-Steps ops & Failures Rewrite each operation to use a State-Machine e.g. Create Table Handler Create regions on FileSystem Add regions to META Assign cpHost.postCreateTableHandler() -> (ACLs) ...each executed step is written to a store if the machine goes down we know what was pending and what should be rolledback or how to continue to complete the operation
  11. 11 hbasecon.com Procedure-v2/Notification-Bus  The Procedure v2/NotificationBus aims to provide a unified way to build:  Synchronous calls, with the ability to see the state/result in case of failure.  Multisteps procedure with a rollback/rollforward ability in case of failure (e.g. create/delete table)  Notifications across multiple machines (e.g. ACLs/Labels/Quota cache updates)  Coordination of long-running/heavy procedures (e.g. compactions, splits, …)  Procedures across multiple machines (e.g. Snapshots, Assignment)  Replication for Master operations (e.g. grant/revoke)
  12. 12 hbasecon.com Procedure-v2/Notification-Bus - Roadmap  Apache HBase 1.1  Fault tolerant Master Operations (e.g. create/delete/…)  Sync Client (We are still wire compatible, both ways)  Apache HBase 1.2  Master WebUI  Notification BUS, and at least Snapshot using it.  Apache HBase 1.3+ or 2.0 (depending on how hard is to keep Master/RSs compatibility)  Replace Cache Updates, Assignment Manager, Distributed Log Replay,…  New Features: Coordinated compactions, Master ops Replication (e.g. grant/revoke)
  13. 13 hbasecon.com Outline  Storing Larger Objects efficiently  Making DDL Operations fault tolerant  Better Region Assignment  Compatibility guarantees for our users  Improving Availability  Using all machine resources  Q+A
  14. 14 hbasecon.com ZK-based Region Assignment  Region states could be inconsistent  Assignment info stored in both meta table and ZooKeeper  Both Master and RegionServer can update them  Limited scalability and operations efficiency  ZooKeeper events used for coordination 14
  15. 15 hbasecon.com ZK-less Region Assignment  RPC based  Master, the true coordinator  Only Master can update meta table  All state changes are persisted  Follow the state machine  RegionServer does what told by Master  Report status to Master  Each step needs acknowledgement from Master 15
  16. 16 hbasecon.com Current Status  Off by default in 1.0  Impact  Master is in the critical path  Meta should be co-located with Master  Procedure V2 could solve it (future work)  Deployment topology change  Master is a RegionServer, serves small system tables  Blog post has more info  https://blogs.apache.org/hbase/entry/hbase_zk_less_region_assignment 16
  17. 17 hbasecon.com Outline  Storing Larger Objects efficiently  Making DDL Operations fault tolerant  Better Region Assignment  Compatibility guarantees for our users  Improving Availability  Using all machine resources  Q+A
  18. 18 hbasecon.com HBase Semantic Versioning The Return to Sanity
  19. 19 hbasecon.com Client Version? Server Version? Hadoop Version? Binary Compatibility? HFile Version? ARRGGHHH. Should be SIMPLE! Protobufs Client/Server Compatibility?
  20. 20 hbasecon.com Semantic Versioning Makes Things Simple
  21. 21 hbasecon.com HBase <Major>.<Minor>.<Patch>
  22. 22 hbasecon.com MAJOR version when you make incompatible API changes
  23. 23 hbasecon.com MINOR version when you add backwards-compatible functionality
  24. 24 hbasecon.com PATCH version when you make backwards-compatible bug fixes
  25. 25 hbasecon.com We are adoption this starting with HBase 1.0
  26. 26 hbasecon.com Compatibility Dimensions (the long version)  Client-Server wire protocol compatibility  Server-Server protocol compatibility  File format compatibility  Client API compatibility  Client Binary compatibility  Server-Side Limited API compatibility (taken from Hadoop)  Dependency Compatibility  Operational Compatibility
  27. 27 hbasecon.com TL;DR:  A patch upgrade is a drop-in replacement  A minor upgrade requires no application or client code modification  A major upgrade allows us - the HBase community - to make breaking changes.
  28. 28 hbasecon.com Simple
  29. 29 hbasecon.com Thanks http://semver.org/ http://hbase.apache.org/book.html#hbase.versioning
  30. 30 hbasecon.com Outline  Storing Larger Objects efficiently  Making DDL Operations fault tolerant  Better Region Assignment  Compatibility guarantees for our users  Improving Availability  Using all machine resources  Q+A
  31. 31 hbasecon.com Improving read availability  HBase is CP  When a node goes down, some regions are unavailable until recovery  Some class of applications want high availability (for reads)  Region replicas  TIMELINE consistency reads
  32. 32 hbasecon.com Phase contents  Phase 1  Region replicas  Stale data up to minutes (15 min)  in 1.0  Phase 2  millisecond-latencies for staleness (WAL replication)  Replicas for the meta table  Region splits and merges with region replicas  Scan support  In 1.1
  33. 33 hbasecon.com Region1 Region2 Region3 WAL append ReplicaReplication RegionServer 1 tail hfile hfile hfile HDFS Flush/Compaction
  34. 34 hbasecon.com Region1 Region2 Region3 WAL ReplicaReplication RegionServer 1 tail Region2 (replica) RegionServer 15 replay RegionServer 20 Region1 (replica) replay hfile hfile hfile HDFS Flush/Compaction Read flush files
  35. 35 hbasecon.com Pluggable WAL Replication  Pluggable WAL replication endpoint  You can write your own replicators!  Similar to co-processors (runs in the same RS process) hbase> add_peer ’my_peer', ENDPOINT_CLASSNAME => 'org.hbase.MyReplicationEndpoint', DATA => { "key1" => 1 }, CONFIG => { "config1" => "value1", "config2" => "value2" }}
  36. 36 hbasecon.com Outline  Storing Larger Objects efficiently  Making DDL Operations fault tolerant  Better Region Assignment  Compatibility guarantees for our users  Improving Availability  Using all machine resources  Q+A
  37. 37 hbasecon.com Workload Throughput Distributed work will eventually be limited by one of • CPU • Disk IO • Network IO
  38. 38 hbasecon.com HBase Under (synthetic) Load Now Not CPU Bound
  39. 39 hbasecon.com HBase Under (synthetic) Load Now Not Disk Bound
  40. 40 hbasecon.com HBase Under (synthetic) Load Now Not Network Bound
  41. 41 hbasecon.com Modest Gain: Multiple WALs  All regions write to one Write ahead log file. (WAL)  Idea: Let’s have multiple write ahead logs so that we can write more in parallel.  Follow-up work:  To the limit if were on SSD we could have one WAL per region. RS 1 2 3 DNDisksRS 1 2 3 DNDisks IDLE IDLE
  42. 42 hbasecon.com Future Solutions • Alternative WAL providers • Read path optimizations based on profiling • Better tuning
  43. 43 hbasecon.com Outline  Storing Larger Objects efficiently  Making DDL Operations fault tolerant  Better Region Assignment  Compatibility guarantees for our users  Improving Availability  Using all machine resources  Q+A
  44. 44 hbasecon.com Thanks!

Editor's Notes

  1. When working with a big mass of machines, your first optimization step has to be getting to the exhaustion of one of these three resources. The specifics will depend on your workload, but right now we have big room for improvement.
  2. This is a mixed write / update/ read workload after reaching a state where there are memstore flushes and compactions happening.. It’s mostly waiting on synchronization AFAICT
  3. This is a mixed write / update/ read workload after reaching a state where there are memstore flushes and compactions happening.. It’s mostly waiting on synchronization AFAICT
  4. This is a mixed write / update/ read workload after reaching a state where there are memstore flushes and compactions happening.. It’s mostly waiting on synchronization AFAICT
  5. Historically one of the long poles in the tent has been the WAL, since all the regions served by a regions server hit the same one. As of HBase 1.0, there are options to expand to multiple pipelines. But the gains are modest. As of HBase 1.1, we can make use of HDFS storage policies to keep just the WAL on SSD in mixed disk deployments. We need more testing and operational feedback from the community though.
  6. Longer term solutions that will start showing up in HBase 2.0 involve updates to both the read and write paths. For WAL limitations, we need to examine some base assumptions; HDFS is made for throughput of large blobs, not for many small writes. Custom DFSClient in HBase to show value, then push upstream Maybe it’s best to defer to a system made for these kinds of writes, e.g. Kafka Stack has recently done some excellent work profiling what happens in an HBase system under load and some optimizations to better work with the jit compiler have been landing as a result. Frankly, we have a huge number of tuning options now that can eat a lot of hardware, but they remain inaccessible. Documentation improvements and a round of updating defaults based on current machine specs.
Advertisement