In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. It cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, It'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, it discusses how the Log Service can simplify the operational burden of HBase.
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
1. Floating on a Raft
HBase Durability with Apache Ratis
NoSQL Day 2019
Washington, D.C.
Ankit Singhal, Josh Elser
Apache, Apache HBase, HBase, Apache Ratis, Ratis are (registered) trademarks of the Apache Software Foundation.
2. Distributed Consensus
Problem: How do a collection of computers agree on state in the face of failures?
A = 1
A = 2
A = 1
CC BY-SA 3.0 https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Gnome-computer.svg/1024px-Gnome-computer.svg.png
4. Easy to understand, easy to implement.
“New” (2013) -- Diego Ongaro, John Ousterhout
Proven correctness via TLA+
Paxos is “old” (1989), but still hard
Raft
5. Apache Ratis
Incubating project at the Apache Software Foundation
A library-oriented, Java implementation of Raft (not a service!)
Pluggable pieces:
● Transport (gRPC, Netty, Hadoop RPC)
● State Machine (your code!)
● Raft Log (In-memory, segmented files on disk)
6. A StateMachine is the abstraction
point for user-code.
Interface to query and modify “state”
Ratis Arithmetic Example:
Maintain variables (e.g. a = 1) and
apply mathematical operations.
Read expr’s: add, subtract, multiply, divide
Write expr’s: assignment
Ratis State Machines
class Arithmetic implements StateMachine {
Map<String,Double> variables;
Message query(Message req) {
Expression exp = parseReadExp(req);
try (ReadLock rlock = getReadlock()) {
return exp.eval(variables);
}
}
Message update(Message req) {
Expression exp = parseWriteExp(req);
try (WriteLock wlock = getWriteLock()) {
return exp.eval(variables);
}
}
}
12. Does HBase want this?
Assumption: we can more efficiently run HBase in cloud environments without
HDFS for WALs.
● Running HDFS is expensive, hard
○ Data is “heavy” (10’s mins to 1’s of hours to decommission)
○ Unexpected DataNode failure requires slow re-replication
● More things to monitor -- twice as many JVMs
Ideal Case:
● Scale up HBase by just adding a more RegionServers, then balance
● Scale down by gently (order 1’s of minutes) removing RegionServers
13. Asynchronous
flushing to generate
HFiles
Write Path
Store
Durability in HBase
Put
Delete
Incr
RegionServer
wal
MemStore
1
2
Region1
Store
MemStore
RegionN
3
3
Store File
Store File
Append
and sync
KVs
14. Life cycle of WAL
RegionServer
WAL
WALs
zookeeper
Flush
Log Roller
Roll Wal
Flush
Tracking for
Replication
Backup
Cleaner
chore
WALs
Archived
15. Regionserver Recovery
Identification
- Master(ServerManager) observes when
a region server is deemed dead due to
their ephemeral node being deleted
Splitting
- Reading the WAL and creating
separate files for each region
Re-assignment
- Assigning the regions from dead
server to live regionservers
Fencing
- Fencing for half dead region server
(server which undergoes long GC
pause and comes back after GC
finishes)
- Currently done through renaming
HDFS directory
Replaying
- Reading the WAL recovered edits
produced by WAL splitting and
replaying the edits that were not
flushed
16. Regionserver Recovery Refactoring
Identification
- Monitoring Ephemeral RS nodes
- WALs available for the servers which are
not live
Splitting
interface WALProvider {
public Map<Region, WAL> split(WAL
wal);
}
Re-assignment
- No change is required as independent of
WAL
Fencing
interface ServerFence {
public void fence(ServerName server);
}
In case of Ratis, Implementation could be to
close the log to prevent further writes by dead
regionserver.
Replaying
interface WALProvider {
public Reader getRecoveredEditsReader(
Region region );
}
Disclaimer: These Interfaces are for reference only , may change during the actual development
17. Replication
- Async and Serial Replication rely on reading WALs
- Need a long-term storage for WALs
- Ratis LogService uses local disk
Proposed Solution
- Can we upload Ratis WALs to distributed, cheap storage?
- If we can hold onto WALs indefinitely, we don’t have to rewrite Replication.
18. Why Ratis for WAL?
Choices are: Apache Kafka, Distributed Log, Apache Ratis, HDFS, Amazon Kinesis, Azure premium
storage
● Fully embeddable(No dependency on External System)
● Low Latency
● High throughput
● Enable HBase for Hybrid cloud deployment
● Availability proportional to no. of nodes in a quorum
Disclaimer: We are not suggesting Ratis is the only solution, HBase refactoring will be done in such a way that any storage is pluggable
19. What’s next?
More testing for LogService
● Easy to cause leader-election storms
● Better insight/understanding into internals
A Ratis LogService WalProvider
● Wire up the LogService with the new WAL APIs
Advantage of cloud:-Cloud Storage is Economical
Easy Migration
Elasticity
Disadvantage of cloud services:-
Expensive
Limited options
Specific versions
Characterstics of WAL
Durable and Highly Available as they are needed in case of crash
Latency and throughput due to the write path
Support of append and group commit
Decisions made for Ratis
Generally the lifecycle of the WAL is pretty small, we no longer need them when Flush completes but due to replication , we need to keep them longer. Plan for Ratis log service, that if redundancy of the
Local Disk
SSD for local disk will lower the latency further