NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis

Floating on a Raft
HBase Durability with Apache Ratis
NoSQL Day 2019
Washington, D.C.
Ankit Singhal, Josh Elser
Apache, Apache HBase, HBase, Apache Ratis, Ratis are (registered) trademarks of the Apache Software Foundation.

Distributed Consensus
Problem: How do a collection of computers agree on state in the face of failures?
A = 1
A = 2
A = 1
CC BY-SA 3.0 https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Gnome-computer.svg/1024px-Gnome-computer.svg.png

Distributed Consensus
Goals: Low-latency, high-throughput, fault-tolerant
Algorithms: Paxos, Raft, ZooKeeper Atomic Broadcast (ZAB), Viewstamped
Replication
Variants: Multi-Paxos, Fast Paxos, Byzantine Paxos, MultiRaft
Implementations: Chubby, Apache ZooKeeper, etcd, CockroachDB, Apache
Kudu, Apache Ratis, HashiCorp Raft/Consul, RethinkDB, Akka Raft,
Hazelcast Raft, Neo4j, WANdisco...

Easy to understand, easy to implement.
“New” (2013) -- Diego Ongaro, John Ousterhout
Proven correctness via TLA+
Paxos is “old” (1989), but still hard
Raft

Apache Ratis
Incubating project at the Apache Software Foundation
A library-oriented, Java implementation of Raft (not a service!)
Pluggable pieces:
● Transport (gRPC, Netty, Hadoop RPC)
● State Machine (your code!)
● Raft Log (In-memory, segmented files on disk)

A StateMachine is the abstraction
point for user-code.
Interface to query and modify “state”
Ratis Arithmetic Example:
Maintain variables (e.g. a = 1) and
apply mathematical operations.
Read expr’s: add, subtract, multiply, divide
Write expr’s: assignment
Ratis State Machines
class Arithmetic implements StateMachine {
Map<String,Double> variables;
Message query(Message req) {
Expression exp = parseReadExp(req);
try (ReadLock rlock = getReadlock()) {
return exp.eval(variables);
}
}
Message update(Message req) {
Expression exp = parseWriteExp(req);
try (WriteLock wlock = getWriteLock()) {
return exp.eval(variables);
}
}
}

Ratis LogService
Recipe that provides a facade of a log (append-only, immutable bytes)
Maintain little-to-no state. Storage “provided” by the Raft Log.
interface Reader {
void seek(long offset);
byte[] readMsg();
List<byte[]> readBulk(int numMsgs);
}
interface Writer {
long write(byte[] msg);
List<Long> writeBulk(
List<byte[]> msgs);
}
interface Client {
List<String> list();
Log getLog(String name);
void archive(String name);
void close(String name);
void delete(String name);
}
interface Log {
Reader createReader();
Writer createWriter();
Metadata getMetadata();
void addListener();
}

Ratis LogService Architecture
Log Name
transactions
gps_coordinates
sensors
query_durations
Client
Metadata
Workers

LogService Testing
Docker-compose simplicity: 3 metadata services, >=3 workers
$ mvn package assembly:single && ./build-docker.sh
$ docker-compose up -d
$ ./client-env.sh
Utilities: interactive shell, verification tool
$ ./bin/shell -q <...>
$ ./bin/load-test -q <...>

LogService Testing
Goal: Generate some non-trivial data sizes
Environment:
● Intel i5-5250U
● 16GB of RAM
● Samsung SSD 850 M.2
● Gentoo Linux: Kernel 4.19.27
● Docker 18.09.4
● Write ~50MB per scenario
● Single client program, one log/thread, no batching
● JDK8, 3GB LogWorker heaps (no other tuning)

LogService Testing Results
Logs/Threads Value Size Num Records Duration
1 50 1,100,000 5h+
4 50 275,000 35m
5 100 105,000 13m 30s
5 500 22,000 2m 48s
8 100 66,000 16m 20s
8 500 13,200 2m 30s
4 1000 13,200 1m 40s

Does HBase want this?
Assumption: we can more efficiently run HBase in cloud environments without
HDFS for WALs.
● Running HDFS is expensive, hard
○ Data is “heavy” (10’s mins to 1’s of hours to decommission)
○ Unexpected DataNode failure requires slow re-replication
● More things to monitor -- twice as many JVMs
Ideal Case:
● Scale up HBase by just adding a more RegionServers, then balance
● Scale down by gently (order 1’s of minutes) removing RegionServers

Asynchronous
flushing to generate
HFiles
Write Path
Store
Durability in HBase
Put
Delete
Incr
RegionServer
wal
MemStore
1
2
Region1
Store
MemStore
RegionN
3
3
Store File
Store File
Append
and sync
KVs

Life cycle of WAL
RegionServer
WAL
WALs
zookeeper
Flush
Log Roller
Roll Wal
Flush
Tracking for
Replication
Backup
Cleaner
chore
WALs
Archived

Regionserver Recovery
Identification
- Master(ServerManager) observes when
a region server is deemed dead due to
their ephemeral node being deleted
Splitting
- Reading the WAL and creating
separate files for each region
Re-assignment
- Assigning the regions from dead
server to live regionservers
Fencing
- Fencing for half dead region server
(server which undergoes long GC
pause and comes back after GC
finishes)
- Currently done through renaming
HDFS directory
Replaying
- Reading the WAL recovered edits
produced by WAL splitting and
replaying the edits that were not
flushed

Regionserver Recovery Refactoring
Identification
- Monitoring Ephemeral RS nodes
- WALs available for the servers which are
not live
Splitting
interface WALProvider {
public Map<Region, WAL> split(WAL
wal);
}
Re-assignment
- No change is required as independent of
WAL
Fencing
interface ServerFence {
public void fence(ServerName server);
}
In case of Ratis, Implementation could be to
close the log to prevent further writes by dead
regionserver.
Replaying
interface WALProvider {
public Reader getRecoveredEditsReader(
Region region );
}
Disclaimer: These Interfaces are for reference only , may change during the actual development

Replication
- Async and Serial Replication rely on reading WALs
- Need a long-term storage for WALs
- Ratis LogService uses local disk
Proposed Solution
- Can we upload Ratis WALs to distributed, cheap storage?
- If we can hold onto WALs indefinitely, we don’t have to rewrite Replication.

Why Ratis for WAL?
Choices are: Apache Kafka, Distributed Log, Apache Ratis, HDFS, Amazon Kinesis, Azure premium
storage
● Fully embeddable(No dependency on External System)
● Low Latency
● High throughput
● Enable HBase for Hybrid cloud deployment
● Availability proportional to no. of nodes in a quorum
Disclaimer: We are not suggesting Ratis is the only solution, HBase refactoring will be done in such a way that any storage is pluggable

What’s next?
More testing for LogService
● Easy to cause leader-election storms
● Better insight/understanding into internals
A Ratis LogService WalProvider
● Wire up the LogService with the new WAL APIs

References
Ratis LogService
● https://github.com/apache/incubator-ratis/tree/master/ratis-logservice
HBase WAL Refactoring
● https://issues.apache.org/jira/browse/HBASE-20951
● https://issues.apache.org/jira/browse/HBASE-20952
Authors
● ankit,elserj@apache.org

NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis

Similar to NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis (20)

Recently uploaded

Recently uploaded (20)

NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis

Editor's Notes