Alluxio Journal Evolution - Towards high availability and fault tolerance

Alluxio Journal Evolution -
Towards High Availability
and Fault Tolerance
ALLUXIO 1

Fault tolerance
- Workers can be replicated, act as a cache for the various UFS
- Many UFS have high availability, fault tolerance guarantees
- Master becomes single point of failure
3

Journal details
4
- Total order log of operations
- Recover by replay
- Snapshots to efficiently store state
- Faster recovery
- Smaller size

Basic fault tolerance
- Create a fault tolerant journal
- If the master crashes
- Stat a new master
- Replay the journal
- Start serving clients
- The system will be unavailable during this time
5

Basic high availability
- Run multiple masters
- A primary master will serve requests
- Secondary master(s) will replicate the state of the primary master, and take
over in case of failure
6

Basic highly available/ fault tolerant architecture
7

Problems to solve
- Ensure a single primary master running at all times
- Journal needs to be
- Fault tolerant
- Must agree on a valid order of journal entries
- Consensus
8

ALLUXIO 9
Zookeeper + UFS Journal

Ensure a single primary master running at a time
- Leader election using Zookeeper recipe
- Apache Zookeeper is an open-source server which enables highly reliable
distributed coordination
- File-system like abstraction built on top of an Atomic Broadcast (consensus) protocol
- Run on a cluster of nodes to provide fault tolerance/high availability
10

UFS Journal
- Write journal entries to the UFS
- Use the availability / fault tolerance / consistency guarantees of the UFS
- HDFS
11

Does leader election solve all our problems?
- Not quite
- Due to asynchrony two nodes
may believe they are leader at the
same time
- Concurrent writes to the journal
12

Zookeeper + UFS architecture
13

Issues
- Relies on multiple systems
- Each having their own fault tolerance/availability models
- More complicated
- Different UFS have different consistency models and performance
- May not be efficient for appending log entries
14

Additional details on the file system metadata
- RocksDB (optional)
- Log-structured merge tree
- Efficient inserts
- Key-value store
- Inode tree as a key-value map
- Efficient snapshots
- Alluxio adds in-memory cache for fast reads
15

ALLUXIO 16
The journal and a replicated state machine
Raft Journal

Raft - replicated state machine
- Clients interact with the state
machine as if it was a single
instance (linearizability)
- Send commands and receive
responses
- Fault tolerant and high
availability
https://ratis.apache.org
17

The Alluxio journal and a replicated state machine
- Raft simplifications
- Handles snapshotting and recovery, the
journal log, etc.
- Replicated state-machine =
key-value store of the file-system
meta-data
- Raft colocated with Alluxio masters
18

Primary master protocol
- Raft ensures a consistent and
highly available journal
- Still want a single primary master
- Update the UFS and Alluxio workers
- Serve clients
- Use leader election built into Raft.
- + Additional coordination layer
19

Alluxio + Raft architecture
20

Advantages
- Simplicity
- No external systems (Raft colocated with masters)
- Raft takes care of logging, snapshotting, recovery, etc.
- Performance
- Journal stored directly on masters
- RocksDB key-value store + cache
21

Alluxio Journal Evolution - Towards high availability and fault tolerance

More Related Content

What's hot

Similar to Alluxio Journal Evolution - Towards high availability and fault tolerance

More from Alluxio, Inc.

Recently uploaded

Alluxio Journal Evolution - Towards high availability and fault tolerance