Alluxio Journal Evolution -
Towards High Availability
and Fault Tolerance
ALLUXIO 1
Basic Alluxio Architecture
2
Fault tolerance
- Workers can be replicated, act as a cache for the various UFS
- Many UFS have high availability, fault tolerance guarantees
- Master becomes single point of failure
3
Journal details
4
- Total order log of operations
- Recover by replay
- Snapshots to efficiently store state
- Faster recovery
- Smaller size
Basic fault tolerance
- Create a fault tolerant journal
- If the master crashes
- Stat a new master
- Replay the journal
- Start serving clients
- The system will be unavailable during this time
5
Basic high availability
- Run multiple masters
- A primary master will serve requests
- Secondary master(s) will replicate the state of the primary master, and take
over in case of failure
6
Basic highly available/ fault tolerant architecture
7
Problems to solve
- Ensure a single primary master running at all times
- Journal needs to be
- Fault tolerant
- Must agree on a valid order of journal entries
- Consensus
8
ALLUXIO 9
Zookeeper + UFS Journal
Ensure a single primary master running at a time
- Leader election using Zookeeper recipe
- Apache Zookeeper is an open-source server which enables highly reliable
distributed coordination
- File-system like abstraction built on top of an Atomic Broadcast (consensus) protocol
- Run on a cluster of nodes to provide fault tolerance/high availability
10
UFS Journal
- Write journal entries to the UFS
- Use the availability / fault tolerance / consistency guarantees of the UFS
- HDFS
11
Does leader election solve all our problems?
- Not quite
- Due to asynchrony two nodes
may believe they are leader at the
same time
- Concurrent writes to the journal
12
Zookeeper + UFS architecture
13
Issues
- Relies on multiple systems
- Each having their own fault tolerance/availability models
- More complicated
- Different UFS have different consistency models and performance
- May not be efficient for appending log entries
14
Additional details on the file system metadata
- RocksDB (optional)
- Log-structured merge tree
- Efficient inserts
- Key-value store
- Inode tree as a key-value map
- Efficient snapshots
- Alluxio adds in-memory cache for fast reads
15
ALLUXIO 16
The journal and a replicated state machine
Raft Journal
Raft - replicated state machine
- Clients interact with the state
machine as if it was a single
instance (linearizability)
- Send commands and receive
responses
- Fault tolerant and high
availability
https://ratis.apache.org
17
The Alluxio journal and a replicated state machine
- Raft simplifications
- Handles snapshotting and recovery, the
journal log, etc.
- Replicated state-machine =
key-value store of the file-system
meta-data
- Raft colocated with Alluxio masters
18
Primary master protocol
- Raft ensures a consistent and
highly available journal
- Still want a single primary master
- Update the UFS and Alluxio workers
- Serve clients
- Use leader election built into Raft.
- + Additional coordination layer
19
Alluxio + Raft architecture
20
Advantages
- Simplicity
- No external systems (Raft colocated with masters)
- Raft takes care of logging, snapshotting, recovery, etc.
- Performance
- Journal stored directly on masters
- RocksDB key-value store + cache
21
ALLUXIO 22
Questions?

Alluxio Journal Evolution - Towards high availability and fault tolerance

  • 1.
    Alluxio Journal Evolution- Towards High Availability and Fault Tolerance ALLUXIO 1
  • 2.
  • 3.
    Fault tolerance - Workerscan be replicated, act as a cache for the various UFS - Many UFS have high availability, fault tolerance guarantees - Master becomes single point of failure 3
  • 4.
    Journal details 4 - Totalorder log of operations - Recover by replay - Snapshots to efficiently store state - Faster recovery - Smaller size
  • 5.
    Basic fault tolerance -Create a fault tolerant journal - If the master crashes - Stat a new master - Replay the journal - Start serving clients - The system will be unavailable during this time 5
  • 6.
    Basic high availability -Run multiple masters - A primary master will serve requests - Secondary master(s) will replicate the state of the primary master, and take over in case of failure 6
  • 7.
    Basic highly available/fault tolerant architecture 7
  • 8.
    Problems to solve -Ensure a single primary master running at all times - Journal needs to be - Fault tolerant - Must agree on a valid order of journal entries - Consensus 8
  • 9.
  • 10.
    Ensure a singleprimary master running at a time - Leader election using Zookeeper recipe - Apache Zookeeper is an open-source server which enables highly reliable distributed coordination - File-system like abstraction built on top of an Atomic Broadcast (consensus) protocol - Run on a cluster of nodes to provide fault tolerance/high availability 10
  • 11.
    UFS Journal - Writejournal entries to the UFS - Use the availability / fault tolerance / consistency guarantees of the UFS - HDFS 11
  • 12.
    Does leader electionsolve all our problems? - Not quite - Due to asynchrony two nodes may believe they are leader at the same time - Concurrent writes to the journal 12
  • 13.
    Zookeeper + UFSarchitecture 13
  • 14.
    Issues - Relies onmultiple systems - Each having their own fault tolerance/availability models - More complicated - Different UFS have different consistency models and performance - May not be efficient for appending log entries 14
  • 15.
    Additional details onthe file system metadata - RocksDB (optional) - Log-structured merge tree - Efficient inserts - Key-value store - Inode tree as a key-value map - Efficient snapshots - Alluxio adds in-memory cache for fast reads 15
  • 16.
    ALLUXIO 16 The journaland a replicated state machine Raft Journal
  • 17.
    Raft - replicatedstate machine - Clients interact with the state machine as if it was a single instance (linearizability) - Send commands and receive responses - Fault tolerant and high availability https://ratis.apache.org 17
  • 18.
    The Alluxio journaland a replicated state machine - Raft simplifications - Handles snapshotting and recovery, the journal log, etc. - Replicated state-machine = key-value store of the file-system meta-data - Raft colocated with Alluxio masters 18
  • 19.
    Primary master protocol -Raft ensures a consistent and highly available journal - Still want a single primary master - Update the UFS and Alluxio workers - Serve clients - Use leader election built into Raft. - + Additional coordination layer 19
  • 20.
    Alluxio + Raftarchitecture 20
  • 21.
    Advantages - Simplicity - Noexternal systems (Raft colocated with masters) - Raft takes care of logging, snapshotting, recovery, etc. - Performance - Journal stored directly on masters - RocksDB key-value store + cache 21
  • 22.