SlideShare a Scribd company logo
Zookeeper vs Raft: Stateful
distributed coordination with
HA and Fault Tolerance
Tyler Crain
ALLUXIO 1
Outline
1. Stateful distributed configuration services
2. What is Alluxio
3. High availability and fault tolerance in Alluxio
a. Implementation 1: Zookeeper + UFS (object store)
b. Implementation 2: Raft
2
ALLUXIO 3
Stateful distributed
coordination services
4
Source: kubernetes.io
Kubernetes - Container orchestration
Kafka - distributed event streaming
5
6
Source: Zhou, Jingyu, et al. "Foundationdb: A distributed unbundled
transactional key value store." Proceedings of the 2021 International
Conference on Management of Data. 2021.
Foundation DB - Transactional Key-Value store
7
Source: Corbett, James C., et al. "Spanner: Google’s globally distributed
database." ACM Transactions on Computer Systems (TOCS) 31.3 (2013): 1-22.
Google Spanner - distributed database
8
Source: Shute, Jeff, et al. "F1: A distributed SQL database that scales." (2013).
Google F1 (Spanner) - SQL Database
9
Source: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
HDFS - Distributed File System
What we have seen
● Metadata
● Configuration
● Cluster health/maintenance
● Naming/service discovery
● Leader Election
10
Small subset of system, but needs strong properties
● High availability
● Fault tolerance
● Strong consistency
ALLUXIO 11
Alluxio
Alluxio
12
Alluxio Architecture
13
● UFS (Under File System)
○ E.g. Amazon S3, HDFS, etc
○ Object stores
Fault tolerance
- Workers can be replicated, act as a cache for the various UFS
- Many UFS have high availability, fault tolerance guarantees
- Master becomes single point of failure
14
Journal details
15
- Total order log of state changes
- Can recover state by replaying the
operations
- Snapshots to efficiently store state
- Faster recovery
- Smaller size
Basic fault tolerance
- Create a fault tolerant journal
- If the master crashes
- Stat a new master
- Replay the journal
- Start serving clients
- Detecting a failure, starting a new node, and replaying the journal takes time
- The system will be unavailable during this time
16
Basic high availability
- Run multiple masters
- A primary master will serve requests
- Secondary master(s) will replicate the state of the primary master, and take
over in case of failure
17
Basic highly available/ fault tolerant architecture
18
Problems to solve
- Ensure a single primary master running at all times
- Journal needs to be
- Fault tolerant
- Masters must agree on a single valid order of journal entries
- Consensus
19
ALLUXIO 20
Implementation 1:
Zookeeper + UFS Journal
Ensure a single primary master running at a time
- Leader election using Zookeeper
- Apache Zookeeper an open-source server which enables highly reliable
distributed coordination
- Provides a file-system (hierarchical namespace) like abstraction built on top of an Atomic
Broadcast (consensus) protocol
- Run on a cluster of nodes to provide fault tolerance/high availability
21
Apache Zookeeper Curator leader election recipe
- Each master subscribes to the leader election recipe
- The recipe will elect one of the nodes leader
- If the leader fails or is unable to be reached due to network issues, the recipe
will elect a new leader
22
UFS Journal
- Write journal entries to the UFS
- Use the availability / fault tolerance guarantees of the UFS
23
Does leader election solve all our problems?
- Not quite
- The Zookeeper recipe will do its best to ensure only a single node is chosen
leader at a time, but due to asynchrony in the system two nodes may believe
they are leader at the same time
- Multiple nodes may be trying to write to the journal concurrently
- Note leader election is sufficient to provide consensus, but is not a consensus protocol itself
24
Defer to the UFS
- Can use the consistency guarantees of the UFS to ensure only a single writer
to the journal
25
- Alluxio Primary/
secondary master state
transition
Ensure a single journal writer to the UFS
- Use a mutual exclusion protocol based on log file names
- (1, 12) (13, 20) (21, 300) (301, MAX_INT)
- New primary master must ensure current log file is “complete” before writing
to a new log
- (1, 12) (13, 20) (21, 300) (301, 327)
- Create a new log
- (1, 12) (13, 20) (21, 300) (301, 327) (328, MAX_INT)
26
Zookeeper + UFS architecture
27
Issues
- Relies on multiple systems
- Each having their own fault tolerance/availability models
- More complicated
- Different UFS have different consistency models and performance
- May not be efficient for appending log entries
28
ALLUXIO 29
Implementation 2:
Raft Journal
The journal as a (deterministic) state machine
- Input: command
- Append
- State transition (deterministic):
- Add the journal entry to the log
30
Raft - replicated state machine
- Clients interact with the state
machine as if it was a single
instance (linearizability)
- Clients send commands to the
state machine and receive
responses
- But the state machine is fault
tolerant and has high
availability
- Apache Ratis (java
implementation of Raft)
31
Forget the Journal as a Log
32
Inode ID Inode
metadata
Worker location
0 … …
1 … …
Key-value store state machine commands:
- Write(key, value)
- Remove(key)
- Read(key) -> value
The Alluxio metadata as a replicated state machine
- Raft simplifies the task of
implementing the journal.
- It handles snapshotting and recovery
- The journal just becomes a
key-value store keeping track of the
file-system meta-data
- Raft is colocated with the Alluxio
masters
33
RocksDB as the key-value store
- Log-structured merge tree
- Efficient inserts
- Key-value store
- Inode tree as a key-value map
- Alluxio provides an additional in memory cache for fast reads
- Efficient snapshots
34
Primary master protocol
- Raft will ensure the journal is
consistent and highly available
- Still want a single primary master to
perform updates to UFS and Alluxio
workers and serve clients
- Take advantage of the leader election
built into Raft.
- An additional layer of coordination is
used along with raft to reduce the
likelihood of concurrent primary masters
35
Alluxio + Raft architecture
36
Advantages
- Simplicity
- No external systems (raft colocated with masters)
- Journal as a key-value store (higher level abstraction)
- Performance
- Journal operations stored locally on masters
- RocksDB as an efficient key-value store
37
From our earlier examples
- Kubernetes - etcd - key/value store run on Raft
- Kafka - coordinator running Zookeeper
- Foundation DB - coordinators running Paxos
- Generally a paxos based replicated state machine of a simple data store with
some custom application logic running over this
38
ALLUXIO 39
Questions?
More information on Alluxio
- Web - alluxio.io
- Github - https://github.com/Alluxio/alluxio
- Slack - alluxio-community.slack.com
40

More Related Content

Similar to zookeeer+raft-2.pdf

Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015
Roger Zhou 周志强
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
Alluxio, Inc.
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)
Boden Russell
 
Introduction to OS LEVEL Virtualization & Containers
Introduction to OS LEVEL Virtualization & ContainersIntroduction to OS LEVEL Virtualization & Containers
Introduction to OS LEVEL Virtualization & Containers
Vaibhav Sharma
 
Ospresentation 120112074429-phpapp02 (1)
Ospresentation 120112074429-phpapp02 (1)Ospresentation 120112074429-phpapp02 (1)
Ospresentation 120112074429-phpapp02 (1)Vivian Vhaves
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
OpenStack
 
Linux26 New Features
Linux26 New FeaturesLinux26 New Features
Linux26 New Featuresguest491c69
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
ShimoFcis
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
ubantu ppt.pptx
ubantu ppt.pptxubantu ppt.pptx
ubantu ppt.pptx
MrGyanprakash
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
Kathirvel Ayyaswamy
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating System
Sharad Pandey
 
Parallel Processing (Part 2)
Parallel Processing (Part 2)Parallel Processing (Part 2)
Parallel Processing (Part 2)
Ajeng Savitri
 
Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner)
Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner) Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner)
Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner)
Puppet
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)
Alexander Shalimov
 
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemProcess, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemLieYah Daliah
 

Similar to zookeeer+raft-2.pdf (20)

Studies
StudiesStudies
Studies
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 
Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
 
Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)Realizing Linux Containers (LXC)
Realizing Linux Containers (LXC)
 
Introduction to OS LEVEL Virtualization & Containers
Introduction to OS LEVEL Virtualization & ContainersIntroduction to OS LEVEL Virtualization & Containers
Introduction to OS LEVEL Virtualization & Containers
 
Ospresentation 120112074429-phpapp02 (1)
Ospresentation 120112074429-phpapp02 (1)Ospresentation 120112074429-phpapp02 (1)
Ospresentation 120112074429-phpapp02 (1)
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Linux26 New Features
Linux26 New FeaturesLinux26 New Features
Linux26 New Features
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
ubantu ppt.pptx
ubantu ppt.pptxubantu ppt.pptx
ubantu ppt.pptx
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating System
 
Parallel Processing (Part 2)
Parallel Processing (Part 2)Parallel Processing (Part 2)
Parallel Processing (Part 2)
 
Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner)
Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner) Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner)
Puppet Camp Chicago 2014: Running Multiple Puppet Masters (Beginner)
 
T isv-tru64-oracle10g-htts-1.0
T isv-tru64-oracle10g-htts-1.0T isv-tru64-oracle10g-htts-1.0
T isv-tru64-oracle10g-htts-1.0
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)Runos OpenFlow Controller (eng)
Runos OpenFlow Controller (eng)
 
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemProcess, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
 

More from Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
Chester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
Chester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
Chester Chen
 

More from Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 

Recently uploaded

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 

Recently uploaded (20)

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 

zookeeer+raft-2.pdf

  • 1. Zookeeper vs Raft: Stateful distributed coordination with HA and Fault Tolerance Tyler Crain ALLUXIO 1
  • 2. Outline 1. Stateful distributed configuration services 2. What is Alluxio 3. High availability and fault tolerance in Alluxio a. Implementation 1: Zookeeper + UFS (object store) b. Implementation 2: Raft 2
  • 4. 4 Source: kubernetes.io Kubernetes - Container orchestration
  • 5. Kafka - distributed event streaming 5
  • 6. 6 Source: Zhou, Jingyu, et al. "Foundationdb: A distributed unbundled transactional key value store." Proceedings of the 2021 International Conference on Management of Data. 2021. Foundation DB - Transactional Key-Value store
  • 7. 7 Source: Corbett, James C., et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31.3 (2013): 1-22. Google Spanner - distributed database
  • 8. 8 Source: Shute, Jeff, et al. "F1: A distributed SQL database that scales." (2013). Google F1 (Spanner) - SQL Database
  • 10. What we have seen ● Metadata ● Configuration ● Cluster health/maintenance ● Naming/service discovery ● Leader Election 10 Small subset of system, but needs strong properties ● High availability ● Fault tolerance ● Strong consistency
  • 13. Alluxio Architecture 13 ● UFS (Under File System) ○ E.g. Amazon S3, HDFS, etc ○ Object stores
  • 14. Fault tolerance - Workers can be replicated, act as a cache for the various UFS - Many UFS have high availability, fault tolerance guarantees - Master becomes single point of failure 14
  • 15. Journal details 15 - Total order log of state changes - Can recover state by replaying the operations - Snapshots to efficiently store state - Faster recovery - Smaller size
  • 16. Basic fault tolerance - Create a fault tolerant journal - If the master crashes - Stat a new master - Replay the journal - Start serving clients - Detecting a failure, starting a new node, and replaying the journal takes time - The system will be unavailable during this time 16
  • 17. Basic high availability - Run multiple masters - A primary master will serve requests - Secondary master(s) will replicate the state of the primary master, and take over in case of failure 17
  • 18. Basic highly available/ fault tolerant architecture 18
  • 19. Problems to solve - Ensure a single primary master running at all times - Journal needs to be - Fault tolerant - Masters must agree on a single valid order of journal entries - Consensus 19
  • 21. Ensure a single primary master running at a time - Leader election using Zookeeper - Apache Zookeeper an open-source server which enables highly reliable distributed coordination - Provides a file-system (hierarchical namespace) like abstraction built on top of an Atomic Broadcast (consensus) protocol - Run on a cluster of nodes to provide fault tolerance/high availability 21
  • 22. Apache Zookeeper Curator leader election recipe - Each master subscribes to the leader election recipe - The recipe will elect one of the nodes leader - If the leader fails or is unable to be reached due to network issues, the recipe will elect a new leader 22
  • 23. UFS Journal - Write journal entries to the UFS - Use the availability / fault tolerance guarantees of the UFS 23
  • 24. Does leader election solve all our problems? - Not quite - The Zookeeper recipe will do its best to ensure only a single node is chosen leader at a time, but due to asynchrony in the system two nodes may believe they are leader at the same time - Multiple nodes may be trying to write to the journal concurrently - Note leader election is sufficient to provide consensus, but is not a consensus protocol itself 24
  • 25. Defer to the UFS - Can use the consistency guarantees of the UFS to ensure only a single writer to the journal 25 - Alluxio Primary/ secondary master state transition
  • 26. Ensure a single journal writer to the UFS - Use a mutual exclusion protocol based on log file names - (1, 12) (13, 20) (21, 300) (301, MAX_INT) - New primary master must ensure current log file is “complete” before writing to a new log - (1, 12) (13, 20) (21, 300) (301, 327) - Create a new log - (1, 12) (13, 20) (21, 300) (301, 327) (328, MAX_INT) 26
  • 27. Zookeeper + UFS architecture 27
  • 28. Issues - Relies on multiple systems - Each having their own fault tolerance/availability models - More complicated - Different UFS have different consistency models and performance - May not be efficient for appending log entries 28
  • 30. The journal as a (deterministic) state machine - Input: command - Append - State transition (deterministic): - Add the journal entry to the log 30
  • 31. Raft - replicated state machine - Clients interact with the state machine as if it was a single instance (linearizability) - Clients send commands to the state machine and receive responses - But the state machine is fault tolerant and has high availability - Apache Ratis (java implementation of Raft) 31
  • 32. Forget the Journal as a Log 32 Inode ID Inode metadata Worker location 0 … … 1 … … Key-value store state machine commands: - Write(key, value) - Remove(key) - Read(key) -> value
  • 33. The Alluxio metadata as a replicated state machine - Raft simplifies the task of implementing the journal. - It handles snapshotting and recovery - The journal just becomes a key-value store keeping track of the file-system meta-data - Raft is colocated with the Alluxio masters 33
  • 34. RocksDB as the key-value store - Log-structured merge tree - Efficient inserts - Key-value store - Inode tree as a key-value map - Alluxio provides an additional in memory cache for fast reads - Efficient snapshots 34
  • 35. Primary master protocol - Raft will ensure the journal is consistent and highly available - Still want a single primary master to perform updates to UFS and Alluxio workers and serve clients - Take advantage of the leader election built into Raft. - An additional layer of coordination is used along with raft to reduce the likelihood of concurrent primary masters 35
  • 36. Alluxio + Raft architecture 36
  • 37. Advantages - Simplicity - No external systems (raft colocated with masters) - Journal as a key-value store (higher level abstraction) - Performance - Journal operations stored locally on masters - RocksDB as an efficient key-value store 37
  • 38. From our earlier examples - Kubernetes - etcd - key/value store run on Raft - Kafka - coordinator running Zookeeper - Foundation DB - coordinators running Paxos - Generally a paxos based replicated state machine of a simple data store with some custom application logic running over this 38
  • 40. More information on Alluxio - Web - alluxio.io - Github - https://github.com/Alluxio/alluxio - Slack - alluxio-community.slack.com 40