SlideShare a Scribd company logo
1 of 34
1 © Hortonworks Inc. 2011–2018. All rights reserved
High throughput data replication over
RAFT
Mukul Kumar Singh, Staff Software Engineer, Hortonworks
Lokesh Jain, Software Engineer, Hortonworks
2 © Hortonworks Inc. 2011–2018. All rights reserved
• msingh@apache.org
• Staff Software Engineer, Hortonworks
• ASF
• Committer for Apache Hadoop
• Committer for Apache Ratis
• MS from Carnegie Mellon University,
Pittsburgh
• ljain@apache.org
• Software Engineer, Hortonworks
• ASF
• Committer for Apache Ratis
• BE(Hons) Computer Science & M.Sc.
(Hons) Mathematics from BITS Pilani
Mukul Kumar Singh Lokesh Jain
Speakers
3 © Hortonworks Inc. 2011–2018. All rights reserved
Raft
4 © Hortonworks Inc. 2011–2018. All rights reserved
Raft
• Raft is a consensus algorithm
• Works when majority of nodes are alive in cluster
• i.e. can handle loss of minority number of nodes.
• “In Search of an Understandable Consensus Algorithm”
• by Diego Ongaro and John Ousterhout
• USENIX ATC’14, https://raft.github.io
5 © Hortonworks Inc. 2011–2018. All rights reserved
Raft Library
• Our Motivations
• Use Raft in Ozone
• “In Search of a Usable Raft Library”
• A long list of Raft implementations is available
• None of them a general library ready to be consumed by other projects.
• Most of them are tied to another project or a part of another project.
• We need a Raft library!
6 © Hortonworks Inc. 2011–2018. All rights reserved
Raft Basic
• Leader Election
• Servers are started as a Follower
• Randomly timeout to become Candidate and start a leader election
• Candidate sends requestVote to other servers
• It becomes the leader once it gets a majority of the votes.
• Append Entries
• Clients send requests to the Leader
• Leader forwards the requests to the Followers
• Leader sends appendEntries to Followers
• When there is no client requests, Leader also sends empty appendEntries
(heartbeats) to Followers to maintain leadership
7 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Ratis
8 © Hortonworks Inc. 2011–2018. All rights reserved
Data Intensive Applications
• In Raft,
• All transactions and the data are written in the log
• Not suitable for data intensive applications
• In Ratis
• Application could choose to not write all the data to log
• State machine data and log data can be separately managed
• See the FileStore example in ratis-example
• See the ContainerStateMachine as an implementation in Apache Hadoop Ozone.
9 © Hortonworks Inc. 2011–2018. All rights reserved
Ratis: Standard Raft Features
• Leader Election + Log Replication
• Automatically elect a leader among the servers in a Raft group
• Randomized timeout for avoiding split votes
• Log is replicated in the Raft group
• Membership Changes
• Members in a Raft group can be re-configurated in runtime
• Replication factor can be changed in runtime
• Log Compaction
• Snapshot is taken periodically
• Send snapshot instead of a long log history.
10 © Hortonworks Inc. 2011–2018. All rights reserved
Ratis: Pluggability
• Pluggable state machine
• Application must define its state machine
• Example: a key-value map
• Pluggable RPC
• Users may provide their own RPC implementation
• Default implementations: gRPC, Netty, Hadoop RPC
• gRPC allows implementation of native client
• Pluggable Raft log
• Users may provide their own log implementation
• The default implementation stores log in local files
11 © Hortonworks Inc. 2011–2018. All rights reserved
Ratis: Asynchronous/Synchronous APIs
• Using gRPC bi-directional stream API
• Netty and Hadoop RPC can support async but not yet implemented
• Server-to-server
• Asynchronous append entries
• Client-to-server
• Asynchronous client requests
12 © Hortonworks Inc. 2011–2018. All rights reserved
General Ratis Use Cases
• You want to:
• (1) replicate the server log/states to multiple machines
• The replication number/cluster membership can be changed in runtime
• It can tolerate server failures.
• or
• (2) have a HA (highly available) service
• When a server fails, another server will automatically take over.
• Clients automatically failover to the new server.
• Apache Ratis is for you!
13 © Hortonworks Inc. 2011–2018. All rights reserved
API
• Client Side APIs
• Send/SendReadOnly
• Send readonly commands are do not change the state of the raft server.
• Async versions also available (sendAsync, sendReadOnlyAsync)
• Server Side APIs
• applyTransaction
• Applies the transaction to the statemachine
• writeStateMachineData
• An optimization to avoid double write penalty for data intensive
applications.
14 © Hortonworks Inc. 2011–2018. All rights reserved
High Throughput
Data Pipeline
15 © Hortonworks Inc. 2011–2018. All rights reserved
Building a high performance data pipeline
• Requirements
• High data write throughput
• Parallelism/async interface
• Large number of transactions per second
• Configurable parameters
• Support for security
16 © Hortonworks Inc. 2011–2018. All rights reserved
Building a high performance data pipeline
• Optimizations
• Separate user data from the raft log
– Avoids double write penalty for data
• Efficient batching of raft log entries
– High write performance during local disk write
– Efficient network replication
• Async processing of operations
– Client ops
– Append entries to followers
– StateMachine implementation
17 © Hortonworks Inc. 2011–2018. All rights reserved
FileStoreStateMachine
• Located at org.apache.ratis.examples.filestore
• Simple state machine implementation to write bytes to a file
• Separates file data from raft log.
• File data written is persisted to disk
• Client generates random bytes of the specified file size
• Client uses writeAsync
18 © Hortonworks Inc. 2011–2018. All rights reserved
Performance Benchmarking
• Setup, 3 nodes with
• Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
• 256GiB System memory
• 10 Gigabit Network Connection
• 4 HGST (HUS726060AL4210) HDD of 5.5TB each
19 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Write Throughout
0
50
100
150
200
250
300
128000 102400 64000 51200 32000 20480 16000 10240 8000 5120 4000 2048 2000 1024 1000 1000 512 500 250 125 100
DatathrouhputinMB/s
File Size in KB
Write throughput for 1GB
20 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Transactions per second
0
2000
4000
6000
8000
10000
12000
100000 10000 1000 100 10
NUMBEROFTRANSACTIONPERSECOND
FILE SIZE IN BYTES
Number of transaction with 100000 files
21 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone
22 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone
Client
DN DN DN
RATIS
Ozone
Master
Storage
Container
Manager
Get Block
Get Container Location
(List of DNs)
Write Data
23 © Hortonworks Inc. 2011–2018. All rights reserved
Terminologies
• OM – Ozone Master
• Namespace manager inside Ozone, manages key name to block id mapping.
• Also manages Volume, buckets and key namespaces
• SCM – Storage Container Manager
• Block Manager, manager cluster membership, container location
information, manager containers
• Datanode
• Used to store user data, Ratis server spawned inside the datanode
• Ozone datanode persist containers, blocks are allocated out of containers.
24 © Hortonworks Inc. 2011–2018. All rights reserved
Storage Container
• Hadoop Distributed Data Storage (HDDS) introduces Storage Containers
• Provide generic data storage functionalities.
• Configurable Size (2GB - 16GB+)
• Unit of management and replication in SCM.
• Blocks are allocated from container
• BID = CID + LocalID
25 © Hortonworks Inc. 2011–2018. All rights reserved
Use of Ratis in Ozone
• Replicating data in open containers
• Replication of user data using Ratis
• Support HA in Storage Container Manager
• Work in Progress
• Support HA in Ozone Manager
• Work in Progress
26 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone Ratis Commands
• Ozone Data Pipeline involved interaction between client and
datanode.
• Commands are marked as readonly if they do not change the state
of the datanode.
• GetKey, ReadChunk, Read Container, or
• WriteChunk, PutKey, CreateContainer etc.
• Ozone Client send container commands to the leader datanode
using Ratis Protocol (grpc as underlying rpc)
27 © Hortonworks Inc. 2011–2018. All rights reserved
Command Replication on Containers
Leader
Follower Follower
Write Chunk
CSM
Response
28 © Hortonworks Inc. 2011–2018. All rights reserved
Open Container Replication using Ratis
• Ratis is used for replication of data being written to Ozone Datanodes.
• Ratis replicates container commands on open containers.
• Ozone Datanode provides its own state machine implementation
• This implementation handles various datanode commands (write chunk, put key, create
container)
• Performance optimizations
• To avoid rewrite of data twice to the disk, the state machine implementation separates user
data from block/chunk metadata.
• Multiple chunks are written in parallel.
• Append requests from Leader to followers are made async. Allows multiple appends in
parallel.
• Raft-journal in separate disk – fast contiguous writes without seeking
29 © Hortonworks Inc. 2011–2018. All rights reserved
Ozone Data Write Performance
• The performance numbers were taken for different key sizes and 10 client writes
in parallel.
• Measure the end to end throughput numbers
• Key allocation in OM and Block Allocation is SCM also account for total throughput.
• Ozone Client
• Uses sync apis to write data to the datanodes
• ContainerStateMachine implementation
• Parallelize write chunk operations
Key Sizes 10 MB 100MB
Throughput (MB/s) 81.3 110.3 MB
30 © Hortonworks Inc. 2011–2018. All rights reserved
Summary
• Ratis is Java based implementation of Raft protocol
• Essentially constituting a replicated statemachine.
• Suitable for data intensive applications.
• Features
• Sync/Async client apis
• Pluggable StateMachine
• Pluggable Raft Log Implementation
• Performance
• Write throughput - 250MB/s – 300 MB/s
• IOPS - 10,000 txns/s
31 © Hortonworks Inc. 2011–2018. All rights reserved
Contributors
• A big thanks to all the contributors for Apache Ratis, Apache Hadoop
and Ozone
• Animesh Trivedi, Anu Engineer, Arpit Agarwal, Brent,
• Chen Liang, Chris Nauroth, Devaraj Das, Enis Soztutar,
• garvit, Hanisha Koneru, Hugo Louro, Jakob Homan,
• Jian He, Jing Chen, Jing Zhao, Jitendra Pandey, Junping Du,
• kaiyangzhang, Karl Heinz Marbaise, Li Lu, Lokesh Jain,
• Marton Elek, Mayank Bansal, Mingliang Liu,
• Mukul Kumar Singh, Sen Zhang, Shashikant Banerjee, Sriharsha
Chintalapani,Tsz Wo Nicholas Sze,
• Uma Maheswara Rao G, Venkat Ranganathan, Wangda Tan,
• Weiqing Yang, Will Xu, Xiaobing Zhou, Xiaoyu Yao, Yubo Xu,
• yue liu, Zhiyuan Yang
32 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Ratis & Apache Hadoop Ozone
• Contributions are welcome!
• Ratis
• http://ratis.incubator.apache.org
• dev@ratis.incubator.apache.org
• Ozone
• http://hadoop.apache.org
• hdfs-dev@hadoop.apache.org
33 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
34 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

More Related Content

What's hot

ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
How is Kafka so Fast?
How is Kafka so Fast?How is Kafka so Fast?
How is Kafka so Fast?Ricardo Paiva
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 

What's hot (20)

ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
How is Kafka so Fast?
How is Kafka so Fast?How is Kafka so Fast?
How is Kafka so Fast?
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 

Similar to High throughput data replication over RAFT

LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storySunil Govindan
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Multi-Lingual Accumulo Communications
Multi-Lingual Accumulo CommunicationsMulti-Lingual Accumulo Communications
Multi-Lingual Accumulo CommunicationsAccumulo Summit
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsDataWorks Summit
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityAccumulo Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 

Similar to High throughput data replication over RAFT (20)

Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Containers and Big Data
Containers and Big Data Containers and Big Data
Containers and Big Data
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Multi-Lingual Accumulo Communications
Multi-Lingual Accumulo CommunicationsMulti-Lingual Accumulo Communications
Multi-Lingual Accumulo Communications
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 

More from DataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Recently uploaded

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

High throughput data replication over RAFT

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved High throughput data replication over RAFT Mukul Kumar Singh, Staff Software Engineer, Hortonworks Lokesh Jain, Software Engineer, Hortonworks
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved • msingh@apache.org • Staff Software Engineer, Hortonworks • ASF • Committer for Apache Hadoop • Committer for Apache Ratis • MS from Carnegie Mellon University, Pittsburgh • ljain@apache.org • Software Engineer, Hortonworks • ASF • Committer for Apache Ratis • BE(Hons) Computer Science & M.Sc. (Hons) Mathematics from BITS Pilani Mukul Kumar Singh Lokesh Jain Speakers
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Raft
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Raft • Raft is a consensus algorithm • Works when majority of nodes are alive in cluster • i.e. can handle loss of minority number of nodes. • “In Search of an Understandable Consensus Algorithm” • by Diego Ongaro and John Ousterhout • USENIX ATC’14, https://raft.github.io
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Raft Library • Our Motivations • Use Raft in Ozone • “In Search of a Usable Raft Library” • A long list of Raft implementations is available • None of them a general library ready to be consumed by other projects. • Most of them are tied to another project or a part of another project. • We need a Raft library!
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Raft Basic • Leader Election • Servers are started as a Follower • Randomly timeout to become Candidate and start a leader election • Candidate sends requestVote to other servers • It becomes the leader once it gets a majority of the votes. • Append Entries • Clients send requests to the Leader • Leader forwards the requests to the Followers • Leader sends appendEntries to Followers • When there is no client requests, Leader also sends empty appendEntries (heartbeats) to Followers to maintain leadership
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Apache Ratis
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Data Intensive Applications • In Raft, • All transactions and the data are written in the log • Not suitable for data intensive applications • In Ratis • Application could choose to not write all the data to log • State machine data and log data can be separately managed • See the FileStore example in ratis-example • See the ContainerStateMachine as an implementation in Apache Hadoop Ozone.
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Ratis: Standard Raft Features • Leader Election + Log Replication • Automatically elect a leader among the servers in a Raft group • Randomized timeout for avoiding split votes • Log is replicated in the Raft group • Membership Changes • Members in a Raft group can be re-configurated in runtime • Replication factor can be changed in runtime • Log Compaction • Snapshot is taken periodically • Send snapshot instead of a long log history.
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Ratis: Pluggability • Pluggable state machine • Application must define its state machine • Example: a key-value map • Pluggable RPC • Users may provide their own RPC implementation • Default implementations: gRPC, Netty, Hadoop RPC • gRPC allows implementation of native client • Pluggable Raft log • Users may provide their own log implementation • The default implementation stores log in local files
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Ratis: Asynchronous/Synchronous APIs • Using gRPC bi-directional stream API • Netty and Hadoop RPC can support async but not yet implemented • Server-to-server • Asynchronous append entries • Client-to-server • Asynchronous client requests
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved General Ratis Use Cases • You want to: • (1) replicate the server log/states to multiple machines • The replication number/cluster membership can be changed in runtime • It can tolerate server failures. • or • (2) have a HA (highly available) service • When a server fails, another server will automatically take over. • Clients automatically failover to the new server. • Apache Ratis is for you!
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved API • Client Side APIs • Send/SendReadOnly • Send readonly commands are do not change the state of the raft server. • Async versions also available (sendAsync, sendReadOnlyAsync) • Server Side APIs • applyTransaction • Applies the transaction to the statemachine • writeStateMachineData • An optimization to avoid double write penalty for data intensive applications.
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved High Throughput Data Pipeline
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Building a high performance data pipeline • Requirements • High data write throughput • Parallelism/async interface • Large number of transactions per second • Configurable parameters • Support for security
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Building a high performance data pipeline • Optimizations • Separate user data from the raft log – Avoids double write penalty for data • Efficient batching of raft log entries – High write performance during local disk write – Efficient network replication • Async processing of operations – Client ops – Append entries to followers – StateMachine implementation
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved FileStoreStateMachine • Located at org.apache.ratis.examples.filestore • Simple state machine implementation to write bytes to a file • Separates file data from raft log. • File data written is persisted to disk • Client generates random bytes of the specified file size • Client uses writeAsync
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Performance Benchmarking • Setup, 3 nodes with • Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz • 256GiB System memory • 10 Gigabit Network Connection • 4 HGST (HUS726060AL4210) HDD of 5.5TB each
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Write Throughout 0 50 100 150 200 250 300 128000 102400 64000 51200 32000 20480 16000 10240 8000 5120 4000 2048 2000 1024 1000 1000 512 500 250 125 100 DatathrouhputinMB/s File Size in KB Write throughput for 1GB
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Transactions per second 0 2000 4000 6000 8000 10000 12000 100000 10000 1000 100 10 NUMBEROFTRANSACTIONPERSECOND FILE SIZE IN BYTES Number of transaction with 100000 files
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Ozone
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Ozone Client DN DN DN RATIS Ozone Master Storage Container Manager Get Block Get Container Location (List of DNs) Write Data
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Terminologies • OM – Ozone Master • Namespace manager inside Ozone, manages key name to block id mapping. • Also manages Volume, buckets and key namespaces • SCM – Storage Container Manager • Block Manager, manager cluster membership, container location information, manager containers • Datanode • Used to store user data, Ratis server spawned inside the datanode • Ozone datanode persist containers, blocks are allocated out of containers.
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Storage Container • Hadoop Distributed Data Storage (HDDS) introduces Storage Containers • Provide generic data storage functionalities. • Configurable Size (2GB - 16GB+) • Unit of management and replication in SCM. • Blocks are allocated from container • BID = CID + LocalID
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Use of Ratis in Ozone • Replicating data in open containers • Replication of user data using Ratis • Support HA in Storage Container Manager • Work in Progress • Support HA in Ozone Manager • Work in Progress
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Ozone Ratis Commands • Ozone Data Pipeline involved interaction between client and datanode. • Commands are marked as readonly if they do not change the state of the datanode. • GetKey, ReadChunk, Read Container, or • WriteChunk, PutKey, CreateContainer etc. • Ozone Client send container commands to the leader datanode using Ratis Protocol (grpc as underlying rpc)
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Command Replication on Containers Leader Follower Follower Write Chunk CSM Response
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Open Container Replication using Ratis • Ratis is used for replication of data being written to Ozone Datanodes. • Ratis replicates container commands on open containers. • Ozone Datanode provides its own state machine implementation • This implementation handles various datanode commands (write chunk, put key, create container) • Performance optimizations • To avoid rewrite of data twice to the disk, the state machine implementation separates user data from block/chunk metadata. • Multiple chunks are written in parallel. • Append requests from Leader to followers are made async. Allows multiple appends in parallel. • Raft-journal in separate disk – fast contiguous writes without seeking
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Ozone Data Write Performance • The performance numbers were taken for different key sizes and 10 client writes in parallel. • Measure the end to end throughput numbers • Key allocation in OM and Block Allocation is SCM also account for total throughput. • Ozone Client • Uses sync apis to write data to the datanodes • ContainerStateMachine implementation • Parallelize write chunk operations Key Sizes 10 MB 100MB Throughput (MB/s) 81.3 110.3 MB
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Summary • Ratis is Java based implementation of Raft protocol • Essentially constituting a replicated statemachine. • Suitable for data intensive applications. • Features • Sync/Async client apis • Pluggable StateMachine • Pluggable Raft Log Implementation • Performance • Write throughput - 250MB/s – 300 MB/s • IOPS - 10,000 txns/s
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Contributors • A big thanks to all the contributors for Apache Ratis, Apache Hadoop and Ozone • Animesh Trivedi, Anu Engineer, Arpit Agarwal, Brent, • Chen Liang, Chris Nauroth, Devaraj Das, Enis Soztutar, • garvit, Hanisha Koneru, Hugo Louro, Jakob Homan, • Jian He, Jing Chen, Jing Zhao, Jitendra Pandey, Junping Du, • kaiyangzhang, Karl Heinz Marbaise, Li Lu, Lokesh Jain, • Marton Elek, Mayank Bansal, Mingliang Liu, • Mukul Kumar Singh, Sen Zhang, Shashikant Banerjee, Sriharsha Chintalapani,Tsz Wo Nicholas Sze, • Uma Maheswara Rao G, Venkat Ranganathan, Wangda Tan, • Weiqing Yang, Will Xu, Xiaobing Zhou, Xiaoyu Yao, Yubo Xu, • yue liu, Zhiyuan Yang
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Apache Ratis & Apache Hadoop Ozone • Contributions are welcome! • Ratis • http://ratis.incubator.apache.org • dev@ratis.incubator.apache.org • Ozone • http://hadoop.apache.org • hdfs-dev@hadoop.apache.org
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Thank you

Editor's Notes

  1. TALK TRACK Hortonworks Powers the Future of Data: data-in-motion, data-at-rest, and Modern Data Applications. [NEXT SLIDE]