SlideShare a Scribd company logo
1 of 20
Floating on a Raft
HBase Durability with Apache Ratis
NoSQL Day 2019
Washington, D.C.
Ankit Singhal, Josh Elser
Apache, Apache HBase, HBase, Apache Ratis, Ratis are (registered) trademarks of the Apache Software Foundation.
Distributed Consensus
Problem: How do a collection of computers agree on state in the face of failures?
A = 1
A = 2
A = 1
CC BY-SA 3.0 https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Gnome-computer.svg/1024px-Gnome-computer.svg.png
Distributed Consensus
Goals: Low-latency, high-throughput, fault-tolerant
Algorithms: Paxos, Raft, ZooKeeper Atomic Broadcast (ZAB), Viewstamped
Replication
Variants: Multi-Paxos, Fast Paxos, Byzantine Paxos, MultiRaft
Implementations: Chubby, Apache ZooKeeper, etcd, CockroachDB, Apache
Kudu, Apache Ratis, HashiCorp Raft/Consul, RethinkDB, Akka Raft,
Hazelcast Raft, Neo4j, WANdisco...
Easy to understand, easy to implement.
“New” (2013) -- Diego Ongaro, John Ousterhout
Proven correctness via TLA+
Paxos is “old” (1989), but still hard
Raft
Apache Ratis
Incubating project at the Apache Software Foundation
A library-oriented, Java implementation of Raft (not a service!)
Pluggable pieces:
● Transport (gRPC, Netty, Hadoop RPC)
● State Machine (your code!)
● Raft Log (In-memory, segmented files on disk)
A StateMachine is the abstraction
point for user-code.
Interface to query and modify “state”
Ratis Arithmetic Example:
Maintain variables (e.g. a = 1) and
apply mathematical operations.
Read expr’s: add, subtract, multiply, divide
Write expr’s: assignment
Ratis State Machines
class Arithmetic implements StateMachine {
Map<String,Double> variables;
Message query(Message req) {
Expression exp = parseReadExp(req);
try (ReadLock rlock = getReadlock()) {
return exp.eval(variables);
}
}
Message update(Message req) {
Expression exp = parseWriteExp(req);
try (WriteLock wlock = getWriteLock()) {
return exp.eval(variables);
}
}
}
Ratis LogService
Recipe that provides a facade of a log (append-only, immutable bytes)
Maintain little-to-no state. Storage “provided” by the Raft Log.
interface Reader {
void seek(long offset);
byte[] readMsg();
List<byte[]> readBulk(int numMsgs);
}
interface Writer {
long write(byte[] msg);
List<Long> writeBulk(
List<byte[]> msgs);
}
interface Client {
List<String> list();
Log getLog(String name);
void archive(String name);
void close(String name);
void delete(String name);
}
interface Log {
Reader createReader();
Writer createWriter();
Metadata getMetadata();
void addListener();
}
Ratis LogService Architecture
Log Name
transactions
gps_coordinates
sensors
query_durations
Client
Metadata
Workers
LogService Testing
Docker-compose simplicity: 3 metadata services, >=3 workers
$ mvn package assembly:single && ./build-docker.sh
$ docker-compose up -d
$ ./client-env.sh
Utilities: interactive shell, verification tool
$ ./bin/shell -q <...>
$ ./bin/load-test -q <...>
LogService Testing
Goal: Generate some non-trivial data sizes
Environment:
● Intel i5-5250U
● 16GB of RAM
● Samsung SSD 850 M.2
● Gentoo Linux: Kernel 4.19.27
● Docker 18.09.4
● Write ~50MB per scenario
● Single client program, one log/thread, no batching
● JDK8, 3GB LogWorker heaps (no other tuning)
LogService Testing Results
Logs/Threads Value Size Num Records Duration
1 50 1,100,000 5h+
4 50 275,000 35m
5 100 105,000 13m 30s
5 500 22,000 2m 48s
8 100 66,000 16m 20s
8 500 13,200 2m 30s
4 1000 13,200 1m 40s
Does HBase want this?
Assumption: we can more efficiently run HBase in cloud environments without
HDFS for WALs.
● Running HDFS is expensive, hard
○ Data is “heavy” (10’s mins to 1’s of hours to decommission)
○ Unexpected DataNode failure requires slow re-replication
● More things to monitor -- twice as many JVMs
Ideal Case:
● Scale up HBase by just adding a more RegionServers, then balance
● Scale down by gently (order 1’s of minutes) removing RegionServers
Asynchronous
flushing to generate
HFiles
Write Path
Store
Durability in HBase
Put
Delete
Incr
RegionServer
wal
MemStore
1
2
Region1
Store
MemStore
RegionN
3
3
Store File
Store File
Append
and sync
KVs
Life cycle of WAL
RegionServer
WAL
WALs
zookeeper
Flush
Log Roller
Roll Wal
Flush
Tracking for
Replication
Backup
Cleaner
chore
WALs
Archived
Regionserver Recovery
Identification
- Master(ServerManager) observes when
a region server is deemed dead due to
their ephemeral node being deleted
Splitting
- Reading the WAL and creating
separate files for each region
Re-assignment
- Assigning the regions from dead
server to live regionservers
Fencing
- Fencing for half dead region server
(server which undergoes long GC
pause and comes back after GC
finishes)
- Currently done through renaming
HDFS directory
Replaying
- Reading the WAL recovered edits
produced by WAL splitting and
replaying the edits that were not
flushed
Regionserver Recovery Refactoring
Identification
- Monitoring Ephemeral RS nodes
- WALs available for the servers which are
not live
Splitting
interface WALProvider {
public Map<Region, WAL> split(WAL
wal);
}
Re-assignment
- No change is required as independent of
WAL
Fencing
interface ServerFence {
public void fence(ServerName server);
}
In case of Ratis, Implementation could be to
close the log to prevent further writes by dead
regionserver.
Replaying
interface WALProvider {
public Reader getRecoveredEditsReader(
Region region );
}
Disclaimer: These Interfaces are for reference only , may change during the actual development
Replication
- Async and Serial Replication rely on reading WALs
- Need a long-term storage for WALs
- Ratis LogService uses local disk
Proposed Solution
- Can we upload Ratis WALs to distributed, cheap storage?
- If we can hold onto WALs indefinitely, we don’t have to rewrite Replication.
Why Ratis for WAL?
Choices are: Apache Kafka, Distributed Log, Apache Ratis, HDFS, Amazon Kinesis, Azure premium
storage
● Fully embeddable(No dependency on External System)
● Low Latency
● High throughput
● Enable HBase for Hybrid cloud deployment
● Availability proportional to no. of nodes in a quorum
Disclaimer: We are not suggesting Ratis is the only solution, HBase refactoring will be done in such a way that any storage is pluggable
What’s next?
More testing for LogService
● Easy to cause leader-election storms
● Better insight/understanding into internals
A Ratis LogService WalProvider
● Wire up the LogService with the new WAL APIs
References
Ratis LogService
● https://github.com/apache/incubator-ratis/tree/master/ratis-logservice
HBase WAL Refactoring
● https://issues.apache.org/jira/browse/HBASE-20951
● https://issues.apache.org/jira/browse/HBASE-20952
Authors
● ankit,elserj@apache.org

More Related Content

What's hot

Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
Unified readonly cache for ceph
Unified readonly cache for cephUnified readonly cache for ceph
Unified readonly cache for cephzhouyuan
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...Tommy Lee
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversLinaro
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonSage Weil
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...
Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...
Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...Gluster.org
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldSage Weil
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageSage Weil
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016John Spray
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Sage Weil
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and BeyondSage Weil
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Ceph Community
 

What's hot (20)

Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Unified readonly cache for ceph
Unified readonly cache for cephUnified readonly cache for ceph
Unified readonly cache for ceph
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...
Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...
Performance bottlenecks for metadata workload in Gluster with Poornima Gurusi...
 
MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 

Similar to NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Open stack in sina
Open stack in sinaOpen stack in sina
Open stack in sinaHui Cheng
 
Openstack HA
Openstack HAOpenstack HA
Openstack HAYong Luo
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSCeph Community
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP IntegrationBKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP IntegrationLinaro
 
Introduction to Container Management on AWS
Introduction to Container Management  on AWSIntroduction to Container Management  on AWS
Introduction to Container Management on AWSAmazon Web Services
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Community
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...DynamicInfraDays
 
Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)Sharma Aashish
 
Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Red Hat Developers
 
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld
 
Ceph at salesforce ceph day external presentation
Ceph at salesforce   ceph day external presentationCeph at salesforce   ceph day external presentation
Ceph at salesforce ceph day external presentationSameer Tiwari
 
Whirr dev-up-puppetconf2011
Whirr dev-up-puppetconf2011Whirr dev-up-puppetconf2011
Whirr dev-up-puppetconf2011Puppet
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Community
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in ScalaAlex Payne
 
Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS Shrey Agarwal
 

Similar to NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis (20)

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Open stack in sina
Open stack in sinaOpen stack in sina
Open stack in sina
 
Openstack HA
Openstack HAOpenstack HA
Openstack HA
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP IntegrationBKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
BKK16-409 VOSY Switch Port to ARMv8 Platforms and ODP Integration
 
Introduction to Container Management on AWS
Introduction to Container Management  on AWSIntroduction to Container Management  on AWS
Introduction to Container Management on AWS
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
 
Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)Cloudjiffy vs Open Shift (private cloud)
Cloudjiffy vs Open Shift (private cloud)
 
Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!
 
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Ceph at salesforce ceph day external presentation
Ceph at salesforce   ceph day external presentationCeph at salesforce   ceph day external presentation
Ceph at salesforce ceph day external presentation
 
Proxy
ProxyProxy
Proxy
 
Whirr dev-up-puppetconf2011
Whirr dev-up-puppetconf2011Whirr dev-up-puppetconf2011
Whirr dev-up-puppetconf2011
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS
 

Recently uploaded

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Recently uploaded (20)

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis

  • 1. Floating on a Raft HBase Durability with Apache Ratis NoSQL Day 2019 Washington, D.C. Ankit Singhal, Josh Elser Apache, Apache HBase, HBase, Apache Ratis, Ratis are (registered) trademarks of the Apache Software Foundation.
  • 2. Distributed Consensus Problem: How do a collection of computers agree on state in the face of failures? A = 1 A = 2 A = 1 CC BY-SA 3.0 https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Gnome-computer.svg/1024px-Gnome-computer.svg.png
  • 3. Distributed Consensus Goals: Low-latency, high-throughput, fault-tolerant Algorithms: Paxos, Raft, ZooKeeper Atomic Broadcast (ZAB), Viewstamped Replication Variants: Multi-Paxos, Fast Paxos, Byzantine Paxos, MultiRaft Implementations: Chubby, Apache ZooKeeper, etcd, CockroachDB, Apache Kudu, Apache Ratis, HashiCorp Raft/Consul, RethinkDB, Akka Raft, Hazelcast Raft, Neo4j, WANdisco...
  • 4. Easy to understand, easy to implement. “New” (2013) -- Diego Ongaro, John Ousterhout Proven correctness via TLA+ Paxos is “old” (1989), but still hard Raft
  • 5. Apache Ratis Incubating project at the Apache Software Foundation A library-oriented, Java implementation of Raft (not a service!) Pluggable pieces: ● Transport (gRPC, Netty, Hadoop RPC) ● State Machine (your code!) ● Raft Log (In-memory, segmented files on disk)
  • 6. A StateMachine is the abstraction point for user-code. Interface to query and modify “state” Ratis Arithmetic Example: Maintain variables (e.g. a = 1) and apply mathematical operations. Read expr’s: add, subtract, multiply, divide Write expr’s: assignment Ratis State Machines class Arithmetic implements StateMachine { Map<String,Double> variables; Message query(Message req) { Expression exp = parseReadExp(req); try (ReadLock rlock = getReadlock()) { return exp.eval(variables); } } Message update(Message req) { Expression exp = parseWriteExp(req); try (WriteLock wlock = getWriteLock()) { return exp.eval(variables); } } }
  • 7. Ratis LogService Recipe that provides a facade of a log (append-only, immutable bytes) Maintain little-to-no state. Storage “provided” by the Raft Log. interface Reader { void seek(long offset); byte[] readMsg(); List<byte[]> readBulk(int numMsgs); } interface Writer { long write(byte[] msg); List<Long> writeBulk( List<byte[]> msgs); } interface Client { List<String> list(); Log getLog(String name); void archive(String name); void close(String name); void delete(String name); } interface Log { Reader createReader(); Writer createWriter(); Metadata getMetadata(); void addListener(); }
  • 8. Ratis LogService Architecture Log Name transactions gps_coordinates sensors query_durations Client Metadata Workers
  • 9. LogService Testing Docker-compose simplicity: 3 metadata services, >=3 workers $ mvn package assembly:single && ./build-docker.sh $ docker-compose up -d $ ./client-env.sh Utilities: interactive shell, verification tool $ ./bin/shell -q <...> $ ./bin/load-test -q <...>
  • 10. LogService Testing Goal: Generate some non-trivial data sizes Environment: ● Intel i5-5250U ● 16GB of RAM ● Samsung SSD 850 M.2 ● Gentoo Linux: Kernel 4.19.27 ● Docker 18.09.4 ● Write ~50MB per scenario ● Single client program, one log/thread, no batching ● JDK8, 3GB LogWorker heaps (no other tuning)
  • 11. LogService Testing Results Logs/Threads Value Size Num Records Duration 1 50 1,100,000 5h+ 4 50 275,000 35m 5 100 105,000 13m 30s 5 500 22,000 2m 48s 8 100 66,000 16m 20s 8 500 13,200 2m 30s 4 1000 13,200 1m 40s
  • 12. Does HBase want this? Assumption: we can more efficiently run HBase in cloud environments without HDFS for WALs. ● Running HDFS is expensive, hard ○ Data is “heavy” (10’s mins to 1’s of hours to decommission) ○ Unexpected DataNode failure requires slow re-replication ● More things to monitor -- twice as many JVMs Ideal Case: ● Scale up HBase by just adding a more RegionServers, then balance ● Scale down by gently (order 1’s of minutes) removing RegionServers
  • 13. Asynchronous flushing to generate HFiles Write Path Store Durability in HBase Put Delete Incr RegionServer wal MemStore 1 2 Region1 Store MemStore RegionN 3 3 Store File Store File Append and sync KVs
  • 14. Life cycle of WAL RegionServer WAL WALs zookeeper Flush Log Roller Roll Wal Flush Tracking for Replication Backup Cleaner chore WALs Archived
  • 15. Regionserver Recovery Identification - Master(ServerManager) observes when a region server is deemed dead due to their ephemeral node being deleted Splitting - Reading the WAL and creating separate files for each region Re-assignment - Assigning the regions from dead server to live regionservers Fencing - Fencing for half dead region server (server which undergoes long GC pause and comes back after GC finishes) - Currently done through renaming HDFS directory Replaying - Reading the WAL recovered edits produced by WAL splitting and replaying the edits that were not flushed
  • 16. Regionserver Recovery Refactoring Identification - Monitoring Ephemeral RS nodes - WALs available for the servers which are not live Splitting interface WALProvider { public Map<Region, WAL> split(WAL wal); } Re-assignment - No change is required as independent of WAL Fencing interface ServerFence { public void fence(ServerName server); } In case of Ratis, Implementation could be to close the log to prevent further writes by dead regionserver. Replaying interface WALProvider { public Reader getRecoveredEditsReader( Region region ); } Disclaimer: These Interfaces are for reference only , may change during the actual development
  • 17. Replication - Async and Serial Replication rely on reading WALs - Need a long-term storage for WALs - Ratis LogService uses local disk Proposed Solution - Can we upload Ratis WALs to distributed, cheap storage? - If we can hold onto WALs indefinitely, we don’t have to rewrite Replication.
  • 18. Why Ratis for WAL? Choices are: Apache Kafka, Distributed Log, Apache Ratis, HDFS, Amazon Kinesis, Azure premium storage ● Fully embeddable(No dependency on External System) ● Low Latency ● High throughput ● Enable HBase for Hybrid cloud deployment ● Availability proportional to no. of nodes in a quorum Disclaimer: We are not suggesting Ratis is the only solution, HBase refactoring will be done in such a way that any storage is pluggable
  • 19. What’s next? More testing for LogService ● Easy to cause leader-election storms ● Better insight/understanding into internals A Ratis LogService WalProvider ● Wire up the LogService with the new WAL APIs
  • 20. References Ratis LogService ● https://github.com/apache/incubator-ratis/tree/master/ratis-logservice HBase WAL Refactoring ● https://issues.apache.org/jira/browse/HBASE-20951 ● https://issues.apache.org/jira/browse/HBASE-20952 Authors ● ankit,elserj@apache.org

Editor's Notes

  1. Advantage of cloud:- Cloud Storage is Economical Easy Migration Elasticity Disadvantage of cloud services:- Expensive Limited options Specific versions
  2. Characterstics of WAL Durable and Highly Available as they are needed in case of crash Latency and throughput due to the write path Support of append and group commit
  3. Decisions made for Ratis
  4. Generally the lifecycle of the WAL is pretty small, we no longer need them when Flush completes but due to replication , we need to keep them longer. Plan for Ratis log service, that if redundancy of the
  5. Local Disk SSD for local disk will lower the latency further