SlideShare a Scribd company logo
Enabling Native Integration of
NoSQL HBase with Cloud
Providers
Ankit Singhal, Wellington Chevreuil, Stephen Wu
2
© 2023 Cloudera, Inc. All rights reserved.
● TCO
○ Instance Sizing
○ Automatic scaling
○ Storage Optimization
● Storage Options
● Security Simplification
● Availability and Fault Tolerance
● Benchmark
● Cost Case study
Agenda
3
© 2023 Cloudera, Inc. All rights reserved.
Considering Cloud Native
* https://www.redhat.com/en/topics/cloud-native-apps
● Elasticity with increased efficiency
● Flexible and configurable Availability
● Cost Reduction
4
© 2023 Cloudera, Inc. All rights reserved.
Instance sizing
● Per role type: Masters, RegionServers
● Per load (with RS Grouping)
○ System tables (meta, namespace, phoenix)
○ Use case specific load
TCO
5
© 2023 Cloudera, Inc. All rights reserved.
Metrics based automatic scale up/down
● Read/Write latency/throughput based
● RPC latency based
● Compaction queue size
● Overall request load
● Available cache space (Cloud Storage deployments)
TCO
6
© 2023 Cloudera, Inc. All rights reserved.
Storage Optimizations
● Block vs Cloud Storage
● Ephemeral Storage for BucketCache
TCO
7
© 2023 Cloudera, Inc. All rights reserved.
Cloud Storage
● Benefits
○ Cost efficient for large volumes
○ Available on major cloud providers
○ Decoupled compute from storage
● Challenges
○ Increased latency
○ HBase native compatibility
■ S3 lacks atomic rename (for now)
Storage Options
8
© 2023 Cloudera, Inc. All rights reserved.
Cloud Storage
● Mitigating latency
○ File based bucket cache over ephemeral storage
■ Initial cache warmup required
■ Cache must be resilient to RS crashes/restarts
● HBASE-27686, HBASE-27743
■ Region balancer must consider impacts to the cache
● HBASE-27389
○ Cloud provider specific tunings
■ Throttling, connection/thread pooling configs
■ Request hints (random/sequential)
Storage Options
9
© 2023 Cloudera, Inc. All rights reserved.
Cloud Storage
● HBase S3 Integration
○ HBase originally designed for hierarchical file systems
■ HBase internal write operations are two phased:
● New files created on a temp dir
● Rename new files to final dir at commit time
■ Amazon S3 lacks atomic renames
● Requires locking the whole dir content
● Individual files rename calls to S3 (suboptimal)
Storage Options
10
© 2023 Cloudera, Inc. All rights reserved.
Cloud Storage
Storage Options
Original Flow
11
© 2023 Cloudera, Inc. All rights reserved.
Cloud Storage
● HBase S3 Integration
○ Redesign HBase to not (only) rely on renames
■ Store File Tracking: HBASE-26067
● Additional layer for tracking store files
● Write operations delegate file path decision to the
tracking layer
● Configurable tracker implementations
○ Default: still uses renames
○ File Based Tracker: keeps list of valid files in
meta files
Storage Options
12
© 2023 Cloudera, Inc. All rights reserved.
Cloud Storage
SFT Flow
Storage Options
13
© 2023 Cloudera, Inc. All rights reserved.
Cloud Storage
SFT Flow
(detailed)
Storage Options
14
© 2023 Cloudera, Inc. All rights reserved.
Cloud Storage
● Limitations
○ No durable syncs (fsync/hflush)
■ WAL files require durable file sync to overcome crashes
■ Low latency critical for write performance
● Deploy HDFS for WAL
○ Separate WAL from data directory
○ WALs are temporary, so limited space usage
○ Use the instances attached disks for HDFS data
Storage Options
15
© 2023 Cloudera, Inc. All rights reserved.
Block Storage
● Benefits
○ Higher performance *
● Challenges
○ More costly
○ Compute/Storage tightly coupled
Storage Options
16
© 2023 Cloudera, Inc. All rights reserved.
Block Storage
● Block storage volumes attached to the cluster nodes
● HDFS deployed over the nodes
○ Both WALs and HBase data on HDFS
● Less cost efficient
■ Require additional/higher volumes attached to instances
■ HDFS replication factor requires more storage
■ Compute and Storage scaling together
● Higher Performance
○ Avg 5x better latency/throughput
Storage Options
17
© 2023 Cloudera, Inc. All rights reserved.
Cloud vs Block
Storage Type Pros Cons
Cloud Storage More cost efficient
Separate compute from storage:
● Optimal scaling
● Data management and availability
Lower performance
● Mitigated with bucket cache
Block Storage Better overall performance Higher costs
Couples compute and storage
● Can't scale independently
Storage Options
18
© 2023 Cloudera, Inc. All rights reserved.
Kerberos vs JWT
● Kerberos
○ HBase support since early versions
○ Strong client/server authentication (KDC based)
○ SASL for the RPC authentication
○ Requires clients to be known by the KDC
○ Complex to integrate with other services
Security Simplification
19
© 2023 Cloudera, Inc. All rights reserved.
Kerberos vs JWT
● JWT
○ Token based authentication
■ Custom SASL auth provider (HBASE-23347)
■ HBase builtin works ongoing (HBASE-26553)
○ TLS for RPC (encrypts the token and further info)
■ HBASE-26666 recently added TLS for RPC support
○ Centralised authentication service
■ Easily reusable by clients/other services
Security Simplification
20
© 2023 Cloudera, Inc. All rights reserved.
● HBase built-in
capabilities
○ Master HA
○ Table sharding
(Regions)
○ Meta region replica
Availability and Fault Tolerance
21
© 2023 Cloudera, Inc. All rights reserved.
Availability and Fault Tolerance (cont.)
● * Multiple
Availability Zones
(Multi AZ)
○ Master
instances on
different zones
○ RSes spread
evenly
throughout
zones
*https://blog.cloudera.com/high-availability-multi-az-for-cdp-operational-database/
22
© 2023 Cloudera, Inc. All rights reserved.
Availability and Fault Tolerance (cont.)
● Limitations
○ Cache reload (Cloud Storage deployments) when RSs
crashes
■ HBASE-27313 Persist list of HFiles after prefetch
■ HBASE-27743 Enhancement for persistent cache
■ HBASE-27389 Cost function that considers bucket
cache as part of the plan before region movement
○ Multi AZ increases latency (Block Storage deployments)
23
© 2023 Cloudera, Inc. All rights reserved.
YCSB Workloads
● Dataset size: 20 Billion rows = ~20TB
● Workload C: 100% Read
● Workload A: 50% read, 50% write
● Workload F: 50% read, 25% update, 25% read-modify-update
● Two common deployments, HDFS vs Cloud storage
Benchmark
24
© 2023 Cloudera, Inc. All rights reserved.
Clusters definitions
● AWS
○ HDFS
■ 20 RSes m5.2xlarge storage with HDD
■ 2 Masters m5.8xlarge
○ S3
■ 20 RSes i3.2xlarge storage as S3
■ 2 Masters m5.2xlarge
● Azure
○ HDFS
■ 20 RSes Standard_D8_V3
■ 2 Masters Standard_D32_V3
○ ABFS
■ 20 RSes Standard_L8s_V2
■ 2 Masters Standard_D8a_V4
Benchmark
25
© 2023 Cloudera, Inc. All rights reserved.
AWS with HDFS vs S3 (Higher is better)
Benchmark
26
© 2023 Cloudera, Inc. All rights reserved.
AWS with HDFS vs S3 (Lower is better)
Benchmark
There is an outlier
on P95+, where the
P95 latency is 28.3
ms vs 3.3 ms
27
© 2023 Cloudera, Inc. All rights reserved.
Azure with HDFS vs ABFS (Higher is better)
Benchmark
28
© 2023 Cloudera, Inc. All rights reserved.
Benchmark
Azure with HDFS vs ABFS (Lower is better)
29
© 2023 Cloudera, Inc. All rights reserved.
Average comparison against HDFS on Block Storage
Workload Type Latency S3 vs
HDFS
Throughput S3 vs
HDFS
Latency ABFS vs
HDFS
Throughput
S3 vs HDFS
Read only workload 47% lower 86% higher 48% lower 91% higher
50% Read, 50% Write R: 52% lower
W: 28% higher
36% higher R: 46% lower
W: 35% lower
66% higher
50% read, 25% update, 25%
read-modify-update
R: 87% lower
W: 29% lower
66% higher R: 48% lower
W: 35% lower
66% lower
Benchmark
30
© 2023 Cloudera, Inc. All rights reserved.
Sample comparison for a 50TB read/write workload on AWS:
50TB read/write workload S3 with ephemeral HDFS in block storage
Number of instances 43 43
Instance Types 3 x m5.2xlarge
40 x i3.2xlarge
3 x m5.8xlarge
40 x m5.2xlarge
Initial capacity with
performance parity
64TB ephemeral storage for
bucket cache
190TB EBS volumes for
HDFS
Cost of instances $10,778.58 $7,467.51
Cost of storage *$1,428.60 $9,216.00
Total cost (monthly) $12,207.18 (26.8%) $16,683.51
Cost Case Study
* we estimated the amount of list/put/get requests to S3 is 50 millions per month, the actual cost may be slightly
higher
31
© 2023 Cloudera, Inc. All rights reserved.
Q/A
32
© 2023 Cloudera, Inc. All rights reserved.
● Wellington Chevreuil
○ SW Engineer at Cloudera,
HBase PMC
● Stephen Wu
○ SW Engineer at Cloudera,
HBase PMC
Speaker information
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf

More Related Content

Similar to Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf

Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
DataWorks Summit
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
Cloudera, Inc.
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Timothy Spann
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloudera, Inc.
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
IaaS for DBAs in Azure
IaaS for DBAs in AzureIaaS for DBAs in Azure
IaaS for DBAs in Azure
Kellyn Pot'Vin-Gorman
 
Deep Dive on Backup.pdf
Deep Dive on Backup.pdfDeep Dive on Backup.pdf
Deep Dive on Backup.pdf
Amazon Web Services
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 
Crossplane and a story about scaling Kubernetes custom resources.pdf
Crossplane and a story about scaling Kubernetes custom resources.pdfCrossplane and a story about scaling Kubernetes custom resources.pdf
Crossplane and a story about scaling Kubernetes custom resources.pdf
Richárd Kovács
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Newton Alex
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
Dinesh Chitlangia
 
Deep Dive on Backup
Deep Dive on BackupDeep Dive on Backup
Deep Dive on Backup
Amazon Web Services
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
Sergio Navarro Pino
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
Cloudera, Inc.
 
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak DataClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
Altinity Ltd
 
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
inside-BigData.com
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practices
Omid Vahdaty
 

Similar to Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf (20)

Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
IaaS for DBAs in Azure
IaaS for DBAs in AzureIaaS for DBAs in Azure
IaaS for DBAs in Azure
 
Deep Dive on Backup.pdf
Deep Dive on Backup.pdfDeep Dive on Backup.pdf
Deep Dive on Backup.pdf
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
Crossplane and a story about scaling Kubernetes custom resources.pdf
Crossplane and a story about scaling Kubernetes custom resources.pdfCrossplane and a story about scaling Kubernetes custom resources.pdf
Crossplane and a story about scaling Kubernetes custom resources.pdf
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
 
Deep Dive on Backup
Deep Dive on BackupDeep Dive on Backup
Deep Dive on Backup
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak DataClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
 
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practices
 

More from wchevreuil

HBase System Tables / Metadata Info
HBase System Tables / Metadata InfoHBase System Tables / Metadata Info
HBase System Tables / Metadata Info
wchevreuil
 
HDFS client write/read implementation details
HDFS client write/read implementation detailsHDFS client write/read implementation details
HDFS client write/read implementation details
wchevreuil
 
HBase RITs
HBase RITsHBase RITs
HBase RITs
wchevreuil
 
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
wchevreuil
 
Hbasecon2019 hbck2 (1)
Hbasecon2019 hbck2 (1)Hbasecon2019 hbck2 (1)
Hbasecon2019 hbck2 (1)
wchevreuil
 
Web hdfs and httpfs
Web hdfs and httpfsWeb hdfs and httpfs
Web hdfs and httpfs
wchevreuil
 
HBase replication
HBase replicationHBase replication
HBase replication
wchevreuil
 
Hadoop tuning
Hadoop tuningHadoop tuning
Hadoop tuning
wchevreuil
 
I nd t_bigdata(1)
I nd t_bigdata(1)I nd t_bigdata(1)
I nd t_bigdata(1)
wchevreuil
 
Hadoop - TDC 2012
Hadoop - TDC 2012Hadoop - TDC 2012
Hadoop - TDC 2012
wchevreuil
 

More from wchevreuil (10)

HBase System Tables / Metadata Info
HBase System Tables / Metadata InfoHBase System Tables / Metadata Info
HBase System Tables / Metadata Info
 
HDFS client write/read implementation details
HDFS client write/read implementation detailsHDFS client write/read implementation details
HDFS client write/read implementation details
 
HBase RITs
HBase RITsHBase RITs
HBase RITs
 
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
 
Hbasecon2019 hbck2 (1)
Hbasecon2019 hbck2 (1)Hbasecon2019 hbck2 (1)
Hbasecon2019 hbck2 (1)
 
Web hdfs and httpfs
Web hdfs and httpfsWeb hdfs and httpfs
Web hdfs and httpfs
 
HBase replication
HBase replicationHBase replication
HBase replication
 
Hadoop tuning
Hadoop tuningHadoop tuning
Hadoop tuning
 
I nd t_bigdata(1)
I nd t_bigdata(1)I nd t_bigdata(1)
I nd t_bigdata(1)
 
Hadoop - TDC 2012
Hadoop - TDC 2012Hadoop - TDC 2012
Hadoop - TDC 2012
 

Recently uploaded

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
TaghreedAltamimi
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Envertis Software Solutions
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 

Recently uploaded (20)

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 

Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf

  • 1. Enabling Native Integration of NoSQL HBase with Cloud Providers Ankit Singhal, Wellington Chevreuil, Stephen Wu
  • 2. 2 © 2023 Cloudera, Inc. All rights reserved. ● TCO ○ Instance Sizing ○ Automatic scaling ○ Storage Optimization ● Storage Options ● Security Simplification ● Availability and Fault Tolerance ● Benchmark ● Cost Case study Agenda
  • 3. 3 © 2023 Cloudera, Inc. All rights reserved. Considering Cloud Native * https://www.redhat.com/en/topics/cloud-native-apps ● Elasticity with increased efficiency ● Flexible and configurable Availability ● Cost Reduction
  • 4. 4 © 2023 Cloudera, Inc. All rights reserved. Instance sizing ● Per role type: Masters, RegionServers ● Per load (with RS Grouping) ○ System tables (meta, namespace, phoenix) ○ Use case specific load TCO
  • 5. 5 © 2023 Cloudera, Inc. All rights reserved. Metrics based automatic scale up/down ● Read/Write latency/throughput based ● RPC latency based ● Compaction queue size ● Overall request load ● Available cache space (Cloud Storage deployments) TCO
  • 6. 6 © 2023 Cloudera, Inc. All rights reserved. Storage Optimizations ● Block vs Cloud Storage ● Ephemeral Storage for BucketCache TCO
  • 7. 7 © 2023 Cloudera, Inc. All rights reserved. Cloud Storage ● Benefits ○ Cost efficient for large volumes ○ Available on major cloud providers ○ Decoupled compute from storage ● Challenges ○ Increased latency ○ HBase native compatibility ■ S3 lacks atomic rename (for now) Storage Options
  • 8. 8 © 2023 Cloudera, Inc. All rights reserved. Cloud Storage ● Mitigating latency ○ File based bucket cache over ephemeral storage ■ Initial cache warmup required ■ Cache must be resilient to RS crashes/restarts ● HBASE-27686, HBASE-27743 ■ Region balancer must consider impacts to the cache ● HBASE-27389 ○ Cloud provider specific tunings ■ Throttling, connection/thread pooling configs ■ Request hints (random/sequential) Storage Options
  • 9. 9 © 2023 Cloudera, Inc. All rights reserved. Cloud Storage ● HBase S3 Integration ○ HBase originally designed for hierarchical file systems ■ HBase internal write operations are two phased: ● New files created on a temp dir ● Rename new files to final dir at commit time ■ Amazon S3 lacks atomic renames ● Requires locking the whole dir content ● Individual files rename calls to S3 (suboptimal) Storage Options
  • 10. 10 © 2023 Cloudera, Inc. All rights reserved. Cloud Storage Storage Options Original Flow
  • 11. 11 © 2023 Cloudera, Inc. All rights reserved. Cloud Storage ● HBase S3 Integration ○ Redesign HBase to not (only) rely on renames ■ Store File Tracking: HBASE-26067 ● Additional layer for tracking store files ● Write operations delegate file path decision to the tracking layer ● Configurable tracker implementations ○ Default: still uses renames ○ File Based Tracker: keeps list of valid files in meta files Storage Options
  • 12. 12 © 2023 Cloudera, Inc. All rights reserved. Cloud Storage SFT Flow Storage Options
  • 13. 13 © 2023 Cloudera, Inc. All rights reserved. Cloud Storage SFT Flow (detailed) Storage Options
  • 14. 14 © 2023 Cloudera, Inc. All rights reserved. Cloud Storage ● Limitations ○ No durable syncs (fsync/hflush) ■ WAL files require durable file sync to overcome crashes ■ Low latency critical for write performance ● Deploy HDFS for WAL ○ Separate WAL from data directory ○ WALs are temporary, so limited space usage ○ Use the instances attached disks for HDFS data Storage Options
  • 15. 15 © 2023 Cloudera, Inc. All rights reserved. Block Storage ● Benefits ○ Higher performance * ● Challenges ○ More costly ○ Compute/Storage tightly coupled Storage Options
  • 16. 16 © 2023 Cloudera, Inc. All rights reserved. Block Storage ● Block storage volumes attached to the cluster nodes ● HDFS deployed over the nodes ○ Both WALs and HBase data on HDFS ● Less cost efficient ■ Require additional/higher volumes attached to instances ■ HDFS replication factor requires more storage ■ Compute and Storage scaling together ● Higher Performance ○ Avg 5x better latency/throughput Storage Options
  • 17. 17 © 2023 Cloudera, Inc. All rights reserved. Cloud vs Block Storage Type Pros Cons Cloud Storage More cost efficient Separate compute from storage: ● Optimal scaling ● Data management and availability Lower performance ● Mitigated with bucket cache Block Storage Better overall performance Higher costs Couples compute and storage ● Can't scale independently Storage Options
  • 18. 18 © 2023 Cloudera, Inc. All rights reserved. Kerberos vs JWT ● Kerberos ○ HBase support since early versions ○ Strong client/server authentication (KDC based) ○ SASL for the RPC authentication ○ Requires clients to be known by the KDC ○ Complex to integrate with other services Security Simplification
  • 19. 19 © 2023 Cloudera, Inc. All rights reserved. Kerberos vs JWT ● JWT ○ Token based authentication ■ Custom SASL auth provider (HBASE-23347) ■ HBase builtin works ongoing (HBASE-26553) ○ TLS for RPC (encrypts the token and further info) ■ HBASE-26666 recently added TLS for RPC support ○ Centralised authentication service ■ Easily reusable by clients/other services Security Simplification
  • 20. 20 © 2023 Cloudera, Inc. All rights reserved. ● HBase built-in capabilities ○ Master HA ○ Table sharding (Regions) ○ Meta region replica Availability and Fault Tolerance
  • 21. 21 © 2023 Cloudera, Inc. All rights reserved. Availability and Fault Tolerance (cont.) ● * Multiple Availability Zones (Multi AZ) ○ Master instances on different zones ○ RSes spread evenly throughout zones *https://blog.cloudera.com/high-availability-multi-az-for-cdp-operational-database/
  • 22. 22 © 2023 Cloudera, Inc. All rights reserved. Availability and Fault Tolerance (cont.) ● Limitations ○ Cache reload (Cloud Storage deployments) when RSs crashes ■ HBASE-27313 Persist list of HFiles after prefetch ■ HBASE-27743 Enhancement for persistent cache ■ HBASE-27389 Cost function that considers bucket cache as part of the plan before region movement ○ Multi AZ increases latency (Block Storage deployments)
  • 23. 23 © 2023 Cloudera, Inc. All rights reserved. YCSB Workloads ● Dataset size: 20 Billion rows = ~20TB ● Workload C: 100% Read ● Workload A: 50% read, 50% write ● Workload F: 50% read, 25% update, 25% read-modify-update ● Two common deployments, HDFS vs Cloud storage Benchmark
  • 24. 24 © 2023 Cloudera, Inc. All rights reserved. Clusters definitions ● AWS ○ HDFS ■ 20 RSes m5.2xlarge storage with HDD ■ 2 Masters m5.8xlarge ○ S3 ■ 20 RSes i3.2xlarge storage as S3 ■ 2 Masters m5.2xlarge ● Azure ○ HDFS ■ 20 RSes Standard_D8_V3 ■ 2 Masters Standard_D32_V3 ○ ABFS ■ 20 RSes Standard_L8s_V2 ■ 2 Masters Standard_D8a_V4 Benchmark
  • 25. 25 © 2023 Cloudera, Inc. All rights reserved. AWS with HDFS vs S3 (Higher is better) Benchmark
  • 26. 26 © 2023 Cloudera, Inc. All rights reserved. AWS with HDFS vs S3 (Lower is better) Benchmark There is an outlier on P95+, where the P95 latency is 28.3 ms vs 3.3 ms
  • 27. 27 © 2023 Cloudera, Inc. All rights reserved. Azure with HDFS vs ABFS (Higher is better) Benchmark
  • 28. 28 © 2023 Cloudera, Inc. All rights reserved. Benchmark Azure with HDFS vs ABFS (Lower is better)
  • 29. 29 © 2023 Cloudera, Inc. All rights reserved. Average comparison against HDFS on Block Storage Workload Type Latency S3 vs HDFS Throughput S3 vs HDFS Latency ABFS vs HDFS Throughput S3 vs HDFS Read only workload 47% lower 86% higher 48% lower 91% higher 50% Read, 50% Write R: 52% lower W: 28% higher 36% higher R: 46% lower W: 35% lower 66% higher 50% read, 25% update, 25% read-modify-update R: 87% lower W: 29% lower 66% higher R: 48% lower W: 35% lower 66% lower Benchmark
  • 30. 30 © 2023 Cloudera, Inc. All rights reserved. Sample comparison for a 50TB read/write workload on AWS: 50TB read/write workload S3 with ephemeral HDFS in block storage Number of instances 43 43 Instance Types 3 x m5.2xlarge 40 x i3.2xlarge 3 x m5.8xlarge 40 x m5.2xlarge Initial capacity with performance parity 64TB ephemeral storage for bucket cache 190TB EBS volumes for HDFS Cost of instances $10,778.58 $7,467.51 Cost of storage *$1,428.60 $9,216.00 Total cost (monthly) $12,207.18 (26.8%) $16,683.51 Cost Case Study * we estimated the amount of list/put/get requests to S3 is 50 millions per month, the actual cost may be slightly higher
  • 31. 31 © 2023 Cloudera, Inc. All rights reserved. Q/A
  • 32. 32 © 2023 Cloudera, Inc. All rights reserved. ● Wellington Chevreuil ○ SW Engineer at Cloudera, HBase PMC ● Stephen Wu ○ SW Engineer at Cloudera, HBase PMC Speaker information