SlideShare a Scribd company logo
1 of 40
Download to read offline
Dissecting Real-World
Database Performance
Dilemmas
Felipe Mendes, Solution Architect
Noelly Medina, Technical Support Team Lead
Guilherme da Silva Nogueira, Senior Solutions Architect
+ For data-intensive applications that require high
throughput and predictable low latencies
+ Close-to-the-metal design takes full advantage of
modern infrastructure
+ >5x higher throughput
+ >20x lower latency
+ >75% TCO savings
+ Compatible with Apache Cassandra and Amazon
DynamoDB
+ DBaaS/Cloud, Enterprise and Open Source
solutions
The Database for Gamechangers
2
“ScyllaDB stands apart...It’s the rare product
that exceeds my expectations.”
– Martin Heller, InfoWorld contributing editor and reviewer
“For 99.9% of applications, ScyllaDB delivers all the
power a customer will ever need, on workloads that other
databases can’t touch – and at a fraction of the cost of
an in-memory solution.”
– Adrian Bridgewater, Forbes senior contributor
3
+400 Gamechangers Leverage ScyllaDB
Seamless experiences
across content + devices
Digital experiences at
massive scale
Corporate fleet
management
Real-time analytics 2,000,000 SKU -commerce
management
Video recommendation
management
Threat intelligence service
using JanusGraph
Real time fraud detection
across 6M transactions/day
Uber scale, mission critical
chat & messaging app
Network security threat
detection
Power ~50M X1 DVRs with
billions of reqs/day
Precision healthcare via
Edison AI
Inventory hub for retail
operations
Property listings and
updates
Unified ML feature store
across the business
Cryptocurrency exchange
app
Geography-based
recommendations
Global operations- Avon,
Body Shop + more
Predictable performance for
on sale surges
GPS-based exercise
tracking
Serving dynamic live
streams at scale
Powering India's top
social media platform
Personalized
advertising to players
Distribution of game
assets in Unreal Engine
Introductions
Felipe Mendes, Solution Architect
+ Helps teams solve their most challenging problems
+ Author of Database Performance at Scale
Noelly Medina, Technical Support Team Lead
+ Years of experience in customer-facing roles
+ Always seeking for new challenges
Guilherme da Silva Nogueira, Senior Solutions Architect
+ Years of experience with Linux and distributed systems
+ Just got a new puppy
Puppy time!
Meet Baunilha
(Portuguese for Vanilla)
Agenda + Hunting down a Latency Problem
+ Ticking Time-Series Bomb
+ Scale, scale… SCALE!
+ Hot Facts, Cold Insights
6
Hunting Down a Latency
Problem
Uncovering a Multi-Region Performance Challenge
7
Customer evaluated ScyllaDB and was happy with the results
+ Initial testing:
+ 3 node cluster (AWS us-east1)
+ P99 < 15ms
+ Using gocql driver:
+ Following query best-practices
+ Making use of gocql.DataCentreHostFilter
+ All application queries using a LOCAL_* ConsistencyLevel
8
A Latency Sensitive Workload for AdTech
Final production requirements were multi-region (AWS us-west-1)
+ Followed the Adding a New DC to an ScyllaDB Cluster procedure
+ Latency went through the roof!
9
The PROBLEM
10
Why?!
+ We realized that latencies only affected a specific ScyllaDB scheduling class
11
A Few Data Points
+ A path for resolution: main query class P99 driven by the Network RTT
+ nodetool setlogginglevel query_processing trace:
12
Whoops! Tracking down individual queries
+ Seems like we are getting close. Ideas?
for i in *; do egrep -Hi "SELECT|UPDATE|INSERT|DELETE" $i | awk -F':' '{ print $1, $NF }'; done | sort -V | uniq -c
215 (cached) "SELECT * FROM system_auth.roles WHERE role = ?" (cassandra)
1 (cached) "SELECT * FROM system_distributed.service_levels;" ()
304 (cached) "SELECT * FROM system_auth.roles WHERE role = ?" (cassandra)
3 (cached) "SELECT * FROM system_distributed.service_levels;" ()
323 (cached) "SELECT * FROM system_auth.roles WHERE role = ?" (cassandra)
2 (cached) "SELECT * FROM system_distributed.service_levels;" ()
1 (cached) "SELECT id, data, written_at, version FROM system.batchlog LIMIT 128" ()
7 (cached) "SELECT * FROM system_distributed.service_levels;" ()
1 (cached) "SELECT id, data, written_at, version FROM system.batchlog LIMIT 128" ()
6 (cached) "SELECT * FROM system_distributed.service_levels;" ()
1 (cached) "SELECT id, data, written_at, version FROM system.batchlog LIMIT 128" ()
6 (cached) "SELECT * FROM system_distributed.service_levels;" ()
+ From ScyllaDB:
13
The answer lies within the code (and in our docs!)
+ From Enable Authorization:
db::consistency_level
password_authenticator::consistency_for_user(std::string_view role_name) {
if (role_name == DEFAULT_USER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
Ticking Time-series bomb
Rushing against time
14
Log retention use case in production
+ ScyllaDB Cloud cluster in GCP
+ 6 node cluster
+ Latency within bounds, except during repairs
15
Major Streaming Company
Cluster repairs were taking long to complete
+ Some nodes started to run out of disk space
+ Customer was under a time-sensitive "freeze" period
16
The PROBLEM
4% free space!
+ Compaction was unable to keep up with the rate of incoming files:
17
An interesting symptom
18
What would you do?
+ What we know thus far:
+ Nodes are running out of disk space
+ Compaction is unable to catch up
+ Repair is taking long
19
Data Modeling Review
+ Making use Jaeger as an integration:
+ Business-critical application metrics
+ Time to Live (TTL) for automatic data expiration
+ Spreads data using a "bucket" technique, sorted by time
+ Circling back to the basics:
20
Data Modeling Review
$ cqlsh -e "DESC SCHEMA" | egrep -i "compaction =" | sort -h | uniq -c
43 AND compaction = {'class': 'IncrementalCompactionStrategy'}
33 AND compaction = {'class': 'SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
55 AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1',
'compaction_window_unit': 'HOURS'}
+ 131 tables to run through … 😢
+ … but TWCS tables seem interesting 💪
+ TTL 259200 (in seconds) == 30 days
+ Split in 1 hour buckets
+ 30 * 24 buckets = 720 windows!
21
Diving Deeper
CREATE TABLE x.y (
<...>
) WITH crc_check_chance = 1.0
AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1',
'compaction_window_unit': 'HOURS'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND default_time_to_live = 2592000
AND gc_grace_seconds = 10800
22
Understanding the problem
TWCS
Picture from: https://www.pythian.com/blog/proposal-for-a-new-cassandra-cluster-key-compaction-strategy
23
After schema changes
+ Repair completed, memory utilization reduced, performance improved!
24
+ Customer explained they were using Jaeger integration "stock" settings
+ We reported and addressed jaegertracing/jaeger/4561 upstream:
+ "With ScyllaDB (and likely Cassandra too), increasing TTL leads to large numbers of
sstables"
+ Felipe introduced the "twcs_max_window_count" tunable in ScyllaDB:
+ Safemode - Introduce TimeWindowCompactionStrategy Guardrails
+ Available starting at ScyllaDB Open Source 5.2 & Enterprise 2023.1
+ Takeaways
+ Be suspicious of activities taking a long time to complete
+ Review how third-party integrations works under the hood
+ Rest assured with ScyllaDB Cloud expertise to back you up!
Post event findings (and diligence)
Scale, scale… SCALE!
Should the sky really be the limit?
25
+ This is a fairly common scenario:
+ Decades of user data stored to be retrieved live
+ Long-term Cassandra deployment
+ Massive data growth YoY, dozens of Terabytes per cluster
+ Multiple clusters per geo
+ Some part of the dataset replicated EU/Asia
+ Rest of the dataset siloed per region
+ ~260TB raw dataset
+ This is a Cassandra cluster, so we should just scale horizontally, right? 🚀
26
Use Case Review
+ Multiple clusters scaling horizontally to keep up with data growth
+ From 9 nodes, to 12, 18, 30, 45, 90, … 200, 400!? 😱
+ Managing multiple clusters of 100's of nodes is challenging
+ From technical and cost perspective 💸
+ Each node has CPU and RAM, that sums up to a stratospherical amount 🛰
+ Assuming 64 vCPUs and 512GB RAM, 400 nodes = 25K vCPUs and 2TB RAM
27
Node sprawl
28
So, what next?
+ Current technology restricted storage density per node
+ Does not make efficient use of the hardware
+ Cannot tackle nodes denser in storage
29
Look for alternatives
+ From: 64 vCPUs x 400 nodes = 25K vCPUs
+ To: 21 x i3en.24xlarge (96 vCPUs, 60TB storage) = 2K vCPUs
+ ~12x reduction in vCPUs
+ ~20x reduction in node count
30
How can ScyllaDB help?
Hot Facts, Cold Insights
Data Thermodynamics & ScyllaDB
31
Engaged with us after losing data in HBase
+ On-prem bare-metal deployment
+ HUGE storage footprint – Petabyte range
+ Tiered storage or similar requirement:
+ "hot" – frequently accessed data or;
+ "cold" – least accessed data
+ No retention periods, data is forever stored
32
Worldwide Feed Aggregator
Which deployment and replication strategies to follow?
+ Single cluster, dual hot/cold DC?
+ Separate clusters, observable replication?
+ Should we use CDC for replication?
+ To dual-write or not?
+ How to evict data from the "hot" cluster?
33
Challenges
34
Let's talk
strategies
Reasons were plenty:
+ An on-premise deployment makes it fair difficult to effectively leverage it
+ Decommissioning the previous HBase infrastructure wasn't an option
+ Performance:
+ "Local" Object Storage latencies were suboptimal
+ Cloud ones would result in:
+ Network RTT penalty
+ Internet traffic ($$)
+ Need to come up with data tiering/replication strategies anyway…
+ … and still figure out how to enable their applications to work with an Object
Store
35
Why not just use Object Storage?
Plan-ahead:
+ ScyllaDB specialized cache fully allocates server's memory (LSA)
+ Important on-disk components (SSTable metadata) needs to be stored
+ 1:30 for performance, 1:100 as an upper bound limit
36
Memory and Storage Limits
Ultimately, they decided to apply the following strategy:
+ Application primarily writes to "hot" DC;
+ Replication from "hot" to "cold" is asynchronously handled by ScyllaDB
+ After their cut-off period, simply remove all replicas from the "hot" DC:
37
ScyllaDB Replication
ALTER KEYSPACE replicated_keyspace WITH replication = {'class':
'NetworkTopologyStrategy', 'hot': 0, 'cold': 3};
+ Similar strategy already employed in HBase, zero friction point
Performance requires an holistic view:
+ An overlooked setting may be the culprit
to performance problems
+ Integrations won't always follow best practices
+ Be sure to take the most of your underlying
(and available) infrastructure
+ Context is fundamental to guide your decisions
38
Summary
Q&A
ScyllaDB Cloud
Start free trial
scylladb.com/cloud
Feb 14-15 | VIRTUAL EVENT
scylladb.com/summit
Virtual Workshop
January 25, 2024
scylladb.com/events
Thank you
for joining us today.
@scylladb scylladb/
slack.scylladb.com
@scylladb company/scylladb/
scylladb/

More Related Content

Similar to Dissecting Real-World Database Performance Dilemmas

Webinar: How to build a highly available time series solution with KairosDB
Webinar: How to build a highly available time series solution with KairosDBWebinar: How to build a highly available time series solution with KairosDB
Webinar: How to build a highly available time series solution with KairosDBScyllaDB
 
Webinar how to build a highly available time series solution with kairos-db (1)
Webinar  how to build a highly available time series solution with kairos-db (1)Webinar  how to build a highly available time series solution with kairos-db (1)
Webinar how to build a highly available time series solution with kairos-db (1)Julia Angell
 
Exploring Phantom Traffic Jams in Your Data Flows
Exploring Phantom Traffic Jams in Your Data Flows Exploring Phantom Traffic Jams in Your Data Flows
Exploring Phantom Traffic Jams in Your Data Flows ScyllaDB
 
Running a DynamoDB-compatible Database on Managed Kubernetes Services
Running a DynamoDB-compatible Database on Managed Kubernetes ServicesRunning a DynamoDB-compatible Database on Managed Kubernetes Services
Running a DynamoDB-compatible Database on Managed Kubernetes ServicesScyllaDB
 
TechEvent 2019: DBaaS from Swisscom Cloud powered by Trivadis; Konrad Häfeli ...
TechEvent 2019: DBaaS from Swisscom Cloud powered by Trivadis; Konrad Häfeli ...TechEvent 2019: DBaaS from Swisscom Cloud powered by Trivadis; Konrad Häfeli ...
TechEvent 2019: DBaaS from Swisscom Cloud powered by Trivadis; Konrad Häfeli ...Trivadis
 
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
Eliminating  Volatile Latencies Inside Rakuten’s NoSQL MigrationEliminating  Volatile Latencies Inside Rakuten’s NoSQL Migration
Eliminating Volatile Latencies Inside Rakuten’s NoSQL MigrationScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaHenning Jacobs
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehousePrecisely
 
Workshop - How to benchmark your database
Workshop - How to benchmark your databaseWorkshop - How to benchmark your database
Workshop - How to benchmark your databaseScyllaDB
 
3. ami big data hadoop on ucs seminar may 2013
3. ami big data hadoop on ucs seminar may 20133. ami big data hadoop on ucs seminar may 2013
3. ami big data hadoop on ucs seminar may 2013Taldor Group
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...DataStax Academy
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0ScyllaDB
 
How Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdfHow Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdfScyllaDB
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
 
Microsoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosMicrosoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosmictc
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 

Similar to Dissecting Real-World Database Performance Dilemmas (20)

Webinar: How to build a highly available time series solution with KairosDB
Webinar: How to build a highly available time series solution with KairosDBWebinar: How to build a highly available time series solution with KairosDB
Webinar: How to build a highly available time series solution with KairosDB
 
Webinar how to build a highly available time series solution with kairos-db (1)
Webinar  how to build a highly available time series solution with kairos-db (1)Webinar  how to build a highly available time series solution with kairos-db (1)
Webinar how to build a highly available time series solution with kairos-db (1)
 
Exploring Phantom Traffic Jams in Your Data Flows
Exploring Phantom Traffic Jams in Your Data Flows Exploring Phantom Traffic Jams in Your Data Flows
Exploring Phantom Traffic Jams in Your Data Flows
 
Running a DynamoDB-compatible Database on Managed Kubernetes Services
Running a DynamoDB-compatible Database on Managed Kubernetes ServicesRunning a DynamoDB-compatible Database on Managed Kubernetes Services
Running a DynamoDB-compatible Database on Managed Kubernetes Services
 
TechEvent 2019: DBaaS from Swisscom Cloud powered by Trivadis; Konrad Häfeli ...
TechEvent 2019: DBaaS from Swisscom Cloud powered by Trivadis; Konrad Häfeli ...TechEvent 2019: DBaaS from Swisscom Cloud powered by Trivadis; Konrad Häfeli ...
TechEvent 2019: DBaaS from Swisscom Cloud powered by Trivadis; Konrad Häfeli ...
 
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
Eliminating  Volatile Latencies Inside Rakuten’s NoSQL MigrationEliminating  Volatile Latencies Inside Rakuten’s NoSQL Migration
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe Barcelona
 
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data WarehouseReal-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
 
Workshop - How to benchmark your database
Workshop - How to benchmark your databaseWorkshop - How to benchmark your database
Workshop - How to benchmark your database
 
3. ami big data hadoop on ucs seminar may 2013
3. ami big data hadoop on ucs seminar may 20133. ami big data hadoop on ucs seminar may 2013
3. ami big data hadoop on ucs seminar may 2013
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Cassandra at teads
Cassandra at teadsCassandra at teads
Cassandra at teads
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
How Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdfHow Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdf
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Microsoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosMicrosoft Azure in HPC scenarios
Microsoft Azure in HPC scenarios
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 

More from ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 
Top NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesTop NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesScyllaDB
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesScyllaDB
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversScyllaDB
 
Overcoming Media Streaming Challenges with NoSQL
Overcoming Media Streaming Challenges with NoSQLOvercoming Media Streaming Challenges with NoSQL
Overcoming Media Streaming Challenges with NoSQLScyllaDB
 
How Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfHow Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfScyllaDB
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineScyllaDB
 

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 
Top NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesTop NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling Mistakes
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
 
Overcoming Media Streaming Challenges with NoSQL
Overcoming Media Streaming Challenges with NoSQLOvercoming Media Streaming Challenges with NoSQL
Overcoming Media Streaming Challenges with NoSQL
 
How Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfHow Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdf
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Dissecting Real-World Database Performance Dilemmas

  • 1. Dissecting Real-World Database Performance Dilemmas Felipe Mendes, Solution Architect Noelly Medina, Technical Support Team Lead Guilherme da Silva Nogueira, Senior Solutions Architect
  • 2. + For data-intensive applications that require high throughput and predictable low latencies + Close-to-the-metal design takes full advantage of modern infrastructure + >5x higher throughput + >20x lower latency + >75% TCO savings + Compatible with Apache Cassandra and Amazon DynamoDB + DBaaS/Cloud, Enterprise and Open Source solutions The Database for Gamechangers 2 “ScyllaDB stands apart...It’s the rare product that exceeds my expectations.” – Martin Heller, InfoWorld contributing editor and reviewer “For 99.9% of applications, ScyllaDB delivers all the power a customer will ever need, on workloads that other databases can’t touch – and at a fraction of the cost of an in-memory solution.” – Adrian Bridgewater, Forbes senior contributor
  • 3. 3 +400 Gamechangers Leverage ScyllaDB Seamless experiences across content + devices Digital experiences at massive scale Corporate fleet management Real-time analytics 2,000,000 SKU -commerce management Video recommendation management Threat intelligence service using JanusGraph Real time fraud detection across 6M transactions/day Uber scale, mission critical chat & messaging app Network security threat detection Power ~50M X1 DVRs with billions of reqs/day Precision healthcare via Edison AI Inventory hub for retail operations Property listings and updates Unified ML feature store across the business Cryptocurrency exchange app Geography-based recommendations Global operations- Avon, Body Shop + more Predictable performance for on sale surges GPS-based exercise tracking Serving dynamic live streams at scale Powering India's top social media platform Personalized advertising to players Distribution of game assets in Unreal Engine
  • 4. Introductions Felipe Mendes, Solution Architect + Helps teams solve their most challenging problems + Author of Database Performance at Scale Noelly Medina, Technical Support Team Lead + Years of experience in customer-facing roles + Always seeking for new challenges Guilherme da Silva Nogueira, Senior Solutions Architect + Years of experience with Linux and distributed systems + Just got a new puppy
  • 6. Agenda + Hunting down a Latency Problem + Ticking Time-Series Bomb + Scale, scale… SCALE! + Hot Facts, Cold Insights 6
  • 7. Hunting Down a Latency Problem Uncovering a Multi-Region Performance Challenge 7
  • 8. Customer evaluated ScyllaDB and was happy with the results + Initial testing: + 3 node cluster (AWS us-east1) + P99 < 15ms + Using gocql driver: + Following query best-practices + Making use of gocql.DataCentreHostFilter + All application queries using a LOCAL_* ConsistencyLevel 8 A Latency Sensitive Workload for AdTech
  • 9. Final production requirements were multi-region (AWS us-west-1) + Followed the Adding a New DC to an ScyllaDB Cluster procedure + Latency went through the roof! 9 The PROBLEM
  • 11. + We realized that latencies only affected a specific ScyllaDB scheduling class 11 A Few Data Points + A path for resolution: main query class P99 driven by the Network RTT
  • 12. + nodetool setlogginglevel query_processing trace: 12 Whoops! Tracking down individual queries + Seems like we are getting close. Ideas? for i in *; do egrep -Hi "SELECT|UPDATE|INSERT|DELETE" $i | awk -F':' '{ print $1, $NF }'; done | sort -V | uniq -c 215 (cached) "SELECT * FROM system_auth.roles WHERE role = ?" (cassandra) 1 (cached) "SELECT * FROM system_distributed.service_levels;" () 304 (cached) "SELECT * FROM system_auth.roles WHERE role = ?" (cassandra) 3 (cached) "SELECT * FROM system_distributed.service_levels;" () 323 (cached) "SELECT * FROM system_auth.roles WHERE role = ?" (cassandra) 2 (cached) "SELECT * FROM system_distributed.service_levels;" () 1 (cached) "SELECT id, data, written_at, version FROM system.batchlog LIMIT 128" () 7 (cached) "SELECT * FROM system_distributed.service_levels;" () 1 (cached) "SELECT id, data, written_at, version FROM system.batchlog LIMIT 128" () 6 (cached) "SELECT * FROM system_distributed.service_levels;" () 1 (cached) "SELECT id, data, written_at, version FROM system.batchlog LIMIT 128" () 6 (cached) "SELECT * FROM system_distributed.service_levels;" ()
  • 13. + From ScyllaDB: 13 The answer lies within the code (and in our docs!) + From Enable Authorization: db::consistency_level password_authenticator::consistency_for_user(std::string_view role_name) { if (role_name == DEFAULT_USER_NAME) { return db::consistency_level::QUORUM; } return db::consistency_level::LOCAL_ONE; }
  • 15. Log retention use case in production + ScyllaDB Cloud cluster in GCP + 6 node cluster + Latency within bounds, except during repairs 15 Major Streaming Company
  • 16. Cluster repairs were taking long to complete + Some nodes started to run out of disk space + Customer was under a time-sensitive "freeze" period 16 The PROBLEM 4% free space!
  • 17. + Compaction was unable to keep up with the rate of incoming files: 17 An interesting symptom
  • 19. + What we know thus far: + Nodes are running out of disk space + Compaction is unable to catch up + Repair is taking long 19 Data Modeling Review
  • 20. + Making use Jaeger as an integration: + Business-critical application metrics + Time to Live (TTL) for automatic data expiration + Spreads data using a "bucket" technique, sorted by time + Circling back to the basics: 20 Data Modeling Review $ cqlsh -e "DESC SCHEMA" | egrep -i "compaction =" | sort -h | uniq -c 43 AND compaction = {'class': 'IncrementalCompactionStrategy'} 33 AND compaction = {'class': 'SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} 55 AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'HOURS'} + 131 tables to run through … 😢 + … but TWCS tables seem interesting 💪
  • 21. + TTL 259200 (in seconds) == 30 days + Split in 1 hour buckets + 30 * 24 buckets = 720 windows! 21 Diving Deeper CREATE TABLE x.y ( <...> ) WITH crc_check_chance = 1.0 AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'HOURS'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND default_time_to_live = 2592000 AND gc_grace_seconds = 10800
  • 22. 22 Understanding the problem TWCS Picture from: https://www.pythian.com/blog/proposal-for-a-new-cassandra-cluster-key-compaction-strategy
  • 23. 23 After schema changes + Repair completed, memory utilization reduced, performance improved!
  • 24. 24 + Customer explained they were using Jaeger integration "stock" settings + We reported and addressed jaegertracing/jaeger/4561 upstream: + "With ScyllaDB (and likely Cassandra too), increasing TTL leads to large numbers of sstables" + Felipe introduced the "twcs_max_window_count" tunable in ScyllaDB: + Safemode - Introduce TimeWindowCompactionStrategy Guardrails + Available starting at ScyllaDB Open Source 5.2 & Enterprise 2023.1 + Takeaways + Be suspicious of activities taking a long time to complete + Review how third-party integrations works under the hood + Rest assured with ScyllaDB Cloud expertise to back you up! Post event findings (and diligence)
  • 25. Scale, scale… SCALE! Should the sky really be the limit? 25
  • 26. + This is a fairly common scenario: + Decades of user data stored to be retrieved live + Long-term Cassandra deployment + Massive data growth YoY, dozens of Terabytes per cluster + Multiple clusters per geo + Some part of the dataset replicated EU/Asia + Rest of the dataset siloed per region + ~260TB raw dataset + This is a Cassandra cluster, so we should just scale horizontally, right? 🚀 26 Use Case Review
  • 27. + Multiple clusters scaling horizontally to keep up with data growth + From 9 nodes, to 12, 18, 30, 45, 90, … 200, 400!? 😱 + Managing multiple clusters of 100's of nodes is challenging + From technical and cost perspective 💸 + Each node has CPU and RAM, that sums up to a stratospherical amount 🛰 + Assuming 64 vCPUs and 512GB RAM, 400 nodes = 25K vCPUs and 2TB RAM 27 Node sprawl
  • 29. + Current technology restricted storage density per node + Does not make efficient use of the hardware + Cannot tackle nodes denser in storage 29 Look for alternatives
  • 30. + From: 64 vCPUs x 400 nodes = 25K vCPUs + To: 21 x i3en.24xlarge (96 vCPUs, 60TB storage) = 2K vCPUs + ~12x reduction in vCPUs + ~20x reduction in node count 30 How can ScyllaDB help?
  • 31. Hot Facts, Cold Insights Data Thermodynamics & ScyllaDB 31
  • 32. Engaged with us after losing data in HBase + On-prem bare-metal deployment + HUGE storage footprint – Petabyte range + Tiered storage or similar requirement: + "hot" – frequently accessed data or; + "cold" – least accessed data + No retention periods, data is forever stored 32 Worldwide Feed Aggregator
  • 33. Which deployment and replication strategies to follow? + Single cluster, dual hot/cold DC? + Separate clusters, observable replication? + Should we use CDC for replication? + To dual-write or not? + How to evict data from the "hot" cluster? 33 Challenges
  • 35. Reasons were plenty: + An on-premise deployment makes it fair difficult to effectively leverage it + Decommissioning the previous HBase infrastructure wasn't an option + Performance: + "Local" Object Storage latencies were suboptimal + Cloud ones would result in: + Network RTT penalty + Internet traffic ($$) + Need to come up with data tiering/replication strategies anyway… + … and still figure out how to enable their applications to work with an Object Store 35 Why not just use Object Storage?
  • 36. Plan-ahead: + ScyllaDB specialized cache fully allocates server's memory (LSA) + Important on-disk components (SSTable metadata) needs to be stored + 1:30 for performance, 1:100 as an upper bound limit 36 Memory and Storage Limits
  • 37. Ultimately, they decided to apply the following strategy: + Application primarily writes to "hot" DC; + Replication from "hot" to "cold" is asynchronously handled by ScyllaDB + After their cut-off period, simply remove all replicas from the "hot" DC: 37 ScyllaDB Replication ALTER KEYSPACE replicated_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'hot': 0, 'cold': 3}; + Similar strategy already employed in HBase, zero friction point
  • 38. Performance requires an holistic view: + An overlooked setting may be the culprit to performance problems + Integrations won't always follow best practices + Be sure to take the most of your underlying (and available) infrastructure + Context is fundamental to guide your decisions 38 Summary
  • 39. Q&A ScyllaDB Cloud Start free trial scylladb.com/cloud Feb 14-15 | VIRTUAL EVENT scylladb.com/summit Virtual Workshop January 25, 2024 scylladb.com/events
  • 40. Thank you for joining us today. @scylladb scylladb/ slack.scylladb.com @scylladb company/scylladb/ scylladb/