SlideShare a Scribd company logo
1 of 45
Scylla 5.0 New
Features, Part 2
Asias He, Principal Software Engineer & Raphael S.
Carvalho, Software Engineer
Repair Based
Node Operations
Asias He
Principal Software Engineer
Asias He
■ Asias He is a long-time open source developer who previously
worked on Debian Project, Solaris Kernel, KVM Virtualization for
Linux and OSv unikernel. He now works on Seastar and Scylla.
Principal Software Engineer
YOUR PHOTO
GOES HERE
Agenda
■ Repair based node operations
■ Off-strategy compaction
■ Gossip free node operations
Repair Based Node Operations
What is RBNO
■ Use row level repair as the underlying mechanism to sync data between nodes
instead of streaming
■ Single mechanism for all the node operations
• Bootstrap / replace / rebuild / decommission / removenode / repair
Benefits of RBNO
Significant improvements on performance and data safety
■ Resumable
• Resume from previous failed bootstrap operations
■ Consistency
• Latest replica is guaranteed
■ Simplified
• No need to run repair before or after node operations like replace and removenode
■ Unified
• All node ops use the same underlying mechanism
Towards using RBNO by default
■ Enabled by default for replace operations
■ More operations will use RBNO by default in the future
■ All node operations are supported
■ Options to turn on specific node operations
• E.g., --enable-repair-based-node-ops true
• E.g., --allowed-repair-based-node-ops replace, bootstrap
■ Better IO scheduler improvement to reduce latency impact
Off-strategy compaction
Introduction
Make compaction during node operations more efficient
■ What is it
• Sstables generated by node operations are kept in a separate data set
• Compact them together and integrate to main set when node operation is done
■ Benefits
• Less compaction work during node operations
• Faster to complete node operations
Current status
■ Enabled for all node ops
• repair, bootstrap, replace, decommission, repair, rebuild
■ Normal trigger for node ops
• Trigger at the end of node operation
■ Smart trigger for repair
• Wait for more repair to come to batch more off-strategy compaction
Gossip Free Node Operations
What is it
■ A new feature to use RPC verbs instead of gossip status to perform node
operations like add or remove node.
■ Require all nodes to participate by default
• Avoid data inconsistency issue if nodes are network partitioned
• Allow users to ignore dead nodes explicitly
• nodetool removenode --ignore-dead-nodes
• scylla --ignore-dead-nodes-for-replace
■ Automatically revert to previous state in case of error
■ Detect user operation mistakes
• Detect and reject multiple node operations in parallel
■ Each operations is assigned with a UUID
• Easier to identify a node operation
Benefits
How to use it
■ No action is needed from the user
■ Enabled for bootstrap, replace, decommission and removenode.
Thank you!
Stay in touch
Asias He
@asias_he
asias@scylladb.com
Repair Based
Tombstone GC
Asias He
Principal Software Engineer
Asias He
■ Asias He is a long-time open source developer who previously
worked on Debian Project, Solaris Kernel, KVM Virtualization for
Linux and OSv unikernel. He now works on Seastar and Scylla.
Principal Software Engineer
YOUR PHOTO
GOES HERE
Agenda
■ Background of tombstones
■ Timeout based tombstone GC
■ Repair based tombstone GC
Background
■ Tombstones are used to delete data
■ Can’t keep tombstones forever
■ Tombstones GC happens
• Data covered by the tombstone and tombstone can compact away together
• It is old enough, older than gc_grace_seconds
■ Tombstones might be missed on some replica
■ Data resurrection happens
• Replica nodes with tombstone do GC
• Replica nodes without tombstone still contains the data should be deleted
• Read return deleted data to the user
Current timeout based tombstone gc
■ Must run full cluster wide repair in gc_grace_seconds
■ Not robust enough
• Nothing guarantees if one can finish repair in time
• Repair is a low priority maintenance operation
■ Pressure to people who operates scylla
■ Performance impact over critical period
Introduce repair based tombstone gc
■ The idea
• GC a tombstone only after repair is performed
■ Main benefits
• No need to tune and find a proper gc_grace_seconds
• No data resurrection if cluster wide repair couldn’t be performed within gc_grace_seconds
• Less pressure for users operating scylla clusters to run repairs in a timely manner
• Throttle repair intensity even more
• Reduce the latency impact on user workload
• Since there is no more hard requirement to finish repair in time.
• If repair is performed more frequently than gc_grace_seconds
• Tombstones can be garbage-collected sooner
• Improving performance
How to use it
■ ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'} ;
• The mode can be {timeout, repair, disabled, immediate}
• timeout = gc tombstone after gc_grace_seconds
• repair = gc tombstone after repair
• disabled = never gc tombstone
• immediate = gc tombstone immediately
■ CREATE TABLE ks.cf (key blob PRIMARY KEY, val blob) WITH tombstone_gc =
{'mode':'repair'};
More considerations
■ When to use mode = immediate
• Use it for TWCS with no user deletes
• Safer than gc_grace_seconds = 0
• Reject deletes if mode = immediate
■ When to use mode = disabled
• Tools that may load scylla with out of order writes or writes in the past, e.g., sstableloader
and cdc replicator
• Disable tombstone gc when the tools are in progress.
■ What happens if mode = repair but repair can not finish for some reason
• A new restful api to fake repair history
• Use it as emergency to allow gc
• Turn back to mode = timeout
How to upgrade from existing cluster
■ A gossip feature TOMBSTONE_GC_OPTIONS is added
■ The tombstone_gc option can not be used until full upgrade
• E.g., in a mixed cluster:
• cqlsh> ALTER TABLE ks.table WITH tombstone_gc = { 'mode':'repair'} ;
• ConfigurationException: tombstone_gc option not supported by the cluster
■ To keep max compatibility and introduce less surprise to users
• All tables default to use the mode = timeout (same as without this feature)
• Uses have to set mode = repair explicitly
Thank you!
Stay in touch
Asias He
@asias_he
asias@scylladb.com
Compaction Enhancements:
Increased Storage Density
and Time Series
Made Much Easier
Raphael S. Carvalho
Software Engineer
Raphael S. Carvalho
■ Member of the ScyllaDB storage team
■ Responsible for the compaction subsystem
■ Previously worked on Syslinux and OSv
Software Engineer at ScyllaDB
Agenda
■ Space optimization for incremental compaction
■ “Bucketless” time series, i.e. time series made much easier for you
■ Upcoming improvements
Let’s take a look back
■ Incremental compaction (ICS) introduced in enterprise release 2019.1.4
■ Known for combining techniques from both size-tiered and leveled strategies
■ Fixes the 100% space overhead problem in size-tiered compaction, increasing disk utilization significantly.
Is it enough?
■ Space overhead in tiered compaction was efficiently fixed, however…
■ Incremental (ICS) and size-tiered (STCS) strategies share the same space amplification (~2-4x)
with overwrite workloads, where:
• They cover a similar region in the three-dimensional efficiency space, also known as RUM conjecture
trade-offs.
READ
WRITE SPACE
STCS ICS
Turns out it’s not enough. But can we do better?
■ Leveled strategy and Size-tiered (or ICS) cover different regions
• Interesting regions cannot be reached with either strategies.
• But interesting regions can be reached by combining data layout of both strategies
• i.e. a hybrid (tiered + leveled) approach
READ
WRITE SPACE
STCS ICS
LCS
Let’s work to optimize space efficiency then
■ A few high-level goals:
• Optimize space efficiency with overwrite workloads
• Ensure write and read latency meet SLA requirements
■ That’s Space Amplification Goal (SAG) for you.
■ Increased storage density per node? YES.
■ Reduce costs? YES.
A few facts about this feature
■ This feature (available since Scylla Enterprise 2020.1.6) can only be used with Incremental Compaction
■ Compaction will dynamically adapt to the workload to meet requirements
■ Under heavy write load, compaction strategy will work to meet write latency requirement.
■ Otherwise, strategy works to optimize space efficiency to the desired extent
■ Translates into:
• Storage Density per node ++
• Costs --
• Scale ++
READ
WRITE SPACE
ICS+SAG
Enabling the space optimization (SAG)
■ This will enable the feature with a space amplification goal of 1.5
■ The lower the configured value the higher the write amplification
■ Adaptive approach minimizes the impact of extra amplification
■ Gives user control to reach interesting regions in the three-dimensional efficiency space
ALTER TABLE keyspace.table
WITH compaction = {
'class': 'IncrementalCompactionStrategy',
'space_amplification_goal': '1.5',
};
Space optimization in action…
A common schema for time series looked like…
CREATE TABLE billy.readings (
sensor_id int,
date date,
time time,
temperature int,
PRIMARY KEY ((sensor_id, date), time)
)
Why bucket in time series?
■ Large partitions were known to causing all sorts of problems
• Index inefficiency when reading from the middle of a large partition
• Latency issues when repairing large partitions
• High resource usage and read amplification when querying multiple time windows
• Reactor stalls which caused higher P99 latencies
• And so on…
■ Consequently applications were forced to “bucket” partitions to keep their size within a limit.
Bucketed vs Unbucketed time series
Window
sstable 1 sstable 2 sstable 3 sstable 4
Bucketed partitions for a
single time series
Unbucketed partition for a
single time series
Time series made much easier for you!
■ But those bad days are gone!
• Scylla allows a large partition to be efficiently indexed: O(logN)
• Scylla’s row-level repair allows large partitions to be efficiently repaired
• TimeWindowCompactionStrategy can now efficiently query multiple time windows
• by discarding SSTable files which time range is irrelevant for the query
• Incrementally open the relevant files to reduce resource overhead
• Therefore, read amplification and resource usage problems are fixed
A schema for time series can now look like…
■ There’s no longer any field date in schema, meaning that:
• Application won’t have to create new partitions on a fixed interval for a time series
• Querying a time series will be much easier as only a single partition is involved
■ Bucketing days are potentially gone!
■ Lots of complexity reduced in the application side
CREATE TABLE billy.readings (
sensor_id int,
time time,
temperature int,
PRIMARY KEY (sensor_id, time)
)
Upcoming improvements
■ Compaction becoming overall more resilient / performant:
• Changes were recently made to make Cleanup, Major compactions more resilient when system is running
out of disk space
• Dynamic control of compaction fan-in to increase overall compaction efficiency
• Based on observation that efficiency is a function of number of input files and their relative sizes
• Don’t dilute the overall efficiency by submitting jobs which efficiency is greater than or equal to
efficiency of ongoing jobs
• Tests show that write amplification is reduced under heavy write load while keeping space and
read amplifications within bounds
• Makes the system adapt even better to changing workloads
• More stability. More performance.
Upcoming improvements
■ Reduce compaction aggressiveness by:
• Improvements in I/O scheduler (Pavel Emelyanov covers this in depth in his talk)
• Improvements in Compaction backlog controller
• Aiming at improving tail latency and overall system stability.
■ Off-strategy compaction (Asias He covers this better)
• Make repair-based node operations more efficient, faster
• Consequently, better elasticity
Thank you!
Stay in touch
Raphael S. Carvalho
@raphael_scarv
raphaelsc@scylladb.com

More Related Content

What's hot

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

What's hot (20)

Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDs
 
Redis Reliability, Performance & Innovation
Redis Reliability, Performance & InnovationRedis Reliability, Performance & Innovation
Redis Reliability, Performance & Innovation
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
 
Developing Scylla Applications: Practical Tips
Developing Scylla Applications: Practical TipsDeveloping Scylla Applications: Practical Tips
Developing Scylla Applications: Practical Tips
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with Scylla
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Multi cluster, multitenant and hierarchical kafka messaging service   slideshareMulti cluster, multitenant and hierarchical kafka messaging service   slideshare
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 

Similar to Scylla Summit 2022: Scylla 5.0 New Features, Part 2

What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
DataStax
 

Similar to Scylla Summit 2022: Scylla 5.0 New Features, Part 2 (20)

Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
The road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceThe road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as service
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
Cassandra Bootstap from Backups
Cassandra Bootstap from BackupsCassandra Bootstap from Backups
Cassandra Bootstap from Backups
 
Cassandra Bootstrap from Backups
Cassandra Bootstrap from BackupsCassandra Bootstrap from Backups
Cassandra Bootstrap from Backups
 
Cinder enhancements-for-replication-using-stateless-snapshots
Cinder enhancements-for-replication-using-stateless-snapshotsCinder enhancements-for-replication-using-stateless-snapshots
Cinder enhancements-for-replication-using-stateless-snapshots
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
Webinar Slides: High Noon at AWS — Amazon RDS vs. Tungsten Clustering with My...
 
Sql server tips from the field
Sql server tips from the fieldSql server tips from the field
Sql server tips from the field
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Where Django Caching Bust at the Seams
Where Django Caching Bust at the SeamsWhere Django Caching Bust at the Seams
Where Django Caching Bust at the Seams
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 

More from ScyllaDB

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 
Top NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesTop NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling Mistakes
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Scylla Summit 2022: Scylla 5.0 New Features, Part 2

  • 1. Scylla 5.0 New Features, Part 2 Asias He, Principal Software Engineer & Raphael S. Carvalho, Software Engineer
  • 2. Repair Based Node Operations Asias He Principal Software Engineer
  • 3. Asias He ■ Asias He is a long-time open source developer who previously worked on Debian Project, Solaris Kernel, KVM Virtualization for Linux and OSv unikernel. He now works on Seastar and Scylla. Principal Software Engineer YOUR PHOTO GOES HERE
  • 4. Agenda ■ Repair based node operations ■ Off-strategy compaction ■ Gossip free node operations
  • 5. Repair Based Node Operations
  • 6. What is RBNO ■ Use row level repair as the underlying mechanism to sync data between nodes instead of streaming ■ Single mechanism for all the node operations • Bootstrap / replace / rebuild / decommission / removenode / repair
  • 7. Benefits of RBNO Significant improvements on performance and data safety ■ Resumable • Resume from previous failed bootstrap operations ■ Consistency • Latest replica is guaranteed ■ Simplified • No need to run repair before or after node operations like replace and removenode ■ Unified • All node ops use the same underlying mechanism
  • 8. Towards using RBNO by default ■ Enabled by default for replace operations ■ More operations will use RBNO by default in the future ■ All node operations are supported ■ Options to turn on specific node operations • E.g., --enable-repair-based-node-ops true • E.g., --allowed-repair-based-node-ops replace, bootstrap ■ Better IO scheduler improvement to reduce latency impact
  • 10. Introduction Make compaction during node operations more efficient ■ What is it • Sstables generated by node operations are kept in a separate data set • Compact them together and integrate to main set when node operation is done ■ Benefits • Less compaction work during node operations • Faster to complete node operations
  • 11. Current status ■ Enabled for all node ops • repair, bootstrap, replace, decommission, repair, rebuild ■ Normal trigger for node ops • Trigger at the end of node operation ■ Smart trigger for repair • Wait for more repair to come to batch more off-strategy compaction
  • 12. Gossip Free Node Operations
  • 13. What is it ■ A new feature to use RPC verbs instead of gossip status to perform node operations like add or remove node.
  • 14. ■ Require all nodes to participate by default • Avoid data inconsistency issue if nodes are network partitioned • Allow users to ignore dead nodes explicitly • nodetool removenode --ignore-dead-nodes • scylla --ignore-dead-nodes-for-replace ■ Automatically revert to previous state in case of error ■ Detect user operation mistakes • Detect and reject multiple node operations in parallel ■ Each operations is assigned with a UUID • Easier to identify a node operation Benefits
  • 15. How to use it ■ No action is needed from the user ■ Enabled for bootstrap, replace, decommission and removenode.
  • 16. Thank you! Stay in touch Asias He @asias_he asias@scylladb.com
  • 17. Repair Based Tombstone GC Asias He Principal Software Engineer
  • 18. Asias He ■ Asias He is a long-time open source developer who previously worked on Debian Project, Solaris Kernel, KVM Virtualization for Linux and OSv unikernel. He now works on Seastar and Scylla. Principal Software Engineer YOUR PHOTO GOES HERE
  • 19. Agenda ■ Background of tombstones ■ Timeout based tombstone GC ■ Repair based tombstone GC
  • 20. Background ■ Tombstones are used to delete data ■ Can’t keep tombstones forever ■ Tombstones GC happens • Data covered by the tombstone and tombstone can compact away together • It is old enough, older than gc_grace_seconds ■ Tombstones might be missed on some replica ■ Data resurrection happens • Replica nodes with tombstone do GC • Replica nodes without tombstone still contains the data should be deleted • Read return deleted data to the user
  • 21. Current timeout based tombstone gc ■ Must run full cluster wide repair in gc_grace_seconds ■ Not robust enough • Nothing guarantees if one can finish repair in time • Repair is a low priority maintenance operation ■ Pressure to people who operates scylla ■ Performance impact over critical period
  • 22. Introduce repair based tombstone gc ■ The idea • GC a tombstone only after repair is performed ■ Main benefits • No need to tune and find a proper gc_grace_seconds • No data resurrection if cluster wide repair couldn’t be performed within gc_grace_seconds • Less pressure for users operating scylla clusters to run repairs in a timely manner • Throttle repair intensity even more • Reduce the latency impact on user workload • Since there is no more hard requirement to finish repair in time. • If repair is performed more frequently than gc_grace_seconds • Tombstones can be garbage-collected sooner • Improving performance
  • 23. How to use it ■ ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'} ; • The mode can be {timeout, repair, disabled, immediate} • timeout = gc tombstone after gc_grace_seconds • repair = gc tombstone after repair • disabled = never gc tombstone • immediate = gc tombstone immediately ■ CREATE TABLE ks.cf (key blob PRIMARY KEY, val blob) WITH tombstone_gc = {'mode':'repair'};
  • 24. More considerations ■ When to use mode = immediate • Use it for TWCS with no user deletes • Safer than gc_grace_seconds = 0 • Reject deletes if mode = immediate ■ When to use mode = disabled • Tools that may load scylla with out of order writes or writes in the past, e.g., sstableloader and cdc replicator • Disable tombstone gc when the tools are in progress. ■ What happens if mode = repair but repair can not finish for some reason • A new restful api to fake repair history • Use it as emergency to allow gc • Turn back to mode = timeout
  • 25. How to upgrade from existing cluster ■ A gossip feature TOMBSTONE_GC_OPTIONS is added ■ The tombstone_gc option can not be used until full upgrade • E.g., in a mixed cluster: • cqlsh> ALTER TABLE ks.table WITH tombstone_gc = { 'mode':'repair'} ; • ConfigurationException: tombstone_gc option not supported by the cluster ■ To keep max compatibility and introduce less surprise to users • All tables default to use the mode = timeout (same as without this feature) • Uses have to set mode = repair explicitly
  • 26. Thank you! Stay in touch Asias He @asias_he asias@scylladb.com
  • 27. Compaction Enhancements: Increased Storage Density and Time Series Made Much Easier Raphael S. Carvalho Software Engineer
  • 28. Raphael S. Carvalho ■ Member of the ScyllaDB storage team ■ Responsible for the compaction subsystem ■ Previously worked on Syslinux and OSv Software Engineer at ScyllaDB
  • 29. Agenda ■ Space optimization for incremental compaction ■ “Bucketless” time series, i.e. time series made much easier for you ■ Upcoming improvements
  • 30. Let’s take a look back ■ Incremental compaction (ICS) introduced in enterprise release 2019.1.4 ■ Known for combining techniques from both size-tiered and leveled strategies ■ Fixes the 100% space overhead problem in size-tiered compaction, increasing disk utilization significantly.
  • 31. Is it enough? ■ Space overhead in tiered compaction was efficiently fixed, however… ■ Incremental (ICS) and size-tiered (STCS) strategies share the same space amplification (~2-4x) with overwrite workloads, where: • They cover a similar region in the three-dimensional efficiency space, also known as RUM conjecture trade-offs. READ WRITE SPACE STCS ICS
  • 32. Turns out it’s not enough. But can we do better? ■ Leveled strategy and Size-tiered (or ICS) cover different regions • Interesting regions cannot be reached with either strategies. • But interesting regions can be reached by combining data layout of both strategies • i.e. a hybrid (tiered + leveled) approach READ WRITE SPACE STCS ICS LCS
  • 33. Let’s work to optimize space efficiency then ■ A few high-level goals: • Optimize space efficiency with overwrite workloads • Ensure write and read latency meet SLA requirements
  • 34. ■ That’s Space Amplification Goal (SAG) for you. ■ Increased storage density per node? YES. ■ Reduce costs? YES.
  • 35. A few facts about this feature ■ This feature (available since Scylla Enterprise 2020.1.6) can only be used with Incremental Compaction ■ Compaction will dynamically adapt to the workload to meet requirements ■ Under heavy write load, compaction strategy will work to meet write latency requirement. ■ Otherwise, strategy works to optimize space efficiency to the desired extent ■ Translates into: • Storage Density per node ++ • Costs -- • Scale ++ READ WRITE SPACE ICS+SAG
  • 36. Enabling the space optimization (SAG) ■ This will enable the feature with a space amplification goal of 1.5 ■ The lower the configured value the higher the write amplification ■ Adaptive approach minimizes the impact of extra amplification ■ Gives user control to reach interesting regions in the three-dimensional efficiency space ALTER TABLE keyspace.table WITH compaction = { 'class': 'IncrementalCompactionStrategy', 'space_amplification_goal': '1.5', };
  • 38. A common schema for time series looked like… CREATE TABLE billy.readings ( sensor_id int, date date, time time, temperature int, PRIMARY KEY ((sensor_id, date), time) )
  • 39. Why bucket in time series? ■ Large partitions were known to causing all sorts of problems • Index inefficiency when reading from the middle of a large partition • Latency issues when repairing large partitions • High resource usage and read amplification when querying multiple time windows • Reactor stalls which caused higher P99 latencies • And so on… ■ Consequently applications were forced to “bucket” partitions to keep their size within a limit.
  • 40. Bucketed vs Unbucketed time series Window sstable 1 sstable 2 sstable 3 sstable 4 Bucketed partitions for a single time series Unbucketed partition for a single time series
  • 41. Time series made much easier for you! ■ But those bad days are gone! • Scylla allows a large partition to be efficiently indexed: O(logN) • Scylla’s row-level repair allows large partitions to be efficiently repaired • TimeWindowCompactionStrategy can now efficiently query multiple time windows • by discarding SSTable files which time range is irrelevant for the query • Incrementally open the relevant files to reduce resource overhead • Therefore, read amplification and resource usage problems are fixed
  • 42. A schema for time series can now look like… ■ There’s no longer any field date in schema, meaning that: • Application won’t have to create new partitions on a fixed interval for a time series • Querying a time series will be much easier as only a single partition is involved ■ Bucketing days are potentially gone! ■ Lots of complexity reduced in the application side CREATE TABLE billy.readings ( sensor_id int, time time, temperature int, PRIMARY KEY (sensor_id, time) )
  • 43. Upcoming improvements ■ Compaction becoming overall more resilient / performant: • Changes were recently made to make Cleanup, Major compactions more resilient when system is running out of disk space • Dynamic control of compaction fan-in to increase overall compaction efficiency • Based on observation that efficiency is a function of number of input files and their relative sizes • Don’t dilute the overall efficiency by submitting jobs which efficiency is greater than or equal to efficiency of ongoing jobs • Tests show that write amplification is reduced under heavy write load while keeping space and read amplifications within bounds • Makes the system adapt even better to changing workloads • More stability. More performance.
  • 44. Upcoming improvements ■ Reduce compaction aggressiveness by: • Improvements in I/O scheduler (Pavel Emelyanov covers this in depth in his talk) • Improvements in Compaction backlog controller • Aiming at improving tail latency and overall system stability. ■ Off-strategy compaction (Asias He covers this better) • Make repair-based node operations more efficient, faster • Consequently, better elasticity
  • 45. Thank you! Stay in touch Raphael S. Carvalho @raphael_scarv raphaelsc@scylladb.com

Editor's Notes

  1. Hello everyone and welcome to my talk about SSTable compaction enhancements
  2. My name is Raphael Carvalho and I have been working on ScyllaDB storage layer since its early days. Enough about me, let’s move on to the interesting part
  3. In this session, we’ll describe space optimization for incremental compaction that will allow the storage density of nodes to increase Additionally, how Scylla makes it much easier to model time series data. Without having to rely on old techniques like data bucketing, which is commonly used to avoid running into large partition performance issues Last but not least, we’ll talk about upcoming improvements that will make compaction better for you
  4. Now, let’s take a look back. Incremental compaction strategy, or ICS, was introduced back in 2019, to solve the large space overhead that affected the users. Before its existence, users were left with no choice but to leave 50% of free disk space for compactions to succeed.
  5. But was it enough? Well, the aforementioned space overhead was efficiently fixed by Incremental Compaction, however, it still suffered with bad space amplification when facing overwrite-heavy workloads. That’s because the compaction strategy wasn’t efficient at removing the data redundancy accumulated across the tiers We use a theoretical model called RUM conjecture to reason about compaction efficiency. It states that a compaction strategy cannot be optimal at the three efficiency goals: Read, write and space. That’s why we have different strategies available, each suiting better a particular use case If we look at the three-dimensional efficiency space, which represents the RUM conjecture trade-offs, we’ll see that Incremental and size tiered strategies cover a similar region.
  6. Turns out Incremental Compaction can do much better than fixing the space overhead problem. We know for a fact that leveled and size tiered strategies cover completely different regions in the efficiency space. Also, we know that interesting regions cannot be reached with either of them. However, very interesting regions in the efficiency space can be reached by combining the concepts of both strategies. We call it a hybrid approach.
  7. What do we actually want to accomplish with this hybrid approach? Let’s set a few goals. First, we want to optimize space efficiency for overwrite workloads, while ensuring write and read latencies meet service-level requirements In other words, Performance must be sufficient to meet the needs, but (space) efficiency should be as good as possible to allow for scale.
  8. That’s Space amplification goal for you. A feature that will help you increasing storage density per node therefore reducing costs. Who doesn’t like that?
  9. It is only available in Scylla enterprise and can be used with our incremental compaction only. Everything was carefully implemented. To ensure latency will meet service-level requirements, compaction will dynamically adapt to the workload. Under heavy write load, compaction strategy will enter a write optimized mode to make sure system can keep up with the write rate. Otherwise, the strategy will be continuously working to optimize space efficiency. The coexistence of both modes is the reason we call this a hybrid approach. The adaptive approach, combined with the hybrid one is what makes this feature unique in the compaction world.
  10. Let’s get to a bit of action. How to enable the space optimization? That’s simply a matter of specifying a value between 1 and 2 to the strategy option named space_amplification_goal. The lower the value the lower the space amplification but the higher the write amplification. 1.5 is a good value to start with In order to optimize space efficiency, we’re willing to trade-off extra write amplification. However, the adaptive approach minimizes the impact of the extra amplification given that the strategy will switch between the modes, that is, write and space, whenever it has to. To conclude, this will nicely give user control to reach interesting regions in the efficiency space, allowing the strategy to perform better for your particular use case
  11. Now comes the interesting part… the optimization in action. We can clearly see in the graph that the lower the configured value the lower the disk usage will be. In the example, with a value of 1.25, the space amplification reached a maximum of 100%, but eventually went below 50% mark. If the system isn’t under heavy write load, then space amplification will meet the goal faster. As expected, the system is optimizing for space once performance objectives are achieved.
  12. Now let’s switch gears and talk about how Scylla made time series less painful for application developers. Please look at that create table statement. That’s an usual schema for time series data. Note how the field date composes the partition key along with the sensor id. That’s a technique called data bucketing.
  13. This bucketing technique is mainly used to prevent large partitions from being created, as they were known to creating all sorts of performance issues. For example: Scylla was very inefficient when reading from the middle of a large partition Repair was a enemy of large partitions too And Time Window Compaction strategy wasn’t optimized for reading a large partition spanning multiple time windows
  14. In the picture, each individual line represents a partition. In the bucketed case, the time series is split into multiple smaller partitions. While in the unbucketed case, the time series is kept in a single large partition. One of the main problems with bucketing is that lots of complexity is pushed to the application side. The application will have to keep track of all partitions that belong to a particular time series. Also, aggregation is more complex as the application has to figure out which partitions store a particular time range, query each one of them individually and finally aggregate the results.
  15. Fortunately, those bad days are gone. Scylla fixed all problems aforementioned. Large partitions can be efficiently indexed, row-level repair was introduced to solve the problem of repairing large partitions, and time window compaction strategy can now efficiently read large partitions stored across multiple time windows When a table uses time window strategy, its SSTable files do not overlap in timestamp range, so a specialized reader was implemented that will discard irrelevant files and efficiently read the relevant ones. This will reduce resource consumption and read amplification, making the queries much more efficient
  16. With all those problems fixed, the schema for time series application can now look much simpler. Queries becomes simpler. Your application becomes simpler. Please only make sure you have enough time series (partitions) to avoid hot spots, where a subset of shards may be processing much more data than its counterparts. For example, if your application is monitoring millions of devices, where each has its own time series, then you will not run into unbalance issue. But if you only have a few time series in your application, it’s better to rely on the old bucketing technique, to guarantee proper balancing.
  17. As for upcoming improvements, Cleanup and major compaction will now be more resilient when system is running out of disk space. The compaction manager will now dynamically control the compaction fan-in, essentially a threshold on minimum # of input files for compaction, to increase overall compaction efficiency, which translates into lower write amplification. This decision is based on observation that compaction efficiency is a function # of input files for compaction, and also their relative sizes. That’s why size tiered and ICS favor similar sized files to be compacted together Essentially, we will increase the overall efficiency by not diluting it with compaction jobs that are less inefficient than the ongoing compaction jobs. System becomes more stable and performant as a result of this change.
  18. Last but not least, I/O scheduler is being nicely enhanced by Pavel Emelyanov. The enhancements will make the system more stable, allowing compaction to impact less the other activities in the system like user queries, streaming, and so on. To learn more about this, you can watch Pavel’s talk. And off-strategy compaction was written with the goal of making compaction less aggressive for node operations like bootstrap, regular repair, etc, allowing them to complete faster. Consequently the system will be able to scale faster, making Scylla elasticity even better.
  19. Thank you for watching! See you around.