How Scylla Manager
Handles Backups
Michał Matczuk, Scylla Manager Team Leader
One to rule them all
One to find them
One to repair them
One to backup them
One to manage them...
Presenter
Michał Matczuk
Team Leader of Scylla Manager
Contributor to C* Go driver
Author of C* ORM scylladb/gocqlx
Father of 3
Backup to Azure,
Google, Dropbox,...
Faster repair
Maintenance
windows
Clone cluster
Scylla Manager
Repair
Healthcheck
SD for Monitoring
SSH node access
1.x (current) 2.0 (in few weeks) 2.x (Q1/Q2 2020) 3.x (Q3/Q4 2020)
Backup to S3
Agent HTTPS node
access
Integrated Monitoring
Basic UI
Out/down scale
Rolling upgrade
Rolling config update
Log collection
Scylla Manager 2.0
No sidecar in 1.x
It was a nice hack:
+ SSH server as a reverse-proxy
+ HTTP client with libssh session dial
+ No aux ports opened
Go SSH goodies are now Open-Sourced github.com/scylladb/go-sshtools
Deployment in 1.x
sctool Scylla Manager
server
Scylla
node
REST API
(localhost) Scylla REST API
SSH
Problems with SSH
Notorious scyllamgr_ssh_setup bash script
When it works it works.
Field issues:
+ Wild SSHd configurations
+ Timeouts on reverse DNS
+ Proxying - keepalives
+ User model mismatch
Deployment in 2.0
sctool Scylla Manager
server
Scylla
node
REST API
(localhost) Scylla REST API
HTTPS
Agent the benefits
Low impact on Scylla nodes
+ CPU pinning
+ Systemd slices
Just works™
+ Installs with Scylla
+ Auto configures from
Scylla API
Simplified security
+ Bearer Authentication
+ Mutual TLS
+ Scylla data dir jail
Agent integrates Rclone https://rclone.org/
+ "rsync for cloud storage"
+ Virtually any Object Storage, SFTP, localdir
+ Vibrant community
+ Mature, written in Go
+ MIT licensed
Extensibility!
Rclone - are we done?
CLI
+ Global state
+ "Operation at a time" assumption
+ Lots of small bugs when run as a server
Transfer cancelation #3257
Polluting Linux page cache #3413
Performance audit #3419
Accounting of in-flight transfers only
File based configuration
Backup
Backup
Parallel snapshot & upload
+ Control parallelism level
+ Control upload speed
Retention policy
Deduplication
Retries (rclone, http, scheduler)
--snapshot-parallel
--upload-parallel
--rate-limit
--rate-limit 10,us-east:20,us-west:30
DC aware backup location
DC filtering with glob patterns
Flexible keyspace/table filtering with glob patterns
Backup
--location us-east:s3:east-bucket,
us-west:s3:west-bucket,
default-bucket
--dc 'us-*'
-K 'shopping_cart*,!old_orders'
Low impact on node and cluster
+ CPU pinning
+ Systemd slices
+ Fadvise free pages
+ Adjustable parallelism levels
+ Upload Rate limit
Backup
Secure
API
+ Authentication token
+ Mutual TLS
Rclone
+ Credentials agnostic (IAM Roles)
+ Scylla data dir jail (read from API)
Backup
CLI backup task list
CLI backup progress
--details flag to show
per host per split up
Restore
S3 buckets are source of truth
Paths encode information
/backup/meta/cluster/48b9f020-be12-4c4e-8009-aee16131b51f/dc/dc1/node/4ebbf3b1-9871-44b1-
bfef-7c993ff8e933/keyspace/backuptest_purge/table/big_table/task/b81dd5e7-34d8-4dee-b611-
42fe34f02b10/tag/sm_20191025123826UTC/445cda50f72411e98345000000000001/manifest.json
+ List backup
+ grouped by target (keyspaces and tables)
+ keyspace/table filtering with glob patterns
+ List S3 paths for given snapshot tag
+ keyspace/table filtering with glob patterns
CLI backup list
Vision
Scylla Manager 4.x
Scylla Manager 4.x
+ Kubernetes Operator (WIP)
+ All (centralised components) in one
+ Scylla UI
+ Scylla cloud on premise
+ Security
+ Admin Security
+ LDAP integration
+ Role Based Auth
+ Audit
Thank you Stay in touch
Any questions?
Michal Matczuk
michal@scylladb.com
@michalmatczuk

How Scylla Manager Handles Backups

  • 1.
    How Scylla Manager HandlesBackups Michał Matczuk, Scylla Manager Team Leader
  • 2.
    One to rulethem all One to find them One to repair them One to backup them One to manage them...
  • 3.
    Presenter Michał Matczuk Team Leaderof Scylla Manager Contributor to C* Go driver Author of C* ORM scylladb/gocqlx Father of 3
  • 4.
    Backup to Azure, Google,Dropbox,... Faster repair Maintenance windows Clone cluster Scylla Manager Repair Healthcheck SD for Monitoring SSH node access 1.x (current) 2.0 (in few weeks) 2.x (Q1/Q2 2020) 3.x (Q3/Q4 2020) Backup to S3 Agent HTTPS node access Integrated Monitoring Basic UI Out/down scale Rolling upgrade Rolling config update Log collection
  • 5.
  • 6.
    No sidecar in1.x It was a nice hack: + SSH server as a reverse-proxy + HTTP client with libssh session dial + No aux ports opened Go SSH goodies are now Open-Sourced github.com/scylladb/go-sshtools
  • 7.
    Deployment in 1.x sctoolScylla Manager server Scylla node REST API (localhost) Scylla REST API SSH
  • 8.
    Problems with SSH Notoriousscyllamgr_ssh_setup bash script When it works it works. Field issues: + Wild SSHd configurations + Timeouts on reverse DNS + Proxying - keepalives + User model mismatch
  • 9.
    Deployment in 2.0 sctoolScylla Manager server Scylla node REST API (localhost) Scylla REST API HTTPS
  • 10.
    Agent the benefits Lowimpact on Scylla nodes + CPU pinning + Systemd slices Just works™ + Installs with Scylla + Auto configures from Scylla API Simplified security + Bearer Authentication + Mutual TLS + Scylla data dir jail
  • 11.
    Agent integrates Rclonehttps://rclone.org/ + "rsync for cloud storage" + Virtually any Object Storage, SFTP, localdir + Vibrant community + Mature, written in Go + MIT licensed Extensibility!
  • 12.
    Rclone - arewe done? CLI + Global state + "Operation at a time" assumption + Lots of small bugs when run as a server Transfer cancelation #3257 Polluting Linux page cache #3413 Performance audit #3419 Accounting of in-flight transfers only File based configuration
  • 13.
  • 14.
    Backup Parallel snapshot &upload + Control parallelism level + Control upload speed Retention policy Deduplication Retries (rclone, http, scheduler) --snapshot-parallel --upload-parallel --rate-limit --rate-limit 10,us-east:20,us-west:30
  • 15.
    DC aware backuplocation DC filtering with glob patterns Flexible keyspace/table filtering with glob patterns Backup --location us-east:s3:east-bucket, us-west:s3:west-bucket, default-bucket --dc 'us-*' -K 'shopping_cart*,!old_orders'
  • 16.
    Low impact onnode and cluster + CPU pinning + Systemd slices + Fadvise free pages + Adjustable parallelism levels + Upload Rate limit Backup
  • 17.
    Secure API + Authentication token +Mutual TLS Rclone + Credentials agnostic (IAM Roles) + Scylla data dir jail (read from API) Backup
  • 18.
  • 19.
    CLI backup progress --detailsflag to show per host per split up
  • 20.
    Restore S3 buckets aresource of truth Paths encode information /backup/meta/cluster/48b9f020-be12-4c4e-8009-aee16131b51f/dc/dc1/node/4ebbf3b1-9871-44b1- bfef-7c993ff8e933/keyspace/backuptest_purge/table/big_table/task/b81dd5e7-34d8-4dee-b611- 42fe34f02b10/tag/sm_20191025123826UTC/445cda50f72411e98345000000000001/manifest.json + List backup + grouped by target (keyspaces and tables) + keyspace/table filtering with glob patterns + List S3 paths for given snapshot tag + keyspace/table filtering with glob patterns
  • 21.
  • 22.
  • 23.
    Scylla Manager 4.x +Kubernetes Operator (WIP) + All (centralised components) in one + Scylla UI + Scylla cloud on premise + Security + Admin Security + LDAP integration + Role Based Auth + Audit
  • 24.
    Thank you Stayin touch Any questions? Michal Matczuk michal@scylladb.com @michalmatczuk

Editor's Notes

  • #5 Repair Recurrent and ah hoc repairs Pause and resume Flexible keyspace, DCs selection using glob patterns Progress information - CLI, merics and alerts Scylla Shard aware Resilient to temporary failures Healthcheck REST, CQL Credentials agnostic Automatically detects TLS
  • #7 Was it simple, easy.
  • #9 UseDNS no, maybe dns config need fixing.
  • #12 Free testing
  • #13 Has HTTP API out of the box We need to have the ability to execute the following operations on the Scylla nodes: Determine disk usage List (recursively) files and directories within scylla data directory Copy single file to and from the remote location (only if it’s missing) Copy directory to and from the remote location with file filtering (only if it’s missing) Delete files on both nodes, and remote locations Manage bandwidth limits for the file transfers Get progress for the running transfers Performance analysis * Unnecessary hash calculation * Parallelizing hash calculation * Local to local file copy 1G from 16s to 1s Praise Nick