Michał Matczuk, Scylla Manager Lead Developer
Tzach Livyatan, Product Manager
WEBINAR
Cluster Management and Task Automation
2
Michal Matczuk is a software engineer working on Scylla
management. He’s a Go enthusiast and contributor to many
open source projects. He has a background in network
programming. Prior joining ScyllaDB he worked with
StratoScale and NTT.
Tzach Livyatan is ScyllaDB Product Manager, and has had a 15
year career in development, system engineering and product
management.
In the past he worked in the Telecom domain, focusing on carrier
grade systems, signalling, policy and charging applications.
Why do we need a recurrent repair solution?
Scylla Manager status and road map
Scylla Manager 101
4
+ Next-generation NoSQL database
+ Drop-in replacement for Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
+ Production Scylla Release
+ Scylla Manager
+ Enterprise-Only Features
+ 24/7 Mission Critical Support
+ Bug Escalation
+ Hot Fixes
+ Long-Term Support
+ Commercial License
Scylla Enterprise
Scylla node may go out of sync over time - increasing the Entropy
+ Network issues
+ Node issues
+ Rolling upgrades
This will impact your business
+ Inconsistent data
+ Data Resurrection
Anti-Entropy tools:
+ Repair
+ Read Repair
+ Hinted Handoff (Scylla 2.2)
Repair is an offline process which synchronizes the data between nodes, so
eventually, all the replicas will hold the same data.
It is highly recommended to run a full cluster repair once a week (shorter than
default GC period)
Nodetool repair - A CLI command to run repair on a single node
Scylla Ring
T-0
T-1
T-2
T-3
T-5
T-4
T-6
T-10
T-9
T-7
T-8
T-11
nodetool
repair
Delete
1. Delete fails to propagate to all nodes
2. After gc_grace_seconds, tombstones
are removed
Read
Keep repair period
under
gc_grace_seconds
or I will eat your
brain
1. Run recurrent repair using 3rd party solution
2. Run recurrent repair using home grown solution
3. Manually running repair from time to time
4. Running repair only after a failure
5. Do not use repair on my production cluster
6. What is repair?
+ Runs on one node - no cluster-wide control
+ No recurrent operation support
+ Run in node level bulks
+ No good way to understand the progress
+ No way to pause, resume or retry a repair
+ No failure handling
+ Shard ignorant - may overload one shard at a time and take much longer to complete
+ Lacks HA
+ Hard to integrate with external tools - no metrics, no API
Centralize, HA, control of multiple Scylla Clusters.
One Ring Manager to Rule Them All
Manager
+ Highly Available, centralized controller
+ sctool - User Friendly CLI tool
+ Automatically set and run recurrent repair on Scylla clusters
+ Run ad-hoc repair on all or one table
+ Allow pause / restart / retry of repairs tasks
+ Shard aware - optimized for Scylla
+ Highly Available - stateless, use Scylla as a backend
+ Install local Scylla Enterprise backend by default
+ Can use any Scylla Cluster as remote backend
+ sctool - CLI tool
+ REST API (swagger documented)
+ Grafana Dashboard (Manager 1.1)
19
+ Implemented in Go
+ Backed by Scylla Enterprise
+ Backed by github.com/scylladb/gocqlx
+ Exports Prometheus metrics
21
+ Sctool
+ Cluster
+ Repair
+ Task
+ POSIX compliant
+ BASH completion
+ Prints tables :)
$ sctool repair -h
Manage repairs
Usage:
sctool repair [command]
Available Commands:
progress Shows repair progress
schedule Schedule repair of a unit
unit Manage repair units
Flags:
-h, --help help for repair
Global Flags:
--api-url URL URL of Scylla Manager server (defau
-c, --cluster name target cluster name or ID
Use "sctool repair [command] --help" for more information
Register a cluster to Scylla Manager
$ sctool cluster add --name=test-cluster --hosts=172.16.1.10,172.16.1.2 --shard-count=16
30a538a0-9bdb-4276-a69d-60e19197fd93
__
/  Cluster added, to set it as a default run:
@ @ export SCYLLA_MANAGER_CLUSTER=30a538a0-9bdb-4276-a69d-60e19197fd93
| |
|| |/ Repair will run on 20 Apr 18 00:00 UTC and will be repeated every 7 days.
|| || To see the repair units run: sctool repair unit list -c 30a538a0-9bdb-4276-a69d-60e19197fd93
|_/|
___/
$ export SCYLLA_MGMT_CLUSTER=30a538a0-9bdb-4276-a69d-60e19197fd93
Register a cluster to Scylla Manager
$ sctool repair unit list
╭──────────────────────────────────────┬──────────────────────┬──────────────────────┬────────╮
│ unit id │ name │ keyspace │ tables │
├──────────────────────────────────────┼──────────────────────┼──────────────────────┼────────┤
│ 27b75126-605e-4238-9e6c-a2cec097d2c4 │ super_important_data │ super_important_data │ [] │
│ 40f9a7f3-4c63-4abb-8044-5397e36e1960 │ important_data │ important_data │ [] │
│ 4a694d40-4e50-4b05-9a45-3f0fe2c0f00a │ very_important_data │ very_important_data │ [] │
╰──────────────────────────────────────┴──────────────────────┴──────────────────────┴────────╯
Register a cluster to Scylla Manager
$ sctool task list
╭───────────────────────────────────────────────────────────┬─────────────────────┬──────┬──────┬────────────┬───────────┬──────── ╮
│ task │ start date │ int. │ ret. │ properties │ run start │ status │
├───────────────────────────────────────────────────────────┼─────────────────────┼──────┼──────┼────────────┼───────────┼────────┤
│ repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b │ 21 Apr 18 00:00 UTC │ 7 │ 6 │ - │ - │ - │
╰───────────────────────────────────────────────────────────┴─────────────────────┴──────┴──────┴────────────┴───────────┴──────── ╯
$ sctool task start repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b
$ sctool task list
╭───────────────────────────────────────────────────────────┬─────────────────────┬──────┬──────┬───────────────────────────┬─────────────────────┬─────────╮
│ task │ start date │ int. │ ret. │ properties │ run start │ status │
├───────────────────────────────────────────────────────────┼─────────────────────┼──────┼──────┼───────────────────────────┼─────────────────────┼─────────┤
│ repair/274cee81-bcc6-44b4-82f2-39b1483397a2 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:27b75126-605e-... │ - │ - │
│ repair/7b8bf21c-af89-4c2d-876c-8195179de685 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:40f9a7f3-4c63-... │ - │ - │
│ repair/b8cb6881-12e9-4205-bdc8-8fcc070803e9 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:4a694d40-4e50-... │ - │ - │
│ repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b │ 21 Apr 18 00:00 UTC │ 7 │ 6 │ - │ 20 Apr 18 09:33 UTC │ stopped │
╰───────────────────────────────────────────────────────────┴─────────────────────┴──────┴──────┴───────────────────────────┴─────────────────────┴─────────╯
Running repair
$ sctool repair schedule important_data --start-date now
repair/ab60e3c0-ee53-4872-8436-f8590779b675
$ sctool task list
╭───────────────────────────────────────────────────────────┬─────────────────────┬──────┬──────┬────────────────────────┬─────────────────────┬─────────╮
│ task │ start date │ int. │ ret. │ properties │ run start │ status │
├───────────────────────────────────────────────────────────┼─────────────────────┼──────┼──────┼────────────────────────┼─────────────────────┼─────────┤
│ repair/274cee81-bcc6-44b4-82f2-39b1483397a2 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:27b75126-605e- │ - │ - │
│ repair/7b8bf21c-af89-4c2d-876c-8195179de685 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:40f9a7f3-4c63- │ - │ - │
│ repair/ab60e3c0-ee53-4872-8436-f8590779b675 │ 20 Apr 18 10:06 UTC │ 0 │ 3 │ unit_id:important_data │ 20 Apr 18 10:09 UTC │ running │
│ repair/b8cb6881-12e9-4205-bdc8-8fcc070803e9 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:4a694d40-4e50- │ - │ - │
│ repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b │ 21 Apr 18 00:00 UTC │ 7 │ 6 │ - │ 20 Apr 18 09:33 UTC │ stopped │
╰───────────────────────────────────────────────────────────┴─────────────────────┴──────┴──────┴────────────────────────┴─────────────────────┴─────────╯
Running repair
$ sctool repair progress repair/ab60e3c0-ee53-4872-8436-f8590779b675
Status: running
Start time: 20 Apr 18 10:09 UTC
Duration: 1m14s
Progress: 8%
╭─────────────┬──────────┬─────────────────╮
│ host │ progress │ failed segments │
├─────────────┼──────────┼─────────────────┤
│ 172.16.1.2 │ - │ - │
│ 172.16.1.3 │ 24 │ 0 │
│ 172.16.1.10 │ - │ - │
╰─────────────┴──────────┴─────────────────╯
Repair progress
$ sctool task stop repair/ab60e3c0-ee53-4872-8436-f8590779b675
$ sctool repair progress repair/ab60e3c0-ee53-4872-8436-f8590779b675
Status: stopped
Start time: 20 Apr 18 10:09 UTC
End time: 20 Apr 18 10:10 UTC
Duration: 1m36s
Progress: 10%
╭─────────────┬──────────┬─────────────────╮
│ host │ progress │ failed segments │
├─────────────┼──────────┼─────────────────┤
│ 172.16.1.2 │ - │ - │
│ 172.16.1.3 │ 31 │ 0 │
│ 172.16.1.10 │ - │ - │
╰─────────────┴──────────┴─────────────────╯
Repair stop
$ sctool task start repair/ab60e3c0-ee53-4872-8436-f8590779b675
$ sctool repair progress repair/ab60e3c0-ee53-4872-8436-f8590779b675
Status: running
Start time: 20 Apr 18 10:19 UTC
Duration: 3s
Progress: 10%
╭─────────────┬──────────┬─────────────────╮
│ host │ progress │ failed segments │
├─────────────┼──────────┼─────────────────┤
│ 172.16.1.2 │ - │ - │
│ 172.16.1.3 │ 32 │ 0 │
│ 172.16.1.10 │ - │ - │
╰─────────────┴──────────┴─────────────────╯
Repair resume
www.scylladb.com/enterprise-download/#manager
+ Prometheus integration
+ Repairing fails segments
33
Scylla Manager 3.x
- Cloud Management (AWS)
- Unified packaging
- Sampler
- Admin Security
- Kerberos, LDAP integration
- Role Based Auth
- Audit
- Cloud Management (GCE)
- Migration Tools
- Admin Console (UI)
Scylla Manager 2.1
- Cluster Management
- Out / down scale
- Rolling Upgrade
- Rolling cfg update
Scylla Manager 2.x
- Log Collection
- Recurrent backup and health
status
Scylla Manager 1.1 (April 2018)
- Prometheus integration
- Repairing failed segments
Scylla Manager 1.2
- Repair: MultiDC support
- Repair: Auto config (msb,
shards)
- Repair subset on nodes
Scylla Manager 2.0
- Cluster setup (including monitoring)
CQ4’18CQ3’18CQ2’18
34
United States Israel www.scylladb.com
@scylladb

Introducing Scylla Manager: Cluster Management and Task Automation

  • 1.
    Michał Matczuk, ScyllaManager Lead Developer Tzach Livyatan, Product Manager WEBINAR Cluster Management and Task Automation
  • 2.
    2 Michal Matczuk isa software engineer working on Scylla management. He’s a Go enthusiast and contributor to many open source projects. He has a background in network programming. Prior joining ScyllaDB he worked with StratoScale and NTT. Tzach Livyatan is ScyllaDB Product Manager, and has had a 15 year career in development, system engineering and product management. In the past he worked in the Telecom domain, focusing on carrier grade systems, signalling, policy and charging applications.
  • 3.
    Why do weneed a recurrent repair solution? Scylla Manager status and road map Scylla Manager 101
  • 4.
    4 + Next-generation NoSQLdatabase + Drop-in replacement for Cassandra + 10X the performance & low tail latency + Open source and enterprise editions + Founded by the creators of KVM hypervisor + HQs: Palo Alto, CA; Herzelia, Israel
  • 5.
    + Production ScyllaRelease + Scylla Manager + Enterprise-Only Features + 24/7 Mission Critical Support + Bug Escalation + Hot Fixes + Long-Term Support + Commercial License Scylla Enterprise
  • 7.
    Scylla node maygo out of sync over time - increasing the Entropy + Network issues + Node issues + Rolling upgrades This will impact your business + Inconsistent data + Data Resurrection Anti-Entropy tools: + Repair + Read Repair + Hinted Handoff (Scylla 2.2)
  • 8.
    Repair is anoffline process which synchronizes the data between nodes, so eventually, all the replicas will hold the same data. It is highly recommended to run a full cluster repair once a week (shorter than default GC period) Nodetool repair - A CLI command to run repair on a single node
  • 9.
  • 10.
    Delete 1. Delete failsto propagate to all nodes
  • 11.
    2. After gc_grace_seconds,tombstones are removed Read Keep repair period under gc_grace_seconds or I will eat your brain
  • 12.
    1. Run recurrentrepair using 3rd party solution 2. Run recurrent repair using home grown solution 3. Manually running repair from time to time 4. Running repair only after a failure 5. Do not use repair on my production cluster 6. What is repair?
  • 13.
    + Runs onone node - no cluster-wide control + No recurrent operation support + Run in node level bulks + No good way to understand the progress + No way to pause, resume or retry a repair + No failure handling + Shard ignorant - may overload one shard at a time and take much longer to complete + Lacks HA + Hard to integrate with external tools - no metrics, no API
  • 14.
    Centralize, HA, controlof multiple Scylla Clusters. One Ring Manager to Rule Them All Manager
  • 15.
    + Highly Available,centralized controller + sctool - User Friendly CLI tool + Automatically set and run recurrent repair on Scylla clusters + Run ad-hoc repair on all or one table + Allow pause / restart / retry of repairs tasks
  • 16.
    + Shard aware- optimized for Scylla + Highly Available - stateless, use Scylla as a backend + Install local Scylla Enterprise backend by default + Can use any Scylla Cluster as remote backend
  • 17.
    + sctool -CLI tool + REST API (swagger documented) + Grafana Dashboard (Manager 1.1)
  • 19.
  • 20.
    + Implemented inGo + Backed by Scylla Enterprise + Backed by github.com/scylladb/gocqlx + Exports Prometheus metrics
  • 21.
    21 + Sctool + Cluster +Repair + Task + POSIX compliant + BASH completion + Prints tables :) $ sctool repair -h Manage repairs Usage: sctool repair [command] Available Commands: progress Shows repair progress schedule Schedule repair of a unit unit Manage repair units Flags: -h, --help help for repair Global Flags: --api-url URL URL of Scylla Manager server (defau -c, --cluster name target cluster name or ID Use "sctool repair [command] --help" for more information
  • 22.
    Register a clusterto Scylla Manager $ sctool cluster add --name=test-cluster --hosts=172.16.1.10,172.16.1.2 --shard-count=16 30a538a0-9bdb-4276-a69d-60e19197fd93 __ / Cluster added, to set it as a default run: @ @ export SCYLLA_MANAGER_CLUSTER=30a538a0-9bdb-4276-a69d-60e19197fd93 | | || |/ Repair will run on 20 Apr 18 00:00 UTC and will be repeated every 7 days. || || To see the repair units run: sctool repair unit list -c 30a538a0-9bdb-4276-a69d-60e19197fd93 |_/| ___/ $ export SCYLLA_MGMT_CLUSTER=30a538a0-9bdb-4276-a69d-60e19197fd93
  • 23.
    Register a clusterto Scylla Manager $ sctool repair unit list ╭──────────────────────────────────────┬──────────────────────┬──────────────────────┬────────╮ │ unit id │ name │ keyspace │ tables │ ├──────────────────────────────────────┼──────────────────────┼──────────────────────┼────────┤ │ 27b75126-605e-4238-9e6c-a2cec097d2c4 │ super_important_data │ super_important_data │ [] │ │ 40f9a7f3-4c63-4abb-8044-5397e36e1960 │ important_data │ important_data │ [] │ │ 4a694d40-4e50-4b05-9a45-3f0fe2c0f00a │ very_important_data │ very_important_data │ [] │ ╰──────────────────────────────────────┴──────────────────────┴──────────────────────┴────────╯
  • 24.
    Register a clusterto Scylla Manager $ sctool task list ╭───────────────────────────────────────────────────────────┬─────────────────────┬──────┬──────┬────────────┬───────────┬──────── ╮ │ task │ start date │ int. │ ret. │ properties │ run start │ status │ ├───────────────────────────────────────────────────────────┼─────────────────────┼──────┼──────┼────────────┼───────────┼────────┤ │ repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b │ 21 Apr 18 00:00 UTC │ 7 │ 6 │ - │ - │ - │ ╰───────────────────────────────────────────────────────────┴─────────────────────┴──────┴──────┴────────────┴───────────┴──────── ╯
  • 25.
    $ sctool taskstart repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b $ sctool task list ╭───────────────────────────────────────────────────────────┬─────────────────────┬──────┬──────┬───────────────────────────┬─────────────────────┬─────────╮ │ task │ start date │ int. │ ret. │ properties │ run start │ status │ ├───────────────────────────────────────────────────────────┼─────────────────────┼──────┼──────┼───────────────────────────┼─────────────────────┼─────────┤ │ repair/274cee81-bcc6-44b4-82f2-39b1483397a2 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:27b75126-605e-... │ - │ - │ │ repair/7b8bf21c-af89-4c2d-876c-8195179de685 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:40f9a7f3-4c63-... │ - │ - │ │ repair/b8cb6881-12e9-4205-bdc8-8fcc070803e9 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:4a694d40-4e50-... │ - │ - │ │ repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b │ 21 Apr 18 00:00 UTC │ 7 │ 6 │ - │ 20 Apr 18 09:33 UTC │ stopped │ ╰───────────────────────────────────────────────────────────┴─────────────────────┴──────┴──────┴───────────────────────────┴─────────────────────┴─────────╯ Running repair
  • 26.
    $ sctool repairschedule important_data --start-date now repair/ab60e3c0-ee53-4872-8436-f8590779b675 $ sctool task list ╭───────────────────────────────────────────────────────────┬─────────────────────┬──────┬──────┬────────────────────────┬─────────────────────┬─────────╮ │ task │ start date │ int. │ ret. │ properties │ run start │ status │ ├───────────────────────────────────────────────────────────┼─────────────────────┼──────┼──────┼────────────────────────┼─────────────────────┼─────────┤ │ repair/274cee81-bcc6-44b4-82f2-39b1483397a2 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:27b75126-605e- │ - │ - │ │ repair/7b8bf21c-af89-4c2d-876c-8195179de685 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:40f9a7f3-4c63- │ - │ - │ │ repair/ab60e3c0-ee53-4872-8436-f8590779b675 │ 20 Apr 18 10:06 UTC │ 0 │ 3 │ unit_id:important_data │ 20 Apr 18 10:09 UTC │ running │ │ repair/b8cb6881-12e9-4205-bdc8-8fcc070803e9 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:4a694d40-4e50- │ - │ - │ │ repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b │ 21 Apr 18 00:00 UTC │ 7 │ 6 │ - │ 20 Apr 18 09:33 UTC │ stopped │ ╰───────────────────────────────────────────────────────────┴─────────────────────┴──────┴──────┴────────────────────────┴─────────────────────┴─────────╯ Running repair
  • 27.
    $ sctool repairprogress repair/ab60e3c0-ee53-4872-8436-f8590779b675 Status: running Start time: 20 Apr 18 10:09 UTC Duration: 1m14s Progress: 8% ╭─────────────┬──────────┬─────────────────╮ │ host │ progress │ failed segments │ ├─────────────┼──────────┼─────────────────┤ │ 172.16.1.2 │ - │ - │ │ 172.16.1.3 │ 24 │ 0 │ │ 172.16.1.10 │ - │ - │ ╰─────────────┴──────────┴─────────────────╯ Repair progress
  • 28.
    $ sctool taskstop repair/ab60e3c0-ee53-4872-8436-f8590779b675 $ sctool repair progress repair/ab60e3c0-ee53-4872-8436-f8590779b675 Status: stopped Start time: 20 Apr 18 10:09 UTC End time: 20 Apr 18 10:10 UTC Duration: 1m36s Progress: 10% ╭─────────────┬──────────┬─────────────────╮ │ host │ progress │ failed segments │ ├─────────────┼──────────┼─────────────────┤ │ 172.16.1.2 │ - │ - │ │ 172.16.1.3 │ 31 │ 0 │ │ 172.16.1.10 │ - │ - │ ╰─────────────┴──────────┴─────────────────╯ Repair stop
  • 29.
    $ sctool taskstart repair/ab60e3c0-ee53-4872-8436-f8590779b675 $ sctool repair progress repair/ab60e3c0-ee53-4872-8436-f8590779b675 Status: running Start time: 20 Apr 18 10:19 UTC Duration: 3s Progress: 10% ╭─────────────┬──────────┬─────────────────╮ │ host │ progress │ failed segments │ ├─────────────┼──────────┼─────────────────┤ │ 172.16.1.2 │ - │ - │ │ 172.16.1.3 │ 32 │ 0 │ │ 172.16.1.10 │ - │ - │ ╰─────────────┴──────────┴─────────────────╯ Repair resume
  • 31.
  • 32.
    + Prometheus integration +Repairing fails segments
  • 33.
    33 Scylla Manager 3.x -Cloud Management (AWS) - Unified packaging - Sampler - Admin Security - Kerberos, LDAP integration - Role Based Auth - Audit - Cloud Management (GCE) - Migration Tools - Admin Console (UI) Scylla Manager 2.1 - Cluster Management - Out / down scale - Rolling Upgrade - Rolling cfg update Scylla Manager 2.x - Log Collection - Recurrent backup and health status Scylla Manager 1.1 (April 2018) - Prometheus integration - Repairing failed segments Scylla Manager 1.2 - Repair: MultiDC support - Repair: Auto config (msb, shards) - Repair subset on nodes Scylla Manager 2.0 - Cluster setup (including monitoring) CQ4’18CQ3’18CQ2’18
  • 34.
  • 35.
    United States Israelwww.scylladb.com @scylladb