Introducing Scylla Manager: Cluster Management and Task Automation

Michał Matczuk, Scylla Manager Lead Developer
Tzach Livyatan, Product Manager
WEBINAR
Cluster Management and Task Automation

2
Michal Matczuk is a software engineer working on Scylla
management. He’s a Go enthusiast and contributor to many
open source projects. He has a background in network
programming. Prior joining ScyllaDB he worked with
StratoScale and NTT.
Tzach Livyatan is ScyllaDB Product Manager, and has had a 15
year career in development, system engineering and product
management.
In the past he worked in the Telecom domain, focusing on carrier
grade systems, signalling, policy and charging applications.

Why do we need a recurrent repair solution?
Scylla Manager status and road map
Scylla Manager 101

4
+ Next-generation NoSQL database
+ Drop-in replacement for Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel

+ Production Scylla Release
+ Scylla Manager
+ Enterprise-Only Features
+ 24/7 Mission Critical Support
+ Bug Escalation
+ Hot Fixes
+ Long-Term Support
+ Commercial License
Scylla Enterprise

Scylla node may go out of sync over time - increasing the Entropy
+ Network issues
+ Node issues
+ Rolling upgrades
This will impact your business
+ Inconsistent data
+ Data Resurrection
Anti-Entropy tools:
+ Repair
+ Read Repair
+ Hinted Handoff (Scylla 2.2)

Repair is an offline process which synchronizes the data between nodes, so
eventually, all the replicas will hold the same data.
It is highly recommended to run a full cluster repair once a week (shorter than
default GC period)
Nodetool repair - A CLI command to run repair on a single node

Scylla Ring
T-0
T-1
T-2
T-3
T-5
T-4
T-6
T-10
T-9
T-7
T-8
T-11
nodetool
repair

Delete
1. Delete fails to propagate to all nodes

2. After gc_grace_seconds, tombstones
are removed
Read
Keep repair period
under
gc_grace_seconds
or I will eat your
brain

1. Run recurrent repair using 3rd party solution
2. Run recurrent repair using home grown solution
3. Manually running repair from time to time
4. Running repair only after a failure
5. Do not use repair on my production cluster
6. What is repair?

+ Runs on one node - no cluster-wide control
+ No recurrent operation support
+ Run in node level bulks
+ No good way to understand the progress
+ No way to pause, resume or retry a repair
+ No failure handling
+ Shard ignorant - may overload one shard at a time and take much longer to complete
+ Lacks HA
+ Hard to integrate with external tools - no metrics, no API

Centralize, HA, control of multiple Scylla Clusters.
One Ring Manager to Rule Them All
Manager

+ Highly Available, centralized controller
+ sctool - User Friendly CLI tool
+ Automatically set and run recurrent repair on Scylla clusters
+ Run ad-hoc repair on all or one table
+ Allow pause / restart / retry of repairs tasks

+ Shard aware - optimized for Scylla
+ Highly Available - stateless, use Scylla as a backend
+ Install local Scylla Enterprise backend by default
+ Can use any Scylla Cluster as remote backend

+ sctool - CLI tool
+ REST API (swagger documented)
+ Grafana Dashboard (Manager 1.1)

+ Implemented in Go
+ Backed by Scylla Enterprise
+ Backed by github.com/scylladb/gocqlx
+ Exports Prometheus metrics

21
+ Sctool
+ Cluster
+ Repair
+ Task
+ POSIX compliant
+ BASH completion
+ Prints tables :)
$ sctool repair -h
Manage repairs
Usage:
sctool repair [command]
Available Commands:
progress Shows repair progress
schedule Schedule repair of a unit
unit Manage repair units
Flags:
-h, --help help for repair
Global Flags:
--api-url URL URL of Scylla Manager server (defau
-c, --cluster name target cluster name or ID
Use "sctool repair [command] --help" for more information

Register a cluster to Scylla Manager
$ sctool cluster add --name=test-cluster --hosts=172.16.1.10,172.16.1.2 --shard-count=16
30a538a0-9bdb-4276-a69d-60e19197fd93
__
/ Cluster added, to set it as a default run:
@ @ export SCYLLA_MANAGER_CLUSTER=30a538a0-9bdb-4276-a69d-60e19197fd93
| |
|| |/ Repair will run on 20 Apr 18 00:00 UTC and will be repeated every 7 days.
|| || To see the repair units run: sctool repair unit list -c 30a538a0-9bdb-4276-a69d-60e19197fd93
|_/|
___/
$ export SCYLLA_MGMT_CLUSTER=30a538a0-9bdb-4276-a69d-60e19197fd93

$ sctool repair unit list
╭──────────────────────────────────────┬──────────────────────┬──────────────────────┬────────╮
│ unit id │ name │ keyspace │ tables │
├──────────────────────────────────────┼──────────────────────┼──────────────────────┼────────┤
│ 27b75126-605e-4238-9e6c-a2cec097d2c4 │ super_important_data │ super_important_data │ [] │
│ 40f9a7f3-4c63-4abb-8044-5397e36e1960 │ important_data │ important_data │ [] │
│ 4a694d40-4e50-4b05-9a45-3f0fe2c0f00a │ very_important_data │ very_important_data │ [] │
╰──────────────────────────────────────┴──────────────────────┴──────────────────────┴────────╯

$ sctool task list
╭───────────────────────────────────────────────────────────┬─────────────────────┬──────┬──────┬────────────┬───────────┬──────── ╮
│ task │ start date │ int. │ ret. │ properties │ run start │ status │
├───────────────────────────────────────────────────────────┼─────────────────────┼──────┼──────┼────────────┼───────────┼────────┤
│ repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b │ 21 Apr 18 00:00 UTC │ 7 │ 6 │ - │ - │ - │
╰───────────────────────────────────────────────────────────┴─────────────────────┴──────┴──────┴────────────┴───────────┴──────── ╯

$ sctool task start repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b
$ sctool task list
╭───────────────────────────────────────────────────────────┬─────────────────────┬──────┬──────┬───────────────────────────┬─────────────────────┬─────────╮
├───────────────────────────────────────────────────────────┼─────────────────────┼──────┼──────┼───────────────────────────┼─────────────────────┼─────────┤
│ repair/274cee81-bcc6-44b4-82f2-39b1483397a2 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:27b75126-605e-... │ - │ - │
│ repair/7b8bf21c-af89-4c2d-876c-8195179de685 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:40f9a7f3-4c63-... │ - │ - │
│ repair/b8cb6881-12e9-4205-bdc8-8fcc070803e9 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:4a694d40-4e50-... │ - │ - │
│ repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b │ 21 Apr 18 00:00 UTC │ 7 │ 6 │ - │ 20 Apr 18 09:33 UTC │ stopped │
╰───────────────────────────────────────────────────────────┴─────────────────────┴──────┴──────┴───────────────────────────┴─────────────────────┴─────────╯
Running repair

$ sctool repair schedule important_data --start-date now
repair/ab60e3c0-ee53-4872-8436-f8590779b675
$ sctool task list
╭───────────────────────────────────────────────────────────┬─────────────────────┬──────┬──────┬────────────────────────┬─────────────────────┬─────────╮
├───────────────────────────────────────────────────────────┼─────────────────────┼──────┼──────┼────────────────────────┼─────────────────────┼─────────┤
│ repair/274cee81-bcc6-44b4-82f2-39b1483397a2 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:27b75126-605e- │ - │ - │
│ repair/7b8bf21c-af89-4c2d-876c-8195179de685 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:40f9a7f3-4c63- │ - │ - │
│ repair/ab60e3c0-ee53-4872-8436-f8590779b675 │ 20 Apr 18 10:06 UTC │ 0 │ 3 │ unit_id:important_data │ 20 Apr 18 10:09 UTC │ running │
│ repair/b8cb6881-12e9-4205-bdc8-8fcc070803e9 │ 20 Apr 18 11:33 UTC │ 0 │ 864 │ unit_id:4a694d40-4e50- │ - │ - │
│ repair_auto_schedule/e6cd058c-a3ad-4b26-8e75-1bc66946280b │ 21 Apr 18 00:00 UTC │ 7 │ 6 │ - │ 20 Apr 18 09:33 UTC │ stopped │
╰───────────────────────────────────────────────────────────┴─────────────────────┴──────┴──────┴────────────────────────┴─────────────────────┴─────────╯
Running repair

$ sctool repair progress repair/ab60e3c0-ee53-4872-8436-f8590779b675
Status: running
Start time: 20 Apr 18 10:09 UTC
Duration: 1m14s
Progress: 8%
╭─────────────┬──────────┬─────────────────╮
│ host │ progress │ failed segments │
├─────────────┼──────────┼─────────────────┤
│ 172.16.1.2 │ - │ - │
│ 172.16.1.3 │ 24 │ 0 │
│ 172.16.1.10 │ - │ - │
╰─────────────┴──────────┴─────────────────╯
Repair progress

$ sctool task stop repair/ab60e3c0-ee53-4872-8436-f8590779b675
Status: stopped
End time: 20 Apr 18 10:10 UTC
Duration: 1m36s
Progress: 10%
╭─────────────┬──────────┬─────────────────╮
├─────────────┼──────────┼─────────────────┤
│ 172.16.1.2 │ - │ - │
│ 172.16.1.3 │ 31 │ 0 │
│ 172.16.1.10 │ - │ - │
╰─────────────┴──────────┴─────────────────╯
Repair stop

$ sctool task start repair/ab60e3c0-ee53-4872-8436-f8590779b675
Status: running
Duration: 3s
Progress: 10%
╭─────────────┬──────────┬─────────────────╮
├─────────────┼──────────┼─────────────────┤
│ 172.16.1.2 │ - │ - │
│ 172.16.1.3 │ 32 │ 0 │
│ 172.16.1.10 │ - │ - │
╰─────────────┴──────────┴─────────────────╯
Repair resume

www.scylladb.com/enterprise-download/#manager

+ Prometheus integration
+ Repairing fails segments

33
Scylla Manager 3.x
- Cloud Management (AWS)
- Unified packaging
- Sampler
- Admin Security
- Kerberos, LDAP integration
- Role Based Auth
- Audit
- Cloud Management (GCE)
- Migration Tools
- Admin Console (UI)
Scylla Manager 2.1
- Cluster Management
- Out / down scale
- Rolling Upgrade
- Rolling cfg update
Scylla Manager 2.x
- Log Collection
- Recurrent backup and health
status
Scylla Manager 1.1 (April 2018)
- Prometheus integration
- Repairing failed segments
Scylla Manager 1.2
- Repair: MultiDC support
- Repair: Auto config (msb,
shards)
- Repair subset on nodes
Scylla Manager 2.0
- Cluster setup (including monitoring)
CQ4’18CQ3’18CQ2’18

United States Israel www.scylladb.com
@scylladb

Introducing Scylla Manager: Cluster Management and Task Automation

In this document

More Related Content

What's hot

Similar to Introducing Scylla Manager: Cluster Management and Task Automation

More from ScyllaDB

Recently uploaded

Introducing Scylla Manager: Cluster Management and Task Automation