1 sysadmin vs 250 clusters de stockage

1 sysadmin vs 250 clusters
Etienne Menguy
SysadminDays
November 19, 2019

OVHcloud
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
2
1 500 000 customers
2200 employees
380 000 Bare-metal servers

Ceph at OVHcloud
D at e
3
Public Cloud
Virtual
machines
Additional
disks
Additional
disks
Additional
disks
Additional
disks
Cloud Disk Array
As A
Service

Evolution
„2015
• 4 dev
• 1 ops
• 8 clusters
• 4 regions
D at e
4
„2019
• 9 dev
• 250 clusters
• 10 regions

Daily work
„1 sysadmin
• Monitoring
• Prodding
• Support
• Training
• Deploying regions, servers
• And the daily surprises
D at e
5
8 devs
• Ceph as a service
• Infra as code
• Code review
• Tests
• R&D

Ceph setup
D at e
6
FlashcacheFlashcacheFlashcache
LXC
Data
LXC
Data
LXC
Data
NVME
Partition
Partition
Partition
x12
HDD
HDD
x12
HDD
Flashcache
LXC
Data
Bare-metal server
40Gbps NIC

Ceph as a service
„Autonomous users
• Creating cluster
• Managing users, pools, rights
• Managing network
• Cluster growth
„Backup management
• 500TB/day
• Ceph -> Swift
• Ceph -> Ceph
D at e
7
„Managing our infrastructure
• Cluster upgrade
• Deploy new ceph versions
• Manage tasks
• Host management
• Network management
• Containers management

Infrastructure
D at e
8
Serveurs
Conteneurs
VM
Instances
BDD
Puppet
API
Python
API
OVH
RabbitMQ
Celery

Task management
„ RabbitMQ
„ Celery
• https://github.com/ovh/celery-dyrygent
• Complex workflow
• Reliable
• Monitoring
• Web interface
• Planned tasks
• NVME replacement
• Self healing
• Triggered by monitoring probe
• Executes any operation
D at e
9

Example
D at e
10
start
Check
operation
safety
Lower disk
weight
Wait
cluster_health_ok
Remove disk
from cluster
Yes
No
Weight
equals 0

Continuous delivery
„CDS
• https://github.com/ovh/cds
„Each pull request
• Lint
• Unit test
„Daily prodding
• All tests executed
D at e
11

Infra as code
D at e
12

Inconsistent hardware
„Hardware profile
• 12 profils on production
• CPU
• NVME
• HDD
„Firmwares
„Ceph versions
D at e
13
• Generic tools
• 1 profile = 1 cluster

Monitoring
„ Automatic downtimes by tasks
„ Some alarms on working hours
„ Services/hosts aggregation
„ 143 000 services
„ 25 000 hosts
„ 3 infrastructures
• 6 masters
• 12 satellites
D at e
14

Metrics
„ Clusters metrics
• Usage
• Latency
„ Hardware
• Cpu, mermory usage
• Cache hit ratio
„ Service
• KPI
• Usage per openstack region
D at e
15
„ Metrics Data Platform
• https://www.ovh.com/fr/data-platforms/metrics/
„ 13 Millions series
„ 13 Billions points per day
„ Performance
• IO/s
• Latency

Logs
„ Infrastructure
• OS
• Ceph
„ Applications
• CAAS
• Celery / RabbitMQ
• Uniq step/task ID
„ API
D at e
16
„ Logs Data Platform
• https://www.ovh.com/fr/data-
platforms/logs/
„ 15 000 logs/second
„ Graylog
„ Filebeat

D at e
p ag e 17
Conclusion

D at e
p ag e 18
Questions?

1 sysadmin vs 250 clusters de stockage

More Related Content

What's hot

Similar to 1 sysadmin vs 250 clusters de stockage

More from OVHcloud

Recently uploaded

1 sysadmin vs 250 clusters de stockage