1 sysadmin vs 250 clusters
Etienne Menguy
SysadminDays
November 19, 2019
OVHcloud
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
2
1 500 000 customers
2200 employees
380 000 Bare-metal servers
Ceph at OVHcloud
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
3
Public Cloud
Virtual
machines
Additional
disks
Additional
disks
Additional
disks
Additional
disks
Cloud Disk Array
As A
Service
Evolution
„2015
• 4 dev
• 1 ops
• 8 clusters
• 4 regions
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
4
„2019
• 9 dev
• 250 clusters
• 10 regions
Daily work
„1 sysadmin
• Monitoring
• Prodding
• Support
• Training
• Deploying regions, servers
• And the daily surprises
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
5
8 devs
• Ceph as a service
• Infra as code
• Code review
• Tests
• R&D
Ceph setup
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
6
FlashcacheFlashcacheFlashcache
LXC
Data
LXC
Data
LXC
Data
NVME
Partition
Partition
Partition
x12
HDD
HDD
x12
HDD
Flashcache
LXC
Data
Bare-metal server
40Gbps NIC
Ceph as a service
„Autonomous users
• Creating cluster
• Managing users, pools, rights
• Managing network
• Cluster growth
„Backup management
• 500TB/day
• Ceph -> Swift
• Ceph -> Ceph
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
7
„Managing our infrastructure
• Cluster upgrade
• Deploy new ceph versions
• Manage tasks
• Host management
• Network management
• Containers management
Infrastructure
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
8
Serveurs
Conteneurs
VM
Instances
BDD
Puppet
API
Python
API
OVH
RabbitMQ
Celery
Task management
„ RabbitMQ
„ Celery
• https://github.com/ovh/celery-dyrygent
• Complex workflow
• Reliable
• Monitoring
• Web interface
• Planned tasks
• NVME replacement
• Self healing
• Triggered by monitoring probe
• Executes any operation
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
9
Example
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
10
start
Check
operation
safety
Lower disk
weight
Wait
cluster_health_ok
Remove disk
from cluster
Yes
No
Weight
equals 0
Continuous delivery
„CDS
• https://github.com/ovh/cds
„Each pull request
• Lint
• Unit test
„Daily prodding
• All tests executed
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
11
Infra as code
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
12
Inconsistent hardware
„Hardware profile
• 12 profils on production
• CPU
• NVME
• HDD
„Firmwares
„Ceph versions
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
13
• Generic tools
• 1 profile = 1 cluster
Monitoring
„ Automatic downtimes by tasks
„ Some alarms on working hours
„ Services/hosts aggregation
„ 143 000 services
„ 25 000 hosts
„ 3 infrastructures
• 6 masters
• 12 satellites
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
14
Metrics
„ Clusters metrics
• Usage
• Latency
„ Hardware
• Cpu, mermory usage
• Cache hit ratio
„ Service
• KPI
• Usage per openstack region
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
15
„ Metrics Data Platform
• https://www.ovh.com/fr/data-platforms/metrics/
„ 13 Millions series
„ 13 Billions points per day
„ Performance
• IO/s
• Latency
Logs
„ Infrastructure
• OS
• Ceph
„ Applications
• CAAS
• Celery / RabbitMQ
• Uniq step/task ID
„ API
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
16
„ Logs Data Platform
• https://www.ovh.com/fr/data-
platforms/logs/
„ 15 000 logs/second
„ Graylog
„ Filebeat
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
p ag e 17
Conclusion
D at e
F o o t er can b e p er so n alized as
fo llo w : In ser t / H ead er an d fo o t er
p ag e 18
Questions?

1 sysadmin vs 250 clusters de stockage

  • 1.
    1 sysadmin vs250 clusters Etienne Menguy SysadminDays November 19, 2019
  • 2.
    OVHcloud D at e Fo o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 2 1 500 000 customers 2200 employees 380 000 Bare-metal servers
  • 3.
    Ceph at OVHcloud Dat e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 3 Public Cloud Virtual machines Additional disks Additional disks Additional disks Additional disks Cloud Disk Array As A Service
  • 4.
    Evolution „2015 • 4 dev •1 ops • 8 clusters • 4 regions D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 4 „2019 • 9 dev • 250 clusters • 10 regions
  • 5.
    Daily work „1 sysadmin •Monitoring • Prodding • Support • Training • Deploying regions, servers • And the daily surprises D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 5 8 devs • Ceph as a service • Infra as code • Code review • Tests • R&D
  • 6.
    Ceph setup D ate F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 6 FlashcacheFlashcacheFlashcache LXC Data LXC Data LXC Data NVME Partition Partition Partition x12 HDD HDD x12 HDD Flashcache LXC Data Bare-metal server 40Gbps NIC
  • 7.
    Ceph as aservice „Autonomous users • Creating cluster • Managing users, pools, rights • Managing network • Cluster growth „Backup management • 500TB/day • Ceph -> Swift • Ceph -> Ceph D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 7 „Managing our infrastructure • Cluster upgrade • Deploy new ceph versions • Manage tasks • Host management • Network management • Containers management
  • 8.
    Infrastructure D at e Fo o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 8 Serveurs Conteneurs VM Instances BDD Puppet API Python API OVH RabbitMQ Celery
  • 9.
    Task management „ RabbitMQ „Celery • https://github.com/ovh/celery-dyrygent • Complex workflow • Reliable • Monitoring • Web interface • Planned tasks • NVME replacement • Self healing • Triggered by monitoring probe • Executes any operation D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 9
  • 10.
    Example D at e Fo o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 10 start Check operation safety Lower disk weight Wait cluster_health_ok Remove disk from cluster Yes No Weight equals 0
  • 11.
    Continuous delivery „CDS • https://github.com/ovh/cds „Eachpull request • Lint • Unit test „Daily prodding • All tests executed D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 11
  • 12.
    Infra as code Dat e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 12
  • 13.
    Inconsistent hardware „Hardware profile •12 profils on production • CPU • NVME • HDD „Firmwares „Ceph versions D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 13 • Generic tools • 1 profile = 1 cluster
  • 14.
    Monitoring „ Automatic downtimesby tasks „ Some alarms on working hours „ Services/hosts aggregation „ 143 000 services „ 25 000 hosts „ 3 infrastructures • 6 masters • 12 satellites D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 14
  • 15.
    Metrics „ Clusters metrics •Usage • Latency „ Hardware • Cpu, mermory usage • Cache hit ratio „ Service • KPI • Usage per openstack region D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 15 „ Metrics Data Platform • https://www.ovh.com/fr/data-platforms/metrics/ „ 13 Millions series „ 13 Billions points per day „ Performance • IO/s • Latency
  • 16.
    Logs „ Infrastructure • OS •Ceph „ Applications • CAAS • Celery / RabbitMQ • Uniq step/task ID „ API D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 16 „ Logs Data Platform • https://www.ovh.com/fr/data- platforms/logs/ „ 15 000 logs/second „ Graylog „ Filebeat
  • 17.
    D at e Fo o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er p ag e 17 Conclusion
  • 18.
    D at e Fo o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er p ag e 18 Questions?