Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

1 sysadmin vs 250 clusters de stockage

120 views

Published on

OVHcloud utilise Ceph depuis cinq ans pour certains de ses besoins de stockage, bien qu'étant composée de 2000 serveurs physiques et 20000 conteneurs, cette infrastructure est gérée au quotidien par une seule personne au RUN. Nous ferons une présentation et un retour d'expérience sur les différents moyens mis en oeuvre pour y arriver.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

1 sysadmin vs 250 clusters de stockage

  1. 1. 1 sysadmin vs 250 clusters Etienne Menguy SysadminDays November 19, 2019
  2. 2. OVHcloud D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 2 1 500 000 customers 2200 employees 380 000 Bare-metal servers
  3. 3. Ceph at OVHcloud D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 3 Public Cloud Virtual machines Additional disks Additional disks Additional disks Additional disks Cloud Disk Array As A Service
  4. 4. Evolution „2015 • 4 dev • 1 ops • 8 clusters • 4 regions D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 4 „2019 • 9 dev • 250 clusters • 10 regions
  5. 5. Daily work „1 sysadmin • Monitoring • Prodding • Support • Training • Deploying regions, servers • And the daily surprises D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 5 8 devs • Ceph as a service • Infra as code • Code review • Tests • R&D
  6. 6. Ceph setup D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 6 FlashcacheFlashcacheFlashcache LXC Data LXC Data LXC Data NVME Partition Partition Partition x12 HDD HDD x12 HDD Flashcache LXC Data Bare-metal server 40Gbps NIC
  7. 7. Ceph as a service „Autonomous users • Creating cluster • Managing users, pools, rights • Managing network • Cluster growth „Backup management • 500TB/day • Ceph -> Swift • Ceph -> Ceph D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 7 „Managing our infrastructure • Cluster upgrade • Deploy new ceph versions • Manage tasks • Host management • Network management • Containers management
  8. 8. Infrastructure D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 8 Serveurs Conteneurs VM Instances BDD Puppet API Python API OVH RabbitMQ Celery
  9. 9. Task management „ RabbitMQ „ Celery • https://github.com/ovh/celery-dyrygent • Complex workflow • Reliable • Monitoring • Web interface • Planned tasks • NVME replacement • Self healing • Triggered by monitoring probe • Executes any operation D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 9
  10. 10. Example D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 10 start Check operation safety Lower disk weight Wait cluster_health_ok Remove disk from cluster Yes No Weight equals 0
  11. 11. Continuous delivery „CDS • https://github.com/ovh/cds „Each pull request • Lint • Unit test „Daily prodding • All tests executed D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 11
  12. 12. Infra as code D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 12
  13. 13. Inconsistent hardware „Hardware profile • 12 profils on production • CPU • NVME • HDD „Firmwares „Ceph versions D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 13 • Generic tools • 1 profile = 1 cluster
  14. 14. Monitoring „ Automatic downtimes by tasks „ Some alarms on working hours „ Services/hosts aggregation „ 143 000 services „ 25 000 hosts „ 3 infrastructures • 6 masters • 12 satellites D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 14
  15. 15. Metrics „ Clusters metrics • Usage • Latency „ Hardware • Cpu, mermory usage • Cache hit ratio „ Service • KPI • Usage per openstack region D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 15 „ Metrics Data Platform • https://www.ovh.com/fr/data-platforms/metrics/ „ 13 Millions series „ 13 Billions points per day „ Performance • IO/s • Latency
  16. 16. Logs „ Infrastructure • OS • Ceph „ Applications • CAAS • Celery / RabbitMQ • Uniq step/task ID „ API D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er 16 „ Logs Data Platform • https://www.ovh.com/fr/data- platforms/logs/ „ 15 000 logs/second „ Graylog „ Filebeat
  17. 17. D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er p ag e 17 Conclusion
  18. 18. D at e F o o t er can b e p er so n alized as fo llo w : In ser t / H ead er an d fo o t er p ag e 18 Questions?

×