SlideShare a Scribd company logo
Automatic	
  Operation	
  Bot	
  for	
  Ceph
You,	
  Ji
youji@ebay.com
• Ceph	
  in	
  eBay
• Ops	
  for	
  Ceph	
  Cluster
• Automatic	
  opertion	
  Bot
Cluster Capacit
y
Node
s
Disks SSD
pool
Raw
Usage
Provisi
on
Usage
Used IOPS
(Avg/Max)
LVS01 1.4PB 18 400 0 4.6% 13% 1.10K/7.5K
LVS02 3.24PB 54 950 3.3T 12.7% 60.7% 11.5K/24K
PHX02 3.49PB 55 1000 6.6T 8.8% 41.9% 9K/16K
SLC07 3.77PB 58 1100 6T 12.5% 36.4% 14K/41.3K
CAL
CephFS
18T 36 36 18T 19.19% 29.6% 3K/5K
3	
  copies	
  for	
  all	
  cluster
0
2
4
6
8
10
12
14
Total
Summary
Raw Capacity Provisioned Size Raw Usage
All	
  the	
  size	
  has	
  multiplied	
  3
PB
• Ceph	
  in	
  eBay
• Ops	
  for	
  Ceph	
  Cluster
• Automatic	
  opertion	
  Bot
foreman
puppet	
  
code
repo
puppet	
  
config
repo
git	
  
server
...
...
Deploy
• Use	
  code	
  to	
  deploy
• Differences	
  kept	
  in	
  Config	
  Repo
• puppet	
  auto	
  add	
  free	
  disks
• NO auto	
  weight	
  for	
  new	
  OSD
Monitoring	
  &	
  Alerts
node-­‐exporter
ceph	
  node
CPU Mem Disk Net
admin-­‐sockets
megacli
iostat
ceph-­‐exporter
pool	
  usage
cluster	
  reliability
cluster	
  availability
PG	
  status
ceph	
  admin	
  node
Prometheus	
  local	
  disk Prometheus	
  remote	
  disk
AlertManager Grafana
Remediation
• ~ 95%	
  warning/error	
  caused	
  by	
  bad	
  disks.
• ~3%	
  caused	
  by	
  host	
  down
• ~2%	
  other	
  errors,	
  such	
  as	
  network,	
  electric	
  power,	
  human	
  error	
  
Percentage
Error Disk Host Down net/power/human error
• Ceph	
  in	
  eBay
• Ops	
  for	
  Ceph	
  Cluster
• Automatic	
  opertion	
  Bot
• Handbook	
  for	
  operation
• Interactive	
  chat	
  bot
• Automatic	
  operation	
  bot
How	
  to	
  build	
  a	
  automatic	
  bot?
ceph	
  pg	
  dump	
  |	
  grep	
  inconsistent
1 2
pg	
  id
3
osd.1 osd.2 osd.3
4
5
6
Swap	
  Disk
$	
  remove	
  OSD.2	
  from	
  ceph
$	
  Offline	
  disk	
  from	
  HW
$	
  Light	
  LED	
  on	
  the	
  Disk.
Write	
  operations	
  on	
  the	
  doc.
• Record	
  every	
  failure	
  of	
  disks/node	
  and	
  related	
  docs.
• Handbook	
  for	
  operation
• Interactive	
  chat	
  bot
• Automatic	
  operation	
  bot
How	
  to	
  build	
  a	
  automatic	
  bot?
Monitoring	
  &	
  Alerts
node-­‐exporter
ceph	
  node
CPU Mem Disk Net
admin-­‐sockets
megacli
iostat
ceph-­‐exporter
pool	
  usage
cluster	
  reliability
cluster	
  availability
PG	
  status
ceph	
  admin	
  node
Prometheus	
  local	
  disk
AlertManager
Demo
• Handbook	
  for	
  operation
• Interactive	
  chat	
  bot
• Automatic	
  operation	
  bot
How	
  to	
  build	
  a	
  automatic	
  bot?
node-­‐exporter
ceph	
  node
CPU Mem Disk Net
admin-­‐sockets
megacli
iostat
ceph-­‐exporter
pool	
  usage
cluster	
  reliability
cluster	
  availability
PG	
  status
ceph	
  admin	
  node
Prometheus	
  local	
  disk
AlertManager
node-­‐exporter
ceph	
  node
CPU Mem Disk Net
admin-­‐sockets
megacli
iostat
ceph-­‐exporter
pool	
  usage
cluster	
  reliability
cluster	
  availability
PG	
  status
ceph	
  admin	
  node
Prometheus	
  local	
  disk
AlertManager
Task	
  Queue
Executor
Q	
  &	
  A
Thanks
youji@ebay.com

More Related Content

What's hot

Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
Ceph Community
 
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangAccelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Ceph Community
 
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Community
 
Making Ceph awesome on Kubernetes with Rook - Bassam Tabbara
Making Ceph awesome on Kubernetes with Rook - Bassam TabbaraMaking Ceph awesome on Kubernetes with Rook - Bassam Tabbara
Making Ceph awesome on Kubernetes with Rook - Bassam Tabbara
Ceph Community
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Ceph Community
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
Ceph Community
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
Ceph Community
 
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiRADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
Ceph Community
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph Community
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Ceph Community
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and Reporting
Ceph Community
 
Ceph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateCeph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA Update
Danielle Womboldt
 
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John HaanBasic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Ceph Community
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of Alabama
Kamesh Pemmaraju
 
Tuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish CacheTuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish Cache
Per Buer
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Danielle Womboldt
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
Patrick McGarry
 
Scaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/Day
ScyllaDB
 
OSv presentation from Linux Foundation Collaboration Summit
OSv presentation from Linux Foundation Collaboration SummitOSv presentation from Linux Foundation Collaboration Summit
OSv presentation from Linux Foundation Collaboration Summit
Don Marti
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Patrick McGarry
 

What's hot (20)

Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangAccelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
 
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
 
Making Ceph awesome on Kubernetes with Rook - Bassam Tabbara
Making Ceph awesome on Kubernetes with Rook - Bassam TabbaraMaking Ceph awesome on Kubernetes with Rook - Bassam Tabbara
Making Ceph awesome on Kubernetes with Rook - Bassam Tabbara
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu ChaiRADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and Reporting
 
Ceph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateCeph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA Update
 
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John HaanBasic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of Alabama
 
Tuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish CacheTuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish Cache
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Scaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/Day
 
OSv presentation from Linux Foundation Collaboration Summit
OSv presentation from Linux Foundation Collaboration SummitOSv presentation from Linux Foundation Collaboration Summit
OSv presentation from Linux Foundation Collaboration Summit
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 

Similar to Automatic Operation Bot for Ceph - You Ji

Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
Ranga Swami Reddy Muthumula
 
Micro control idsecconf2010
Micro control idsecconf2010Micro control idsecconf2010
Micro control idsecconf2010
idsecconf
 
Emulating With JavaScript
Emulating With JavaScriptEmulating With JavaScript
Emulating With JavaScript
alexanderdickson
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems Baruch Osoveskiy
 
Performance
PerformancePerformance
Performance
Christophe Marchal
 
QCon London.pdf
QCon London.pdfQCon London.pdf
QCon London.pdf
Monica Beckwith
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
ShapeBlue
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Shiao-An Yuan
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
Pierre Mavro
 
Tips from Support: Always Carry a Towel and Don’t Panic!
Tips from Support: Always Carry a Towel and Don’t Panic!Tips from Support: Always Carry a Towel and Don’t Panic!
Tips from Support: Always Carry a Towel and Don’t Panic!
Perforce
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Rongze Zhu
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
Mirantis
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
The Linux Foundation
 
php & performance
 php & performance php & performance
php & performance
simon8410
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
Amazon Web Services
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
hellobank1
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Danielle Womboldt
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Community
 

Similar to Automatic Operation Bot for Ceph - You Ji (20)

ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
Micro control idsecconf2010
Micro control idsecconf2010Micro control idsecconf2010
Micro control idsecconf2010
 
Emulating With JavaScript
Emulating With JavaScriptEmulating With JavaScript
Emulating With JavaScript
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 
Performance
PerformancePerformance
Performance
 
QCon London.pdf
QCon London.pdfQCon London.pdf
QCon London.pdf
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
 
Tips from Support: Always Carry a Towel and Don’t Panic!
Tips from Support: Always Carry a Towel and Don’t Panic!Tips from Support: Always Carry a Towel and Don’t Panic!
Tips from Support: Always Carry a Towel and Don’t Panic!
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
 
php & performance
 php & performance php & performance
php & performance
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 

Recently uploaded

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

Automatic Operation Bot for Ceph - You Ji

  • 1. Automatic  Operation  Bot  for  Ceph You,  Ji youji@ebay.com
  • 2. • Ceph  in  eBay • Ops  for  Ceph  Cluster • Automatic  opertion  Bot
  • 3. Cluster Capacit y Node s Disks SSD pool Raw Usage Provisi on Usage Used IOPS (Avg/Max) LVS01 1.4PB 18 400 0 4.6% 13% 1.10K/7.5K LVS02 3.24PB 54 950 3.3T 12.7% 60.7% 11.5K/24K PHX02 3.49PB 55 1000 6.6T 8.8% 41.9% 9K/16K SLC07 3.77PB 58 1100 6T 12.5% 36.4% 14K/41.3K CAL CephFS 18T 36 36 18T 19.19% 29.6% 3K/5K 3  copies  for  all  cluster
  • 4. 0 2 4 6 8 10 12 14 Total Summary Raw Capacity Provisioned Size Raw Usage All  the  size  has  multiplied  3 PB
  • 5. • Ceph  in  eBay • Ops  for  Ceph  Cluster • Automatic  opertion  Bot
  • 6. foreman puppet   code repo puppet   config repo git   server ... ... Deploy • Use  code  to  deploy • Differences  kept  in  Config  Repo • puppet  auto  add  free  disks • NO auto  weight  for  new  OSD
  • 7. Monitoring  &  Alerts node-­‐exporter ceph  node CPU Mem Disk Net admin-­‐sockets megacli iostat ceph-­‐exporter pool  usage cluster  reliability cluster  availability PG  status ceph  admin  node Prometheus  local  disk Prometheus  remote  disk AlertManager Grafana
  • 8. Remediation • ~ 95%  warning/error  caused  by  bad  disks. • ~3%  caused  by  host  down • ~2%  other  errors,  such  as  network,  electric  power,  human  error   Percentage Error Disk Host Down net/power/human error
  • 9. • Ceph  in  eBay • Ops  for  Ceph  Cluster • Automatic  opertion  Bot
  • 10. • Handbook  for  operation • Interactive  chat  bot • Automatic  operation  bot How  to  build  a  automatic  bot?
  • 11. ceph  pg  dump  |  grep  inconsistent 1 2 pg  id 3 osd.1 osd.2 osd.3 4 5 6 Swap  Disk $  remove  OSD.2  from  ceph $  Offline  disk  from  HW $  Light  LED  on  the  Disk.
  • 12. Write  operations  on  the  doc. • Record  every  failure  of  disks/node  and  related  docs.
  • 13. • Handbook  for  operation • Interactive  chat  bot • Automatic  operation  bot How  to  build  a  automatic  bot?
  • 14. Monitoring  &  Alerts node-­‐exporter ceph  node CPU Mem Disk Net admin-­‐sockets megacli iostat ceph-­‐exporter pool  usage cluster  reliability cluster  availability PG  status ceph  admin  node Prometheus  local  disk AlertManager
  • 15. Demo
  • 16. • Handbook  for  operation • Interactive  chat  bot • Automatic  operation  bot How  to  build  a  automatic  bot?
  • 17. node-­‐exporter ceph  node CPU Mem Disk Net admin-­‐sockets megacli iostat ceph-­‐exporter pool  usage cluster  reliability cluster  availability PG  status ceph  admin  node Prometheus  local  disk AlertManager
  • 18. node-­‐exporter ceph  node CPU Mem Disk Net admin-­‐sockets megacli iostat ceph-­‐exporter pool  usage cluster  reliability cluster  availability PG  status ceph  admin  node Prometheus  local  disk AlertManager Task  Queue Executor