SlideShare a Scribd company logo
1 of 27
TROUBLESHOOTING CEPH
Brad Hubbard
Senior Software Maintenance Engineer
05-11-2015
What sort of trouble?
Identify your problem domain
Ceph is a mature, resilient and robust piece of software but, when things do go
wrong, empower yourself to identify these specific areas and analyse them using
common Linux, and Ceph-specific, tooling.
●
Performance
●
“Hang”
●
Crash
●
Unexpected or undesirable behaviour
Performance
Establish a baseline and re-test regularly
●
rados bench
●
ceph tell osd.N bench
●
fio – rbd ioengine
●
fio – libaio ioengine
●
pblio - https://github.com/pblcache/pblcache/wiki/Pblio
●
netperf – test all network segments
●
dd
●
pcp, sysstat, collectl, insert favourite tool here...
●
The Ceph Benchmarking Tool - https://github.com/ceph/cbt
●
Be mindful of the cache and its effects
Performance
Specifically, poor performance
Zero in on the problem area by identifying if it is specific to a particular host, or hosts,
or if a particular sub-domain is implicated.
●
HEALTH_OK ?
●
Re-use the tools mentioned in the previous slide as well as host specific tools
●
ss, netstat and friends
●
tcpdump
●
iostat
●
top
●
pcp, sar, collectl
●
free, vmstat
●
Increase ceph logging verbosity
●
$ gawk '/ERR/||/WRN/' /var/log/ceph/*log
Performance
Slow requests
When Ceph detects a request that is too slow (tunable) it will issue a warning.
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-
time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write
0~4194304] 0.69848840) v4 currently waiting for subops from [610]
●
HEALTH_OK ?
●
Check performance statistics on implicated hosts
●
Turn up debugging on the implicated OSDs
•
# ceph tell osd.N injectargs '--debug_osd 20 --debug_ms 1'
●
Gather information about slow ops
•
# ceph --admin-daemon /var/run/ceph/ceph-osd.N.asok dump_historic_ops
•
# ceph --admin-daemon /var/run/ceph/ceph-osd.N.asok perf dump
Hang
Is it really a hang?
Sometimes situations described as a “hang” turn out to be something different such
as code stuck in a tight loop, a dead-lock, firewall problems, etc.
● Use strace to check if the process is still making progress (may miss hung threads)
●
Check for a high load average and/or high %iowait on the CPUs
●
Use ps to check for ceph processes in d-state (Uninterruptible sleep)
● Use ps to find the ceph threads that are sleeping and what function they are
sleeping in
•
# ps axHo stat,tid,ppid,comm,wchan
●
Check syslog and dmesg for “hung_task_timeout” warnings
●
Use gstack or gcore to figure out where we are in the ceph code and what
subsystems in the kernel we are exercising
Hang
Is it really a hang?
Note that if everything points to uninterruptible threads in kernel space this is a kernel
problem but it obviously still has the potential to severely degrade ceph performance
and needs to be identified and fixed.
●
Sysrq to dump kernel thread stacks. Dumps to syslog, search for “ D “
•
# echo 1 > /proc/sys/kernel/sysrq
•
# echo 't' > /proc/sysrq-trigger
•
# sleep 20
•
# echo 't' > /proc/sysrq-trigger
•
# echo 0 > /proc/sys/kernel/sysrq
xfssyncd/dm-2 D 0000000000000011 0 3207 2 0x00000080
●
sysrq data may implicate a certain subsystem or help to identify a known issue or
confirm suspicions
●
May require a vmcore be collected and analysed
Hang
Is it really a hang?
What at first appears to be a hang may in fact be a thread, or threads, caught in a
tight loop due to some logic condition and failing to make progress. To the user that
process seems “hung” but it is actually running. We need to identify where the
process is spending the bulk of it's time.
●
Look for high CPU usage of Ceph processes
●
Check strace and/or ltrace output for hints at what the process may be doing
●
Employ the “Poor Man's Profiler” technique - http://poormansprofiler.org/
•
# for x in `seq 1 5`; do for pid in `pidof ceph-mon ceph-osd`; do gstack $pid; 
echo; done; done > /tmp/ceph-stacks
•
This can potentially generate a lot of data so you may want to only target a single
process, the one(s) with high CPU utilisation
●
Visually inspect the relevant source code to work out why it might not make
progress
●
More advanced techniques such as scripting gdb, systemtap probes
Hang
Is it really a hang?
Dead-lock or live-lock.
●
Gcore and/or gstack
●
Visually inspect relevant source code
●
Might need some help with this one
Crash
Where did ceph go?
If ceph crashes it will attempt to log details of the crash. Code in handle_fatal_signal()
and __ceph_assert_fail() will try to dump the stack as well as relevant information and
a debug log of recent events. Search the logs for “objdump”.
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.80.8-84-gb5a67f0 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)
+0x2a9) [0x9acc49]
2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x9ad4b6]
3: (OSD::_is_healthy()+0x21) [0x5fde61]
4: (OSD::tick()+0x498) [0x64d978]
...
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Crash
asserts
The ceph code base includes thousands of asserts. Assert is a system call that aborts
the program if if the assertion evaluates to false.
●
Conditions that are considered fatal
●
Memory corruption
●
Result of on-disk corruption
●
Intentional aborts
75 was = h->suicide_timeout.read();
76 if (was && was < now) {
77 ldout(m_cct, 1) << who << " '" << h->name << "'"
78 << " had suicide timed out after " << h->suicide_grace << dendl;
79 assert(0 == "hit suicide timeout");
80 }
Crash
Fatal signals
Indicate a fatal error such as a segmentation fault, bus error or abort. Search for
“objdump” or “*** Caught signal”
●
Indicative of a programming error
●
Usually a memory accounting/access error
●
Check for existing bugs with the same signature or open a new tracker or Bugzilla
Crash
Example
0> 2015-09-24 04:14:49.345105 7fea04f79700 -1 *** Caught signal (Aborted) **
in thread 7fea04f79700
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: /usr/bin/ceph-osd() [0x9f63f2]
2: (()+0xf130) [0x7fea14462130]
3: (gsignal()+0x37) [0x7fea12e7c5d7]
4: (abort()+0x148) [0x7fea12e7dcc8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fea137809b5]
6: (()+0x5e926) [0x7fea1377e926]
7: (()+0x5e953) [0x7fea1377e953]
8: (()+0x5eb73) [0x7fea1377eb73]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) [0xb697b7]
10: (OSDMap::decode_classic(ceph::buffer::list::iterator&)+0x605) [0xab1a35]
...
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Crash
Extracting more information
Crashes can be tricky to diagnose completely but, if you are up for the challenge or
would just like to gather more information, eu-addr2line, gdb or objdump may provide
further insight.
# eu-addr2line -e /usr/bin/ceph-osd 0xb697b7
include/buffer.h:224
# objdump -rdS /usr/bin/ceph-osd
# gdb `which ceph-osd`
(gdb) disass /m 0xb697b7
Dump of assembler code for function ceph::buffer::list::iterator::copy(unsigned int,
char*):
...
Unexpected or undesirable behaviour
That doesn't seem right?
Sometimes ceph may not do what you expect or want
●
Identify the expected or desirable behaviour
●
Figure out if this is by design or the result of a corner case or error
●
What behaviour do you see (when I do X, I see Y)
●
Timestamp an instance of the error/behaviour
●
Increase debugging and trace the transaction through the logs
●
If this is Openstack behaviour trace it via Nova, Glance, Cinder, rbd logs
●
If this is Rados gateway behaviour trace the httpd logs and match these with the
rgw and Ceph logs
●
Start at the user end and work back towards ceph
●
Timestamps help a lot!
Debug logging
OSD
●
debug ms = 1
●
debug osd = 20
●
debug objecter = 20
●
debug monc = 20
●
debug journal = 20
●
debug filestore = 20
●
debug newstore = 30
●
debug objclass = 20
Debug logging
MON
●
debug mon = 20
●
debug ms = 1
●
debug paxos = 20
●
debug auth = 20
Debug logging
RADOS Gateway
●
debug rgw = 20
●
debug ms = 1
Debug logging
MDS
●
debug ms = 1
●
debug mds = 20
●
debug auth = 20
●
debug monc = 20
●
mds debug scatterstat = true
●
mds verify scatter = true
●
mds log max segments = 2
Debug logging
Client
[client] # Section, can also be global since it is inherited
debug ms = 1
debug rbd = 20
debug objectcacher = 20
debug objecter = 20
log file = /var/log/ceph/rbd.log
# touch /var/log/ceph/rbd.log
# chmod 777 /var/log/ceph/rbd.log
Debug logging
Openstack
Turn up logging verbosity for whichever is relevant Nova, Glance, Cinder, rbd or all of
the above
●
Trace error/behaviour down through the logs from high level (Nova) to low level
(rbd and the ceph cluster)
●
Try running relevant commands from a lower level
●
Make sure it isn't an Openstack problem
Debug logging
Linux kernel (krbd) client
The kernel RBD client logs to syslog and/or dmesg
Debug logging
Without restart
Turn debug logging on
●
ceph tell osd.* injectargs '--debug_osd 20 --debug_ms 1'
Turn debug logging off
●
ceph tell osd.* injectargs '--debug_osd 0/5 --debug_ms 0/5'
Source code
Use the source Luke!
Upstream source
# ceph -v
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
●
git clone https://github.com/ceph/ceph.git
●
git checkout e4bfad3a3c51054df7e537a724c8d0bf9be972ff
●
git checkout v0.94.1
Source code
Use the source Luke!
Downstream source
●
yumdownloader –archlist=src --enablerepo=rhel-7-server-rhceph-1.3-*source-rpms ceph
●
rpm -ivh ceph-0.94.1-19.el7cp.src.rpm
●
rpmbuild -bp --nodeps rpmbuild/SPECS/ceph.spec
●
cd rpmbuild/BUILD/ceph-0.94.1/
●
Ubuntu equivalent commands
●
Use your favourite editor (yes, of course it's “vi”) to browse the source files
Resources
Additional sources of help
If all else fails (or even as a first resort) seek help.
●
Email:
•
ceph-users@ceph.com
●
IRC:
•
irc.oftc.net #ceph
●
Known issues
•
http://tracker.ceph.com/
●
https://bugzilla.redhat.com/
●
Documentation
•
http://docs.ceph.com/docs/master/
●
Red Hat support
•
https://access.redhat.com
●
https://access.redhat.com/support
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews

More Related Content

What's hot

Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for CephCeph Community
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringAnne Nicolas
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency CephShapeBlue
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Community
 
日本OpenStackユーザ会 第37回勉強会
日本OpenStackユーザ会 第37回勉強会日本OpenStackユーザ会 第37回勉強会
日本OpenStackユーザ会 第37回勉強会Yushiro Furukawa
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortNAVER D2
 
How to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with CephHow to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with CephSean Cohen
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryScyllaDB
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a NutshellKaran Singh
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 

What's hot (20)

Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for Ceph
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uring
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Ceph
CephCeph
Ceph
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Demystifying openvswitch
Demystifying openvswitchDemystifying openvswitch
Demystifying openvswitch
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 
日本OpenStackユーザ会 第37回勉強会
日本OpenStackユーザ会 第37回勉強会日本OpenStackユーザ会 第37回勉強会
日本OpenStackユーザ会 第37回勉強会
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
How to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with CephHow to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with Ceph
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a Nutshell
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 

Viewers also liked

Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Ceph Community
 
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...Ceph Community
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Community
 
iSCSI Target Support for Ceph
iSCSI Target Support for Ceph iSCSI Target Support for Ceph
iSCSI Target Support for Ceph Ceph Community
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Community
 
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture Ceph Community
 
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...Ceph Community
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Community
 
Ceph Day Shanghai - The Scrub and Repair in Jewel
Ceph Day Shanghai - The Scrub and Repair in Jewel Ceph Day Shanghai - The Scrub and Repair in Jewel
Ceph Day Shanghai - The Scrub and Repair in Jewel Ceph Community
 
Ceph Day Beijing: Containers and Ceph
Ceph Day Beijing: Containers and Ceph Ceph Day Beijing: Containers and Ceph
Ceph Day Beijing: Containers and Ceph Ceph Community
 
Ceph Day Berlin: Erasure Code in Ceph
Ceph Day Berlin: Erasure Code in Ceph Ceph Day Berlin: Erasure Code in Ceph
Ceph Day Berlin: Erasure Code in Ceph Ceph Community
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Community
 
Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Day Beijing: Optimizations on Ceph Cache Tiering Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Day Beijing: Optimizations on Ceph Cache Tiering Ceph Community
 
Ceph Day LA: Ceph Ecosystem Update
Ceph Day LA: Ceph Ecosystem Update Ceph Day LA: Ceph Ecosystem Update
Ceph Day LA: Ceph Ecosystem Update Ceph Community
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Community
 
Ceph Day Berlin: Ceph and iSCSI in a high availability setup
Ceph Day Berlin: Ceph and iSCSI in a high availability setupCeph Day Berlin: Ceph and iSCSI in a high availability setup
Ceph Day Berlin: Ceph and iSCSI in a high availability setupCeph Community
 
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client Ceph Community
 

Viewers also liked (20)

Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
librados
libradoslibrados
librados
 
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective
 
iSCSI Target Support for Ceph
iSCSI Target Support for Ceph iSCSI Target Support for Ceph
iSCSI Target Support for Ceph
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
 
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
 
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
 
Ceph Day Shanghai - The Scrub and Repair in Jewel
Ceph Day Shanghai - The Scrub and Repair in Jewel Ceph Day Shanghai - The Scrub and Repair in Jewel
Ceph Day Shanghai - The Scrub and Repair in Jewel
 
Ceph Day Beijing: Containers and Ceph
Ceph Day Beijing: Containers and Ceph Ceph Day Beijing: Containers and Ceph
Ceph Day Beijing: Containers and Ceph
 
Ceph Day Berlin: Erasure Code in Ceph
Ceph Day Berlin: Erasure Code in Ceph Ceph Day Berlin: Erasure Code in Ceph
Ceph Day Berlin: Erasure Code in Ceph
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
 
Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Day Beijing: Optimizations on Ceph Cache Tiering Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Day Beijing: Optimizations on Ceph Cache Tiering
 
Ceph Day LA: Ceph Ecosystem Update
Ceph Day LA: Ceph Ecosystem Update Ceph Day LA: Ceph Ecosystem Update
Ceph Day LA: Ceph Ecosystem Update
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
 
Ceph Day Berlin: Ceph and iSCSI in a high availability setup
Ceph Day Berlin: Ceph and iSCSI in a high availability setupCeph Day Berlin: Ceph and iSCSI in a high availability setup
Ceph Day Berlin: Ceph and iSCSI in a high availability setup
 
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
 

Similar to Ceph Day Melbourne - Troubleshooting Ceph

We shall play a game....
We shall play a game....We shall play a game....
We shall play a game....Sadia Textile
 
Kettunen, miaubiz fuzzing at scale and in style
Kettunen, miaubiz   fuzzing at scale and in styleKettunen, miaubiz   fuzzing at scale and in style
Kettunen, miaubiz fuzzing at scale and in styleDefconRussia
 
Shall we play a game
Shall we play a gameShall we play a game
Shall we play a gamejackpot201
 
Cephalocon apac china
Cephalocon apac chinaCephalocon apac china
Cephalocon apac chinaVikhyat Umrao
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Ceph Community
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
A Developer’s Guide to Kubernetes Security
A Developer’s Guide to Kubernetes SecurityA Developer’s Guide to Kubernetes Security
A Developer’s Guide to Kubernetes SecurityGene Gotimer
 
2011-03 Developing Windows Exploits
2011-03 Developing Windows Exploits 2011-03 Developing Windows Exploits
2011-03 Developing Windows Exploits Raleigh ISSA
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdbRoman Podoliaka
 
Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)Valerii Kravchuk
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Jian-Hong Pan
 
LCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis ToolsLCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis ToolsLinaro
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityScyllaDB
 
Deep learning - the conf br 2018
Deep learning - the conf br 2018Deep learning - the conf br 2018
Deep learning - the conf br 2018Fabio Janiszevski
 
Lecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentLecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentMohammed Farrag
 
php & performance
 php & performance php & performance
php & performancesimon8410
 
Xdebug - Derick Rethans - Barcelona PHP Conference 2008
Xdebug - Derick Rethans - Barcelona PHP Conference 2008Xdebug - Derick Rethans - Barcelona PHP Conference 2008
Xdebug - Derick Rethans - Barcelona PHP Conference 2008phpbarcelona
 

Similar to Ceph Day Melbourne - Troubleshooting Ceph (20)

We shall play a game....
We shall play a game....We shall play a game....
We shall play a game....
 
Kettunen, miaubiz fuzzing at scale and in style
Kettunen, miaubiz   fuzzing at scale and in styleKettunen, miaubiz   fuzzing at scale and in style
Kettunen, miaubiz fuzzing at scale and in style
 
0507 057 01 98 * Adana Klima Servisleri
0507 057 01 98 * Adana Klima Servisleri0507 057 01 98 * Adana Klima Servisleri
0507 057 01 98 * Adana Klima Servisleri
 
Shall we play a game
Shall we play a gameShall we play a game
Shall we play a game
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?
 
Cephalocon apac china
Cephalocon apac chinaCephalocon apac china
Cephalocon apac china
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
A Developer’s Guide to Kubernetes Security
A Developer’s Guide to Kubernetes SecurityA Developer’s Guide to Kubernetes Security
A Developer’s Guide to Kubernetes Security
 
2011-03 Developing Windows Exploits
2011-03 Developing Windows Exploits 2011-03 Developing Windows Exploits
2011-03 Developing Windows Exploits
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 
Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
 
LCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis ToolsLCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis Tools
 
Os Wilhelm
Os WilhelmOs Wilhelm
Os Wilhelm
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
 
Deep learning - the conf br 2018
Deep learning - the conf br 2018Deep learning - the conf br 2018
Deep learning - the conf br 2018
 
Lecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentLecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports Development
 
php & performance
 php & performance php & performance
php & performance
 
Xdebug - Derick Rethans - Barcelona PHP Conference 2008
Xdebug - Derick Rethans - Barcelona PHP Conference 2008Xdebug - Derick Rethans - Barcelona PHP Conference 2008
Xdebug - Derick Rethans - Barcelona PHP Conference 2008
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Ceph Day Melbourne - Troubleshooting Ceph

  • 1. TROUBLESHOOTING CEPH Brad Hubbard Senior Software Maintenance Engineer 05-11-2015
  • 2. What sort of trouble? Identify your problem domain Ceph is a mature, resilient and robust piece of software but, when things do go wrong, empower yourself to identify these specific areas and analyse them using common Linux, and Ceph-specific, tooling. ● Performance ● “Hang” ● Crash ● Unexpected or undesirable behaviour
  • 3. Performance Establish a baseline and re-test regularly ● rados bench ● ceph tell osd.N bench ● fio – rbd ioengine ● fio – libaio ioengine ● pblio - https://github.com/pblcache/pblcache/wiki/Pblio ● netperf – test all network segments ● dd ● pcp, sysstat, collectl, insert favourite tool here... ● The Ceph Benchmarking Tool - https://github.com/ceph/cbt ● Be mindful of the cache and its effects
  • 4. Performance Specifically, poor performance Zero in on the problem area by identifying if it is specific to a particular host, or hosts, or if a particular sub-domain is implicated. ● HEALTH_OK ? ● Re-use the tools mentioned in the previous slide as well as host specific tools ● ss, netstat and friends ● tcpdump ● iostat ● top ● pcp, sar, collectl ● free, vmstat ● Increase ceph logging verbosity ● $ gawk '/ERR/||/WRN/' /var/log/ceph/*log
  • 5. Performance Slow requests When Ceph detects a request that is too slow (tunable) it will issue a warning. {date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date- time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610] ● HEALTH_OK ? ● Check performance statistics on implicated hosts ● Turn up debugging on the implicated OSDs • # ceph tell osd.N injectargs '--debug_osd 20 --debug_ms 1' ● Gather information about slow ops • # ceph --admin-daemon /var/run/ceph/ceph-osd.N.asok dump_historic_ops • # ceph --admin-daemon /var/run/ceph/ceph-osd.N.asok perf dump
  • 6. Hang Is it really a hang? Sometimes situations described as a “hang” turn out to be something different such as code stuck in a tight loop, a dead-lock, firewall problems, etc. ● Use strace to check if the process is still making progress (may miss hung threads) ● Check for a high load average and/or high %iowait on the CPUs ● Use ps to check for ceph processes in d-state (Uninterruptible sleep) ● Use ps to find the ceph threads that are sleeping and what function they are sleeping in • # ps axHo stat,tid,ppid,comm,wchan ● Check syslog and dmesg for “hung_task_timeout” warnings ● Use gstack or gcore to figure out where we are in the ceph code and what subsystems in the kernel we are exercising
  • 7. Hang Is it really a hang? Note that if everything points to uninterruptible threads in kernel space this is a kernel problem but it obviously still has the potential to severely degrade ceph performance and needs to be identified and fixed. ● Sysrq to dump kernel thread stacks. Dumps to syslog, search for “ D “ • # echo 1 > /proc/sys/kernel/sysrq • # echo 't' > /proc/sysrq-trigger • # sleep 20 • # echo 't' > /proc/sysrq-trigger • # echo 0 > /proc/sys/kernel/sysrq xfssyncd/dm-2 D 0000000000000011 0 3207 2 0x00000080 ● sysrq data may implicate a certain subsystem or help to identify a known issue or confirm suspicions ● May require a vmcore be collected and analysed
  • 8. Hang Is it really a hang? What at first appears to be a hang may in fact be a thread, or threads, caught in a tight loop due to some logic condition and failing to make progress. To the user that process seems “hung” but it is actually running. We need to identify where the process is spending the bulk of it's time. ● Look for high CPU usage of Ceph processes ● Check strace and/or ltrace output for hints at what the process may be doing ● Employ the “Poor Man's Profiler” technique - http://poormansprofiler.org/ • # for x in `seq 1 5`; do for pid in `pidof ceph-mon ceph-osd`; do gstack $pid; echo; done; done > /tmp/ceph-stacks • This can potentially generate a lot of data so you may want to only target a single process, the one(s) with high CPU utilisation ● Visually inspect the relevant source code to work out why it might not make progress ● More advanced techniques such as scripting gdb, systemtap probes
  • 9. Hang Is it really a hang? Dead-lock or live-lock. ● Gcore and/or gstack ● Visually inspect relevant source code ● Might need some help with this one
  • 10. Crash Where did ceph go? If ceph crashes it will attempt to log details of the crash. Code in handle_fatal_signal() and __ceph_assert_fail() will try to dump the stack as well as relevant information and a debug log of recent events. Search the logs for “objdump”. common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") ceph version 0.80.8-84-gb5a67f0 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long) +0x2a9) [0x9acc49] 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x9ad4b6] 3: (OSD::_is_healthy()+0x21) [0x5fde61] 4: (OSD::tick()+0x498) [0x64d978] ... NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  • 11. Crash asserts The ceph code base includes thousands of asserts. Assert is a system call that aborts the program if if the assertion evaluates to false. ● Conditions that are considered fatal ● Memory corruption ● Result of on-disk corruption ● Intentional aborts 75 was = h->suicide_timeout.read(); 76 if (was && was < now) { 77 ldout(m_cct, 1) << who << " '" << h->name << "'" 78 << " had suicide timed out after " << h->suicide_grace << dendl; 79 assert(0 == "hit suicide timeout"); 80 }
  • 12. Crash Fatal signals Indicate a fatal error such as a segmentation fault, bus error or abort. Search for “objdump” or “*** Caught signal” ● Indicative of a programming error ● Usually a memory accounting/access error ● Check for existing bugs with the same signature or open a new tracker or Bugzilla
  • 13. Crash Example 0> 2015-09-24 04:14:49.345105 7fea04f79700 -1 *** Caught signal (Aborted) ** in thread 7fea04f79700 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: /usr/bin/ceph-osd() [0x9f63f2] 2: (()+0xf130) [0x7fea14462130] 3: (gsignal()+0x37) [0x7fea12e7c5d7] 4: (abort()+0x148) [0x7fea12e7dcc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fea137809b5] 6: (()+0x5e926) [0x7fea1377e926] 7: (()+0x5e953) [0x7fea1377e953] 8: (()+0x5eb73) [0x7fea1377eb73] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) [0xb697b7] 10: (OSDMap::decode_classic(ceph::buffer::list::iterator&)+0x605) [0xab1a35] ... NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  • 14. Crash Extracting more information Crashes can be tricky to diagnose completely but, if you are up for the challenge or would just like to gather more information, eu-addr2line, gdb or objdump may provide further insight. # eu-addr2line -e /usr/bin/ceph-osd 0xb697b7 include/buffer.h:224 # objdump -rdS /usr/bin/ceph-osd # gdb `which ceph-osd` (gdb) disass /m 0xb697b7 Dump of assembler code for function ceph::buffer::list::iterator::copy(unsigned int, char*): ...
  • 15. Unexpected or undesirable behaviour That doesn't seem right? Sometimes ceph may not do what you expect or want ● Identify the expected or desirable behaviour ● Figure out if this is by design or the result of a corner case or error ● What behaviour do you see (when I do X, I see Y) ● Timestamp an instance of the error/behaviour ● Increase debugging and trace the transaction through the logs ● If this is Openstack behaviour trace it via Nova, Glance, Cinder, rbd logs ● If this is Rados gateway behaviour trace the httpd logs and match these with the rgw and Ceph logs ● Start at the user end and work back towards ceph ● Timestamps help a lot!
  • 16. Debug logging OSD ● debug ms = 1 ● debug osd = 20 ● debug objecter = 20 ● debug monc = 20 ● debug journal = 20 ● debug filestore = 20 ● debug newstore = 30 ● debug objclass = 20
  • 17. Debug logging MON ● debug mon = 20 ● debug ms = 1 ● debug paxos = 20 ● debug auth = 20
  • 18. Debug logging RADOS Gateway ● debug rgw = 20 ● debug ms = 1
  • 19. Debug logging MDS ● debug ms = 1 ● debug mds = 20 ● debug auth = 20 ● debug monc = 20 ● mds debug scatterstat = true ● mds verify scatter = true ● mds log max segments = 2
  • 20. Debug logging Client [client] # Section, can also be global since it is inherited debug ms = 1 debug rbd = 20 debug objectcacher = 20 debug objecter = 20 log file = /var/log/ceph/rbd.log # touch /var/log/ceph/rbd.log # chmod 777 /var/log/ceph/rbd.log
  • 21. Debug logging Openstack Turn up logging verbosity for whichever is relevant Nova, Glance, Cinder, rbd or all of the above ● Trace error/behaviour down through the logs from high level (Nova) to low level (rbd and the ceph cluster) ● Try running relevant commands from a lower level ● Make sure it isn't an Openstack problem
  • 22. Debug logging Linux kernel (krbd) client The kernel RBD client logs to syslog and/or dmesg
  • 23. Debug logging Without restart Turn debug logging on ● ceph tell osd.* injectargs '--debug_osd 20 --debug_ms 1' Turn debug logging off ● ceph tell osd.* injectargs '--debug_osd 0/5 --debug_ms 0/5'
  • 24. Source code Use the source Luke! Upstream source # ceph -v ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) ● git clone https://github.com/ceph/ceph.git ● git checkout e4bfad3a3c51054df7e537a724c8d0bf9be972ff ● git checkout v0.94.1
  • 25. Source code Use the source Luke! Downstream source ● yumdownloader –archlist=src --enablerepo=rhel-7-server-rhceph-1.3-*source-rpms ceph ● rpm -ivh ceph-0.94.1-19.el7cp.src.rpm ● rpmbuild -bp --nodeps rpmbuild/SPECS/ceph.spec ● cd rpmbuild/BUILD/ceph-0.94.1/ ● Ubuntu equivalent commands ● Use your favourite editor (yes, of course it's “vi”) to browse the source files
  • 26. Resources Additional sources of help If all else fails (or even as a first resort) seek help. ● Email: • ceph-users@ceph.com ● IRC: • irc.oftc.net #ceph ● Known issues • http://tracker.ceph.com/ ● https://bugzilla.redhat.com/ ● Documentation • http://docs.ceph.com/docs/master/ ● Red Hat support • https://access.redhat.com ● https://access.redhat.com/support

Editor's Notes

  1. Hello, My name is Brad Hubbard. I work for Red Hat Support Delivery supporting our distributed storage products, Red Hat Gluster Storage and Red Hat Ceph Storage. I&amp;apos;d like to talk to you today about troubleshooting Ceph problems and I hope everyone can take something away from this talk that can help them in working with Ceph in the future.
  2. Whilst Ceph is a robust piece of software and is designed to be autonomous, self-healing and intelligent, like any large software application it can encounter problems and can be mysterious to the uninitiated. Hopefully I can provide some guidance on how to handle problems when they do arise. I&amp;apos;ve broken the issues into four main areas. Performance &amp;quot;Hangs&amp;quot; In inverted commas because quite often a hang is actually not a hang, but a process still running that is not making progress. Crashes and everything else Pause?
  3. The first area I&amp;apos;m going to talk about is performance. It makes sense to establish baselines for performance and re-test periodically to detect any performance degradation. The tools I&amp;apos;ve listed on this slide can help with that. They are well known and, for the most part, well documented, however it is worth mentioning a couple of them in more detail. Ceph tell osd bench takes arguments for bytes per write and total bytes and does a simple benchmark. The Fio tool has a built in rbd engine which uses librbd to talk to the Ceph cluster directly. PBLIO was new to me when I heard about it a few days ago and it uses an online transaction processing enterprise workload to stress storage systems. It uses the open source Netapp workload generator and seems well suited to testing enterprise storage. I have not used it personally but I&amp;apos;ve heard good things recently. CBT is in development and is a testing harness for ceph which can automate some of the tasks and makes use of fio, rados bench, etc. as well as being able to collect statistics from tools such as blktrace, perf, valgrind and others at the same time.
  4. If you do experience degraded performance the first thing to do is make sure the cluster&amp;apos;s health is ok and that you are firing on all cylinders so you are not comparing apples with basketballs. You can use the tools listed here in addition to the tools on the previous slide to check for errors or statistical anomalies You can quickly check logs for warnings or errors using the gawk command at the bottom of the slide or your own equivalent. Pause?
  5. A request is considered slow if it takes greater than 30 seconds to complete although that is tunable as of a recent commit. If you are seeing these in the logs you should check the health of the cluster and check the indicated hosts for problems that may be effecting performance. The historic ops command will show a collection of the worst performing recent operations. Perf dump will list performance counters and both of these can offer hints at where the issue may lie. Pause?
  6. As i mentioned previously, not everything that appears to be a hang actually is one. However, if you are seeing a true hang you are likely to see processes in prolonged D-state in ps output Strace may list output from threads that are running but none from threads that are stuck so you should use this in conjunction with ps thread output to verify all threads are making progress. gstack and gcore can help you look at the ceph stack traces to try and work out what they are doing and/or what they are waiting on.
  7. Sysrq invoked with the &amp;apos;t&amp;apos; trigger outputs a stack trace for all threads executing in kernel space to syslog. Execute twice with a delay in between to verify the threads in question are in “long term” d-state since it can be a transient state and threads waiting on resources can be in d-state frequently for short periods during normal operation. The line in blue is a real-world example and here we see we may have a problem with the XFS file system or storage layer below it. Vmcores are beyond the scope of this discussion and may require assistance from your support organisation unless you have those skills in-house.
  8. What appears to be a hung process may actually be a process spinning on the CPU in a tight loop and we need to try to work out where in the code this is happening. Poor man&amp;apos;s profiler dumps out stacks and can provide statistics on how many of each identical thread are seen, etc. Look for threads that aren&amp;apos;t “waiting” (unless that&amp;apos;s the problem of course which may be the case when dealing with a lock contention issue). the stack traces need to be interpreted and this can require a decent understanding of the issue but they are definitely worth looking at as they can provide excellent context. With gdb scripts or systemtap probes you can gather considerable data on the state of the running program at regular intervals.
  9. For dead-lock or live-lock issues you may want to refer these types of issues up the line of support as they can be very tricky to debug. These are usually lock synchronisation issues where one or more threads can&amp;apos;t make progress. Pause?
  10. When Ceph detects a crash or a failed assert it will try to dump as much information as it can about the issue to the log of the process. The note about objdump should always be logged so searching for o b j d u m p should find crash dumps in the logs quickly. Here we&amp;apos;ve hit the “suicide timeout” because a thread has not been able to make progress. We can see this is an assert. The hash in green represents the git commit this ceph version was built from and the version number is shown there in blue. You can also see this information by running “ceph -v”. We can also see that the assert occurred on line 79 0f HeartbeatMap.cc
  11. Here we can see the line of code where the assert was called. The code logic has checked the suicide timeout and has determined it has been exceeded and this is considered to be a fatal problem and that the process can not continue so we programmatically terminate the process. There are many thousands of examples of assert calls like this in the code base. They protect the process against corruption and unexpected values which may cause further corruption and they should not be seen under normal operating conditions..
  12. Fatal signals are usually an indication of a programming error although there can definitely be other explanations. Ceph installs a signal handler to capture these so it can do a log dump similar to a failed assert. These are usually a memory accounting or a memory access error although they could be indicative of other memory problems, perhaps memory exhaustion or even problems with the memory hardware itself. I would say these can go straight to a bug report if one doesn&amp;apos;t already exist.
  13. This is an example where a SIGABRT has been sent to the process as indicated by the top red line. The function in red in frame nine is the most interesting as it is the last Ceph function to execute before the program entered the glibc abort code which is what is executing in frames eight through one. The hex value here in green is the offset into the function for the current instruction and the blue value is the return address for the frame.
  14. There are some tools we can use to get more information on the crash based on the information we extracted from the log crash dump in the previous slide. Eu-addr2line gives the source code line the memory address points to.. Depending on inlining and amount of assembly interpretation required objdump output can be tricky to interpret and I find the gdb output provides the best information in most cases, as it will dump out the entire function surrounding the memory address with source code, provided you have the debuginfo package for ceph loaded, or if you are using an unstripped build. Debuginfo packages should be available for all ceph binaries and most are in the ceph-debuginfo package, at least on rpm based distros. Pause?
  15. In the case of unexpected behaviour it is important to identify what the expected behaviour was and what actual behaviour occurred. Always try to get a timestamp of when the problematic behaviour occurred as it will make it a lot easier to try to trace the issue in the logs. Start at the user end where the error or behaviour is seen and work back towards the ceph cluster tracing the process in each log as you go. Pause?
  16. These are the debug logging options for OSDs and their recommended values with the top two or three being the first to try as they are likely to show any issues. Enabling too many of these options is not recommended as it can flood the logs with data making them difficult to interpret so you really don&amp;apos;t want to turn all of these on at once. If there is an indication a specific area is suspect then the enabling of that option may be warranted. Pause.
  17. Same for the MONs and once again the top three are probably all you&amp;apos;ll need Pause
  18. These are the debug options specific to the RADOS gateway
  19. and these are the debug options for MDSs Pause
  20. For client debugging we can specify a log file and set permissions on it so the client process can write to it. So instead of changing the mode we could change the owner and group to “nova”, for example, if we were trying to debug a nova process that is using rbd to access the ceph cluster. I generally find it easier to just set its permissions to 777 but your mileage may vary. Pause.
  21. Openstack issues can be difficult to debug sometimes due to the number of separate processes and amount of logs involved. Sometimes the actual error does not get passed up to the higher level intact so the error you end up with is not really representative of the issue. A timestamp is pretty much essential in these situations so you can follow the entire transaction through the logs from end to end. You may need to turn up logging verbosity for all processes involved, or each in turn, in order to get to the bottom of the issue. You may find the problem is in Openstack code rather than it being a Ceph issue.
  22. The krbd kernel module logs to dmesg and syslog so it should hopefully be obvious if you are seeing an issue with it. If not, this module may require instrumenting with systemtap in order to establish what the issue is.
  23. The debugging options show above need to be put into the local copy of ceph.conf and the daemons restarted for them to take affect. In the case of rbd clients this is generally not a problem since client processes are generally short lived. In order to change the debug logging options immediately, without requiring a restart, you can use the &amp;quot;ceph tell&amp;quot; command. This invocation would enable debug logging for all OSDs in the cluster. You can restrict it to individuals if you want by only specifying their IDs rather than using a wildcard. The second parameter in the command to reset debug logging to the default is the in-memory log level. Ceph creates a copy of the most recent log entries at the specified verbosity in memory and this gets dumped into the log in the event of a crash. Pause?
  24. To access the upstream source code we just clone the git repository and then we can check out the version of the source we need either by the sha1 hash or by the version tag. This gives you the source code as it was at the time of that release and should match the source of the binaries you are using (if you are using that version of course :) )
  25. For the downstream source you can download the source package, install it and use the rpmbuild command to prep the source and apply the necessary patches in the rpm. I&amp;apos;m afraid I&amp;apos;m fuzzy on how this works for deb based distros but I&amp;apos;m sure there are equivalent commands on those to accomplish the same end. Has anyone done this? Is it pretty similar? Can you explain it to me? :) Maybe later ;) Once you have the source you can grep for error messages, inspect functions of interest and generally browse the source. I find vim with the gtag plugin pretty good for browsing the Ceph source code and jumping straight to functions, macros, structs, and classes. it allows you to navigate the code base and jump between files quickly and efficiently.
  26. I hope I&amp;apos;ve given you some ideas for looking into Ceph issues as they arise but it is important to realise you are not alone and there are various sources of help available. Don&amp;apos;t be afraid to give a shout out on the mailing list or the IRC channel. You can also check the Ceph bug tracker and Bugzilla to see if your problem is a known issue. There are also some excellent troubleshooting docs under the ceph storage cluster section on docs.ceph.com as part of the comprehensive documentation available there. ... and of course if you have a Red Hat account you can check out the knowledgebase and talk to support directly about your issue.
  27. Welcome to the Ceph community. We think your going to enjoy your stay :D Thank you for your time and enjoy the rest of the day.