2. What sort of trouble?
Identify your problem domain
Ceph is a mature, resilient and robust piece of software but, when things do go
wrong, empower yourself to identify these specific areas and analyse them using
common Linux, and Ceph-specific, tooling.
●
Performance
●
“Hang”
●
Crash
●
Unexpected or undesirable behaviour
3. Performance
Establish a baseline and re-test regularly
●
rados bench
●
ceph tell osd.N bench
●
fio – rbd ioengine
●
fio – libaio ioengine
●
pblio - https://github.com/pblcache/pblcache/wiki/Pblio
●
netperf – test all network segments
●
dd
●
pcp, sysstat, collectl, insert favourite tool here...
●
The Ceph Benchmarking Tool - https://github.com/ceph/cbt
●
Be mindful of the cache and its effects
4. Performance
Specifically, poor performance
Zero in on the problem area by identifying if it is specific to a particular host, or hosts,
or if a particular sub-domain is implicated.
●
HEALTH_OK ?
●
Re-use the tools mentioned in the previous slide as well as host specific tools
●
ss, netstat and friends
●
tcpdump
●
iostat
●
top
●
pcp, sar, collectl
●
free, vmstat
●
Increase ceph logging verbosity
●
$ gawk '/ERR/||/WRN/' /var/log/ceph/*log
5. Performance
Slow requests
When Ceph detects a request that is too slow (tunable) it will issue a warning.
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-
time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write
0~4194304] 0.69848840) v4 currently waiting for subops from [610]
●
HEALTH_OK ?
●
Check performance statistics on implicated hosts
●
Turn up debugging on the implicated OSDs
•
# ceph tell osd.N injectargs '--debug_osd 20 --debug_ms 1'
●
Gather information about slow ops
•
# ceph --admin-daemon /var/run/ceph/ceph-osd.N.asok dump_historic_ops
•
# ceph --admin-daemon /var/run/ceph/ceph-osd.N.asok perf dump
6. Hang
Is it really a hang?
Sometimes situations described as a “hang” turn out to be something different such
as code stuck in a tight loop, a dead-lock, firewall problems, etc.
● Use strace to check if the process is still making progress (may miss hung threads)
●
Check for a high load average and/or high %iowait on the CPUs
●
Use ps to check for ceph processes in d-state (Uninterruptible sleep)
● Use ps to find the ceph threads that are sleeping and what function they are
sleeping in
•
# ps axHo stat,tid,ppid,comm,wchan
●
Check syslog and dmesg for “hung_task_timeout” warnings
●
Use gstack or gcore to figure out where we are in the ceph code and what
subsystems in the kernel we are exercising
7. Hang
Is it really a hang?
Note that if everything points to uninterruptible threads in kernel space this is a kernel
problem but it obviously still has the potential to severely degrade ceph performance
and needs to be identified and fixed.
●
Sysrq to dump kernel thread stacks. Dumps to syslog, search for “ D “
•
# echo 1 > /proc/sys/kernel/sysrq
•
# echo 't' > /proc/sysrq-trigger
•
# sleep 20
•
# echo 't' > /proc/sysrq-trigger
•
# echo 0 > /proc/sys/kernel/sysrq
xfssyncd/dm-2 D 0000000000000011 0 3207 2 0x00000080
●
sysrq data may implicate a certain subsystem or help to identify a known issue or
confirm suspicions
●
May require a vmcore be collected and analysed
8. Hang
Is it really a hang?
What at first appears to be a hang may in fact be a thread, or threads, caught in a
tight loop due to some logic condition and failing to make progress. To the user that
process seems “hung” but it is actually running. We need to identify where the
process is spending the bulk of it's time.
●
Look for high CPU usage of Ceph processes
●
Check strace and/or ltrace output for hints at what the process may be doing
●
Employ the “Poor Man's Profiler” technique - http://poormansprofiler.org/
•
# for x in `seq 1 5`; do for pid in `pidof ceph-mon ceph-osd`; do gstack $pid;
echo; done; done > /tmp/ceph-stacks
•
This can potentially generate a lot of data so you may want to only target a single
process, the one(s) with high CPU utilisation
●
Visually inspect the relevant source code to work out why it might not make
progress
●
More advanced techniques such as scripting gdb, systemtap probes
9. Hang
Is it really a hang?
Dead-lock or live-lock.
●
Gcore and/or gstack
●
Visually inspect relevant source code
●
Might need some help with this one
10. Crash
Where did ceph go?
If ceph crashes it will attempt to log details of the crash. Code in handle_fatal_signal()
and __ceph_assert_fail() will try to dump the stack as well as relevant information and
a debug log of recent events. Search the logs for “objdump”.
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.80.8-84-gb5a67f0 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)
+0x2a9) [0x9acc49]
2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x9ad4b6]
3: (OSD::_is_healthy()+0x21) [0x5fde61]
4: (OSD::tick()+0x498) [0x64d978]
...
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
11. Crash
asserts
The ceph code base includes thousands of asserts. Assert is a system call that aborts
the program if if the assertion evaluates to false.
●
Conditions that are considered fatal
●
Memory corruption
●
Result of on-disk corruption
●
Intentional aborts
75 was = h->suicide_timeout.read();
76 if (was && was < now) {
77 ldout(m_cct, 1) << who << " '" << h->name << "'"
78 << " had suicide timed out after " << h->suicide_grace << dendl;
79 assert(0 == "hit suicide timeout");
80 }
12. Crash
Fatal signals
Indicate a fatal error such as a segmentation fault, bus error or abort. Search for
“objdump” or “*** Caught signal”
●
Indicative of a programming error
●
Usually a memory accounting/access error
●
Check for existing bugs with the same signature or open a new tracker or Bugzilla
13. Crash
Example
0> 2015-09-24 04:14:49.345105 7fea04f79700 -1 *** Caught signal (Aborted) **
in thread 7fea04f79700
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: /usr/bin/ceph-osd() [0x9f63f2]
2: (()+0xf130) [0x7fea14462130]
3: (gsignal()+0x37) [0x7fea12e7c5d7]
4: (abort()+0x148) [0x7fea12e7dcc8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fea137809b5]
6: (()+0x5e926) [0x7fea1377e926]
7: (()+0x5e953) [0x7fea1377e953]
8: (()+0x5eb73) [0x7fea1377eb73]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x137) [0xb697b7]
10: (OSDMap::decode_classic(ceph::buffer::list::iterator&)+0x605) [0xab1a35]
...
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
14. Crash
Extracting more information
Crashes can be tricky to diagnose completely but, if you are up for the challenge or
would just like to gather more information, eu-addr2line, gdb or objdump may provide
further insight.
# eu-addr2line -e /usr/bin/ceph-osd 0xb697b7
include/buffer.h:224
# objdump -rdS /usr/bin/ceph-osd
# gdb `which ceph-osd`
(gdb) disass /m 0xb697b7
Dump of assembler code for function ceph::buffer::list::iterator::copy(unsigned int,
char*):
...
15. Unexpected or undesirable behaviour
That doesn't seem right?
Sometimes ceph may not do what you expect or want
●
Identify the expected or desirable behaviour
●
Figure out if this is by design or the result of a corner case or error
●
What behaviour do you see (when I do X, I see Y)
●
Timestamp an instance of the error/behaviour
●
Increase debugging and trace the transaction through the logs
●
If this is Openstack behaviour trace it via Nova, Glance, Cinder, rbd logs
●
If this is Rados gateway behaviour trace the httpd logs and match these with the
rgw and Ceph logs
●
Start at the user end and work back towards ceph
●
Timestamps help a lot!
20. Debug logging
Client
[client] # Section, can also be global since it is inherited
debug ms = 1
debug rbd = 20
debug objectcacher = 20
debug objecter = 20
log file = /var/log/ceph/rbd.log
# touch /var/log/ceph/rbd.log
# chmod 777 /var/log/ceph/rbd.log
21. Debug logging
Openstack
Turn up logging verbosity for whichever is relevant Nova, Glance, Cinder, rbd or all of
the above
●
Trace error/behaviour down through the logs from high level (Nova) to low level
(rbd and the ceph cluster)
●
Try running relevant commands from a lower level
●
Make sure it isn't an Openstack problem
24. Source code
Use the source Luke!
Upstream source
# ceph -v
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
●
git clone https://github.com/ceph/ceph.git
●
git checkout e4bfad3a3c51054df7e537a724c8d0bf9be972ff
●
git checkout v0.94.1
25. Source code
Use the source Luke!
Downstream source
●
yumdownloader –archlist=src --enablerepo=rhel-7-server-rhceph-1.3-*source-rpms ceph
●
rpm -ivh ceph-0.94.1-19.el7cp.src.rpm
●
rpmbuild -bp --nodeps rpmbuild/SPECS/ceph.spec
●
cd rpmbuild/BUILD/ceph-0.94.1/
●
Ubuntu equivalent commands
●
Use your favourite editor (yes, of course it's “vi”) to browse the source files
26. Resources
Additional sources of help
If all else fails (or even as a first resort) seek help.
●
Email:
•
ceph-users@ceph.com
●
IRC:
•
irc.oftc.net #ceph
●
Known issues
•
http://tracker.ceph.com/
●
https://bugzilla.redhat.com/
●
Documentation
•
http://docs.ceph.com/docs/master/
●
Red Hat support
•
https://access.redhat.com
●
https://access.redhat.com/support
Hello,
My name is Brad Hubbard. I work for Red Hat Support Delivery supporting our distributed storage products, Red Hat Gluster Storage and Red Hat Ceph Storage. I&apos;d like to talk to you today about troubleshooting Ceph problems and I hope everyone can take something away from this talk that can help them in working with Ceph in the future.
Whilst Ceph is a robust piece of software and is designed to be autonomous, self-healing and intelligent, like any large software application it can encounter problems and can be mysterious to the uninitiated. Hopefully I can provide some guidance on how to handle problems when they do arise. I&apos;ve broken the issues into four main areas.
Performance
&quot;Hangs&quot; In inverted commas because quite often a hang is actually not a hang, but a process still running that is not making progress.
Crashes
and everything else
Pause?
The first area I&apos;m going to talk about is performance.
It makes sense to establish baselines for performance and re-test periodically to detect any performance degradation.
The tools I&apos;ve listed on this slide can help with that. They are well known and, for the most part, well documented, however it is worth mentioning a couple of them in more detail.
Ceph tell osd bench takes arguments for bytes per write and total bytes and does a simple benchmark.
The Fio tool has a built in rbd engine which uses librbd to talk to the Ceph cluster directly.
PBLIO was new to me when I heard about it a few days ago and it uses an online transaction processing enterprise workload to stress storage systems. It uses the open source Netapp workload generator and seems well suited to testing enterprise storage. I have not used it personally but I&apos;ve heard good things recently.
CBT is in development and is a testing harness for ceph which can automate some of the tasks and makes use of fio, rados bench, etc. as well as being able to collect statistics from tools such as blktrace, perf, valgrind and others at the same time.
If you do experience degraded performance the first thing to do is make sure the cluster&apos;s health is ok and that you are firing on all cylinders so you are not comparing apples with basketballs.
You can use the tools listed here in addition to the tools on the previous slide to check for errors or statistical anomalies
You can quickly check logs for warnings or errors using the gawk command at the bottom of the slide or your own equivalent.
Pause?
A request is considered slow if it takes greater than 30 seconds to complete although that is tunable as of a recent commit. If you are seeing these in the logs you should check the health of the cluster and check the indicated hosts for problems that may be effecting performance.
The historic ops command will show a collection of the worst performing recent operations.
Perf dump will list performance counters and both of these can offer hints at where the issue may lie.
Pause?
As i mentioned previously, not everything that appears to be a hang actually is one.
However, if you are seeing a true hang you are likely to see processes in prolonged D-state in ps output
Strace may list output from threads that are running but none from threads that are stuck so you should use this in conjunction with ps thread output to verify all threads are making progress.
gstack and gcore can help you look at the ceph stack traces to try and work out what they are doing and/or what they are waiting on.
Sysrq invoked with the &apos;t&apos; trigger outputs a stack trace for all threads executing in kernel space to syslog. Execute twice with a delay in between to verify the threads in question are in “long term” d-state since it can be a transient state and threads waiting on resources can be in d-state frequently for short periods during normal operation.
The line in blue is a real-world example and here we see we may have a problem with the XFS file system or storage layer below it.
Vmcores are beyond the scope of this discussion and may require assistance from your support organisation unless you have those skills in-house.
What appears to be a hung process may actually be a process spinning on the CPU in a tight loop and we need to try to work out where in the code this is happening.
Poor man&apos;s profiler dumps out stacks and can provide statistics on how many of each identical thread are seen, etc. Look for threads that aren&apos;t “waiting” (unless that&apos;s the problem of course which may be the case when dealing with a lock contention issue). the stack traces need to be interpreted and this can require a decent understanding of the issue but they are definitely worth looking at as they can provide excellent context.
With gdb scripts or systemtap probes you can gather considerable data on the state of the running program at regular intervals.
For dead-lock or live-lock issues you may want to refer these types of issues up the line of support as they can be very tricky to debug. These are usually lock synchronisation issues where one or more threads can&apos;t make progress.
Pause?
When Ceph detects a crash or a failed assert it will try to dump as much information as it can about the issue to the log of the process.
The note about objdump should always be logged so searching for o b j d u m p should find crash dumps in the logs quickly.
Here we&apos;ve hit the “suicide timeout” because a thread has not been able to make progress.
We can see this is an assert.
The hash in green represents the git commit this ceph version was built from and the version number is shown there in blue. You can also see this information by running “ceph -v”.
We can also see that the assert occurred on line 79 0f HeartbeatMap.cc
Here we can see the line of code where the assert was called. The code logic has checked the suicide timeout and has determined it has been exceeded and this is considered to be a fatal problem and that the process can not continue so we programmatically terminate the process.
There are many thousands of examples of assert calls like this in the code base.
They protect the process against corruption and unexpected values which may cause further corruption and they should not be seen under normal operating conditions..
Fatal signals are usually an indication of a programming error although there can definitely be other explanations. Ceph installs a signal handler to capture these so it can do a log dump similar to a failed assert.
These are usually a memory accounting or a memory access error although they could be indicative of other memory problems, perhaps memory exhaustion or even problems with the memory hardware itself.
I would say these can go straight to a bug report if one doesn&apos;t already exist.
This is an example where a SIGABRT has been sent to the process as indicated by the top red line.
The function in red in frame nine is the most interesting as it is the last Ceph function to execute before the program entered the glibc abort code which is what is executing in frames eight through one. The hex value here in green is the offset into the function for the current instruction and the blue value is the return address for the frame.
There are some tools we can use to get more information on the crash based on the information we extracted from the log crash dump in the previous slide.
Eu-addr2line gives the source code line the memory address points to..
Depending on inlining and amount of assembly interpretation required objdump output can be tricky to interpret and I find the gdb output provides the best information in most cases, as it will dump out the entire function surrounding the memory address with source code, provided you have the debuginfo package for ceph loaded, or if you are using an unstripped build. Debuginfo packages should be available for all ceph binaries and most are in the ceph-debuginfo package, at least on rpm based distros.
Pause?
In the case of unexpected behaviour it is important to identify what the expected behaviour was and what actual behaviour occurred. Always try to get a timestamp of when the problematic behaviour occurred as it will make it a lot easier to try to trace the issue in the logs. Start at the user end where the error or behaviour is seen and work back towards the ceph cluster tracing the process in each log as you go.
Pause?
These are the debug logging options for OSDs and their recommended values with the top two or three being the first to try as they are likely to show any issues. Enabling too many of these options is not recommended as it can flood the logs with data making them difficult to interpret so you really don&apos;t want to turn all of these on at once. If there is an indication a specific area is suspect then the enabling of that option may be warranted. Pause.
Same for the MONs and once again the top three are probably all you&apos;ll need
Pause
These are the debug options specific to the RADOS gateway
and these are the debug options for MDSs
Pause
For client debugging we can specify a log file and set permissions on it so the client process can write to it. So instead of changing the mode we could change the owner and group to “nova”, for example, if we were trying to debug a nova process that is using rbd to access the ceph cluster. I generally find it easier to just set its permissions to 777 but your mileage may vary.
Pause.
Openstack issues can be difficult to debug sometimes due to the number of separate processes and amount of logs involved. Sometimes the actual error does not get passed up to the higher level intact so the error you end up with is not really representative of the issue. A timestamp is pretty much essential in these situations so you can follow the entire transaction through the logs from end to end. You may need to turn up logging verbosity for all processes involved, or each in turn, in order to get to the bottom of the issue. You may find the problem is in Openstack code rather than it being a Ceph issue.
The krbd kernel module logs to dmesg and syslog so it should hopefully be obvious if you are seeing an issue with it. If not, this module may require instrumenting with systemtap in order to establish what the issue is.
The debugging options show above need to be put into the local copy of ceph.conf and the daemons restarted for them to take affect. In the case of rbd clients this is generally not a problem since client processes are generally short lived.
In order to change the debug logging options immediately, without requiring a restart, you can use the &quot;ceph tell&quot; command. This invocation would enable debug logging for all OSDs in the cluster. You can restrict it to individuals if you want by only specifying their IDs rather than using a wildcard. The second parameter in the command to reset debug logging to the default is the in-memory log level. Ceph creates a copy of the most recent log entries at the specified verbosity in memory and this gets dumped into the log in the event of a crash.
Pause?
To access the upstream source code we just clone the git repository and then we can check out the version of the source we need either by the sha1 hash or by the version tag. This gives you the source code as it was at the time of that release and should match the source of the binaries you are using (if you are using that version of course :) )
For the downstream source you can download the source package, install it and use the rpmbuild command to prep the source and apply the necessary patches in the rpm.
I&apos;m afraid I&apos;m fuzzy on how this works for deb based distros but I&apos;m sure there are equivalent commands on those to accomplish the same end.
Has anyone done this? Is it pretty similar? Can you explain it to me? :) Maybe later ;)
Once you have the source you can grep for error messages, inspect functions of interest and generally browse the source.
I find vim with the gtag plugin pretty good for browsing the Ceph source code and jumping straight to functions, macros, structs, and classes. it allows you to navigate the code base and jump between files quickly and efficiently.
I hope I&apos;ve given you some ideas for looking into Ceph issues as they arise but it is important to realise you are not alone and there are various sources of help available. Don&apos;t be afraid to give a shout out on the mailing list or the IRC channel.
You can also check the Ceph bug tracker and Bugzilla to see if your problem is a known issue.
There are also some excellent troubleshooting docs under the ceph storage cluster section on docs.ceph.com as part of the comprehensive documentation available there.
... and of course if you have a Red Hat account you can check out the knowledgebase and talk to support directly about your issue.
Welcome to the Ceph community. We think your going to enjoy your stay :D
Thank you for your time and enjoy the rest of the day.