Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Troubleshooting common oslo.messaging and RabbitMQ issues

9,405 views

Published on

This talk focuses on troubleshooting of common oslo.messaging and RabbitMQ issues in OpenStack environments. Co-presented at the OpenStack Summit Austin in April 2016.

Published in: Technology
  • Login to see the comments

Troubleshooting common oslo.messaging and RabbitMQ issues

  1. 1. Identifying (and fixing) oslo.messaging & RabbitMQ issues Michael Klishin, Pivotal Dmitry Mescheryakov, Mirantis
  2. 2. What is oslo.messaging? ● Library for ○ building RPC clients/servers ○ emitting/handling notifications
  3. 3. What is oslo.messaging? ● Library for ○ building RPC clients/servers ○ emitting/handling notifications ● Supports several backends: ○ RabbitMQ ■ based on Kombu - the oldest and most well known (and we will speak about it) ■ based on Pika - recent addition ○ AMQP 1.0
  4. 4. What is oslo.messaging? ● Library for ○ building RPC clients/servers ○ emitting/handling notifications ● Supports several backends: ○ RabbitMQ ■ based on Kombu - the oldest and most well known (and we will speak about it) ■ based on Pika - recent addition ○ AMQP 1.0
  5. 5. What is oslo.messaging? ● Library for ○ building RPC clients/servers ○ emitting/handling notifications ● Supports several backends: ○ RabbitMQ ■ based on Kombu - the oldest and most well known (and we will speak about it) ■ based on Pika - recent addition ○ AMQP 1.0
  6. 6. Spawning a VM in Nova nova-api nova-api nova-api nova- conductor nova- conductor nova- scheduler nova- scheduler nova- scheduler nova- compute nova- compute nova- compute nova- compute Client HTTP RPC
  7. 7. Examples Internal: ● nova-compute sends a report to nova-conductor every minute ● nova-conductor sends a command to spawn a VM to nova-compute ● neutron-l3-agent requests router list from neutron-server ● …
  8. 8. Examples Internal: ● nova-compute sends a report to nova-conductor every minute ● nova-conductor sends a command to spawn a VM to nova-compute ● neutron-l3-agent requests router list from neutron-server ● … External: ● Every OpenStack service sends notifications to Ceilometer
  9. 9. Where is RabbitMQ in this picture? nova- conductor nova- compute RabbitMQ compute.node-1.domain.tld reply_b6686f7be58b4773a2e0f5475368d19a request response RPC
  10. 10. Spotting oslo.messaging logs
  11. 11. Spotting oslo.messaging logs 2016-04-15 11:16:57.239 16181 DEBUG nova.service [req-d83ae554-7ef5-4299- 82ce-3f70b00b6490 - - - - -] Creating RPC server for service scheduler start /usr/lib/python2.7/dist-packages/nova/service.py:218 2016-04-15 11:16:57.258 16181 DEBUG oslo.messaging._drivers.pool [req- d83ae554-7ef5-4299-82ce-3f70b00b6490 - - - - -] Pool creating new connection create /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/pool.py:109
  12. 12. ... File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 420, in _send result = self._waiter.wait(msg_id, timeout) File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 318, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 223, in get 'to message ID %s' % msg_id) MessagingTimeout: Timed out waiting for a reply to message ID 9e4a677887134a0cbc134649cd46d1ce My favorite oslo.messaging exception
  13. 13. oslo.messaging operations ● Cast - fire RPC request and forget about it ● Notify - the same, only format is different ● Call - send RPC request and receive reply Call throws a MessagingTimeout exception when a reply isn’t received in a certain amount of time
  14. 14. Making a Call 1. Client -> request -> RabbitMQ 2. RabbitMQ -> request -> Server 3. Server processes the request and produces the response 4. Server -> response -> RabbitMQ 5. RabbitMQ -> response -> Client If the process gets stuck on any step from 2 to 5, client gets a MessagingTimeout exception.
  15. 15. Debug shows the truth L3 Agent log CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin received reply msg_id: ae63b165611f439098f1461f906270de * Examples from Mitaka
  16. 16. Debug shows the truth L3 Agent log CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin received reply msg_id: ae63b165611f439098f1461f906270de Neutron Server received message msg_id: ae63b165611f439098f1461f906270de reply to: reply_df2405440ffb40969a2f52c769f72e30 REPLY msg_id: ae63b165611f439098f1461f906270de reply queue: reply_df2405440ffb40969a2f52c769f72e30 * Examples from Mitaka
  17. 17. Enabling the debug [DEFAULT] debug=true
  18. 18. Enabling the debug [DEFAULT] debug=true default_log_levels=...,oslo.messaging=DEBUG,...
  19. 19. If you don’t have debug enabled Examine the stack trace Find which operation failed Guess the destination service Try to find correlating log entries around the time the request was made
  20. 20. If you don’t have debug enabled Examine the stack trace Find which operation failed Guess the destination service Try to find correlating log entries around the time the request was made File "/opt/stack/neutron/neutron/agent/dhcp/agent.py", line 571, in _report_state self.state_rpc.report_state(ctx, self.agent_state, self.use_call) File "/opt/stack/neutron/neutron/agent/rpc.py", line 86, in report_state return method(context, 'report_state', **kwargs) File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
  21. 21. Diagnosing issues through RabbitMQ ● # rabbitmqctl list_queues consumers name 0 consumers indicate that nobody listens to the queue ● # rabbitmqctl list_queues messages consumers name If a queue has consumers, but also messages are accumulating there. It means that the corresponding service can not process messages in time or got stuck in a deadlock or cluster is partitioned
  22. 22. Checking RabbitMQ cluster for integrity # rabbitmqctl cluster_status Check that its output contains all the nodes in the cluster. You might find that your cluster is partitioned. Partitioning is a good reason for some messages to get stuck in queues.
  23. 23. How to fix such issues For RabbitMQ issues including partitioning, see RabbitMQ docs Restart of the affected services helps in most cases
  24. 24. How to fix such issues For RabbitMQ issues including partitioning, see RabbitMQ docs Restart of the affected services helps in most cases Force close connections using `rabbitmqctl` or HTTP API
  25. 25. Never set amqp_auto_delete = true Use a queue expiration policy instead, with a TTL of at least 1 minute Starting from Mitaka all by default auto-delete queues were replaced with expiring ones
  26. 26. Why not amqp_auto_delete? nova- conductor nova- compute RabbitMQ compute.node-1.domain.tld message auto-delete auto-delete = true network hiccup
  27. 27. Queue mirroring is quite expensive Out testing shows 2x drop in throughput on 3-node cluster with ‘ha-mode: all’ policy comparing with non-mirrored queues. RPC can live without it But notifications might be too important (if used for billing) In later case enable mirroring for notification queues only (example in Fuel)
  28. 28. Use different backends for RPC and Notifications Different drivers * Available starting from Mitaka
  29. 29. Use different backends for RPC and Notifications Different drivers Same driver. For example: RPC messages go through one RabbitMQ cluster Notification messages go through another RabbitMQ cluster * Available starting from Mitaka
  30. 30. Use different backends for RPC and Notifications Different drivers Same driver. For example: RPC messages go through one RabbitMQ cluster Notification messages go through another RabbitMQ cluster Implementation (non-documented) * Available starting from Mitaka
  31. 31. Part 2
  32. 32. Erlang VM process disappears
  33. 33. Erlang VM process disappears Syslog, kern.log, /var/log/messages: grep for “killed process”
  34. 34. Erlang VM process disappears Syslog, kern.log, /var/log/messages: grep for “killed process” “Cannot allocate 1117203264527168 bytes of memory (of type …)” — move to Erlang 17.5 or 18.3
  35. 35. RAM usage
  36. 36. RAM usage `rabbitmqctl status`
  37. 37. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers`
  38. 38. Stats DB overload
  39. 39. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer
  40. 40. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind
  41. 41. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind `rabbitmqctl status` reports most RAM used by `mgmt_db`
  42. 42. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind `rabbitmqctl status` reports most RAM used by `mgmt_db` You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’`
  43. 43. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind `rabbitmqctl status` reports most RAM used by `mgmt_db` You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’` Resetting is a safe thing to do but may confuse your monitoring tools
  44. 44. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind `rabbitmqctl status` reports most RAM used by `mgmt_db` You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’` Resetting is a safe thing to do but may confuse your monitoring tools New better parallelized event collector coming in RabbitMQ 3.6.2
  45. 45. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top
  46. 46. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top `rabbitmqctl list_connections | wc -l`
  47. 47. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top `rabbitmqctl list_connections | wc -l` `rabbitmqctl list_channels | wc -l`
  48. 48. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top `rabbitmqctl list_connections | wc -l` `rabbitmqctl list_channels | wc -l` Reduce TCP buffer size: RabbitMQ Networking guide
  49. 49. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top `rabbitmqctl list_connections | wc -l` `rabbitmqctl list_channels | wc -l` Reduce TCP buffer size: RabbitMQ Networking guide To force per-connection channel limit use`rabbit.channel_max`.
  50. 50. Unresponsive nodes
  51. 51. Unresponsive nodes `rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'`
  52. 52. Unresponsive nodes `rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'` Pivotal & Erlang Solutions contributed a few Mnesia deadlock fixes in Erlang/OTP 18.3.1 and 19.0
  53. 53. TCP connections are rejected
  54. 54. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall
  55. 55. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall Ensure RabbitMQ listens on correct network interfaces
  56. 56. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall Ensure RabbitMQ listens on correct network interfaces Check open file handles limit (defaults on Linux are completely inadequate)
  57. 57. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall Ensure RabbitMQ listens on correct network interfaces Check open file handles limit (defaults on Linux are completely inadequate) TCP connection backlog size: rabbitmq.tcp_listen_options.backlog, net.core.somaxconn
  58. 58. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall Ensure RabbitMQ listens on correct network interfaces Check open file handles limit (defaults on Linux are completely inadequate) TCP connection backlog size: rabbitmq.tcp_listen_options.backlog, net.core.somaxconn Consult RabbitMQ logs for authentication and authorization errors
  59. 59. TLS connections fail
  60. 60. TLS connections fail Deserves a talk of its own
  61. 61. TLS connections fail Deserves a talk of its own See log files
  62. 62. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`)
  63. 63. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`) `openssl s_server` (`man 1 s_server`)
  64. 64. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`) `openssl s_server` (`man 1 s_server`) Ensure peer CA certificate is trusted and verification depth is sufficient
  65. 65. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`) `openssl s_server` (`man 1 s_server`) Ensure peer CA certificate is trusted and verification depth is sufficient Troubleshooting TLS on rabbitmq.com
  66. 66. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`) `openssl s_server` (`man 1 s_server`) Ensure peer CA certificate is trusted and verification depth is sufficient Troubleshooting TLS on rabbitmq.com Run Erlang 17.5 or 18.3.1
  67. 67. Message payload inspection
  68. 68. Message payload inspection Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
  69. 69. Message payload inspection Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace rabbitmq_tracing
  70. 70. Message payload inspection Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace rabbitmq_tracing Tracing puts *very* high load on the system
  71. 71. Message payload inspection Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace rabbitmq_tracing Tracing puts *very* high load on the system Wireshark (tcpdump, …)
  72. 72. Higher than expected latency
  73. 73. Higher than expected latency Wireshark (tcpdump, …)
  74. 74. Higher than expected latency Wireshark (tcpdump, …) strace, DTrace, …
  75. 75. Higher than expected latency Wireshark (tcpdump, …) strace, DTrace, … Erlang VM scheduler-to-core binding (pinning)
  76. 76. General remarks
  77. 77. General remarks Guessing is not effective (or efficient)
  78. 78. General remarks Guessing is not effective (or efficient) Use tools to gather more data
  79. 79. General remarks Guessing is not effective (or efficient) Use tools to gather more data Always consult log files
  80. 80. General remarks Guessing is not effective (or efficient) Use tools to gather more data Always consult log files Ask on rabbitmq-users
  81. 81. Thank you
  82. 82. Thank you @michaelklishin
  83. 83. Thank you @michaelklishin rabbitmq-users

×