Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What we've learned from running thousands of production RabbitMQ clusters - Lovisa Johansson


Published on

Watch the full lecture on YouTube:

Since 2012 CloudAMQP has been running dedicated and shared RabbitMQ clusters for customers around the world, in seven different clouds. In this talk Lovisa will address the most common misconception, misconfigurations and anti-patterns in RabbitMQ usage, and how they can be avoided. Lovisa will talk about how you can increase RabbitMQ reliability and performance and she will also mention common RabbitMQ use cases among CloudAMQP’s customers.

The first RabbitMQ Summit connected RabbitMQ users and developers from around the world in London on November 12, 2018. Learn what's happening in and around RabbitMQ, and how top companies utilize RabbitMQ to power their services.

RabbitMQ Summit was organized by:
- Erlang Solutions, offering world-leading RabbitMQ Consultancy, Support, Health Checks & Tuning solutions
- CloudAMQP, offering fully managed RabbitMQ clusters

RabbitMQ Summit 2018 was sponsored by the following companies.

Platinum sponsors:

Gold sponors:

Silver sponsor:
Cogin Queue Explorer

Published in: Technology
  • Login to see the comments

What we've learned from running thousands of production RabbitMQ clusters - Lovisa Johansson

  1. 1. What we've learned from running thousands of production RabbitMQ clusters Lovisa Johansson
  2. 2. 3000 emails
  3. 3. ● Unstable RabbitMQ version ● Unoptimized configuration for a specific use case ➢ High availability ➢ High Performance ● Users (you?) are using RabbitMQ in a bad way ● Client libraries are using RabbitMQ in bad way ● Things are not done in an optimal way ● Customer use cases ● Configuration mistakes ● Common mistakes Client side problems Server side problems
  4. 4. What we've learned from running thousands of production RabbitMQ clusters
  5. 5. Lovisa Johansson Marketing Manager Support Engineer RabbitMQ Engineer Umeå, Sweden
  6. 6. 23000 running instances 7 clouds Largest provider of managed RabbitMQ servers 75 regions Headquarter Stockholm Sweden
  7. 7. Don’t use too many connections or channels ● Keep connection/channel count low ● Each connection uses about 100 KB of RAM ● Thousands of connections can be a heavy burden on a RabbitMQ server ● Channel and connections leaks are among the most common errors that we see Recommendation number 1. CONNECTIONS AND CHANNELS
  8. 8. ● Long-lived connections. ● Don’t open a channel every time you are publishing Don’t open and close connections or channels repeatedly ● AMQP connections: 7 TCP packages ● AMQP channel: 2 TCP packages ● AMQP publish: 1 TCP package ● AMQP close channel: 2 TCP packages ● AMQP close connection: 2 TCP packages Total 14-19 packages (+ acks) Recommendation number 2. CONNECTIONS AND CHANNELS
  9. 9. ● Our benchmarks show that the proxy is increasing publishing speed with a magnitude or more. ● ● Some clients can’t keep long-lived connections (looking at you PHP ) ● Avoid connection churn by using a proxy that pools connections and channels for reuse. AMQProxy
  10. 10. Flow control: Might not be able to consume if the connection is in flow control Back pressure: RabbitMQ can apply back pressure on the TCP connection when the publisher is sending too many messages Separate connections for publishers and consumers Recommendation number 3. CONNECTIONS AND CHANNELS
  11. 11. ● Less than 10 000 messages in one queue ● Heavy load on RAM usage QUEUES Recommendation number 4. Don't have too large queues ○ In order to free up RAM, RabbitMQ starts page out messages to disk ○ Blocks the queue from processing messages ● Time-consuming to restart a cluster ● Limit queue size with TTL or max-length
  12. 12. ● Lazy queues was added in RabbitMQ 3.6 ● Writes messages to disk immediately, thus spreading the work out over time instead of taking the risk of a performance hit somewhere down the road ● More predictable and smooth performance curve ○ Messages are only loaded into memory when they are needed. Enable lazy queues to get predictable performance Recommendation number 5. QUEUES Enable lazy queues if… ● the publisher is sending many messages at once ● the consumers are not keeping up with the speed of the publishers all the time Ignore lazy queues if.. ● you require high performance ● queues are always short
  13. 13. The RabbitMQ management collects and calculates metrics for every queue, connection, and channel in the cluster ● Slows down the server if you have thousands upon thousands of active queues or consumers Don’t set RabbitMQ Management statistics rate mode to detailed Recommendation number 6. QUEUES
  14. 14. Split queues over different cores, and route messages to multiple queues Recommendation number 7.1 QUEUES ● A queue is single threaded ○ 50k messages/s ● Queue performance is limited to one CPU core. ● All messages routed to a specific queue will end up on the node where that queue resides. Plugins The consistent hash exchange plugin RabbitMQ sharding
  15. 15. Recommendation number 7.2 QUEUES ● Load-balance messages between queues ● Messages are consistently and equally distributed across many queues ● Consume from all queues ● The consistent hash exchange plugin
  16. 16. Recommendation number 7.3 QUEUES RabbitMQ sharding ● Automatic partitioning of queues ● Queues are created on every cluster node and messages are sharded across them ● Shows one queue to the consumer, but it could be many queues running behind it in the background ●
  17. 17. Recommendation number 8. QUEUES Have limited use on priority queues ● Each priority level uses an internal queue on the Erlang VM, which takes up resources. ● In most use cases it's sufficient to have no more than 5 priority levels.
  18. 18. Recommendation number 9. QUEUES Send persistent messages and durable queues ● Messages, exchanges, and queues that are not durable and persistent are lost during a broker restart ● High performance - use transit messages and temporary, or non-durable queues
  19. 19. Recommendation number 10.1 PREFETCH Adjust prefetch value ● Limits how many messages the client can receive before acknowledging a message ● RabbitMQ default prefetch value - unlimited buffer ● RabbitMQ 3.7 ○ Option to adjust the default prefetch ○ CloudAMQP servers has a default prefetch of 1000
  20. 20. Recommendation number 10.2 PREFETCH Prefetch - Too small prefetch value RabbitMQ is most of the time waiting to get permission to send more messages
  21. 21. Recommendation number 10.3 PREFETCH Prefetch - Too large prefetch value
  22. 22. Recommendation number 10.4 PREFETCH Prefetch ● One single or few consumers with short processing time ○ prefetch many messages at once ● About the same processing time and a stable network ○ estimated prefetch value by using the total round trip time divided by processing time on the client for each message ● Many consumers, and short processing time ○ A lower prefetch value than for one single or few consumers ● Many consumers, and/or long processing time ○ Set prefetch count to 1 so that messages are evenly distributed among all your workers ● The prefetch value have no effect if your client auto-ack messages
  23. 23. Recommendation number 11. HiPE HiPE ● HiPE increases server throughput at the cost of increased start-up time ○ increases throughput with 20-80% ○ increases start-up time about 1 to 3 minutes ● HiPE is recommended if you require high availability ● We don’t consider HiPE as experimental any longer
  24. 24. ● Pay attention to where in your consumer logic you’re acknowledging messages ● For the fastest possible throughput, manual acks should be disabled ● Publish confirm is required if the publisher needs messages to be processed at least once Recommendation number 12. ACKS AND CONFIRMS Acknowledgments and Confirms
  25. 25. Great improvements are made to RabbitMQ, all the time <3 ● 3.7 ○ Default prefetch ○ Individual vhost message stores ● 3.6 ○ Lots of many memory problems, up to version 3.6.14 ○ Lazy queues ● 3.5 ○ Still may customers on 3.5.7 Recommendation number 13. VERSION Use a stable RabbitMQ version Back compatibility is really good in RabbitMQ
  26. 26. ● Some plugins are consuming lots of resources ● Make sure to disable plugins that you are not using Recommendation number 14. Plugins Disable plugins you are not using
  27. 27. ● Unused queues take up some resources, queue index, management statistics etc ● Temporary queues should be auto deleted Recommendation number 15. Unused queues Delete unused queues
  28. 28. ● Message loss on netsplits ● Needed to be able to upgrade without losing messages at CloudAMQP Recommendation number 16. VHOST Enable HA-vhost policy on custom vhosts
  29. 29. Summary Overall Server side problems ● Short queues ● Long lived connections ● Limited use of priority queues ● Use multiple queues and consumers ● Split your queues over different cores ● Stable Erlang and RabbitMQ version ● Disable plugins you are not using ● Channels on all connections
  30. 30. Summary Overall Server side problems ● Separate connections for publishers and consumers ● Management statistics rate mode ● Delete unused queues ● Temporary queues should be auto deleted
  31. 31. Summary High Performance Server side problems ● Short queues ○ max-length if possible ● Do not use lazy queues ● Send transit messages ● Disable manual acks and publish confirms ● Avoid multiple nodes (HA) ● Enable RabbitMQ HiPE
  32. 32. Summary High Availability Server side problems ● Enable lazy queues ● RabbitMQ HA - 2 nodes ○ HA-policy on all vhosts ● Persistent messages, durable queues ● Do not enable HiPE
  34. 34. DIAGNOSTIC TOOL Diagnostics Tool ● RabbitMQ and Erlang version ● Queue length ● Unused queues ● Persistent messages in durable queues ● No mirrored auto delete queues ● Limited use of priority queues ● Long lived connections ● Connection and channel leak ● Channels on all connections ● Insecure connections ● Client library ● AMQP Heartbeats ● Channel prefetch ● Limited use of priority queues ● Management statistics rate mode ● Ensure that you are not using topic exchange as fanout ● Ensure that all published messages are routed ● Ensure that you have a HA-policy on all vhosts ● Auto delete on temporary queues ● Persistent messages in durable queues ● No transient messages in mirrored queues ● No mirrored auto delete queues ● Separate connections for publishers and consumers
  35. 35. It should be easier to do things right!
  36. 36. Questions? Visit blog site, documentation and FAQ for more info