RabbitMQ Best Practice with CloudAMQP

Erlang Solutions
Erlang SolutionsErlang Solutions
CloudAMQP
RabbitMQ Best Practice
Experiences from running CloudAMQP,
the largest RabbitMQ fleet in the world
CloudAMQP
Background
● Carl Hörberg CEO of 84codes
● CloudAMQP - RabbitMQ as a Service
● Largest provider of managed RabbitMQ servers
● Thousands of RabbitMQ nodes
● 6 different clouds
CloudAMQP
RabbitMQ
Client side problems - You’re using it wrong
Server side configurations - It can be optimized for you
CloudAMQP
RabbitMQ/Erlang
versions
● Great improvements have been made lately
● RabbitMQ 3.7.x has a lot of new features
● 3.6.x had a lot of memory problems up to 3.6.14
but is now very good
● 3.5.7 was stable (as single-nodes at least) but
lacked many good features (eg. lazy queues)
● Update Erlang too
● Make sure to use updated client libraries
https://www.rabbitmq.com/which-erlang.html
https://www.rabbitmq.com/changelog.html
CloudAMQP
Connections and
channels
● Use separate connections for publish/consume
○ Publishes may be TCP back-pressured
○ Then the server won’t receive AMQP Acks either
● Keep connection/channel count low
○ Reuse connections
○ 1 connection for publishing
○ 1 connection for consuming
○ 1 channel per thread (don’t share)
● Every connection uses:
○ TCP buffer space (auto-tuned, but may need to be
decreased)
○ CPU for metric collection by RabbitMQ mgmt UI
CloudAMQP
Connections and
channels
● Don’t open and close connections or channels
repeatedly
○ TLS connection: 5 TCP packages
○ AMQP connections: 7 TCP packages
○ AMQP channel: 2 TCP packages
○ AMQP publish: 1 TCP package (more for larger
messages)
○ AMQP close channel: 2 TCP packages
○ AMQP close connection: 2 TCP packages
○ Total 14-19 packages (+ Acks)
CloudAMQP
Queues
● Short queues are fast queues, because
message can be cached or not hit the disk at all
● Use lazy queue-mode if they’re long (RabbitMQ
>= 3.6)
● Limit queue size, with TTL or max-length
● Problems with long queues
○ Small msgs embedded in queue index
○ Take long time to sync between nodes
○ Time consuming to start a server with many
msgs
https://www.rabbitmq.com/lazy-queues.html
CloudAMQP
Queues
● Queues are single-threaded
● Max performance: One queue per core
○ Consistent hash exchange plugin
○ RabbitMQ sharding
● Consume (push), don’t poll (pull) for messages
● Auto-ack or ack every X msgs instead of every msg
● RabbitMQ management interface collects and
stores stats for all queues
https://github.com/rabbitmq/rabbitmq-consistent-hash-exchange
https://github.com/rabbitmq/rabbitmq-sharding
CloudAMQP
Persistent
messages
● For a message to survive a server restart
○ Durable exchange (most are)
○ Durable queue
○ Persistent message (delivery_mode=2)
● For throughput use temporary, or non-durable
queues
https://www.cloudamqp.com/blog/2017-03-14-how-to-persist-m
essages-during-RabbitMQ-broker-restart.html
CloudAMQP
Acknowledgements
and Confirms
● Message acknowledge and confirms ensures
at-least-once delivery.
● On publish, server can confirm it has received
the message.
● Acknowledge messages when you’ve processed
it. On disconnect, the message will be
redelivered again.
● But for throughput, and when at-least-once
delivery is not required auto-ack is much faster.
Common mistake:
● Enable publish confirm, but don’t block/wait for
confirmation and/or don’t retry publishing.
http://www.rabbitmq.com/confirms.html
CloudAMQP
Prefetch
The prefetch value is used to specify how many
messages that are being sent to the consumer before
the client has to acknowledge one.
● Optimal prefetch is roundtrip latency divided by
message processing time + 1. So that the client
doesn’t have to wait for deliveries
● All unacked messages have to reside in both the
server’s and the client’s RAM, too many can
cause OOM
Common mistakes:
● Too large or unlimited prefetch value
● Not rejecting (nor acking) messages that fail
https://www.rabbitmq.com/consumer-prefetch.html
CloudAMQP
Clustering
For HA and/or scalability
● HA if queues are mirrored (policy or per queue)
● HA if client failovers
● Failover methods
○ Multiple addresses in client’s host array
○ DNS load balancing (Short TTL)
○ Load balancer (PROXY protocol in 3.7)
● For scalability only if you control which node
queues are created on and you connect there
Common mistake:
● Not configuring mirrored queues
● Not understanding partition handling modes
● Client doesn’t automatically reconnect
CloudAMQP
Clustering
RabbitMQ uses Erlangs clustering functionality, which
is not design for net-splits. “Net-splits” can occur:
● When “pings/net-ticks” can’t be sent between
nodes
● 100% CPU or RAM usage
● HA batch-sync-size is too large (decrease it)
● Network is slow
● Tip: increase net_tick time to 90-120s (default
60s)
https://www.rabbitmq.com/partitions.html
RabbitMQ will have better partition tolerance in 4.x
CloudAMQP
Filesystem
We use and recommend XFS
GP2/IO1 drives on EC2, IOPS more important when
using many queues than sequential performance
CloudAMQP
HiPE
High performance Erlang
● Increased throughput 20-80%
● “Experimental”, but 3% of our customers use it
● Increases boot time
CloudAMQP
Resilience
Increase restart time on crash:
export ERL_CRASH_DUMP_SECONDS=1
Make RAM usage much more stable:
export RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+hmqd off_heap"
http://erlang.org/doc/man/erl.html#+hmqd
CloudAMQP
Resilience
[{rabbit, [
{queue_index_embed_msgs_below, 0},
{channel_max, 256},
{tcp_listen_options, [
{keepalive, true},
{nodelay, true},
{exit_on_close, true},
{linger, {true, 0}}
]},
…
]}, http://www.rabbitmq.com/networking.html
● Embedding small msgs (<4KB) in queue index
improves throughput a lot, but disabling it
makes the RAM usage much more consistent
and allows you to start the server with more
messages in queue than RAM on the server
● Channel max of 65536 is often way too much, a
single misbehaving client can crash the whole
server
● Use TCP keepalive, not AMQP heartbeats
CloudAMQP
AMQP heartbeats
vs. TCP keepalive
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=15
net.ipv4.tcp_keepalive_probes=4
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html
Rational: Detect dead peers but don’t let
proxys/NATs/routers close idle connections
TCP keepalives works at the kernel level, client side
support is not required.
AMQP heartbeats can be missed if the client is busy,
single-threaded, or buggy etc.
CloudAMQP
TLS
{ssl_options, [
{cacertfile,"/etc/rabbitmq/ca.pem"},
{certfile,"/etc/rabbitmq/cert.pem"},
{keyfile, "/etc/rabbitmq/key.pem"},
{honor_cipher_order, true},
{ciphers, [
"ECDHE-RSA-AES128-GCM-SHA256",
"ECDHE-RSA-AES256-GCM-SHA384",
"ECDHE-RSA-AES128-SHA",
"ECDHE-RSA-AES256-SHA",
"ECDHE-RSA-AES128-SHA256",
"ECDHE-RSA-AES256-SHA384",
"DHE-RSA-AES128-GCM-SHA256",
"DHE-RSA-AES256-GCM-SHA384",
"DHE-RSA-AES128-SHA",
"DHE-RSA-AES256-SHA",
"DHE-RSA-AES128-SHA256",
"DHE-RSA-AES256-SHA256"
]},
http://www.rabbitmq.com/ssl.html
Disable RSA key exchanges and other insecure
ciphers
CloudAMQP
Many connections
optimization
{tcp_listen_options, [
...,
{sndbuf, 8192},
{recbuf, 8192}
]}
http://www.rabbitmq.com/networking.html
Default values are auto-tuned by the kernel but
decrease TCP buffers sizes if you need to make
tens/hundreds of thousands of connections.
Also set rates_mode in the mgmt interface to none
CloudAMQP
RabbitMQ best
practice for High
Performance
● Make sure your queues stay short
● Limit queue length
● Do not enable lazy queues
● Use transit messages
● Use multiple queues and consumers
● Split your queues over different cores
● Disable manual acks and publish confirms
● Avoid multiple nodes (HA)
● Enable RabbitMQ HiPE
● Disable plugins you are not using
CloudAMQP
RabbitMQ best
practice for High
Availability
● Make sure your queues stay short
● Enable lazy queues
● Cluster setup (RabbitMQ HA with 2 or more
nodes)
● Make sure your clients will failover
● Use persistent messages and durable queues
● Federation between clouds
CloudAMQP
Questions
mladen.miliksic@erlang-solutions.com
www.erlang-solutions.com
www.cloudamqp.com
contact@cloudamqp.com
1 of 23

More Related Content

Recently uploaded(20)

LLM App Hacking (AVTOKYO2023)LLM App Hacking (AVTOKYO2023)
LLM App Hacking (AVTOKYO2023)
Shota Shinogi216 views
Future Telecoms Challenges & OpportunitiesFuture Telecoms Challenges & Opportunities
Future Telecoms Challenges & Opportunities
University of Hertfordshire55 views
PoC Azure AdministrationPoC Azure Administration
PoC Azure Administration
Olaf Reitmaier Veracierta70 views
We aint got no time - Droidcon NairobiWe aint got no time - Droidcon Nairobi
We aint got no time - Droidcon Nairobi
Danny Preussler45 views
SaaraSaara
Saara
brand4424 views
2023-1117 AI Music Intro.pdf2023-1117 AI Music Intro.pdf
2023-1117 AI Music Intro.pdf
wayne39125 views
Heatmap for SAP and CIA.pdfHeatmap for SAP and CIA.pdf
Heatmap for SAP and CIA.pdf
AndreeaTom47 views
MSWMSW
MSW
Wonjun Hwang25 views
Ontology Repositories and Semantic Artefact Catalogues with the OntoPortal Te...Ontology Repositories and Semantic Artefact Catalogues with the OntoPortal Te...
Ontology Repositories and Semantic Artefact Catalogues with the OntoPortal Te...
INRAE (MISTEA) and University of Montpellier (LIRMM)53 views

Featured(20)

How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC2.4K views
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy81.7K views
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani30K views
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking6.8K views
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25K views
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.1K views
I Rock Therefore I Am. 20 Legendary Quotes from PrinceI Rock Therefore I Am. 20 Legendary Quotes from Prince
I Rock Therefore I Am. 20 Legendary Quotes from Prince
Empowered Presentations142.8K views
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views

RabbitMQ Best Practice with CloudAMQP

  • 1. CloudAMQP RabbitMQ Best Practice Experiences from running CloudAMQP, the largest RabbitMQ fleet in the world
  • 2. CloudAMQP Background ● Carl Hörberg CEO of 84codes ● CloudAMQP - RabbitMQ as a Service ● Largest provider of managed RabbitMQ servers ● Thousands of RabbitMQ nodes ● 6 different clouds
  • 3. CloudAMQP RabbitMQ Client side problems - You’re using it wrong Server side configurations - It can be optimized for you
  • 4. CloudAMQP RabbitMQ/Erlang versions ● Great improvements have been made lately ● RabbitMQ 3.7.x has a lot of new features ● 3.6.x had a lot of memory problems up to 3.6.14 but is now very good ● 3.5.7 was stable (as single-nodes at least) but lacked many good features (eg. lazy queues) ● Update Erlang too ● Make sure to use updated client libraries https://www.rabbitmq.com/which-erlang.html https://www.rabbitmq.com/changelog.html
  • 5. CloudAMQP Connections and channels ● Use separate connections for publish/consume ○ Publishes may be TCP back-pressured ○ Then the server won’t receive AMQP Acks either ● Keep connection/channel count low ○ Reuse connections ○ 1 connection for publishing ○ 1 connection for consuming ○ 1 channel per thread (don’t share) ● Every connection uses: ○ TCP buffer space (auto-tuned, but may need to be decreased) ○ CPU for metric collection by RabbitMQ mgmt UI
  • 6. CloudAMQP Connections and channels ● Don’t open and close connections or channels repeatedly ○ TLS connection: 5 TCP packages ○ AMQP connections: 7 TCP packages ○ AMQP channel: 2 TCP packages ○ AMQP publish: 1 TCP package (more for larger messages) ○ AMQP close channel: 2 TCP packages ○ AMQP close connection: 2 TCP packages ○ Total 14-19 packages (+ Acks)
  • 7. CloudAMQP Queues ● Short queues are fast queues, because message can be cached or not hit the disk at all ● Use lazy queue-mode if they’re long (RabbitMQ >= 3.6) ● Limit queue size, with TTL or max-length ● Problems with long queues ○ Small msgs embedded in queue index ○ Take long time to sync between nodes ○ Time consuming to start a server with many msgs https://www.rabbitmq.com/lazy-queues.html
  • 8. CloudAMQP Queues ● Queues are single-threaded ● Max performance: One queue per core ○ Consistent hash exchange plugin ○ RabbitMQ sharding ● Consume (push), don’t poll (pull) for messages ● Auto-ack or ack every X msgs instead of every msg ● RabbitMQ management interface collects and stores stats for all queues https://github.com/rabbitmq/rabbitmq-consistent-hash-exchange https://github.com/rabbitmq/rabbitmq-sharding
  • 9. CloudAMQP Persistent messages ● For a message to survive a server restart ○ Durable exchange (most are) ○ Durable queue ○ Persistent message (delivery_mode=2) ● For throughput use temporary, or non-durable queues https://www.cloudamqp.com/blog/2017-03-14-how-to-persist-m essages-during-RabbitMQ-broker-restart.html
  • 10. CloudAMQP Acknowledgements and Confirms ● Message acknowledge and confirms ensures at-least-once delivery. ● On publish, server can confirm it has received the message. ● Acknowledge messages when you’ve processed it. On disconnect, the message will be redelivered again. ● But for throughput, and when at-least-once delivery is not required auto-ack is much faster. Common mistake: ● Enable publish confirm, but don’t block/wait for confirmation and/or don’t retry publishing. http://www.rabbitmq.com/confirms.html
  • 11. CloudAMQP Prefetch The prefetch value is used to specify how many messages that are being sent to the consumer before the client has to acknowledge one. ● Optimal prefetch is roundtrip latency divided by message processing time + 1. So that the client doesn’t have to wait for deliveries ● All unacked messages have to reside in both the server’s and the client’s RAM, too many can cause OOM Common mistakes: ● Too large or unlimited prefetch value ● Not rejecting (nor acking) messages that fail https://www.rabbitmq.com/consumer-prefetch.html
  • 12. CloudAMQP Clustering For HA and/or scalability ● HA if queues are mirrored (policy or per queue) ● HA if client failovers ● Failover methods ○ Multiple addresses in client’s host array ○ DNS load balancing (Short TTL) ○ Load balancer (PROXY protocol in 3.7) ● For scalability only if you control which node queues are created on and you connect there Common mistake: ● Not configuring mirrored queues ● Not understanding partition handling modes ● Client doesn’t automatically reconnect
  • 13. CloudAMQP Clustering RabbitMQ uses Erlangs clustering functionality, which is not design for net-splits. “Net-splits” can occur: ● When “pings/net-ticks” can’t be sent between nodes ● 100% CPU or RAM usage ● HA batch-sync-size is too large (decrease it) ● Network is slow ● Tip: increase net_tick time to 90-120s (default 60s) https://www.rabbitmq.com/partitions.html RabbitMQ will have better partition tolerance in 4.x
  • 14. CloudAMQP Filesystem We use and recommend XFS GP2/IO1 drives on EC2, IOPS more important when using many queues than sequential performance
  • 15. CloudAMQP HiPE High performance Erlang ● Increased throughput 20-80% ● “Experimental”, but 3% of our customers use it ● Increases boot time
  • 16. CloudAMQP Resilience Increase restart time on crash: export ERL_CRASH_DUMP_SECONDS=1 Make RAM usage much more stable: export RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+hmqd off_heap" http://erlang.org/doc/man/erl.html#+hmqd
  • 17. CloudAMQP Resilience [{rabbit, [ {queue_index_embed_msgs_below, 0}, {channel_max, 256}, {tcp_listen_options, [ {keepalive, true}, {nodelay, true}, {exit_on_close, true}, {linger, {true, 0}} ]}, … ]}, http://www.rabbitmq.com/networking.html ● Embedding small msgs (<4KB) in queue index improves throughput a lot, but disabling it makes the RAM usage much more consistent and allows you to start the server with more messages in queue than RAM on the server ● Channel max of 65536 is often way too much, a single misbehaving client can crash the whole server ● Use TCP keepalive, not AMQP heartbeats
  • 18. CloudAMQP AMQP heartbeats vs. TCP keepalive net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=15 net.ipv4.tcp_keepalive_probes=4 http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html Rational: Detect dead peers but don’t let proxys/NATs/routers close idle connections TCP keepalives works at the kernel level, client side support is not required. AMQP heartbeats can be missed if the client is busy, single-threaded, or buggy etc.
  • 19. CloudAMQP TLS {ssl_options, [ {cacertfile,"/etc/rabbitmq/ca.pem"}, {certfile,"/etc/rabbitmq/cert.pem"}, {keyfile, "/etc/rabbitmq/key.pem"}, {honor_cipher_order, true}, {ciphers, [ "ECDHE-RSA-AES128-GCM-SHA256", "ECDHE-RSA-AES256-GCM-SHA384", "ECDHE-RSA-AES128-SHA", "ECDHE-RSA-AES256-SHA", "ECDHE-RSA-AES128-SHA256", "ECDHE-RSA-AES256-SHA384", "DHE-RSA-AES128-GCM-SHA256", "DHE-RSA-AES256-GCM-SHA384", "DHE-RSA-AES128-SHA", "DHE-RSA-AES256-SHA", "DHE-RSA-AES128-SHA256", "DHE-RSA-AES256-SHA256" ]}, http://www.rabbitmq.com/ssl.html Disable RSA key exchanges and other insecure ciphers
  • 20. CloudAMQP Many connections optimization {tcp_listen_options, [ ..., {sndbuf, 8192}, {recbuf, 8192} ]} http://www.rabbitmq.com/networking.html Default values are auto-tuned by the kernel but decrease TCP buffers sizes if you need to make tens/hundreds of thousands of connections. Also set rates_mode in the mgmt interface to none
  • 21. CloudAMQP RabbitMQ best practice for High Performance ● Make sure your queues stay short ● Limit queue length ● Do not enable lazy queues ● Use transit messages ● Use multiple queues and consumers ● Split your queues over different cores ● Disable manual acks and publish confirms ● Avoid multiple nodes (HA) ● Enable RabbitMQ HiPE ● Disable plugins you are not using
  • 22. CloudAMQP RabbitMQ best practice for High Availability ● Make sure your queues stay short ● Enable lazy queues ● Cluster setup (RabbitMQ HA with 2 or more nodes) ● Make sure your clients will failover ● Use persistent messages and durable queues ● Federation between clouds
  • 23. CloudAMQP Questions mladen.miliksic@erlang-solutions.com www.erlang-solutions.com www.cloudamqp.com contact@cloudamqp.com