2. What does going deep
mean?
And when should you, and shouldn’t you do it?
3. Application Application
A good queue is an empty queue
MQ is designed to allow applications to asynchronously communicate, acting as a buffer to smooth workload
peaks and temporary application outages
4. Application Application
A good queue is an empty queue, but application failures happen!
MQ is designed to allow applications to asynchronously communicate, acting as a buffer to smooth workload
peaks and temporary application outages
Failure of a putting application has no real effect on MQ
Failure of a getting application will result in messages building up on its queues until the application restarts and
beings processing again
5. Application
A good queue is an empty queue, but application failures happen!
MQ is designed to allow applications to asynchronously communicate, acting as a buffer to smooth workload
peaks and temporary application outages
Failure of a putting application has no real effect on MQ
Failure of a getting application will result in messages building up on its queues until the application restarts and
beings processing again
If the getting application suffers an extended outage queues can completely fill up, resulting in the putting
application having to have a strategy for how to deal with a full queue
Application
6. Application Application
The aim of this presentation
This presentation aims to explore how deep a queue can go, and things you should bear in mind when that
happens
7. Application
Are deep queues a valid use case for MQ?
MQ is designed to allow applications to asynchronously communicate, acting as a buffer to smooth workload
peaks and temporary application outages
These outages might be minutes, hours or even days, but they are always temporary
We are not advocating using MQ as a database. I.e. keeping data on queues for “ever”. MQ is not optimised for
this use case
Use a database instead!
Application
8. Application Application
Are deep queues the only answer?
No, but they are often the simplest from an application perspective. But as we will see, really deep queues require
some thinking about
Some alternatives include:
• Putting application detecting MQRC_Q_FULL and pausing putting more messages
• Putting application detecting MQRC_Q_FULL and putting new messages to a database, file, another queue
• Starting up a temporary getting application to move the messages off somewhere else
• Failing!
In addition to having to code these solutions, they often come with challenges such as keeping message ordering
etc
9. Life-cycle of a getting application outage
It is up to you to decide the maximum getting application outage duration that you want to tolerate
Some customers are now looking at multiple day outages with 10s of millions of messages a day building up
However that decision needs to be in the bounds of possibility for MQ. There is a limit to the amount of data a
queue manager can store
Time
Depth
Getting
application
outage
Getting
application
restarts
Getting application
processing backlog
Normal running Normal running
Peak queue
depth
Duration of recovery from
outage
Duration of getting
application outage
10. Other important things
• When you decide what the queue size limits are, make sure you enforce them via configuration: MAXDEPTH,
MAXMSGL
• Make sure you monitor for a getting application outage, perhaps via DIS QSTATUS, or by service interval
events, so you can restart the getting application as soon as possible
• Monitor for the queue filling up, e.g. queue depth full, or high events, for similar reasons
• As the queue fills up you might start to see degraded putting application performance, it’s worth
understanding what this might be so you can plan for it
Time
Depth
Getting
application
outage
Getting
application
restarts
Getting application
processing backlog
Normal running Normal running
Peak queue
depth
Duration of recovery from
outage
Duration of getting
application outage
11. Other important things
• When the getting application starts up it is going to need to catch up. How fast can it do this? E.g. with a 2 day
outage and a getting application that can get twice as fast as data being put, it will take a total of 4 days from
the start of the application outage to get back to normal
• Do you need to allow for a getting application outage occurring during this recovery window?
• Can the messages expire? And are you relying on that to keep the queue depth low?
• Be particularly aware of potential getting application inefficiencies. For example getting using a message
selector is relatively slow, this will be particularly noticeable on a deep queue
• Make sure you test the deep queue scenario, including recovery, to make sure everything works in the
timeframes you expect!
Time
Depth
Getting
application
outage
Getting
application
restarts
Getting application
processing backlog
Normal running Normal running
Duration of recovery from
outage
Duration of getting
application outage
Peak queue
depth
12. So, how deep can I go?
Approximate number of 1 KB
messages
MQ on z/OS private queue 16.8 million
MQ on z/OS shared queue all in CF 613.6 million* **
MQ on z/OS shared queue on
SMDS
1.4 billion* **
MQ on distributed 68.4 billion*
We will see where these numbers come from later
* We haven’t tested to the end of these limits
**Going this deep will mean you can’t recover in the case of a structure failure. So non-
persistent messages only!
14. Private queues – buffer pools and page sets
Messages on private queues are stored in memory in buffer pools and might be moved into a page set on disk
Data in buffer pools and page sets is accessed in 4KB pages. A single page might contain the queue spine,
message meta data, message data for at most one message, or space usage (space map page)
A private queue is associated with a single buffer pool / page set pair so the size of a page set is the constraining
factor for private queues. A page set can be a maximum of 64 GB in size, which allows for ~16.8 million 1KB
messages. This is a rough calculation, ignoring space maps, spine pages, etc
To get a rough idea for the maximum number of messages of a given size that can be stored in a page set, round
up the size of the message to a multiple of 4 KB and divide 64 GB by that number
Buffer pool
Page set
15. Private queues – recommendations
If you want to allow a private queue to deal with an extensive getting application outage consider putting them in
their own page set and ideally their own buffer pool. This will allow for accurate sizing. You don’t want two very
deep queues on the same page set at the same time
There is obviously a limit to how much separation you can have here given that there can be at most 100 page
sets and buffer pools in a single z/OS queue manager
Buffer pool
Page set
16. Indexing your queues
On z/OS you can specify the INDXTYPE attribute on local queues to tell the queue manager how to index the
queue: no index; by message ID; by correlation ID; by group ID
An index makes no difference if just getting the next available message off a queue, but makes a significant
difference otherwise, especially if the queue is deep
INDXTYPE only supports a single value, so choose it based off the most common approach for getting messages
from the queue.
However for private queues you can still use any approach regardless of the index, but it will be less efficient. The
queue manager will tell you if you should consider indexing your queues
MessageID=A MessageID=B MessageID=C
17. Indexing your queues
The index for private queues is maintained in the queue manager in 64 bit storage
Each message on an indexed queue uses 136 bytes to maintain the index. 10 million messages therefore uses
1360MB of 64 bit storage
So if you are going to have deep private queues which are indexed make sure you account for it in the
MEMLIMIT attribute of your *MSTR JCL
MessageID=A MessageID=B MessageID=C
19. Shared queues – storage
Shared queues are stored in a coupling facility (CF)
Messages may be entirely stored in the CF if the message size < 63KB, or a pointer to the message can be
stored in the CF and the remainder of the message offloaded to Db2 (not recommended, and not discussed
further) or in shared message data sets (SMDS)
The maximum supported CF structure size is 1TB. This is all real storage, so is relatively expensive
If SMDS is used each queue manager gets its own SMDS data set for the structure. The maximum size of a
single SMDS is 16 TB
pMessage
pMessage
pMessage
20. Shared queues – small messages
Shared queues perform best when the message is held entirely in the CF as that minimises both code path
length and removes the need to interact with DASD
For messages which are < 63KB in size the best approach is therefore to keep them in the CF and only offload
them to SMDS when the queue starts getting deep during a getting application outage
This gives the best of both worlds, fast message access normally, but the ability to store lots of messages in the
worst case
If you do decide to keep small messages in the CF and not to offload, the maximum number of 1KB messages
that can be stored in a single structure is approximately 613.6 million, based on the max CF size of 1TB
pMessage
pMessage
pMessage
21. Shared queues – offload rules
Each MQ CFSTRUCT definition has three offload rules associated with it
Each rule specifies the maximum size message that can be stored in in the structure, when the structure is over a
given percentage full
The default rules assume that no messages < 63KB get offloaded until the structure is very full
These defaults are not likely to be good for a getting application outage where you want to maximize the number
of messages you can store and minimise the amount of CF used. Instead you might want to start offloading all
messages when the structure is say 10% full, as shown in the example on the right hand size
DEFINE CFSTRUCT(SHALLOWSTRUCT)
CFLEVEL(5) …
OFFLOAD(SMDS)
OFFLD1TH(70) OFFLD1SZ(32K)
OFFLD2TH(80) OFFLD1SZ(4K)
OFFLD2TH(90) OFFLD1SZ(0K)
DEFINE CFSTRUCT(DEEPSTRUCT)
CFLEVEL(5) …
OFFLOAD(SMDS)
OFFLD1TH(10) OFFLD1SZ(0K)
OFFLD2TH(10) OFFLD1SZ(0K)
OFFLD2TH(10) OFFLD1SZ(0K)
22. Shared queues – offloaded messages
Each offloaded message requires a message pointer in the CF. The CF is used to maintain queueing semantics,
the pointer allows the messages to be located in SMDS
Offloaded messages require you to consider both the CF structure size and the space used in SMDS
At most a structure can contain 1.4 billion message pointers, you then need to consider how much space those
messages will occupy on SMDS
A single SMDS data set can be up to 16TB in size. This can easily take 1.4 billion 1KB messages, but only 16
million 1MB messages
pMessage
pMessage
pMessage
a message
pointer
1 entry = 256 bytes
2 elements = 2 * 256 bytes
Total size = 768 bytes
pMessage
23. Shared queues – back ups
CF structures are in memory structures. In the rare cases where they fail, they need to be rebuilt
With MQ this is done by periodically taking back ups of the structure. If a structure failure occurs the structure can
be recovered from the back up plus the logs of the queue managers that have accessed the structure since the
failure
Deep queues have important implications for this process:
• Size and number of active and archive logs
• Backup time
• Backup frequency
• Recovery time
BACKUP
CFSTRUCT(DEEPSTRUCT)
Active and
archive logs
RECOVER
CFSTRUCT(DEEPSTRUCT)
24. Shared queues – back ups – archive and active logs
Backing up a structure involves writing the structure contents and the contents of the SMDS to a queue
manager’s active logs. Over time the active logs then get written to the archive logs
Therefore, as shown above, the back up might be only in the active logs, in a mixture of active and archive logs,
or just archive logs
As shown above the contents of the active logs normally mainly overlap the contents of the archive logs
Therefore the limiting factor for a back up is the number and size of the archive logs. The biggest backup you can
have is ~ 4096 GB. NB this is smaller than the maximum size of an SMDS!
Current
active log (0)
Previous
active log (-1)
Archive log
0
Archiving
in process
Previous
active log (-2)
Archive log
-1
Archiving
complete
=
Previous
active log (-3)
Archive log
-2
Archiving
complete
=
Archive log
-3
Data only
in archive
Archive log
-4
Data only
in archive
Up to 310 * 4 GB
active logs
Up to 1000 * 4 GB
archive logs
25. Shared queues – back ups – archive and active logs
In order to recover a backup it needs to be accessible to the queue managers, so you need to make sure that the
start of the backup remains in the available archive logs for that queue manager, otherwise you can’t recover it!
You don’t just need one backup, you need to be able to safely transition from one backup to the next one, should
anything fail while the backup is occurring. The green and yellow “usable backups” above illustrate this
The conclusion of all this is that you really don’t want a backup to be more than about a quarter of the available
archive logs of a queue manager, i.e. < 1 TB at the most
For this extreme case you should also be considering a separate queue manager just to perform the backups.
This removes the risks of application messages pushing the backup from the archive logs, and also means the
back up process can’t affect applications
Archive log 0
(newest available)
Archive log -999
(oldest available)
Usable backup Unusable backup
Messages since
backup
RECOVER CFSTRUCT works RECOVER CFSTRUCT fails
Usable backup
Messages since
backup
Messages since
backup
26. Shared queues – back ups – time and frequency
Backing up a structure takes time. The best backup is one that contains minimal data, as that will be fast. IBM
recommends taking a backup of every structure at least every hour to minimise the amount of time recovery will
take
Care is needed here with deep queues!
Backing up a large structure will take a long time, and as discussed it will take a lot of log space
If your getting application outage can be several days, continuing with your normal backup strategy while the
structure contains lots and lots of messages might not be a good idea
BACKUP
CFSTRUCT(DEEPSTRUCT)
Active and
archive logs
RECOVER
CFSTRUCT(DEEPSTRUCT)
27. Shared queues – back ups – time and frequency
It is worth considering adjusting your back up strategy during extended getting application outages
1) During normal running back up every hour or so (whatever is normal for your site)
2) When the queue starts to get very deep, pause backups, or make them less frequent, until the getting
application restarts and the queue depth has reduced
This approach minimises the repeated costs of taking a large back up both in terms of CPU and log usage.
However it does mean that any recovery will rely on reading a potentially large amount of log data across the
queue managers in your group. Therefore you need to accurately size the active and archive logs across your
group too, and consider how long recovery might take
A good time to do the pause is when the size of the backup starts to become multiple times the amount of data
that would normally get written between backups
Time
Depth
Regular backups
Getting
application
outage
Getting
application
restarts
Regular backups
resume
Pause backups
28. Shared queues – back ups – recovery
Frequent back ups are recommended to minimise the amount of time it takes to recover from a back up
Recovery involves reading the logs in the backup and scanning the logs of all queue managers that used the
structure since the point of the backup. This results in reading the active / archive logs backwards which is
typically slower than reading them forwards
With deep queues this could take a significant period of time. But it does require multiple failures all at the same
time (getting application outage for a significant period of time, and then subsequent CF structure failure)
It might be worth considering CF duplexing to reduce the chance of a structure failure, but bear in mind this will
result in increased CF CPU utilization, and more CF storage being required
Time
Depth
Regular backups
Getting
application
outage
Getting
application
restarts
Structure fails here
Pause backups
30. Queue files
Distributed takes a different approach to z/OS
Each queue gets its own in memory set of buffers for temporarily staging messages. A different set of buffers are
used for persistent and non-persistent messages
These buffers can be tuned, up to a maximum size of 100MB, they default to 128KB for non-persistent messages
and 256 KB for persistent ones. These settings aren’t as fully externalized as buffer pools on z/OS. But similar
performance considerations apply
The buffers asynchronously get written to the queue file, and there is one queue file per queue
Q1
Non-persistent
message buffers
Persistent message
buffers
Queue file
Q2
Non-persistent
message buffers
Persistent message
buffers
Queue file
Q3
Non-persistent
message buffers
Persistent message
buffers
Queue file
31. Queue files
Queue file size is the ultimate upper limit for queue depth on distributed. Default max size is ~2TB
From 9.2.0 adjusting this is simple as shown above. The maximum value is ~255TB
Messages are stored on queue files in blocks. If the queue file < 2TB in size the block size is 512 bytes
Above 2 TB the block size is 4KB, which means a 1KB message will use a whole block
Maximum number of 1KB messages on a single queue is therefore ~68.4 billion
DEFINE QL(NEWQUEUE) MAXFSIZE(500)
ALTER QL(EXISTINGQUEUE) MAXFSIZE(1000)
Create queue with maximum file size of 500MB
Alter existing queue to have a max size of
1000MB
DIS QSTATUS(NEWQUEUE) CURMAXFS
CURFSIZE
Queue is using 39 MB of its 500MB
QUEUE(NEWQUEUE)
CURFSIZE(39) CURMAXFS(500)
32. Logging
Distributed supports both linear and circular logging
Linear logging is of interest with deep queues as it allows you to periodically backup queue files onto the log, i.e.
create a media image
As with shared queues, on distributed you need to think carefully as to when you create a media image of a very
large queue file as it will consume a lot of log space, which you will have to maintain
Queue managers can be configured to create media images automatically based on time, or amount of log usage
since the last image. If you use this you might want to switch it off for long getting application outages
Depth
Regular media images
Getting
application
outage
Getting
application
restarts
Regular media images resume
Last media image
34. Recommended reading
For z/OS much of this information, along with some example values, is in the capacity and planning guide. I
strongly recommend reading it
http://ibm-messaging.github.io/mqperf/mp16.pdf
For distributed, take a look at
https://ibm-messaging.github.io/mqperf/MQ_Performance_Best_Practices_v1.0.1.pdf