Three gotchas with the Azure Service Bus
I’ve been writing some fresh code using Azure Service Bus Queues in recent weeks. Overall I’m very impressed. The platform is good, stable and the Client APIs (at least in the form of Microsoft.ServiceBus.dll that I’ve used) is quite modern in design and layout. It’s only slightly annoying that the Client APIs seem to use the old fashioned Begin/End async pattern that was perhaps more in vogue back in the .NET 1.0 to 2.0 days. Why not just return TPL Tasks?
However, there have been a few larger gotchas that I’ve discovered which can quite easily turn into non-trivial problems for a developer to safely work around. These are the sort of problems that can inherently change the way your application is designed.
Deferring messages via
I’m of the opinion that a Service Bus should take care of message redelivery mechanisms itself. On the most part, Azure Service Bus does this really well. But it supports this slightly bizarre type of return notification called deferral (invoked via a
BeginDefer() method). This basically sets a flag on the message internally so that it will never be implicitly redelivered by the queue to your application. But the message will fundamentally still exist inside the queue and you can even still
Receive() it by asking for it by its
SequenceId explicitly. That’s all good and everything but it leaves your application with a bigger problem. Where does it durably store those
SequenceId‘s so that it knows what messages it has deferred? Sure you could hold them in-memory, that would be the naive approach and seems to be the approach taken by the majority of Azure books and documentation. But that is, frankly, a ridiculous idea and its insulting that authors in the fault-tolerant distributed systems space can even suggest such rubbish. The second problem is of course what sort of retry strategy do you adopt for that queue of deferred
SequenceId‘s. Then you have to think about the transaction costs (i.e. money!) involved of whatever retry strategy you employ. What if your system has deferred hundreds of thousands of millions of messages? Consider that those deferred messages were outbound e-mails and they were being deferred because your mail server is down for 2 hours. If you were to retry those messages every 5 seconds, that is a lot of Service Bus transactions that you’ll get billed for.
One wonders why the
Defer() method doesn’t support some sort of time duration or absolute point in time as a parameter that could indicate to the Service Bus when you actually want that message to be redelivered. It would certainly be a great help and I can’t imagine it would require that much work in the back-end for the Azure guys.
So how do you actually solve this problem?
For now, I have completely avoided the use of
Defer() in my system. When I need to defer a message I will simply not perform any return notification for the message and I will allow the PeekLock to expire by its own accord (which the Service Bus handles itself). This approach has the following application design side affects:
- The deferral and retry logic is performed by the Service Bus entirely. My application does not need to worry about such things and the complexities involved.
- The deferral retry time is constant and is defined at queue description level. It cannot be controlled dynamically on a per message basis.
- Your queue’s
DefaultTimeToLiveparameters will become inherently coupled and will need to be explicitly controlled.
(MaxDeliveryCount x LockDuration)will determine how long a message can be retried for and at what interval. If your
LockDurationis 1.5 minutes and you want to retry the message for 1 day then
MaxDeliveryCount = (1 day / 1.5 minutes) = 960.
This is a good stop-gap measure whilst I am iterating quickly. For small systems it can perhaps even be a permanent solution. But sooner or later it will cause problems for me and will need to be refactored.
I think the key to solving this problem is gaining better understanding over the reason why the message is being deferred in the first place, therefore providing you with more control. In my particular application it can only be caused when for instance an e-mail server is down or unreachable etc. So maybe I need some sort of watchdog in my application that (eventually) detects when the e-mail server is down and then actively stops trying to send messages, and indeed maybe even puts the brakes on actually
Receive()‘ing messages from the queue in the first place. For those messages that have been received already then maybe there should be a separate queue called something like “email-outbox-deferred” (note the suffix). Messages queued on this would not actually be the real message but simply a pointer record that points back to the SequenceId of the real one on the “email-outbox” queue. When the watchdog detects that the e-mail server has come back up then it can start opening up the taps again. Firstly it would perform a
Receive() loop on the “email-outbox-deferred” queue and attempt to reprocess those messages by following the SequenceId pointers back to the real queue. If it manages to successfully send the e-mail then it can issue a Complete() on both the deferred pointer message and the real message; to entirely remove it from the job queue. Otherwise it can
Abandon() them both and the watchdog can start from square one by waiting to gain confidence of the e-mail servers health before retrying again.
The key to this approach is the watchdog. The watchdog must act as a type of valve that can open and close the
Receive() loops on the two queues. Without this component you are liable to create long unconstrained loops or even infinite-like loops that will cause you to have potentially massive Service Bus transaction costs on your next bill from Azure.
I believe what I have described here is considered to be a SEDA or “Staged event-driven architecture“. Documentation of this pattern is a bit thin on the ground at the moment. Hopefully this will start to change as enterprise-level PaaS cloud applications gain more and more traction. But if anyone has any good book recommendations… ping me a message.
I’d be interested in learning more about message deferral and retry strategies, so please comment!
Transient fault retry logic is not built into the Client API
Transient faults are those that imply there is probably nothing inherently wrong with your message. It’s just that the Service Bus is perhaps too busy or network conditions dictate that it can’t be handled at this time. Fortunately the Client API includes a nice
IsTransient boolean property on every
MessagingException. Making good use of this property is harder than it first appears though.
All the Azure documentation that I’ve found makes use of (the rather hideous) Enterprise Library Transient Fault Block pattern. That’s all fine and good. But who honestly wants to be wrapping up every Client API action they do in that? Sure you can abstract it away again by yourself but where does it end?
It seems odd that the Client API doesn’t have this built in. Why when you invoke some operation like
Receive() can’t you specify a transient fault retry strategy as an optional parameter? Or hell, why can’t you just specify this retry strategy at a
I remain hopeful that this is something the Azure guys will fix soon.
Dead lettering is not the end
You may think that once you’ve dead lettered a message that you’ll not need to worry about it again from your application. Wrong.
When you dead letter a message it is actually just moved to a special sub-queue of your queue. If left untouched, it will remain in that sub-queue forever. Forever.
Yes, forever. Yes, a memory leak. Eventually this will bring down your application because your queue will run into its memory limit (which can only be a maximum of 5GB). Annoyingly most developers are simply not aware of the dead letter sub-queues existence because it does not show up as a queue on the Server Explorer pane in Visual Studio. Bit of an oversight that one!
Having a human flush this queue out every now and then is not an acceptable solution for most systems. What if your system has a sudden spike in dead letters. Maybe a rogue system was submitting messages to your queues using an old serialization format or something? What if there were millions of these messages? Your application is going to be offline quicker than any human can react. So you need to build this logic into your application itself. This can be done by a watchdog process that keeps track of how many messages are being put onto the dead letter queue and actively ensures it is frequently pruned. This is very much a non-trivial problem.
Alternatively you can avoid the use of dead lettering entirely. This seems drastic but it may not be such a bad idea actually. You should consider if you actually care enough about retaining that carbon-copy of a message to keep it around as-is. Ask yourself whether just some simple and traditional trace/log output of the problem and approximate message content would be sufficient? Dead lettering is inherently a human concept that is analogous to “lost and found” or a “spam folder”. So perhaps with fully automated systems that desire as little human influence or administrative effort as possible then avoiding dead lettering entirely is the best choice.