Nathan Evans' Nemesis of the Moment

Three gotchas with the Azure Service Bus

Posted in .NET Framework, Distributed Systems, Software Design by Nathan B. Evans on March 28, 2013

I’ve been writing some fresh code using Azure Service Bus Queues in recent weeks. Overall I’m very impressed. The platform is good, stable and the Client APIs (at least in the form of Microsoft.ServiceBus.dll that I’ve used) is quite modern in design and layout. It’s only slightly annoying that the Client APIs seem to use the old fashioned Begin/End async pattern that was perhaps more in vogue back in the .NET 1.0 to 2.0 days. Why not just return TPL Tasks?

However, there have been a few larger gotchas that I’ve discovered which can quite easily turn into non-trivial problems for a developer to safely work around. These are the sort of problems that can inherently change the way your application is designed.

Deferring messages via Defer()

I’m of the opinion that a Service Bus should take care of message redelivery mechanisms itself. On the most part, Azure Service Bus does this really well. But it supports this slightly bizarre type of return notification called deferral (invoked via a Defer() or BeginDefer() method). This basically sets a flag on the message internally so that it will never be implicitly redelivered by the queue to your application. But the message will fundamentally still exist inside the queue and you can even still Receive() it by asking for it by its SequenceId explicitly. That’s all good and everything but it leaves your application with a bigger problem. Where does it durably store those SequenceId‘s so that it knows what messages it has deferred? Sure you could hold them in-memory, that would be the naive approach and seems to be the approach taken by the majority of Azure books and documentation. But that is, frankly, a ridiculous idea and its insulting that authors in the fault-tolerant distributed systems space can even suggest such rubbish. The second problem is of course what sort of retry strategy do you adopt for that queue of deferred SequenceId‘s. Then you have to think about the transaction costs (i.e. money!) involved of whatever retry strategy you employ. What if your system has deferred hundreds of thousands of millions of messages? Consider that those deferred messages were outbound e-mails and they were being deferred because your mail server is down for 2 hours. If you were to retry those messages every 5 seconds, that is a lot of Service Bus transactions that you’ll get billed for.

One wonders why the Defer() method doesn’t support some sort of time duration or absolute point in time as a parameter that could indicate to the Service Bus when you actually want that message to be redelivered. It would certainly be a great help and I can’t imagine it would require that much work in the back-end for the Azure guys.

So how do you actually solve this problem?

For now, I have completely avoided the use of Defer() in my system. When I need to defer a message I will simply not perform any return notification for the message and I will allow the PeekLock to expire by its own accord (which the Service Bus handles itself). This approach has the following application design side affects:

  • The deferral and retry logic is performed by the Service Bus entirely. My application does not need to worry about such things and the complexities involved.
  • The deferral retry time is constant and is defined at queue description level. It cannot be controlled dynamically on a per message basis.
  • Your queue’s MaxDeliveryCount, LockDuration and DefaultTimeToLive parameters will become inherently coupled and will need to be explicitly controlled.
    (MaxDeliveryCount x LockDuration) will determine how long a message can be retried for and at what interval. If your LockDuration is 1.5 minutes and you want to retry the message for 1 day then MaxDeliveryCount = (1 day / 1.5 minutes) = 960.

This is a good stop-gap measure whilst I am iterating quickly. For small systems it can perhaps even be a permanent solution. But sooner or later it will cause problems for me and will need to be refactored.

I think the key to solving this problem is gaining better understanding over the reason why the message is being deferred in the first place, therefore providing you with more control. In my particular application it can only be caused when for instance an e-mail server is down or unreachable etc. So maybe I need some sort of watchdog in my application that (eventually) detects when the e-mail server is down and then actively stops trying to send messages, and indeed maybe even puts the brakes on actually Receive()‘ing messages from the queue in the first place. For those messages that have been received already then maybe there should be a separate queue called something like “email-outbox-deferred” (note the suffix). Messages queued on this would not actually be the real message but simply a pointer record that points back to the SequenceId of the real one on the “email-outbox” queue. When the watchdog detects that the e-mail server has come back up then it can start opening up the taps again. Firstly it would perform a Receive() loop on the “email-outbox-deferred” queue and attempt to reprocess those messages by following the SequenceId pointers back to the real queue. If it manages to successfully send the e-mail then it can issue a Complete() on both the deferred pointer message and the real message; to entirely remove it from the job queue. Otherwise it can Abandon() them both and the watchdog can start from square one by waiting to gain confidence of the e-mail servers health before retrying again.

The key to this approach is the watchdog. The watchdog must act as a type of valve that can open and close the Receive() loops on the two queues. Without this component you are liable to create long unconstrained loops or even infinite-like loops that will cause you to have potentially massive Service Bus transaction costs on your next bill from Azure.

I believe what I have described here is considered to be a SEDA or “Staged event-driven architecture“. Documentation of this pattern is a bit thin on the ground at the moment. Hopefully this will start to change as enterprise-level PaaS cloud applications gain more and more traction. But if anyone has any good book recommendations… ping me a message.

I’d be interested in learning more about message deferral and retry strategies, so please comment!

Transient fault retry logic is not built into the Client API

Transient faults are those that imply there is probably nothing inherently wrong with your message. It’s just that the Service Bus is perhaps too busy or network conditions dictate that it can’t be handled at this time. Fortunately the Client API includes a nice IsTransient boolean property on every MessagingException. Making good use of this property is harder than it first appears though.

All the Azure documentation that I’ve found makes use of (the rather hideous) Enterprise Library Transient Fault Block pattern. That’s all fine and good. But who honestly wants to be wrapping up every Client API action they do in that? Sure you can abstract it away again by yourself but where does it end?

It seems odd that the Client API doesn’t have this built in. Why when you invoke some operation like Receive() can’t you specify a transient fault retry strategy as an optional parameter? Or hell, why can’t you just specify this retry strategy at a QueueClient level?

I remain hopeful that this is something the Azure guys will fix soon.

Dead lettering is not the end

You may think that once you’ve dead lettered a message that you’ll not need to worry about it again from your application. Wrong.

When you dead letter a message it is actually just moved to a special sub-queue of your queue. If left untouched, it will remain in that sub-queue forever. Forever. Yes, forever. Yes, a memory leak. Eventually this will bring down your application because your queue will run into its memory limit (which can only be a maximum of 5GB). Annoyingly most developers are simply not aware of the dead letter sub-queues existence because it does not show up as a queue on the Server Explorer pane in Visual Studio. Bit of an oversight that one!

Having a human flush this queue out every now and then is not an acceptable solution for most systems. What if your system has a sudden spike in dead letters. Maybe a rogue system was submitting messages to your queues using an old serialization format or something? What if there were millions of these messages? Your application is going to be offline quicker than any human can react. So you need to build this logic into your application itself. This can be done by a watchdog process that keeps track of how many messages are being put onto the dead letter queue and actively ensures it is frequently pruned. This is very much a non-trivial problem.

Alternatively you can avoid the use of dead lettering entirely. This seems drastic but it may not be such a bad idea actually. You should consider if you actually care enough about retaining that carbon-copy of a message to keep it around as-is. Ask yourself whether just some simple and traditional trace/log output of the problem and approximate message content would be sufficient? Dead lettering is inherently a human concept that is analogous to “lost and found” or a “spam folder”. So perhaps with fully automated systems that desire as little human influence or administrative effort as possible then avoiding dead lettering entirely is the best choice.

Tagged with: , , ,

11 Responses

Subscribe to comments with RSS.

  1. […] Three gotchas with the Azure Service Bus (Nathan Evans) […]

  2. Reading Notes 2013-04-08 | Matricis said, on April 8, 2013 at 6:44 PM

    […] Three gotchas with the Azure Service Bus – Very interesting post that goes deeper than usual and explains clearly some details of the Azure service bus. […]

  3. Nathan Alden, Sr. said, on June 3, 2013 at 4:39 PM

    Another big issue: there is no support for transactions that span multiple queues or that span a queue and a database. The recommended solution? Queuing messages in an “outbox” table in the database. This is a huge omission, IMO. Yes, I get that MSDTC is a single point of failure, but MS really needs to provide an MSDTC-for-Azure, so-to-speak. As it is, I’m forced to accept eventual consistency (even if it’s not appropriate) or use an external system like NServiceBus.

    • Kartik paramasivam said, on December 11, 2013 at 5:07 AM

      Service bus does support transactions across queues. For e.g within a transaction you can send messages to multiple queues VIA a source queue.

      In a transaction you can also complete a message that you have received from a queue and send another message to a destination queue VIA the source queue.

  4. Michael said, on August 29, 2013 at 2:25 PM

    I found your blog post when searching for solutions to these exact problems that are driving me nuts. The deferral with no time option is half-baked. I ended up making my own retry strategy, where it will mark a message complete, and just clone/create a new one with a ScheduledEnqueueTimeUtc in the future, based on my own rules. I don’t feel like I should have to do this though. Transient fault handling should be built in to the .net SDK, instead of requiring an application block. At least I’m not the only one hitting this stuff. I agree with Nathan as well. I only store the partition and row key in my queue message, which gets updated up when the message is processed. It would be nice if that could be in some kind of transaction.

  5. Kartik paramasivam said, on December 11, 2013 at 5:09 AM

    With azure sdk 2.1 and above service bus sdk dooes support transient fault handling by default

  6. Daniel Ryan said, on November 20, 2014 at 11:38 PM

    Although this article is well written, I think people should be aware that there is quite a large amount of misunderstanding of how Service Bus works. In particular, the assumption that the dead-letter queue contains a journal of all messages. In fact, only messages that have failed (and failed any retries) will end up in this queue. When this is understood, there is no need to implement the complicated strategy mentioned (using the defer and a separate queue containing the Id’s).

    My take on things is to use the transient fault handling block to implement a first level of retries for a short period of time. As Hillary mentions, this is now baked into the SDK. If necessary a second level of retries can be performed over a longer period of time using the queue parameters mentioned in this article. If this fails, then the letters will end up in the dead-letter queue and will need to be resubmitted once the issue(s) have been fixed.

    Regarding the email server being down, this sounds like an ideal candidate for the circuit breaker strategy for which there is lot’s of information available.

    Oh, and the transient fault handling block is actually very nice. It can also be easily customised to support any scenario where you want to retry something for a period of time before giving up.

  7. Gooey D said, on February 17, 2016 at 8:20 PM

    Microsoft provides a pretty good breakdown on how to make use of the built-in retry policies:

  8. Justin J Stark said, on October 21, 2016 at 3:46 PM

    Another option to deferring or letting the peek-lock expire is to clone the message, reqeueue it with a future enqueue timestamp, and then complete the message. This allows finer-grain control of the retry schedule and allows a retry frequency of more than 5 minutes, which is the maximum peek-lock expiration.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: