Nathan Evans' Nemesis of the Moment

WebSockets versus REST… fight!

Posted in .NET Framework, Software Design by Nathan B. Evans on December 16, 2011

Where will WebSockets be on this InfoQ chart, in three years time?

On 8th December 2011, a little known (but growing in awareness) standard called “WebSockets” was upgraded to W3C Candidate Recommendation status. That is one small step shy of becoming a fully ratified web standard. And just to remove any remaining possible doubt: Next-gen platforms and frameworks such as Windows 8 and .NET 4.5 (at three levels: System.Net, WCF and ASP.NET) already have deeply nested support, and they aren’t even beta  yet!

After reading up about the standard in detail and absorbing the various online discussion around it, it is becoming increasingly clear to me that this standard is going to steal a large chunk of mind share from RESTful web services. What I mean is that there will come a stage in product development where somebody will have to ask the question:

Right guys, shall we use WebSockets or REST for this project?

I expect that WebSockets will, within a year or two, begin stunting the growth of RESTful web services – at least as we know them today.

What are WebSockets?

They are an overdue and rather elegant protocol extension for HTTP 1.1 that allows what is fundamentally a bi-directional TCP data stream to be tunnelled over a HTTP session. They provide a built-in implementation of TCP message framing, so developers don’t need to worry about any boilerplate code stuff like that when designing their application protocol.

Why are WebSockets a threat to RESTful web services?

From the last few years of working on projects that expose RESTful web services, I have noticed a few shortcomings. I should probably make clear that I’m not claiming that WebSockets answers all those shortcomings. I’m merely suggesting that REST is not the silver bullet solution that it is often hyped up to be. What I am saying is that there is definitely space for another player that can still operate at “web scale”. WebSockets have more scope to be a little more like a black box or quick’n’dirty solution than REST which requires a more design up-front approach due to versioning and public visibility concerns. Always use the correct tool for the job, as they say.

Sub-par frameworks

They might claim REST support but they still haven’t truly “groked” it yet, in my opinion. WCF REST is a good example of this. Admittedly, the WCF Web API for .NET is starting to get close to where things should be, but it is not yet production ready.

Perhaps even more serious is the lack of widespread cross-platform RESTful clients that work in the way that Roy prescribed; of presenting an entry point resource that allows the client to automatically discover and autonomously navigate between further nested resources in a sort of state machine fashion. A single client framework that can operate with hundreds of totally different RESTful web services from different organisations. This does not exist yet, today. This is why so many big providers of RESTful web services end up seeding their own open source projects in various programming languages to provide the essential REST client.

Enterprise loves SOAP (and other RPCs)

Third-parties that want to use your web services often prefer SOAP over REST. Many haven’t even heard of REST! WebSockets are a message-based protocol allowing for SOAP-like RPC protocols that enterprise seem to adore so much. Hell there’s nothing stopping actual SOAP envelopes being transferred over a WebSocket!

This might not be the case if you’re operating in an extremely leading edge market such as perhaps cloud computing where everyone is speaking the same awesomesauce.

Complex domain models

Mapping out complex domain models onto REST can be slow and labourious. You’ll find yourself constantly having to work around its architectural constraints. Transactions, for example, can be a particular problem. Of course, this is partly related to the first problem (sub-par frameworks) but one cannot reasonably expect transaction support in a REST framework. What is probably needed is a set of common design patterns for mapping domain models to REST. And then an extension library for the framework that provides reusable implementations of those patterns. But alas, none of this exists yet.

Text-based formats

JSON/XML (for reasons unknown) are commonly used with REST and these are of course text-based formats. This is great for interoperability and cross-platform characteristics. But it is not so great for memory and bandwidth usage. This especially has implications on mobile devices.

You’ll find yourself running into walls if you try to use something that isn’t JSON or XML, at least that is my experience with current frameworks.

Request-response architecture of HTTP

Fundamentally, REST is nothing more than a re-specification of the way HTTP works and a proposal of a design pattern to build applications on top of HTTP. This means it retains the same statelessness and sessionless characteristics of HTTP. It therefore precludes REST from being bi-directional where the server could act as the requester of some resource from the client, or sender of some message to the client. As a result it requires “hacks” to be used to emulate server-side events, and these hacks have bad characteristics such as high latency (round trip time) and are wasteful of battery life.

Public visibility, versioning concerns

Sometimes having everything publicly visible is not what you want. People start using APIs that you don’t want them to use yet. You have to design everything to the nth degree much more. Have a proper versioning strategy in place. It encourages a more discerning approach to software development, that is for sure. Whilst these are usually good things, they can be a hindrance on early stage “lean agile” projects.

What can WebSockets do that is so amazing?

The fact that there will soon be a second player in this space suggests that there will be rebalancing of use-cases. WebSockets will prove to be disruptive for several reasons:

True bi-directional capability and server-side events, no hacks

Comet, push technology, long-polling etc in web apps are slow, inefficient, inelegant and have a higher potential magnitude for unreliability. They often work by requesting a resource from the server, causing the server to block until such a time that an event (or events) need to be transferred back to the client. They can be unreliable because the TCP connection could be teared down by a intermediate router during the time it is waiting for the response. Or worse, a proxy server might deliberately  time out the long-running request. As such, many implementations of this hack will use some kind of self-timeout mechanism so that perhaps every 60 seconds they will reissue the request to the server anyway. This has implications on both bandwidth and battery usage.

The true bi-directional capability offered by WebSockets is a first for any HTTP-borne protocol. It is something that neither SOAP nor REST have. And which Comet/push/long-polling can only emulate, inefficiently. The bi-directional capability is inherently so good that you could tunnel a real-time TCP protocol such as Remote Desktop or VNC over a WebSocket, if you wanted.

Great firewall penetration characteristics

WebSockets can tunnel out of heavily firewalled or proxied environments far easier than many other RPC designs. I’m sure I’m not alone in observing that enterprise environments rarely operate their SOAP services on port 80 or 443.

If you can access the web on port 80 without a proxy, WebSockets will work.

If you can access the web on port 80 with a proxy, WebSockets should work as long as the proxy software isn’t in the 1% that are broken and incompatible.

If  you can access the web on port 443 with or without a proxy, WebSockets will work.

I strongly suspect that there will be a whole raft of new Logmein/Remote Desktop and VPN solutions that are built on top of WebSockets, purely because of the great tunnelling characteristics.

Lightweight application protocols and TCP tunnelling

There is the potential for extremely lightweight application protocols, in respect of performance, bandwidth and battery usage. Like REST, the application schema/protocol isn’t defined by the standard; it is left completely wide open. WebSockets can transfer either text strings or binary data. It is clear that the text string support was included to aid in transferring JSON messages to JavaScript engines which lack the concept of byte arrays. Whilst the binary support will be most useful tunnelling TCP streams or for custom RPC implementations. After a WebSocket session is established, the overhead per message can be as small as just two bytes (!). Compare that to REST which has a huge HTTP header to attach to every single request and response.

How will the use-cases of REST change?

I believe that REST will lose a certain degree of its lustre. Project teams will less eagerly adopt it if they can get away with a bare bones WebSocket implementation. REST will probably remain the default choice for projects that need highly visible and cross-platform interoperable web services.

Projects without those requirements will probably opt for WebSockets instead and either run JSON over it, or use a bespoke wire protocol. They will particularly be used by web and mobile applications for their back-end communications i.e. data retrieval and push-events. Windows 8 “Metro” applications will need to use them extensively.

I suppose you could summarise that as:

  • REST will be (and remain) popular for publicly visible interfaces.
  • WebSockets will be popular for private, internal or “limited eyes only” interfaces.

Note: By “public” and “private” I am not referring literally to some form of paid/subscription/membership web service. I am referring to the programming API contract and its level of exposure to eyes outside of your development team or company.

Conclusion

Even though they are competing, the good thing is that REST and WebSockets can actually co-exist with one another. In fact, because they are both built upon HTTP fundamentals they will actually complement each other. A RESTful hypermedia resource could actually refer to a WebSocket as though it were another resource through a ws:// URL. This will pave the way for new RESTful design patterns and framework features. It will allow REST to remedy some of its shortcomings, such as with transaction support; because a WebSocket session could act as the transactional unit of work.

The next year is going to be very interesting on the web.

Building automated two-way applications on top of SMS text messaging

Posted in Software Design by Nathan B. Evans on November 30, 2011

For the past 8 years of my life I have been engrossed in the development of fully automated applications that use two-way SMS text messaging as their communication layer. SMS started life as being nothing more than what was basically the “ICMP protocol” of GSM networks. It used to be fairly hidden away in the menus of those early Nokia phones. And even then it was very much akin to sending a “ICMP ping” message to your friend, and then he pinged you back. I guess that’s where modern services like “PingChat” got their name!

SMS is a very simple protocol; there is only three essential things you need to understand:

  1. It is limited to 160 characters per message, if you use the GSM 03.38 7-bit character set.
  2. It is limited to 70 characters per message, if you use the UCS-2 (a.k.a. Unicode, UTF-16) character set.
  3. Multiple messages can be joined together to form a multi-part message by including a special concatenation header, which eats up 6 or 12 characters (depending on whether you’re using GSM or UCS-2 character set). Most phones these days refer to this concept on their GUI as “pages”.

Unfortunately the protocol is severely handicapped for when it comes to building automated two-way applications, and here’s why:

It does not provide any facility, not even an extension standard or extension point, for performing reply correlation.

What do I mean by “reply correlation”? It is a simple concept. Assume that you send a question to a buddy, and then he responds to you with the answer. One might hope expect that the message containing the answer contains some sort of ID code, token or cookie (hidden away in its header information, of course) that relates it to the original question message. Unfortunately, it does not and this is the problem; SMS does not include any such ID/token/cookie, anywhere. It simply wasn’t included in neither the original standard nor any subsequent revisions or extensions of the standard.

It is not necessarily the creators fault because clearly they couldn’t foresee how ubiquitous SMS would become. But there is evidence that they did recognise and respond to its popularity in the late 1990’s very quickly by publishing new standard extensions that built upon SMS, such as multi-part messages and WAP. So one can only wonder why they didn’t make an extension that would allow replies to be correlated with their original message. And unfortunately the window of opportunity to actually get this sorted out was over a decade ago, so we’re pretty much screwed then and will have to make do with it.

This is a big problem for SMS. It makes the process of building two-way fully automated applications much more difficult. Very very few companies have actually managed the solve the problem, and those that have tend to be very small or operating in niché markets. I don’t understand this at all, because the possibilities and prospects for building two-way SMS applications are absolutely huge, almost endless.

One of my key responsibilities over the last 8 years has been in devising production-ready solutions that work around this problem and this blog post is going to summarise all of them.

An overview of the solution

The key to solving this problem lies with two fields contained within the header information of every SMS message: the source and destination address. Or what I call the “address pairing”.

By using the address pairing in an intelligent way we can find the right compromise for a particular two-way application. Essentially, whenever the application needs to send a question to a mobile phone number it must ensure that no existing question is already outstanding on the same address pairing.

There are several ways that an application can be designed around this basic concept.

Solution #1: My application only has one source address

The application must be designed to “serialise” the transmission of questions. It can use a mutual exclusion mechanism that will prevent itself sending a further question to the same mobile phone number if an outstanding question is still waiting for a response. It can be expected that some characteristics of a “transaction” or “transactional unit of work” would be adopted in the design of the application to model this mutual exclusion concept.

My past implementations of this pattern were based on a database table with a composite primary key between both the “source” and “destination” columns. The application would try to insert the address pairing into this table and, if successful, it continues sending the message. But if the insertion were to fail then it would realise that a question is already outstanding with the mobile phone number, prompting it to give up and retry later. Or rather than retrying later based on some timer mechanism, you might enqueue it as a job somewhere; so that when a response for the outstanding question is received the application can check the queue for further jobs for that address pairing and dequeue/execute the top job.

There is a caveat with this solution however, and it comes as a side affect of “serialising” the questions one after the other. What if it takes days or weeks for the person to respond to the question? The questions that are queued up waiting to acquire a lock on the address pairing are going to get pushed back and back. They could get pushed back so far that the premise of the question has been entirely voided (e.g. an appointment reminder/confirmation).

The solution to this problem is to introduce a further concept of a “timeout” value. This will ensure that any question sent to the mobile phone can only be outstanding for up to a designated time period. You would probably typically set this to around 24-48 hours, but some questions that contain more time sensitive content may use a lower value of between 1-4 hours.

It is important (though not essential) that when implementing the timeout value concept that you use the “Validity Period” field that is available in every outbound SMS message. You should set the validity period to roughly match what your timeout value for that question will be. This will help ensure messaging integrity in the event that, for example, the mobile phone is turned off for a week and when it is turned back on then you don’t want your “expired” questions to be delivered when your back-end application has already timed out the workflow that was running for that question.

Solution #2: My application can have multiple source addresses

The idea is that you would have a relatively large pool of source addresses, perhaps as many as 50 or 100. Your application would, as with Solution #1, maintain some kind of database table or data structure that prevents duplicate address pairings. The application would then have some logic that enables it to “select” a free source address i.e. a source address that is not “in use” for the destination mobile phone number.

It would still be advisable to implement some kind of “timeout” mechanism, as with Solution #1, but the advantage would be that you would be able to have substantially greater timeout periods. Possibly in the order of weeks or months. Really the timeout mechanism here would be acting more as a type of garbage collector, than as a question expedite governor as in Solution #1.

I’ve always considered that this solution is better suited to applications that provide a “shared” or cloud service of some kind. Simply because setting up a large pool of dedicated source addresses for each of your application’s customers is surely going to get painful.

This solution does have the disadvantage that end-users on their mobile phone will be communicating, potentially, with lots of different source addresses even though it is really the same company/application at the other end. It can mess up the user’s normal “texting” experience, it would rob them of their iPhone’s “bubble chat” GUI style of presentation and the ability of perhaps creating a Contact list entry for a regular contact. Obviously there are things you can do to try to minimise this risk, such as always trying to select the source address with the lowest index. But really I think that will just make things worse. At some point you WILL want to send multiple questions to a mobile phone number, and there’s no getting around that fact. If you’ve got a large pool of source addresses then you’re going to want to use them.

Solution #3: My application only has one source address, but I need to send concurrent questions to the same mobile phone

You can’t. Well you can, but I don’t recommend it at all. I tried it once, on an early version of our system, and our customers didn’t like it.

Essentially you combine the concepts detailed in Solution #1 and then rely on some text processing logic in your response handling code. So rather than perhaps phrasing your question like “Are you attending the meeting tomorrow? Reply with A=Yes or B=No.” You’d phrase it as “… Reply with A1=Yes or B1=No”. Notice the “1” digit in there? That’s the key bit. That digit refers to a transaction code that will be used for correlation. My implementation of this basically went from zero to nine, so you could have a total of 10 concurrent questions open with the same mobile number.

I don’t like this solution for the following reasons:

  • Many end-users forget to include the essential digit in their reply. They might reply “A” instead of “A1”. I’ve seen this happen in the wild.
  • Accessing digits on mobile phones when typing a SMS message is often an unintuitive process. Even an iPhone needs you to access a sub-keyboard screen. Blackberry’s need you to hit the ALT key.
  • It prohibits your application from accepting literal text responses. Many users would simply reply “Yes” rather than “A” or “A1”. If they do this, your application would be screwed because it wouldn’t have the essential digit to correlate the reply with the original question. I’ve seen this happen in the wild.
  • It prohibits your application from accepting “freeform” text responses. You might want to send a question like “What is your full name?”. There’s no way you can tag on the end of that a list of options. It simply doesn’t make sense.
  • It reveals implementation details onto the user interface of your application. Not good.
  • It compromises messaging integrity. An end-user might inadvertently reply (or possibly even deliberately!) with an incorrect digit.
  • It requires both the “reply analysis/text processing” and “reply correlation” concerns of your application to be interdependent on each other, when really they should not be – at least not to perform something so simple.

Experimental solutions

On the last bullet point of Solution #3 I suggested that your application’s “reply analysis” and “reply correlation” concerns shouldn’t be linked together. This I believe is true for something as simple as what was described in that solution. However, there is plenty of mileage to be explored in adopting this approach for more advanced designs.

When you send a question with a constrained set of response options such as “Yes, No, Maybe”, you might want to record these as part of your address pairing in the database or data structure (as described in Solution #1). Then if you need to send a further question (to the same mobile phone, whilst the first question is still outstanding) you can check if the set of response options are different. This question might be looking for a “Good, Bad, Ugly” response. In which case there is no conflict, is there? So a lock on the address pairing, based upon those expected response options, can be allowed to be acquired. Obviously this wouldn’t be possible (or at least would have ramifications on your overall design) if you were expecting a “freeform” response.

Another possible avenue to be explored is an area of computer science called “natural language processing“. The idea is that when you ask a question like “What is your name?” then you would prime your NLP engine to be expecting a reply that looks like somebody’s name. Anything that arrives from that mobile phone that doesn’t look like a person’s name can be assumed to not be related to the outstanding question. Obviously if you want to ask a concurrent question like “What is your wife’s name?” then you’re back to square one. Because that would be a conflict and you’d need to serialise the questions as described in Solution #1. This (NLP and SMS applications) is an active area of research for me, so I may blog about it in more detail at a later time.

Conclusion

Solution #1 is the best, for now. It strikes the right level of compromise without sacrificing neither messaging integrity nor user friendliness. If you desperately need to send multiple concurrent questions to a mobile phone then I would suggest that you should rethink your approach. Perhaps logically separating your business departments and/or workflow concerns onto different source addresses would be a solution in this case. That way you can send out an urgent question, perhaps relating a missed bill payment, on a source address that is dedicated for that purpose.

Solution #2 is usable, and I can think of several use-cases. But I feel it is not as good for frequent one-to-one contact between a company and their customers. It has serious disadvantages in user friendliness. It is best suited to a hosted cloud service of some kind, where everyone shares the same pool of source addresses and where contact is expected to be infrequent.