Building automated two-way applications on top of SMS text messaging
For the past 8 years of my life I have been engrossed in the development of fully automated applications that use two-way SMS text messaging as their communication layer. SMS started life as being nothing more than what was basically the “ICMP protocol” of GSM networks. It used to be fairly hidden away in the menus of those early Nokia phones. And even then it was very much akin to sending a “ICMP ping” message to your friend, and then he pinged you back. I guess that’s where modern services like “PingChat” got their name!
SMS is a very simple protocol; there is only three essential things you need to understand:
- It is limited to 160 characters per message, if you use the GSM 03.38 7-bit character set.
- It is limited to 70 characters per message, if you use the UCS-2 (a.k.a. Unicode, UTF-16) character set.
- Multiple messages can be joined together to form a multi-part message by including a special concatenation header, which eats up 6 or 12 characters (depending on whether you’re using GSM or UCS-2 character set). Most phones these days refer to this concept on their GUI as “pages”.
Unfortunately the protocol is severely handicapped for when it comes to building automated two-way applications, and here’s why:
It does not provide any facility, not even an extension standard or extension point, for performing reply correlation.
What do I mean by “reply correlation”? It is a simple concept. Assume that you send a question to a buddy, and then he responds to you with the answer. One might
hope expect that the message containing the answer contains some sort of ID code, token or cookie (hidden away in its header information, of course) that relates it to the original question message. Unfortunately, it does not and this is the problem; SMS does not include any such ID/token/cookie, anywhere. It simply wasn’t included in neither the original standard nor any subsequent revisions or extensions of the standard.
It is not necessarily the creators fault because clearly they couldn’t foresee how ubiquitous SMS would become. But there is evidence that they did recognise and respond to its popularity in the late 1990’s very quickly by publishing new standard extensions that built upon SMS, such as multi-part messages and WAP. So one can only wonder why they didn’t make an extension that would allow replies to be correlated with their original message. And unfortunately the window of opportunity to actually get this sorted out was over a decade ago, so we’re pretty much screwed then and will have to make do with it.
This is a big problem for SMS. It makes the process of building two-way fully automated applications much more difficult. Very very few companies have actually managed the solve the problem, and those that have tend to be very small or operating in niché markets. I don’t understand this at all, because the possibilities and prospects for building two-way SMS applications are absolutely huge, almost endless.
One of my key responsibilities over the last 8 years has been in devising production-ready solutions that work around this problem and this blog post is going to summarise all of them.
An overview of the solution
The key to solving this problem lies with two fields contained within the header information of every SMS message: the source and destination address. Or what I call the “address pairing”.
By using the address pairing in an intelligent way we can find the right compromise for a particular two-way application. Essentially, whenever the application needs to send a question to a mobile phone number it must ensure that no existing question is already outstanding on the same address pairing.
There are several ways that an application can be designed around this basic concept.
Solution #1: My application only has one source address
The application must be designed to “serialise” the transmission of questions. It can use a mutual exclusion mechanism that will prevent itself sending a further question to the same mobile phone number if an outstanding question is still waiting for a response. It can be expected that some characteristics of a “transaction” or “transactional unit of work” would be adopted in the design of the application to model this mutual exclusion concept.
My past implementations of this pattern were based on a database table with a composite primary key between both the “source” and “destination” columns. The application would try to insert the address pairing into this table and, if successful, it continues sending the message. But if the insertion were to fail then it would realise that a question is already outstanding with the mobile phone number, prompting it to give up and retry later. Or rather than retrying later based on some timer mechanism, you might enqueue it as a job somewhere; so that when a response for the outstanding question is received the application can check the queue for further jobs for that address pairing and dequeue/execute the top job.
There is a caveat with this solution however, and it comes as a side affect of “serialising” the questions one after the other. What if it takes days or weeks for the person to respond to the question? The questions that are queued up waiting to acquire a lock on the address pairing are going to get pushed back and back. They could get pushed back so far that the premise of the question has been entirely voided (e.g. an appointment reminder/confirmation).
The solution to this problem is to introduce a further concept of a “timeout” value. This will ensure that any question sent to the mobile phone can only be outstanding for up to a designated time period. You would probably typically set this to around 24-48 hours, but some questions that contain more time sensitive content may use a lower value of between 1-4 hours.
It is important (though not essential) that when implementing the timeout value concept that you use the “Validity Period” field that is available in every outbound SMS message. You should set the validity period to roughly match what your timeout value for that question will be. This will help ensure messaging integrity in the event that, for example, the mobile phone is turned off for a week and when it is turned back on then you don’t want your “expired” questions to be delivered when your back-end application has already timed out the workflow that was running for that question.
Solution #2: My application can have multiple source addresses
The idea is that you would have a relatively large pool of source addresses, perhaps as many as 50 or 100. Your application would, as with Solution #1, maintain some kind of database table or data structure that prevents duplicate address pairings. The application would then have some logic that enables it to “select” a free source address i.e. a source address that is not “in use” for the destination mobile phone number.
It would still be advisable to implement some kind of “timeout” mechanism, as with Solution #1, but the advantage would be that you would be able to have substantially greater timeout periods. Possibly in the order of weeks or months. Really the timeout mechanism here would be acting more as a type of garbage collector, than as a question expedite governor as in Solution #1.
I’ve always considered that this solution is better suited to applications that provide a “shared” or cloud service of some kind. Simply because setting up a large pool of dedicated source addresses for each of your application’s customers is surely going to get painful.
This solution does have the disadvantage that end-users on their mobile phone will be communicating, potentially, with lots of different source addresses even though it is really the same company/application at the other end. It can mess up the user’s normal “texting” experience, it would rob them of their iPhone’s “bubble chat” GUI style of presentation and the ability of perhaps creating a Contact list entry for a regular contact. Obviously there are things you can do to try to minimise this risk, such as always trying to select the source address with the lowest index. But really I think that will just make things worse. At some point you WILL want to send multiple questions to a mobile phone number, and there’s no getting around that fact. If you’ve got a large pool of source addresses then you’re going to want to use them.
Solution #3: My application only has one source address, but I need to send concurrent questions to the same mobile phone
You can’t. Well you can, but I don’t recommend it at all. I tried it once, on an early version of our system, and our customers didn’t like it.
Essentially you combine the concepts detailed in Solution #1 and then rely on some text processing logic in your response handling code. So rather than perhaps phrasing your question like “Are you attending the meeting tomorrow? Reply with A=Yes or B=No.” You’d phrase it as “… Reply with A1=Yes or B1=No”. Notice the “1” digit in there? That’s the key bit. That digit refers to a transaction code that will be used for correlation. My implementation of this basically went from zero to nine, so you could have a total of 10 concurrent questions open with the same mobile number.
I don’t like this solution for the following reasons:
- Many end-users forget to include the essential digit in their reply. They might reply “A” instead of “A1”. I’ve seen this happen in the wild.
- Accessing digits on mobile phones when typing a SMS message is often an unintuitive process. Even an iPhone needs you to access a sub-keyboard screen. Blackberry’s need you to hit the ALT key.
- It prohibits your application from accepting literal text responses. Many users would simply reply “Yes” rather than “A” or “A1”. If they do this, your application would be screwed because it wouldn’t have the essential digit to correlate the reply with the original question. I’ve seen this happen in the wild.
- It prohibits your application from accepting “freeform” text responses. You might want to send a question like “What is your full name?”. There’s no way you can tag on the end of that a list of options. It simply doesn’t make sense.
- It reveals implementation details onto the user interface of your application. Not good.
- It compromises messaging integrity. An end-user might inadvertently reply (or possibly even deliberately!) with an incorrect digit.
- It requires both the “reply analysis/text processing” and “reply correlation” concerns of your application to be interdependent on each other, when really they should not be – at least not to perform something so simple.
On the last bullet point of Solution #3 I suggested that your application’s “reply analysis” and “reply correlation” concerns shouldn’t be linked together. This I believe is true for something as simple as what was described in that solution. However, there is plenty of mileage to be explored in adopting this approach for more advanced designs.
When you send a question with a constrained set of response options such as “Yes, No, Maybe”, you might want to record these as part of your address pairing in the database or data structure (as described in Solution #1). Then if you need to send a further question (to the same mobile phone, whilst the first question is still outstanding) you can check if the set of response options are different. This question might be looking for a “Good, Bad, Ugly” response. In which case there is no conflict, is there? So a lock on the address pairing, based upon those expected response options, can be allowed to be acquired. Obviously this wouldn’t be possible (or at least would have ramifications on your overall design) if you were expecting a “freeform” response.
Another possible avenue to be explored is an area of computer science called “natural language processing“. The idea is that when you ask a question like “What is your name?” then you would prime your NLP engine to be expecting a reply that looks like somebody’s name. Anything that arrives from that mobile phone that doesn’t look like a person’s name can be assumed to not be related to the outstanding question. Obviously if you want to ask a concurrent question like “What is your wife’s name?” then you’re back to square one. Because that would be a conflict and you’d need to serialise the questions as described in Solution #1. This (NLP and SMS applications) is an active area of research for me, so I may blog about it in more detail at a later time.
Solution #1 is the best, for now. It strikes the right level of compromise without sacrificing neither messaging integrity nor user friendliness. If you desperately need to send multiple concurrent questions to a mobile phone then I would suggest that you should rethink your approach. Perhaps logically separating your business departments and/or workflow concerns onto different source addresses would be a solution in this case. That way you can send out an urgent question, perhaps relating a missed bill payment, on a source address that is dedicated for that purpose.
Solution #2 is usable, and I can think of several use-cases. But I feel it is not as good for frequent one-to-one contact between a company and their customers. It has serious disadvantages in user friendliness. It is best suited to a hosted cloud service of some kind, where everyone shares the same pool of source addresses and where contact is expected to be infrequent.