Handling Edge Cases

The other day I decided to try Uber. I already had a registration, so, while connected to the Wi-Fi at the house I was in, I requested the car. It came quickly, so I went out and got in. The driver (who didn’t speak proper English at all, although we were in New York) confirmed the address and we were on our way. A minute later he pointed to his phone and said (or meant to say, comprendo?) that the ride is over and the fee is only 8 dollars (rather than the estimated 35 I saw in the app). I could not open my Uber app, because I don’t have mobile data (mobile roaming in the US when coming from the EU is quite expensive). Then the receipt email I got confirmed that the ride was only 45 seconds (only two blocks away from the starting point). My dashboard does not show any cancellation, neither from my side, nor from the driver’s side, and my email to Uber’s support resulted in something like “we have no idea what happened, we are sorry”.

Since they have no idea what happened, I will be speculating at best, but I would assume that the problem was due to my disconnect from the internet (it can be that the driver did something malicious, and pretended not to speak English, but that is just another aspect of the same problem I’m going to describe). So, let’s talk about edge cases. (This post is not meant to be Uber-bashing, after all I’m not sure what exactly happened; it’s just that my Uber experience triggered it)

Handling edge cases might seem like the obvious thing to do, but (if my assumption above is right), a company with a 40 billion valuation didn’t get it right (in addition to the story above, a friend of mine shared a story about Uber and poor connectivity that resulted in inconsistent registration). And incidentally, I’ve been working on edge cases with device-to-server connectivity for the past weeks, so I have a few tips.

If your application (a smartphone app or a rich web client) relies on communication with a server, it must hold a connection state. It should know when it’s “Connected” or “Disconnected”. This can be achieved in a couple of ways, one of which is to invoke a “ping” endpoint on the server every X number of seconds, and if a request or two subsequent requests fail, set the current state to “Disconnected”. Knowing what state are you in is key to adequate edge-case handling.

Then, if you have something to send to the server, you need a queue of pending operations. The command pattern comes handy here, but you can simply use a single-threaded executor service and a synchronization aid to block until connection is back. Then you proceed with executing the queued commands. The Gmail app for android is a very positive example of that. It works perfectly both online and offline, and synchronizes the contents without any issues upon getting reconnected.

One very important note is that your server should not rely on a 100% stable connection. And therefore disconnects should trigger any business logic. In the above example with Uber, it might be that upon 30 seconds of unresponsive client app, the server decides the ride is over (e.g. I am trying to trick it).

Another aspect is the poor connection quality – the user may be connected, but not all requests may succeed, or they may take a lot of time. Tweaking timeouts and having the ability to retry failed operations is key. You don’t want to assume there is no connection if a request fails once – retry that a couple of times (and then add it to the “pending” queue). At the same time, on the server, you should use some sort of transactions. E.g. in the case of Uber registration on a poor connection, my friend received an email confirmation for the registration, but actually his account was not created (and the activation link failed). Maybe the client made two requests, one of which failed, but the server assumed that if one of them went through, the other one also does.

Edge cases are of course not limited to connectivity ones, and my list of “tips” is not at all exhaustive. Edge cases also include, for example, attempts from users to trick the system, so these must also be accommodated in a sensible way – e.g. do not assume that the user tries to trick the system by default, but do give yourself the facilities to investigate that afterwards. Having adequate logging, both on the server and on the client is very important, so that after an unforeseen edge case happens, you can investigate, rather than reply “we have no idea” (those were not the exact words, but it practically meant that).

While handling edge cases, though, we must not forget our default use case. Optimize for the default use case, be sure that your “happy flow” makes users happy. But things break, and “unhappy flows” must not leave users unhappy.

Handling all edge cases might seem like cluttering your code. It will be filled with if-clasues, event handling, “retry pending” constructs, and what not. But that’s okay. It’s actually a sign of a mature product. As Spolsky has pointed out, all these ugly-looking piece of code are actually there to server a particular use-case that you cannot think of when starting from scratch.

Unfortunately, many of these edge cases cannot be tested automatically. So an extensive manual testing is needed to ensure that the application works well not only in a perfect environment.

Tired of reading obvious things? Well, go and implement them.