Expected RESTful communication latencies and reliability of the internet

I have an application where I have replaced a direct connection via USB to a control system with the imp. The application currently determines imp connectivity by polling occasionally (HTTP Post/Response pair) while there is no other activity going on. Somewhat arbitrarily I set a limit of 5 consecutive failed responses to note that the imp service is not available (the agent actually handles the polling, not the imp). The application waits up to 10 seconds for a response before determining the poll failed. Polling happens every few seconds while idle.

I have found that between a few minutes and a few hours the application is always disconnected even when the imp service is available.

In the long haul I will probably adapt a more sophisticated mechanism for determining connectivity but this begs the question about what kind of expectations I should have about communicating from anywhere with the imp service. I know they are not an imp-specific questions but what experience do people have with the following items:

  1. A reasonable maximum latency to expect for responses to HTTP Posts?
  2. Can HTTP Posts or responses be “lost” (meaning they may be sent but never received)? I know TCP is supposed to be reliable but based on logging packets on both ends I seem to see some stuff lost (or, of course, I could have a bug).
  3. Does the path between an application and the imp service sometimes, momentarily, go away? This question is above and beyond failures of some backbone routing equipment.

In general I guess I’m asking about how a designer should think about the reliability of an internet connection when trying to design a reliable communication protocol using the imp. I have a lot of embedded experience but no RESTful experience.

I’m also open to other ideas for determining connectivity between an application and the imp agent if anyone wants to share.

There have been occasional issues with HTTP input - an improved solution is currently in testing and hopefully will be deployed shortly, which will help with reliability.

There is a known issue with the current release of imp software, which means imps can drop offline unless squirrel code is run regularly. The workaround is simple - just add this to your device code:

function w() { imp.wakeup(60, w); }
w();

…which causes the imp to wake regularly and works around the problem. Could that explain what you’ve seen?

Hi Hugo,

I’m not exactly sure where stuff gets lost. It seems from my rudimentary debugging (logs being printed from the agent and from my application’s imp driver) that sometimes the agent doesn’t see the HTTP Post and sometimes my code doesn’t see a response from the agent. The agent acts as an intermediary and should always return a response even if the imp is not present. I know about the current imp bug and that isn’t the problem (my imp is doing stuff constantly, e.g. every few mSec).

I wasn’t necessarily thinking there was a bug with your server but I’m interested if that’s what I’m seeing. In general, however, I am interested in understanding the reliability of RESTful communication so I can design a reliable protocol between my application and the imp. I’m curious what your thoughts are about this and what other users have found.

Generally, when dealing with any internet APIs, you have to be prepared for any HTTP request to fail; however, this should be a rare occurrence.

It sounds like you may have run into issues with our current HTTP input architecture, which is creaking a bit with all the increased load; this was only a functional placeholder until the real solution arrives.

How often are you issuing requests?

Imps can drop offline without the watchdog code he mentioned. They also drop offline when there are bugs in our code during development. That is more likely to be the case. There is no error report, it just goes offline for many possible reasons. Once you get the code running reliably, it never misses a http.request or agent.on. Even with a satellite modem. The ping times are very consistent with a fast connection. 100ms roundtrip exactly without any missed pings. My code has been running for weeks every 15 seconds. I think you’re doing something wrong?

Hugo,

I do see the vast majority of traffic succeed.

Currently the driver is single-threaded. It issues one HTTP request at a time and waits for a response (or timeout). It can pack multiple data packets into each request but under load can issue a new HTTP request immediately after getting a response. When it’s just active pinging then about 2 seconds between requests (plus the response latency which often seems fairly low).

What kinds of maximum response latency do people think will occur? 10 seconds (as I have now)? >10 seconds? Perhaps part of my problem is that I’m not giving the outliers a chance to make it back.

10 seconds seems like a very long time; generally you’d only see that type of time if there was packet loss on your connection and you ended up doing TCP retries.

However, I should ask - what are you doing that you need to poll that fast? You can long poll and only return data when there’s something to send.

ie: your computer issues HTTP request to agent. Agent notes down the request, but does not send a reply until new data is available, at which point the reply gets sent, and the computer can issue a new HTTP request for subsequent data. This way you have much lower load (for you and for the agent) and likely better latency to get new data to your computer.

I’ll take a look at the actual TCP traffic.

The constant pings serve two purposes.

  1. It lets both the imp and the application know the connectivity state. The imp has a LED that shows an application is connected. The application is an existing piece of code (with a new “imp” communication driver) that wants to know when the thing it is controlling is disconnected.

  2. Currently the pings also allow the remote device a chance to get data back to the application.

That being said, it seems possible that both functions could be handled by long polling (although how does the imp know when the application is gone?). As I said initially, I am a neophyte at this kind of communication so the underlying technology of long polling is new to me. I did some web searching today but mostly came up with other people’s libraries (like comet). The application runs on a PC and written using an existing framework that provides me with TCP/HTTP services. I did a little experimenting this morning. I changed the agent to only return a response when it had something to say (like incoming data from the imp). And I removed the timeout from my application. I now send a HTTP Post and then do nothing waiting for a response. I get notification from my socket that the other side disconnects 60 seconds after the Post (I do not know if this is your server disconnecting or something on my side but the socket tells me it’s closed). Responses from the agent after 60 seconds never make it back to the application. Responses generated by the agent in less than 60 seconds after the Post are successful. So what I implemented obviously isn’t long polling.

I realize that this isn’t imp specific but can you point me to anything on the web that explains how long polling works in terms of the underlying socket and RESTful communication? I think I would have to implement my own version.

Hmm, I had been thinking of the outbound long-poll timeout, which is something like 10 minutes. 60s is indeed the long poll timeout for incoming requests - so yes, you implemented long polling (and a connection every minute is still a lot nicer than furiously doing short sessions).

In terms of implementation, you connect, send your HTTP request, and wait for an answer - there’s no magic (which is why it works without modification for most clients), it’s just the server doesn’t send a reply for “a while”.

I’ll put a request in to get the incoming timeout matching the outbound timeout.

You could implement the “application connected” notifier by just noting if there had been an incoming HTTP connection in the last n seconds, where n>long poll interval?

Ok. I can understand that you’d prefer the agents not to be busy with a lot of unproductive work and communication. I can redesign the driver to be less chatty while idle.

Here is a related question. How expensive is communication between the imp and agent relative to the expense of communication between the agent and app? Are you using the same methods behind the scenes (e.g. behind the new device and agent API)?

imp and agent comms is much “cheaper” - well, it takes power on the imp, but the overheads are much lower. The pipe is up already, so it’s just data flowing to and fro.

An agent HTTPS call is a TCP setup, then a TLS setup, then data flow.

So no, very different modes of operation.