Today's Server Update

coverdriven · July 28, 2015, 9:57pm

The server update about 6 hours ago (3:40 UTC) really knocked my devices around. The kerfuffle was over in about 3 minutes but only after many disconnects and reconnects. Is that how most server updates will pan out in the future? I’m not complaining, it’s just helpful to know as I want to be able to manage it as smoothly as possible. For better or worse, my agent and device pairs require a lot of synchronisation. I’m trying to make it as solid as possible, but I do strike problematic fringe cases when there are lots of unsolicited agent restarts or device disconnects. In the aftermath, I’m glad I’ve got lots of logging data to sift through.

rogerlipscombe · July 29, 2015, 10:49am

Is that how most server updates will pan out in the future?

We try to make the deploys as smooth as possible, and we’re getting better with each deploy, but you should endeavour to make your imp/agent code as robust as possible in these cases.

During yesterday’s deploy, there were four issues that might have affected you:

The default imp server (imp.electricimp.com) is on an Elastic IP. During the deploy, we update it to point to the new backend server (imp01b), there was a small race condition between the Elastic IP update and the corresponding server group update (server groups control device/server affinity). This would have affected devices using that server (most developer devices; a small number of specific production devices) for up to 4 minutes. We’re examining what we can do to resolve this in future deploys.

Fortunately, one of the features rolled out in yesterday’s deploy makes this less painful. In the past, we assumed that if a device didn’t go to the server it was supposed to, it was having DNS caching issues. For earlier impOS versions, the only way to resolve this was to trigger an impOS update (which round-trips through the bootrom, rather than the improm, thus clearing the cache properly). This problem was resolved in release 30, and the server behaviour was changed so that it no longer sends devices for unneeded OS upgrades. Devices having difficulty following a redirect now stay on the original server.

If device is disconnected during the deploy, we move its agent from the old server to an arbitrarily-assigned new server, based on load (currently agent count). When the device reconnects and discovers that its old server is not available, it goes to imp.electricimp.com and is redirected to a new server. This may be a different server from where its agent is running. The agent is moved to follow the device. This will result in the agent potentially being moved twice because of the deploy.

In a future release, we’re considering device-follows-agent, rather than agent-follows-device. This will allow us to load-balance more reliably (by agent count), and agents will be moved only once during a deploy.

There is a small race condition where your squirrel code can call agent.send() before the imp is redirected. During a deploy, your agent is moved to (e.g.) imp02b [move 1]. Then your device discovers that its preferred server is down and connects to imp.electricimp.com. Because of the race condition, it’s possible that your agent is moved to imp.electricimp.com (imp01b) before the device is redirected [move 2]. Then, when the device does get redirected to (e.g.) imp03b, the agent is moved to follow it [move 3].

We’re working on this right now, and we hope that it will be fixed in the next release.

Yesterday’s deploy also starts enforcing that each device can only have one agent. If a device changes plan (account), the stale agent is stopped. However, during the deploy, we found an issue where some live agents were stopped instead of the corresponding stale ones.

This is unlikely to have affected most people (devices tend to stay on the same plan), but I mention it for completeness.

Moreover, if your device reconnected while in this state, the correct agent would have been started, and the stale one would have been stopped.

Once we became aware of the problem, we ensured that the correct agents were running for currently-connected devices, and for devices that had been seen in the last 31 days.

This was already fixed in yesterday’s deploy. Deploys following yesterday won’t see this problem.

My personal apologies for any inconvenience caused.

–
Roger Lipscombe
Backend Engineering Lead

coverdriven · July 29, 2015, 9:47pm

Roger, thanks for your comprehensive response. No apology is necessary. The IoT space is new territory and I know that there are challenges in getting the cloud side both hardened and scalable.
We are endeavouring to make our imp/agent code as robust as possible in these cases. It’s just hard to simulate this event. So, in that respect, yesterday was useful. server.restart() has only recently been able to restart the agent. Before then it was even harder to reproduce conditions similar to an update. No harm done, all of our test devices recovered, but not before a flurry of events. Next time round I hope it to be smoother for us as well.

I have another small question about agent updates.

If I deploy a agent/device update to my devices, the update deploys to both immediately IFF the device is online. If the device is offline, the agent will not update until the device comes back online again.
All of our agents talk with a central server. Sometimes, changes to the agent include protocol tweaks. It would be nice to be able to update all of the agents in one swoop, so that I don’t need to manage agents that continue to use a superceded protocol with our central server for days/weeks until their device pops up again. Is there some way to force the agent to update? Or do I need to handle this with versioning and maintain support for deprecated API calls for up to 31 days?

coverdriven · August 1, 2015, 4:28am

Sorry, maybe I was wrong about that last point. It seem’s it’s my agent code that suspends when the device is offline. I guess I’m waiting to see “Agent Restart” in the console, and it isn’t there when there’s a build update.

hugo · August 1, 2015, 9:06am

Hmm, agents should be restarting on a deploy even if the device is offline (and the correct behavior should be that the device gets the update when it next connects). There’s a lot of subtlety involved in “polite” updates which we’re still working through - those are the ones where the device gets notified an update (OS or squirrel) is pending and can choose when to accept it.

Are you looking to restart an individual agent via the API, without touching the device? This was possible with the original build API, but isn’t in the current beta build API.

coverdriven · August 2, 2015, 9:46pm

I think this is my bad. Somewhere in the past few months, I changed the agent so it suspends a full initialisation until it sees the device connect. I mistook the lack of messages in the console for a refusal to update.

I can see where you’re going with updates. I like the idea of notifications. In some cases, updates will change the interaction between agent and device. I would want some sort of coordination between them before an update took place. At other times, I wouldn’t care if one updated before the other.

hugo · August 2, 2015, 10:55pm

Yep the intention is to keep the agent & device versions as locked as possible; there will always be corner cases though (eg device connects and immediately, before listening for a reply from the server confirming it has the latest device firmware, sends a message in an old format to the new agent).

Quite hard to come up with a universal fix for those