Changes to Device/Agent

coverdriven · September 14, 2015, 8:42pm

Will there be an update to explain if there were any changes rolled out after the server maintenance today? I note a few unfamiliar messages in the console log. I’ve found one of the trickier challenges has been managing synchronisation between the device and agent. All the restarts over the past few hours is somewhat helpful as it helps see how the agent and device cope with unsolicited restarts.

tguidon · September 14, 2015, 9:37pm

I’m getting this message in my log, but I don’t have any code writing this:

2015-09-14 17:35:47 UTC-4 [Device] Agent restarted: reload. 2015-09-14 17:35:47 UTC-4 [Device] Agent started.

Are you getting that too? Anyone know what this is? Having lots of trouble with my devices in the wild today.

hugo · September 15, 2015, 3:51am

We ran into issues during the maintenance interval which affected quite a few devices.

One of the changes in the deploy was an indication of when agents are restarted to make this more obvious.

Swieter · September 15, 2015, 10:40am

I’m curious, is there a way for a developer to simulate the Agent restart in order to test our Agent/Device interaction?

coverdriven · September 15, 2015, 12:01pm

A brutal way of restarting the Agent is by forcing it to run out of heap space. Aron made that helpful suggestion on the forum about a year ago. Works for the Device too.
There is also an API call available, which I use for testing, but I don’t believe it’s officially released yet, so use it at your own risk. Coordination between the Device and Agent in the midst of random restarts is an ongoing challenge for me. I’ve recently been doing quite a bit of work with wifi gateways that use cellular backhaul. This can sometimes add an extra 30 seconds latency to message passing between them.

hugo · September 15, 2015, 3:58pm

We are adding the ability for the build API to restart just the agent (the old API had this but is deprecated and will go away soon); if you want to restart the agent manually, you can use server.restart() within the agent - maybe connect this to a specific URL being hit in your http.onrequest() handler?

FrancoisBourdon · September 15, 2015, 8:57pm

Thank you Hugo for sharing this undocumented server.restart() API.

A few days ago, I implemented a fix for my own issue related to agent only restarts. Then, I got it “tested” successfully during you guys last update, yesterday. But now, thanks to your lead, I have attached server.restart() to my own diagnostics URL, so I can include this as part of my tests suite.

Oh! and BTW, thanks for you guys great technology.

François

hugo · September 15, 2015, 10:45pm

Migrating means the device has moved server, due to load balancing. Right now this happens more than it should, but that is being addressed (devices are not “sticky” enough at this point).

Server.restart is documented - https://electricimp.com/docs/api/server/restart/ - the docs are actually a bit incorrect in that server.restart() is implemented, but onshutdown() is not yet.

coverdriven · September 15, 2015, 9:40pm

I’m seeing some behaviour that is unusual. It hits a couple of my devices each day, where an Agent is randomly(?) restarted and “migrated”. I’m coping with it, but an unsolicited agent restart will force a restart of the device, to make sure that theirs states are in sync. The log messages that are not in bold are mine

2015-09-16 07:21:12 UTC+12 [Device] (39860) [wifi] connected (Puss)
2015-09-16 07:21:12 UTC+12 [Agent] OnDisconnect
2015-09-16 07:21:49 UTC+12 [Device] Agent started.
2015-09-16 07:21:49 UTC+12 [Agent] (539283) Agent Setup
2015-09-16 07:21:49 UTC+12 [Agent] OnConnect
2015-09-16 07:21:49 UTC+12 [Agent] OnDeviceConnect
2015-09-16 07:21:49 UTC+12 [Device] Agent stopped: migrating.
2015-09-16 07:21:49 UTC+12 [Status] Device connected
2015-09-16 07:21:49 UTC+12 [Device] (39028) [wifi] connected (Puss)
2015-09-16 07:21:49 UTC+12 [Agent] (580951) restart
2015-09-16 07:21:49 UTC+12 [Status] Device connected
2015-09-16 07:21:49 UTC+12 [Agent] OnConnect
2015-09-16 07:21:50 UTC+12 [Agent] OnDeviceConnect
2015-09-16 07:21:50 UTC+12 [Device] (58960) Device Setup
2015-09-16 07:21:50 UTC+12 [Device] (50380) [wifi] connected (Puss)

The second “Device connected” event is triggered by my Agent. It senses that it and its Device are out of sync and forces a server.restart() of the Device.

FrancoisBourdon · September 15, 2015, 10:59pm

I have used the word undocumented because it does not appear in the right side navigation bar of the API Reference pages, so without knowledge of the actual name of the method, one could not find documentation on this API.

That’s why I was thanking you for the lead, since the "This page is a placeholder … " statement made me conclude that this was not formally documented yet.

But more importantly, I am now using it and happy about it.

hugo · September 16, 2015, 5:10am

Ah, we’ll get that link fixed then, it should be on the sidebar…

jamesb · January 29, 2016, 12:36am

I’ve had some trial devices working offline the past week, and I couldn’t reconnect when I got back to them. I had to power cycle and lose all the data. I noticed the log has:

2016-01-22 10:18:13 UTC+11 [Status] Device disconnected 2016-01-27 03:15:00 UTC+11 [Status] Agent started. 2016-01-27 03:15:00 UTC+11 [Status] Agent stopped: migrating. 20

I guess the issue was the agent had stopped due to migration and not restarted? So is this a case that would be fixed by a call to agent.restart()? And it’s true I can call this through a http request even if the agent has stopped?

Thanks.

hugo · January 29, 2016, 8:13pm

At a deploy, the agent can get moved. We did a deploy at about that time (see status.electricimp.com).

The agent did actually restart, it’s just the messages aren’t guaranteed in order as it happened within the same second (ie stopped happened first, then started). Can you PM the device ID? Are you saying the agent wasn’t servicing requests?

jamesb · January 30, 2016, 1:55am

Thanks for the reply Hugo. The device was on a farm and only tries to connect to wifi every hour, so I might just have been unlucky that the cellular connection on my tethered phone wasn’t good enough the hour I was there. I was assuming it was something up with the agent because of the “stopped” message in the log.

Still not sure, do I need a server.restart() in my agent code to cope with any migrating issues?

hugo · January 30, 2016, 6:17am

No, you don’t need any code in your agent to deal with migration - it happens automatically. At some point we’ll be making this process more polite (ie your agent will get notified that it’s going to be moved so it can tidy up a bit before being killed) though.

If the device usually connects once per hour, it seems a bit strange that it had not been seen since 1/22, no?

jamesb · January 30, 2016, 6:44am

Ah, I only take a 3G-> wifi router to the field once a week to upload all the data. It tries to connect every hour, just that there’s usually no connection available.

rogerlipscombe · February 1, 2016, 3:37pm

At a deploy, the agent can get moved.

We reserve the option to move an agent for other reasons, such as to balance the agent load across the servers. As Hugo says, we hope to make this process more polite, but you should always be sure to explicitly save any data, just in case, for example, we lose an entire host.

Note that calling server.save in a tight loop will result in throttling, and your data is actually at more risk of not being written.

The agent did actually restart, it's just the messages aren't guaranteed in order as it happened within the same second (ie stopped happened first, then started).

It’s actually more complicated than that: the messages aren’t guaranteed in order at all, because they’re sent from different backend servers. They might actually arrive out of order, even if they are in different seconds. We’ve looked at options for establishing causality between the two messages, but decided that it wasn’t a good use of development effort for this one case.

Distributed systems are fun.