2 Devices offline for no apparent reason

hvacspei · April 21, 2015, 6:08pm

I just noticed two of my devices are offline for no apparent reason (one of this account and one on another account):
Device ID: 235f37b930728cee
Device MAC: 0c2a69003fe8

Device ID: 237289b930728cee
Device MAC: 0c2a690463fd

They both seem to have gone offline between 11:00EDT and 11:30EDT. They are both at different remote locations and I have other devices at both locations that are still online–that eliminates power/Internet service interruptions.

Thoughts?

hvacspei · April 21, 2015, 6:20pm

Make that 3 devices:
Device ID: 232e36038fb7bdee

I believe all logged “firmware update triggered”

hvacspei · April 21, 2015, 6:38pm

235f37b930728cee came back online and then disconnected, again.

hugo · April 22, 2015, 2:13am

Just looking in the logs:
235f37b930728cee is online, and seems ok (did not log a firmware update triggered, see other thread for why this was happening)

237289b930728cee is online (but was sent to firmware update, and didn’t come back until 23:17 UTC - did you power cycle it?)

232e36038fb7bdee is offline right now, and had been sent to firmware update, but appears to be barely hanging onto a signal - it’s reconnecting very often. I suspect this one is just on the edge of range, maybe?

We can likely remove our “device is flapping, send it to get new firmware in order to give it a clean reboot” code on our servers at this point, as that was to work around bugs in old (<=release 27) OSes…

vedecoid · April 22, 2015, 1:31pm

quite a lot of our (development server) devices disconnected frequently between 2pm yesterday and 4am last night (CET). They all are on very strong wifi signals. I guess this had to do with the scheduled server mtce, but it lasted for > 12 hrs instead of the announced 2hrs ?

vedecoid · April 22, 2015, 1:31pm

one example is 236ea6b930728cee

vedecoid · April 22, 2015, 1:51pm

another one, 23509b4cead3dbee, went through an OS upgrade but as far as I can tell was ‘upgraded’ to the same release (31.0 - this was our test device to try out the new program size expansion feature coming officially in 32). Any reason for such upgrades ?

hvacspei · April 22, 2015, 5:16pm

Yes, I did power cycle 237289b930728cee and it came back online right away and it’s been solid ever since (as usual) with a RSSI of -61.

232e36038fb7bdee was power cycled and came back online solid with a RSSI of -63.

It’s a bit perplexing, as I’m generally careful not to deploy unless I have a sufficiently strong RSSI (-73 or better). That being said, I do have a few with weaker signals, but they weren’t effected.

hugo · April 22, 2015, 6:30pm

The actual server flap for development devices was only ~4 minutes, but if a device was detected as flapping and sent to upgrade, there is a chance (with versions pre release 32, in certain circumstances) that they get stuck there.

As I said in the previous post, the “being sent to upgrade” just forces a total reboot, which we found would help devices that were getting given bad DNS by a local server and cached it - hence it would solve the connectivity issue in some cases.

@hvacspei 232e36038fb7bdee was doing a lot of reconnects - not “our end detected an issue and closed the connection”; these were all initiated by the remote end. It could be power related too, maybe? Do you use wifi powersave mode?

rogerlipscombe · April 22, 2015, 6:37pm

To expand on Hugo’s point, the maintenance was scheduled to start at 15:00 UTC; we actually started at 15:12 UTC and finished at exactly 17:00 UTC. At 15:17:32 UTC, a configuration issue resulted in a relatively small number of devices being detected as flapping (and these were sent to the upgrade server). This was resolved at 15:20:07 UTC, and the maintenance window then proceeded according to plan.

We’ve identified the steps necessary to avoid this happening in future.

hvacspei · April 22, 2015, 6:55pm

232e36038fb7bdee is located at a customer site. I’m not aware of any issues–we didn’t see any issues with other devices at the same location, but they could have been on different WiFi access point(s).

I doubt it was a power issue and the device code does not reference powersave mode. When I visited the location, no one mentioned any network/power issues, but that’s not conclusive. That being said, I use the same model for battery and plug-in devices (portable ad hoc temperature-humidity sensors). I do put the device to sleep for 60 seconds between readings. Could this result in the “reconnects” that you’re referencing?

!!!UPDATE - YIKES!!! I wrote this code so long ago that I had forgotten I called “imp.setpowersave(true);”

hugo · April 24, 2015, 6:27am

Yep, powersave can be less reliable than normal mode, depending on the router in use. Some router-side implementations of IEEE PS are better than others…