2 Devices offline for no apparent reason

I just noticed two of my devices are offline for no apparent reason (one of this account and one on another account):
Device ID: 235f37b930728cee
Device MAC: 0c2a69003fe8

Device ID: 237289b930728cee
Device MAC: 0c2a690463fd

They both seem to have gone offline between 11:00EDT and 11:30EDT. They are both at different remote locations and I have other devices at both locations that are still online–that eliminates power/Internet service interruptions.


Make that 3 devices:
Device ID: 232e36038fb7bdee

I believe all logged “firmware update triggered”

235f37b930728cee came back online and then disconnected, again.

Just looking in the logs:
235f37b930728cee is online, and seems ok (did not log a firmware update triggered, see other thread for why this was happening)

237289b930728cee is online (but was sent to firmware update, and didn’t come back until 23:17 UTC - did you power cycle it?)

232e36038fb7bdee is offline right now, and had been sent to firmware update, but appears to be barely hanging onto a signal - it’s reconnecting very often. I suspect this one is just on the edge of range, maybe?

We can likely remove our “device is flapping, send it to get new firmware in order to give it a clean reboot” code on our servers at this point, as that was to work around bugs in old (<=release 27) OSes…

quite a lot of our (development server) devices disconnected frequently between 2pm yesterday and 4am last night (CET). They all are on very strong wifi signals. I guess this had to do with the scheduled server mtce, but it lasted for > 12 hrs instead of the announced 2hrs ?

one example is 236ea6b930728cee

another one, 23509b4cead3dbee, went through an OS upgrade but as far as I can tell was ‘upgraded’ to the same release (31.0 - this was our test device to try out the new program size expansion feature coming officially in 32). Any reason for such upgrades ?

Yes, I did power cycle 237289b930728cee and it came back online right away and it’s been solid ever since (as usual) with a RSSI of -61.

232e36038fb7bdee was power cycled and came back online solid with a RSSI of -63.

It’s a bit perplexing, as I’m generally careful not to deploy unless I have a sufficiently strong RSSI (-73 or better). That being said, I do have a few with weaker signals, but they weren’t effected.

The actual server flap for development devices was only ~4 minutes, but if a device was detected as flapping and sent to upgrade, there is a chance (with versions pre release 32, in certain circumstances) that they get stuck there.

As I said in the previous post, the “being sent to upgrade” just forces a total reboot, which we found would help devices that were getting given bad DNS by a local server and cached it - hence it would solve the connectivity issue in some cases.

@hvacspei 232e36038fb7bdee was doing a lot of reconnects - not “our end detected an issue and closed the connection”; these were all initiated by the remote end. It could be power related too, maybe? Do you use wifi powersave mode?

To expand on Hugo’s point, the maintenance was scheduled to start at 15:00 UTC; we actually started at 15:12 UTC and finished at exactly 17:00 UTC. At 15:17:32 UTC, a configuration issue resulted in a relatively small number of devices being detected as flapping (and these were sent to the upgrade server). This was resolved at 15:20:07 UTC, and the maintenance window then proceeded according to plan.

We’ve identified the steps necessary to avoid this happening in future.

232e36038fb7bdee is located at a customer site. I’m not aware of any issues–we didn’t see any issues with other devices at the same location, but they could have been on different WiFi access point(s).

I doubt it was a power issue and the device code does not reference powersave mode. When I visited the location, no one mentioned any network/power issues, but that’s not conclusive. That being said, I use the same model for battery and plug-in devices (portable ad hoc temperature-humidity sensors). I do put the device to sleep for 60 seconds between readings. Could this result in the “reconnects” that you’re referencing?

!!!UPDATE - YIKES!!! I wrote this code so long ago that I had forgotten I called “imp.setpowersave(true);”

Yep, powersave can be less reliable than normal mode, depending on the router in use. Some router-side implementations of IEEE PS are better than others…