First time ever - power cycle required to get device online again

I’ve got 20 devices under the same model. Yesterday I updated the agent code and use the API to reprogram all with the update. After this 2 devices did not come back online (a 3rd had dropped off due to unrelated issues I think)

The devices are relatively local so I was able to visit them and cycle power to get them online.

Any thoughts how they got into this state.

Generally a code deploy does not make a device go offline - the connection stays up (no disconnect/reconnect) unless your code does that at startup.

Without seeing your device code, it’s hard to say; it’s possible you have a bug in your connection handling (do these devices poll, or do they stay online? If stay online, what code are you using to do this?) etc.

We have people who deploy to 100k devices at a time without issues so we aren’t aware of any general problems.

I actually didn’t even touch the device code - it has been stable for a long time. I was updating the Agent, I wrote some code that got caught by the syntax checker - corrected that and uploaded. Next I used Python to run a script that I have been using to push the update to all 20 devices.

at this point - no big deal because I was able to physically access the devices and get them online with just a power cycle. I don’t understand why it was necessary though because as I wrote - never has been the case before.

Generally pushing updates will restart both device & agent even if only agent has changed - but if the code didn’t change, then nothing would be actually sent to the device (hashes would match).

Got a mac address of a device that worked, and one that didn’t for comparison?

These 16 restarted OK

0c2a690be980
0c2a690be68d
0c2a690be05e

0c2a690be5cb
0c2a690be6bc
0c2a690be45a

0c2a690be65d
0c2a690be940

0c2a690be73f
0c2a690be032
0c2a690be713
0c2a690be62d
0c2a690be725
0c2a690be46c
0c2a690be59e
0c2a690be72a

These two required power cycle on-site to come online
0c2a690be7c4
0c2a690be71c

This one had gone offline and is remote - I don’t know what the problem is - maybe just unplugged from power at the site
0c2a690be76a

they are all imp003 modules - Kris Winer’s Tindie breakout board

//=============================================================================

Below is the python code used to restart them

//=============================================================================

import json, httplib, base64
apiKey = base64.b64encode(’--------------------------------’) # API key designed for our devices
headers = {“Authorization”:"Basic " + apiKey}
connection = httplib.HTTPSConnection(‘build.electricimp.com’)
connection.connect()

restart by model

model_id = ‘------------’ # this is a set of 24 devices, 19 typically always ON

connection.request(‘POST’, ‘/v4/models/’ + model_id + ‘/restart’, ‘’, headers)
results = json.loads(connection.getresponse().read())
print results

I can see quite a few updates to devices over the last few days; can you narrow down when this one was? (UTC ideally)

I can only see 21 devices getting updated; some have been udpated more than once in the past week:

(number of deploys / address)
9 30000c2a690be032 5 30000c2a690be05e 6 30000c2a690be45a 7 30000c2a690be46c 8 30000c2a690be59e 6 30000c2a690be5cb 7 30000c2a690be62d 8 30000c2a690be65d 1 30000c2a690be68a 7 30000c2a690be68d 7 30000c2a690be6bc 8 30000c2a690be713 4 30000c2a690be71c 7 30000c2a690be725 8 30000c2a690be72a 7 30000c2a690be73f 1 30000c2a690be743 1 30000c2a690be76a 4 30000c2a690be7c4 6 30000c2a690be940 16 30000c2a690be980

Once I know the actual event I can look a bit deeper and see how the update fanned out (or didn’t, and possibly why)

Found something interesting at 2017/02/17 01:43:59Z for 30000c2a690be7c4.

It got sent new code at that point, and this caused it to be load balanced to another server - however, it didn’t turn up there. At 01:49:43 I see some (what look like) manual reloads, but it isn’t seen again until 11:11:17.

Is this device on a different network than the others? If you could email us with your device code we can have a look at your connection handling maybe.

2017-02-16 19:43:39 UTC-6 [Status] Device disconnected
2017-02-16 19:43:39 UTC-6 [Status] Agent restarted: reload.

There are 10 devices in one building so 8 of 10 worked OK. They should be on the same ssid but since there are multiple ssid programmed to try it could spend some time trying other networks if for some reason it fails to connect to the one that is cached.

The reason it came back online is because I did a power cycle at the site (yes, I was up at 5AM) : )

The other devices are on the other side of the country and they seem to have come online OK.

I could be wrong here and @hugo can confirm, but I believe that the imp enters a “firmware upgrade” mode when told it has new code. If the upgrade fails to complete, squirrel doesn’t run. The imp simply tries the upgrade again or backs off for a bit.

I’ve seen this on tenuous wifi connections (we have a few!), leading to repeated attempts to upgrade and no squirrel until the new code is loaded.

If your imp supports multiple ssids, be aware that without squirrel your imp will be locked onto whatever you last set with imp.setwificonfiguration().

You’re right in that there’s only one space to store squirrel, so if it has erased this space then it can’t run anything until new squirrel has been received - but the link was up in order to get the notification of new code hence it should be able to get the code.

If interrupted at this point, it would be in suspend_on_error mode, which means it’d keep trying to connect for 60s and then sleep for 9 mins if it failed, until it did manage to connect.

However, the erase only happens when new squirrel is about to be received. Here, the device got a reload command, re-identified, and then was load balanced to a different server; it would not have had the new firmware notification yet (that’s the next step) so should have been continuing to run the firmware it already had.

(of course, @peter may correct me on this, but that’s my understanding)

@jhtna does your code do anything specific with wifi settings, eg does it have several sets and iterate through them or anything?

The code does iterate through a list of possible ssid and passwords. However in this case it should have picked up the original network. If that failed in that cycle due to bandwidth then it would change wifi settings sleep for a while and it would/should eventually come online.

There wasn’t new device code to load, it was only restarting due to the Agent getting new code. I don’t know if this is a unique situation, but there are 10 devices in one room all within ~ 30ft of each other. Since I used Python to trigger the update there would have been a burst of activity in a relatively short span.

I wouldn’t want to be over-confident about the hardware (no way to prove hardware didn’t hit a corner case) but I can say these devices have been in constant use since November 2016, are always powered on and sending data and I have not noticed any outages in that time.

btw, I sent the Device code yesterday through email.

thanks for the notes, it’s interesting.

Mark