Upgrade is in contant Fail loop

Scarx · June 7, 2017, 2:09pm

My device has been running for weeks without issue. Now I just get a solid green light, then nothing.
Looks like the Firmware OS upgrade is failing. How do I fix this?

2017-06-07 07:05:40 -07:00 [Exit Code] imp OS upgraded
2017-06-07 07:05:40 -07:00 [Status] firmware update triggered
2017-06-07 07:05:48 -07:00 [Status] Device disconnected
2017-06-07 07:06:47 -07:00 [Status] Device disconnected
2017-06-07 07:06:03 -07:00 [Exit Code] imp OS upgraded
2017-06-07 07:06:03 -07:00 [Status] firmware update triggered
2017-06-07 07:06:10 -07:00 [Status] Device disconnected
2017-06-07 07:06:23 -07:00 [Exit Code] imp OS upgraded
2017-06-07 07:06:23 -07:00 [Status] firmware update triggered
2017-06-07 07:06:24 -07:00 [Status] Device disconnected
2017-06-07 07:06:42 -07:00 [Exit Code] imp OS upgraded
2017-06-07 07:06:42 -07:00 [Status] firmware update triggered
2017-06-07 07:07:00 -07:00 [Exit Code] imp OS upgraded
2017-06-07 07:07:00 -07:00 [Status] firmware update triggered
2017-06-07 07:07:02 -07:00 [Status] Device disconnected
2017-06-07 07:07:18 -07:00 [Exit Code] imp OS upgraded
2017-06-07 07:07:18 -07:00 [Status] firmware update triggered
2017-06-07 07:07:26 -07:00 [Status] Device disconnected

phil · June 7, 2017, 2:47pm

What’s the MAC address of the imp please?

Scarx · June 7, 2017, 3:00pm

Hello,
It funny this has been failing all night, and right as I wrote this to the Forum, looks like the device finally succeeded.
.
.
.
2017-06-07 07:35:28 -07:00 [Status] firmware update triggered
2017-06-07 07:35:29 -07:00 [Status] Device disconnected
2017-06-07 07:35:47 -07:00 [Exit Code] imp OS upgraded
2017-06-07 07:35:47 -07:00 [Status] firmware update triggered
2017-06-07 07:35:48 -07:00 [Status] Device disconnected
2017-06-07 07:36:01 -07:00 [Exit Code] imp OS upgraded
2017-06-07 07:36:01 -07:00 [Status] Downloading new code; 31.42% program storage used
2017-06-07 07:36:13 -07:00 [Status] Device disconnected
2017-06-07 07:36:13 -07:00 [Device] Sleeping until 2017-06-07 14:38:11Z
2017-06-07 07:38:15 -07:00 [Status] Device disconnected
2017-06-07 07:38:15 -07:00 [Status] Device connected

JHT · June 9, 2017, 1:44pm

I have the same issue: 0c2a690224ae

has not been online since before May 2017 - here is the log from latest time it was powered on:

2017-06-09 03:32:37 -05:00 [Status] firmware update triggered
2017-06-09 03:33:03 -05:00 [Status] Device disconnected
2017-06-09 03:33:19 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:33:19 -05:00 [Status] firmware update triggered
2017-06-09 03:42:22 -05:00 [Agent] Result: device is NOT connected
2017-06-09 03:42:47 -05:00 [Status] Device disconnected
2017-06-09 03:43:07 -05:00 [Agent] Result: device is NOT connected
2017-06-09 03:44:15 -05:00 [Agent] Result: device is NOT connected
2017-06-09 03:44:35 -05:00 [Agent] Result: device is NOT connected
2017-06-09 03:56:56 -05:00 [Status] Device connected
2017-06-09 03:56:56 -05:00 [Device] connected 0 OFFICEHIDDEN
2017-06-09 03:56:56 -05:00 [Agent] address is 20000c2a690224ae
2017-06-09 03:56:56 -05:00 [Agent] cals requested; 20000c2a690224ae
2017-06-09 03:56:56 -05:00 [Agent] { “previous_time”: 1492557275, “T”: 20.9257, “Energy”: 1.94074e+08, “reference”: { “boot”: 1.94074e+08, “day”: 1.93252e+08, “manual”: 0, “week”: 1.92852e+08, “Hours”: 0 } }
2017-06-09 03:56:56 -05:00 [Device] start reporter from connection routine, next report in 30s
2017-06-09 03:57:01 -05:00 [Status] Device disconnected
2017-06-09 03:57:11 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:57:11 -05:00 [Status] firmware update triggered
2017-06-09 03:57:16 -05:00 [Status] Device disconnected
2017-06-09 03:57:25 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:57:25 -05:00 [Status] firmware update triggered
2017-06-09 03:57:30 -05:00 [Status] Device disconnected
2017-06-09 03:57:40 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:57:40 -05:00 [Status] firmware update triggered
2017-06-09 03:57:47 -05:00 [Status] Device disconnected
2017-06-09 03:57:55 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:57:55 -05:00 [Status] firmware update triggered
2017-06-09 03:58:06 -05:00 [Status] Device disconnected
2017-06-09 03:58:12 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:58:12 -05:00 [Status] firmware update triggered
2017-06-09 03:58:15 -05:00 [Status] Device disconnected
2017-06-09 03:58:36 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:58:36 -05:00 [Status] firmware update triggered
2017-06-09 03:58:43 -05:00 [Status] Device disconnected
2017-06-09 03:58:55 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:58:55 -05:00 [Status] firmware update triggered
2017-06-09 03:58:58 -05:00 [Status] Device disconnected
2017-06-09 03:59:13 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:59:13 -05:00 [Status] firmware update triggered
2017-06-09 03:59:19 -05:00 [Status] Device disconnected
2017-06-09 03:59:38 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:59:38 -05:00 [Status] firmware update triggered
2017-06-09 03:59:46 -05:00 [Status] Device disconnected
2017-06-09 03:59:58 -05:00 [Exit Code] imp OS upgraded
2017-06-09 03:59:58 -05:00 [Status] firmware update triggered
2017-06-09 04:00:06 -05:00 [Status] Device disconnected
2017-06-09 04:00:15 -05:00 [Exit Code] imp OS upgraded
2017-06-09 04:00:15 -05:00 [Status] firmware update triggered
2017-06-09 04:00:21 -05:00 [Status] Device disconnected
2017-06-09 04:00:31 -05:00 [Exit Code] imp OS upgraded
2017-06-09 04:00:31 -05:00 [Status] firmware update triggered
2017-06-09 04:00:37 -05:00 [Status] Device disconnected
2017-06-09 04:00:48 -05:00 [Exit Code] imp OS upgraded
2017-06-09 04:00:48 -05:00 [Status] firmware update triggered
2017-06-09 04:01:04 -05:00 [Exit Code] imp OS upgraded
2017-06-09 04:01:04 -05:00 [Status] firmware update triggered
2017-06-09 04:01:10 -05:00 [Status] Device disconnected
2017-06-09 04:01:20 -05:00 [Exit Code] imp OS upgraded
2017-06-09 04:01:20 -05:00 [Status] firmware update triggered
2017-06-09 04:01:22 -05:00 [Status] Device disconnected
2017-06-09 04:01:37 -05:00 [Exit Code] imp OS upgraded
2017-06-09 04:01:37 -05:00 [Status] firmware update triggered
2017-06-09 04:01:42 -05:00 [Status] Device disconnected
2017-06-09 04:01:55 -05:00 [Exit Code] imp OS upgraded
2017-06-09 04:01:55 -05:00 [Status] firmware update triggered
2017-06-09 04:11:21 -05:00 [Status] Device disconnected

harryreiss · June 9, 2017, 2:37pm

Hi JHT - We’ve put your imp upgrade on hold while we investigate this. If you power it up now, it should stop looping.

harryreiss · June 9, 2017, 3:15pm

We’ve identified the problem and are fixing it. Once it’s resolved, I’ll make sure your imp is queued again for the impOS 36 upgrade.

harryreiss · June 9, 2017, 3:55pm

This should be resolved now, and your imp upgrade to impOS 36 when it connects.

JHT · June 14, 2017, 10:44am

Thank you!

It is fixed now : )

jamesb · December 28, 2017, 11:32pm

I’ve got a device at a remote location that was failing updates:

|2017-11-30 04:36:34 +11:00|[Status]|Device disconnected|
|---|---|---|
|2017-11-30 04:36:34 +11:00|[Exit Code]|imp OS upgraded|
|2017-11-30 04:36:34 +11:00|[Status]|firmware update triggered|
|2017-11-30 04:37:40 +11:00|[Status]|Device disconnected|
|2017-11-30 04:37:40 +11:00|[Exit Code]|imp OS upgraded|
|2017-11-30 04:37:40 +11:00|[Status]|firmware update triggered|
|2017-11-30 04:38:42 +11:00|[Status]|Device disconnected|
|2017-11-30 04:38:42 +11:00|[Exit Code]|imp OS upgraded|
|2017-11-30 04:38:42 +11:00|[Status]|firmware update triggered|
|2017-11-30 04:48:14 +11:00|[Status]|Device disconnected|

That was the last I saw of it. 20000c2a690a342b
I had three other devices at the same site that are working fine. Any clues as to why this one failed?

rogerlipscombe · December 29, 2017, 2:48pm

This device upgraded from 36.11 to 36.13 at 2017-12-29 04:22:12.110Z.

jamesb · December 29, 2017, 6:03pm

Thanks @rogerlipscombe
Can you give me any clues as to why it seemed dead as far as I could tell for a month, and then suddenly yesterday it updated? Did it require intervention? It’s battery is nearly flat now, so seems it has been online and doing something sometime in the last month. I’ve got another device that’s behaving similar, so wondering if it’s related.
Thanks,
James

hugo · December 30, 2017, 11:02am

It seems to be very much on the edge of coverage; when it upgraded, it had an RSSI of -87 which is on the hairy edge of workable (depends a lot on the access point). This was attached to the AP with BSSID ending 5B:58.

Recently, RSSI has been in the -60 to -69 area, which is much better; it attached to a different AP with BSSID ending 46:03.

One other thing - lthough the steady battery voltage is ~3.2v, the unit is reporting brownouts to the 2.8v level. How are the devices powered, exactly? You may need some more bulk capacitance.

jamesb · January 1, 2018, 9:08pm

There RSSI should have been better than -70 dBm since about 20-Dec. So I’m still curious what the device was doing between then and 29-Dec when it suddenly upgraded and came back online.
Also, many of our devices are in fields with low signal strength (which is fine for data logging applications where data points that fail to upload are stored in memory). Is it possible to force them not to upgrade until a more reliable connection available to avoid devices being bricked during upgrade?

Thanks for the note on brownouts - I’ll look into it. Is this debug info available to users? The device is powered from AA batteries using the same buck regulator as on the imp002 board.

hugo · January 2, 2018, 5:26am

Dug even deeper into the logs and can see some things:

really having problems connecting on 11/27 22:33UTC; many attempts (with a 58s connect timeout which I guess is in your code as that’s not a system timeout). Lots of failed associations to 5B:58 but then a connect to 46:03 in the same session. What happened on 12/20? Seems like there was a “better” AP available on 11/27 but it was not wanting to use it even after multiple timeouts on 5B:58.
Upgraded from 36.10 to 36.11 at 2017-11-28T17:34:22. This version (and 36.12) are now known to have had some issues with connectivity to some APs (see other theads), however the device kept on with its hourly connects after this - mostly (but not exclusively) to the low RSSI AP.
Due to an unrelated issue discovered when testing 36.11, we returned all units to 36.10 just before the device came back at 2017-11-29T17:35:23 - and it was told to downgrade.
Each time, the downgrade failed and the device came back (and was told to downgrade again). Logs show no contact to the upgrade server. We don’t have blackbox logs of why this failed as that device did not upgrade its bootrom to one with blackbox enabled - we have not yet forced this bootrom update for devices in the field.
No sign of the device between 2017-11-29T17:48:14 (when it was last told to upgrade) and 2017-12-29 04:20:54 - and no attempted upgrade fetches either. That’s a suspiciously month-long period, which is the maximum programmatic sleep you can do in squirrel.
On 12/29 it connected via the weak AP (still running 36.11) and was told to upgrade to 36.13 (which was out by that point). Upgraded ok, though signal very weak - reported -90
Sidenote: the AP 46:03 appears to be changing channel (saw it on both ch9 and ch10 in different connects)
Break in connects between 2017-12-29 05:22:52 and 2017-12-29 22:22:38, at which point it resumed its hourly reports. First connect reported lots of struggling to connect to the low RSSI AP before it finally got onto the strong one.

My observations:

I think the month long gap could be a bug in your squirrel, just simply because that’s a suspiciously precise time to be away when there’s connectivity present. Can you post it (privately if necessary)?
There’s no sign that the failed upgrade bricked the device in any way - it came back as if nothing had happened (aside from the long sleep). The fact it came back with 36.11 intact on 12/29 indicates that it never managed to fetch any of 36.10.
There’s definitely something strange happening with the weak AP. The device is scanning for APs but for some reason either isn’t seeing, or isn’t picking up the strong AP. We can’t see why this decision is being made in our logging though.

On upgrades, we assume that if the device has managed to connect and do the whole TLS key exchange, it can likely fetch an upgrade file. We have discussed things we could do to detect repeated attempts and back off & notify the user via some not-yet-existing pathways, but nothing on the roadmap yet.

Brownout info is not currently exposed, no - can you post any more info on your power supply? How many AAs in series? Any bulk bypass across the AA’s? Is the buck layout close to the recommended layout? (typically layout issues manifest themselves as things like bad transient response which could indeed show up as brownouts when the WiFi radio kicks in).

jamesb · January 4, 2018, 4:28am

Thanks for the detailed info @hugo, much appreciated!

I found out there are two APs at the site, unfortunately both with the same SSID. I’m hoping to fix that tomorrow.

The buck layout is pretty much identical to the imp002 breakout board layout. We’ve got 10uF on VIN, 24uF on V3P3. There are 6 AAs in this device.

The wakeup code is something like this (I’ll think more about if there are other parts of the code that might be influencing the wakeup time in this case - maybe one of the variables got set to the maximum possible value somewhere somehow):

nv.next_reading_time <- time() + 60*nv.log_minutes;
if (nv.next_reading_time - time() < 50) //less than one minute
    nv.next_reading_time = time() + 60*1; //set to one minute from now
....
local next_hour = date(nv.next_reading_time).hour;
local next_minute = date(nv.next_reading_time).min;
local next_second = date(nv.next_reading_time).sec;
server.sleepuntil(next_hour, next_minute, next_second);

hugo · January 4, 2018, 6:26am

6xAA - I presume in series… it has got to be hard to get a brownout on that; if you want to send the gerbers by PM we can take a quick look, though.

As you’re doing an HMS sleep I don’t believe there’s any way that could go for longer than a day; if you were calculating seconds and doing a sleepfor() though, that’s where you could max out at a month.