My device has been running for weeks without issue. Now I just get a solid green light, then nothing.
Looks like the Firmware OS upgrade is failing. How do I fix this?
Thanks @rogerlipscombe
Can you give me any clues as to why it seemed dead as far as I could tell for a month, and then suddenly yesterday it updated? Did it require intervention? It’s battery is nearly flat now, so seems it has been online and doing something sometime in the last month. I’ve got another device that’s behaving similar, so wondering if it’s related.
Thanks,
James
It seems to be very much on the edge of coverage; when it upgraded, it had an RSSI of -87 which is on the hairy edge of workable (depends a lot on the access point). This was attached to the AP with BSSID ending 5B:58.
Recently, RSSI has been in the -60 to -69 area, which is much better; it attached to a different AP with BSSID ending 46:03.
One other thing - lthough the steady battery voltage is ~3.2v, the unit is reporting brownouts to the 2.8v level. How are the devices powered, exactly? You may need some more bulk capacitance.
There RSSI should have been better than -70 dBm since about 20-Dec. So I’m still curious what the device was doing between then and 29-Dec when it suddenly upgraded and came back online.
Also, many of our devices are in fields with low signal strength (which is fine for data logging applications where data points that fail to upload are stored in memory). Is it possible to force them not to upgrade until a more reliable connection available to avoid devices being bricked during upgrade?
Thanks for the note on brownouts - I’ll look into it. Is this debug info available to users? The device is powered from AA batteries using the same buck regulator as on the imp002 board.
Dug even deeper into the logs and can see some things:
really having problems connecting on 11/27 22:33UTC; many attempts (with a 58s connect timeout which I guess is in your code as that’s not a system timeout). Lots of failed associations to 5B:58 but then a connect to 46:03 in the same session. What happened on 12/20? Seems like there was a “better” AP available on 11/27 but it was not wanting to use it even after multiple timeouts on 5B:58.
Upgraded from 36.10 to 36.11 at 2017-11-28T17:34:22. This version (and 36.12) are now known to have had some issues with connectivity to some APs (see other theads), however the device kept on with its hourly connects after this - mostly (but not exclusively) to the low RSSI AP.
Due to an unrelated issue discovered when testing 36.11, we returned all units to 36.10 just before the device came back at 2017-11-29T17:35:23 - and it was told to downgrade.
Each time, the downgrade failed and the device came back (and was told to downgrade again). Logs show no contact to the upgrade server. We don’t have blackbox logs of why this failed as that device did not upgrade its bootrom to one with blackbox enabled - we have not yet forced this bootrom update for devices in the field.
No sign of the device between 2017-11-29T17:48:14 (when it was last told to upgrade) and 2017-12-29 04:20:54 - and no attempted upgrade fetches either. That’s a suspiciously month-long period, which is the maximum programmatic sleep you can do in squirrel.
On 12/29 it connected via the weak AP (still running 36.11) and was told to upgrade to 36.13 (which was out by that point). Upgraded ok, though signal very weak - reported -90
Sidenote: the AP 46:03 appears to be changing channel (saw it on both ch9 and ch10 in different connects)
Break in connects between 2017-12-29 05:22:52 and 2017-12-29 22:22:38, at which point it resumed its hourly reports. First connect reported lots of struggling to connect to the low RSSI AP before it finally got onto the strong one.
My observations:
I think the month long gap could be a bug in your squirrel, just simply because that’s a suspiciously precise time to be away when there’s connectivity present. Can you post it (privately if necessary)?
There’s no sign that the failed upgrade bricked the device in any way - it came back as if nothing had happened (aside from the long sleep). The fact it came back with 36.11 intact on 12/29 indicates that it never managed to fetch any of 36.10.
There’s definitely something strange happening with the weak AP. The device is scanning for APs but for some reason either isn’t seeing, or isn’t picking up the strong AP. We can’t see why this decision is being made in our logging though.
On upgrades, we assume that if the device has managed to connect and do the whole TLS key exchange, it can likely fetch an upgrade file. We have discussed things we could do to detect repeated attempts and back off & notify the user via some not-yet-existing pathways, but nothing on the roadmap yet.
Brownout info is not currently exposed, no - can you post any more info on your power supply? How many AAs in series? Any bulk bypass across the AA’s? Is the buck layout close to the recommended layout? (typically layout issues manifest themselves as things like bad transient response which could indeed show up as brownouts when the WiFi radio kicks in).
Thanks for the detailed info @hugo, much appreciated!
I found out there are two APs at the site, unfortunately both with the same SSID. I’m hoping to fix that tomorrow.
The buck layout is pretty much identical to the imp002 breakout board layout. We’ve got 10uF on VIN, 24uF on V3P3. There are 6 AAs in this device.
The wakeup code is something like this (I’ll think more about if there are other parts of the code that might be influencing the wakeup time in this case - maybe one of the variables got set to the maximum possible value somewhere somehow):
nv.next_reading_time <- time() + 60*nv.log_minutes;
if (nv.next_reading_time - time() < 50) //less than one minute
nv.next_reading_time = time() + 60*1; //set to one minute from now
....
local next_hour = date(nv.next_reading_time).hour;
local next_minute = date(nv.next_reading_time).min;
local next_second = date(nv.next_reading_time).sec;
server.sleepuntil(next_hour, next_minute, next_second);
6xAA - I presume in series… it has got to be hard to get a brownout on that; if you want to send the gerbers by PM we can take a quick look, though.
As you’re doing an HMS sleep I don’t believe there’s any way that could go for longer than a day; if you were calculating seconds and doing a sleepfor() though, that’s where you could max out at a month.