Interpreting hardware.wakereason()

coverdriven · January 10, 2015, 11:36pm

I’ve been monitoring a few of my imps recently and would like to understand how hardware.wakereason() determines which value to return. I call hardware.wakereason() as part of my device initialisation and log it for later review. If I deploy new FW to an imp by clicking on “Build and Run”, I can expect to see an imp restart with the value WAKEREASON_NEW_SQUIRREL (4). If an imp encounters a run-time error that forces a restart, I’ll get a WAKEREASON_TIMER (1) instead. However, if I deploy to a single imp when there other other imps using the same “Model” , it seems to be the case that the next restart that occurs for the others will also be logged as a WAKEREASON_NEW_SQUIRREL, rather than the actual reason why it restarted. This is moderately frustrating. I can understand that the OS might want to take advantage of the restart to upgrade the firmware, but it doesn’t seem right to mask out the real reason for the restart. Wouldn’t it be better to use a single bit in the value hardware.wakereason() return value to indicate the new FW and/or a new OS were downloaded without overwriting the original reason?

dheadrick · January 15, 2015, 2:42pm

I added some code to log disconnects and thought it would be handy to add some info to see the reason. I have found the info to not be helpful at all. Here is the log entries I’m seeing:

2015-01-12 08:02:21 UTC-5 [Status] Device disconnected 2015-01-12 08:02:27 UTC-5 [Status] Device Booting; 9.45% program storage used 2015-01-12 08:02:27 UTC-5 [Device] -------------------------------------------------- Disconnected 1 times! 2015-01-12 08:02:27 UTC-5 [Device] Wifi Signal Strength: -60dBm (5 bars) 2015-01-12 08:02:27 UTC-5 [Device] Reason for Waking/Reboot: New Squirrel code downloaded 2015-01-13 08:03:09 UTC-5 [Status] Device disconnected 2015-01-13 08:03:13 UTC-5 [Status] Device Booting; 9.45% program storage used 2015-01-13 08:03:13 UTC-5 [Device] -------------------------------------------------- Disconnected 1 times! 2015-01-13 08:03:13 UTC-5 [Device] Wifi Signal Strength: -54dBm (5 bars) 2015-01-13 08:03:13 UTC-5 [Device] Reason for Waking/Reboot: New Squirrel code downloaded 2015-01-13 20:16:01 UTC-5 [Status] Device disconnected 2015-01-14 08:13:15 UTC-5 [Status] Device Booting; 9.45% program storage used 2015-01-14 08:13:15 UTC-5 [Device] -------------------------------------------------- Disconnected 1 times! 2015-01-14 08:13:15 UTC-5 [Device] Wifi Signal Strength: -52dBm (5 bars) 2015-01-14 08:13:15 UTC-5 [Device] Reason for Waking/Reboot: New Squirrel code downloaded 2015-01-14 13:06:57 UTC-5 [Status] Device disconnected 2015-01-14 13:07:45 UTC-5 [Status] Device Booting; 9.45% program storage used 2015-01-14 13:07:45 UTC-5 [Device] -------------------------------------------------- Disconnected 1 times! 2015-01-14 13:07:45 UTC-5 [Device] Wifi Signal Strength: -55dBm (5 bars) 2015-01-14 13:07:45 UTC-5 [Device] Reason for Waking/Reboot: New Squirrel code downloaded

These are the functions called during disconnects:

`
function disconnection_handler(reason)
{
if (reason != SERVER_CONNECTED) {
disconnectFlag = true;
disconnectCount++;
server.connect(disconnection_handler, 60);
} else {
server.log("-------------------------------------------------- Disconnected " + disconnectCount + " times!" );
reportRSSI();
logDeviceOnline();
disconnectCount = 0;
}
}

function reportRSSI()
{
local rssi = imp.rssi()
local bars = 5

if (rssi < -87) {
    bars = 0
} else if (rssi < -82) {
    bars = 1
} else if (rssi < -77) {
    bars = 2
} else if (rssi < -72) {
    bars = 3
} else if (rssi < -67) {
    bars = 4
}

if (bars == 1) {
    server.log("Wifi Signal Strength: " + rssi + "dBm (1 bar)")
} else {
    server.log("Wifi Signal Strength: " + rssi + "dBm (" + bars + " bars)")
}

}

function logDeviceOnline()
{
local reasonString = "Unknown"
switch(hardware.wakereason())
{
case WAKEREASON_POWER_ON:
reasonString = "The power was turned on"
break
case WAKEREASON_SW_RESET:
reasonString = "A software reset took place"
break
case WAKEREASON_TIMER:
reasonString = "An event timer fired"
break
case WAKEREASON_PIN1:
reasonString = "Pulse detected on Wakeup Pin"
break
case WAKEREASON_NEW_SQUIRREL:
reasonString = "New Squirrel code downloaded"
break
case WAKEREASON_SQUIRREL_ERROR:
reasonString = "Squirrel runtime error"
break
case WAKEREASON_NEW_FIRMWARE:
reasonString = "impOS update"
break
case WAKEREASON_SNOOZE:
reasonString = "A snooze-and-retry event"
break
case WAKEREASON_HW_RESET:
// imp003 only
reasonString = “Hardware reset”
}
server.log("Reason for Waking/Reboot: " + reasonString)
}
`

The code has not been changed by me. The reason is always “New Squirrel code downloaded”. I have yet to see any other reasons.

coverdriven · January 15, 2015, 5:28pm

The main issue I raise with my post here is that a pending firmware update appears to override hardware.wakereason(), meaning that I get WAKEREASON_NEW_SQUIRREL when I expected something else.

For reasons not clear to me, a software fault that causes the device to restart will result in WAKEREASON_TIMER being returned, not WAKEREASON_SW_RESET. I have a number of devices on test and if there is a reset I want to discriminate between a standard sw error (potentially trappable and recoverable) and an OS runtime error (fatal).

peter · January 16, 2015, 12:54pm

Could you go into more details? A software fault that causes the device to restart, is of course a bug, and we would like to fix it.

What’s happening in your case, is that the imp restarts, so its last-wake-reason is SW_RESET. As part of the reconnection process, it asks the server whether there’s any new Squirrel – which there then is. So it restarts again to get it, at which point its last wake-reason is NEW_SQUIRREL and the one before that is lost.

Peter

coverdriven · January 16, 2015, 4:30pm

Thanks Peter. I assume that there is no Squirrel execution between the SW_RESET and the NEW_SQUIRREL restarts? As I can’t seem to get the former wakereason. I understand that there is new functionality in the pipeline that will provide more Squirrel control over fw updates. This may address my minor frustration.

Related to your first point, I’m trying very hard to identify why I’m getting intermittent out-of-memory restarts on some of the devices. As soon as I have something concrete and repeatable, I’ll put it forward.

DrJack · January 17, 2015, 5:37am

You could regularly query your remaining memory to keep an eye on it and see when it is getting lost?

coverdriven · January 17, 2015, 9:31pm

All my main test devices POST remaining memory free to Loggly every minute. It’s not a slow memory leak that I’m chasing. I’ve seen a device report 41K free then immediately report an out-of-memory condition.

Rather than finding the cause (which is proving elusive), I’m now simply changing code to avoid the likely cause. I can only be confident of that if I see a lack of device restarts in the event logs that I can attribute to an out-of-memory condition. However, the stats are hard to collect if a fw update overwrites the wakereason.

One solution is to make sure the fw for all units is up to date. But it’s not easy if they are not all online at the same time.

hugo · January 19, 2015, 1:31am

When you push firmware to blessed devices they’ll get the new FW immediately if they are online, or on next connect if they’re not; I guess that doesn’t help much for the first cycle of debug for offline units, but should later?

peter · January 20, 2015, 6:21am

Does this device use the sampler or the fixed-frequency DAC at all?

Peter

coverdriven · January 20, 2015, 2:10pm

No, I’m just using UARTS heavily. There is also a lot of table and string use, with constant creation/deletion of ~200 character strings and juggling of many nested tables (3x deep).

peter · January 21, 2015, 11:15am

It’s possible you’re suffering from heap fragmentation – but my first guess (without having seen the code) would be that in some error condition, or after some UART corruption, the code is trying to allocate an impossibly long string or blob.

Peter

coverdriven · January 21, 2015, 6:07pm

UART corruption is very unlikely. The operations I’ve been performing on them have been rock solid for months. All the UART buffers are pre-allocated and not expandable. My recent changes have been around introducing a cache inside the device. The cache manipulates a lot of tables and strings. The largest single thing I manipulate is an array that slowly expands from 0 to about 60 entries (about ~480 bytes) and that is often the moment where I get the out-of-memory condition. At the time it appends a single entry to that array, getmemoryfree() reports around 20K free, sometimes more.

I also observe that a failure is much more likely to occur immediately after a live firmware update. It is less likely to occur after the out-of-memory condition has reset the device. Is the device heap in a different state for both situations?

coverdriven · January 21, 2015, 11:06pm

Peter, could you please explain the difference between:

[Exit Code] imp restarted, reason: memory error
and
[Exit Code] imp restarted, reason: out of memory

I triggered a download of new firmware to a group of test devices and, while most updated without a problem, a few returned the first message and and a few returned with the second (but only after running for a few for 4-10s).

For the record, I do not disconnect from wifi at startup. I’m mindful of the note about the 3K of RAM reallocated when the imp reconnects.

peter · January 22, 2015, 4:24am

“Out of memory” means that a heap allocation failed because Squirrel data has used up all the available RAM. “Memory error”, on the other hand, is very interesting and means that you’ve found an improm bug: a way of confusing the improm itself into accessing memory locations which it shouldn’t. (Older improm releases, before 30.x, didn’t notice such errors.) I’d very much like to see the Squirrel code that triggers “memory error”: if you’d rather not post it to the forums, could you perhaps email it to peter@electricimp.com?

It’s even possible that a bug which can cause “memory error” is in fact corrupting the heap, which could then cause spurious “out of memory” restarts.

Peter

peter · January 22, 2015, 5:22am

Oh wait, is this that Squirrel that ends up doing lots of recursive f.call()? Too many nested f.call()s can exhaust the stack, which will also appear as a “memory error”.

Peter

coverdriven · January 22, 2015, 5:51am

I use function.acall() in both the device and agent, but it isn’t called recursively, nor are there nested calls or acalls. However, it does surprise me to hear that they would have a different effect on the stack to a standard invocation.

The only recursion I use is my own json encoder, which is never invoked directly through acall() or call(), although it may be invoked through the callee.

peter · January 23, 2015, 1:09pm

Does the device use imp.scanwifinetworks() at all?

Peter

coverdriven · January 23, 2015, 5:01pm

Ahhh, yes, very interesting. I’d just starting looking at the wifi stuff last night. I will email you with more detail.

peter · January 26, 2015, 6:51am

Ah, so serendipity has struck. On Friday I found a bug in a third-party library, whose effect was that imp.scanwifinetworks() can leak memory. And worse, leak it by losing it altogether, so that it’s still counted in imp.getfreememory() but can’t be reused. The symptom would be, if you call imp.scanwifinetworks() a lot without ever restarting the imp, you can end up in a situation where you get an out-of-memory restart despite imp.getfreememory() looking quite healthy.

This will of course be fixed in the next release, and there are checks in place that will stop any future bugs of this type escaping into production. Sadly for the time being there isn’t much of a workaround: either don’t call imp.scanwifinetworks() very much, or periodically restart the imp using imp.deepsleepfor(0) or similar. Unfortunately it has to be a full reboot, via either deep sleep or a power-cycle (or out-of-memory): a mere Squirrel restart (such as by throwing an uncaught exception) doesn’t reset the heap state.

Peter

coverdriven · January 26, 2015, 4:37pm

OK, good to know. This is likely to be the root cause of at least some of my woes.
I can’t eliminate the use of scanwifinetworks() completely, but will certainly use it more sparingly.