Recursive wakeup issue in agent

iceled · August 7, 2017, 9:40pm

I have an agent function called checkTime that’s called recursively every 30 seconds using the statement: local ref=imp.wakeup(30, checkTime); inside the function. I check ref and log an error if null.

While it works fine for what seems like a few weeks (or maybe months) the function eventually stops being triggered. No error is logged but when I finally notice it’s broken and restart the code it carries on working for another few weeks. Having such a long time between failures makes it awkward to test which is why I’m a bit vague on the details. I’ve only just had this happen again for the third time.

Might there be some limit on recursion that I’m running into? I’m sure I’ve got device code running recursively like this without any problems.

smittytone · August 7, 2017, 11:40pm

As per the docs, agents are limited to 20 timers, whereas devices are limited only by the amount of available memory, so it sounds like you’re hitting that limit.

If you can give some context for the code that sets checkTime() to run in 30 seconds’ time, we might be able to suggest a solution.

iceled · August 8, 2017, 11:01am

It may be the 20 timer limit, although the imp.wakeup that eventually goes missing is simply re-triggering itself as per this stripped down equivalent I have tested:

count <- 0;
function checkTime()
{
    local ref=imp.wakeup(0.01, checkTime);
    if(ref==null)
    {
        server.error("Checktime timer failed");
    }
    if(count%100==0) server.log(count+" "+now.min);
    count++;
}
checkTime(); //start

This speeded-up test passes of course - and by “passing” I mean when I look at the number of 30 second callbacks between my last update of the full code and when I recently noticed it fail (approx. 22 days which is in the order of 63360 callbacks) I was suspicious of a 2^16 limit but the above test carries on beyond this.

My full agent code does have other timers that are not so easily tested (part of a http request queue) so I’ll wrap some more debug around there.

What was perplexing me was that the checkTime function could apparently just vanish without reporting that it was unable to schedule another callback of itself. Incidentally, I appreciate that this isn’t truly recursive so there should be no resource limitiation to one imp.wakeup callback scheduling another instance of itself like this.

hugo · August 9, 2017, 10:26am

There’s no recursion happening here; when checkTime() exits, you’re back to zero in terms of stack depth.

If this is the only code, then there’s no chance of hitting the 20 timer limit (that’d require checkTime() to be called from multiple places, each one starting a new chain of retriggering timers).

Can you give timestamps as to when things stopped working for you? Would like to see if this coincided with server deploys or anything - many customers rely on timers firing consistently for their agents to work, and though we did have an issue with this a couple of years ago (in certain circumstances) we’re pretty sure there are no lurking issues left.

coverdriven · August 9, 2017, 9:42pm

As @hugo mentioned, there were some issues in previous years with timers being lost. I’m pleased to say that they have been very reliable since then. If they weren’t, my devices would collapse in a heap.

What timeout policy are you using? SUSPEND_ON_ERROR or RETURN_ON_ERROR? WAIT_TIL_SENT or WAIT_FOR_ACK?
Calls to server.log() or server.error() have the potential to block if you don’t wrap them in a test for a server connection beforehand.

iceled · August 11, 2017, 6:06pm

@Hugo, there are other agent timers in the full application code and they’re potentially exceeding the limit so I’m tracking their usage next. I haven’t yet got any info on the exact time this one particular wakeup timer callback goes awol because I always gets a non-null reference so there’s been no error to log.

@coverdriven, the default SUSPEND_ON_ERROR policy is used. I’m pretty sure that some more debug will shine a light on an unnoticed error that breaks the chain of execution. I just need to offload some of the other logging so I can see the important event before it scrolls off the end of the world.

hugo · August 13, 2017, 1:30pm

Note that the developer servers now have a 100 timer limit. If your code exceeds this, it now throws an error (vs silently dropping the wakeup). I don’t believe this version has yet got to production but should do at the next deploy.

@iceled if you’re seeing issues on a developer device then it’s not likely due to the timer limit.