Agent Reboot

apineda · November 15, 2013, 1:57am

If this is documented somewhere, please point me in the right direction; I could not find any mention of this in the discussions.
Brief background:
My imp is connected to an external device that sends serial data every 10 seconds to the imp serial port. The imp successfully reads the serial data, logs it to the server via a “server.log” and sends the data to the Agent. The Agent then sends the data via an HTTP Post to a remote server after it also does a server.log.

This process runs just fine for several days and then the Agent stops responding. The imp server.log continues to record incoming data via the serial port on the console, but the Agent is just silent. The Agent does not log the incoming data from the imp, nor does the Agent respond to an incoming HTTP command from the remote host, so I think the Agent code is stuck somehow (might have had an uncaught exception, even though I have may try/catch brackets). I checked memory on the imp by logging available memory on every request and the memory is not going down, even after a day or so; it is fairly constant. I don’t think it is running out of memory, but could be wrong.

Rebooting the Agent through the IDE clears everything and it runs again for a few days.
I am sure that I will find the bug sooner or later, but if this happens after many deployments, I would like a generic way of rebooting the Agent from the server somehow if the Agent does not respond. I have already implemented a “ping” between the Agent and the imp and would like something similar from the Server to the Agent. It goes without saying that the server needs to be able to reboot the Agent if it stops talking.

Any suggestions?

controlCloud · November 15, 2013, 3:05am

None that spring to mind it does seems a bit odd.

May be a watchdog ping from the agent to your host that way you know your agent is sort of functioning. I do something similar and persist imp data in the agent. I send on change in imp and every hour from agent to my host.

apineda · November 15, 2013, 3:47pm

Thanks for your thoughts. The host already sends HTTP requests to the Agent on a regular basis and will know if there is a problem. However, I tried sending various HTTP requests to the Agent from the host and the Agent does not reply (the core problem). As it is now, I get an SMS text message from the host that there is a problem, but I have to go to the IDE and reboot the Agent. I was looking for a way for the Server to send a ping to the Agent and reboot the Agent if it did not respond.
Looks like “server.restart()” would force a reboot, but the question is how to trigger that API when the Agent is hosed?

Hmmm, just thought of something. Maybe the imp itself can send the “server.restart()” if it does not get a ping from the Agent after so many seconds. I’ll give it a shot.

hugo · November 15, 2013, 10:29pm

Very strange; if an agent crashes, or runs out of memory, it’s automatically restarted. We’re not aware of this ever failing to work. How sure are you that the agent is not running? If you were erroneously doing a lot of HTTP activity to the agent you may run into rate limiting (which will make it look like it’s not working as requests will be queued then dropped).

Have you tried pinging the agent from the imp to see if the agent will reply? eg you have this on the agent:

device.on(“ping”, function(v) { device.send(“pong”, 0); });

…so that any time you send it a “ping” it will immediately reply with a “pong”.

apineda · November 16, 2013, 12:51am

Hi Hugo,
I do have a ping going back and forth every 5 seconds. I modified the ping loig and added the "agentCheck: logic as follows to see if this will reset the Agent if the imp does not get a response. The Agent notifies the remote Host that the imp is down and the imp tries to reset the Agent if it stops responding.
I’ll see if this will function as a watchdog to keep both sides operational.

/// On Agent ////////////////////////// function awake() { if(!imponline && !devicedownhostnotified){ // Notify host that imp is down curtime <- time(); addgateway(agenturl,"0","0",curtime, "DOWN"); // function to sent http to host devicedownhostnotified <- true; // keep from flooding host // Note: devicedownhostnotified gets reset when the imp comes online. } imponline <- false; device.send("ping",0); imp.wakeup(5.0, awake);`` };

`device.on(“ping”,function(code){
imponline <-true; // indicate imp is responding
// server.log(“device ping”);
});

imp.wakeup(5.0, awake);
`

/// On Device ////////////////////////// agent.on("ping",function(code){ agent.send("ping",0); agentdown <- 1; // Indicate Agent is up });

`function checkAgent() {
if(agentdown == 0) {
server.restart();
server.log(“Restart Agent”);
}
agentdown <- 0; // If no Agent ping, this stays zero
imp.wakeup(10, checkAgent); // check every 10 seconds for stuck Agent
}

imp.wakeup(10, checkAgent);

`

apineda · November 16, 2013, 3:39pm

Found it!

First of all, the logic above works great. If the imp stops getting a “ping” from the agent every 5 seconds, the imp forces a “server.restart”. The only change to the previous comment’s code was to allow more time for the Agent to reboot. I had to change the “imp.wakeup(10, checkAgent);” at the end of the device code to 30 seconds. Less time might work, but the imp got into a restart loop because the Agent was taking longer than 10 seconds to come back up

Root Cause:
My Agent code had an instruction to parse through the data packet ( a blob) that came in from the imp. Most places, I have try/catch, but this one did not. The incoming data packet from the external device had a “start of packet” character as the last character of a message. I was not checking for that and that was my fault.
However, the real issue is that the Agent code read a character and thought it was the beginning of a packet and tried to process the next char “alen <- uartb.readn(‘b’);” instruction, got an end of buffer io error and got stuck. From that point forward, the imp kept passing up more incoming data packets from the serial port, but the Agent is in a frozen state and would never respond. As mentioned before, even sending an incoming HTTP request from the remote host to the agent would get ignored and timeout.
The logic to have the imp restart the Agent if it does not get a “ping” within a certain time limit will get me going, but the imp Server is not catching the fact that the Agent had an internal error and restarting it automatically.

How to reproduce the condition:
I added a test routing in the Agent code to periodically cause the Agent to read past the end of a blob that was passed up from the imp without a try/catch. Sure enough, the Agent froze and the imp kept sending up received serial port packets, but the only way to get the Agent going was a manual restart.
The checkAgent logic handles this condition by forcing a restart of the server if the Agent does not respond to the imp within a certain amount of time.

mlseim · November 16, 2013, 3:59pm

This is a great post about troubleshooting and correcting a problem. I’m glad you provided a descriptive post about what you did wrong, and how you fixed it. Even though it was your own fault, every one of us does the same kind of thing from time to time. Beginners to the Imp/Squirrel will benefit from learning the thought process on how to deal with these kind of situations.

I think it would be great to have a “Fail” section on this forum to actually post about our faults and failures and how we corrected them. I know that I learn more about the Imp in my failures than I do in my successes.

hugo · November 16, 2013, 4:06pm

Very interesting - if you could provide minimal code that replicates the agent hang then we’d love this so we can find/fix the issue.

I’m not quite sure what the issue is apart from you say an error is thrown but then the agent gets “stuck”?