Device Disconnecting Randomly

ansynclabs · July 10, 2019, 5:27pm

The expected outcome of the firmware is to always have device code running without interruption due to no internet connectivity / unexpected drops. We have a status LED that “should” update using the API’s such as…

server.onunexpecteddisconnect(function (reason) {
    server.log("Device unexpectedly disconnected...");
    Device.LEDState = LED_STATE.OFFLINE;
});

I am also using a specific timeout policy to keep code running for up to 1 hour if no connection is made.

server.setsendtimeoutpolicy(RETURN_ON_ERROR, WAIT_TIL_SENT, 3600);

Am I doing something wrong here?

hugo · July 10, 2019, 8:13pm

That’s not quite how those APIs work.

3600 in the send timeout means “device will attempt to send for up to an hour before returning from any call that tries to use the network” - ie, it’ll keep trying to attach to wifi etc during that period, blocking your code from running.

You should use a shorter send timeout - a couple of seconds probably - and then it’s simplest to use the connection manager library to deal with reconnections and so on. See https://developer.electricimp.com/libraries/utilities/connectionmanager

peter · July 11, 2019, 12:40pm

If you’re using RETURN_ON_ERROR mode, then that server.log is doomed to fail, as, in that mode, server.log does not itself cause a connection attempt. Once you’ve had an unexpected disconnect, you must eventually call server.connect to get back in touch with the server.

See our wifi state diagram, in which the “onunexpecteddisconnect” handler represents a transition from ACTIVE_ONLINE to ACTIVE_OFFLINE.

Peter

ansynclabs · July 16, 2019, 11:14pm

During my random disconnections I do see my server.onunexpecteddisconnect() callback being triggered. This callback is setting one of my LED’s to a specific color.

Here is my full code for handling offline mode / handling disconnections. I’m unsure why this isn’t working.

//Ill check if either of these are false... imp's api takes ~45 seconds?
function reconnect() {
    if(!Device.Connected || !server.isconnected()) {
        server.connect(disconnectionHandler, 5);
    } else {
        disconnectionHandler(SERVER_CONNECTED);
    }
}
/*
Callback checks it's flash for blinkup...
If blinkedUp we should try to reconnect && not connected.
*/
function disconnectionHandler(reason) {
    //If we're blinkedup... try to reconnect.
    if(table.blinkUp && !Device.Connected) {
        Device.LEDState = LED_STATE.DISCONNECTED;
        imp.wakeup(5, reconnect);
    } else {
        //Connected
    }
}
/*
RETURN_ON_ERROR: Code still runs, callback is called...
*/
server.setsendtimeoutpolicy(RETURN_ON_ERROR, WAIT_TIL_SENT, 60);
server.onunexpecteddisconnect(disconnectionHandler);

Is there a way to force a server.onunexpecteddisconnect() trigger? Without tampering with the device physically. The api for server.disconnect does not work for this

hugo · July 16, 2019, 11:36pm

Unexpecteddisconnect does not get called unless the disconnection was unexpected - a server.disconnect() is an intentional disconnect and hence not unexpected.

What is “Device.Connected”? Where is it set and cleared? This is likely a problem, because in “disconnectionHandler()” you check Device.Connected and only call reconnect if this is false… but as no other code will have been executed on a disconnect before the disconnect handler is called, it’ll likely still indicate true and so the reconnect will be skipped (however server.isconnected() will have been updated at this point).

Also, a 5 second connect timeout is far too short; increase this to 60 seconds. Yes, 5s may work on good networks, but there are networks where it’ll take 5s just to get a wifi association.

As noted in a previous reply, just use the ConnectionManager library. There are many subtleties going on here, and you can make your life a lot simpler by using proven code to deal with reconnections.

ansynclabs · July 17, 2019, 4:24pm

Device.Connected is a global flag from a ping/pong from device -> agent. If the device doesn’t get a pong back after 10seconds then it is considered offline. So then I can update the LED to red. Otherwise it is white to show it is connected.

Same exact method on the agent. If it doesn’t get a ping from the device after 10 seconds it will mark the device offline in my database.

I’m going to give this ConnectionManager library a shot, although it looks like it’s going to increase the code size by a lot…

hugo · July 17, 2019, 5:25pm

So, as I said, device.connected may well not have been updated by this point. The unexpected disconnect handler only fires ONCE, at the time impOS decides the connection is no longer working. If device.connected is still showing true at that point, your device will be offline and will not be making any attempts to connect, because the only place you attempt to connect is in the unexpecteddisconnect handler when device.connected says false.

Bear in mind that aggressive pinging like that is likely to give false triggers. Any slight packet loss and you could miss a 10s interval, and you will be burning through a lot of your 1MB/day device data allowance (wifi/ethernet), or paying for extra cell data, with this.

impOS does connection maintenance itself. Yes, you may not get a report from the agent that the device is offline for several minutes (because TCP has to be given a chance to work) but this does work.

Is there a reason you need to know within ~20s that the device connection hasn’t passed any packets?

ansynclabs · July 17, 2019, 6:08pm

The reason is that my mobile application and web services should know very quickly if the device has gone offline. There is a lot that can go wrong if users aren’t notified quickly enough if the device goes offline.

I have the agent connected to my database which my mobile app is connected to so it can update an “online” flag. I have the mobile app hitting an agent api to check whether the device has pinged the agent in the last 10-15 seconds.

I’m planning to change a lot of this as I’m going to implement the ConnectionManager.

ansynclabs · July 17, 2019, 6:27pm

Is there a way to cause an unexpected disconnection via software?

hugo · July 17, 2019, 6:39pm

No, there’s no way to cause an unexpected disconnection via software. You can turn off your router (this will kick in quickly as WiFi beacons will vanish) or unplug the ethernet from the router (will take a while to kick in as it’ll just see packet loss).

As noted, you will certainly run into issues with such a short notification period. Packet loss happens a lot in the real world - most ISPs suffer from time to time. If you’re only doing this fast check when the user is in the mobile app, then that’s fine, but I’d recommend slowing it down a lot when the user isn’t actually watching the state intently - to maybe every minute or two vs 10s.

ansynclabs · July 17, 2019, 7:18pm

Thank you for being so prompt it’s really helping with development.

I’ve removed all of the pinging and using purely the connection manager callbacks to handle updating state. I have a small code snippet of what is running right now. Is there a “reason” that is passed on the onDisconnect or is it just the expected flag?

I would like to further debug it. I don’t have any other agent.send() methods being called besides a 5 second interval to send my device data up to the agent to my cloud.

I’m not calling server.connect/disconnect anywhere in the code right now. It is currently working as intended… although I don’t see why it should be experiencing so many unexpected disconnections.

Edit #1: I’m unsure on how someone is to send an offline status to the agent using the onDisconnection callbacks if you’re disconnect the by the time you want to update your agent. Seems like some type of pinging is required.

Edit #2: Seems like device.isconnected() doesn’t necessarily catch devices right away when powered down. Seems to be expected behavior to filter out small connection losses. Maybe it is not a big deal for right now and it will suffice.

Code

cm.onDisconnect(function(expected) {
    if(expected) {
        //During the expected disconnection turn led blue(config).
        Device.LEDState = LED_STATE.CONFIG;
        cm.log("cmlog: this was expected");
        cm.connect();
    } else {
        cm.error("cmerror: this was unexpected");
        //Turn the LED purple during unexpected disconnections.
        Device.LEDState = LED_STATE.DISCONNECTED;
        cm.connect();
    }
});

cm.onConnect(function() {
    //agent.send("deviceConnected", null);
    Device.LEDState = LED_STATE.ONLINE;
});

cm.onNextConnect(function() {
     //agent.send("deviceConnected", null);
     Device.LEDState = LED_STATE.ONLINE;
});
//Test the expected disconnection.
imp.wakeup(10, function() {
    cm.disconnect();
})

Logs

2019-07-17T18:26:18.079 +00:00 [Device] Version: 3
2019-07-17T18:26:18.103 +00:00 [Device] INIT DONE.
2019-07-17T18:26:27.467 +00:00 |[Status] Device disconnected
2019-07-17T18:26:29.427 +00:00 [Device] 1563387987 - cmlog: this was expected
2019-07-17T18:52:45.867 +00:00 [Status] Device disconnected
2019-07-17T18:52:45.876 +00:00 [Device] ERROR: 1563389563 - cmerror: this was unexpected
2019-07-17T18:53:41.377 +00:00 [Status] Device disconnected
2019-07-17T18:53:41.385 +00:00 [Device] ERROR: 1563389619 - cmerror: this was unexpected
2019-07-17T18:58:07.365 +00:00 [Status] Device disconnected
2019-07-17T18:58:07.372 +00:00 [Device] ERROR: 1563389885 - cmerror: this was unexpected
2019-07-17T18:58:47.525 +00:00 [Status] Device disconnected
2019-07-17T18:58:47.533 +00:00 [Device] ERROR: 1563389925 - cmerror: this was unexpected
2019-07-17T18:59:02.564 +00:00 [Status] Device disconnected
2019-07-17T18:59:02.572 +00:00 [Device] ERROR: 1563389940 - cmerror: this was unexpected

hugo · July 18, 2019, 2:40pm

The ondisconnect callback (in the connection manager) is not intended for people to send status to the agent - as you note, it’s too late then as you’re disconnected. However, there are other things people may want to do when a disconnect happens.

In terms of “catching” a power down, note that the default state of TCP is not to send anything for an active connection. This is the way the internet works - there is no magic thread that will get pulled when a device vanishes from a network. Silence on a TCP connection can either be “everything is just fine” or “the other end no longer exists”. TCP provides keepalives, which force traffic to be sent (ACKing the last sequence number), and can tear down an unresponsive connection - we enable these and typically an unresponsive but idle connection is torn down within about 90s, which will signal to the agent that the device is offline… however, if there’s data outstanding (not yet ACKed) on the connection, keepalives do not function any more (this is a “feature” of TCP) and a higher level system keepalive will catch the offline event on average 4.5 minutes after the device vanishes (but up to 9 minutes).

If you’re wanting to stay connected, you don’t need to call reconnect yourself - the best way is to just initialize the connection manager with stayConnected=true and it will deal with this itself. You likely also want startBehavior=CM_START_CONNECTED as if the internet is not available at code boot time (eg device power cycles during an outage) then you want connection manager to try connecting.

As to why you keep disconnecting, if you give me your device ID (by PM if needed) I can look at the logs; this could be either a network issue - eg router/firewall enforced NAT timeout - or related to sending lots of data upstream and the sends not being able to complete within the send timeout period - the default policy with connection manager is RETURN_ON_ERROR which means if a send (server.log or agent.send) can’t get into the TCP transmit buffer within the specified timeout period, it’ll return with SEND_ERROR_TIMEOUT and disconnect.

ansynclabs · July 18, 2019, 6:32pm

ondisconnect is understood now. For catching a power down, a ping every 10 seconds should be sufficient. I would rather just use the connectionManager for this type of thing.

The device is able to be offline / online. So I can’t set the stayConnected true, or else it would keep retrying to connect from what I’ve read and potentially block other code.

I believe the bigger issue at this point is the random/repeated disconnects. Here is my device id 40000c2a69167554. Currently it is offline, if required to be online I will put it online. But I’m sure you’re going to just check out some logs. Appreciate the feedback!

hugo · July 19, 2019, 5:50pm

The issue appears to be - as suspected - that you have a 1s send timeout, and are sending enough data that the TCP buffers have filled. Every time you hit this 1s timeout, the device will disconnect as it hits the send timeout.

To address this, increase the timeout and/or increase the TCP send buffer size (I’ve no idea how much data you’re sending so don’t know which one makes more sense in your application).

Note that you can work this out yourself if you check the return codes for all agent.send and server.logs and use cm.log() to log these when they report a failure.

Here’s a snippet of the black box log for one of the disconnections:

40000c2a69167554     2019-07-19 00:50:13.00 03:24:26.89 NET CON ERROR: ERR_WOULDBLOCK in netconn write
40000c2a69167554     2019-07-19 00:50:13.99 03:24:27.88 WRITE TIMEOUT 1000ms, WAIT_TIL_SENT
40000c2a69167554     2019-07-19 00:50:13.99 03:24:27.88 SEND ERROR: RETURN_ON_ERROR:WAIT_TIL_SENT Timeout 1000ms
40000c2a69167554     2019-07-19 00:50:13.99 03:24:27.88 DISCONNECTED

ansynclabs · September 6, 2019, 11:20pm

I’m experiencing this still after removing some of these tick functions that sent request. I think the only one I have running concurrently is a 2.5 second tick to send state updates to the cloud.

Could you please take a look at logs where any previous unexpected disconnections are coming in? I’m using the connection manager to handle unexpected disconnections. Here is a snippet of what I am using. I don’t know why it should be unexpectedly disconnecting at the moment.

Device ID: 40000c2a69166362

hugo · September 6, 2019, 11:38pm

You will have this issue any time the transmit buffer is unable to accept your new packet for your timeout time. Any interruption in the connection can cause this, at any point in the chain (wifi, router, isp, internet). I explained this above.

If you want more resilience to this - by making the system wait longer for the buffer to clear when the interruption goes away - just increase your timeout to something bigger. 1s is not very long. The root cause of the issues is packet loss or network congestion at some point in the path, though.