Time on device very unreliable

dmutterer · January 24, 2017, 8:30pm

My 2 IMP projects rely heavily on time-stamps, sleeping and logging data to a Linux host. I’ve done extensive testing of time-stamp logging and the time from the device itself is always way off from real time, ie: agent and Linux.

I’ve done many MANY things to try and get the IMP to sync with the agent but nothing seems to work. I’ve even gone as far as putting in an agent.send() to a dummy function as the very first thing from a wake-up, to try to force a sync, but no luck.

I don’t know what other tricks to try. On one project, I’ve seen the time off as much a 90 seconds, with 1 hour sleeps. This project below is more around 5 seconds off, with 1 minute sleeps, but the really odd thing is that, so far, the device is always in the future. If it was lagging the agent, I would say it was time to fire-up the WiFi and such.

Anyone else experience this?? or am I the only one with such a strict time requirement? Should we not rely on the IMP for time-stamps? Should I wire in an external RTC chip.?

Device: Tue Jan 24 15:27:03 EST 2017 (Imp)
Agent: Tue Jan 24 15:26:57 EST 2017 (Agent)
Host: Tue Jan 24 15:26:57 EST 2017 (Linux Host)

hugo · January 24, 2017, 10:28pm

We’ve been investigating this and have determined the cause; code added in release 34 to deal with time jitter on busy imp<>agent links (for permanently connected devices) had a side effect of causing the time not to get set on sleepy devices.

Agent time comes from the agent hosts, which are ntp synced.

Things that will cause imp time to resync with agent time:

cold boot
staying online for an extended period
imp clock running slower than agent clock

Right now, the time can drift up to 120s ahead on the imp (never behind) on a device that connects only periodically.

The fix will be in release 36; we may be able to do a server workaround here too, I’ll see what we can do.

Summary: we now know the root cause, and it’ll be fixed in the next release if not sooner.

hugo · January 25, 2017, 12:47am

Edit: A server workaround appears possible (works for me) but that needs to be run past the real engineers tomorrow

dmutterer · January 25, 2017, 1:15am

I wasn’t sure that the time was always “in the future”. If that was guaranteed, then I could send a “trigger” to the agent… have the agent wait till the appropriate time, and send a trigger back to the imp to continue processing.

Now… even though I looked at that, I also looked at other options.

I might be fine using the time-stamp from the agent, although I’m not sure of the latency, but probably too small for me to care, OR just wait for your fix to be applied, as long as it’s not months away.

Thanks.

hugo · January 25, 2017, 6:27am

The slip is only allowed with the device ahead of the agent. To be honest, timestamping with the agent is likely better if only because the agent has NTP-synced time down to milliseconds, whereas the device is whole seconds only at this point.

It won’t be months, others have been suffering too hence the recent investigation.

rogerlipscombe · January 25, 2017, 5:10pm

We’re deploying a new version of the developer server today/tomorrow; it includes this workaround. For the record, the real engineers went “ick” when they saw it, but it’ll have to do.

That “today/tomorrow” is a little misleading:

We do blue/green deploys (ish). The new developer server is up (including all of the required monitoring). I’ll be adding it to the developer server group shortly, at which point it’ll start taking load. You might notice devices reconnecting and agents being restarted as they migrate.

Once we’re happy with it, we’ll deactivate the old developer server.

The new server includes some performance improvements in the agent VM (that you probably won’t notice, but that we’re keen to see being used).

hugo · January 25, 2017, 10:58pm

@dmutterer as your device is connecting/disconnecting, you’re likely on the new server already. How is the time looking?

dmutterer · January 26, 2017, 3:33am

Thank you.

Here is the exact moment that it started to be reasonable.

Device: Wed Jan 25 12:17:03 EST 2017
Agent: Wed Jan 25 12:16:53 EST 2017
Host: Wed Jan 25 12:16:53 EST 2017

Device: Wed Jan 25 12:18:01 EST 2017
Agent: Wed Jan 25 12:18:01 EST 2017
Host: Wed Jan 25 12:18:02 EST 2017

BUT, only as long as I keep the dummy agent call active as one of the first lines run after wake-up.

Device:

agent.send(“dummy”, “dummy”);

Agent:

function dummy(dummy) {}
device.on(“dummy”, dummy);

hugo · January 26, 2017, 5:28am

Sounds a bit like you weren’t causing the device to connect before printing, eg:

local d = date(); server.log(format("%02d:%02d:%02d", d.hour, d.min, d.sec);

…at the start of your code will print the local RTC time, not with any server-side correction, because it’s the server.log that caused the imp to connect. By that time, “d” has been loaded with the RTC-sourced date.

Your agent.send will be doing the same thing (I’m assuming you’re running in SUSPEND_ON_ERROR mode, not RETURN_ON_ERROR).

Are you saying you still see incorrect times if you don’t have that? I’d have expected the RTC to be set so you’d only have the normal up to 30ppm drift from the crystal accuracy.

rogerlipscombe · March 2, 2017, 11:41am

We’re delaying our next full deploy for a while, so I’ve hotpatched the developer and production servers to include the server-side workaround for this issue.