Development Agent server down?

iceled · August 15, 2017, 10:26pm

Right now http requests to all my Agents are timing out or completing very late, is there a server problem? No status reports in the ide and code editing is ok.

hugo · August 15, 2017, 11:20pm

Yes, there was an issue with outbound HTTP from the developer server; I’ll let @rogerlipscombe give some details.

rogerlipscombe · August 16, 2017, 6:55am

Something triggered a deadlock/slowlock in the agent scheduling queue. The issue definitely affected HTTP in and out. It might have affected wakeups.

We grabbed a core dump and restarted the daemon. We’ll continue to investigate today.

This incident only affected the developer server. Unfortunately, there are some production devices on that server for historical reasons. We’ll be looking to resolve that today or tomorrow.

We’re also examining why we didn’t get paged when the incident started. It looks like some required monitoring wasn’t in place. We’ll be resolving that this morning.

My apologies.

rogerlipscombe · August 16, 2017, 7:20am

See also http://status.electricimp.com/incidents/6lrn5mj1154z.

iceled · August 16, 2017, 9:22am

Great… I’m sure it’s just a coincidence but at almost that exact time I was working with an imp01 in a relatively hostile electrical environment and its power supply may have been glitching quite severely. I’m highly doubtful that this could have such an impact but I thought I ought to 'fess-up anyway.

Initially I put the trouble I was having with http access on that imp down to the electrical issues but soon spotted separate applications that were similarly affected.

jamesb · August 16, 2017, 9:37pm

I was wondering what happened too. Good to know it’s being looked into.

hugo · August 17, 2017, 3:13am

Not long after Roger’s post, the issue was located and a maintenance window opened, when we deployed the new code to the developer environment.

We’re going to beef up the tests in this area; unit tests run under helgrind but not system tests that could have caught this issue pre-deployment. Things can get quite complex