Server / agent not responding? Code 7?

foxt · December 6, 2015, 3:32pm

I have had an imp running as a temp sensor for about 4 weeks, and all has been well. The imp wakes up every 15 mins and sends the temp to the agent. Overnight last night, the agent returned a code 504 when the device tried to send data, and has subsequently returned a code 7 on each new try. I can’t find any reference to code 7. The imp site indicates that the agent server status is “green”. Can anyone point me in the right direction as to what the issue may be, and how to resolve it?

2015-12-05 23:00:03 UTC-5 [Device] sleeping until 1449375301000
2015-12-05 23:00:08 UTC-5 [Agent] PUSH: 200 - 1 success
2015-12-05 23:15:08 UTC-5 [Status] Device connected
2015-12-05 23:15:09 UTC-5 [Device] sleeping until 1449376201000
2015-12-05 23:17:16 UTC-5 [Agent] PUSH: 504 -

504 Gateway Time-out

The server didn’t respond in time. 2015-12-05 23:30:03 UTC-5 [Status] Device connected 2015-12-05 23:30:03 UTC-5 [Device] sleeping until 1449377101000 2015-12-05 23:31:06 UTC-5 [Agent] PUSH: 7 - 2015-12-05 23:45:02 UTC-5 [Status] Device connected 2015-12-05 23:45:03 UTC-5 [Device] sleeping until 1449378001000 2015-12-05 23:46:07 UTC-5 [Agent] PUSH: 7 - 2015-12-06 00:00:03 UTC-5 [Status] Device connected 2015-12-06 00:00:03 UTC-5 [Device] sleeping until 1449378901000 2015-12-06 00:01:07 UTC-5 [Agent] PUSH: 7 -

foxt · December 6, 2015, 4:10pm

I took a look at my agent code, and determined that the agent is running correctly - it seems that it is the sparkfun stream that is having a problem saving the data I am trying to save. Off to research more about what may be amiss with sparkfun …

foxt · December 6, 2015, 6:30pm

Direct requests to the sparkfun stream are working properly. I have discovered that when http.get returns a code 0-99, the codes have the same meaning as libcurl error messages. The error code 7 that I am getting indicates that the agent is not able to connect to the sparkfun stream to execute the http.get. Since I verified that I can execute the http.get directly to sparkfun from my server, my problem must be that the agent can not connect to sparkfun.

Why would this be happening? Is something broken between the imp agent server and sparkfun?

kingjez · December 6, 2015, 6:31pm

Hi, I am also having the same problem. I have a very similar setup - the agent sends data (in my case every minute) to the sparkfun data stream (data.sparkfun.com). The data seemed to stop getting logged around 4am UTC this morning.

I’ve tested my sparkfun datastream and it seems to be working fine. I’m able to send data via my browser for example. It seems like something between the electricImp servers and sparkfun phant server is amiss.

foxt · December 6, 2015, 7:35pm

Well, sorry to hear that you have the issue too, but it suggests that there is something systemic rather than unique to either of us. We’ll have to wait and see … unless there is a way to raise a ticket with imp?

foxt · December 7, 2015, 1:03pm

FYI - The agent resumed it’s ability to post to sparkfun datastream, for me, around 06:45 UTC this morning. Is anyone able to shed any light on the nature of the outage?

hugo · December 8, 2015, 7:15am

As you noted, low error numbers are from libcurl, which is used to issue the requests, see http://curl.haxx.se/libcurl/c/libcurl-errors.html - error 7 is that the connection couldn’t be established.

I can see in our logs that all connects were timing out (60+ seconds) for an extended period; it started coming back at 2015-12-07T06:30:04.520Z but was still patchy until about 06:31:35. Didn’t see any other http out issues for other hosts.

It sounds like sparkfun had issues with their service (504 is a returned error from their end) - maybe they took it down for maintenance then fixed the problem? There weren’t any imp issues at that point, but the error 7’s could also be routing between our AWS region and their server, wherever that is (though the IP address is AWS-ish). Their IP didn’t change either side of the outage, so it wasn’t a server move… or they have elastic IP set up.

foxt · December 10, 2015, 3:02am

I was able to directly post to sparkfun during the period where I was getting the RC 7 from the agent, so I don’t think sparkfun was down. Maybe it was an AWS issue? Regardless, thanks for following up with us and letting us know what you saw on your side!

hugo · December 10, 2015, 1:15pm

If I’d spotted this thread whilst it was still happening I could have done more to look at it (eg traceroute from the servers in question); that’s the problem with connectivity issues, looking at them when everything is working doesn’t help much

(commercial customers get access to our ticketing system, which is a bit more immediate than the forums - not that this is much help in your case, but just for whoever else may be reading this!)

foxt · December 10, 2015, 3:29pm

I was wondering how tickets were managed, since there was no obvious way to submit one. Makes total sense that you support commercial customers who are running in a production environment through tickets, and that those of us who are tinkering don’t have access to the ticket system. I for one accept the fact that I am not operating in a production environment, and I am grateful for the support / service that you do offer while I evaluate the platform.

If the outage happens again, I’ll just post here again …

mlseim · December 11, 2015, 3:59pm

There is another option. You can purchase your own shared webhost account with a domain name for about $40-$75 per year. With that, you have all the PHP, Perl, MySQLi database at your disposal to do whatever you want. And of course, you can create your own web pages / website. This is an inexpensive shared host I’ve used with good service ( www.cleverdot.com ). There are other webhosts, like GoDaddy, etc.

You do whatever you want with the imp and your website. Collect all the data you want and use it however you want.

foxt · January 1, 2016, 3:29pm

I’m getting the RC=7 again over the last day or so, perhaps I will put up my own datastream collection host as mlseim suggests … but by any chance if Hugo or someone else from imp is watching at the moment, the issue is occurring now …

hugo · January 1, 2016, 5:57pm

Taking a look now…

hugo · January 1, 2016, 6:03pm

There appear to be plenty of requests succeeding right now (several per second); they have had some patches where the connection was established but then appeared to be very slow (like, 1000+ seconds for the transaction to complete), and some where for some reason the IP doesn’t appear to resolve (though these are on the same server where other requests are working fine, which is freaky).

Can you PM me your agent ID so I can narrow it down to just your requests? There are well over 100 agents posting to sparkfun regularly.

hugo · January 1, 2016, 6:08pm

The really strange thing is that the times the connection doesn’t appear to resolve/complete are happening every 5 minutes, for a small patch (generally 2-3 seconds), eg:

2016-01-01T04:15:03-05
2016-01-01T04:20:01-04
2016-01-01T04:25:01-03
2016-01-01T04:30:02-06
2016-01-01T04:35:01-06

…I wonder if their server is doing something high load every 5 minutes, or whether actually everyone is synchronizing their posts to 5 minute intervals, and hence they see waves of server load (pretty sure it’s not our side).

Tip: try making your 15 minute wakeups happen at 2.5 minutes past the hour (and in 15 min intervals after that) and see if it’s happier?

Rfarmer · January 1, 2016, 11:37pm

I’m not sure if this is related but I’ll mention it anyway. My imp has been posting data every few minutes to Sparkfun/Phant all afternoon with inconsistent results. I finally put a counter on result.err. In the last three hours, Sparkfun has returned 72 statuses of “1 success” and 36 statuses of { “error”: “”, “code”: 35 }. My device/agent code is working as far as I can tell, and I haven’t changed anything since I added the error counter. Successful data posts immediately appear at analog.io. Does anybody know what the “35” codes from Sparkfun mean, and is this the same problem as mentioned at the beginning of this thread?

hugo · January 2, 2016, 12:37am

Code 35 is a TLS error (error when setting up the secure connection). See http://curl.haxx.se/libcurl/c/libcurl-errors.html for the list of all codes.

These do appear to be coming from the remote end, ie I don’t believe we have an issue at our end. Possibly they’re getting more and more loaded?

@rfarmer a workaround, if you’re not worried about the security of the data you’re sending to sparkfun, might be to just move to http vs https in your posting URLs.

hugo · January 2, 2016, 3:27am

I can see quite a few errors now (all on https). Approximately 34% of all https connections to data.sparkfun.com are failing.

Looking at packet captures, they’re just hanging up after we sent the client hello. 66% of the time they’re happy, 34% they hang up.

I suspect there’s a problem with maybe one of 3 load balanced servers behind data.sparkfun.com… I’ve emailed their generic address with tech details, will update this thread if I hear back.

kingjez · January 3, 2016, 1:54pm

I’d like to reiterate foxt’s note about support - it’s greatly appreciated. Thank you Hugo for looking at this. I have been seeing 100% of my https connection attempts to sparkfun failing (error code 35) since 13:28 UTC on 2nd Jan. http attempts appear to be fine.

Rfarmer · January 4, 2016, 3:48am

Sure enough, all https connections now fail, but http succeeds.