first time ever - power cycle required to get device online again

edited February 17 in General
I've got 20 devices under the same model. Yesterday I updated the agent code and use the API to reprogram all with the update. After this 2 devices did not come back online (a 3rd had dropped off due to unrelated issues I think)

The devices are relatively local so I was able to visit them and cycle power to get them online.

Any thoughts how they got into this state.

Comments

  • Generally a code deploy does not make a device go offline - the connection stays up (no disconnect/reconnect) unless your code does that at startup.

    Without seeing your device code, it's hard to say; it's possible you have a bug in your connection handling (do these devices poll, or do they stay online? If stay online, what code are you using to do this?) etc.

    We have people who deploy to 100k devices at a time without issues so we aren't aware of any general problems.
  • I actually didn't even touch the device code - it has been stable for a long time. I was updating the Agent, I wrote some code that got caught by the syntax checker - corrected that and uploaded. Next I used Python to run a script that I have been using to push the update to all 20 devices.

    at this point - no big deal because I was able to physically access the devices and get them online with just a power cycle. I don't understand why it was necessary though because as I wrote - never has been the case before.
  • Generally pushing updates will restart both device & agent even if only agent has changed - but if the code didn't change, then nothing would be actually sent to the device (hashes would match).

    Got a mac address of a device that worked, and one that didn't for comparison?
  • These 16 restarted OK

    0c2a690be980
    0c2a690be68d
    0c2a690be05e

    0c2a690be5cb
    0c2a690be6bc
    0c2a690be45a

    0c2a690be65d
    0c2a690be940

    0c2a690be73f
    0c2a690be032
    0c2a690be713
    0c2a690be62d
    0c2a690be725
    0c2a690be46c
    0c2a690be59e
    0c2a690be72a



    These two required power cycle on-site to come online
    0c2a690be7c4
    0c2a690be71c




    This one had gone offline and is remote - I don't know what the problem is - maybe just unplugged from power at the site
    0c2a690be76a

    they are all imp003 modules - Kris Winer's Tindie breakout board


    //=============================================================================

    Below is the python code used to restart them

    //=============================================================================



    import json, httplib, base64
    apiKey = base64.b64encode('--------------------------------') # API key designed for our devices
    headers = {"Authorization":"Basic " + apiKey}
    connection = httplib.HTTPSConnection('build.electricimp.com')
    connection.connect()



    # restart by model

    model_id = '------------' # this is a set of 24 devices, 19 typically always ON

    connection.request('POST', '/v4/models/' + model_id + '/restart', '', headers)
    results = json.loads(connection.getresponse().read())
    print results
  • edited February 20
    I can see quite a few updates to devices over the last few days; can you narrow down when this one was? (UTC ideally)

    I can only see 21 devices getting updated; some have been udpated more than once in the past week:

    (number of deploys / address)
    9 30000c2a690be032
    5 30000c2a690be05e
    6 30000c2a690be45a
    7 30000c2a690be46c
    8 30000c2a690be59e
    6 30000c2a690be5cb
    7 30000c2a690be62d
    8 30000c2a690be65d
    1 30000c2a690be68a
    7 30000c2a690be68d
    7 30000c2a690be6bc
    8 30000c2a690be713
    4 30000c2a690be71c
    7 30000c2a690be725
    8 30000c2a690be72a
    7 30000c2a690be73f
    1 30000c2a690be743
    1 30000c2a690be76a
    4 30000c2a690be7c4
    6 30000c2a690be940
    16 30000c2a690be980


    Once I know the actual event I can look a bit deeper and see how the update fanned out (or didn't, and possibly why)
  • Found something interesting at 2017/02/17 01:43:59Z for 30000c2a690be7c4.

    It got sent new code at that point, and this caused it to be load balanced to another server - however, it didn't turn up there. At 01:49:43 I see some (what look like) manual reloads, but it isn't seen again until 11:11:17.

    Is this device on a different network than the others? If you could email us with your device code we can have a look at your connection handling maybe.
  • 2017-02-16 19:43:39 UTC-6 [Status] Device disconnected
    2017-02-16 19:43:39 UTC-6 [Status] Agent restarted: reload.

    There are 10 devices in one building so 8 of 10 worked OK. They should be on the same ssid but since there are multiple ssid programmed to try it could spend some time trying other networks if for some reason it fails to connect to the one that is cached.

    The reason it came back online is because I did a power cycle at the site (yes, I was up at 5AM) : )

    The other devices are on the other side of the country and they seem to have come online OK.
  • I could be wrong here and @hugo can confirm, but I believe that the imp enters a "firmware upgrade" mode when told it has new code. If the upgrade fails to complete, squirrel doesn't run. The imp simply tries the upgrade again or backs off for a bit.

    I've seen this on tenuous wifi connections (we have a few!), leading to repeated attempts to upgrade and no squirrel until the new code is loaded.

    If your imp supports multiple ssids, be aware that without squirrel your imp will be locked onto whatever you last set with imp.setwificonfiguration().
  • You're right in that there's only one space to store squirrel, so if it has erased this space then it can't run anything until new squirrel has been received - but the link was up in order to get the notification of new code hence it should be able to get the code.

    If interrupted at this point, it would be in suspend_on_error mode, which means it'd keep trying to connect for 60s and then sleep for 9 mins if it failed, until it did manage to connect.

    However, the erase only happens when new squirrel is about to be received. Here, the device got a reload command, re-identified, and then was load balanced to a different server; it would not have had the new firmware notification yet (that's the next step) so should have been continuing to run the firmware it already had.

    (of course, @peter may correct me on this, but that's my understanding)

    @jhtna does your code do anything specific with wifi settings, eg does it have several sets and iterate through them or anything?
  • The code does iterate through a list of possible ssid and passwords. However in this case it should have picked up the original network. If that failed in that cycle due to bandwidth then it would change wifi settings sleep for a while and it would/should eventually come online.

    There wasn't new device code to load, it was only restarting due to the Agent getting new code. I don't know if this is a unique situation, but there are 10 devices in one room all within ~ 30ft of each other. Since I used Python to trigger the update there would have been a burst of activity in a relatively short span.

    I wouldn't want to be over-confident about the hardware (no way to prove hardware didn't hit a corner case) but I can say these devices have been in constant use since November 2016, are always powered on and sending data and I have not noticed any outages in that time.

    btw, I sent the Device code yesterday through email.

    thanks for the notes, it's interesting.

    Mark
Sign In or Register to comment.