Drone

Agent launched by autoscaler stuck at StateStaging

The autoscaler logs show the following, repeatedly:

{"log":"12:11AM DBG connecting to docker ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-11
T00:11:08.842005132Z"}
{"log":"12:11AM DBG cannot connect, retry in 1m0s error=\"Cannot connect to the Docker daemon at https://ww.xx.yy.zz
9:2376. Is the docker daemon running?\" ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-11T
00:11:08.843114371Z"}

I checked the agent and it is on StateStaging

$ drone server info agent-ABC
Name: agent-ABC
Address: ww.xx.yy.zz
Region:  us-east-1f
Size:    m5a.2xlarge
State:   staging

I SSH’d into the agent, and it seems to have failed installing docker, nothing related to Drone.
It seems to try to install over and over when on StateStaging


Not sure how to go about fixing this, it seems they should be placed to StateError after a while, and let the reaper handle it. They’re not marked for termination since that works on StateRunning agents only.

when pinging the docker daemon fails it performs a backoff and after an hour (I believe) it eventually marks the instance as errored. Have you waited an hour to see if it marks the server as errored as expected?

It was up for 15 hours, I don’t see any pinger logs for that agent, it is configured to ping every 10 minutes. It looks like it only pings StateRunning https://github.com/drone/autoscaler/blob/master/engine/pinger.go#L41 unless I’m missing something.

I would need to see all logs with name=agent-ABC to further assess

{"log":"10:58PM DBG instance create image=ami-43a15f3e name=agent-ABC region=us-east-1 size=m5a.large\n","stream":"stderr","time":"2019-09-10T22:58:33.66337939Z"}
{"log":"10:58PM INF instance create success image=ami-43a15f3e name=agent-ABC region=us-east-1 size=m5a.large\n","stream":"stderr","time":"2019-09-10T22:58:35.112429734Z"}
{"log":"10:58PM DBG check instance network image=ami-43a15f3e name=agent-ABC region=us-east-1 size=m5a.large\n","stream":"stderr","time":"2019-09-10T22:58:35.112658704Z"}
{"log":"10:58PM DBG instance network ready image=ami-43a15f3e ip=ww.xx.yy.zz name=agent-ABC region=us-east-1 size=m5a.large\n","stream":"stderr","time":"2019-09-10T22:58:35.223177267Z"}
{"log":"10:58PM DBG provisioned server server=agent-ABC\n","stream":"stderr","time":"2019-09-10T22:58:35.22326976Z"}
{"log":"10:58PM DBG check docker connectivity ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-10T22:58:37.282570616Z"}
{"log":"10:58PM DBG connecting to docker ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-10T22:58:37.282646072Z"}
{"log":"10:59PM DBG cannot connect, retry in 1m0s error=\"Cannot connect to the Docker daemon at https://ww.xx.yy.zz:2376. Is the docker daemon running?\" ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-10T22:59:08.714011185Z"}

The last two repeat on and on for hours, it hits the max characters allowed for the post.

It looks like we were missing a timeout in the install routine. I have pushed an update to include such a timeout so that it will fail about approximately 1 hour of failed connectivity attempts.

We typically only see these errors when you first configure the autoscaler, and it generally indicates a problem with networking configuration or firewall rules. It should therefore be very rare in a properly configured production installation. For this reason I think a 60 minute timeout is good enough.

This is great, thanks for the quick responses as always!

So it seems the timeout did work, but they’re not getting killed, just stuck at StateStaging for 5 hours now.

{"log":"12:25PM DBG instance create image=ami-43a15f3e name=agent-ABC region=us-east-1 size=m5a.large\n","stream":"stderr","time":"2019-09-12T12:25:58.366453276Z"}
{"log":"12:26PM INF instance create success image=ami-43a15f3e name=agent-ABC region=us-east-1 size=m5a.large\n","stream":"stderr","time":"2019-09-12T12:26:00.061732851Z"}
{"log":"12:26PM DBG check instance network image=ami-43a15f3e name=agent-ABC region=us-east-1 size=m5a.large\n","stream":"stderr","time":"2019-09-12T12:26:00.061917248Z"}
{"log":"12:26PM DBG instance network ready image=ami-43a15f3e ip=ww.xx.yy.zz name=agent-ABC region=us-east-1 size=m5a.large\n","stream":"stderr","time":"2019-09-12T12:26:00.176977084Z"}
{"log":"12:26PM DBG provisioned server server=agent-ABC\n","stream":"stderr","time":"2019-09-12T12:26:00.177125506Z"}
{"log":"12:26PM DBG check docker connectivity ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-12T12:26:07.649957862Z"}
{"log":"12:26PM DBG connecting to docker ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-12T12:26:07.6499868Z"}
{"log":"12:26PM DBG cannot connect, retry in 1m0s error=\"Cannot connect to the Docker daemon at https://ww.xx.yy.zz:2376. Is the docker daemon running?\" ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-12T12:26:39.274166689Z"}
...
{"log":"1:24PM DBG cannot connect, retry in 1m0s error=\"Cannot connect to the Docker daemon at https://ww.xx.yy.zz:2376. Is the docker daemon running?\" ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-12T13:24:39.380819484Z"}
{"log":"1:25PM DBG connecting to docker ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-12T13:25:39.38156451Z"}
{"log":"1:25PM DBG cannot connect, retry in 1m0s error=\"Cannot connect to the Docker daemon at https://ww.xx.yy.zz:2376. Is the docker daemon running?\" ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-12T13:25:39.382501389Z"}
{"log":"1:26PM DBG connection timeout ip=ww.xx.yy.zz name=agent-ABC\n","stream":"stderr","time":"2019-09-12T13:26:07.650015758Z"}
$ drone server info agent-ABC
Name: agent-ABC
Address: ww.xx.yy.zz
Region:  us-east-1f
Size:    m5a.large
State:   staging