Autoscaler Agents remain in Creating state


We are seeing an issue with agents that remain in Creating state when we have a burst in the number of agents create by the autoscaler.

If agents are created gracefully, 1-4 there is generally no issue, but if we go from our minimum and requesting 10+ agents within a parallelised build, we se that some agents remain indefinitely in Creating state.

Logs on both the autoscaler and agent don’t seems to show anything wrong, the VM in GCP gets created, but we can see there are no Docker processes running yet.

Any hint on where to look for issues would be great. The current theory is that the autoscaler is waiting for the agent to report some sort of status or there is a concurrency issue when multiple agents get requested.
We’ve been looking into this to try understand what would be an expected response for the status to update and mark the agent ready to start taking builds:

Please post the autoscaler logs with TRACE logging enabled. The logs will help us determine where in the process the autoscaler is waiting so that we can suggest possible root causes. If the logs are insufficient, we can add more.

This is not quite how it works. First, the autoscaler provisions the instance and then makes an API call to describe the instance and to check the instance network. Once the instance is successfully provisioned, the status is changed from creating to staging

Next, the autoscaler tries to docker ping the docker instance to verify it is initialized (using a backoff). And finally once it is able to ping the instance, it installs and starts the autoscaler (using docker create and docker start). Once the agent is successfully installed, the status is changed from creating to running.

Since you see the runner is stuck in a creating status, we can narrow this down to some problem with instance creation. It sounds like it might be stuck in the waitZoneOperation backoff. The backoff is subject to a 1 hour timeout, which ultimately propagates to the waitZoneOperation call using this context. The autoscaler performs this backoff until GCP indicates the instance is given a status of DONE or until the API returns an error (which includes a timeout error).

The agent is ready once it has successfully connected to the Docker daemon on the machine and executed a docker ping and installed the agent using docker create and docker start. But as mentioned above, it sounds like you are not getting past the instance creation and verification step.