Builds are Stuck in Pending Status

#1

This section will help triage why builds are stuck in a pending state.

Whenever we encounter this issue it is always related to configuration. To triage this problem we therefore need to see configuration details and logs. Please take the following actions and provide the following data:

  1. Provide your server configuration
  2. Provide your agent configuration
  3. Enable DRONE_LOGS_TRACE=true on the server
  4. Enable DRONE_LOGS_TRACE=true on the agent
  5. Provide the agent logs with trace enabled
  6. Provide the server logs with trace enabled
  7. Provide the Yaml configuration file

What does a successful connection look like?

Before we discuss troubleshooting connectivity issues, we should first examine what a successful connection looks like. When debug mode is enabled on the server, and when agents successfully connect, you will see an entry in your server logs that looks like this:

{
  "arch": "amd64",
  "kernel": "",
  "level": "debug",
  "msg": "manager: request queue item",
  "os": "linux",
  "time": "2019-04-28T16:00:47-07:00",
  "variant": ""
}

The manager: request queue item entry is proof that the agent is successfully connecting to the server. If you do not see these corresponding log entries, you can be certain that the agents are failing to the connect with the server.

Networking Problems

The most common root cause is network connectivity issues. The best way to triage connectivity issues is to pass DRONE_LOGS_TRACE=true to the agent. This will provide detailed logs for http attempts made to the server.

If the agent cannot establish a connection to the server you will see the below agent logs. Please note that this indicates a problem with either your Agent configuration, your Server configuration or your Network configuration (DNS, etc). This does not indicate a bug with Drone.

2019/04/28 16:05:57 [ERR] POST http://localhost:8080/rpc/v1/request request failed: Post http://localhost:8080/rpc/v1/request: dial tcp [::1]:8080: connect: connection refused
2019/04/28 16:05:57 [DEBUG] POST http://localhost:8080/rpc/v1/request: retrying in 2s (29 left)
2019/04/28 16:05:57 [ERR] POST http://localhost:8080/rpc/v1/request request failed: Post http://localhost:8080/rpc/v1/request: dial tcp [::1]:8080: connect: connection refused
2019/04/28 16:05:57 [DEBUG] POST http://localhost:8080/rpc/v1/request: retrying in 2s (29 left)
2019/04/28 16:05:57 [ERR] POST http://localhost:8080/rpc/v1/request request failed: Post http://localhost:8080/rpc/v1/request: dial tcp [::1]:8080: connect: connection refused
2019/04/28 16:05:57 [DEBUG] POST http://localhost:8080/rpc/v1/request: retrying in 2s (29 left)

Invalid Endpoint, Proxy Problems

Another common root cause is when you specify and invalid endpoint or when a reverse proxy is incorrectly routing the request. This will manifest as an error that includes html in the error message, for example:

2019/04/28 16:12:03 [DEBUG] POST https://drone.company.com/rpc/v1/request
{
  "arch": "amd64",
  "error": "\u003c!DOCTYPE html\u003e\n\u003chtml

You should also check to ensure you provide the correct server address, including the scheme (http vs https). If you are using the http address, and your reverse proxy automatically redirects to https, it can result in connectivity issues.

Incorrect Secret

Unfortunately an shared secret mismatch between the agent and server is the most difficult error to debug because it does not product a useful error message. You should take care to ensure you are passing the correct secret to both the server and agent. Make sure the characters match exactly. If you are using cat to read the secret from a file, be careful, since this has caused problems (with newlines, etc) that can be difficult to troubleshoot.

Undefined Platform when using Arm or Arm64

Drone assumes all pipelines are amd64 unless otherwise specified. If you are using Drone with arm or arm64 agents please be sure to specify the architecture to ensure builds are routed to the correct agent.

kind: pipeline
name: default

+platform:
+  os: linux
+  arch: arm

steps: ...

Beware of False Positives

When you enable trace logs it is easy to misinterpret the results. The Agent uses long polling to request builds from the server queue. The agent connects to the server for up to 30 seconds. If the agent does not receive a build from the queue after 30 seconds, it terminates the connection and then reconnects. The connection is terminated after 30 seconds to prevent timeouts (from reverse proxies, load balancers, etc). It is therefore completely normal to see 524 status codes and context deadline exceeded errors in the trace logs.

The following trace log entries are therefore completely normal:

{
  "arch": "amd64",
  "level": "debug",
  "machine": "bradleys-mbp.lan",
  "msg": "runner: polling queue",
  "os": "linux",
  "time": "2019-04-28T16:22:16-07:00"
}
2019/04/28 16:22:16 [DEBUG] POST http://localhost:8080/rpc/v1/request
2019/04/28 16:22:46 [DEBUG] POST http://localhost:8080/rpc/v1/request (status: 524): retrying in 1s (30 left)
2019/04/28 16:23:16 [ERR] POST http://localhost:8080/rpc/v1/request request failed: Post http://localhost:8080/rpc/v1/request: context deadline exceeded
1 Like
Drone-jsonnet not working
Drone Autoscaler
[solved] Builds stuck in pending state for drone:1.1.0
[solved] Builds stuck in pending state for drone:1.1.0
closed #2