Drone

Fix gRPC DeadlineExceeded error

I have encountered a lot of gRPC errors. In the drone-agent, the logs are as follows:

grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded

This error has been around for a while, I checked some of the discussion posts in the forum, such as:

Many people have encountered similar problems, and some people have not solved the problem. From my scenario, this error does not affect the build process, it just appears in the drone-agent log, not fatal.

From the previous post, some people say that it is a problem of gRPC communication between drone-agent and drone-server, and then the situation is not necessarily the case (since the build process is normal, communication can not be a problem), I Can reproduce the scene of this error, and hope to submit a pull request.

How to reproduce

1、Set the timeout of the repo in the drone to 1 minute,like this:

2、Set a step in the pipeline, which takes more than 1 minute(I use sleep 60)

3、Trigger a build

Since the timeout is set to 1 minute, but within 1 minute, the build process will be cancelled due to timeout. Then go to the drone-agent log and you can see the output below(This kind of output log will be printed every 1 second. Over time, this kind of log will be accumulated more and more.).

grpc error: wait(): code: DeadlineExceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded

In drone-server, you can see the following log:

grpc: Server.processUnaryRPC failed to write status stream error: code = DeadlineExceeded desc = “context deadline exceeded”

How to solve this problem

It is easy to understand that the drone-agent performs a Wait operation during the build process, waiting for the execution of the drone-server process to end. However, the timeout context in the Wait operation is the same as the timeout context in the pipeline engine.

When the build process is cancelled because of the timeout. The drone-server side wants to return data, but because the gRPC times out, it cannot be returned, and the drone-agent side can not get any information about the correct drone-server at this moment, only get the gRPC timeout error.

We can look at the implementation of the Wait method:

func (c *client) Wait(ctx context.Context, id string) (err error) {
req := new(proto.WaitRequest)
req.Id = id
for {
	_, err = c.client.Wait(ctx, req)
	if err == nil {
		break
	} else {
		log.Printf("grpc error: wait(): code: %v: %s", grpc.Code(err), err)
	}
	switch grpc.Code(err) {
	case
		codes.Aborted,
		codes.DataLoss,
		codes.DeadlineExceeded,
		codes.Internal,
		codes.Unavailable:
		// non-fatal errors
	default:
		return err
	}
	<-time.After(backoff)
}
return nil
}

When the timeout context of gRPC is timed out, the Wait operation will continue to loop because gRPC’s DeadlineExceeded error is ignored.

Therefore, to solve this problem, you must make the pipeline timeout, does not affect the drone-agent gRPC call to drone-server, that is, separate the timeout context, do not use the same one, And, let the timeout of the Wait operation be slightly longer.

Fix code

The code fix is very simple, just one line (change cmd/drone-agent/agent.go#214).

	go func() {
	logger.Debug().
		Msg("listen for cancel signal")

	//fix here
	ctx, cancel := context.WithTimeout(ctxmeta, timeout+1*time.Minute)

	if werr := r.client.Wait(ctx, work.ID); werr != nil {
		cancelled.SetTo(true)
		logger.Warn().
			Err(werr).
			Msg("cancel signal received")

		cancel()
	} else {
		logger.Debug().
			Msg("stop listening for cancel signal")
	}
}()

I re-built the drone-agent, and after deployment, this error will not appear. I think there is no problem, I have done a lot of builds for testing, then I will submit a pull request, I hope to pass, hope Can help others.

have you tried setting the keepalive time with the server and agent? This has been the solution for most people that have encountered issues with grpc connectivity.

you pass the following to the server:
DRONE_KEEPALIVE_MIN_TIME=20s

and the following to the agent:
DRONE_KEEPALIVE_TIME=45s
DRONE_KEEPALIVE_TIMEOUT=30s

note you may need to toggle the values based on your load balancer / reverse proxy or network settings to ensure the keepalive is timed correctly.

I am honored to discuss this issue with you. However, I have tried to modify the environmental variables you mentioned earlier and it has not worked. In addition, I have even added grpc health monitoring. Therefore, such a setting does not work because the network connection is always smooth. The problem I want to fix with this pull request is not caused by the network failure.

1 Like

@bradrydzewski Im also facing the similar issue. We have many build which takes between 5 to 15 minutes to complete. Will setting the keepalive values which you have mentioned timeouts the build ?

Note: We run drone inside k8s and the drone agent communicates to dron server using k8s service endpoint.

grpc was removed from Drone in version 1.0. The solution to any grpc issues is to upgrade the latest stable version of Drone, which is 1.6.4 at the time of writing this comment.

@bradrydzewski Yeah we will be moving to version 1.0. But for now will setting this values affects us considering our build takes between 5 to 15 minutes ?