Kubernetes runner intermittently fails steps

I’m not sure if this is a problem with the runner or something larger in my cluster, but over the last month or two, these failures have become more frequent. Using the drone-kube-runner, during a pipeline run, sometimes a step will fail with no output at all usually or the only output being like:

unable to retrieve container logs for docker://226be60fecae14fa720572ccae3062f530f1d71c9a1dbf67d9b10029b9458c0d

A restart will usually pass but lately, it’s taken several attempts as on a retry sometimes a different step will fail like this. I have my drone logging turned up on the main process and the drone-kube-runner but don’t see any notable errors in the logs. I was using drone/drone-runner-kube:latest until today when I switched to drone/drone-runner-kube:1.0.0-beta.3 but still experienced the issue with that. Main Drone version is 1.7.0.

I’m not seeing anything helpful in the kube logs (kube-apiserver, kube-scheduler, kube-controller-manager or even just docker) but it’s a bit like trying to find a needle in the proverbial haystack. I’m not really sure where to go from here. This problem only started occurring after switching to the drone-kube-runner from the legacy kubernetes runner. It’s the running on the same kubernetes cluster.

I guess I should have posted this under Support, but the amount of transient build failures this is causing is quite high. Any help at all on what logging I should turn up would be useful.

Hi Matt,
I did some research on this particular error and came across an interesting thread [1] from knative (which shares some similiarities with Drone) and they found this issue was caused by the container being deleted before the kubernetes client had enough time to fetch the logs. I wonder if this could be a possible root cause? Does the step run very quickly?

[1] https://github.com/knative/docs/issues/300

Thanks for looking into this.

Yes, most of these steps have executions that take only a handful of seconds. Drone usually reports 4 or so seconds for a total run time for a particular step when its successful. I believe that time also includes image pull time, so actual time the container is in a running state is probably less than a second.

I think that covers most of the intermittent failures, but sometimes the clone step fails with no output and that takes roughly 40 seconds when successful. But I think you’re on to something…

I think there are a couple things I would like to attempt:

  1. Retry getting the logs on error with a backoff
  2. Delay Pod deletion to given enough time to fetch logs

If I enable both of the above with feature flags, would you be open to testing since you are able to reproduce with some consistency?

Yes, absolutely. I would love if the deletion time were configurable. I would find a lot of value in being able to inspect them during general troubleshooting. I’d probably set it for about 24 hours or so.

Hi Matt,

We added some feature flags to drone/drone-runner-kube:latest that I hope will help:

configures the system to retry streaming the logs on error. I recommend enabling this flag first and monitoring its behavior. If this flag solves the problem, we can implement a more permanent fix where we automatically retry.

delays removing the pod by 30 seconds. If the underlying root cause is the Pod being deleted before logs are streamed, setting this flag to true should workaround this issue by giving the system enough time to stream logs.

disables removing the Pod. If enabled this requires manually removing the resources associated with the pipeline (Pod, Config Maps, Secrets, etc). If streaming the logs fails consistently, we should be able to disable Pod deletion and try streaming logs using kubectl to see if the error is reproducible.

Thanks very much. I’ll throw the first flag in now and give it a few days.