Drone

Parallel steps in kubernetes runner

When using a pipeline that has parallel steps, and when any two steps complete at roughly the same time, multiples steps fail with the following: default: Operation cannot be fulfilled on pods "drone-0upypmgcjeq7yt2inqe5": the object has been modified; please apply your changes to the latest version and try again.

See Kubernetes runner: operation cannot be fulfilled; the object has been modified

This seems to prevent that error, but concurrent steps still fail for me. They either hang forever, or fail with zero logs. Now I see this in the kubelet logs:

147.75.64.155: E0423 00:09:56.347725    5495 status_manager.go:402] Status update on pod ci/drone-h4ozy45e6uldqyfbe4hb aborted: terminated container drone-5hxb5687jy44py8lbj3e attempted illegal transition to non-terminated state

EDIT:

The above error may not be one of concern as I dig into it, but I’m still not able to get parallel steps to work.

it looks like steps finish successfully by looking at the logs, but the step is red: https://ci.dev.talos-systems.io/talos-systems/talos/8551/1/16

I took a look at the raw request and I see the code returns an exit code of 2
https://ci.dev.talos-systems.io/api/repos/talos-systems/talos/builds/8551

This means the step is definitely failing and the step should be red. Is it possible your make command (or whatever is running inside the container) is failing silently with an exit code of 2? I can say with certainty that Drone would not assign an exit code of 2 unless it came directly from the kubernetes API.

All I can add to this is that this exact same pipeline/step works when we use the deprecated built in kubernetes runner. Also, the failures seem to be random. For example, the talos-local step fails here: https://ci.dev.talos-systems.io/talos-systems/talos/8550/1/14. Strange thing is that https://ci.dev.talos-systems.io/api/repos/talos-systems/talos/builds/8551 says it was skipped. Again, this all works when using the baked in k8s runner. Here is our last successful run, with the same exact pipeline, except without type: kubernetes: https://ci.dev.talos-systems.io/talos-systems/talos/8530

EDIT:

I ran the markdown lint step locally, and it is indeed failing. Will fix that and try again. Regardless, in some cases it is green and some it is not. Something feels off with the reporting of the success of a step.

Are you sure the step can be parallelized? Perhaps there is a race condition where the step is reading or writing files that are also being accessed by another step. This could explain why the failures would be random …

But let’s step back for a moment and consider the following …

IF we assume:

  1. Drone creates a standard Kubernetes Pod and
  2. Each step is a standard Kubernetes container and
  3. Your commands are a simple bash script (as shown here) and
  4. the bash script is the container entrypoint and
  5. Drone gets the exit code directly from Kubernetes

Then: How could Drone cause a container to fail or return the wrong exit code?

As a next step, I would recommend that you triage the code and try to find the root cause (bonus points for sending a patch). Since you are able to reproduce the issue (albeit infrequently) you would be in the best position to debug the problem. You can find the relevant code at https://github.com/drone-runners/drone-runner-kube/blob/master/engine/engine_impl.go

I’m absolutely certain the step can be parallelized. Again, this pipeline has been working for months. The only thing that has changed is the Kubernetes runner. Take https://ci.dev.talos-systems.io/api/repos/talos-systems/talos/builds/8613 as an example. This PR fails randomly. I can reproduce this more than infrequently. In fact it has happened 5 times out of the last 6 runs. Each time, a different step has failed (see 8610, 8611, and 8614), each time a different stable fails, and nothing as to indicate why (no logs for most).

https://ci.dev.talos-systems.io/api/repos/talos-systems/talos/builds/8614 is interesting because the setup-ci logs look perfectly fine, yet the exit code is 2, and also at this point there is no parallelization. Everything after setup-ci depends on it. Similarly https://ci.dev.talos-systems.io/talos-systems/talos/8615/1/8 shows logs that look normal, yet the exit code 2.

If you believe this is an issue with the kubernetes runner, the next step is for you to debug the code and submit a patch to drone-runners/drone-runner-kube. We provide this guide to contributing to the kubernetes runner.