[wontfix] Failing build steps reporting as success on highly parallel builds

We ran into this issue whilst attempting to run a pipeline that has highly parallel build steps.
In this instance drone is responsible for deploying an application across a number of contexts.
The pipeline ran but numerous build steps were incorrectly marked as Success. These builds steps were not run, did not deploy the service and left no logs. I have attached an image of one of the failing build steps. Our investigation determined that these errors were taking place when running highly parallel builds on a system limited by threadcount.
Given the fact that the deployment was in an very inconsistent state and in fact partial it should never report GREEN


We were able to mitigate the issue itself by limiting the ‘MAX_PROCS’ running so that parallel pipelines no longer caused the error but this was at the cost of their concurrency and as such the gains provided by it.

We recently fixed a bug where referencing an invalid step name in depends_on, or referencing the step name before the step was defined (e.g. order of operations) would result in the exact behavior you described where steps fail to execute but are labeled as passing.

We also added some improvements to the runner’s logging to surface more information about edge cases that one might encounter.

I would therefore recommend ensuring you are running the latest version of the docker runner, version 1.5.2, to ensure you have these improvements. If the issue persists we would request that you enable trace logging on the runner. Then gather the trace logs and provide a copy of the yaml so that we can analyze and attempt to reproduce.

Hi Brad,
Thanks for your help.
We have upgraded the runners to the latest version so will get better logs anon.

We have ruled out the DAG or invalid step names as these failures are inconsistent.
Furthermore, we only see these fails when running too many procs, if we mitigate by limiting this they cease.

We will continue to monitor the situation and update here with new information

we use a semaphore to limit the maximum number of concurrent steps. looking through the code the other day I noticed we silently exit if we fail to acquire the semaphore which could be a root cause of the behavior you described. I patched this behavior (see below link) which is included in v1.5.2. The main reason we would expect acquiring a semaphore to error is because the pipeline was canceled by the end user or timed out, but we now handle unexpected failure scenarios as well.

the semaphore code is only executed when max procs are configured. The fact that you are experiencing this issue only when max procs are configured helps narrow this down (thanks for this) and gives me some confidence the behavior described above may have been the root cause. Hopefully you will find this solved in v1.5.2, but if not, there should be more logging in place to help further narrow this down.

Hi Brad,

Thanks for your help. I think you may have misunderstood what I was saying.

We were seeing this issue when running highly parallel builds but WITHOUT the max proc set.
This is the opposite of what you mentioned above.

To be clear. We were seeing the issue when we had no value configured. Setting the max to, in our specific example, 8 stopped the issue from occurring. At 16 it occurred much more often.

Nonetheless we have updated to v1.5.2. and switched on tracing so we will have logs to look at soon.

Thanks again,
Brian

Thanks for the clarification. We made two improvements, which are present in 1.5.2, including capturing oom kill errors and re-connecting if the docker wait command exits unexpectedly before the container exits. Under significant load, either of these situations could be possible, and could have contributed to the unexpected behavior you observed. Hopefully you will no longer experience this behavior with 1.5.2. However, if the behavior persists, hopefully the trace logs will help further narrow this down.

Here is maybe a related issue where max_procs causes new step canceling behavior on failure.

Dunno if its fixed by anything mentioned here or not.

@thomasf if I am understanding correctly, the behavior described in the linked thread is the expected behavior. If the pipeline is in a failure state, subsequent steps are skipped unless they are configured to run on failure. If you limit max procs and the step has not started yet due to concurrency limits, and the pipeline is in a failure state by the time it starts, it is skipped. The logic is consistently applied. I am happy to discuss further, but let’s move back to the original thread to keep this thread focused on pipeline steps having a passing status.