Drone

Cancelling a build leaves stages with status='pending'

Environment

Docker image sha256:78f6fe3bb6e585fa137439a9eae4ba056ee22a75cca3294fa0089fadd2b0a63d
Server version: 1.4.0

Autoscaler version:

    {
     source: "https://github.com/drone/autoscaler.git",
     version: "latest",
     commit: "a8ce5f982e0b516cb88c3335cbfad8344fbf4dbb"
    }

Problem

drone_ci=> select stage_id, stage_status from stages s join builds b on s.stage_build_id = b.build_id where b.build_status not in ('running') and s.stage_status='pending';
 stage_id | stage_status
----------+--------------
    18141 | pending
    18127 | pending
    18119 | pending
    18118 | pending
    18117 | pending
    18116 | pending
    18115 | pending
(7 rows)

As we use autoscaler, these pending stages cause agents to keep running. How can it happen and is there way to work around?


this PR will fix this behaviour.

beware the PR only adjusts MySQL driver, you may need to copy the code for the SQLite driver.

We’re running this in prod for months (with sqlite) and it definitely fixes this and we did not notice any adverse affects

@genisd thx for the pointer. Yes, it looks like the exact issue I experienced.

@bradrydzewski As discussed in https://github.com/drone/drone/pull/2653 this may burn a lot of :moneybag: as autoscaler thinks there are pending stages but they are never dispatched. Monkey patch is not a desirable way as the codebase rapidly evolves. Please consider to fix this. I think the integrity is more important than the performance.

We have not experienced this problem at cloud.drone.io and we have pretty significant build volume. If we were able to experience the issue it would be easier for us to troubleshoot and patch, however, this has not been the case.

Regarding the pull request, the author wrote the following in the description:

Another approach would be to properly kill those pending stages, but my Golang skills are currently too limited to accomplish this. Also further debugging would be needed to determine why it didn’t worked in the first place.

As the author notes, the pull request does not fix the root cause of the problem and they did not have time to research and identify the root cause. While I do appreciate the pull request (most people just complain and never send a pull request) it does not address the root cause. It is a quick hack. I have no desire to merge hacks and thus introduce technical debt. Send a pull request that addresses the root cause, and I will merge it. Or identify and document the root cause and I will be happy to patch.

@bradrydzewski Thank you for the clarification. I’ll try to create a drone.yml to reproduce the error, but in the meanwhile, I’m going to run the SQL script periodically to fix the inconsistency in Postgres.

update stages set stage_status='killed' 
where stage_id in 
(select stage_id from stages s 
 join builds b on s.stage_build_id = b.build_id 
 where b.build_status not in ('running') and s.stage_status='pending');

I understand you need to reproduce the situation to debug, but I’d appreciate if you know I need to update other tables. I get no result from drone queue ls after running the script so I presume it is sufficient.

@bradrydzewski I could reproduce the issue in
cloud.drone.io.

https://cloud.drone.io/k2n/drone-multi-pipeline-test/16/26/1

Created 32 pipelines and 7 pipelines are stuck in pending state.

I checked the page again and noticed the pipelines in pending status are now cancelled. Is there a periodic task to clean up the state?

I could reproduce again after several attempts.
Here is the screenshot.

https://cloud.drone.io/k2n/drone-multi-pipeline-test/21/31/1