So I’m having some issues with grpc and my agents.
I think an agent is working fine, has some grpc issues that it doesn’t “recover” from and then gets a build that “doesn’t finish”. Really the build is no longer running on the agent (no containers aside from agent container running). But when I hit
/varz the agent still shows the build as running and having exceeded it’s timeout. So then the agent becomes “unhealthy”.
But gRPC then starts working and since I have
DRONE_MAX_PROCS set to 3 it continues to take builds again. So the build that isn’t running but timed out is taking 1 slot, so 2 are free to continue to take and run builds. So now I have an agent that is working but in a degraded state.
So I go and try to clean it up. I run
drone kill -s SIGINT agent. I see in the log
ctrl+c received, terminating process. It doesn’t take anymore jobs. But hitting the
/varz endpoint I still see that “hung” job that isn’t really running still “running”. So the agent never exits. At this point I force remove the agent container and start it again.
So a couple questions/thoughts I had after all this:
- Why does the agent continue to track a build that is timed out? Why doesn’t it cancel/clean the build since it’s reached it’s timeout? I think if it knows the build is timed out it should automatically try to stop the build’s containers and remove it from “running”.
- Would it be considered a bug that
drone kill -s SIGINT agentdoesn’t work properly on an unhealthy agent? If so I can open an issue. However, if you agree with my first statement I think that would also take care of this issue then.