marko-gacesa recently joined the team and he will be focusing on bringing the Kubernetes runner to GA. He authored this pull request that redesigns how we watch for updates, including periodic polling as a fallback just in case the runner fails to receive an event from Kubernetes (for whatever reason) with the hopes of making the runner much more resilient. We would ask everyone in reading this thread to begin testing using the latest runner image. Also please remember that when you are using our latest runner image you are using an unstable build; this should not be tested in production.
This particular issue has been very difficult for us to triage because it is inconsistent; some people use the runner in production without issues and others periodically see pipelines or steps stuck in a running state. Going forward when someone reports an issue we are going to ask for cluster details so that we can try to mirror your cluster internally and reproduce. We may also ask to schedule a Zoom call to triage live. We would love to have the runner reach GA by early July but that will depend on community testing and feedback, so we look forward to your help.
If you continue to experience issues with pipelines or steps getting stuck with the Kubernetes runner, please let us know and please provide the following:
- Ensure you are using the latest Kubernetes runner image (please avoid reporting issues with previous versions)
- Enable debug and trace logging
- Publish your runner logs to a gist and provide a link
- Publish your yaml to a gist and provide a link (bonus points for providing a simplified yaml that we can use to easily reproduce)
- Publish your cluster details to a gist and provide a link; provide whatever information it takes for us to re-create your cluster in our environment
If you experience issues with the kubernetes runner unrelated to pipelines or steps stuck in pending or running status, please create a separate thread and we will triage and address separately. Thanks!
we uncovered an issue with an
unknown container error being returned and causing steps to fail sporadically. We are investigating. See Latest drone-runner-kube fails successful steps and does not display output
we uncovered an issue where an invalid or non-existing image would cause the pipeline to hang due to an uncaught error. This has been patched and a new version of the runner image is available. https://github.com/drone-runners/drone-runner-kube/pull/58