Kube runner, handling exitCode=2 reason=Error

Hello! We are at the point where we are about to officially release version 1.0 of Kubernetes runner. But, unfortunately, we can’t proceed with that until we resolve the final blocking issue… And we need help from the community with this.

When the runner watches a pod for container state changes, it sometimes gets for a container state=terminated, exitCode=2, reason="Error". I just can’t figure out when or why this is happening. It happens intermittently.

Earlier the runner failed an entire stage when such a situation occurred. But that created too many failed builds for too many people. Then I added a timeout, first just 15 seconds, but currently a minute. So, if a container stays in the state longer than that the runner fails the stage.

Then, @sethpollack gave me feedback that the situation is not an error at all, and that we should just allow stages to continue. Basically, it means that Kubernetes is downloading an image in the background or something like that. He works with extremely large images that are downloaded every time and such a short timeout presented an issue for him.

So, I removed the timeout. But, then @mknapcok in Drone-runner-kube 1.0.0-beta.12 - Failed step blocks pipeline reports stuck builds or, when I reintroduced the timeout, the “container failed to start” error.

I need help to figure what does it mean state=terminated, exitCode=2, reason="Error" and how to treat the situation. Does anybody have a clue?

Hi @marko-gacesa ,

I’m running drone/drone-runner-kube@1.0.0-rc.1 and encounter the error you are describing.

Until now it was working fine. I have no idea what is causing the issue…

Here are more details about the logs:

  1. For pod="drone-runner-kube-7c5485cdd-wttqh"
time="2021-10-12T19:08:56Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2021-10-12T19:08:40Z" level=debug msg="done listening for cancellations" build.id=358 build.number=284 repo.id=6 repo.name=my-repo repo.namespace=my-namespace stage.id=358 stage.name=default stage.number=1 thread=2
time="2021-10-12T19:08:40Z" level=trace msg="http: context canceled"
time="2021-10-12T19:08:40Z" level=debug msg="poller: request stage from remote server" thread=2
time="2021-10-12T19:08:40Z" level=debug msg="updated stage to complete" build.id=358 build.number=284 duration=166 repo.id=6 repo.name=my-repo repo.namespace=my-namespace stage.id=358 stage.name=default stage.number=1 thread=2
time="2021-10-12T19:08:40Z" level=debug msg="successfully destroyed the pipeline environment" build.id=358 build.number=284 repo.id=6 repo.name=my-repo repo.namespace=my-namespace stage.id=358 stage.name=default stage.number=1 thread=2
time="2021-10-12T19:08:40Z" level=trace msg="PodWatcher terminated" namespace=default pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:40Z" level=debug msg="PodWatcher: Wait finished. Duration=2.52s" container= error="<nil>" pod=drone-lqpwz3p6bhtw2tqjvvvw stepState=finished
time="2021-10-12T19:08:40Z" level=trace msg="PodWatcher: Event" event=DELETED pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:40Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:39Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:38Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:38Z" level=debug msg="PodWatcher: Waiting..." container= pod=drone-lqpwz3p6bhtw2tqjvvvw stepState=finished
time="2021-10-12T19:08:38Z" level=trace msg="deleted pod" namespace=default pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:38Z" level=trace msg="deleted secret" namespace=default pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:38Z" level=trace msg="deleted pull secret" namespace=default pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:33Z" level=debug msg="destroying the pipeline environment" build.id=358 build.number=284 repo.id=6 repo.name=my-repo repo.namespace=my-namespace stage.id=358 stage.name=default stage.number=1 thread=2
time="2021-10-12T19:08:32Z" level=debug msg="PodWatcher: Wait finished. Duration=61.91s" container=drone-4m8p4rpnx3rri3hg7f19 error="kubernetes has failed: container failed to start: id=drone-4m8p4rpnx3rri3hg7f19 exitcode=2 reason=Error image=docker.io/library/node:16.4" pod=drone-lqpwz3p6bhtw2tqjvvvw stepState=finished
time="2021-10-12T19:08:32Z" level=debug msg="PodWatcher: Container failed." container=drone-4m8p4rpnx3rri3hg7f19 exitCode=2 image="docker.io/library/node:16.4" pod=drone-lqpwz3p6bhtw2tqjvvvw ready=true reason=Error restartCount=1 state=TERMINATED
time="2021-10-12T19:08:32Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:27Z" level=trace msg="PodWatcher: Container failed. Waiting to recover..." container=drone-4m8p4rpnx3rri3hg7f19 exitCode=2 image="docker.io/library/node:16.4" pod=drone-lqpwz3p6bhtw2tqjvvvw ready=true reason=Error restartCount=1 state=TERMINATED
time="2021-10-12T19:08:27Z" level=trace msg="PodWatcher: Periodic container state check" namespace=default pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:08:16Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2021-10-12T19:08:16Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2021-10-12T19:08:07Z" level=trace msg="PodWatcher: Container failed. Waiting to recover..." container=drone-4m8p4rpnx3rri3hg7f19 exitCode=2 image="docker.io/library/node:16.4" pod=drone-lqpwz3p6bhtw2tqjvvvw ready=true reason=Error restartCount=1 state=TERMINATED
time="2021-10-12T19:08:07Z" level=trace msg="PodWatcher: Periodic container state check" namespace=default pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:07:47Z" level=trace msg="PodWatcher: Container failed. Waiting to recover..." container=drone-4m8p4rpnx3rri3hg7f19 exitCode=2 image="docker.io/library/node:16.4" pod=drone-lqpwz3p6bhtw2tqjvvvw ready=true reason=Error restartCount=1 state=TERMINATED
time="2021-10-12T19:07:47Z" level=trace msg="PodWatcher: Periodic container state check" namespace=default pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:07:36Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2021-10-12T19:07:36Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2021-10-12T19:07:31Z" level=trace msg="PodWatcher: Container failed. Trying recovery..." container=drone-4m8p4rpnx3rri3hg7f19 exitCode=2 image="docker.io/library/node:16.4" pod=drone-lqpwz3p6bhtw2tqjvvvw ready=true reason=Error restartCount=1 state=TERMINATED
time="2021-10-12T19:07:31Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:07:30Z" level=debug msg="PodWatcher: Waiting..." container=drone-4m8p4rpnx3rri3hg7f19 pod=drone-lqpwz3p6bhtw2tqjvvvw stepState=finished
time="2021-10-12T19:07:28Z" level=debug msg="PodWatcher: Wait finished. Duration=1.46s" container=drone-4m8p4rpnx3rri3hg7f19 error="<nil>" pod=drone-lqpwz3p6bhtw2tqjvvvw stepState=running
time="2021-10-12T19:07:28Z" level=debug msg="PodWatcher: Container state changed" container=drone-4m8p4rpnx3rri3hg7f19 image="docker.io/library/node:16.4" pod=drone-lqpwz3p6bhtw2tqjvvvw restartCount=1 state=RUNNING stepState=running
time="2021-10-12T19:07:28Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lqpwz3p6bhtw2tqjvvvw
time="2021-10-12T19:07:27Z" level=trace msg="PodWatcher: Periodic container state check" namespace=default pod=drone-lqpwz3p6bhtw2tqjvvvw
  1. For pod="drone-dfb4b9b5c-79wnc"
{"fields.time":"2021-10-12T19:08:43Z","latency":268775,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50270","request":"/welcome","request-id":"1zQ1uufUGb5cbaShR7x4u3y28fk","time":"2021-10-12T19:08:43Z"}
{"fields.time":"2021-10-12T19:08:43Z","latency":34374,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50268","request":"/","request-id":"1zQ1uw7IWrievhcVAnNDckIRRA8","time":"2021-10-12T19:08:43Z"}
{"fields.time":"2021-10-12T19:08:33Z","latency":362883,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50224","request":"/welcome","request-id":"1zQ1tjmRGzbtROv8WpbDfBHS4wx","time":"2021-10-12T19:08:33Z"}
{"fields.time":"2021-10-12T19:08:33Z","latency":39754,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50222","request":"/","request-id":"1zQ1tjLgHsXOxeOz2v8uwko9hdA","time":"2021-10-12T19:08:33Z"}
{"build.id":358,"build.number":284,"level":"debug","msg":"manager: build is finished, teardown","repo.id":6,"stage.id":358,"time":"2021-10-12T19:08:33Z"}
{"level":"debug","msg":"manager: stage is complete. teardown","stage.id":358,"time":"2021-10-12T19:08:32Z"}
{"error":"stream: not found","level":"warning","msg":"manager: cannot teardown log stream","step.id":1165,"step.name":"notify","step.status":"skipped","time":"2021-10-12T19:08:32Z"}
{"level":"debug","msg":"manager: updating step status","step.id":1165,"step.name":"notify","step.status":"skipped","time":"2021-10-12T19:08:32Z"}
{"error":"stream: not found","level":"warning","msg":"manager: cannot teardown log stream","step.id":1164,"step.name":"deploy","step.status":"skipped","time":"2021-10-12T19:08:32Z"}
{"level":"debug","msg":"manager: updating step status","step.id":1164,"step.name":"deploy","step.status":"skipped","time":"2021-10-12T19:08:32Z"}
{"error":"stream: not found","level":"warning","msg":"manager: cannot teardown log stream","step.id":1163,"step.name":"publish","step.status":"skipped","time":"2021-10-12T19:08:32Z"}
{"level":"debug","msg":"manager: updating step status","step.id":1163,"step.name":"publish","step.status":"skipped","time":"2021-10-12T19:08:32Z"}
{"level":"debug","msg":"manager: updating step status","step.id":1162,"step.name":"lint","step.status":"error","time":"2021-10-12T19:08:32Z"}
{"arch":"","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"","time":"2021-10-12T19:08:26Z","type":"kubernetes","variant":""}
{"fields.time":"2021-10-12T19:08:23Z","latency":234481,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50180","request":"/welcome","request-id":"1zQ1sPC6EEeVeSDWoDvmFvgvvvR","time":"2021-10-12T19:08:23Z"}
{"fields.time":"2021-10-12T19:08:23Z","latency":40446,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50178","request":"/","request-id":"1zQ1sR4SSf8WvBlCOpRYLnoYVdR","time":"2021-10-12T19:08:23Z"}
{"arch":"","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"","time":"2021-10-12T19:08:16Z","type":"kubernetes","variant":""}
{"fields.time":"2021-10-12T19:08:13Z","latency":222459,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50138","request":"/welcome","request-id":"1zQ1r8u9FDz73IBNPvve6U6Fu27","time":"2021-10-12T19:08:13Z"}
{"fields.time":"2021-10-12T19:08:13Z","latency":36277,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50136","request":"/","request-id":"1zQ1r89L6UaLxUntT57jn8dw3b7","time":"2021-10-12T19:08:13Z"}
{"fields.time":"2021-10-12T19:08:03Z","latency":247846,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50084","request":"/welcome","request-id":"1zQ1pxCddX6W7e4czWjukocP1fq","time":"2021-10-12T19:08:03Z"}
{"fields.time":"2021-10-12T19:08:03Z","latency":85690,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50082","request":"/","request-id":"1zQ1puom5y2HQzMy29nhCClhl57","time":"2021-10-12T19:08:03Z"}
{"fields.time":"2021-10-12T19:07:53Z","latency":382929,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50042","request":"/welcome","request-id":"1zQ1ob1t13fybQWQDcchDQRGub8","time":"2021-10-12T19:07:53Z"}
{"fields.time":"2021-10-12T19:07:53Z","latency":35427,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50040","request":"/","request-id":"1zQ1oeXBrc3FJC3u0ZnM1XSWD1j","time":"2021-10-12T19:07:53Z"}
{"arch":"","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"","time":"2021-10-12T19:07:46Z","type":"kubernetes","variant":""}
{"fields.time":"2021-10-12T19:07:43Z","latency":289925,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:50000","request":"/welcome","request-id":"1zQ1nRG1KDsK0CrTH6TfEab1KfH","time":"2021-10-12T19:07:43Z"}
{"fields.time":"2021-10-12T19:07:43Z","latency":26561,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:49998","request":"/","request-id":"1zQ1nPd8SoJLCYgoqc7ZYnXIV3D","time":"2021-10-12T19:07:43Z"}
{"arch":"","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"","time":"2021-10-12T19:07:36Z","type":"kubernetes","variant":""}
{"fields.time":"2021-10-12T19:07:33Z","latency":337555,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:49954","request":"/welcome","request-id":"1zQ1m4zmP0aE7KJqP7bxQ6MLLQi","time":"2021-10-12T19:07:33Z"}
{"fields.time":"2021-10-12T19:07:33Z","latency":52909,"level":"debug","method":"GET","msg":"","remote":"100.100.100.100:49952","request":"/","request-id":"1zQ1m5xAfuZSSKIKHDAxVepEgRG","time":"2021-10-12T19:07:33Z"}

And here is a slightly simplified version of my .drone.yml:

kind: pipeline
type: kubernetes
name: default

steps:
  - name: build
    image: node:16.4
    commands:
      - npm ci --no-optional
    when:
      event:
        exclude:
          - promote

  - name: lint
    image: node:16.4
    commands:
      - npm run lint
    when:
      event:
        exclude:
          - promote

I don’t think the error is related to the docker image, because the build step pass with success…