Kubernetes runner stuck in running state / bringing the Kubernetes runner to GA

Switching from latest to 1.0.0-beta.7

I had to run it 10 times but I got this build to stuck:

It is:

name: default
kind: pipeline
type: kubernetes

volumes:
  - name: cache
    host:
      path: /tmp/drone/cache/v1

services:
  - name: postgres
    image: bitnami/postgresql:11
    environment:
      POSTGRESQL_POSTGRES_CONNECTION_LIMIT: 500
      POSTGRESQL_LOG_CONNECTIONS: YES
      POSTGRESQL_PASSWORD: "pass123"
      POSTGRESQL_USERNAME: postgres
  - name: mongodb
    image: mongo:3
    command: [ --smallfiles ]
  - name: redis
    image: redis
  - name: stripe-mock
    image: stripemock/stripe-mock:v0.84.0
  - name: geoip
    image: fiorix/freegeoip

basic_env_vars: &basic_env_vars
  PG_URL: "postgres://postgres:pass123@postgres:5432/test?ssl=false"
  MONGO_HOST: "mongodb"
  REDIS_HOST: redis
  REDIS_URL: "redis://redis:6379"
  CACHE_TEST_REDIS_URL: "redis://redis:6379/2"


steps:
  - name: restore-cache
    image: drillster/drone-volume-cache
    volumes:
      - name: cache
        path: /cache
    settings:
      restore: true
      ttl: 7
      cache_key: [ DRONE_REPO_NAME, DRONE_BRANCH ]
      mount:
        - ./deps
        - ./_build

  - name: db-connections
    image: debian
    pull: always
    environment:
      <<: *basic_env_vars
    commands:
      - sleep 60

  - name: install-deps
    image: elixir:latest
    pull: always
    environment:
      <<: *basic_env_vars
    commands:
      - echo "Do some elixir stuff"
    depends_on:
      - restore-cache
      - db-connections
  - name: rebuild-cache
    image: drillster/drone-volume-cache
    volumes:
      - name: cache
        path: /cache
    settings:
      rebuild: true
      cache_key: [ DRONE_REPO_NAME, DRONE_BRANCH ]
      mount:  
        - ./deps
        - ./_build
    depends_on:
      - install-deps

  - name: k-tests
    image: elixir:latest
    resources:
      limits:
        memory: 2048Mi
      requests:
        memory: 2048Mi
    environment:
      <<: *basic_env_vars
    commands:
      - echo "Some app tests"
    depends_on:
      - install-deps
  - name: t-tests
    image: elixir:latest
    resources:
      limits:
        memory: 2048Mi
      requests:
        memory: 2048Mi
    environment:
      <<: *basic_env_vars
    commands:
      - echo "More tests"
    depends_on:
      - install-deps
  - name: other-tests-and-credo
    image: elixir:latest
    environment:
      <<: *basic_env_vars
    commands:
      - echo "other tests"
    depends_on:
      - install-deps
  - name: build-image
    image: debian
    privileged: true
    environment:
      SOME_SECRET:
        from_secret: SOME_SECRET
    commands:
      - echo "Build image and push it"
   
  - name: deploy-image
    image: debian
    environment:
      OLLY_SECRET:
        from_secret: OLLY_SECRET
    commands:
      - echo "Push chart to chart museum"
    depends_on:
      - other-tests-and-credo
      - t-tests
      - k-tests
      - build-image

You need to create directory on host /tmp/drone/cache/v1 to run it

Thanks @michalzxc … Do you have the runner’s log?

Sure, that is for the “5. Just sleep” build - time="2021-06-25T13:40:00Z" level=debug msg="stage details fetched" build.id=442 - Pastebin.com

I’ve come across a strange thing… First some background: Kubernetes runner creates a pod for each stage and each step in a stage is a container inside the pod. Containers are initially created with a placeholder image. It is supposed to run forever, never terminates or fails. Once a step is ready for execution, the placeholder image is replaced with the real image for the step.

So, I was able to reproduce the stuck stage problem on my machine. The cause was that a container (a service step, redis image in my case) reverted back to the placeholder image from redis image. It also terminated with reason simply equal to “Error” (@michalzxc the same thing happened to you too from what I can see in that latest log output that you provided, also with redis).

Here my log related to the pod and the container:

DEBU[0020] Engine: Starting step: "redis"                container=drone-cz86n60u6swo5fa0wutw image="docker.io/library/redis:latest" placeholder="drone/placeholder:1" pod=drone-kbb5ik90ow4tg45l1kfd step.name=redis
DEBU[0021] PodWatcher: Waiting...                        container=drone-cz86n60u6swo5fa0wutw pod=drone-kbb5ik90ow4tg45l1kfd state=RUNNING
DEBU[0028] PodWatcher: Container state changed           container=drone-cz86n60u6swo5fa0wutw image="redis:latest" pod=drone-kbb5ik90ow4tg45l1kfd state=RUNNING stateInfo=
DEBU[0028] PodWatcher: Wait finished. Duration=6.69s     container=drone-cz86n60u6swo5fa0wutw error="<nil>" pod=drone-kbb5ik90ow4tg45l1kfd state=RUNNING
DEBU[0048] PodWatcher: Placeholder terminated abnormally  container=drone-cz86n60u6swo5fa0wutw image="drone/placeholder:1" pod=drone-kbb5ik90ow4tg45l1kfd state=TERMINATED stateInfo=Error

As you can see, at [0028] the redis is running, but then at [0048] the same container reverted back to placeholder image and is TERMINATED.

A snippet from “kubectl describe pod” output, after the runner got stuck:

  drone-cz86n60u6swo5fa0wutw:
    Container ID:   docker://5e2dbeee3edb0dfedf444522b22b8f84dd609d7fb6ad2ee71395894728c7e0ca
    Image:          docker.io/library/redis:latest
    Image ID:       docker-pullable://drone/placeholder@sha256:1b791b37e0589c7f1470fe0210ec0e7ef1e1de479c8f742d25942d2b4617bc07
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 25 Jun 2021 15:45:49 +0200
      Finished:     Fri, 25 Jun 2021 15:46:06 +0200
    Ready:          False
    Restart Count:  0

How is this possible? How it’s possible that Kubernetes changes a container image back to placeholder? If redis somehow failed, shouldn’t the image remain “redis:latest”?

:wave: @sethpollack wondering if you can read the above thread and let us know if you have any thoughts on the described behavior? Did you come across this when creating the initial version of this runner?

I’ve had this before and I don’t think that it reverted back, I think it failed to update the image to redis.

I think it had to do with the docker container going away and k8s thinking it was still there.

Take a look on crio ticket I mentioned before - Different container state between kubernetes and crictl · Issue #4693 · cri-o/cri-o · GitHub
There was some underling crio issue, but after patching it, it seems the problem is with kubelet.
So in this case crio did run correct container (after drone operation replacing placeholder), and it didn’t reverted back, but kubelet didn’t switch tracking from placeholder container to the proper one. So as result kubelet was keep sending Container Status requests about placeholder container (what was already gone on crio level and in reality it was successfully replaced - what could be found with crioctl commands - the proper container started, and did something, and eventually ended)
But outside of kubelet problems, crio was perfectly aware of new container, and of it being part of specific pod

I have no idea how difficult that might be, but maybe it could be possible to get status from crio/docker/containerd sockets, not depending on kubelet ?

Thanks for the research and for providing an update!

I think we should document this as a known issue that can occur when using the runner with cri-o and we can link back to the cri-o bug. If the cri-o team is able to confirm the bug I am sure they will issue a patch. We definitely want to rely solely on the kubernetes API and avoid vendor-specific code and workarounds as much as possible.

I do wonder if we could detect whether or not the step has reverted back to placeholder, and if we detect this, it errors the step with a descriptive error message (that we can include in our docs). This would prevent the pipeline from hanging indefinitely which would be a big step in the right direction.

In the meantime, let’s continue tracking issue #4693 and see where it goes. Thanks again!

Actually it is kubelet bug, what happens with both docker and crio, what I meant to say is operation of changing image works fine on crio and possibly docker level aswell, but there is bug in kubelet what makes it not refresh sometimes for some reason. This is what this GitHub issue shows, that crio successfully change the image, but kubelet remains unaware. And there is also someone commenting that he has the same with docker (Different container state between kubernetes and crictl · Issue #4693 · cri-o/cri-o · GitHub).
I was suspecting crio, but during investigation in this ticket it turned out that image is successfully changed in it and problem is on kubelet end alone

So it did not reverted, the actual image started on crio/docker level but kubelet lost track of this change, and as result is sending status updates request about plecaholder image to cri socket, when it is already successfully replaced, and placeholder is long gone

I believe all cri (container runtime interfaces) are API compatible, so you could have generic solution asking docker/crio/containerd about container status updates.
Since you were able to replicate the bug on docker, look for the replacement container when it will get stuck using docker ps and you should find it there aswell successfully running (regardless of kubelet loosing track of it, and reverting to not existing any more placeholder container)

As much CRI is concerned, kubelet asks it to remove placeholder image and replace it, and later out of blue (when this issue happen) kubelet starts asking about placeholder container status (when it is already gone), but there is no operation reverting it back on CRI level. Container has been successfully replaced and replacement successfully started, and was left running until finish

Thanks for clarifying. The near term solution that we plan to implement is to identify when the container image reverts back to placeholder, and when this happens, we can fail the pipeline with an error state so that it doesn’t hang indefinitely. Simultaneously, we can report the issue to the Kubernetes team and advocate to get this permanently fixed upstream.

Since it looks like some kind of kubelet performance issue/edge case issue related to multiple replace-image operations happening at the same time, do you think it would be possible to try to add some rate limit to them, before Kubernetes will fix it on theirs end?
Like replacing 1 image at a time with some delay in between ? Maybe it will do the trick

Kubernetes runner v1.0.0-beta.8 is out…

Thank you,
:tada: Upgraded


Either I am being lucky or it is better, run 15 builds at once, and only 2 failed

Btw I am using my own restarter script at this moment, it restarts jobs when it sees errors about containers execution in runner logs, what makes entire experience much better

#!/bin/bash

NAMESPACE=drone-ci

POD="$(kubectl --namespace $NAMESPACE get pods -L "app.kubernetes.io/name: drone-runner-kube"|grep runner|awk {'print $1'})"
LOGS="$(kubectl --namespace $NAMESPACE logs $POD --since=300s)"

FAILED_CONTAINERS="$(echo -e "$LOGS"|grep "container failed to start"|egrep "container=drone-[0-9a-zA-Z]+" -o|egrep "drone-[0-9a-zA-Z]+" -o)"

function restart (
	BUILDID="$(echo -e "$LOGS"|grep $1|egrep -o "build.id=[0-9]+"|egrep -o "[0-9]+")"
	if [ -z "$(echo $BUILDID)" ]; then
		echo "No build ID for $c, too old"
		return 0
	fi
	BUILDNUMBER="$(echo -e "$LOGS"|egrep "build.id\=$BUILDID"|egrep -o "build.number=[0-9]+"|head -n 1|egrep -o "[0-9]+")"
	REPO="$(echo -e "$LOGS"|egrep "build.id\=$BUILDID"|egrep -o "repo.name=[A-Za-z0-9]+"|head -n 1|egrep -o "=[A-Za-z0-9]+"|egrep -o "[A-Za-z0-9]+")"
	REPONAMESPACE="$(echo -e "$LOGS"|egrep "build.id\=$BUILDID"|egrep -o "repo.namespace=[A-Za-z0-9]+"|head -n 1|egrep -o "=[A-Za-z0-9]+"|egrep -o "[A-Za-z0-9]+")"
	echo "Stopping and restarting $REPONAMESPACE/$REPO $BUILDNUMBER"
	drone build stop $REPONAMESPACE/$REPO $BUILDNUMBER
	drone build restart $REPONAMESPACE/$REPO $BUILDNUMBER
)


if [ -z "$(echo $FAILED_CONTAINERS)" ]; then
	echo "$(date): No failed contaners"
	exit 0
fi

for c in $FAILED_CONTAINERS; do
	restart $c
done

I run it as cronjob every 5 minutes

Thanks @michalzxc for this! It shouldn’t be just luck. Many situations, caused by erratic Kubernetes behavior, that were earlier causing builds to stuck will now fail with an error. The situation from your latest screen shot, the error “container failed to start”, is one such situation. The runner detected inconsistent event coming from the pod and failed the entire build. If you examine such a pod you could see all kinds of strange behavior: Terminated containers, container waiting to start indefinitely and others. I can’t guarantee that a build will never be stuck again, but should be far less likely to happen. Thanks again!

1 Like

After 5 days I think I can say pretty confident that it is not getting stuck anymore, it fails instantly

I got one “failed to start in timely manner” error in GUI

But it is not present in runner log, so my script wasn’t able to detect it and restart a job

kubectl --namespace $NAMESPACE logs $POD |grep drone-oi1oeqbztbww7sq0fgxo 
time="2021-07-06T08:43:39Z" level=debug msg="PodWatcher started" build.id=5099 build.number=3349 pod=drone-oi1oeqbztbww7sq0fgxo repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=5482 stage.name=default stage.number=1 step=clone step.name=clone thread=39
time="2021-07-06T08:43:39Z" level=debug msg="Engine: Starting step: \"clone\"" build.id=5099 build.number=3349 container=drone-ebn7y8x1svcmxcxjv9uo image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=5482 stage.name=default stage.number=1 step.name=clone thread=39
time="2021-07-06T08:43:57Z" level=debug msg="PodWatcher: Container state changed" container=drone-ebn7y8x1svcmxcxjv9uo image="docker.io/drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:43:58Z" level=debug msg="PodWatcher: Container state changed" container=drone-ebn7y8x1svcmxcxjv9uo image="docker.io/drone/git:latest" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:43:58Z" level=debug msg="PodWatcher: Wait finished. Duration=18.31s" container=drone-ebn7y8x1svcmxcxjv9uo error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING
time="2021-07-06T08:44:06Z" level=debug msg="PodWatcher: Container state changed" container=drone-ebn7y8x1svcmxcxjv9uo image="docker.io/drone/git:latest" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED stateInfo=Completed
time="2021-07-06T08:44:06Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-ebn7y8x1svcmxcxjv9uo error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED
time="2021-07-06T08:44:06Z" level=debug msg="Engine: Starting step: \"redis\"" build.id=5099 build.number=3349 container=drone-sta2li7h9zkoym727zzh image="docker.io/library/redis:latest" placeholder="drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=5482 stage.name=default stage.number=1 step.name=redis thread=39
time="2021-07-06T08:44:06Z" level=debug msg="Engine: Starting step: \"db-connections\"" build.id=5099 build.number=3349 container=drone-h1ssmlbo775fzk3b8nmp image="ecr.ad.dice.fm/base:waitdbs-latest" placeholder="drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=5482 stage.name=default stage.number=1 step.name=db-connections thread=39
time="2021-07-06T08:44:06Z" level=debug msg="Engine: Starting step: \"postgres\"" build.id=5099 build.number=3349 container=drone-vpejr7i60qt7h9y4417w image="docker.io/bitnami/postgresql:11" placeholder="drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=5482 stage.name=default stage.number=1 step.name=postgres thread=39
time="2021-07-06T08:44:06Z" level=debug msg="Engine: Starting step: \"kafka\"" build.id=5099 build.number=3349 container=drone-2hcpymiaa41ymdhzf1w0 image="docker.io/spotify/kafka:latest" placeholder="drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=5482 stage.name=default stage.number=1 step.name=kafka thread=39
time="2021-07-06T08:44:06Z" level=debug msg="PodWatcher: Container state changed" container=drone-2hcpymiaa41ymdhzf1w0 image="docker.io/drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:06Z" level=debug msg="PodWatcher: Container state changed" container=drone-h1ssmlbo775fzk3b8nmp image="docker.io/drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:06Z" level=debug msg="PodWatcher: Container state changed" container=drone-sta2li7h9zkoym727zzh image="docker.io/drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:06Z" level=debug msg="PodWatcher: Container state changed" container=drone-vpejr7i60qt7h9y4417w image="docker.io/drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:09Z" level=debug msg="PodWatcher: Container state changed" container=drone-2hcpymiaa41ymdhzf1w0 image="docker.io/spotify/kafka:latest" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:09Z" level=debug msg="PodWatcher: Container state changed" container=drone-h1ssmlbo775fzk3b8nmp image="ecr.ad.dice.fm/base:waitdbs-latest" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:09Z" level=debug msg="PodWatcher: Container state changed" container=drone-sta2li7h9zkoym727zzh image="docker.io/library/redis:latest" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:09Z" level=debug msg="PodWatcher: Container state changed" container=drone-vpejr7i60qt7h9y4417w image="docker.io/bitnami/postgresql:11" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:09Z" level=debug msg="PodWatcher: Wait finished. Duration=2.38s" container=drone-h1ssmlbo775fzk3b8nmp error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING
time="2021-07-06T08:44:09Z" level=debug msg="PodWatcher: Wait finished. Duration=2.38s" container=drone-vpejr7i60qt7h9y4417w error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING
time="2021-07-06T08:44:09Z" level=debug msg="PodWatcher: Wait finished. Duration=2.38s" container=drone-sta2li7h9zkoym727zzh error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING
time="2021-07-06T08:44:09Z" level=debug msg="PodWatcher: Wait finished. Duration=2.38s" container=drone-2hcpymiaa41ymdhzf1w0 error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING
time="2021-07-06T08:44:10Z" level=debug msg="PodWatcher: Container state changed" container=drone-2hcpymiaa41ymdhzf1w0 image="docker.io/drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED stateInfo=Error
time="2021-07-06T08:44:10Z" level=debug msg="PodWatcher: Container state changed" container=drone-h1ssmlbo775fzk3b8nmp image="ecr.ad.dice.fm/base:waitdbs-latest" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED stateInfo=Completed
time="2021-07-06T08:44:10Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-h1ssmlbo775fzk3b8nmp error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED
time="2021-07-06T08:44:10Z" level=debug msg="Engine: Starting step: \"prefect-build-and-test\"" build.id=5099 build.number=3349 container=drone-bko9qgsw1mintwa5ty65 image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=5482 stage.name=default stage.number=1 step.name=prefect-build-and-test thread=39
time="2021-07-06T08:44:10Z" level=debug msg="Engine: Starting step: \"dam-build-and-test\"" build.id=5099 build.number=3349 container=drone-xzl5bxi1cvriexuvgi6y image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=5482 stage.name=default stage.number=1 step.name=dam-build-and-test thread=39
time="2021-07-06T08:44:10Z" level=debug msg="Engine: Starting step: \"tournesol-build-and-test\"" build.id=5099 build.number=3349 container=drone-k8x2ukndhv09lhrsgc9j image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=5482 stage.name=default stage.number=1 step.name=tournesol-build-and-test thread=39
time="2021-07-06T08:44:11Z" level=debug msg="PodWatcher: Container state changed" container=drone-bko9qgsw1mintwa5ty65 image="docker.io/drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:11Z" level=debug msg="PodWatcher: Container state changed" container=drone-k8x2ukndhv09lhrsgc9j image="docker.io/drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:44:11Z" level=debug msg="PodWatcher: Container state changed" container=drone-xzl5bxi1cvriexuvgi6y image="docker.io/drone/placeholder:1" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING stateInfo=
time="2021-07-06T08:47:47Z" level=debug msg="PodWatcher: Container state changed" container=drone-2hcpymiaa41ymdhzf1w0 image="docker.io/spotify/kafka:latest" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED stateInfo=Completed
time="2021-07-06T08:47:47Z" level=debug msg="PodWatcher: Container state changed" container=drone-bko9qgsw1mintwa5ty65 image="ecr.ad.dice.fm/base:imagebuilder-latest" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED stateInfo=Error
time="2021-07-06T08:47:47Z" level=debug msg="PodWatcher: Container state changed" container=drone-k8x2ukndhv09lhrsgc9j image="ecr.ad.dice.fm/base:imagebuilder-latest" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED stateInfo=Error
time="2021-07-06T08:47:47Z" level=debug msg="PodWatcher: Container state changed" container=drone-sta2li7h9zkoym727zzh image="docker.io/library/redis:latest" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED stateInfo=Completed
time="2021-07-06T08:47:47Z" level=debug msg="PodWatcher: Container state changed" container=drone-vpejr7i60qt7h9y4417w image="docker.io/bitnami/postgresql:11" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED stateInfo=Completed
time="2021-07-06T08:47:47Z" level=debug msg="PodWatcher: Container state changed" container=drone-xzl5bxi1cvriexuvgi6y image="ecr.ad.dice.fm/base:imagebuilder-latest" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED stateInfo=Completed
time="2021-07-06T08:47:47Z" level=debug msg="PodWatcher: Wait finished. Duration=216.62s" container=drone-bko9qgsw1mintwa5ty65 error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING
time="2021-07-06T08:47:47Z" level=debug msg="PodWatcher: Wait finished. Duration=216.62s" container=drone-k8x2ukndhv09lhrsgc9j error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING
time="2021-07-06T08:47:47Z" level=debug msg="PodWatcher: Wait finished. Duration=216.62s" container=drone-xzl5bxi1cvriexuvgi6y error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=RUNNING
time="2021-07-06T08:47:56Z" level=debug msg="PodWatcher: Wait finished. Duration=40.04s" container= error="<nil>" pod=drone-oi1oeqbztbww7sq0fgxo state=TERMINATED
time="2021-07-06T08:47:56Z" level=debug msg="PodWatcher terminated" namespace=drone-ci-exec pod=drone-oi1oeqbztbww7sq0fgxo

@michalzxc thanks for reporting this. The issue is fixed in the latest commit on master.

This job got stuck ignoring 15 minutes limit for a repo

It is first example since we are using this image


time="2021-07-14T12:36:31Z" level=debug msg="stage details fetched" build.id=5479 build.number=899 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 thread=24
time="2021-07-14T12:36:32Z" level=debug msg="updated stage to running" build.id=5479 build.number=899 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 thread=24
time="2021-07-14T12:36:33Z" level=debug msg="PodWatcher started" build.id=5479 build.number=899 pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step=clone step.name=clone thread=24
time="2021-07-14T12:36:33Z" level=debug msg="Engine: Starting step: \"clone\"" build.id=5479 build.number=899 container=drone-gq132jpdx888j4stie4e image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=clone thread=24
time="2021-07-14T12:36:36Z" level=debug msg="PodWatcher: Container state changed" container=drone-gq132jpdx888j4stie4e image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:37Z" level=debug msg="PodWatcher: Container state changed" container=drone-gq132jpdx888j4stie4e image="docker.io/drone/git:latest" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:37Z" level=debug msg="PodWatcher: Wait finished. Duration=4.24s" container=drone-gq132jpdx888j4stie4e error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:36:39Z" level=debug msg="PodWatcher: Container state changed" container=drone-gq132jpdx888j4stie4e image="docker.io/drone/git:latest" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED stateInfo=Completed
time="2021-07-14T12:36:40Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-gq132jpdx888j4stie4e error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED
time="2021-07-14T12:36:40Z" level=debug msg="received exit code 0" build.id=5479 build.number=899 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=clone thread=24
time="2021-07-14T12:36:40Z" level=debug msg="Engine: Starting step: \"build-image\"" build.id=5479 build.number=899 container=drone-ogb18fj43ld51z45tdcg image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=build-image thread=24
time="2021-07-14T12:36:40Z" level=debug msg="Engine: Starting step: \"geoip\"" build.id=5479 build.number=899 container=drone-mg0nmfuxk38aqm8cvmwq image="docker.io/fiorix/freegeoip:latest" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=geoip thread=24
time="2021-07-14T12:36:40Z" level=debug msg="Engine: Starting step: \"db-connections\"" build.id=5479 build.number=899 container=drone-kxvf0lg7syhdo1sh0kyj image="ecr.ad.dice.fm/base:waitdbs-latest" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=db-connections thread=24
time="2021-07-14T12:36:40Z" level=debug msg="Engine: Starting step: \"redis\"" build.id=5479 build.number=899 container=drone-x1505xskbmhrf5z7dqk2 image="docker.io/library/redis:latest" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=redis thread=24
time="2021-07-14T12:36:40Z" level=debug msg="Engine: Starting step: \"mongodb\"" build.id=5479 build.number=899 container=drone-fjmhwzbotalnw7ii0nzh image="docker.io/library/mongo:3" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=mongodb thread=24
time="2021-07-14T12:36:40Z" level=debug msg="Engine: Starting step: \"postgres\"" build.id=5479 build.number=899 container=drone-44w95yhf19l9z39vos5d image="docker.io/bitnami/postgresql:11" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=postgres thread=24
time="2021-07-14T12:36:40Z" level=debug msg="Engine: Starting step: \"restore-cache\"" build.id=5479 build.number=899 container=drone-4lfamzjw0dpsqt0uamhu image="docker.io/drillster/drone-volume-cache:latest" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=restore-cache thread=24
time="2021-07-14T12:36:40Z" level=debug msg="Engine: Starting step: \"stripe-mock\"" build.id=5479 build.number=899 container=drone-jw5gm90x156yu6a1vl97 image="docker.io/stripemock/stripe-mock:v0.84.0" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=stripe-mock thread=24
time="2021-07-14T12:36:41Z" level=debug msg="PodWatcher: Container state changed" container=drone-44w95yhf19l9z39vos5d image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:41Z" level=debug msg="PodWatcher: Container state changed" container=drone-4lfamzjw0dpsqt0uamhu image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:41Z" level=debug msg="PodWatcher: Container state changed" container=drone-fjmhwzbotalnw7ii0nzh image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:41Z" level=debug msg="PodWatcher: Container state changed" container=drone-jw5gm90x156yu6a1vl97 image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:41Z" level=debug msg="PodWatcher: Container state changed" container=drone-kxvf0lg7syhdo1sh0kyj image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:41Z" level=debug msg="PodWatcher: Container state changed" container=drone-mg0nmfuxk38aqm8cvmwq image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:41Z" level=debug msg="PodWatcher: Container state changed" container=drone-ogb18fj43ld51z45tdcg image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:41Z" level=debug msg="PodWatcher: Container state changed" container=drone-x1505xskbmhrf5z7dqk2 image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Container state changed" container=drone-44w95yhf19l9z39vos5d image="docker.io/bitnami/postgresql:11" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Container state changed" container=drone-4lfamzjw0dpsqt0uamhu image="docker.io/drillster/drone-volume-cache:latest" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Container state changed" container=drone-fjmhwzbotalnw7ii0nzh image="docker.io/library/mongo:3" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Container state changed" container=drone-jw5gm90x156yu6a1vl97 image="docker.io/stripemock/stripe-mock:v0.84.0" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Wait finished. Duration=2.75s" container=drone-44w95yhf19l9z39vos5d error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Wait finished. Duration=2.75s" container=drone-jw5gm90x156yu6a1vl97 error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Wait finished. Duration=2.75s" container=drone-fjmhwzbotalnw7ii0nzh error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Wait finished. Duration=2.75s" container=drone-4lfamzjw0dpsqt0uamhu error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Container state changed" container=drone-kxvf0lg7syhdo1sh0kyj image="ecr.ad.dice.fm/base:waitdbs-latest" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Container state changed" container=drone-mg0nmfuxk38aqm8cvmwq image="docker.io/fiorix/freegeoip:latest" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Wait finished. Duration=2.75s" container=drone-kxvf0lg7syhdo1sh0kyj error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Container state changed" container=drone-ogb18fj43ld51z45tdcg image="ecr.ad.dice.fm/base:imagebuilder-latest" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Wait finished. Duration=2.75s" container=drone-mg0nmfuxk38aqm8cvmwq error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Wait finished. Duration=2.75s" container=drone-ogb18fj43ld51z45tdcg error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Container state changed" container=drone-x1505xskbmhrf5z7dqk2 image="docker.io/library/redis:latest" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:44Z" level=debug msg="PodWatcher: Wait finished. Duration=2.75s" container=drone-x1505xskbmhrf5z7dqk2 error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:36:45Z" level=debug msg="PodWatcher: Container state changed" container=drone-4lfamzjw0dpsqt0uamhu image="docker.io/drillster/drone-volume-cache:latest" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED stateInfo=Completed
time="2021-07-14T12:36:45Z" level=debug msg="PodWatcher: Container state changed" container=drone-kxvf0lg7syhdo1sh0kyj image="ecr.ad.dice.fm/base:waitdbs-latest" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED stateInfo=Completed
time="2021-07-14T12:36:45Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-4lfamzjw0dpsqt0uamhu error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED
time="2021-07-14T12:36:45Z" level=debug msg="received exit code 0" build.id=5479 build.number=899 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=restore-cache thread=24
time="2021-07-14T12:36:45Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-kxvf0lg7syhdo1sh0kyj error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED
time="2021-07-14T12:36:45Z" level=debug msg="received exit code 0" build.id=5479 build.number=899 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=db-connections thread=24
time="2021-07-14T12:36:45Z" level=debug msg="Engine: Starting step: \"install-deps\"" build.id=5479 build.number=899 container=drone-w5rgyy0cj9hinw0vu8wb image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=install-deps thread=24
time="2021-07-14T12:36:46Z" level=debug msg="PodWatcher: Container state changed" container=drone-w5rgyy0cj9hinw0vu8wb image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:48Z" level=debug msg="PodWatcher: Container state changed" container=drone-w5rgyy0cj9hinw0vu8wb image="docker.io/dicefm/elixir-image:elixir-1.8.2" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:36:48Z" level=debug msg="PodWatcher: Wait finished. Duration=1.88s" container=drone-w5rgyy0cj9hinw0vu8wb error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:38:14Z" level=debug msg="PodWatcher: Container state changed" container=drone-w5rgyy0cj9hinw0vu8wb image="docker.io/dicefm/elixir-image:elixir-1.8.2" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED stateInfo=Completed
time="2021-07-14T12:38:14Z" level=debug msg="PodWatcher: Wait finished. Duration=0.04s" container=drone-w5rgyy0cj9hinw0vu8wb error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED
time="2021-07-14T12:38:14Z" level=debug msg="received exit code 0" build.id=5479 build.number=899 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=install-deps thread=24
time="2021-07-14T12:38:14Z" level=debug msg="Engine: Starting step: \"other-tests-and-credo\"" build.id=5479 build.number=899 container=drone-bnvvwhazicdyvqcfgl9o image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=other-tests-and-credo thread=24
time="2021-07-14T12:38:14Z" level=debug msg="Engine: Starting step: \"kim-tests\"" build.id=5479 build.number=899 container=drone-9kdjq5xrcgxsjn0t2r7p image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=kim-tests thread=24
time="2021-07-14T12:38:14Z" level=debug msg="Engine: Starting step: \"ticketing-tests\"" build.id=5479 build.number=899 container=drone-mvh5aiqvoeodrv8z65p9 image="docker.io/dicefm/elixir-image:elixir-1.8.2" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=ticketing-tests thread=24
time="2021-07-14T12:38:14Z" level=debug msg="Engine: Starting step: \"rebuild-cache\"" build.id=5479 build.number=899 container=drone-m2qih86byhtry44ac2ad image="docker.io/drillster/drone-volume-cache:latest" placeholder="drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=rebuild-cache thread=24
time="2021-07-14T12:38:14Z" level=debug msg="PodWatcher: Container state changed" container=drone-9kdjq5xrcgxsjn0t2r7p image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:38:14Z" level=debug msg="PodWatcher: Container state changed" container=drone-bnvvwhazicdyvqcfgl9o image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:38:14Z" level=debug msg="PodWatcher: Container state changed" container=drone-m2qih86byhtry44ac2ad image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:38:14Z" level=debug msg="PodWatcher: Container state changed" container=drone-mvh5aiqvoeodrv8z65p9 image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Container state changed" container=drone-44w95yhf19l9z39vos5d image="docker.io/drone/placeholder:1" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED stateInfo=Error
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Container state changed" container=drone-9kdjq5xrcgxsjn0t2r7p image="docker.io/dicefm/elixir-image:elixir-1.8.2" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Container state changed" container=drone-bnvvwhazicdyvqcfgl9o image="docker.io/dicefm/elixir-image:elixir-1.8.2" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Container state changed" container=drone-m2qih86byhtry44ac2ad image="docker.io/drillster/drone-volume-cache:latest" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED stateInfo=Completed
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Container state changed" container=drone-mvh5aiqvoeodrv8z65p9 image="docker.io/dicefm/elixir-image:elixir-1.8.2" pod=drone-z8ewas8s49dq044bxihf state=RUNNING stateInfo=
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Wait finished. Duration=2.56s" container=drone-bnvvwhazicdyvqcfgl9o error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Wait finished. Duration=2.56s" container=drone-mvh5aiqvoeodrv8z65p9 error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Wait finished. Duration=2.56s" container=drone-m2qih86byhtry44ac2ad error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Wait finished. Duration=2.56s" container=drone-9kdjq5xrcgxsjn0t2r7p error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=RUNNING
time="2021-07-14T12:38:17Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-m2qih86byhtry44ac2ad error="<nil>" pod=drone-z8ewas8s49dq044bxihf state=TERMINATED
time="2021-07-14T12:38:17Z" level=debug msg="received exit code 0" build.id=5479 build.number=899 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=5862 stage.name=default stage.number=1 step.name=rebuild-cache thread=24

Logs on the screenshot are showing, that all the effected steps ended successfully. I manually ended job moment after making the screenshot