Kubernetes runner stuck in running state / bringing the Kubernetes runner to GA

marko-gacesa recently joined the team and he will be focusing on bringing the Kubernetes runner to GA. He authored this pull request that redesigns how we watch for updates, including periodic polling as a fallback just in case the runner fails to receive an event from Kubernetes (for whatever reason) with the hopes of making the runner much more resilient. We would ask everyone in reading this thread to begin testing using the latest runner image. Also please remember that when you are using our latest runner image you are using an unstable build; this should not be tested in production.

This particular issue has been very difficult for us to triage because it is inconsistent; some people use the runner in production without issues and others periodically see pipelines or steps stuck in a running state. Going forward when someone reports an issue we are going to ask for cluster details so that we can try to mirror your cluster internally and reproduce. We may also ask to schedule a Zoom call to triage live. We would love to have the runner reach GA by early July but that will depend on community testing and feedback, so we look forward to your help.

If you continue to experience issues with pipelines or steps getting stuck with the Kubernetes runner, please let us know and please provide the following:

  1. Ensure you are using the latest Kubernetes runner image (please avoid reporting issues with previous versions)
  2. Enable debug and trace logging
  3. Publish your runner logs to a gist and provide a link
  4. Publish your yaml to a gist and provide a link (bonus points for providing a simplified yaml that we can use to easily reproduce)
  5. Publish your cluster details to a gist and provide a link; provide whatever information it takes for us to re-create your cluster in our environment

If you experience issues with the kubernetes runner unrelated to pipelines or steps stuck in pending or running status, please create a separate thread and we will triage and address separately. Thanks!


EDIT 2021-06-08

we uncovered an issue with an unknown container error being returned and causing steps to fail sporadically. We are investigating. See Latest drone-runner-kube fails successful steps and does not display output


EDIT 2021-06-15

we uncovered an issue where an invalid or non-existing image would cause the pipeline to hang due to an uncaught error. This has been patched and a new version of the runner image is available. https://github.com/drone-runners/drone-runner-kube/pull/58

2 Likes

i’ve built the runner with the latest changes and it is running now on our cluster… WIll monitor it today and will come back with some results.

Thanks!

Any chance for a 1.0.0-beta.7 version to be published to Docker Hub?

@alonsodomin yes, we will plan on tagging 1.0.0-beta.7 end of this week if we do not receive any new issue reports. We would continue to ask everyone to test the latest kubernetes runner image (we just pushed another update) and continue reporting issues they encounter related to stuck pipelines.

I thought new version it is perfect already, but I noticed issue after I re enabled one repo what has over 20 steps. One step got stuck on running

  - name: db-connections
    image: ecr.ad.dice.fm/base:waitdbs-latest
    pull: always
    environment:
      <<: *basic_env_vars
    commands:
      - waitpostgres $PG_URL
      - waitmongo $MONGO_HOST
      - waitredis $REDIS_HOST

Redis check is simple script and it clearly finished:

#!/bin/bash
failure=0
while [ true ]; do
    redis-cli -h $1 ping
    if [ $? -eq 0 ];then
        exit 0
    fi
    sleep 1
    failure=$[$failure+1]
    if [ $failure -gt 120 ]; then
        exit 1
    fi
    echo "Waiting for redis..."
done%           

It said “PONG”, what means it connected and finished

Normally it takes 8s, but in this case it was Running until job timeout

Jobs with less step, are running without single failure since I am using new latest runner image

It looked exactly like this, with a difference that it never finish. But all looked the same in a step logs

@michalzxc thanks for the report. Is it possible to provide a yaml that can be used to reproduce? as well as your kubernetes runner logs with trace logging enabled? @marko-gacesa any other information that would be helpful?

I enabled trace only after, so I am trying to reproduce it now myself (also I wasn’t sure if my latest runner image was really the latest, so restarted runner pod with pull always).
As said, new version was working perfectly until now, and it was a first time when step got stuck since a week, so maybe it was just accident that it happened to this particular job at my first try to run it.
I will get back to you when I will have logs collected, thank you!

I tried to remove all private images while keeping all drone specific parts (persistent volumes, secret used, service type containers) and this is what I got:

name: default
kind: pipeline
type: kubernetes

volumes:
  - name: cache
    host:
      path: /tmp/drone/cache/v1

services:
  - name: postgres
    image: bitnami/postgresql:11
    environment:
      POSTGRESQL_POSTGRES_CONNECTION_LIMIT: 500
      POSTGRESQL_LOG_CONNECTIONS: YES
      POSTGRESQL_PASSWORD: "pass123"
      POSTGRESQL_USERNAME: postgres
  - name: mongodb
    image: mongo:3
    command: [ --smallfiles ]
  - name: redis
    image: redis
  - name: stripe-mock
    image: stripemock/stripe-mock:v0.84.0
  - name: geoip
    image: fiorix/freegeoip

basic_env_vars: &basic_env_vars
  PG_URL: "postgres://postgres:pass123@postgres:5432/test?ssl=false"
  MONGO_HOST: "mongodb"
  REDIS_HOST: redis
  REDIS_URL: "redis://redis:6379"
  CACHE_TEST_REDIS_URL: "redis://redis:6379/2"


steps:
  - name: restore-cache
    image: drillster/drone-volume-cache
    volumes:
      - name: cache
        path: /cache
    settings:
      restore: true
      ttl: 7
      cache_key: [ DRONE_REPO_NAME, DRONE_BRANCH ]
      mount:
        - ./deps
        - ./_build

  - name: db-connections
    image: debian
    pull: always
    environment:
      <<: *basic_env_vars
    commands:
      - apt update && apt install -y postgresql-client redis-tools
      - wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add -
      - echo "deb http://repo.mongodb.org/apt/debian buster/mongodb-org/4.4 main" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list
      - apt update && apt install -y mongodb-org
      - while [ true ]; do mongo --norc --quiet --host=$MONGO_HOST:27017 <<< "db.getMongo()";     if [ $? -eq 0 ];then         exit 0;     fi;     sleep 1;     failure=$[$failure+1];     if [ $failure -gt 120 ]; then         exit 1;     fi;     echo "Waiting for mongo..."; done
      - while [ true ]; do     DB="$(echo -e ''"${PG_URL}"''|sed 's/?ssl=false//g'|sed 's/?ssl=true//g')";     psql -Atx ''"${DB}"'' -c "select current_date";     if [ $? -eq 0 ];then         exit 0;     fi;     sleep 1;     failure=$[$failure+1];     if [ $failure -gt 120 ]; then         exit 1;     fi;     echo "Waiting for postgres..."; done
      - while [ true ]; do     redis-cli -h $REDIS_HOST ping;     if [ $? -eq 0 ];then         exit 0;     fi;     sleep 1;     failure=$[$failure+1];     if [ $failure -gt 120 ]; then         exit 1;     fi;     echo "Waiting for redis..."; done

  - name: install-deps
    image: elixir:latest
    pull: always
    environment:
      <<: *basic_env_vars
    commands:
      - echo "Do some elixir stuff"
    depends_on:
      - restore-cache
      - db-connections
  - name: rebuild-cache
    image: drillster/drone-volume-cache
    volumes:
      - name: cache
        path: /cache
    settings:
      rebuild: true
      cache_key: [ DRONE_REPO_NAME, DRONE_BRANCH ]
      mount:  
        - ./deps
        - ./_build
    depends_on:
      - install-deps

  - name: k-tests
    image: elixir:latest
    resources:
      limits:
        memory: 2048Mi
      requests:
        memory: 2048Mi
    environment:
      <<: *basic_env_vars
    commands:
      - echo "Some app tests"
    depends_on:
      - install-deps
  - name: t-tests
    image: elixir:latest
    resources:
      limits:
        memory: 2048Mi
      requests:
        memory: 2048Mi
    environment:
      <<: *basic_env_vars
    commands:
      - echo "More tests"
    depends_on:
      - install-deps
  - name: other-tests-and-credo
    image: elixir:latest
    environment:
      <<: *basic_env_vars
    commands:
      - echo "other tests"
    depends_on:
      - install-deps
  - name: build-image
    image: debian
    privileged: true
    environment:
      SOME_SECRET:
        from_secret: SOME_SECRET
    commands:
      - echo "Build image and push it"
   
  - name: deploy-image
    image: debian
    environment:
      OLLY_SECRET:
        from_secret: OLLY_SECRET
    commands:
      - echo "Push chart to chart museum"
    depends_on:
      - other-tests-and-credo
      - t-tests
      - k-tests
      - build-image

@michalzxc thank you very much! It would be really great if you could get trace logs of a stuck runner…

At which step the runner was stuck? What was the step doing? Can you provide some details about your kubernetes setup?

Does the issue happen only with this repo (with a lot of steps)? Other repos (with pipelines with fewer steps) are not causing any problems?

It was the db-connections one, it was keep “running” after redis connected and displayed “PONG” result
I wasn’t able to replicated it so far, previously with old version of runner it was happening quite often, at this moment I saw it only once, I am restarting jobs today randomly today in hope to replicate it at some point
This is bare metal kubernetes on OVH dedicated hosts INFRA-3 ones (Processor: AMD Epyc 7371 - 16 c / 32 t - 3.1 GHz / 3.8 GHz with 256Gigs of memory, and hardware SSD raid), it is kubernetes 1.18 and is using calico networking. It is using crio to manage containers, with this path I got when I was asking for help there with previous version of drone runner - Different container state between kubernetes and crictl · Issue #4693 · cri-o/cri-o · GitHub

I got something else this time

After I noticed that “db-connections” step failed like this, I cancelled the job and restarted it

time="2021-06-24T14:33:30Z" level=debug msg="stage details fetched" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:33:31Z" level=debug msg="updated stage to running" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:33:32Z" level=debug msg="PodWatcher: Started" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:32Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-b8xnm0v4wi67znzpbi8b image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:33Z" level=debug msg="PodWatcher: Waiting..." container=drone-b8xnm0v4wi67znzpbi8b pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:40Z" level=debug msg="PodWatcher: Wait finished. Duration=7.66s" container=drone-b8xnm0v4wi67znzpbi8b error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-b8xnm0v4wi67znzpbi8b pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-b8xnm0v4wi67znzpbi8b error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:33:50Z" level=debug msg="received exit code 0" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 step.name=clone thread=66
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"build-image\"" container=drone-abg3r89abnbdzlg9i258 image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"redis\"" container=drone-aw5mytoznqji4kqdkmf4 image="docker.io/library/redis:latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"postgres\"" container=drone-ynqhyrea0ftwqf51270v image="docker.io/bitnami/postgresql:11" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"db-connections\"" container=drone-sxt7effysqf6n01a15g6 image="ecr.ad.dice.fm/base:waitdbs-latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"restore-cache\"" container=drone-zbg97fzhxezfx2lsip66 image="docker.io/drillster/drone-volume-cache:latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"mongodb\"" container=drone-9slbly9y70ivg9dhok2g image="docker.io/library/mongo:3" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"stripe-mock\"" container=drone-kjv8whfgek2k5ntr5r91 image="docker.io/stripemock/stripe-mock:v0.84.0" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="Engine: Starting step: \"geoip\"" container=drone-t43uzmbdmf0b8s8wdryw image="docker.io/fiorix/freegeoip:latest" placeholder="drone/placeholder:1" pod=drone-rfsvxz58fu7bnipy0q8x
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-abg3r89abnbdzlg9i258 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-aw5mytoznqji4kqdkmf4 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-ynqhyrea0ftwqf51270v pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-sxt7effysqf6n01a15g6 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:50Z" level=debug msg="PodWatcher: Waiting..." container=drone-zbg97fzhxezfx2lsip66 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-kjv8whfgek2k5ntr5r91 pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:51Z" level=debug msg="PodWatcher: Waiting..." container=drone-t43uzmbdmf0b8s8wdryw pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:51Z" level=debug msg="PodWatcher: Wait finished. Duration=1.35s" container=drone-abg3r89abnbdzlg9i258 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:52Z" level=debug msg="PodWatcher: Waiting..." container=drone-9slbly9y70ivg9dhok2g pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.16s" container=drone-zbg97fzhxezfx2lsip66 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.32s" container=drone-aw5mytoznqji4kqdkmf4 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.27s" container=drone-ynqhyrea0ftwqf51270v error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=3.22s" container=drone-sxt7effysqf6n01a15g6 error="container failed to start (invalid image?)" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Waiting..." container=drone-zbg97fzhxezfx2lsip66 pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:33:54Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-zbg97fzhxezfx2lsip66 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:33:54Z" level=debug msg="received exit code 0" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 step.name=restore-cache thread=66
time="2021-06-24T14:33:56Z" level=debug msg="PodWatcher: Wait finished. Duration=3.63s" container=drone-9slbly9y70ivg9dhok2g error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:56Z" level=debug msg="PodWatcher: Wait finished. Duration=4.23s" container=drone-t43uzmbdmf0b8s8wdryw error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:33:56Z" level=debug msg="PodWatcher: Wait finished. Duration=4.63s" container=drone-kjv8whfgek2k5ntr5r91 error="<nil>" pod=drone-rfsvxz58fu7bnipy0q8x state=RUNNING
time="2021-06-24T14:34:43Z" level=debug msg="PodWatcher: Waiting..." container=drone-vzivcajpr2mmprx7kn9f pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:43Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-vzivcajpr2mmprx7kn9f error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:43Z" level=debug msg="received exit code 0" build.id=4277 build.number=3044 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4660 stage.name=default stage.number=1 step.name=tournesol-build-and-test thread=83
time="2021-06-24T14:34:43Z" level=debug msg="Engine: Starting step: \"tournesol-deployment\"" container=drone-pr0qw6hlqjeexlgspugo image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-v5h2umymvxc752qyy7ot
time="2021-06-24T14:34:44Z" level=debug msg="PodWatcher: Waiting..." container=drone-pr0qw6hlqjeexlgspugo pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Wait finished. Duration=1.28s" container=drone-pr0qw6hlqjeexlgspugo error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Waiting..." container=drone-pr0qw6hlqjeexlgspugo pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-pr0qw6hlqjeexlgspugo error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:45Z" level=debug msg="received exit code 0" build.id=4277 build.number=3044 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4660 stage.name=default stage.number=1 step.name=tournesol-deployment thread=83
time="2021-06-24T14:34:45Z" level=debug msg="Engine: Starting step: \"prefect-deployment\"" container=drone-o2j0sjimgpuvkxg7go76 image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-v5h2umymvxc752qyy7ot
time="2021-06-24T14:34:45Z" level=debug msg="Engine: Starting step: \"dam-deployment\"" container=drone-00tu1zxwlhzfg5yuz1rz image="ecr.ad.dice.fm/base:imagebuilder-latest" placeholder="drone/placeholder:1" pod=drone-v5h2umymvxc752qyy7ot
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Waiting..." container=drone-o2j0sjimgpuvkxg7go76 pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:45Z" level=debug msg="PodWatcher: Waiting..." container=drone-00tu1zxwlhzfg5yuz1rz pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:48Z" level=debug msg="PodWatcher: Wait finished. Duration=2.96s" container=drone-o2j0sjimgpuvkxg7go76 error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:48Z" level=debug msg="PodWatcher: Wait finished. Duration=2.92s" container=drone-00tu1zxwlhzfg5yuz1rz error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=RUNNING
time="2021-06-24T14:34:48Z" level=debug msg="PodWatcher: Waiting..." container=drone-00tu1zxwlhzfg5yuz1rz pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:48Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container=drone-00tu1zxwlhzfg5yuz1rz error="<nil>" pod=drone-v5h2umymvxc752qyy7ot state=TERMINATED
time="2021-06-24T14:34:48Z" level=debug msg="received exit code 0" build.id=4277 build.number=3044 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4660 stage.name=default stage.number=1 step.name=dam-deployment thread=83
time="2021-06-24T14:35:06Z" level=debug msg="stage received" stage.id=4663 stage.name=default stage.number=1 thread=68
time="2021-06-24T14:35:06Z" level=debug msg="stage received" stage.id=4663 stage.name=default stage.number=1 thread=84
time="2021-06-24T14:35:06Z" level=debug msg="stage accepted" stage.id=4663 stage.name=default stage.number=1 thread=68
time="2021-06-24T14:35:06Z" level=debug msg="stage accepted by another runner" stage.id=4663 stage.name=default stage.number=1 thread=84
time="2021-06-24T14:35:06Z" level=debug msg="poller: request stage from remote server" thread=84
time="2021-06-24T14:35:06Z" level=debug msg="stage details fetched" build.id=4280 build.number=3045 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4663 stage.name=default stage.number=1 thread=68
time="2021-06-24T14:35:06Z" level=debug msg="updated stage to running" build.id=4280 build.number=3045 repo.id=106 repo.name=Tournesol repo.namespace=dicefm stage.id=4663 stage.name=default stage.number=1 thread=68
time="2021-06-24T14:35:07Z" level=debug msg="PodWatcher: Started" pod=drone-zjpllesl0kcrfdrv8ao9
time="2021-06-24T14:35:07Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-mov6r84ugyl04dw5k7jc image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-zjpllesl0kcrfdrv8ao9
time="2021-06-24T14:35:08Z" level=debug msg="PodWatcher: Waiting..." container=drone-mov6r84ugyl04dw5k7jc pod=drone-zjpllesl0kcrfdrv8ao9 state=RUNNING
time="2021-06-24T14:35:08Z" level=debug msg="received cancellation" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:35:10Z" level=debug msg="stage received" stage.id=4664 stage.name=default stage.number=1 thread=70
time="2021-06-24T14:35:10Z" level=debug msg="stage received" stage.id=4664 stage.name=default stage.number=1 thread=85
time="2021-06-24T14:35:10Z" level=debug msg="stage accepted" stage.id=4664 stage.name=default stage.number=1 thread=70
time="2021-06-24T14:35:10Z" level=debug msg="stage accepted by another runner" stage.id=4664 stage.name=default stage.number=1 thread=85
time="2021-06-24T14:35:10Z" level=debug msg="poller: request stage from remote server" thread=85
time="2021-06-24T14:35:10Z" level=debug msg="stage details fetched" build.id=4281 build.number=419 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4664 stage.name=default stage.number=1 thread=70
time="2021-06-24T14:35:10Z" level=debug msg="updated stage to running" build.id=4281 build.number=419 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4664 stage.name=default stage.number=1 thread=70
time="2021-06-24T14:35:11Z" level=debug msg="PodWatcher: Started" pod=drone-alt3npdar9ttei61w56p
time="2021-06-24T14:35:11Z" level=debug msg="Engine: Starting step: \"clone\"" container=drone-zpga7msqmvbo9viioqdy image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-alt3npdar9ttei61w56p
time="2021-06-24T14:35:11Z" level=debug msg="PodWatcher: Waiting..." container=drone-zpga7msqmvbo9viioqdy pod=drone-alt3npdar9ttei61w56p state=RUNNING
time="2021-06-24T14:35:18Z" level=debug msg="destroying the pipeline environment" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:35:23Z" level=debug msg="PodWatcher: Waiting..." container= pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:35:23Z" level=debug msg="PodWatcher: Wait finished. Duration=0.00s" container= error="context canceled" pod=drone-rfsvxz58fu7bnipy0q8x state=TERMINATED
time="2021-06-24T14:35:23Z" level=debug msg="successfully destroyed the pipeline environment" build.id=4279 build.number=418 repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66
time="2021-06-24T14:35:23Z" level=debug msg="stage failed" build.id=4279 build.number=418 duration=108 error="1 error occurred:\n\t* Optimistic Lock Error\n\n" repo.id=239 repo.name=kimlexir repo.namespace=dicefm stage.id=4662 stage.name=default stage.number=1 thread=66

I think there is some pattern that it only happens to this job. Maybe use of volumes, or amount of steps :thinking:

There are other jobs with less steps using “db-connections” container the same way and no single issue since I updated to latest runner last week

Wow @michalzxc ! Thank you very much. This is super useful.

In the log I noticed an error: ‘error=“container failed to start (invalid image?)”’. It happened twice! Both times for the same “db-connections” step. This error happens when Kubernetes fails to run a container. As the error text explains this can happen if there is a typo in the image name or the image doesn’t exist, but can also happen for other reasons. Do you have a idea why would this container fail to start? Memory requirement? Some other limits?

So after job restart it was fine (so image is there, and there was no typo), I noticed it has “pull:always” some maybe some repository short issue?
Removing pull:always now

I can’t imagine it might be resource issue, drone has 2 very big hosts just for itself:

free -m
              total        used        free      shared  buff/cache   available
Mem:         257890        8151       86191         644      163546      250510
Swap:             0           0           0
w
 16:51:17 up 30 days, 54 min,  1 user,  load average: 3.95, 1.96, 1.03

Out of 32 threads (16 physical cores)

All it runs are 3 scripts:

waitdbs❯ cat waitmongo                                                                                                                                  ecr-base/git/main !
#!/bin/bash
failure=0
while [ true ]; do
    mongo --norc --quiet --host=$1:27017 <<< "db.getMongo()"
    if [ $? -eq 0 ];then
        exit 0
    fi
    sleep 1
    failure=$[$failure+1]
    if [ $failure -gt 240 ]; then
        exit 1
    fi
    echo "Waiting for mongo..."
done                                                                                                                                                           waitdbs❯ cat waitpostgres                                                                                                                               ecr-base/git/main !
#!/bin/bash
failure=0
while [ true ]; do
    DB="$(echo -e ''"${1}"''|sed 's/?ssl=false//g'|sed 's/?ssl=true//g')"
    psql -Atx ''"${DB}"'' -c "select current_date"
    if [ $? -eq 0 ];then
        exit 0
    fi
    sleep 1
    failure=$[$failure+1]
    if [ $failure -gt 240 ]; then
        exit 1
    fi
    echo "Waiting for postgres..."
done                                                                                                                                                               waitdbs❯ cat waitredis                                                                                                                                  ecr-base/git/main !
#!/bin/bash
failure=0
while [ true ]; do
    redis-cli -h $1 ping
    if [ $? -eq 0 ];then
        exit 0
    fi
    sleep 1
    failure=$[$failure+1]
    if [ $failure -gt 240 ]; then
        exit 1
    fi
    echo "Waiting for redis..."
done

In order to block pipeline test steps until mongo, redis, postgress (“drone services”) will start