[solved] Immutable Field Change Error on Pod After Kubernetes v1.19.12 Upgrade

I have been asked to create a new thread from: Kubernetes runner stuck in running state / bringing the Kubernetes runner to GA - #45 by jim5252

We have ran into an issue, this has been noticed after upgrading our kubernetes cluster from v1.18.18 to v1.19.12

This occurred when on runner version: v1.0.0-beta.6
and remained after upgrading to version: v1.0.0-beta.9

Error:

default: Pod "drone-rygfen8qb2yve8jmcxnh" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations) core.PodSpec{ ... // 10 identical fields AutomountServiceAccountToken: nil, NodeName: "ip-10-250-6-105.eu-west-2.compute.internal", SecurityContext: &core.PodSecurityContext{ ... // 11 identical fields
default – clone: Error

I can not see anything in the changelog from kubernetes that would suggest this issue with a new immutable field.

This error occurs after every build restart on what appears to be every pipeline in the cluster.

We do use kubernetes features such as tolerations and secrets.

Please find the gist with requested information here: Kubernetes v1.19.12 Upgrade Drone Error · GitHub

Thanks for this. But the log listing that you provided contains a different error from what you are describing above. The error message in the log is:

"kubernetes error: aborting due to error: kubernetes error: container failed to start: id=drone-t5iqrgubssdwya1tg8bv exitcode=2 reason=Error"

This error happens when Kubernetes fails to start a container. You can easily reproduce it by providing a non-existing image.

Hi Marko,

Yes, I can only see the error in the API. I believe the error exists in ephemeral pods which only live for a second when the build errors.

I have re-ran the build and we can see the error again ini the runner:

time="2021-09-01T14:28:13Z" level=debug msg="PodWatcher started" build.id=287 build.number=24 pod=drone-lg48feaoo6ttlromjajr repo.id=1801 repo.name=test-2 repo.namespace=willem.veerman stage.id=287 stage.name=default stage.number=1 step=clone step.name=clone thread=18
time="2021-09-01T14:28:13Z" level=debug msg="Engine: Starting step" build.id=287 build.number=24 container=drone-xn9a4ytptdxfvzs269gz image="drone/git:latest" placeholder="drone/placeholder:1" pod=drone-lg48feaoo6ttlromjajr repo.id=1801 repo.name=test-2 repo.namespace=willem.veerman stage.id=287 stage.name=default stage.number=1 step=clone step.name=clone thread=18
time="2021-09-01T14:28:13Z" level=trace msg="PodWatcher: Event" event=ADDED pod=drone-lg48feaoo6ttlromjajr
time="2021-09-01T14:28:13Z" level=debug msg="Launched containers. Duration=0.03s" count=1 failed=0 success=1
time="2021-09-01T14:28:13Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lg48feaoo6ttlromjajr
time="2021-09-01T14:28:13Z" level=debug msg="destroying the pipeline environment" build.id=287 build.number=24 repo.id=1801 repo.name=test-2 repo.namespace=willem.veerman stage.id=287 stage.name=default stage.number=1 thread=18
time="2021-09-01T14:28:15Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lg48feaoo6ttlromjajr
time="2021-09-01T14:28:19Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lg48feaoo6ttlromjajr
time="2021-09-01T14:28:19Z" level=trace msg="PodWatcher: Waiting..." container= pod=drone-lg48feaoo6ttlromjajr state=TERMINATED
time="2021-09-01T14:28:19Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lg48feaoo6ttlromjajr
time="2021-09-01T14:28:20Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lg48feaoo6ttlromjajr
time="2021-09-01T14:28:21Z" level=trace msg="PodWatcher: Event" event=MODIFIED pod=drone-lg48feaoo6ttlromjajr
time="2021-09-01T14:28:21Z" level=trace msg="PodWatcher: Event" event=DELETED pod=drone-lg48feaoo6ttlromjajr
time="2021-09-01T14:28:21Z" level=debug msg="PodWatcher: Wait finished. Duration=2.44s" container= error="<nil>" pod=drone-lg48feaoo6ttlromjajr state=TERMINATED
time="2021-09-01T14:28:21Z" level=debug msg="PodWatcher terminated" namespace=drone-bg5ze7d8ucm1b6mbk60n pod=drone-lg48feaoo6ttlromjajr
time="2021-09-01T14:28:21Z" level=debug msg="successfully destroyed the pipeline environment" build.id=287 build.number=24 repo.id=1801 repo.name=test-2 repo.namespace=willem.veerman stage.id=287 stage.name=default stage.number=1 thread=18
time="2021-09-01T14:28:21Z" level=debug msg="updated stage to complete" build.id=287 build.number=24 duration=1 repo.id=1801 repo.name=test-2 repo.namespace=willem.veerman stage.id=287 stage.name=default stage.number=1 thread=18
time="2021-09-01T14:28:21Z" level=debug msg="poller: request stage from remote server" thread=18
time="2021-09-01T14:28:21Z" level=trace msg="http: context canceled"
time="2021-09-01T14:28:21Z" level=debug msg="done listening for cancellations" build.id=287 build.number=24 repo.id=1801 repo.name=test-2 repo.namespace=willem.veerman stage.id=287 stage.name=default stage.number=1 thread=18

I have also tested the same pipeline via a drone runner in a k8s v1.18 cluster which works.

I can also confirm the image exists and I can pull it from the v1.19 node in our cluster:

jimhamill@ip-10-250-5-145 ~ $ sudo docker pull python:3.9.2
3.9.2: Pulling from library/python
004f1eed87df: Pull complete
5d6f1e8117db: Pull complete
48c2faf66abe: Pull complete
234b70d0479d: Pull complete
6fa07a00e2f0: Pull complete
04a31b4508b8: Pull complete
1823d93d9698: Pull complete
2eee1b8e2538: Pull complete
b48672c0140e: Pull complete
Digest: sha256:d5d25f8ddcf983c0164bdcdc87b330d31417e2ce302dbd3e1d0e90fddf3ddff1
Status: Downloaded newer image for python:3.9.2
docker.io/library/python:3.9.2

@jim5252, but there is no error in the log output above.

The runner really does a simple thing. It creates a pod with a bunch of containers, one for each step, all using a placeholder image. Then, one by one (but with services and detached steps can be several at once) it replaces the placeholder with a real image when a step is ready for execution.

Hi Marko,

We just ran another build but this time removed the delete namepace/pod permissions from the service account cluster role so the pod is not cleaned up.

time="2021-09-01T15:14:31Z" level=debug msg="Launched containers. Duration=0.03s" count=1 failed=0 success=1
time="2021-09-01T15:14:31Z" level=debug msg="destroying the pipeline environment" build.id=290 build.number=25 repo.id=1801 repo.name=test-2 repo.namespace=willem.veerman stage.id=290 stage.name=default stage.number=1 thread=95

As can be seen here, the logs do not show the failure even though the UI does.

It moves from Launch containers step straight to the pipeline cleanup in the next step.

The job failure is instant, and these are shown in k8s audit logs:
500 (comes first):

{
  "_index": "kubernetes-audit-acp-test-2021.09.01",
  "_type": "doc",
  "_id": "Nuz3oXsBqwUb303WCRKC",
  "_score": 1,
  "_source": {
    "tags": [
      "kubernetes_audit_filtered"
    ],
    "@version": "1",
    "index_prefix": "kubernetes-audit-acp-test",
    "audit_json": {
      "stage": "ResponseComplete",
      "userAgent": "drone-runner-kube/v0.0.0 (linux/amd64) kubernetes/$Format",
      "stageTimestamp": "2021-09-01T15:25:25.830153Z",
      "kind": "Event",
      "requestReceivedTimestamp": "2021-09-01T15:25:25.820873Z",
      "responseStatus": {
        "status": "Failure",
        "code": 500,
        "reason": "ServerTimeout",
        "metadata": {}
      },
      "auditID": "7159b97d-5bff-4823-80b4-ad579d5bd35d",
      "user": {
        "uid": "399f3b36-7df6-47f6-a35a-eb4f20792ec8",
        "username": "system:serviceaccount:drone-ci:drone-gl-runner",
        "groups": [
          "system:serviceaccounts",
          "system:serviceaccounts:drone-ci",
          "system:authenticated"
        ]
      },
      "apiVersion": "audit.k8s.io/v1",
      "verb": "create",
      "sourceIPs": [
        "10.250.7.193"
      ],
      "annotations": {
        "authorization.k8s.io/decision": "allow",
        "authentication.k8s.io/legacy-token": "system:serviceaccount:drone-ci:drone-gl-runner",
        "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"drone-gl-runner\" of ClusterRole \"drone-gl-runner\" to ServiceAccount \"drone-gl-runner/drone-ci\""
      },
      "level": "Metadata",
      "objectRef": {
        "resource": "pods",
        "apiVersion": "v1",
        "namespace": "drone-bleqfkaay7k3rhbqwz8i",
        "name": "drone-zx079dmtb3jo13rl4vwz"
      },
      "requestURI": "/api/v1/namespaces/drone-bleqfkaay7k3rhbqwz8i/pods"
    },
    "s3Key": "drone-bleqfkaay7k3rhbqwz8i",
    "@timestamp": "2021-09-01T15:25:25.830Z"
  },
  "fields": {
    "audit_json.stageTimestamp": [
      "2021-09-01T15:25:25.830Z"
    ],
    "audit_json.requestReceivedTimestamp": [
      "2021-09-01T15:25:25.820Z"
    ],
    "@timestamp": [
      "2021-09-01T15:25:25.830Z"
    ]
  }
}

422 afterwards:

{
  "_index": "kubernetes-audit-acp-test-2021.09.01",
  "_type": "doc",
  "_id": "uuz3oXsBqwUb303WEBKh",
  "_score": 1,
  "_source": {
    "audit_json": {
      "userAgent": "drone-runner-kube/v0.0.0 (linux/amd64) kubernetes/$Format",
      "responseStatus": {
        "metadata": {},
        "code": 422,
        "reason": "Invalid",
        "status": "Failure"
      },
      "apiVersion": "audit.k8s.io/v1",
      "level": "Metadata",
      "verb": "update",
      "objectRef": {
        "uid": "6defb1b6-b2d2-4d39-8a0e-786958e36a69",
        "apiVersion": "v1",
        "resourceVersion": "513423363",
        "name": "drone-zx079dmtb3jo13rl4vwz",
        "resource": "pods",
        "namespace": "drone-bleqfkaay7k3rhbqwz8i"
      },
      "user": {
        "username": "system:serviceaccount:drone-ci:drone-gl-runner",
        "uid": "399f3b36-7df6-47f6-a35a-eb4f20792ec8",
        "groups": [
          "system:serviceaccounts",
          "system:serviceaccounts:drone-ci",
          "system:authenticated"
        ]
      },
      "stageTimestamp": "2021-09-01T15:25:27.334165Z",
      "requestURI": "/api/v1/namespaces/drone-bleqfkaay7k3rhbqwz8i/pods/drone-zx079dmtb3jo13rl4vwz",
      "annotations": {
        "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"drone-gl-runner\" of ClusterRole \"drone-gl-runner\" to ServiceAccount \"drone-gl-runner/drone-ci\"",
        "authorization.k8s.io/decision": "allow",
        "authentication.k8s.io/legacy-token": "system:serviceaccount:drone-ci:drone-gl-runner"
      },
      "requestReceivedTimestamp": "2021-09-01T15:25:27.312148Z",
      "stage": "ResponseComplete",
      "auditID": "7153b7b2-a74a-43b9-90f3-2f89d713ca3b",
      "kind": "Event",
      "sourceIPs": [
        "10.250.7.193"
      ]
    },
    "index_prefix": "kubernetes-audit-acp-test",
    "@timestamp": "2021-09-01T15:25:27.334Z",
    "s3Key": "drone-bleqfkaay7k3rhbqwz8i",
    "tags": [
      "kubernetes_audit_filtered"
    ],
    "@version": "1"
  },
  "fields": {
    "audit_json.stageTimestamp": [
      "2021-09-01T15:25:27.334Z"
    ],
    "audit_json.requestReceivedTimestamp": [
      "2021-09-01T15:25:27.312Z"
    ],
    "@timestamp": [
      "2021-09-01T15:25:27.334Z"
    ]
  }
}

These two logs from the drone server relate to the job failure:

{"error":"stream: not found","level":"warning","msg":"manager: cannot teardown log stream","step.id":4908,"step.name":"step_2","step.status":"skipped","time":"2021-09-01T14:28:13Z"}
{"build.id":288,"build.number":157,"error":"Cannot transition status via :enqueue from :pending (Reason(s): Status cannot transition via \"enqueue\")","level":"warning","msg":"manager: cannot publish status","repo.id":1802,"stage.id":288,"time":"2021-09-01T14:57:34Z"}

Hi @marko-gacesa , @bradrydzewski

Please let me know if there are any updates with this provided information.

Thanks,

Jim

I really don’t know why you experience this error. No matter what I do I can’t reproduce it.

But you helped me find a bug in the runner code. Thanks! I didn’t handle an error when a pod update fails (to update a container’s image and pod’s metadata annotations). You can try with the latest image. I suspect you should see an error that the launching of containers failed. If not, please report and I’ll keep searching…

Hi Marko,

I built the runner off master with your recent PR and tested again. The same error appears in the UI however there are no new logs or errors in the runner output, as can be seen below:

time="2021-09-02T16:40:27Z" level=debug msg="updated stage to running" build.id=293 build.number=16 repo.id=3707 repo.name=repo-owner-test repo.namespace=acp stage.id=293 stage.name=default stage.number=1 thread=38
time="2021-09-02T16:40:29Z" level=debug msg="destroying the pipeline environment" build.id=293 build.number=16 repo.id=3707 repo.name=repo-owner-test repo.namespace=acp stage.id=293 stage.name=default stage.number=1 thread=38
time="2021-09-02T16:40:34Z" level=debug msg="successfully destroyed the pipeline environment" build.id=293 build.number=16 repo.id=3707 repo.name=repo-owner-test repo.namespace=acp stage.id=293 stage.name=default stage.number=1 thread=38
time="2021-09-02T16:40:34Z" level=debug msg="updated stage to complete" build.id=293 build.number=16 duration=1 repo.id=3707 repo.name=repo-owner-test repo.namespace=acp stage.id=293 stage.name=default stage.number=1 thread=38
time="2021-09-02T16:40:34Z" level=debug msg="poller: request stage from remote server" thread=38
time="2021-09-02T16:40:34Z" level=trace msg="http: context canceled"
time="2021-09-02T16:40:34Z" level=debug msg="done listening for cancellations" build.id=293 build.number=16 repo.id=3707 repo.name=repo-owner-test repo.namespace=acp stage.id=293 stage.name=default stage.number=1 thread=38

Hi,

Any updates from my last message?

Thanks,

Jim

I have updated the test runner to the latest version and built from master. We are seeing the extra logging although it is not very indicative of our problem. It outputs the same error seen in the API:

time="2021-09-08T08:18:20Z" level=trace msg="PodWatcher started" build.id=358 build.number=43 container=drone-lq51vjglxix64e23fjqc image="drone/git:latest" namespace=drone-sdrrr0bt42gzy2m0rzmx placeholder="drone/placeholder:1" pod=drone-ex1w4p7g2j6gbqkn6oda repo.id=1801 repo.name=test-2 repo.namespace=willem.veerman stage.id=358 stage.name=default stage.number=1 step=clone step.name=clone thread=6
time="2021-09-08T08:18:20Z" level=debug msg="Engine: Starting step" build.id=358 build.number=43 container=drone-lq51vjglxix64e23fjqc image="drone/git:latest" namespace=drone-sdrrr0bt42gzy2m0rzmx placeholder="drone/placeholder:1" pod=drone-ex1w4p7g2j6gbqkn6oda repo.id=1801 repo.name=test-2 repo.namespace=willem.veerman stage.id=358 stage.name=default stage.number=1 step=clone step.name=clone thread=6
time="2021-09-08T08:18:20Z" level=trace msg="PodWatcher: Event" event=ADDED pod=drone-ex1w4p7g2j6gbqkn6oda
time="2021-09-08T08:18:20Z" level=debug msg="Launch of 1 containers failed. Duration=0.03s" error="Pod \"drone-ex1w4p7g2j6gbqkn6oda\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)\n  core.PodSpec{\n  \t... // 10 identical fields\n  \tAutomountServiceAccountToken: nil,\n  \tNodeName:                     \"ip-10-250-7-43.eu-west-2.compute.internal\",\n  \tSecurityContext: &core.PodSecurityContext{\n  \t\t... // 11 identical fields\n  \t\tFSGroupChangePolicy: nil,\n  \t\tSysctls:             nil,\n- \t\tSeccompProfile:      nil,\n+ \t\tSeccompProfile:      &core.SeccompProfile{Type: \"RuntimeDefault\"},\n  \t},\n  \tImagePullSecrets: []core.LocalObjectReference{{Name: \"drone-81clqfo4fvz7f53oqibd\"}},\n  \tHostname:         \"\",\n  \t... // 15 identical fields\n  }\n"

Is there any further input on what this issue could be?

Yes, this is the error that you earlier saw on the UI. Now it is logged to the runner’s log too. However, I still have no clue what could cause the error.

The error, after I cleaned it, looks like this:

Pod "drone-ex1w4p7g2j6gbqkn6oda" is invalid: spec: Forbidden: pod updates may not change fields other than
`spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)

  core.PodSpec{
  	... // 10 identical fields
  	AutomountServiceAccountToken: nil,
  	NodeName:                     "ip-10-250-7-43.eu-west-2.compute.internal",
  	SecurityContext: &core.PodSecurityContext{
  		... // 11 identical fields
  		FSGroupChangePolicy: nil,
  		Sysctls:             nil,
- 		SeccompProfile:      nil,
+ 		SeccompProfile:      &core.SeccompProfile{Type: "RuntimeDefault"},
  	},
  	ImagePullSecrets: []core.LocalObjectReference{{Name: "drone-81clqfo4fvz7f53oqibd"}},
  	Hostname:         "",
  	... // 15 identical fields
  }

So, it appears that it’s complaining about the update to the SeccompProfile field of the SecurityContext in the core.PodSpec. But the runner does not change that field when it’s updating a pod to change a container image… So I really don’t know what’s going on…

Hi, @jim5252. Can you please try again with the latest version of Kubernetes runner? The version is 1.0.0-beta.11. I’ve just released it.

I’ve noticed that we used an older version of the Kubernetes library, that doesn’t support the SeccompProfile field. The latest version of the runner uses the latest version of the library, so it might be able to solve your problem.

Hi @marko-gacesa , the new runner version has worked! Thanks for your help.

1 Like