[drone-runner-kube] Not correctly setting resource requests for Steps

I’ve noticed some unusual behaviour when setting Resource Requests for Pipeline Steps. In our runner we provide the following two env vars:

DRONE_RESOURCE_REQUEST_CPU: 200m
DRONE_RESOURCE_REQUEST_MEMORY: 100MiB

I assumed (perhaps wrongly) that this is the default CPU & Memory Request applied to each individual Step, in the absence of Resource Requests being defined in the Pipeline. Instead what it seems, is that this is the Maximum Requests which are applied to the entire Pipeline (which is not overridable). So every Step in the Pipeline ends up with:

    Requests:
      cpu:     1m
      memory:  4Mi

But the first Step (usually the git clone) will get whatever is remaining from the total after the above values were subtracted. So if you have 4 more Steps after the git clone, your first clone Step would reserve:

    Requests:
      cpu:     196m
      memory:  84Mi

If this is intentional behaviour then the docs should probably be updated with more of an explanation that setting these vars on the runner means it’s the total reserved resources for the entire Pod that is scheduled (overriding any specification for individual Steps in a users Pipeline), split across each Container.

I’ll test removing both env variables to make sure each step can define their own Requests. It would still be nice to set a minimum applied for all Steps (at the runner level), so that they are given a buffer of reserved resources and reduce the likelihood of over-allocation of Jobs on a single Node (another problem we hit).

It’s related to this logic which was merged in December but not officially tagged: drone-runner-kube/compiler.go at master · drone-runners/drone-runner-kube · GitHub

I’ve also tested the following two env vars (in isolation) introduced in that commit:

  • DRONE_RESOURCE_MIN_REQUEST_CPU
  • DRONE_RESOURCE_MIN_REQUEST_MEMORY

Their behaviour overrides the Resource Requests for each Step, regardless of what the step has defined. I would expect that it would set the higher value of either the env vars above, or what has been defined in the Step. For example if the Env Var has 200MiB for Memory and the Step has 250MiB, it should set the latter in the Spec.

This thread explains the changes we are making to how the kubernetes runner works:

Please keep in mind the Kubernetes runner is in Beta and we do not have a stable specification. We are still making breaking changes which may result in inconsistencies in the docs (despite our best efforts to keep them up to date). If you need something more stable, we recommend the Docker runner.

Thanks for pointing us to the post that sets out the current resource strategy. According to our tests with the current master (5abe9d7), there doesn’t seem to be any way of overriding DRONE_RESOURCE_MIN_REQUEST_CPU or MEMORY values in pipeline manifests.

Each container gets the resources specified by the MIN_REQUEST env vars regardless of the requests specified in the pipeline manifest. The following table shows the four scenarios we have tested and the actual resource allocations reported by K8s API:

  • no env vars are set in runner deployment
  • only minimum request values are set
  • only stage request values are set
  • both minimum and stage values are set.

The pipeline manifests used for testing contained four steps (in addition to git clone):

  • first container does not specify any requests
  • second requests resources that are lower than the MIN_REQUEST values set in the runner deployment
  • third requests resources that are higher than MIN_REQUEST but lower than stage values set in the runner deployment
  • fourth requests resources that are higher than stage values set in the runner deployment.

Looking at the post you linked, we are not clear if this is the intended behaviour. Should we be able to set requests for individual pipeline steps? or shall we be setting only limits per step instead?

We are also observing that the stage approach is not having the desired effect of reserving enough resources for each given step. A single step can go above its reserved amount (4MB or MIN_REQUEST) and get OOMKilled even when stage env vars reserve plenty of resource for the entire pod.

Do you have any suggestions?

@bradrydzewski would you be able to advise on the above investigation please?