Drone-runner-kube slow to process spike in pipeline requests

I am using the kube-runner and am having some issues with very slow start times for builds due to a high number of pipelines being thrown into the job queue all at once. We have determined that there is a particular repo’s .drone.yml that is creating 67 parallel builds which is causing the job queue to fill up and significantly delay builds for all other repos.

We have tried increasing the DRONE_RUNNER_CAPACITY and DRONE_RUNNER_MAX_PROCS which seems to help the runner pull things off the queue faster, but this does not seem to help at all in terms of processing the builds and starting the underlying build containers.

I would appreciate any help with the following questions:
Is there a way to increase throughput with higher job concurrency for the kube runner and/or scale the kube runner horizontally?
Does anyone have any guidance about concurrency limits/expectations that should be documented or that I can potentially enforce through policy?

We are running the following:

One question I have is whether the bottleneck is Drone or Kubernetes? When the Kubernetes runner receives a pipeline from the server it attempts to immediately create and schedule the Pod for execution, however, this can block if your cluster does not have enough capacity. Have you confirmed your cluster has adequate capacity when you see this behavior?

Is there a way to increase throughput with higher job concurrency for the kube runner and/or scale the kube runner horizontally?

Yes, the Kubernetes runner can be deployed horizontally. No special configuration is required. If you connect multiple Kubernetes runners to the server it will distribute workloads accordingly.

We have tried increasing the DRONE_RUNNER_CAPACITY and DRONE_RUNNER_MAX_PROCS

DRONE_RUNNER_MAX_PROCS should probably remain unset for the Kubernetes runner. By setting this value you are placing a global limit on the number of steps that can run concurrently across all repositories and pipelines, which is going to limit throughput. This setting is primarily intended for use by the Docker runner.

DRONE_RUNNER_CAPACITY has a default value of 100 and can be increased without issue. This value limits the number of pipelines a single runner can process at any given time. The sole purpose of this limit is to prevent someone from accidentally overwhelming a cluster – not the runner. A runner should be able to process much higher volumes.

which seems to help the runner pull things off the queue faster, but this does not seem to help at all in terms of processing the builds and starting the underlying build containers.

This behavior you described makes it sound like a cluster does not have adequate resources to schedule the pipeline and is blocking until it can find an available node.

Hey Brad :wave: , thanks for the feedback!

I pre-warmed a sandbox k8s cluster with enough capacity to support my test case, which is a .drone.yml with 50 parallel pipelines that just run a basic image build pipeline.

When running with a single kube-runner replica, start times balloon pretty quickly to 10x+ their normal start times and build containers are scheduled in what looks to be something of a sequential process.

Great, I did not know this was supported! I just tested this with a few iterations using the test method described above. When running with 3 kube-runner replicas, starts times seemed to decrease by ~1/3 and running with 20 kube-runner replicas start times seem to return to their normal expected values.

If scaling the kube-runner requires no special consideration, then we will probably just look at throwing more replicas at it for now. I suppose we could also look into setting up an HPA with custom metrics to scale using the drone_running_jobs metric. Do you think that would be the appropriate metric to determine if the kube-runner needs more capacity?

Thanks for testing this out. We will carve out some time to stress test the runner and see if we can reproduce. Our of curiosity, what is the start time you are seeing?

I would recommend using a fixed number for now. The problem with autoscaling is that kubernetes could end up terminating a runner while pipelines are executing, which would result in those pipelines getting stuck in a zombie state.

@marko-gacesa let’s carve out some time later next week to try and reproduce :point_up_2:

I don’t have a great way to measure this at the moment TBH. Anecdotally, we typically see clone steps start with a few seconds of being triggered. Under load (but still within the DRONE_RUNNER_CAPACITY) this can start to take minutes. This is largely with available capacity, but I’m sure you’re also right that to some degree there are delays due to k8s scaling cloud infrastructure. We’ll start looking for better ways to measure these various indicators.

@colinhoglund, can you please take a look at the runner’s log and tell me if you can see messages like these and are there many of them (the duration will obviously vary):

I0921 ... Waited for 1.002105589s due to client-side throttling, not priority and fairness

Thanks!

Hey Marko, thanks for taking a look at this. I did start to notice these messages after upgrading the kube runner earlier today to version v1.0.0-beta.12 with the newer client-go version. There are about 5-10 of these messages/hour with a pretty typical load of roughly 20-50 builds/hour.

For more context, this cluster is currently running k8s 1.18.