High Availability Drone Server?

Is there any way today to support the notion of a HA setup for Drone Server? Either built-in, or via some 3rd party duct-tape-and-rubber-bands tooling?

Thanks!

High availability can be an overloaded term. Can you describe what it means to you in this context? I would think configuring multiple instances of Drone server with the same configuration and data source with some reverse proxy would suffice. Of course, that doesn’t cover the data source itself as the single point of failure. Making your data source HA would take a bit more work.

It is not possible to place drone behind a load balancer at the moment, because drone uses a central queue embedded in the server.

I answered a similar question in a github issue that pasted below for convenience. Basically I am focusing on making agent to server communication more reliable and resilient, which I think is the most important step toward minimizing service disruptions.

I do not believe that putting Drone behind a load balancer will make it highly available. This might make the user interface highly available … but will not guarantee you have a build environment that is available to process builds without interruption.

In my experience running CI at scale the biggest problem we face is losing build agents and network disruptions that interfere with agent to server communications. These will be the biggest causes of outages, and neither of these issues are fully solved by putting the server behind a load balancer.

I believe the best way to achieve a highly available build environment is therefore to focus on handling agent failures, which is something I’m actively working on. Enabling a highly available user interface (putting Drone sever behind a load balancer) is therefore of lesser priority at this time because it poses less risk. Note that I’m not saying it isn’t important, I’m just explaining why it isn’t a top priority for me at this time.

2 Likes

I also got updates for a similar statement with github issue with comments from Drone team

Currently the server is not directly stateless, the queue is still processed in memory, but there are ongoing efforts to outsource it to queues provided by cloud providers