I’m running drone server 1.0.0-rc5 on AWS ECS (EC2) and occasionally we see the server start to fail to open connections with a “too many open files” error.
As a hail mary fix, I reduced the connection timeout for all all http.Client instances to 10 minutes (rather than indefinite default), in case there were dangling connections, but that didn’t seem to address it. My original sense was that agents were making too many connections over time. Eventually, health checks start failing and ECS restarts the task, but it doesn’t happen very quickly.
Has this been seen before?
We’re also running about ~16 agents and a configuration (yaml/jsonnet) and secret plugin.