Intermittent server error (too many open files)

I’m running drone server 1.0.0-rc5 on AWS ECS (EC2) and occasionally we see the server start to fail to open connections with a “too many open files” error.

As a hail mary fix, I reduced the connection timeout for all all http.Client instances to 10 minutes (rather than indefinite default), in case there were dangling connections, but that didn’t seem to address it. My original sense was that agents were making too many connections over time. Eventually, health checks start failing and ECS restarts the task, but it doesn’t happen very quickly.

Has this been seen before?

We’re also running about ~16 agents and a configuration (yaml/jsonnet) and secret plugin.

is it possible SSE connections are not being closed? These are the only persistent connections I can think of that could build up over time and do not have an explicit timeout. Drone expects to get notified when the http connection is closed, but if that does not work as expected (proxy, lb, etc) I could see this causing problems.

edit: looking at the code we do set a maximum 1hr timeout for event streams, but you could still confirm this theory by checking for events: stream closed events in your logs to ensure the streams are being closed.

can you confirm it is the Drone server process that owns all the open file descriptors and not some other process on the OS? Typically leaking file descriptors in a Go program would be caused by a) inbound http requests not receiving a close event and hanging (possible but unlikely because we seem to implement timeouts everywhere) or b) http response body not being closed somewhere in the codebase

Yes, I was thinking SSE too, but we don’t have too many users, even if they’re refreshing often I think it’d be a while before hitting any limit.

There’s nothing else running on the machine and it’s hosted in an ECS container. It’s possible that it could be caused another process, so I’ll dig more.

my thinking is that since it is not something we are seeing at, if it is caused by some leak in the codebase, it would have to be somewhere in the code not being used by The first thing that comes to mind are plugins (secret, registry, etc) which make outbound http requests.

Yes, however, those are running on entirely different instances and in Fargate. Hmm.

I have been having this issue on our ec2 Drone instance as well for a while. I found it occurred less often after manually upgrading Docker but still happens about twice a month, down from once a week.

@donatj is this for 1.0 as well, or were you experiencing this with 0.8 ?