Drone

Drone-runner-exec. Log flooding (~600msg/sec) if drone-server answers wrong. Log file and stdout/stderr at the same time

Environment:

  • OS: Ubuntu 18.04
  • drone-runner-exec version: v1.0.0-beta.8
  • drone-server is behind reverse proxy (nginx)

Problems

  1. Flood in logs can “eat” all disk space fast if drone-server answers wrong
  2. DRONE_LOG_FILE is defined, but logs still goes to stdout/stderr too, instead of to log file only

Steps to reproduce “flood”

  1. Run the drone-runner-exec with successful registration on the drone-server
  2. Make the drone-server answer with any error (e.g. 403)
  3. Look in logs. You’ll see flood
  4. Restart the drone-runner-exec -> You’ll see 1 message per second in logs
  5. Make the drone-server available, and repeat from step 2 -> You’ll see flood again

Flood example

tail -f /var/log/drone-runner-exec/drone-runner-exec.log
...
time="2019-12-05T21:22:13Z" level=error msg="cannot request stage" error="<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n" thread=1
time="2019-12-05T21:22:13Z" level=error msg="cannot request stage" error="<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n" thread=4
time="2019-12-05T21:22:13Z" level=error msg="cannot request stage" error="<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n" thread=3
time="2019-12-05T21:22:13Z" level=error msg="cannot request stage" error="<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n" thread=2
...

root@cnt1:~# grep '2019-12-05T21:22:13Z' /var/log/drone-runner-exec/drone-runner-exec.log | wc -l
611

Steps to reproduce continuing messages to stdout/stderr with DRONE_LOG_FILE

  1. Add to config LOG params
  2. Check stdout/stderr and logs

Config example

DRONE_RPC_PROTO=https
DRONE_RPC_HOST=****
DRONE_RPC_SECRET=****

DRONE_RUNNER_NAME=cnt1
DRONE_RUNNER_MAX_PROCS=4
DRONE_RUNNER_CAPACITY=4
DRONE_RUNNER_LABELS=instance:cnt1
DRONE_RUNNER_ENVIRON=env:production
DRONE_RUNNER_PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/sbin:/bin
DRONE_LIMIT_TRUSTED=false
DRONE_LIMIT_EVENTS=push,tag,promote,rollback

DRONE_LOG_FILE=/var/log/drone-runner-exec/drone-runner-exec.log
DRONE_LOG_FILE_MAX_SIZE=50
DRONE_LOG_FILE_MAX_AGE=30
DRONE_LOG_FILE_MAX_BACKUPS=7

Workaround

First part of workaround above in config example(LOG rotate options). Second part is disable logs in systemd unit:

[Unit]
Description=Drone Exec Runner
ConditionFileIsExecutable=/usr/local/bin/drone-runner-exec

[Service]
StartLimitInterval=5
StartLimitBurst=10
User=drone
Group=drone
RuntimeDirectory=drone-runner-exec
RuntimeDirectoryMode=0755
StandardOutput=null
StandardError=null
SyslogIdentifier=drone-runner-exec
ExecStart=/sbin/start-stop-daemon -Smp /var/run/drone-runner-exec/drone-runner-exec.pid -x /usr/local/bin/drone-runner-exec -- service run --config /etc/drone-runner-exec/config
Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target

hmm, it does not look like this response is coming from the Drone server. We do not have any code that returns a response matching this output:

<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr <center>nginx</center>\r\n</body>\r\n</html>

Do you have a remote proxy in place and have you considered adjusting the configuration to avoid returning html for json requests?

This seems to be an edge case related to using a reverse proxy, because I cannot otherwise reproduce the issue. When the server is down, and no reverse proxy is in place, the http request times out after 30 seconds and then re-attempts. This limits the logs to once every N seconds.

For a more permanent solution you might consider sending a patch that accounts for html responses returned from a reverse proxy when the server is down. See https://github.com/drone/runner-go/blob/master/client/http.go#L202

Thanks for your reply!

Do you have a remote proxy in place …?

Sorry for not full information. Yes, the drone-server in our setup is behind the reverse-proxy (nginx).

and have you considered adjusting the configuration to avoid returning html for json requests

No, we haven’t. I think we can do that, but it will be like one more workaround, but not a solution.

hmm, it does not look like this response is coming from the Drone server. We do not have any code that returns a response matching this output:

Yes, it’s Nginx before Drone’s server.

Have you considered adjusting the configuration to avoid returning html for json requests?

We’ll do. But before this incident we didn’t even know about such problem.

When the server is down, and no reverse proxy is in place, the http request times out after 30 seconds and then re-attempts. This limits the logs to once every N seconds.

I’m sorry, but what does it explain? We have a situation, when HTTP client got non-2xx response and flooded all logs (~9k messages for 30sec). This behavior is clearly destructive and in short time lead to downtime. I.e. it’s at least a bug worth fixing.

For a more permanent solution you might consider sending a patch that accounts for html responses returned from a reverse proxy when the server is down.

In case of code written in Golang we can help only with reporting and troubleshooting. Sorry, but it’s all how we can be useful for now :wink:

And few more things.

This seems to be an edge case related to using a reverse proxy, because I cannot otherwise reproduce the issue. When the server is down, and no reverse proxy is in place, the http request times out after 30 seconds and then re-attempts. This limits the logs to once every N seconds.

This abused not only client-side (where was runner), but also a server-side (where was server). We got ~140Gb of Nginx’s access logs from this “edge case”.

For a more permanent solution you might consider sending a patch that accounts for html responses returned from a reverse proxy when the server is down.

And why we’re talking about a body of the response when we got 403? I.e. we got non-success status code, which means that our request is failed. Why content of the response’s body even matter in this case?

As I mentioned, I was unable to reproduce the issue but perhaps I am misunderstanding or perhaps this is a difference in setup. I recommend patching the code to solve the issue and sending a pull request. You can find the relevant source here. We will evaluate the pull request and provide any feedback during the code review process.

As I mentioned, I was unable to reproduce the issue but perhaps I am misunderstanding or perhaps this is a difference in setup.

As I wrote above, we’re happy to help with debugging or any other additional information that may or would be needed.

I recommend patching the code to solve the issue and sending a pull request.

Also, as I wrote before, this isn’t an option. We definitely would if we could. But right now we don’t have anyone around who can write in Go.

PS: BTW, we removed HTML part from the answer in favor of 403 with ‘{}’ in body (still valid JSON?). It didn’t change a bit - exactly the same flood in logs.

The 403 status code is only returned when the server is configured when an empty rpc secret. You can find the code here with the relevant snippet pasted below for your convenience.

	// prevents system administrators from accidentally
	// exposing drone without credentials.
	if token == "" {
		w.WriteHeader(403)
	} 

What I do not understand is how the runner was able to establish a connection with the server in the first place. The first thing the runner does is ping the server. The runner blocks until the ping is successful. If the server is configured with an empty rpc secret, the ping will fail with a 403 and the runner will eventually exit after multiple attempts.

Either way this seems to be a configuration issue. The root cause is the server was configured with an empty rpc secret.