Drone

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running

One root cause of this error is that Docker socket not being properly mounted into the agent. Here are some things to check:

  1. check the agents --volume=/var/run/docker.sock:/var/run/docker.sock
  2. make sure the mapping is correct
  3. make sure the socket exists on the host machine. Some linux distributions, such as centos and coreos, may write the socket at a different location.
  4. make sure Drone does not start before the Docker socket is initialized
  5. if Docker restarts, you must restart the agent. Note that Docker could randomly restart due to a panic (there are documented issues in the moby issue tracker)
  6. please make sure you are running the very latest Docker version (and Drone for that matter). This helps avoid troubleshooting issues that have already been resolved :slight_smile:

Another common root cause is people trying to use agents, but not properly configuring the server for multi-machine mode. You must explicitly configure the Drone server to use agents, otherwise the Drone server defaults to single-machine mode. When running in single-machine mode, the Drone server attempts to connect to the host machine Docker socket to launch builds instead of delegating to agents. If you are trying to configure Drone with agents, please double-check your configuration and ensure you are passing DRONE_AGENTS_ENABLED=true to the server.


1 Like

Thanks to all contributors for Drone.
I have drone/drone:1.0.0 and drone/agent:1.0.0 running via docker-compose on a Google Cloud n1-standard-1 (1 vCPU, 3.75 GB memory) Instance, 20GB free disk.
I’m looking for suggestions on how to diagnose the following:

  • The instance has been semi-usable for me for several days. On each github repo push, a task is correctly created, but about 6 out of 8 times the task stops immediately with the “default: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?” message.
  • I click restart for up to 10 times, until I see that the cloning begins. It seems I’m always able to make the [clone, postgres service, and the build steps] run, as expected, provided I’m patient enough with restarting.

Facts:

  • volume is mounted
  • socket is present:
    srw-rw---- 1 root docker 0 Mar 22 11:25 /var/run/docker.sock
  • uname -a:
    linux drone-runner-1 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 GNU/Linux

Because I can always eventually get a successful build, configuration seems to be fine. So what causes the unreliability?

Does anyone see the same behaviour? What should I be looking at, to diagnose and resolve this?

This error comes directly from the Docker Client source code. The client will throw the error if any of the following happens:

  1. there is a request timeout to the Docker daemon
  2. there is a connection refused error when trying to connect to the deamon (I think this error is only when using TCP)
  3. there is a “dial unix: no such file or directory” error

You can trace the error in the Docker client code here and here.

If we look at the Drone source code we see there is very little surface area. For example, below is the code used to connect to Docker [1]. At just 4 lines of code there is very little opportunity for error within the Drone codebase.

cli, err := docker.NewEnvClient()
if err != nil {
  return nil, err
}

I do recall one individual solved this problem by upgrading Docker to the very latest version. There are documented problems in the moby issue tracker with people getting this error [2]. And there are documented issues in the GitLab issue tracker where they have seen the same error with the GitLab runner [3].

So given that a) I cannot reproduce this problem locally or at cloud.drone.io and b) users of other systems (e.g. gitlab) are experiencing the problem, I am operating under the assumption that this is most likely an issue with Docker or perhaps a host machine configuration issue. I am ready to help if there are actionable improvements we can make to Drone, however, unless we identify an issue with Drone there is unfortunately little I can do on my end.

[1] https://github.com/drone/drone-runtime/blob/master/engine/docker/docker.go#L40:L46
[2] https://github.com/moby/moby/search?q=Cannot+connect+to+the+Docker+daemon+at&type=Issues
[3] https://gitlab.com/gitlab-org/gitlab-runner/issues/1986

Thanks, Brad, for the precise and valued response.
I will follow your suggestions, and report progress here.
–r

@topiaruss also I updated the original post to include this second common root cause. You might want to check to see if this applies to your installation.

Bingo!
I did NOT have that flag set.
Adding it seems to have made an improvement. I’ll confirm later, after some more experience.
I needed to docker-compose down, then up, to ensure new env variables. Then it seemed fine.
Thanks for coming back to this!
–r.

Yes! My intermittent problem has gone, since I added:

DRONE_AGENTS_ENABLED=true

to the server environment.

Thanks again. Really enjoying Drone.

The settings DRONE_AGENTS_ENABLED is not even mentioned in the docs https://docs.drone.io/reference/server/ or is there some other more recent documentation?

It is definitely mentioned in the docs, including the example installation commands and installation reference sections:
https://docs.drone.io/installation/github/multi-machine/#start-the-server
https://docs.drone.io/installation/github/multi-machine/#drone-agents-enabled

It certainly needs to be included in the reference section, however, if you are following the official installation instructions it should be clear that this variable needs to be set. If you believe the installation instructions do not make this clear, please provide actionable feedback and suggestions as to how this could be made more clear.

Thx for the quick reply. Well, we are using bitbucket cloud in multimachine setup, and I can’t find it in the installation docs there (https://docs.drone.io/installation/bitbucket-cloud/multi-machine/). I guess if we can add/update it also in the non-github docs, that would help. Thx.

BTW: Haven’t checked all other installation docs if it is mentioned there.

I’ve switched back to CentOS default docker package instead of docker-ce and it works so far. There is an “issue” with SELinux and passing the docker socket to a container because SELinux blocks the access to /var/run/docker.sock. This is also an expected behavior:

The aggravating thing is, this is exactly what we want SELinux to prevent. If a container process got to the point of talking to the /var/run/docker.sock, you know this is a serious security issue. Giving a container access to the Docker socket, means you are giving it full root on your system.

Full article

To allow connections anyway you have to setup a custom SELinux policy or run the agent container (if you are not using a single server setup) in privileged mode.