How to debug "Cannot connect to the Docker daemon"?

#1

Getting the following error when attempting to build, but not every time. I haven’t figured out the pattern yet. Here is the drone configuration I am using, maybe there is some misconfiguration in it?

And here is the log from the server:

{
  "arch": "arm64",
  "build": 16,
  "error": "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?",
  "level": "info",
  "machine": "drone-server-6f6495cf77-nk4pt",
  "msg": "runner: execution failed",
  "os": "linux",
  "pipeline": "default",
  "repo": "jmreicha/dockerfiles",
  "stage": 2,
  "stage-id": 19,
  "time": "2019-03-25T16:17:33Z"
}

I’m not really sure what to look at next, I’m thinking about debugging the Docker process on the host? Any other ideas?

0 Likes

#2

I think this would be the next step. This error message implies that either Drone was unable to connect with Docker or the connection was reset or broken. Either way, this would need to be debugged at the Docker daemon level.

For example, if Docker crashes and restarts due to an internal error, it would break the connection resulting in a connection error https://github.com/moby/moby/issues/38735

Also make sure you are running the latest version of Docker and that you restart your Drone agent after you upgrade Docker.

0 Likes

#3

It’s strange, it (the Drone build) dies right away and I never see anything pop up in the Docker logs. Docker version is 18.06.1-ce.

I am seeing the same result on all different hosts in my cluster.

0 Likes

#4

This would sort of make sense. If Drone cannot connect to Docker, it would never receive the API call to generate logs.

FWIW we are running 18.09 at cloud.drone.io and have not experienced any issues.

0 Likes

#5

I think I’m going to try upgrading Docker versions.

Also, this is an arm64 cluster, so who knows what kinds of issues are lurking in there :smile:

0 Likes

#6

Just a quick side note (if this ever happens to anyone else). I changed the Drone configuration back to use the Kubernetes configuration and that seems to work.

I am still digging into what the problem might be with the normal drone-agent configuration.

0 Likes

#7

Upgraded Docker to 18.09.3 (on the hosts) and upgraded the docker/dind image to 18.06-dind but still seeing the same Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? error, even though the dind container is saying it is connected to the Docker socket and I don’t see any other errors in the dind logs.

I’m starting to run low on ideas :sob:

0 Likes

#8

If you are going to use Agents I recommend installing the agents on standalone virtual machines (e.g. aws instance). It will probably take just a few minutes to setup, compared to hours or days of debugging Kubernetes.

0 Likes

#9

+1

At this point the debugging is pretty much just to scratch my own itch of getting it working. The Kubernetes integration works well so I can run with that for now and look at getting some stand alone agents set up later.

On a side note, do you know if there is a way to set a TTL on the Kubernetes jobs so that they get cleaned up automatically after finishing?

0 Likes

#10

I do remember reports of a race condition where if the agent starts before Docker, that the connection is never established. This happens if you use a Docker-in-Docker container in the same pod. So that is something worth looking into …

At this point the debugging is pretty much just to scratch my own itch of getting it working. The Kubernetes integration works well so I can run with that for now and look at getting some stand alone agents set up later.

For more context, the reason I am not a huge fan of running Agents on Kubernetes is because the Drone resource scheduler and Kubernetes resource scheduler conflict with each other. Also Kuberetes does weird things with networking and does not always work well with user-defined Docker networks. I have seen people waste a lot of time dealing with these issues.

The native Kubernetes runtime avoids many of the issues but it is still very experimental. It runs quite well on some Kubernetes distros (DigitalOcean, MiniKube) and quite poorly on others (EKS, OpenShift) and in some cases (GKE) varies depending on version and container runtime (containerd, docker, etc). Eventually this will be the recommended way to run pipelines on Kubernetes, and one day, it may be the only way. I could definitely see disabling agents from running on Kubernetes entirely in favor of the native Kubernetes runtime.

On a side note, do you know if there is a way to set a TTL on the Kubernetes jobs so that they get cleaned up automatically after finishing?

yes, you have to enable this feature gate:
https://docs.drone.io/installation/github/kubernetes/#job-cleanup

0 Likes

#11

Perfect, thanks again for all the help. I am working on a blog post now with the things I’ve discovered.

0 Likes