Drone

Permission denied when trying to execute script

I’m running 1.0.0-rc5 on Kubernetes on GKE using the chart at helm/stable. Everything installs fine, I can see the job getting created, the services spinning up and running fine, but then I run into problems.

I’m trying to execute a script as part of my build, but it appears any script from my repo gets a Permission denied error with no further details. Looking at the logs, I see this:

{
  arch:  "amd64"   
  build:  27   
  error:  "provision-elasticsearch : exit code 126"   
  level:  "info"   
  machine:  "gke-staging-preemptible-pool-edc959a7-wf24"   
  msg:  "runner: execution failed"   
  os:  "linux"   
  pipeline:  "default"   
  repo:  "myorg/myrepo"   
  stage:  1   
 }

The exit code suggests the script is not executable, but it definitely has the right file permissions. I thought I wasn’t root, but whoami said otherwise. This behaviour only started occurring since version 1.0.0, but worked fine in 0.8. I’ve also tried changing my workspace directory to /tmp as suggested elsewhere to no avail.

I’ve also tried changing the Kubernetes Node image to containerd as has been suggested elsewhere, but still no luck.

Please help!

hey there, I recommend taking a look at this thread which provides instructions for debugging kubernetes issues: Contributing to Drone for Kubernetes. Given the experimental nature of the Kubernetes runtime we are asking everyone to get hands-on to help debug issues.

So I’ve just followed the instructions for running the runtime locally, and it’s the same thing when run against my GKE cluster. When I run it in Docker for Mac with Kubernetes, it seems to be fine.

It seems anything which is written to the host volume is just not executable. It can be read and written to, but not executed from. I’ve tried changing the permissions of the file, the directory and the parent directory, all to no avail.

I thought it could have been due to the Helm chart, but I went back to raw YAML resource definitions, and it’s the exact same behaviour.

It seems like others are having issues too: Drone 1.0.0-rc5 on kubernetes - shebang header not working

I was not able to reproduce. I should have some time in ~2 weeks to investigate this further, but unfortunately I have other matters to attend to right now. In the meantime, the fastest way to get this resolved would be to look into the underlying source code and send a patch.

I’ve tried to take a look. Good work with what’s been achieved so far!

First thing’s first - I’ll try running a pod directly on Kubernetes which has a host volume mounted and see if I can execute anything from the host volume. If not, I’ll investigate why this might be - I’d guess it’d just be a GKE security thing.

If that is the case, then I the next route would be persistent volumes. Have you made any more progress on that other than what’s in the drone-runtime repo?

unfortunately I have not gotten the change to work on persistent volumes. I am currently focused on hardening Drone for Nomad in preparation for our 1.0 release. Once complete, I should be able to shift my attention back to Kubernetes.

On GKE, to execute scripts, I have to copy files to a path outside workspace mounted volume.

I took a look at the drone-runtime repo and had a play around. Here’s what I’ve learnt. Please can you confirm if I’m on the right tracks or not?

  1. We want to use persistent volumes instead of host volumes so that pods aren’t restricted to running on one Kubernetes node.
  2. Most cloud provided persistent volumes are restricted to nodes within the same zone, meaning normal persistent volumes face the similar issues to host volumes.
  3. We want a new persistent volume provisioned for each new build, meaning a persistent volume claim is required.
  4. Since we don’t want to restrict the pods from having to execute on a single Kubernetes node, we need to use a type of persistent volume which supports claims from pods in any zone and which has the ReadWriteMany access type.
  5. The only way to achieve this using a local cluster or in a cloud agnostic way is to use a NFS persistent volume.

Does that sound about right?

I think there is perhaps a slight misunderstanding. We are not trying to enable the execution of steps across multiple nodes. This is not really a goal. Having steps share the same node is sufficient for the majority of workflows, which execute serially and expect fast local storage shared among all steps. Persistent volumes simplify some of our internal logic and will allow Drone to run in environments that restrict HostPath volumes.

Ah! That simplifies things! :slight_smile: However, would there be a use case where parallel workloads would need access to the same data?

However, would there be a use case where parallel workloads would need access to the same data?

Yes, many projects use single machine parallelism to split up unit tests, which requires shared access to source code, dependencies and artifacts downloaded or generated in previous steps. There are other use cases, but this is the most common.

Conversely, very few projects benefit from splitting their pipeline across multiple machines. I would go as far to say that single machine execution is optimal for 99% of public repositories on GitHub. There are of course exceptions, such as huge mono-repositories. For such projects Drone provides a mechanism for multi-machine pipelines [1], which execute on separate machines, as opposed to defining a single pipeline that executes on a single machine.

[1] https://docs.drone.io/user-guide/pipeline/multi-machine/

The root cause of the inability to execute any binaries is related to the noexec on the mount point. This setup is running within GKE platform.

Even if additional volumes are defined by volumes, it still doesn’t work.

Drone Yaml

volumes:

  • name: test-cache
    temp: {}
  • name: test-tmp
    host:
    path: /tmp

Mount Output from Drone Job

tmpfs on /test-cache type tmpfs (rw,nosuid,nodev,noexec)
tmpfs on /test-tmp type tmpfs (rw,nosuid,nodev,noexec)
tmpfs on /drone/src type tmpfs (rw,nosuid,nodev,noexec)

Host of Agent/Docker

tmpfs on /tmp type tmpfs (rw,nosuid,nodev,noexec)
/dev/sda1 on /mnt/stateful_partition type ext4 (rw,nosuid,nodev,noexec,relatime,commit=30,data=ordered)
/dev/sda1 on /home type ext4 (rw,nosuid,nodev,noexec,relatime,commit=30,data=ordered)
/dev/sda1 on /var type ext4 (rw,nosuid,nodev,noexec,relatime,commit=30,data=ordered)
shm on /var/lib/docker/containers/<random_id>/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)

I’m still trying to identify if this is a GKE configuration issue or Drone to Docker.

What additional information would you need?
Any options to force exec on the mount points?

References:

I’ve identified that the issue is at a cross-road where drone jobs controller places files within /tmp/drone on the default image type for GKE, while GKE /tmp is tmpfs and noexec.

As a temporary solution, one workaround is to create a DaemonSet for the nodes to shuffle soft-links around. For myself, I’ve created a symlink from /mnt/stateful_partition/var/tmp/drone to /tmp/drone on the GKE nodes.

Links to Solution

Issues:

  • GKE sets /tmp as tmpfs, which limits available space for /tmp/drone.
  • GKE sets /tmp as noexec, which limits execution of binaries leveraged during pipeline steps.
1 Like

(cont’d post; new users limited to two links per post!)

Feature Requests:

1 Like

thanks for the details. I definitely hope to add persistent volumes which should completely solve this issue. In the meantime I’m glad to see we have a documented workaround. Thanks!

Hello everyone,

tl;dr use Ubuntu Image Type for node pools rather than the default COS

I encountered the same issue on a new installation of drone on k8s and wanted to share what I learned in case anyone else sees this. I was using the cos (default Container Optimized OS) image types. I didn’t catch it on the first reading but the comment at the top of @sigtrap’s linked medium article fixes everything for me.
GKE has released the ubuntu image to replace container-vm. This image has xfsprogs pre-installed. Use this instead.

I recreated my cluster nodes with Ubuntu and everything is working well now.