Drone Step Retryable

We have a use case for a feature that we were wondering if the community would be open to:

We have several users today that will have a certain step in their pipeline fail for any number of reasons… Maybe a timeout waiting for an external dependency. Maybe there was a hiccup in network communication. Regardless of the reasons, we were wondering how the community would feel about implementing a retry feature for Drone. We think this is how it would appear in the .drone.yml:

pipeline:
  foo:
    image: alpine
    retries: 5
    commands:
      - echo foo

Drone by default does not retry any step (native behavior) but upon adding a retries attribute, it will retry the step. We feel on the Drone side, that sane limits should be set for number of retries as well as wait time before attempting each retry. However a user would be able to override those (to an extent) like:

pipeline:
  foo:
    image: alpine
    retries: 5
    retry_wait: 10
    commands:
      - echo foo

In the above example, we have explicitly set the number of times the step will be retried as 5 and the agent will wait 10 seconds before retrying each time it tries to execute the step.

Now in order to give added configurability to managing Drone, we feel that each limit could be sourced from an environment variable like DRONE_RETRY_LIMIT and DRONE_RETRY_WAIT_LIMIT. However the Drone server would come baked with its own default limits so if the server is spun up without those variables set, it is still protected.

Other considerations to keep in mind are what should happen if a user sets the retry count or retry wait period above what the limits are. We see two possible outcomes:

  • The step will error out with a message notifying them they set an invalid retries or retry_wait value
  • The step would just inherit the defaults from the environment variables

Looking for comments, questions, concerns etc…

We’d be willing to do the PRs for the work! /cc @jmccann

3 Likes

I would suggest we use the built-in Docker retry mechanism and docker-compose syntax:

      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s

In 0.9 we are moving to the following libraries for yaml parsing and runtime execution, so if we choose to go this route, it should be added here so it can land in 0.9:

In terms of limits, any retries would be subject to the overall build timeout, so we would have some initial safety in place. I would probably prefer the initial implementation avoid adding any extra global configurations (e.g. retry limits). We can always come back and put global limits in place if we observe real-world issues.

The only gotcha could be tailing container logs and waiting for containers to exit (which are docker endpoints). I am not sure how these endpoints behave on restart, which could definitely throw a wrench into things depending on how complex the change becomes.

Did this ever get implemented this would be handy step, we sometimes get failures on build steps when deps are pulled because they timed out or something, so auto retry would be nice so we don’t need to manually rerun the build ?

4 Likes

We would also be very much interessted in this, did it ever get implemented?