Prometheus scrapes intermittently timing out after 10 seconds

Hello,

We are running Drone v2.0.6, and noticing that prometheus scrapes will intermittently time out after 10 seconds, when a normal scrape time will typically take under a second -

This is leading to gaps in our data where we are experiencing the scrape failures -

Our scrape configuration -

        - job_name: drone-server
          # bearer-token was generated manually following https://docs.drone.io/server/user/machine/
          bearer_token: "%{hiera('drone_prometheus_bearer_token')}"
          ec2_sd_configs:
            # dev
            - region: eu-west-1
              role_arn: arn:aws:iam::01234556789:role/prometheus_ec2_sd
              port: 80
            # foundation-prod
            - region: eu-west-1
              role_arn: arn:aws:iam::1234567890:role/prometheus_ec2_sd
              port: 80
          relabel_configs:
            - source_labels: [__meta_ec2_tag_Name]
              regex: drone-server.*
              action: keep
            - source_labels: [__meta_ec2_instance_id]
              target_label: instance
            - source_labels: [__meta_ec2_tag_Name]
              target_label: instance_name
            - source_labels: [__meta_ec2_tag_Vpc]
              target_label: vpc
            - source_labels: [__meta_ec2_tag_Rootaccount]
              target_label: root_account
            - source_labels: [__meta_ec2_tag_Subaccount]
              target_label: sub_account
            - source_labels: [__meta_ec2_tag_Owner]
              target_label: owner
            - source_labels: [__meta_ec2_tag_Subsystem]
              target_label: sub_system

I’ve taken a look at our drone-server, and there isn’t any high utilization of resources that I can tell that could be contributing to the timeouts. We’re thinking it might be that the drone server process is sometimes locking resources when the /metrics endpoint is hit.

You probably just need to increase the timeout. When you hit the metrics endpoint Drone executes multiple sql queries that count rows which can take some time, especially if you are using postgres, which is known to have slow counts relative to other database systems (see https://wiki.postgresql.org/wiki/Slow_Counting).

Thanks Brad, that did seem to be the case. We’re using MySQL, but we’re seeing our scrapes take around 11-12 seconds now. I suppose this is a byproduct of how much history we’re storing in the database now.