Connecting issues for services (1.0.0-rc.2, k8s)


#1

Services does not work in k8s (eks).
This is my k8s configuration for drone (I have an external gate with ssl for incoming connections):

---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: drone-rbac
subjects:
  - kind: ServiceAccount
    # Reference to upper's `metadata.name`
    name: default
    # Reference to upper's `metadata.namespace`
    namespace: default
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io

---
apiVersion: v1
kind: Service
metadata:
  name: drone

spec:
  selector:
    role: service
    app: drone

  ports:
  - port: 80
    targetPort: http

  clusterIP: None

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: drone
  labels:
    role: service
    app: drone

spec:
  replicas: 1
  selector:
    matchLabels:
      role: service
      app: drone

  template:
    metadata:
      labels:
        role: service
        app: drone

    spec:
      nodeSelector:
        role: general

      containers:
      - name: drone
        image: drone/drone:1.0.0-rc.2
        env:
        - name: DRONE_KUBERNETES_ENABLED
          value: "true"
        - name: DRONE_KUBERNETES_NAMESPACE
          value: "default"
        - name: DRONE_GITHUB_SERVER
          value: "https://github.com"
        - name: DRONE_GITHUB_CLIENT_ID
          value: "********************"
        - name: DRONE_GITHUB_CLIENT_SECRET
          value: "****************************************"
        - name: DRONE_ORGS
          value: "ticketscloud"
        - name: DRONE_ADMIN
          value: "zzzsochi"
        - name: DRONE_OPEN
          value: "true"
        - name: DRONE_RPC_SECRET
          value: "2443f5bb10a7004f7faa42b3d4e21f98"
        - name: DRONE_SERVER_HOST
          value: "drone2.******.***"
        - name: DRONE_SERVER_PROTO
          value: "https"
        - name: DRONE_TLS_AUTOCERT
          value: "false"
        - name: DRONE_DATABASE_DRIVER
          value: "postgres"
        - name: DRONE_DATABASE_DATASOURCE
          value: "postgres://drone:drone@drone-db.default:5432/postgres?sslmode=disable"

        ports:
          - name: http
            containerPort: 80

---
apiVersion: v1
kind: Service
metadata:
  name: drone-db
  labels:
    name: drone-db

spec:
  selector:
    role: service
    add: drone-db
  clusterIP: None
  ports:
  - port: 5432
    targetPort: postgres

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: drone-db

spec:
  storageClassName: general
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 5Gi

---
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: drone-db

spec:
  replicas: 1

  selector:
    matchLabels:
      role: service
      add: drone-db

  template:
    metadata:
      labels:
        role: service
        add: drone-db

    spec:
      nodeSelector:
        role: general

      containers:
        - name: postgres
          image: postgres:9.6-alpine
          env:
            - name: POSTGRES_USER
              value: "drone"
            - name: POSTGRES_PASSWORD
              value: "drone"
            - name: POSTGRES_DB
              value: "drone"
          ports:
            - name: postgres
              containerPort: 5432
          volumeMounts:
            - name: postgres
              mountPath: /var/lib/postgres

      volumes:
        - name: postgres
          persistentVolumeClaim:
            claimName: drone-db

There are three pipelines for demonstrate the problem:

kind: pipeline
name: mongo

workspace:
  base: /tmp
  path: "."

volumes:
- name: mongo
  temp: {}

services:
- name: mongo
  image: mongo:3.6
  ports:
  - 27017
  volumes:
  - name: mongo
    path: /data

steps:
- name: sleep
  image: busybox
  commands:
  - echo "Waiting for start"
  - sleep 30

- name: check-dns
  image: alpine
  commands:
  - apk add -U bind-tools
  - host mongo

- name: connect
  image: mongo:3.6
  commands:
  - mongo --host mongo --eval 'db.stats()'

---
kind: pipeline
name: redis

workspace:
  base: /tmp
  path: "."

volumes:
- name: redis
  temp: {}

services:
- name: redis
  image: redis:4
  ports:
  - 6379
  volumes:
  - name: redis
    path: /data

steps:
- name: sleep
  image: busybox
  commands:
  - echo "Waiting for start"
  - sleep 30

- name: check-dns
  image: alpine
  commands:
  - apk add -U bind-tools
  - host redis

- name: connect
  image: redis:4
  commands:
  - redis-cli -h redis keys '*'

---
kind: pipeline
name: postgres

workspace:
  base: /tmp
  path: "."

volumes:
- name: postgres
  temp: {}

services:
- name: postgres
  image: postgres:9.6-alpine
  environment:
    POSTGRES_USER: test
    POSTGRES_PASSWORD: test
    POSTGRES_DB: test
  ports:
  - 5432
  volumes:
  - name: postgres
    path: /var/lib/postgres

steps:
- name: sleep
  image: busybox
  commands:
  - echo "Waiting for start"
  - sleep 30

- name: check-dns
  image: alpine
  commands:
  - apk add -U bind-tools
  - host postgres

- name: connect
  image: postgres:9.6-alpine
  commands:
  - echo "Must be auth error, but connection error"
  - psql -h postgres test test -c 'select * from pg_catalog.pg_config;'

This is not work. All pipelines failed on connection issue. :frowning:

Parsed logs of all three services below.

Mongo:

2018-12-20T20:54:46.316+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db 64-bit host=kacb4tdn8osfdjasmm6uk5u54qezjhpb
2018-12-20T20:54:46.316+0000 I CONTROL [initandlisten] db version v3.6.9
2018-12-20T20:54:46.316+0000 I CONTROL [initandlisten] git version: 167861a164723168adfaaa866f310cb94010428f
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.1.0f 25 May 2017
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] allocator: tcmalloc
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] modules: none
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] build environment:
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] distmod: debian92
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] distarch: x86_64
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] target_arch: x86_64
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] options: { net: { bindIpAll: true } }
2018-12-20T20:54:46.317+0000 E STORAGE [initandlisten] Failed to set up listener: SocketException: Permission denied
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] now exiting
2018-12-20T20:54:46.317+0000 I CONTROL [initandlisten] shutting down with code:48

Redis:

1:C 20 Dec 2018 21:00:31.634 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 20 Dec 2018 21:00:31.634 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 20 Dec 2018 21:00:31.634 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
1:M 20 Dec 2018 21:00:31.638 * Running mode=standalone, port=6379.
1:M 20 Dec 2018 21:00:31.638 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 20 Dec 2018 21:00:31.638 # Server initialized
1:M 20 Dec 2018 21:00:31.638 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 20 Dec 2018 21:00:31.638 * Ready to accept connections

Postgres:

The files belonging to this database system will be owned by user "test".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
sh: locale: not found performing post-bootstrap initialization ... No usable system locales were found. Use the option "--debug" to see details. ok
syncing data to disk ... ok
Success. You can now start the database server using: pg_ctl -D /var/lib/postgresql/data -l logfile start
WARNING: enabling "trust" authentication for local connections You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
waiting for server to start....
LOG: database system was shut down at 2018-12-21 19:34:18 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections 
LOG: autovacuum launcher started
done server started
CREATE DATABASE
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/* waiting for server to shut down...
LOG: received fast shutdown request 
LOG: aborting any active transactions.
LOG: autovacuum launcher shutting down 
LOG: shutting down
LOG: database system is shut down
done server stopped PostgreSQL init process complete; ready for start up.
LOG: database system was shut down at 2018-12-21 19:34:20 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections 
LOG: autovacuum launcher started

Mongo was shut down immediately, but icon in web still green.
After fail on third step, logs of redis and postgres are lost from web interface (but mongo not).

P.S. I very want to use this setup and ready to live with many issues (e.g. my problems with plugins/ecr or interface bugs), but this instrument must work for my simple cases. :-/


#2

Oh! I forget fail logs!

Mongo:

+ mongo --host mongo --eval 'db.stats()'
MongoDB shell version v3.6.9 connecting to: mongodb://mongo:27017/
2018-12-21T21:20:45.401+0000 W NETWORK [thread1] Failed to connect to 172.20.127.223:27017 after 5000ms milliseconds, giving up.
2018-12-21T21:20:45.402+0000 E QUERY [thread1] Error: couldn't connect to server mongo:27017, connection attempt failed : connect@src/mongo/shell/mongo.js:257:13 @(connect):1:6
exception: connect failed

Redis:

+ redis-cli -h redis keys '*'
Could not connect to Redis at redis:6379: Connection timed out

Postgres:

+ echo "Must be auth error, but connection error"
Must be auth error, but connection error
+ psql -h postgres test test -c 'select * from pg_catalog.pg_config;'
psql: could not connect to server: Operation timed out Is the server running on host "postgres" (172.20.246.174) and accepting
TCP/IP connections on port 5432?

#3

Please see my post here: Contributing to Drone for Kubernetes

I was able to successfully test the below configuration on Kubernetes (Digital Ocean and Minikube)

kind: pipeline
name: default

steps:
- name: test
  image: redis
  commands:
  - sleep 5
  - redis-cli -h $REDIS_SERVICE_HOST ping

services:
- name: redis
  image: redis
  ports:
  - 6379

I am not able to reproduce issues with services on Kubernetes and my time is currently allocated to other tasks, so this is not something I will be able to investigate further. I would, however, encourage you to look at the source code and see if you can identify any issues with the implementation and / or send a patch: Contributing to Drone for Kubernetes


#4

Sorry, this is not work in real life. I was try in EKS and DO. I found why.

This is service:

$ kubectl -n kxg205txrrqiibke5rwld4y55nd9es88 describe service/redis
...
Selector:          io.drone.step.name=redis
...
Endpoints:         <none>
...

But pod with service redis is not have label io.drone.step.name=redis:

$ kubectl -n kxg205txrrqiibke5rwld4y55nd9es88 describe pod/138fwoy30msir2pxpzjw3b0d8fdlo2pm
...
Labels:             io.drone.build.number=11
                    io.drone.repo.name=ticketscloud
                    io.drone.repo.namespace=ticketscloud
                    io.drone.stage.number=1
                    io.drone.stage.ttl=1545526292
...

If I add this label manualy — all work.

I have not found place, where this labels creating:


Somewhere in the drone? Not in drone-yaml. I can add somthing like this step.Metadata.Labels["io.drone.step.name"] = findName(*step) but this is wrong code.

P.S. Am I first user of this feature in production? :slight_smile:


#5

There are a number of teams testing Drone on Kubernetes. I am running this on Digital Ocean and I cannot reproduce any issues with services. Did you try running drone-runtime --kube-debug .drone.json as specified in my development guide to see what values are being set when creating the kubernetes resources?

---
kind: Service
metadata:
  creationTimestamp: null
  name: redis
  namespace: 07l5xq6ivufj1fymnceim5hroow9cutg
spec:
  ports:
  - port: 6379
    targetPort: 6379
  selector:
    io.drone.step.name: redis
  type: ClusterIP
status:
  loadBalancer: {}
---
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    io.drone.step.name: redis
  name: 3iadqvd5i3hijw76ztcx04tj4ae3sf4y
  namespace: 07l5xq6ivufj1fymnceim5hroow9cutg
spec:
  ...

I highly encourage reading through the development guide and exploring the drone/drone-runtime and drone/drone-yaml repositories. These two repositories handle everything.

See https://github.com/drone/drone-yaml/search?q=io.drone.step.name&unscoped_q=io.drone.step.name


#6

Where did this wrong labels on my pods?


#7

I see, drone-yaml create a correct labels. And drone-runtime work correctly.

  "steps": [
    {
      "metadata": {
        "uid": "ap5q5imwuan72aejahzjg21cma564qnf",
        "namespace": "6qx89gsv87nbeqeu6gz7duxv0zx82idx",
        "name": "redis",
        "labels": {
          "io.drone.step.name": "redis"
        }
      },

I’ve seen the source code.
https://github.com/drone/drone-yaml/search?q=io.drone.step.number&unscoped_q=io.drone.step.number
Empty. But my service pods has this label. How?


#8

confirming similar behavior with rc3 on AWS EKS. I do not see service pods being labeled with io.drone.step.number

[root@ip-172-31-40-147 ~]# kubectl -n n4box0w83s4o882hskqc13nyqsosccql describe pods 06sf5pfp334zltjbn0fyzlugoz8g73lu
Name:         06sf5pfp334zltjbn0fyzlugoz8g73lu
Namespace:    n4box0w83s4o882hskqc13nyqsosccql
Node:         ip-192-168-64-60.us-west-2.compute.internal/192.168.64.60
Start Time:   Thu, 03 Jan 2019 14:46:31 +0000
Labels:       io.drone.build.number=23
              io.drone.repo.name=terraform-provider-spotinst
              io.drone.repo.namespace=kmcgrath
              io.drone.stage.number=1
              io.drone.stage.ttl=1546533991
Annotations:  <none>
Status:       Running
IP:           192.168.108.57

manually adding the label will connect the service:

kubectl -n n4box0w83s4o882hskqc13nyqsosccql label pods 06sf5pfp334zltjbn0fyzlugoz8g73lu io.drone.step.name=website

#9

@kmcgrath thanks for the analysis, this really helped

The io.drone.step.number was present in my tests, however, your sample output included a number of labels that were not present in my tests. Basically everything below:

Labels:       io.drone.build.number=23
              io.drone.repo.name=terraform-provider-spotinst
              io.drone.repo.namespace=kmcgrath
              io.drone.stage.number=1
              io.drone.stage.ttl=1546533991

I introduced a regression a few weeks ago when I added these labels. Instead of appending them to the existing list of labels, I was replacing. This resulted in io.drone.step.number being removed. I adjusted the code accordingly:

+		if s.Step.Metadata.Labels == nil {
			s.Step.Metadata.Labels = map[string]string{}
+		}
		s.Step.Metadata.Labels["io.drone.build.number"] = fmt.Sprint(m.Build.Number)
		s.Step.Metadata.Labels["io.drone.repo.namespace"] = m.Repo.Namespace
		s.Step.Metadata.Labels["io.drone.repo.name"] = m.Repo.Name
		s.Step.Metadata.Labels["io.drone.stage.number"] = fmt.Sprint(m.Stage.Number)

I patched the existing rc.3 image. Can you force pull and let me know if the problem is resolved?


#10

Fix works! Thank you!