Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sonobuoy Conformance on EKS-A on Baremetal shows failures. #3423

Open
elamaran11 opened this issue Sep 22, 2022 · 8 comments
Open

Sonobuoy Conformance on EKS-A on Baremetal shows failures. #3423

elamaran11 opened this issue Sep 22, 2022 · 8 comments
Assignees
Labels
area/providers/tinkerbell Tinkerbell provider related tasks and issues kind/bug Something isn't working team/providers

Comments

@elamaran11
Copy link

elamaran11 commented Sep 22, 2022

I was trying to run a Conformance test using Sonobuoy on EKS-A deployed on Bare metal with partner hardware Equinix. The Sonobuoy validation failed with following errors when I ran this with sonobuoy v0.56.10.

When i ran with Sonobuoy v0.50.0, Sonobuoy validation never made a progress and thats another error i want to report.

Appreciate if you can take a look in to these failures and lets us know if these are good to go or we will be having any bug fixes to EKSA version. Thanks in advance.


[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates that NodeSelector is respected if not matching  [Conformance]
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:436

[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates that there exists conflict between pods with same hostPort and protocol but one using 0.0.0.0 hostIP [Conformance]
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:1068

[Fail] [sig-apps] Daemon set [Serial] [It] should rollback without unnecessary restarts [Conformance]
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/daemon_set.go:432

[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates resource limits of pods that are allowed to run  [Conformance]
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:323```
@jacobweinstock jacobweinstock self-assigned this Sep 23, 2022
@jacobweinstock jacobweinstock added kind/bug Something isn't working area/providers/tinkerbell Tinkerbell provider related tasks and issues team/providers labels Sep 23, 2022
@jacobweinstock
Copy link
Member

Hey @elamaran11, thanks for reporting this. I'm not seeing any failures testing on bare metal hardware with v0.56.10. My initial impression is there is possibly something specific to Equinix Metal, and/or deployment options, etc.

Would you mind sharing the details of your cluster creation, please (cluster spec, hardware csv, etc)? Are you following the Equinix guide here?

I will test on Equinix Metal to see about reproducing. Thanks again for the report!

@elamaran11
Copy link
Author

elamaran11 commented Sep 25, 2022

@jacobweinstock Yes im following the exact equinix guide. Please provide us an update as soon as you can reproduce and fix the issue.

Here is my hardware.csv file :

root@eksa-admin:~# cat hardware.csv
hostname,vendor,mac,ip_address,gateway,netmask,nameservers,disk,labels
eksa-node-cp-001,Equinix,10:70:fd:86:eb:f6,147.75.90.243,147.75.90.241,255.255.255.240,8.8.8.8,/dev/sda,type=cp
eksa-node-dp-001,Equinix,10:70:fd:7f:94:9e,147.75.90.244,147.75.90.241,255.255.255.240,8.8.8.8,/dev/sda,type=dp
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: my-eksa-cluster
  namespace: default
spec:
  bundlesRef:
    apiVersion: anywhere.eks.amazonaws.com/v1alpha1
    name: bundles-15
    namespace: eksa-system
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 1
    endpoint:
      host: 147.75.90.254
    machineGroupRef:
      kind: TinkerbellMachineConfig
      name: my-eksa-cluster-cp
  datacenterRef:
    kind: TinkerbellDatacenterConfig
    name: my-eksa-cluster
  kubernetesVersion: "1.23"
  managementCluster:
    name: my-eksa-cluster
  workerNodeGroupConfigurations:
  - count: 1
    machineGroupRef:
      kind: TinkerbellMachineConfig
      name: my-eksa-cluster
    name: md-0
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellDatacenterConfig
metadata:
  name: my-eksa-cluster
  namespace: default
spec:
  tinkerbellIP: 147.75.90.253
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  annotations:
    anywhere.eks.amazonaws.com/control-plane: "true"
  name: my-eksa-cluster-cp
  namespace: default
spec:
  hardwareSelector:
    type: cp
  osFamily: bottlerocket
  templateRef:
    kind: TinkerbellTemplateConfig
    name: my-eksa-cluster
  users:
  - name: ec2-user
    sshAuthorizedKeys:
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: my-eksa-cluster
  namespace: default
spec:
  hardwareSelector:
    type: dp
  osFamily: bottlerocket
  templateRef:
    kind: TinkerbellTemplateConfig
    name: my-eksa-cluster
  users:
  - name: ec2-user
    sshAuthorizedKeys:
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellTemplateConfig
metadata:
  name: my-eksa-cluster
  namespace: default
spec:
  template:
    global_timeout: 6000
    id: ""
    name: my-eksa-cluster
    tasks:
    - actions:
      - environment:
          COMPRESSED: "true"
          DEST_DISK: /dev/sda
          IMG_URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/15/artifacts/raw/1-23/bottlerocket-v1.23.7-eks-d-1-23-4-eks-a-15-amd64.img.gz
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/image2disk:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: stream-image
        timeout: 600
      - environment:
          CONTENTS: |
            # Version is required, it will change as we support
            # additional settings
            version = 1

            # "eno1" is the interface name
            # Users may turn on dhcp4 and dhcp6 via boolean
            [enp1s0f0np0]
            dhcp4 = true
            dhcp6 = false
            # Define this interface as the "primary" interface
            # for the system.  This IP is what kubelet will use
            # as the node IP.  If none of the interfaces has
            # "primary" set, we choose the first interface in
            # the file
            primary = true
          DEST_DISK: /dev/sda12
          DEST_PATH: /net.toml
          DIRMODE: "0755"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: write-netplan
        pid: host
        timeout: 90
      - environment:
          BOOTCONFIG_CONTENTS: |
            kernel {
                console = "ttyS1,115200n8"
            }
          DEST_DISK: /dev/sda12
          DEST_PATH: /bootconfig.data
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: write-bootconfig
        pid: host
        timeout: 90
      - environment:
          DEST_DISK: /dev/sda12
          DEST_PATH: /user-data.toml
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          HEGEL_URLS: http://147.75.90.242:50061,http://147.75.90.253:50061
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: write-user-data
        pid: host
        timeout: 90
      - image: public.ecr.aws/eks-anywhere/tinkerbell/hub/reboot:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: reboot-image
        pid: host
        timeout: 90
        volumes:
        - /worker:/worker
      name: my-eksa-cluster
      volumes:
      - /dev:/dev
      - /dev/console:/dev/console
      - /lib/firmware:/lib/firmware:ro
      worker: '{{.device_1}}'
    version: "0.1"

@jacobweinstock
Copy link
Member

jacobweinstock commented Oct 4, 2022

Hey @elamaran11. Here's the results from my conformance test. I wasn't able to reproduce the failures you posted. I did get one failure, but it is only because my cluster had only a single worker node. One thing that did stand out was the bottlerocket and Kubernetes versions.
Yours: bottlerocket-v1.23.7-eks-d-1-23-4-eks-a-15-amd64
mine: bottlerocket-v1.23.9-eks-d-1-23-5-eks-a-17-amd64

@elamaran11, would you mind doing another run on your side?

If you have only one worker node, Sonobuoy will throw the following error: [sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance] -- ref

I followed the guide from here, https://github.com/equinix-labs/terraform-equinix-metal-eks-anywhere, to setup the cluster.

cd terraform-equinix-metal-eks-anywhere/examples/deploy
terraform init
terraform apply

Then, on the admin node i ran the conformance test with sonobuoy version v0.56.10.

sonobuoy run --wait

Results

Plugin: e2e
Status: failed
Total: 7050
Passed: 343
Failed: 1
Skipped: 6706

Failed tests:
[sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance]

Plugin: systemd-logs
Status: passed
Total: 2
Passed: 2
Failed: 0
Skipped: 0

Run Details:
API Server version: v1.23.9-eks-68c1cba
Node health: 2/2 (100%)
Pods health: 35/36 (97%)
Details for failed pods:
sonobuoy/sonobuoy-e2e-job-62d8ed75dd74406a Ready:False: ContainersNotReady: containers with unready status: [e2e sonobuoy-worker]
Errors detected in files:
Errors:
1705 podlogs/kube-system/cilium-jdlsz/logs/cilium-agent.txt
1347 podlogs/kube-system/kube-controller-manager-139.178.68.19/logs/kube-controller-manager.txt
 588 podlogs/sonobuoy/sonobuoy-e2e-job-62d8ed75dd74406a/logs/e2e.txt
 107 podlogs/kube-system/kube-apiserver-139.178.68.19/logs/kube-apiserver.txt
  70 podlogs/kube-system/kube-scheduler-139.178.68.19/logs/kube-scheduler.txt
   8 podlogs/kube-system/kube-proxy-vkptp/logs/kube-proxy.txt
   8 podlogs/kube-system/kube-proxy-tp52s/logs/kube-proxy.txt
   5 podlogs/kube-system/cilium-6thdk/logs/cilium-agent.txt
   1 podlogs/kube-system/etcd-139.178.68.19/logs/etcd.txt
   1 podlogs/kube-system/kube-vip-139.178.68.19/logs/kube-vip.txt
Warnings:
486 podlogs/kube-system/kube-controller-manager-139.178.68.19/logs/kube-controller-manager.txt
379 podlogs/kube-system/cilium-jdlsz/logs/cilium-agent.txt
103 podlogs/kube-system/kube-apiserver-139.178.68.19/logs/kube-apiserver.txt
 37 podlogs/kube-system/kube-scheduler-139.178.68.19/logs/kube-scheduler.txt
 14 podlogs/sonobuoy/sonobuoy-e2e-job-62d8ed75dd74406a/logs/e2e.txt
 10 podlogs/kube-system/cilium-6thdk/logs/cilium-agent.txt
  4 podlogs/kube-system/etcd-139.178.68.19/logs/etcd.txt
  2 podlogs/sonobuoy/sonobuoy/logs/kube-sonobuoy.txt

Here is the final generated EKSA cluster config:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: my-eksa-cluster
spec:
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
        - 192.168.0.0/16
    services:
      cidrBlocks:
        - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 1
    endpoint:
      host: "139.178.68.30"
    machineGroupRef:
      kind: TinkerbellMachineConfig
      name: my-eksa-cluster-cp
  datacenterRef:
    kind: TinkerbellDatacenterConfig
    name: my-eksa-cluster
  kubernetesVersion: "1.23"
  managementCluster:
    name: my-eksa-cluster
  workerNodeGroupConfigurations:
    - count: 1
      machineGroupRef:
        kind: TinkerbellMachineConfig
        name: my-eksa-cluster
      name: md-0
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellDatacenterConfig
metadata:
  name: my-eksa-cluster
spec:
  tinkerbellIP: "139.178.68.29"
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: my-eksa-cluster-cp
spec:
  hardwareSelector:
    type: cp
  osFamily: bottlerocket
  templateRef:
    kind: TinkerbellTemplateConfig
    name: cp-my-eksa-cluster-m3-small-x86
  users:
    - name: ec2-user
      sshAuthorizedKeys:
        - ssh-rsa AA...
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: my-eksa-cluster
spec:
  hardwareSelector:
    type: dp
  osFamily: bottlerocket
  templateRef:
    kind: TinkerbellTemplateConfig
    name: dp-my-eksa-cluster-m3-small-x86
  users:
    - name: ec2-user
      sshAuthorizedKeys:
        - ssh-rsa AA...
---
{}
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellTemplateConfig
metadata:
  name: cp-my-eksa-cluster-m3-small-x86
spec:
  template:
    global_timeout: 6000
    id: ""
    name: cp-my-eksa-cluster-m3-small-x86
    tasks:
    - actions:
      - environment:
          COMPRESSED: "true"
          DEST_DISK: /dev/sda
          IMG_URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/17/artifacts/raw/1-23/bottlerocket-v1.23.9-eks-d-1-23-5-eks-a-17-amd64.img.gz
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/image2disk:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: stream-image
        timeout: 600
      - environment:
          CONTENTS: |
            # Version is required, it will change as we support
            # additional settings
            version = 1

            # "eno1" is the interface name
            # Users may turn on dhcp4 and dhcp6 via boolean
            [enp1s0f0np0]
            dhcp4 = true
            dhcp6 = false
            # Define this interface as the "primary" interface
            # for the system.  This IP is what kubelet will use
            # as the node IP.  If none of the interfaces has
            # "primary" set, we choose the first interface in
            # the file
            primary = true
          DEST_DISK: /dev/sda12
          DEST_PATH: /net.toml
          DIRMODE: "0755"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-netplan
        pid: host
        timeout: 90
      - environment:
          BOOTCONFIG_CONTENTS: |
            kernel {
                console = "ttyS1,115200n8"
            }
          DEST_DISK: /dev/sda12
          DEST_PATH: /bootconfig.data
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-bootconfig
        pid: host
        timeout: 90
      - environment:
          DEST_DISK: /dev/sda12
          DEST_PATH: /user-data.toml
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          HEGEL_URLS: http://139.178.68.18:50061,http://139.178.68.29:50061
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-user-data
        pid: host
        timeout: 90
      - image: public.ecr.aws/eks-anywhere/tinkerbell/hub/reboot:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: reboot-image
        pid: host
        timeout: 90
        volumes:
        - /worker:/worker
      name: cp-my-eksa-cluster-m3-small-x86
      volumes:
        - /dev:/dev
        - /dev/console:/dev/console
        - /lib/firmware:/lib/firmware:ro
      worker: '{{.device_1}}'
    version: "0.1"
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellTemplateConfig
metadata:
  name: dp-my-eksa-cluster-m3-small-x86
spec:
  template:
    global_timeout: 6000
    id: ""
    name: dp-my-eksa-cluster-m3-small-x86
    tasks:
    - actions:
      - environment:
          COMPRESSED: "true"
          DEST_DISK: /dev/sda
          IMG_URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/17/artifacts/raw/1-23/bottlerocket-v1.23.9-eks-d-1-23-5-eks-a-17-amd64.img.gz
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/image2disk:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: stream-image
        timeout: 600
      - environment:
          CONTENTS: |
            # Version is required, it will change as we support
            # additional settings
            version = 1

            # "eno1" is the interface name
            # Users may turn on dhcp4 and dhcp6 via boolean
            [enp1s0f0np0]
            dhcp4 = true
            dhcp6 = false
            # Define this interface as the "primary" interface
            # for the system.  This IP is what kubelet will use
            # as the node IP.  If none of the interfaces has
            # "primary" set, we choose the first interface in
            # the file
            primary = true
          DEST_DISK: /dev/sda12
          DEST_PATH: /net.toml
          DIRMODE: "0755"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-netplan
        pid: host
        timeout: 90
      - environment:
          BOOTCONFIG_CONTENTS: |
            kernel {
                console = "ttyS1,115200n8"
            }
          DEST_DISK: /dev/sda12
          DEST_PATH: /bootconfig.data
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-bootconfig
        pid: host
        timeout: 90
      - environment:
          DEST_DISK: /dev/sda12
          DEST_PATH: /user-data.toml
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          HEGEL_URLS: http://139.178.68.18:50061,http://139.178.68.29:50061
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-user-data
        pid: host
        timeout: 90
      - image: public.ecr.aws/eks-anywhere/tinkerbell/hub/reboot:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: reboot-image
        pid: host
        timeout: 90
        volumes:
        - /worker:/worker
      name: dp-my-eksa-cluster-m3-small-x86
      volumes:
        - /dev:/dev
        - /dev/console:/dev/console
        - /lib/firmware:/lib/firmware:ro
      worker: '{{.device_1}}'
    version: "0.1"

Here is the hardware.csv:

hostname,vendor,mac,ip_address,gateway,netmask,nameservers,disk,labels
eksa-gi3g9q-node-cp-001,Equinix,10:70:fd:7f:99:a2,139.178.68.19,139.178.68.17,255.255.255.240,8.8.8.8,/dev/sda,type=cp
eksa-gi3g9q-node-dp-001,Equinix,10:70:fd:86:ee:aa,139.178.68.20,139.178.68.17,255.255.255.240,8.8.8.8,/dev/sda,type=dp

@jacobweinstock
Copy link
Member

Hey @displague and @cprivitere, would either of you, by chance, have any thoughts or insights on this?

@displague
Copy link

displague commented Oct 4, 2022

Looks like this was the only failed test, as you pointed out, because of the limited cluster size. What does it test?

Failed tests: [sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance] 

We've released v0.3.2 but I can't think of any significant changes you'd encounter in the previous builds.

@jacobweinstock
Copy link
Member

Looks like this was the only failed test, as you pointed out, because of the limited cluster size. What does it test?

Failed tests: [sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance] 

We've released v0.3.2 but I can't think of any significant changes you'd encounter in the previous builds.

Hey @displague, thanks for the response. Any insight into @elamaran11 's original failures at the very top, by chance?

@elamaran11
Copy link
Author

elamaran11 commented Dec 8, 2022

Team -
ANy updates on this issue. We installed EKSA on Dell hardware for a customer and we ran Sonobuoy and we got same issues.

@elamaran11
Copy link
Author

Team -
Any updates on this issue. We installed EKSA on Intell/Dell hardware for a customer and we ran Sonobuoy and we still see the following issues :

Summarizing 3 Failures:

[Fail] [sig-apps] Daemon set [Serial] [It] should rollback without unnecessary restarts [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/daemon_set.go:432

[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates that there exists conflict between pods with same hostPort and protocol but one using 0.0.0.0 hostIP [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:1068

[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates resource limits of pods that are allowed to run  [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:323

Ran 346 of 7044 Specs in 8646.423 seconds
FAIL! -- 343 Passed | 3 Failed | 0 Pending | 6698 Skipped
--- FAIL: TestE2E (8653.07s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/providers/tinkerbell Tinkerbell provider related tasks and issues kind/bug Something isn't working team/providers
Projects
None yet
Development

No branches or pull requests

3 participants