Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/var/lib/dockershim and /var/lib/cri-dockerd should not be ephemeral #2993

Closed
Oats87 opened this issue Jul 27, 2022 · 7 comments
Closed

/var/lib/dockershim and /var/lib/cri-dockerd should not be ephemeral #2993

Oats87 opened this issue Jul 27, 2022 · 7 comments

Comments

@Oats87
Copy link
Contributor

Oats87 commented Jul 27, 2022

RKE version:
RKE <= 1.3.12?
Docker version: (docker version,docker info preferred)
Doesn't matter
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Doesn't matter
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Doesn't matter
cluster.yml file:
Doesn't matter
Steps to Reproduce:
The /var/lib/dockershim (/var/lib/cri-dockerd for >= 1.24.x) directory for the kubelet container should not be ephemeral. The kubelet/dockershim relies on this directory to store runtime metadata such as portMappings for management of pods.

We ran into this issue when debugging a case where hostport pods were not being properly cleaned up on tear down (with duplicate iptables rules being found) after Kubernetes version upgrades.

This issue was specifically a combination of multiple problems, namely one where we updated the CNI plugin versions from v0.8.6 to v1.0.0 which brought along a change (see: containernetworking/plugins#509) that changed to performing a NOOP if portMappings were not specified. The issue we saw was that the kubelet/dockershim were sending an empty portMappings after a Kubernetes version upgrade, as the Kubelet would be replaced and the /var/lib/dockershim folder was empty, and could not properly construct a portMappings struct for teardown.

A workaround for this issue is to specify an extra_bind for the kubelet service, i.e.

services:
  kubelet:
    extra_binds:
      - "/var/lib/dockershim:/var/lib/dockershim"

and rebooting the host after an rke up (if already dealing with duplicate iptables rules).

This will not be done automatically for v1.23 and older versions of Kubernetes, but will be done for v1.24 and newer versions of K8s.

SURE-4702

@sowmyav27
Copy link

sowmyav27 commented Jul 27, 2022

Reproduced the issue on Rancher 2.6head commit id: 78017e64e

On an upgrade of RKE1 node driver Linode cluster (1 etcd/cp, 2 worker nodes) from 1.20.15-rancher2-1 to 1.21.14-rancher1-1 to 1.22.11-rancher1-1, ingress after upgrading to 1.22.11-rancher1-1 doesn't work. Works after upgrade to 1.21.14-rancher1-1

Other useases tested:

  • Upgrade from 1.21.14-rancher1-1 to 1.22.11-rancher1-1 --> Ingress works
  • Upgrade from 1.21.14-rancher1-1 to 1.22.11-rancher1-1 to 1.23.8-rancher1-1 --> Ingress works
  • Upgrade from v1.23.7 to 1.24.2 --> works
  • 1.21.13-rancher1-1 to 1.22.11-rancher1-1 --> Works
  • 1.21.13-rancher1-1 to 1.22.11-rancher1-1 to 1.23.8-rancher1-1 --> Works
  • Upgrade from 1.21.14-rancher1-1 to 1.22.11-rancher1-1 to 1.24.2 --> Works

@sowmyav27
Copy link

On 2.6.7-rc5

On a Linode node driver Node driver cluster - 1 etcd/cp and 3 worker nodes

  • v1.21.14-rancher1-1 to v1.22.11 --> Ingress works after this upgrade.
  • v1.21.14-rancher1-1 to v1.22.11 + a redeploy of ingress --> ingress doesn't work after
  • v1.21.10-rancher1-1 to v1.22.7-rancher1-2 --> Ingress works after this upgrade.
  • v1.21.10-rancher1-1 to v1.22.7-rancher1-2 + a redeploy of ingress --> ingress doesn't work after

@Oats87
Copy link
Contributor Author

Oats87 commented Aug 5, 2022

@markusewalker I have added the testing template to the PR 3001 -- testing this should be very simple.

This issue also ties into #2999 and testing that one is a little more nuanced.

@markusewalker
Copy link

markusewalker commented Aug 5, 2022

Verified that this is addressed on v1.3.13-rc6.

ENVIRONMENT DETAILS

  • Client machine: Ubuntu 20.04
  • RKE version
    • Reproduced version: v1.3.10
    • Verified version: v1.3.13-rc6

TEST RESULT
PASS

REPRODUCTION STEPS

  1. On a client machine, downloaded v1.3.10 and set it up.
  2. Provisioned a downstream RKE1 cluster.
  3. Once successfully provisioned, verified that v1.22.9 DOES NOT have directory /var/lib/cri-dockerd:
# find . -name cri-dockerd
./var/lib/docker/volumes/e17334c258e428cf67ac1d67258890058ad001d6637cde6fb3ad6ac5fb62c6fd/_data/bin/cri-dockerd
./var/lib/docker/volumes/973a0989702535ff0cff799663fadc20e2d1cf9e1470d1f45b4ae39f033a602c/_data/bin/cri-dockerd
./var/lib/docker/overlay2/025c09d38f9260920975a8d10051e7deb17ad00bbb91a6009a01046050d060eb/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/1a59d3f30972346abe1ccd51256e806f3e5d1b3f9da6a1a02d1a81008ce31a8d/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/1b6a9b1951b1489f4344b27423236c356fa27260691f6bc6952c633d7808fa3f/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/3a604b3223a62e4d7f450d30f8405028c2a1764ef49ac6ec42bbd5ec97fcc7eb/merged/opt/rke-tools/bin/cri-dockerd

VERIFICATION STEPS

  1. On a client machine, downloaded RKE v1.3.13-rc6 and set it up.
  2. Provisioned two downstream RKE1 clusters: 1 with version v1.23.8, 1 with version v1.24.2.
    • Ensured that cri-dockerd is enabled on these clusters.
  3. Once successfully provisioned, verified that the 1.23.8 cluster DOES NOT have directory /var/lib/cri-dockerd:
# find . -name cri-dockerd
./var/lib/docker/volumes/8ef0676b2f9919105ae88bb8b491e54b4ee9090d0d20bcd152e09c9c73fb1409/_data/bin/cri-dockerd
./var/lib/docker/volumes/66c5bf0cc1a94fe52e5b846b31e74a231b6c55ef8aa2a74e6712d2460df6b40e/_data/bin/cri-dockerd
./var/lib/docker/volumes/91eecb22bf8ae5cf82b8143c006fbfd78bdb1cc3d3332855b1f697bef6a23556/_data/bin/cri-dockerd
./var/lib/docker/overlay2/0933d9a4af2eeb1187b9bd7e70c670c4d5b4e57ebce57a54dcc65bcd3719eebe/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/95c61e6d4d227a6e8205f3eeef3394c388d4526508ba8fbea95a1f87cba10387/merged/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/2fef0490da342a6f7062092856c56a6c6c83d4677f8b4c30b09c1b11d7766cc3/diff/opt/rke-tools/bin/cri-dockerd
  1. Verify that the 1.24.2 cluster DOES have directory /var/lib/cri-dockerd:
# find . -name cri-dockerd
./var/lib/docker/volumes/377fe302bff78102d6bddc107a425a4af344d5152807c07b7d5dfe5561583233/_data/bin/cri-dockerd
./var/lib/docker/volumes/27570f4436f8392cb87b5c9645235e1c957d44a22f470641c629da90ebfc2237/_data/bin/cri-dockerd
./var/lib/docker/volumes/11a269383a6f0aaea434134bb4ebb744a0356f5ccbb8aa5178cd9d5d949b6dae/_data/bin/cri-dockerd
./var/lib/docker/overlay2/025c09d38f9260920975a8d10051e7deb17ad00bbb91a6009a01046050d060eb/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/1b6a9b1951b1489f4344b27423236c356fa27260691f6bc6952c633d7808fa3f/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/37bc3b00627c4048e1eb016cd52a18729a689f29b03848f53b7845f1e631f5dc/merged/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/2d1f252a62daa1c08ecba31eb053b25a95dca3617d19b404820e92c034d996d2/merged/var/lib/cri-dockerd
./var/lib/docker/overlay2/2d1f252a62daa1c08ecba31eb053b25a95dca3617d19b404820e92c034d996d2/diff/var/lib/cri-dockerd
./var/lib/cri-dockerd
# cd /var/lib/cri-dockerd/sandbox
# ls
07c7c340a7c95b527ff3c32fbbc067419e80b83f1ba86953c2935495468c57bb  67abce04b418c758f1f33e9050183b0f1dda1bd4d55958d07f3836341f1696a3
0f951c85ab7ddc4484e94eadb93cd40402ddfff9bcdb74d4628041473aab18a3  6bffa9aab2ee43ce16bd7438b1e48d52fbeebbd8e507ea181c40b0cc5b997b9e
19a7805b608daf15476822fb7ef55b70b19cb58af938e93a5e6703837cc2b8b2  78a37dc7b0877760bfff2b3b934e9a0ea0663649ed4ce4bf55cfd0205b36bfc6
346fedc21d4540b36d38999b8c8b363ed7b6ae9fa81bfd0d1e3068ada01827ca  84979f09d9ef67e1cb23b58f80cbdce6deb9b407f29d7a13b09c20b3c22a69e9
3def26408acc29501cc12227f6c16a9e1e241e3628d6b72ed7a93c9941d80df6  99505a4dc56bb03f99c08ed5ec62a62a73b1dc8cbd9cf6aadcfb96872e0c09d0
60606291d5ff383e362c8467f10f865c23f5142a7347dd30d3d2f59855e318b0  aa575ed98820a488b682d08a884f170bca62a2d57f5b00dce5c0da057ff99d50
  1. Upgraded the 1.23.8 cluster to v1.24.2 and made sure that this is successful.
  2. Once finished, verified that /var/lib/cri-dockerd, previously not there, is now there in the upgraded version:
# find . -name cri-dockerd
./var/lib/docker/volumes/42ce57e0c5be0f4fc4333ec3d82420fdd871ea81be90938658270a1d09bb6645/_data/bin/cri-dockerd
./var/lib/docker/volumes/66c5bf0cc1a94fe52e5b846b31e74a231b6c55ef8aa2a74e6712d2460df6b40e/_data/bin/cri-dockerd
./var/lib/docker/volumes/91eecb22bf8ae5cf82b8143c006fbfd78bdb1cc3d3332855b1f697bef6a23556/_data/bin/cri-dockerd
./var/lib/docker/overlay2/0933d9a4af2eeb1187b9bd7e70c670c4d5b4e57ebce57a54dcc65bcd3719eebe/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/0b2e428f3e5a1d563a7da137b7770a972042a79ba7d3dee0eb1e0bbd1508fdbb/merged/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/2fef0490da342a6f7062092856c56a6c6c83d4677f8b4c30b09c1b11d7766cc3/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/ec06257c4b66a0dc62712483d034fb61561e7a04d2cac5dbb705b40cc8198301/merged/var/lib/cri-dockerd
./var/lib/docker/overlay2/ec06257c4b66a0dc62712483d034fb61561e7a04d2cac5dbb705b40cc8198301/diff/var/lib/cri-dockerd
./var/lib/cri-dockerd
# cd /var/lib/cri-dockerd/sandbox/
# ls
# 

@Oats87 as noted in step 6 of the verification steps, when I upgraded, /var/lib/cri-dockerd persists, but there is no data found in sandbox. Is this expected behavior or not? If so, then I will proceed with closing this ticket. For now, I'll put back to reopened.

UPDATE
Tried on a different setup and the data propagated as expected. Further investigation implies that indeed it was environmental on my end and this is a non-issue at this time.

@sowmyav27
Copy link

Reopening for more validations

@sowmyav27
Copy link

sowmyav27 commented Aug 8, 2022

Can we also validate for these k8s versions ? on 2.6-head latest

  • v1.22.11-rancher1-1 to 1.23.8-rancher1-1 and if the data persists (cri_dockerd disabled which is by default)
  • 1.23.8 (cri_dockerd set to false) to 1.24.2 and verify whether data persists between /var/lib/dockershim/sandbox and /var/lib/cri-dockerd/sandbox
  • From Rancher, test a 1.23 to 1.24 upgrade usecase (preupgrades run on 1.23 and post upgrade checks on 1.24) (this cluster has 3 etcd, 2 cp and 3 worker nodes and this is an RKE1 cluster)

@markusewalker
Copy link

In response to additional validation tests found herehttps://github.com//issues/2993#issuecomment-1208666694, please find results below.

ENVIRONMENT DETAILS

  • Rancher Install: Docker
  • Rancher version: v2.6.7-rc7
  • Browser: Chrome

TEST SCENARIO 1

  1. Setup Rancher and navigated to the UI in a browser.
  2. Created standard user and logged into Rancher as that user.
  3. Provisioned downstream RKE1 cluster (v1.22.11-rancher1-1); verified the cluster and node came up as Active.
  4. Upgraded to v1.23.8-rancher1-1; verified the upgrade was successful. Cluster and node came up as Active.
  5. In a separate client machine, SSH'ed into the node that Rancher provisioned.
  6. Verified that /var/lib/cri-docked DOES NOT get persisted:
# find . -name cri-dockerd
./var/lib/docker/overlay2/84b8d5874fb2749cf8c0c435b32e27026f9f0247ed09c46ce8d4aa969f5be13f/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/volumes/4ee9328d8c4b26a2663a2c215b4c9aaac332d089bd1150fd775be2acdf870ed8/_data/bin/cri-dockerd
# cd /var/lib/
# ls
alsa    calico      dbus    etcd             misc       private  sudo                   ubuntu-release-upgrader  update-notifier
amazon  cloud       dhcp    grub             os-prober  python   systemd                ucf                      usbutils
apt     cni         docker  initramfs-tools  pam        rancher  tpm                    unattended-upgrades
boltd   containerd  dpkg    kubelet          polkit-1   snapd    ubuntu-drivers-common  update-manager

TEST SCENARIO 2

  1. Setup Rancher and navigated to the UI in a browser.
  2. Created standard user and logged into Rancher as that user.
  3. Provisioned downstream RKE1 cluster (v1.23.8-rancher1-1); verified the cluster and node came up as Active.
  4. Upgraded to v1.24.2-rancher1-1; verified the upgrade was successful. Cluster and node came up as Active.
  5. In a separate client machine, SSH'ed into the node that Rancher provisioned.
  6. Verified that /var/lib/cri-docked DOES get persisted:
# find . -name cri-dockerd
./var/lib/cri-dockerd
./var/lib/docker/overlay2/fc18063cd3ed2697d34bb61c31f24167130bca1ac2d98839e248d73a921cfce2/diff/opt/rke-tools/bin/cri-dockerd
./var/lib/docker/overlay2/3e9a8e50a3f48a7e318c9509aa80887c47f6c5ea5d6cb7d3d934e186ab344a09/merged/var/lib/cri-dockerd
./var/lib/docker/overlay2/3e9a8e50a3f48a7e318c9509aa80887c47f6c5ea5d6cb7d3d934e186ab344a09/diff/var/lib/cri-dockerd
./var/lib/docker/volumes/d7f3492253203fa4f4b7e097d71f510eeea4eb4d27578a074933f0ba051efeb2/_data/bin/cri-dockerd
# cd /var/lib/cri-dockerd/sandbox/
# ls
b4e7be53148e2ddfd8e325f5ec6909b00002d826f39e1078c29456735066e8ed  e159eded5e721d4b738c3da64606f575e69df1ab495a17e1f35779eb98aff9a4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants