Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot successfully create CephObjectStore after the store has been deleted. #14763

Closed
denppa opened this issue Sep 24, 2024 · 3 comments
Closed
Assignees
Labels

Comments

@denppa
Copy link

denppa commented Sep 24, 2024

  • Bug Report

Deviation from expected behavior:

The rgw pod does not get created after the CephObjectStore CRDs has been deleted.

Expected behavior:

Spin up CephObjectStore with rgw pod even though the object store has been created before.

How to reproduce it (minimal and precise):

Create the following CRD:

apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
  name: nextcloud-obj-store
  namespace: rook-ceph
spec:
  metadataPool:
    #failureDomain: host
    replicated:
      size: 3
  dataPool:
    #failureDomain: host
    replicated:
      size: 3
    quotas:
      maxSize: 5Ti
  preservePoolsOnDelete: true
  gateway:
    #sslCertificateRef:
    port: 80
    # securePort: 443
    instances: 1
kubectl create -f crd.yml

Then delete it:

kubectl delete -f crd.yml
# In a second shell
kubectl -n rook-ceph edit cephobjectstores.ceph.rook.io nextcloud-obj-store
# comment out the finalizer

Then create the obj store again, and it will fail. The Progress column of the below command will be Failure, and the Progressing and then loop. rgw pod will also fail to be created.

Monitor commands:

kubectl -n rook-ceph get pods --watch
kubectl -n rook-ceph get cephobjectstores.ceph.rook.io --watch

Logs to submit:

ceph-object-controller-detect-version-w995k         0/1     Pending           0             0s
ceph-object-controller-detect-version-w995k         0/1     Pending           0             0s
ceph-object-controller-detect-version-w995k         0/1     Init:0/1          0             0s
ceph-object-controller-detect-version-w995k         0/1     Init:0/1          0             0s
ceph-object-controller-detect-version-w995k         0/1     Init:0/1          0             1s
ceph-object-controller-detect-version-w995k         0/1     PodInitializing   0             3s
ceph-object-controller-detect-version-w995k         0/1     PodInitializing   0             3s
ceph-object-controller-detect-version-w995k         0/1     Terminating       0             3s
ceph-object-controller-detect-version-w995k         0/1     Terminating       0             4s
ceph-object-controller-detect-version-w995k         0/1     Terminating       0             4s
ceph-object-controller-detect-version-w995k         0/1     Completed         0             4s
ceph-object-controller-detect-version-w995k         0/1     Completed         0             5s
ceph-object-controller-detect-version-w995k         0/1     Completed         0
NAME                  PHASE         ENDPOINT                                                    SECUREENDPOINT   AGE
nextcloud-obj-store   Progressing   http://rook-ceph-rgw-nextcloud-obj-store.rook-ceph.svc:80                    43s
nextcloud-obj-store   Failure                                                                                    53s
nextcloud-obj-store   Progressing   http://rook-ceph-rgw-nextcloud-obj-store.rook-ceph.svc:80                    53s
nextcloud-obj-store   Failure
  • Operator's logs, if necessary
Sorry, operator restarted and logs cleared.

Cluster Status to submit:
All healthy.

Environment:

  • OS (e.g. from /etc/os-release): debian bookworm
  • Kernel (e.g. uname -a): Debian 6.1.106-3
  • Cloud provider or hardware configuration: baremetal
  • Rook version (use rook version inside of a Rook Pod):
  • Storage backend version (e.g. for ceph do ceph -v):
  • Kubernetes version (use kubectl version):
sh-5.1$ rook version
2024/09/24 18:47:01 maxprocs: Leaving GOMAXPROCS=96: CPU quota undefined
rook: v1.15.2
go: go1.22.7
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): baremetal k8s
  • Storage backend status:
ceph version 18.2.4 ... reef (stable)

Temporary Solution

It will succeed if I delete every rgw service typed pool, including the .rgw.root pool.

@denppa denppa added the bug label Sep 24, 2024
@sp98 sp98 self-assigned this Sep 25, 2024
@sp98
Copy link
Contributor

sp98 commented Sep 25, 2024

@denppa Rook deletes the pools when objectstore CR is deleted. Can you please try to repro this again and share the rook operator logs?

@BlaineEXE
Copy link
Member

If the finalizer is removed from the CephObjectStore resource while deleting, Rook will not be able to gracefully clean up the pools because they still could have user data in them. This can then prevent the CephObjectStore from being recreated. Likely, this is what occurred. If true, this is a matter of Rook working as intended to ensure user data safefty.

@denppa
Copy link
Author

denppa commented Sep 26, 2024

If the finalizer is removed from the CephObjectStore resource while deleting, Rook will not be able to gracefully clean up the pools because they still could have user data in them. This can then prevent the CephObjectStore from being recreated. Likely, this is what occurred. If true, this is a matter of Rook working as intended to ensure user data safefty.

Yes, as the command was hanging, I thought it was not able to delete the object store due to the finalizer. However, it makes sense that Rook disallows overwriting the pool to protect its data.

For future reference, wait for the delete command to complete gracefully to avoid this issue.

@denppa denppa closed this as completed Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants