Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added retry for snapshot and s3 upload, verify etcd running on host #2952

Conversation

jakefhyde
Copy link
Contributor

@jakefhyde jakefhyde commented Jun 10, 2022

rancher/rancher#37639

  • Added option to quiet noisy container logs
    • In other words, some rke operations (such as taking a snapshot and uploading it) produce very few logs, so printing these logs for containers that have a potentially long run time floods the logs and isn't very helpful by itself. Now, for taking a snapshot, the stdout and stderr of the container are only logged when they are different from previous iterations.
  • Improved logic to ensure that saving a snapshot is considered successful when at least one snapshot is taken, and at least one node was able to upload to s3. The failures are still logged, however previously this would've resulted in snapshots that failed to upload to s3 being deleted from the node (fixed in rke-tools), so rke has been updated to reflect this. Additionally, rke will now fall back to using the snapshot on the node. This may change if rancher is to separate local and s3 backups for rke1 like rke2, but will have considerations regarding retention (although more than likely we will track retention separately for local/s3 snapshots much like how we are for failed snapshots (Needs Discussion)). There are already considerations regarding the desync between the rancher etcdbackup objects, and the actual backups in both s3 and on the nodes, and a solutation to backpopulate these objects similar to rke2 may actually be desired here in order to minimize the risk of a user restoring an impossible snapshot.
  • Renamed IsContainerRunning to DoesContainerExist which is what was previously being checked; added IsContainerRunning for when you really want to make sure a container is running (like etcd before taking a snapshot). There are most likely several places where IsContainerRunning or an equivalent check on a specific container would be useful that checking if it exists (I expect there would likely be some breakages today if some containers were manually stopped), but I considered this out of the scope of this PR because there are already significant changes and identifying which containers would benefit is likely more trouble than it's worth.

Depends on rancher/rke-tools#153 and subsequent KDM update.

@jakefhyde jakefhyde marked this pull request as ready for review June 10, 2022 22:36
@jakefhyde jakefhyde requested review from a team and snasovich June 10, 2022 22:36
@jakefhyde jakefhyde force-pushed the 37639-rke1-etcd-snapshot-trying-too-frequently branch from 0e2f14b to f0b0b31 Compare June 10, 2022 22:38
docker/docker.go Outdated Show resolved Hide resolved
pki/deploy.go Outdated Show resolved Hide resolved
pki/deploy.go Show resolved Hide resolved
pki/deploy.go Outdated Show resolved Hide resolved
services/services.go Outdated Show resolved Hide resolved
docker/docker.go Outdated Show resolved Hide resolved
services/etcd.go Outdated Show resolved Hide resolved
@jiaqiluo jiaqiluo requested a review from a team June 13, 2022 22:52
@jakefhyde jakefhyde requested a review from jiaqiluo June 14, 2022 16:04
cluster/etcd.go Outdated Show resolved Hide resolved
services/etcd.go Show resolved Hide resolved
@jakefhyde jakefhyde requested review from a-blender and a team June 16, 2022 22:39
@a-blender
Copy link
Contributor

Super nit: In the desc, printing these logs for containers that have a potentially long run time floods the logs and isn't very useless by itself did you mean helpful?

Copy link
Contributor

@a-blender a-blender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments to be addressed

@jakefhyde jakefhyde requested review from a-blender and a team June 21, 2022 16:13
Copy link
Member

@jiaqiluo jiaqiluo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but will approve after the CI failure is fixed.

@jakefhyde
Copy link
Contributor Author

@annablender @jiaqiluo we may have to figure something out here, because golangci-lint itself has a bug in it, and I can't reproduce it in debug with the exact same version of golangci-lint from source.

@jakefhyde jakefhyde force-pushed the 37639-rke1-etcd-snapshot-trying-too-frequently branch from 286ec62 to 1b67efc Compare June 23, 2022 16:41
@jakefhyde jakefhyde requested review from jiaqiluo and a team June 23, 2022 19:58
@jiaqiluo
Copy link
Member

LGTM. Please squash commits before merging

@jakefhyde jakefhyde merged commit 8aa6283 into rancher:release/v1.3 Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants