-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special failsafe feature #16185
base: master
Are you sure you want to change the base?
Special failsafe feature #16185
Conversation
f73ddb3
to
0b0ecbb
Compare
912657d
to
0819183
Compare
Thanks so much for working on this! Just wanted to clarify; if a special device is removed or fails (with no redundancy), does the feature flag remain enabled? It sounds like it shouldn't need to remain enabled (and therefore needs to be re-enabled before adding a replacement, non-redundant special device) but I could be misunderstanding what's going on behind the scenes? I'm guessing that in this case (special device fails/is removed and then replaced) the new special device will be fully empty and no existing data will be copied onto it by any means (either preemptively or reactively)? Absolutely fine if so, especially if it makes it easier to get the feature implemented at all, just wanted to check as I didn't see any notes about this case – though it might be worth mentioning in documentation either way? |
Yes, the SPA_FEATURE_SPECIAL_FAILSAFE feature flag will remain "active" if a special device fails/is_removed, which I think is what you're reffering to. This matches the behavior of SPA_FEATURE_ALLOCATION_CLASSES. |
49d50d4
to
feec657
Compare
What is difference between this feature and adding ssd L2ARC with secondarycache=metadata? |
@vaclavskala There's a lot of overlap, and in many use cases you could use either L2ARC+secondarycache=metadata, or special+special_failsafe interchangeably. There are some differences:
The "irrationally large" comment here makes me think we can't just scale the L2ARC to be arbitrarily large (unlike special). |
feec657
to
67edb03
Compare
I would maybe also add to that list:
Of course if the special device is improperly sized, L2ARC may be better/more adaptable, but with the proposed special failsafe you should actually have the option of trying with the special device at first, and if you determine that it's too small, you can remove it and re-add it as an L2ARC instead. |
955e3f2
to
69e9faf
Compare
This is a very interesting feature. It would be even better if the feature could be turned on and off at will, which would ideally cause the special data to be backed up to the pool when it transitioned from off to on, and deleted from the pool when transitioned from on to off. The use case for this enhancement to the enhancement is performing some expensive metadata write operations, e.g. |
@lachesis that would be nice, but it would also require Block Pointer Rewrite, which I don't think is going to happen anytime soon. |
This feature is extremely interesting, as it would solve L2ARC limitations regarding metadata handling and therefore the need for either adding a zoo of L2ARC workaround/improvement features or a L2ARC redesign. Some comments regarding the feature description:
In case this would not only be a high level description, but represent the detailed implementation, we should seriously think about the situation when only a single (=non-redundant) special device is used. Remark: For me the "backup" implies something which is performed after the metadata was written to the special devices. At least in case of non-redundant special device usage, the metadata shall be written to the pool (non special devices) in parallel to the special devices, and the metadata shall be "flagged" as written not before it was also successfully written to stable pool storage - if not already implemented like / similar to this.
Can we reduce the risk of accidents by e.g.:
Can some additional information be provided for this? How/why would it differ from using DRAID without special devices (assuming the special devices are only used for metadata, and not small files, dedup, etc.)? |
This is the current behavior. The feature flag has to be enabled in order to set the pool property. And you can't add non-redundant special devices to a pool if the pool property isn't set.
I think that's an interesting point. I will have to think about the tradeoffs in doing that.
Some info on write-inflation from an older version of this PR: #16073 (comment) I don't think it would differ from using DRAID without special devices - you'd still get the write inflation from the tiny metadata. That's why people often use DRAID + special. |
Sounds good, thank you for the clarification.
According to...
...the user could permanently loose metadata redundancy and would be forced to delete and recreate the pool. Therefore i expect that this safety-feature would be worth the trade-offs.
Does this mean that the "backup" of this feature is implemented by basicly "mirroring" the pool ("without" special devices) with the special device(s) of the pool in case of metadata? And therefore the standard zfs mechanism will ensure that the metadata will always be properly written to the (non special device) pool drives? |
The description of this feature mentions backup, but not much about the failure case and later restore. What is the behaviour in a special-device failure case? Does the pool simply freeze, change to read-only, or does the pool continue to work nominally and the metadata is only written to the (typically redundant) pool drives? What is the restore behaviour? Can the metadata be recreated on a replaced special device (drive/partition/slice), e.g. by resilver? If not, a failed special device would require pool deletion and recreation from backup. |
It's conceptually: "whenever I write to the special device, I will also write an additional copy of that data to the main pool". If you familiar with the
The pool would behave as if it lost some disks but still had redundancy. That is, the pool would still import read-write even if all the special devices had failed. For example, here's how it looks after I deleted all the special devices and did an import with special_failsafe=on:
You could then do the normal
Yea, I see what you're getting at. We do let users make pools with non-redundant special devices, but they have to force it with |
Just to confirm, does this mean that zfs will repopulate the replacement |
If I only want (all) metadata to be read from my SSD, which one should I use?
Will Also how does one get the metadata into the cache? When I added my |
As you say, the contents of L2ARC depend upon what has been loaded into ARC (and later evicted), meaning you have limited control over what ends up in there. You can use tricks like A It's worth nothing that the special failsafe isn't required for this, the benefit of special failsafe is that you can safely use a special device without having to match the redundancy of the rest of your pool, i.e- you could use a single disk without worrying it will take down the entire pool if it fails.
When you first add a Currently the only way to force a Populating a special device without sending will require a feature of its own I think? Though I'm actually not 100% sure if it 's dependent upon block pointer rewrite or similar. |
@Haravikk Thanks for the answers.
You mean when the L2ARC is full, right?
Why? If I don't wanna store data (
New records meaning completely new or just different? Because
I mean I don't even necessarily need that as I probably have some files that I'll never even access ever. I just wanna speed up some common operations like To be honest I don't even really trust that |
I believe it will repopulate with the backed-up special blocks. |
I'd like to be wrong, but I am not sure it is possible now to replace completely lost top-level vdev. It might be just not implemented though, since normally missing top-level are fatal and it is normally impossible to even import pools in that state. But sure such functionality could be interesting in a light of tunable added not so long ago to allow such imports. |
69e9faf
to
a7647c9
Compare
I had to change some of the import code, but it is possible with this PR. That is, with special_failsafe enabled, if you have a special top level vdev and all it's children are corrupted or missing, you can still import the pool RW without any data loss. One of the test cases I added verifies this:
|
Special failsafe is a feature that allows your special allocation class vdevs ('special' and 'dedup') to fail without losing any data. It works by automatically backing up all special data to the pool. This has the added benefit that you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device). This behavior is controlled via two properties: 1. feature@special_failsafe - This feature flag enables the special failsafe subsystem. It prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support special failsafe. 2. special_failsafe - This pool property is the main on/off switch to control special failsafe. If you want to use special failsafe simply turn it on either at creation time or with `zpool set` prior to adding a special alloc class device. After special device have been added, then you can either leave the property on or turn it off, but once it's off you can't turn it back on again. Note that special failsafe may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device). It can also inflate disk usage on dRAID pools. Closes: openzfs#15118 Signed-off-by: Tony Hutter <hutter2@llnl.gov>
a7647c9
to
be0eeca
Compare
I just rebased this on the latest master. I will need to update the special failsafe code to work with the latest DDT stuff:
|
Since the discussion seems to have cooled down I hope it's okay to ask this here since I plan on switching once this hits stable:
I don't think the first one is achived because I regularly run |
It depends. If you have a big special device, you could set |
I don't want any data in the L2ARC, just metadata. |
Motivation and Context
Allow your special allocation class vdevs ('special' and 'dedup') to fail without data loss.
Description
Special failsafe is a new feature that allows your special allocation class vdevs ('special' and 'dedup') to fail without losing any data. It works by automatically backing up all special data to the main pool. This has the added benefit that you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device).
This behavior is controlled via two properties:
feature@special_failsafe - This feature flag enables the special failsafe subsystem. It prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support special failsafe.
special_failsafe - This pool property is the main on/off switch to control special failsafe. If you want to use special failsafe simply turn it on either at creation time or with
zpool set
prior to adding a special alloc class device. After special device have been added, then you can either leave the property on or turn it off, but once it's off you can't turn it back on again.Note that special failsafe may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device). It can also inflate disk usage on dRAID pools.
Closes: #15118
Note: This is a simpler, more elegant version of my older PR: #16073
How Has This Been Tested?
Test cases added
Types of changes
Checklist:
Signed-off-by
.