Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special failsafe feature #16185

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tonyhutter
Copy link
Contributor

@tonyhutter tonyhutter commented May 10, 2024

Motivation and Context

Allow your special allocation class vdevs ('special' and 'dedup') to fail without data loss.

Description

Special failsafe is a new feature that allows your special allocation class vdevs ('special' and 'dedup') to fail without losing any data. It works by automatically backing up all special data to the main pool. This has the added benefit that you can safely create pools with non-matching alloc class redundancy (like a mirrored pool with a single special device).

This behavior is controlled via two properties:

  1. feature@special_failsafe - This feature flag enables the special failsafe subsystem. It prevents the backed-up pool from being imported read/write on an older version of ZFS that does not support special failsafe.

  2. special_failsafe - This pool property is the main on/off switch to control special failsafe. If you want to use special failsafe simply turn it on either at creation time or with zpool set prior to adding a special alloc class device. After special device have been added, then you can either leave the property on or turn it off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure alloc class writes due to the extra backup copy write to the pool. Alloc class reads should not be affected as they always read from DVA 0 first (the copy of the data on the special device). It can also inflate disk usage on dRAID pools.

Closes: #15118

Note: This is a simpler, more elegant version of my older PR: #16073

How Has This Been Tested?

Test cases added

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter tonyhutter force-pushed the special_failsafe branch 5 times, most recently from f73ddb3 to 0b0ecbb Compare May 14, 2024 22:47
@tonyhutter tonyhutter force-pushed the special_failsafe branch 2 times, most recently from 912657d to 0819183 Compare May 21, 2024 22:06
@Haravikk
Copy link

Thanks so much for working on this!

Just wanted to clarify; if a special device is removed or fails (with no redundancy), does the feature flag remain enabled? It sounds like it shouldn't need to remain enabled (and therefore needs to be re-enabled before adding a replacement, non-redundant special device) but I could be misunderstanding what's going on behind the scenes?

I'm guessing that in this case (special device fails/is removed and then replaced) the new special device will be fully empty and no existing data will be copied onto it by any means (either preemptively or reactively)? Absolutely fine if so, especially if it makes it easier to get the feature implemented at all, just wanted to check as I didn't see any notes about this case – though it might be worth mentioning in documentation either way?

@tonyhutter
Copy link
Contributor Author

@Haravikk

if a special device is removed or fails (with no redundancy), does the feature flag remain enabled?

Yes, the SPA_FEATURE_SPECIAL_FAILSAFE feature flag will remain "active" if a special device fails/is_removed, which I think is what you're reffering to. This matches the behavior of SPA_FEATURE_ALLOCATION_CLASSES.

@tonyhutter tonyhutter force-pushed the special_failsafe branch 4 times, most recently from 49d50d4 to feec657 Compare June 4, 2024 00:47
@vaclavskala
Copy link
Contributor

What is difference between this feature and adding ssd L2ARC with secondarycache=metadata?
In both cases write will be limited by the speed of the main pool. And reads will be handled by ssd. L2ARC will need some time to be filled, but with persistent L2ARC it is not problem anymore.
And when L2ARC/special device is smaller then metadata size, L2ARC can be even faster because it will hold only hot metadata.

@tonyhutter
Copy link
Contributor Author

@vaclavskala There's a lot of overlap, and in many use cases you could use either L2ARC+secondarycache=metadata, or special+special_failsafe interchangeably. There are some differences:

  1. This PR gives you more flexibility. Consider if you could only buy two NVMe drives to speed up your redundant pool. Prior to this PR, you could not use one NVMe for L2ARC and one NVMe for special, since special wouldn't be redundant enough. With this PR you have that option. You could then have a pool with hot large blocks on L2ARC while still guaranteeing all read metadata operation will be fast with special.

  2. L2ARC doesn't let you separate dedup data from metadata, whereas special alloc class devices do. If you have a heavily dedup'd pool, it may make more sense to dedicate all your NVMe to dedup+special_failsafe rather than L2ARC.

  3. You can set special_small_blocks on datasets for special, but you can only set the less granular secondarycache=[all|none|metadata] to get it on L2ARC.

  4. You can set l2arc_exclude_special to have L2ARC exclude special data. This could be useful if you're using both L2ARC with special + special_small_blocks.

  5. special_failsafe is per-pool, but Persitent L2ARC is a module parameter (zfs_rebuild_enabled).

  6. You may be super paranoid about your special/dedup data and simply want another copy on the pool. That way you have alloc class device data on two different mediums: NVMe (special/dedup) and HDDs (main pool). So if your NVMe PCIe switch goes down during a firmware update, you can still import the pool from the HDDs without downtime.

  7. One downside of L2ARC is that its headers take up ARC memory. From man/man4/zfs.4:

l2arc_meta_percent=33% (uint)

Percent of ARC size allowed for L2ARC-only headers. Since L2ARC buffers are not evicted on
memory pressure, too many headers on a system with an irrationally large L2ARC can render it
slow or unusable. This parameter limits L2ARC writes and rebuilds to achieve the target.

The "irrationally large" comment here makes me think we can't just scale the L2ARC to be arbitrarily large (unlike special).

@Haravikk
Copy link

Haravikk commented Jun 4, 2024

I would maybe also add to that list:

  1. The contents of the special device are a lot more predictable – if properly sized and configured, and added at creation time, a special device is guaranteed to contain all special blocks, so these will always be accessed from the faster device. When you compare this to ARC/L2ARC, we don't actually have a lot of control over what stays in ARC/L2ARC between metadata only, all or nothing, and it tends not to retain infrequently used records for very long, so you'll almost certainly have to go to other devices to retrieve those. There are two cases I like to use that illustrate the benefits of this:
    • Loading the contents of infrequently accessed directories – this can also be thought of as find performance, as a find search may require you to stat every entry in a directory (or directory tree), plus extended attributes in some cases. Unless the bulk of these are in ARC/L2ARC this process can be extremely slow, as it's pretty much a worst case for spinning disks (lots of often randomly distributed, tiny records). If your workload includes anything like this then you want that offloaded to an SSD.
    • ZVOLs can be tuned nicely using a special device; since a ZVOL stores "blocks" of a predictable size (effectively a minimum record size for most blocks), you can exclude them while storing everything else (ZFS' metadata) on the special device. While this will be pretty much the same as an L2ARC set to secondarycache=metadata, again you can guarantee that it's all there on the special device, and never gets evicted. This means that operations for your ZVOL(s) should predictably send all "block" activity to your main pool, and all other activity to the special device – though obviously not total separation in the context of special failsafe (since the metadata is also written through to the rest of the pool).

Of course if the special device is improperly sized, L2ARC may be better/more adaptable, but with the proposed special failsafe you should actually have the option of trying with the special device at first, and if you determine that it's too small, you can remove it and re-add it as an L2ARC instead.

@lachesis
Copy link

This is a very interesting feature. It would be even better if the feature could be turned on and off at will, which would ideally cause the special data to be backed up to the pool when it transitioned from off to on, and deleted from the pool when transitioned from on to off.

The use case for this enhancement to the enhancement is performing some expensive metadata write operations, e.g. zfs recieveing a large dataset containing many small files, or rsyncing in a lot of files, etc. Then you could disable the flag, perform the operation while relying on special device for metadata, and then re-enable the flag once the operation is complete to restore additional redundancy.

@tonyhutter
Copy link
Contributor Author

This is a very interesting feature. It would be even better if the feature could be turned on and off at will, which would ideally cause the special data to be backed up to the pool when it transitioned from off to on, and deleted from the pool when transitioned from on to off.

@lachesis that would be nice, but it would also require Block Pointer Rewrite, which I don't think is going to happen anytime soon.

@zfsuser
Copy link

zfsuser commented Aug 8, 2024

This feature is extremely interesting, as it would solve L2ARC limitations regarding metadata handling and therefore the need for either adding a zoo of L2ARC workaround/improvement features or a L2ARC redesign.

Some comments regarding the feature description:

It works by automatically backing up all special data to the main pool.

In case this would not only be a high level description, but represent the detailed implementation, we should seriously think about the situation when only a single (=non-redundant) special device is used.

Remark: For me the "backup" implies something which is performed after the metadata was written to the special devices.

At least in case of non-redundant special device usage, the metadata shall be written to the pool (non special devices) in parallel to the special devices, and the metadata shall be "flagged" as written not before it was also successfully written to stable pool storage - if not already implemented like / similar to this.

If you want to use special failsafe simply turn it on either at creation time or with zpool set prior to adding a special alloc class device. After special device have been added, then you can either leave the property on or turn it off, but once it's off you can't turn it back on again.

Can we reduce the risk of accidents by e.g.:

  • Only allowing the addition of non-redundant special devices when the feature is enabled?
  • Only allowing to disable the feature when no non-redundant special devices are used?

It can also inflate disk usage on dRAID pools.

Can some additional information be provided for this? How/why would it differ from using DRAID without special devices (assuming the special devices are only used for metadata, and not small files, dedup, etc.)?

@tonyhutter
Copy link
Contributor Author

Can we reduce the risk of accidents by e.g.:

Only allowing the addition of non-redundant special devices when the feature is enabled?

This is the current behavior. The feature flag has to be enabled in order to set the pool property. And you can't add non-redundant special devices to a pool if the pool property isn't set.

Only allowing to disable the feature when no non-redundant special devices are used?

I think that's an interesting point. I will have to think about the tradeoffs in doing that.

Can some additional information be provided for this? How/why would it differ from using DRAID without special devices (assuming the special devices are only used for metadata, and not small files, dedup, etc.)?

Some info on write-inflation from an older version of this PR: #16073 (comment)

I don't think it would differ from using DRAID without special devices - you'd still get the write inflation from the tiny metadata. That's why people often use DRAID + special.

@zfsuser
Copy link

zfsuser commented Aug 14, 2024

Only allowing the addition of non-redundant special devices when the feature is enabled?

This is the current behaviour. The feature flag has to be enabled in order to set the pool property. And you can't add non-redundant special devices to a pool if the pool property isn't set.

Sounds good, thank you for the clarification.

Only allowing to disable the feature when no non-redundant special devices are used?

I think that's an interesting point. I will have to think about the trade-offs in doing that.

According to...

you can [...] turn it off, but once it's off you can't turn it back on again.

...the user could permanently loose metadata redundancy and would be forced to delete and recreate the pool. Therefore i expect that this safety-feature would be worth the trade-offs.

I don't think it would differ from using DRAID without special devices - you'd still get the write inflation from the tiny metadata. That's why people often use DRAID + special.

Does this mean that the "backup" of this feature is implemented by basicly "mirroring" the pool ("without" special devices) with the special device(s) of the pool in case of metadata? And therefore the standard zfs mechanism will ensure that the metadata will always be properly written to the (non special device) pool drives?

@zfsuser
Copy link

zfsuser commented Aug 14, 2024

The description of this feature mentions backup, but not much about the failure case and later restore.

What is the behaviour in a special-device failure case? Does the pool simply freeze, change to read-only, or does the pool continue to work nominally and the metadata is only written to the (typically redundant) pool drives?

What is the restore behaviour? Can the metadata be recreated on a replaced special device (drive/partition/slice), e.g. by resilver? If not, a failed special device would require pool deletion and recreation from backup.

@tonyhutter
Copy link
Contributor Author

Does this mean that the "backup" of this feature is implemented by basicly "mirroring" the pool ("without" special devices) with the special device(s) of the pool in case of metadata? And therefore the standard zfs mechanism will ensure that the metadata will always be properly written to the (non special device) pool drives?

It's conceptually: "whenever I write to the special device, I will also write an additional copy of that data to the main pool". If you familiar with the copies dataset property, then you can think if it as forcing copies=2 and the copy going on the main pool.

The description of this feature mentions backup, but not much about the failure case and later restore.

The pool would behave as if it lost some disks but still had redundancy. That is, the pool would still import read-write even if all the special devices had failed.

For example, here's how it looks after I deleted all the special devices and did an import with special_failsafe=on:

  pool: tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
config:

	NAME                     STATE     READ WRITE CKSUM
	tank                     DEGRADED     0     0     0
	  mirror-0               ONLINE       0     0     0
	    /tmp/file1           ONLINE       0     0     0
	    /tmp/file2           ONLINE       0     0     0
	special	
	  mirror-1               UNAVAIL      0     0     0  insufficient replicas
	    168996602107661217   UNAVAIL      0     0     0  was /tmp/file3
	    5802159686745897429  UNAVAIL      0     0     0  was /tmp/file4

errors: No known data errors

You could then do the normal zpool replace to replace a bad special disks.

you can [...] turn it off, but once it's off you can't turn it back on again.

...the user could permanently loose metadata redundancy and would be forced to delete and recreate the pool. Therefore i expect that this safety-feature would be worth the trade-offs.

Yea, I see what you're getting at. We do let users make pools with non-redundant special devices, but they have to force it with -f, so there's precedent for having a safety like you're describing. Let me think about it.

@Haravikk
Copy link

Haravikk commented Aug 15, 2024

You could then do the normal zpool replace to replace a bad special disks.

Just to confirm, does this mean that zfs will repopulate the replacement special disk(s) with the backed up special blocks, or will it be treated the same as adding a new special device for the first time (i.e- it will be empty until new data is written that fits the special block size)?

@Anuskuss
Copy link

Anuskuss commented Sep 6, 2024

If I only want (all) metadata to be read from my SSD, which one should I use? CACHE + secondarycache=metadata + l2arc_rebuild_enabled=1 (+ l2arc_noprefetch=0 + l2arc_headroom=0) or SPECIAL + special_failsafe=on (+ special_small_blocks=0)? AFAICT L2ARC works off (soon to be evicted data from) ARC meaning if the data never makes it to ARC it will never make it to L2ARC. Will SPECIAL (once the redundancy problem is gone) be the preferable method to cache metadata then?

One downside of L2ARC is that its headers take up ARC memory.

Will SPECIAL be more space efficient holding the same information or does it have it's own caveats?

Also how does one get the metadata into the cache? When I added my CACHE I ran find - will this also work for SPECIAL? Sorry for all of the questions but I always wanted a SPECIAL but got scared by the whole "you'll lose everything" worst case.

@Haravikk
Copy link

Haravikk commented Sep 7, 2024

If I only want (all) metadata to be read from my SSD, which one should I use? CACHE + secondarycache=metadata + l2arc_rebuild_enabled=1 (+ l2arc_noprefetch=0 + l2arc_headroom=0) or SPECIAL + special_failsafe=on (+ special_small_blocks=0)? AFAICT L2ARC works off (soon to be evicted data from) ARC meaning if the data never makes it to ARC it will never make it to L2ARC. Will SPECIAL (once the redundancy problem is gone) be the preferable method to cache metadata then?

As you say, the contents of L2ARC depend upon what has been loaded into ARC (and later evicted), meaning you have limited control over what ends up in there. You can use tricks like find to stat everything and pull metadata into cache, but it's not guaranteed to stay there, meaning you may have to periodically repeat the command.

A special device by comparison can be guaranteed to contain all of your metadata so long as it's big enough and you set the special block size correctly.

It's worth nothing that the special failsafe isn't required for this, the benefit of special failsafe is that you can safely use a special device without having to match the redundancy of the rest of your pool, i.e- you could use a single disk without worrying it will take down the entire pool if it fails.

Also how does one get the metadata into the cache? When I added my CACHE I ran find - will this also work for SPECIAL? Sorry for all of the questions but I always wanted a SPECIAL but got scared by the whole "you'll lose everything" worst case.

When you first add a special device it contains nothing – data only starts being written to it as you start creating new records (and metadata) up to the special record size. For this reason it's best to plan ahead and add a special device when creating the pool, because otherwise there will be data that should be in there but won't be until you re-write it.

Currently the only way to force a special device to be populated with all special records is to copy your dataset(s), e.g- send to another pool then send back, or send to the same pool under a new name if you have enough space.

Populating a special device without sending will require a feature of its own I think? Though I'm actually not 100% sure if it 's dependent upon block pointer rewrite or similar.

@Anuskuss
Copy link

Anuskuss commented Sep 7, 2024

@Haravikk Thanks for the answers.

but it's not guaranteed to stay there

You mean when the L2ARC is full, right?

set the special block size correctly

Why? If I don't wanna store data (special_small_blocks) the default should be sufficient, right?

data only starts being written to it as you start creating new records

New records meaning completely new or just different? Because cating everything will change the atime and that will cause new records to be written (and cached).

force a special device to be populated

I mean I don't even necessarily need that as I probably have some files that I'll never even access ever. I just wanna speed up some common operations like find.

To be honest I don't even really trust that CACHE is doing it's thing (although I have never benchmarked it) because some actions take longer than I would expect. Like if it's in cache I'd expect everything (except reading the actual data) to be instant which it's clearly not. Also it's only using 7 GB for 70 TB of data which seems rather low.

@tonyhutter
Copy link
Contributor Author

Just to confirm, does this mean that zfs will repopulate the replacement special disk(s) with the backed up special blocks, or will it be treated the same as adding a new special device for the first time

I believe it will repopulate with the backed-up special blocks.

@amotin
Copy link
Member

amotin commented Sep 10, 2024

I believe it will repopulate with the backed-up special blocks.

I'd like to be wrong, but I am not sure it is possible now to replace completely lost top-level vdev. It might be just not implemented though, since normally missing top-level are fatal and it is normally impossible to even import pools in that state. But sure such functionality could be interesting in a light of tunable added not so long ago to allow such imports.

@tonyhutter
Copy link
Contributor Author

I'd like to be wrong, but I am not sure it is possible now to replace completely lost top-level vdev.

I had to change some of the import code, but it is possible with this PR. That is, with special_failsafe enabled, if you have a special top level vdev and all it's children are corrupted or missing, you can still import the pool RW without any data loss. One of the test cases I added verifies this:

# Our pool is imported but has all its special devices zeroed out. Try

Special failsafe is a feature that allows your special allocation
class vdevs ('special' and 'dedup') to fail without losing any data.  It
works by automatically backing up all special data to the pool.  This
has the added benefit that you can safely create pools with non-matching
alloc class redundancy (like a mirrored pool with a single special
device).

This behavior is controlled via two properties:

1. feature@special_failsafe - This feature flag enables the special
   failsafe subsystem.  It prevents the backed-up pool from being
   imported read/write on an older version of ZFS that does not
   support special failsafe.

2. special_failsafe - This pool property is the main on/off switch
   to control special failsafe.  If you want to use special failsafe
   simply turn it on either at creation time or with `zpool set` prior
   to adding a special alloc class device.  After special device have
   been added, then you can either leave the property on or turn it
   off, but once it's off you can't turn it back on again.

Note that special failsafe may create a performance penalty over pure
alloc class writes due to the extra backup copy write to the pool.
Alloc class reads should not be affected as they always read from DVA 0
first (the copy of the data on the special device).  It can also inflate
disk usage on dRAID pools.

Closes: openzfs#15118

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
@tonyhutter
Copy link
Contributor Author

I just rebased this on the latest master. I will need to update the special failsafe code to work with the latest DDT stuff:

[ 7848.550770] VERIFY3(s == bp_ndvas) failed (1 == 2)
[ 7848.552130] PANIC at ddt.c:700:ddt_phys_extend()
[ 7848.553913] Showing stack for process 360822
[ 7848.556822] CPU: 0 PID: 360822 Comm: z_wr_iss Tainted: P           OE      6.10.12-200.fc40.x86_64 #1
[ 7848.561008] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 7848.565359] Call Trace:
[ 7848.566352]  <TASK>
[ 7848.567501]  dump_stack_lvl+0x5d/0x80
[ 7848.569300]  spl_panic+0xf4/0x10b [spl]
[ 7848.570855]  ddt_phys_extend+0x390/0x6c0 [zfs]
[ 7848.577241]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 7848.578792]  ? metaslab_alloc+0x1c3/0x740 [zfs]
[ 7848.580689]  zio_ddt_child_write_ready+0xfa/0x190 [zfs]
[ 7848.582721]  zio_ready+0x93/0x790 [zfs]
[ 7848.584358]  zio_nowait+0x104/0x2b0 [zfs]
[ 7848.586107]  zio_ddt_write+0x391/0xbb0 [zfs]
[ 7848.587891]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 7848.589466]  zio_execute+0xde/0x200 [zfs]
[ 7848.591212]  taskq_thread+0x27c/0x5a0 [spl]
[ 7848.592578]  ? __pfx_default_wake_function+0x10/0x10
[ 7848.594257]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 7848.596136]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 7848.597671]  kthread+0xd2/0x100
[ 7848.598798]  ? __pfx_kthread+0x10/0x10
[ 7848.600092]  ret_from_fork+0x34/0x50
[ 7848.601358]  ? __pfx_kthread+0x10/0x10
[ 7848.602613]  ret_from_fork_asm+0x1a/0x30

@Anuskuss
Copy link

Since the discussion seems to have cooled down I hope it's okay to ask this here since I plan on switching once this hits stable:
I have a CACHE device with secondarycache=metadata because I'm trying to accomplish two things:

  • Speed up browsing the filesystem
  • Let my HDDs sleep as much as possible (until data is actually read)

I don't think the first one is achived because I regularly run find over the entire filesystem and it's not as fast as I'd like it to be (it doesn't improve) and also entering some directories I haven't visited in a while takes time (longer than recently visited ones). Will SPECIAL improve this?
The second one is difficult to test because there aren't many days where my filesystem is truely inactive but I know that it did enter standby a few times (recently added files take a few seconds to be accessible) but I wouldn't know if the metadata would've been ready. With SPECIAL, once the new metadata is backed-up onto the HDDs, there won't be any write activity but will they be read from SPECIAL without needing to read (e.g. comparing the metadata for consistency checks) from the HDDs (i.e. without waking them up)?

@tonyhutter
Copy link
Contributor Author

Let my HDDs sleep as much as possible (until data is actually read)

Will SPECIAL improve this?

It depends. If you have a big special device, you could set special_small_blocks to a big value, and your reads wouldn't have to go to the HDDs as much. You could also set l2arc_exclude_special to exclude special data from the ARC, which in turn would allow more HDD data to stay in the ARC.

@Anuskuss
Copy link

special_small_blocks

I don't want any data in the L2ARC, just metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write-through special device
7 participants