Skip to content

Add vdev property to toggle adding io to vdev queue #17358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MigeljanImeri
Copy link
Contributor

This PR is very similar to the other one (#16591), although I have made some significant changes and so I have opened this one instead. I have reverted back to a vdev property, as it allows for more granular control over which vdevs will queue io. I am open to changing this to a module parameter, although we would need something to allow for more granular control over the vdevs it affects.

Motivation and Context

Described in #16591.

Description

Described in #16591

How Has This Been Tested?

Along with the testing described in #16591, I have done additional testing to see if this property could lead to starvation. I have tested on two separate nodes, one with gen 3 drives (INTEL SSDPE21K375GA) and the other with gen 5 drives (KIOXIA KCMYXRUG7T68). The testing consisted of doing mixed reads / writes, both buffered and direct, along with simultaneously launching zpool trims, scrubs and syncs in an attempt to cause starvation. I did not see deadman events on either node. I don't think this property should be disabled for spinning disk backed vdevs, and I have added a warning in the man page that its not recommended. In regard to the deadman events, bypassing the queue should still allow for these events to occur, albeit not through the spa_deadman path. In zio_wait, the parent zio will trigger the deadman code if any of its children are hung. Because of this, we do not need to keep track of which zios are active in a list, avoiding the performance hit of having to lock to add to a list.

Additionally the tests that were failing, zed_slow_io_many_vdevs, along with zed_slow_io, were caused by file backed vdevs being labeled as non rotational, causing the queue to be bypassed.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 20, 2025
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refreshing this and testing the deadman case. After looking at that code again, I agree we should be okay there.

@@ -922,6 +922,12 @@ vdev_queue_io(zio_t *zio)
zio->io_flags |= ZIO_FLAG_DONT_QUEUE;
zio->io_timestamp = gethrtime();

if (!zio->io_vd->vdev_queue_io &&
!(zio->io_flags & ZIO_FLAG_NODATA)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ZIO_FLAG_NODATA check correct? If so can you add a brief comment explaining why we can't skip the queue in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

error = EINVAL;
break;
}
if (vd->vdev_ops->vdev_op_leaf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only allowing this to be set on leaves has the complication that when the drive is replaced it will revert back to the default value for the property. The VDEV_PROP_CHECKSUM/IO/SLOW_N/T properties handle this with a workaround which inherits non-default values if they're set on any of the parents (see vdev_prop_get_inherited). We should consider doing the same thing. The only wrinkle would be you'd want to cache to value in the vdev_t.

@snajpa
Copy link
Contributor

snajpa commented May 21, 2025

I'm just wondering whether you've tried and exhausted all the tuning possibilities for the vdev queues. It would seem that the default values just don't mirror the modern NVMe's at all, all of the values IMHO deserve attention, if tuned properly, there AFAIK is no need for "hacks" like this one at all...

Are you sure you've tried everything wrt/ tuning zfs_vdev_async|sync_read|write_min|max_active, zfs_vdev_max_active and others?

To me it seems like going for changing the code without trying to get it working as it was intended properly first...

@snajpa
Copy link
Contributor

snajpa commented May 21, 2025

I mean, I've been there, vq_lock, yes, but I've solved it by tuning what could be tuned, didn't feel any need for patching, so I'm wondering why you did, if you tried everything before reaching for the code editor?

@snajpa
Copy link
Contributor

snajpa commented May 21, 2025

I think I have went through the discussion in #16591 thoroughly enough and I'm not seeing any attempt to tune the queues right. NVMe differ from SATA, they can have up to 1k requests in flight, whereas the OpenZFS defaults correspond to SATA max values, where it's only possible to have 32 queued requests. That's "only" 32x difference... I mean, I am almost certain - because I'm speaking from my experience and I've invested quite some time into it - that it is just a matter of tuning the queues right.

I agree with @robn's comment in the original PR

I don't think we should merge this without it being supported by evidence - flamegraphs alone without any tuning attempts are just evidence those tuning attempts are needed.

@MigeljanImeri
Copy link
Contributor Author

I think I have went through the discussion in #16591 thoroughly enough and I'm not seeing any attempt to tune the queues right. NVMe differ from SATA, they can have up to 1k requests in flight, whereas the OpenZFS defaults correspond to SATA max values, where it's only possible to have 32 queued requests. That's "only" 32x difference... I mean, I am almost certain - because I'm speaking from my experience and I've invested quite some time into it - that it is just a matter of tuning the queues right.

I agree with @robn's comment in the original PR

I don't think we should merge this without it being supported by evidence - flamegraphs alone without any tuning attempts are just evidence those tuning attempts are needed.

I just did some quick testing and with just echo 1000 > zfs_vdev_sync_read_max_active, I was able to get around 401k IOPS, however with this property that went up to 441k. About a 10% improvement for random read IOPS. I do agree that the tuning does help significantly and gets us almost all the way there, there is still some merit to this property. I am also working on a few other patches to help improve IOPS performance and they all build on top of this vdev queue bypass. From the testing I've done this vq_lock seems to be the biggest source of contention and with it still in place it just over shadows any other optimizations I can make.

@snajpa
Copy link
Contributor

snajpa commented May 21, 2025

@MigeljanImeri thanks for the explanation; also, huge thanks for taking on perf tuning in the codebase (perhaps should have opened with that, sorry :D)

how does the perf report look like with zfs_vdev_max_active=1023 and zfs_vdev_sync_read_max_active=1023?

(assuming your /sys/class/nvme/nvmeX/nvmeXnY/queue/nr_requests is 1023)

IIRC there's also some uplift to be had if you raise the _min_active numbers somewhat (but not too high)

btw it's even more interesting with the max queues and queued entries with NVMe, actually the specs allow for 64k queues and 64k requests in each of those queues, didn't know it could go that high :D

@MigeljanImeri
Copy link
Contributor Author

@MigeljanImeri thanks for the explanation; also, huge thanks for taking on perf tuning in the codebase (perhaps should have opened with that, sorry :D)

how does the perf report look like with zfs_vdev_max_active=1023 and zfs_vdev_sync_read_max_active=1023?

(assuming your /sys/class/nvme/nvmeX/nvmeXnY/queue/nr_requests is 1023)

IIRC there's also some uplift to be had if you raise the _min_active numbers somewhat (but not too high)

btw it's even more interesting with the max queues and queued entries with NVMe, actually the specs allow for 64k queues and 64k requests in each of those queues, didn't know it could go that high :D

Definitely an uplift, we get up to 422k with zfs_vdev_max_active and zfs_vdev_sync_read_max_active at 1023 and increasing zfs_vdev_sync_read_min_active to 50. perf report is showing a ton of contention in vdev_queue_io. Yeah the blk-mq layer can go really fast.

zfs_master_256jobs_1023_max_active

@snajpa
Copy link
Contributor

snajpa commented May 21, 2025

Btw what happens, if you just, you know, set all of the _max_active variants to ridiculously high values? And the _min_actives to actual max of the device? That also, in effect, takes out the osq_lock spinning, doesn't it? At that point, the perf data should show where to focus with the optimizations, without the immediate need to bypass the queueing entirely, am I wrong?

I think wanting to have vdev properties, that affect the queueing mechanism in a way that is suitable for whatever storage medium is underneath, is very legit. Just wanting to have mutliple separate pools with rotational SATA, SSD SATA or some fast NVMes, in one machine, calls for different queue sizes for those pools in some way. So vdev properties sound right for that.

Not sure if there will in the end be real need for the vdev queueing bypass, it might be that high numbers do the job too and if there's some other stuff, that can be optimized in the code, than just the locking, it would also benefit SATA SSD based pools/systems.

(I'm not against the bypass as such, I'm just wondering how far we can go with what we already have and if we can tackle the technological variability even within one system)

@robn
Copy link
Member

robn commented May 21, 2025

I agree with @robn's comment in the original PR

Since I was invoked, I will say that I am not and never was against this change as such. What I was (softly) against in #16591 was any change that binds us to a future that assumes a particular behaviour from the queue, or even a queue at all. And that's because I have a couple of alternate queue models on the bench for NVMe that are very different to the current queue, and don't want a stored property that assumes "traditional queue or bust" as it leaves me nowhere to go without a breaking change. Thus my ask upthread for a queue type selector, which gives me room to move.

On tuning, you cannot tune your way past vq_lock once you have enough load. All the limits are used under the lock; the lock itself is the problem, as its taken on for every IO on issue and on interrupt. There's no question that it's a problem. Taking it out the picture isn't the whole story, as there are other locks that can be heavily contended on "up" vs "down" (dd_lock/ds_lock/dp_lock, vd_expand_lock, etc). We'll have to tackle them all in time, but that's no reason to not start with this one.

@snajpa
Copy link
Contributor

snajpa commented May 21, 2025

All the limits are used under the lock; the lock itself is the problem

Haven't looked at the code honestly, since I covered everything I needed just by tuning the queues. Generally what I need is like ~100k IOPS peak while the ARC covers the rest, though I understand that's missing in the chain with direct IO.

The bypass is awesome for removing the ceiling altogether, but how does it interact with other features, like trim, scrub/resilver? What happens with such a high-performance system on those occasions, how is it managed? The original PR tried to address this, what @tonyhutter proposed seemed reasonable.

there are other locks that can be heavily contended

on that topic - how about using atomics for stuff needed in the fast paths, when too wide osq_locks are problematic?

@MigeljanImeri
Copy link
Contributor Author

All the limits are used under the lock; the lock itself is the problem

Haven't looked at the code honestly, since I covered everything I needed just by tuning the queues. Generally what I need is like ~100k IOPS peak while the ARC covers the rest, though I understand that's missing in the chain with direct IO.

The bypass is awesome for removing the ceiling altogether, but how does it interact with other features, like trim, scrub/resilver? What happens with such a high-performance system on those occasions, how is it managed? The original PR tried to address this, what @tonyhutter proposed seemed reasonable.

Currently, any queuing would be left to the device itself to do. I haven't ran into any issues with running trims and scrubs concurrently while also doing reads / writes in my testing. If it does become an issue, I can always adjust the should_queue function to detect if anything like that is happening and queue if so.

there are other locks that can be heavily contended

on that topic - how about using atomics for stuff needed in the fast paths, when too wide osq_locks are problematic?

I do plan on trying to utilize those in areas with high contention, just need to do more testing to see if it helps.

Added vdev property to toggle queueing io to vdev queue when
reading / writing from / to vdev. The intention behind this
property is to improve IOPS performance when using o_direct.

Signed-off-by: MigeljanImeri <[email protected]>
@amotin
Copy link
Member

amotin commented May 22, 2025

Currently, any queuing would be left to the device itself to do.

That's fine if you know that you can submit all requests you may have through the block layer and to the device same time. It may be true for NVMe (assuming device actually runs all the requests you send to it and not marinate them on request queue), it is questionable for SAS, and doubtful for SATA and just not so for old-style USB. It has nothing to do with whether device is rotational or not. We should somehow get the maximum queue depth for the device from the OS, if possible, though for SCSI/SAS it might be difficult, since it is not reported directly. On top of that rotational devices add their own specifics. But that would not be a problem if we could specify device priorities for each requests to handle desired level of service, then device's own scheduler could do the magic. But unfortunately that functionality was really rudimentary in hardware last time I checked.

Before I even looked into the current patch, I think instead of calling a I/O queue control, we should call it a I/O scheduler control. The queue is just a way to implement a scheduler. And user should understand that he disables the I/O scheduler.

@robn
Copy link
Member

robn commented May 22, 2025

Before I even looked into the current patch, I think instead of calling a I/O queue control, we should call it a I/O scheduler control.

+1. I didn't want to quibble, felt like I'd bikeshedded (bikeshod?) enough on this and the previous PR. But if the property was called scheduler= I'd be delighted.

@snajpa
Copy link
Contributor

snajpa commented May 22, 2025

If it does become an issue, I can always adjust the should_queue function to detect if anything like that is happening and queue if so.

I would expect this will start to matter as you get closer to those ~2.5M IOPS this drive can supposedly do - until you can make it fully busy, it can be postponed - this is also how it might make sense to present towards the user - like "scheduler=off is for drives which ZFS can't saturate" ~ or so

With easily up to 24 of these in one system... this seems like a fun ride, I'll try to get some of these drives too (Kioxia, wanted to try them for a while)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants