-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add vdev property to toggle adding io to vdev queue #17358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for refreshing this and testing the deadman case. After looking at that code again, I agree we should be okay there.
module/zfs/vdev_queue.c
Outdated
@@ -922,6 +922,12 @@ vdev_queue_io(zio_t *zio) | |||
zio->io_flags |= ZIO_FLAG_DONT_QUEUE; | |||
zio->io_timestamp = gethrtime(); | |||
|
|||
if (!zio->io_vd->vdev_queue_io && | |||
!(zio->io_flags & ZIO_FLAG_NODATA)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ZIO_FLAG_NODATA
check correct? If so can you add a brief comment explaining why we can't skip the queue in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tests/zfs-tests/tests/functional/cli_root/zpool_set/vdev_set_queue_io.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/cli_root/zpool_set/vdev_set_queue_io.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/cli_root/zpool_set/vdev_set_queue_io.ksh
Outdated
Show resolved
Hide resolved
error = EINVAL; | ||
break; | ||
} | ||
if (vd->vdev_ops->vdev_op_leaf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only allowing this to be set on leaves has the complication that when the drive is replaced it will revert back to the default value for the property. The VDEV_PROP_CHECKSUM/IO/SLOW_N/T
properties handle this with a workaround which inherits non-default values if they're set on any of the parents (see vdev_prop_get_inherited
). We should consider doing the same thing. The only wrinkle would be you'd want to cache to value in the vdev_t.
I'm just wondering whether you've tried and exhausted all the tuning possibilities for the vdev queues. It would seem that the default values just don't mirror the modern NVMe's at all, all of the values IMHO deserve attention, if tuned properly, there AFAIK is no need for "hacks" like this one at all... Are you sure you've tried everything wrt/ tuning To me it seems like going for changing the code without trying to get it working as it was intended properly first... |
I mean, I've been there, |
I think I have went through the discussion in #16591 thoroughly enough and I'm not seeing any attempt to tune the queues right. NVMe differ from SATA, they can have up to 1k requests in flight, whereas the OpenZFS defaults correspond to SATA max values, where it's only possible to have 32 queued requests. That's "only" 32x difference... I mean, I am almost certain - because I'm speaking from my experience and I've invested quite some time into it - that it is just a matter of tuning the queues right. I agree with @robn's comment in the original PR I don't think we should merge this without it being supported by evidence - flamegraphs alone without any tuning attempts are just evidence those tuning attempts are needed. |
58d5620
to
a978640
Compare
I just did some quick testing and with just echo |
@MigeljanImeri thanks for the explanation; also, huge thanks for taking on perf tuning in the codebase (perhaps should have opened with that, sorry :D) how does the (assuming your IIRC there's also some uplift to be had if you raise the btw it's even more interesting with the max queues and queued entries with NVMe, actually the specs allow for 64k queues and 64k requests in each of those queues, didn't know it could go that high :D |
Definitely an uplift, we get up to 422k with |
Btw what happens, if you just, you know, set all of the I think wanting to have vdev properties, that affect the queueing mechanism in a way that is suitable for whatever storage medium is underneath, is very legit. Just wanting to have mutliple separate pools with rotational SATA, SSD SATA or some fast NVMes, in one machine, calls for different queue sizes for those pools in some way. So vdev properties sound right for that. Not sure if there will in the end be real need for the vdev queueing bypass, it might be that high numbers do the job too and if there's some other stuff, that can be optimized in the code, than just the locking, it would also benefit SATA SSD based pools/systems. (I'm not against the bypass as such, I'm just wondering how far we can go with what we already have and if we can tackle the technological variability even within one system) |
Since I was invoked, I will say that I am not and never was against this change as such. What I was (softly) against in #16591 was any change that binds us to a future that assumes a particular behaviour from the queue, or even a queue at all. And that's because I have a couple of alternate queue models on the bench for NVMe that are very different to the current queue, and don't want a stored property that assumes "traditional queue or bust" as it leaves me nowhere to go without a breaking change. Thus my ask upthread for a queue type selector, which gives me room to move. On tuning, you cannot tune your way past |
Haven't looked at the code honestly, since I covered everything I needed just by tuning the queues. Generally what I need is like ~100k IOPS peak while the ARC covers the rest, though I understand that's missing in the chain with direct IO. The bypass is awesome for removing the ceiling altogether, but how does it interact with other features, like trim, scrub/resilver? What happens with such a high-performance system on those occasions, how is it managed? The original PR tried to address this, what @tonyhutter proposed seemed reasonable.
on that topic - how about using atomics for stuff needed in the fast paths, when too wide |
Currently, any queuing would be left to the device itself to do. I haven't ran into any issues with running trims and scrubs concurrently while also doing reads / writes in my testing. If it does become an issue, I can always adjust the
I do plan on trying to utilize those in areas with high contention, just need to do more testing to see if it helps. |
Added vdev property to toggle queueing io to vdev queue when reading / writing from / to vdev. The intention behind this property is to improve IOPS performance when using o_direct. Signed-off-by: MigeljanImeri <[email protected]>
a978640
to
1f058cf
Compare
That's fine if you know that you can submit all requests you may have through the block layer and to the device same time. It may be true for NVMe (assuming device actually runs all the requests you send to it and not marinate them on request queue), it is questionable for SAS, and doubtful for SATA and just not so for old-style USB. It has nothing to do with whether device is rotational or not. We should somehow get the maximum queue depth for the device from the OS, if possible, though for SCSI/SAS it might be difficult, since it is not reported directly. On top of that rotational devices add their own specifics. But that would not be a problem if we could specify device priorities for each requests to handle desired level of service, then device's own scheduler could do the magic. But unfortunately that functionality was really rudimentary in hardware last time I checked. Before I even looked into the current patch, I think instead of calling a I/O queue control, we should call it a I/O scheduler control. The queue is just a way to implement a scheduler. And user should understand that he disables the I/O scheduler. |
+1. I didn't want to quibble, felt like I'd bikeshedded (bikeshod?) enough on this and the previous PR. But if the property was called |
I would expect this will start to matter as you get closer to those ~2.5M IOPS this drive can supposedly do - until you can make it fully busy, it can be postponed - this is also how it might make sense to present towards the user - like " With easily up to 24 of these in one system... this seems like a fun ride, I'll try to get some of these drives too (Kioxia, wanted to try them for a while) |
This PR is very similar to the other one (#16591), although I have made some significant changes and so I have opened this one instead. I have reverted back to a vdev property, as it allows for more granular control over which vdevs will queue io. I am open to changing this to a module parameter, although we would need something to allow for more granular control over the vdevs it affects.
Motivation and Context
Described in #16591.
Description
Described in #16591
How Has This Been Tested?
Along with the testing described in #16591, I have done additional testing to see if this property could lead to starvation. I have tested on two separate nodes, one with gen 3 drives (INTEL SSDPE21K375GA) and the other with gen 5 drives (KIOXIA KCMYXRUG7T68). The testing consisted of doing mixed reads / writes, both buffered and direct, along with simultaneously launching zpool trims, scrubs and syncs in an attempt to cause starvation. I did not see deadman events on either node. I don't think this property should be disabled for spinning disk backed vdevs, and I have added a warning in the man page that its not recommended. In regard to the deadman events, bypassing the queue should still allow for these events to occur, albeit not through the
spa_deadman
path. Inzio_wait
, the parent zio will trigger the deadman code if any of its children are hung. Because of this, we do not need to keep track of which zios are active in a list, avoiding the performance hit of having to lock to add to a list.Additionally the tests that were failing,
zed_slow_io_many_vdevs
, along withzed_slow_io
, were caused by file backed vdevs being labeled as non rotational, causing the queue to be bypassed.Types of changes
Checklist:
Signed-off-by
.