-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Linux: wire up copy_file_range, FICLONE, etc to block cloning #15050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0360980
to
d828c5b
Compare
f85dc61
to
ca42718
Compare
I think the remaining test failures are not mine. This should be good to go. |
Yeah, the remaining test failures look like what @amotin was describing here: https://openzfs.slack.com/archives/C052RGXL5/p1689265971254869 |
can't access this link. my system info: |
If that all looks right, then please show all the output of setting up your pool and using clonefile, eg:
|
thanks for your reply!The difference is: Could you show me how to upgrade zfs-kmod version? thanks~ |
If you just want to do it temporarily, from the zfs you build yourself, run Otherwise, you need to install it. Instructions are here: https://openzfs.github.io/openzfs-docs/Developer%20Resources/Building%20ZFS.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for breaking this up in to logically separate commits to facilitate the review. This looks great, and it passed all the manual testing I was able to throw at it as well an 100 iterations of the new test cases. I only posted one comment with a trivial nit.
Just silencing the warning about large allocations. Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc.
@robn and it looks like there's potentially one other large
|
bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc.
ca42718
to
fbe5cc3
Compare
Done, see additional commit. I wasn't able to reproduce it because I don't have sufficiently large vdevs to work with. It seems fairly clear from the comment on I checked the other kmem allocations in brt.c (just some back-of-napkin math) and they all seem like they can almost never be very big - definitely no where near |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked the most important code and run my tests against it as well, found the same erros and if we apply #14995 the errors are gone. If I have time to add tests for my test case, I will add it, but need some time to understand how the test framework works exactly and how to add test.
Good code, simple to read the different commits. Nice work
Can you describe exactly how to reproduce #14995 against this PR? I have never seen it, and never been able to reproduce with your method, and it doesn't really make sense to me either. |
Yes there are two different problems. First my setup: setup the code and pool: sh autogen.sh
./configure --enable-debug --enable-debuginfo --enable-debug-kmem --enable-debug-kmem-tracking
make -s -j$(nproc) && sleep 5 && make install; ldconfig; depmod
# remove the module (rmmod zfs or reboot)
zpool create -f tank /dev/sdb && zfs create tank/test && dd if=/dev/random bs=4M status=progress count=1000 of=/tank/test/test.img
# in most cases I check some stuff out, wait a little bit and while doing this everything is synced to the virtual disk first test: while true; do /usr/bin/cp -fv /tank/test/test.img /tank/test/test.img2 && date; done just leave it, after a few seconds (mostly about 5 to 10 seconds) you get the result/error:
I can reproduce it really good and can say pretty good based on the workload of the cpu, when this error occur. Fist I just used my own reflink implementation (less complete than yours). After debuging and reading how the second test I made: while true; do /usr/bin/cp -fv /tank/test/test.img /tank/test/test.img2 && sleep 2 && sha256sum /tank/test/test.img2; done and the result:
this takes a little bit more time and I think the disk speed is important, (the virtual disk is stored on a stable zfs pool on ssds) but I didn't tried it on slower disks. It looks like if you are fast enough to read the data, the ASSERT is false. The ASSERT(db->db_state == DB_CACHED || db->db_state == DB_NOFILL ||
(db->db_state == DB_READ &&
dr->dt.dl.dr_brtwrite == B_TRUE)); please don't ask why I'm doing some weird If you have any more questions, please feel free to ask. I hope I didn't forgot anything and you can reproduce the error. |
Thanks for the detail, I will consider it more closely tomorrow. Just to clarify one point:
Can you reproduce this against my implementation? For the purposes of this PR that's all I'm interested in. If you can, then we might be looking at something real. If not, then it's more likely to be something in your implementation. |
yes I have tested it with your code. While I was writting the explanation I tested everything whith my test setup again using your PR. (also tested againts other peoples reflink wires) |
bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
dbuf_undirty() will (correctly) only removed dirty records for the given (open) txg. If there is a dirty record for an earlier closed txg that has not been synced out yet, then db_dirty_records will still have entries on it, tripping the assertion. Instead, change the assertion to only consider the current txg. To some extent this is redundant, as its really just saying "did dbuf_undirty() work?", but it it doesn't hurt and accurately expresses our expectations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
Block cloning introduced a new state transition from DB_NOFILL to DB_READ. This occurs when a block is cloned and then read on the current txg. In this case, the clone will move the dbuf to DB_NOFILL, and then the read will be issued for the overidden block pointer. If that read is still outstanding when it comes time to write, the dbuf will be in DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus tripping the assertions. This updates those checks to allow DB_READ as a valid state iff the dirty record is for a BRT write and there is a override block pointer. This is a safe situation because the block already exists, so there's nothing that could change from underneath the read. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050 Closes openzfs#405 Closes openzfs#13349
Just silencing the warning about large allocations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
dbuf_undirty() will (correctly) only removed dirty records for the given (open) txg. If there is a dirty record for an earlier closed txg that has not been synced out yet, then db_dirty_records will still have entries on it, tripping the assertion. Instead, change the assertion to only consider the current txg. To some extent this is redundant, as its really just saying "did dbuf_undirty() work?", but it it doesn't hurt and accurately expresses our expectations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
Block cloning introduced a new state transition from DB_NOFILL to DB_READ. This occurs when a block is cloned and then read on the current txg. In this case, the clone will move the dbuf to DB_NOFILL, and then the read will be issued for the overidden block pointer. If that read is still outstanding when it comes time to write, the dbuf will be in DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus tripping the assertions. This updates those checks to allow DB_READ as a valid state iff the dirty record is for a BRT write and there is a override block pointer. This is a safe situation because the block already exists, so there's nothing that could change from underneath the read. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050
Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050 Closes #405 Closes #13349
Is it expected that |
I would suspect you're running into what I mentioned about it failing to activate the feature in the first place, guessing blindly, assuming you marked the feature as "enabled" already. |
Well, that is a well-deserved facepalm for me. Thank you. I'll choose to blame GitHub's "xxx hidden comments..." disclosure for me not even knowing there was a pool feature, rather than a feature in the broader sense, even though I know it was not solely GitHub's fault. 😄 Sorry for the noise! |
Just silencing the warning about large allocations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
dbuf_undirty() will (correctly) only removed dirty records for the given (open) txg. If there is a dirty record for an earlier closed txg that has not been synced out yet, then db_dirty_records will still have entries on it, tripping the assertion. Instead, change the assertion to only consider the current txg. To some extent this is redundant, as its really just saying "did dbuf_undirty() work?", but it it doesn't hurt and accurately expresses our expectations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
Block cloning introduced a new state transition from DB_NOFILL to DB_READ. This occurs when a block is cloned and then read on the current txg. In this case, the clone will move the dbuf to DB_NOFILL, and then the read will be issued for the overidden block pointer. If that read is still outstanding when it comes time to write, the dbuf will be in DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus tripping the assertions. This updates those checks to allow DB_READ as a valid state iff the dirty record is for a BRT write and there is a override block pointer. This is a safe situation because the block already exists, so there's nothing that could change from underneath the read. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050
Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050 Closes #405 Closes openzfs#13349
Motivation and Context
The recent addition of block cloning to OpenZFS was not initially available on Linux. This adds the missing pieces.
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes: #405
Closes: #13349
Description
This implements the necessary interfaces to allow the
copy_file_range
syscall and theFICLONE
,FICLONERANGE
andFIDEDUPERANGE
ioctls to be properly routed through to OpenZFS, and provides implementations of all but dedup:.copy_file_range
,.clone_file_range
,.dedupe_file_range
and.remap_file_range
VFS opsFICLONE
,FICLONERANGE
andFIDEDUPERANGE
.copy_file_range
and.clone_file_range
VFS opsNote that I've wired up the dedup calls for completeness, but currently they return
EOPNOTSUPP
orENOTTY
as appropriate. Implementing it is pretty involved, and beyond the scope of this PR.Note that this does not attempt to address the issues surround cross-dataset cloning in Linux (I'm not even sure there's much we can really do anyway). The short version is that only
copy_file_range
since 5.3. can clone across filesystems, but there's no way to know from its return if it did a clone, a regular copy or a bit of both. coreutils 9+ will usecopy_file_range
forcp --reflink=auto
(default), butFICLONE
forcp --reflink=always
(previously it always usedFICLONE
). Its hard to say whether or not users will find this confusing. It might require documentation improvements, or real effort to make it work. I suggest its out of scope for this PR too; we can consider options later if becomes clear that cross-dataset cloning is in high demand.How Has This Been Tested?
I wrote a test program: https://github.com/robn/clonefile
I've tested on the following kernels/distributions:
All performed as I would expect:
FICLONE
/FICLONERANGE
worked on all,copy_file_range
worked on all but Debian 8 / 3.16 (syscall doesn't exist there).When block cloning is disabled, all calls fail correctly except
copy_file_range
, which falls back a regular file copy.Determining if the file was cloned or not is just looking at the L0 DVAs for each file and comparing them.
Cloning smaller file ranges appears to work within the existing constraints of
zfs_clone_range()
, but I have not tested extensively.I've incorporated some of this into the test suite. They're not very comprehensive, but should be enough of a starting point. I'd appreciate feedback.
Types of changes
Checklist:
Signed-off-by
.