Linux: wire up copy_file_range, FICLONE, etc to block cloning #15050

robn · 2023-07-11T01:30:40Z

Motivation and Context

The recent addition of block cloning to OpenZFS was not initially available on Linux. This adds the missing pieces.

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.

Closes: #405
Closes: #13349

Description

This implements the necessary interfaces to allow the copy_file_range syscall and the FICLONE, FICLONERANGE and FIDEDUPERANGE ioctls to be properly routed through to OpenZFS, and provides implementations of all but dedup:

For Linux 4.5+, implementing the .copy_file_range, .clone_file_range, .dedupe_file_range and .remap_file_range VFS ops
For older Linux, implementing compatible handlers for FICLONE, FICLONERANGE and FIDEDUPERANGE
For EL7 kernels, implementing the extended .copy_file_range and .clone_file_range VFS ops

Note that I've wired up the dedup calls for completeness, but currently they return EOPNOTSUPP or ENOTTY as appropriate. Implementing it is pretty involved, and beyond the scope of this PR.

Note that this does not attempt to address the issues surround cross-dataset cloning in Linux (I'm not even sure there's much we can really do anyway). The short version is that only copy_file_range since 5.3. can clone across filesystems, but there's no way to know from its return if it did a clone, a regular copy or a bit of both. coreutils 9+ will use copy_file_range for cp --reflink=auto (default), but FICLONE for cp --reflink=always (previously it always used FICLONE). Its hard to say whether or not users will find this confusing. It might require documentation improvements, or real effort to make it work. I suggest its out of scope for this PR too; we can consider options later if becomes clear that cross-dataset cloning is in high demand.

How Has This Been Tested?

I wrote a test program: https://github.com/robn/clonefile

I've tested on the following kernels/distributions:

kernel.org: 5.10.170, 6.1.38, 6.4.2
Debian 12.0: 6.1.0-9-amd64 (6.1.27-1)
Debian 11.7: 5.10.0-20-amd64 (5.10.158-2)
Debian 10.11: 4.19.0-24-amd64 (4.19.282-1)
Debian 8.11: 3.16.0-6-amd64 (3.16.56-1+deb8u1)
CentOS 7.9.2009: 3.10.0-1160.90.1.el7.x86_64

All performed as I would expect: FICLONE/FICLONERANGE worked on all, copy_file_range worked on all but Debian 8 / 3.16 (syscall doesn't exist there).

When block cloning is disabled, all calls fail correctly except copy_file_range, which falls back a regular file copy.

Determining if the file was cloned or not is just looking at the L0 DVAs for each file and comparing them.

Cloning smaller file ranges appears to work within the existing constraints of zfs_clone_range(), but I have not tested extensively.

I've incorporated some of this into the test suite. They're not very comprehensive, but should be enough of a starting point. I'd appreciate feedback.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

robn · 2023-07-16T10:19:33Z

I think the remaining test failures are not mine. This should be good to go.

allanjude · 2023-07-17T13:43:46Z

Yeah, the remaining test failures look like what @amotin was describing here: https://openzfs.slack.com/archives/C052RGXL5/p1689265971254869

dreamice2012 · 2023-07-18T08:22:30Z

Yeah, the remaining test failures look like what @amotin was describing here: https://openzfs.slack.com/archives/C052RGXL5/p1689265971254869

can't access this link.
I use clonefile to test the patch, only "-f " option can excute ok, others are errors as flowlling:
using FICLONERANGE
ioctl(FICLONERANGE): Operation not supported
using FIDEDUPERANGE
ioctl(FIDEDUPERANGE): Operation not supported

my system info:
root@zfstest:/mypool# uname -a
Linux zfstest 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

robn · 2023-07-18T09:58:38Z

my system info: root@zfstest:/mypool# uname -a Linux zfstest 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I use clonefile to test the patch, only "-f " option can excute ok, others are errors as flowlling:
using FICLONERANGE
ioctl(FICLONERANGE): Operation not supported
using FIDEDUPERANGE
ioctl(FIDEDUPERANGE): Operation not supported

FIDEDUPERANGE won't work by design (see opening comment) but FICLONERANGE definitely should. Please confirm that you've definitely built OpenZFS with these commits and its properly installed and loaded into the kernel, and that the feature@block_cloning pool property is enabled.

If that all looks right, then please show all the output of setting up your pool and using clonefile, eg:

# zfs version
zfs-2.2.99-1
zfs-kmod-2.2.99-1

# zpool get feature@block_cloning tank
NAME  PROPERTY               VALUE                  SOURCE
tank  feature@block_cloning  enabled                local

# dd if=/dev/urandom of=/tank/file bs=128K count=4
4+0 records in
4+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.00214827 s, 244 MB/s

# clonefile -c /tank/file /tank/file2
using FICLONE
file offsets: src=0/524288; dst=0/524288

dreamice2012 · 2023-07-18T12:29:49Z

my system info: root@zfstest:/mypool# uname -a Linux zfstest 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I use clonefile to test the patch, only "-f " option can excute ok, others are errors as flowlling:
using FICLONERANGE
ioctl(FICLONERANGE): Operation not supported
using FIDEDUPERANGE
ioctl(FIDEDUPERANGE): Operation not supported

FIDEDUPERANGE won't work by design (see opening comment) but FICLONERANGE definitely should. Please confirm that you've definitely built OpenZFS with these commits and its properly installed and loaded into the kernel, and that the feature@block_cloning pool property is enabled.

If that all looks right, then please show all the output of setting up your pool and using clonefile, eg:
# zfs version
zfs-2.2.99-1
zfs-kmod-2.2.99-1

# zpool get feature@block_cloning tank
NAME  PROPERTY               VALUE                  SOURCE
tank  feature@block_cloning  enabled                local

# dd if=/dev/urandom of=/tank/file bs=128K count=4
4+0 records in
4+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.00214827 s, 244 MB/s

# clonefile -c /tank/file /tank/file2
using FICLONE
file offsets: src=0/524288; dst=0/524288

thanks for your reply！The difference is：
root@zfstest:/home/zfs/lib# zfs version
zfs-2.2.99-1
zfs-kmod-2.1.5-1ubuntu6

Could you show me how to upgrade zfs-kmod version? thanks~

allanjude · 2023-07-18T16:59:07Z

thanks for your reply！The difference is： root@zfstest:/home/zfs/lib# zfs version zfs-2.2.99-1 zfs-kmod-2.1.5-1ubuntu6

Could you show me how to upgrade zfs-kmod version? thanks~

If you just want to do it temporarily, from the zfs you build yourself, run ./scripts/zfs.sh -v -r from the zfs source code directory, and it will unload the old module, and load the module you just built.

Otherwise, you need to install it. Instructions are here: https://openzfs.github.io/openzfs-docs/Developer%20Resources/Building%20ZFS.html

behlendorf

Thanks for breaking this up in to logically separate commits to facilitate the review. This looks great, and it passed all the manual testing I was able to throw at it as well an 100 iterations of the new test cases. I only posted one comment with a trivial nit.

module/os/linux/zfs/zpl_file_range.c

Just silencing the warning about large allocations. Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc.

behlendorf · 2023-07-21T23:48:23Z

@robn and it looks like there's potentially one other large kmem_alloc/free that should be converted to a vmem_alloc/free.

[3734433.153401] Large kmem_alloc(952320, 0x1000), please file an issue at:
                 https://github.com/openzfs/zfs/issues/new
[3734433.166717] CPU: 4 PID: 160637 Comm: txg_sync Kdump: loaded Tainted: P           OE  X --------- -  - 4.18.0-477.10.1.1toss.t4.x86_64 #1
[3734433.180594] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0024.021320181901 02/13/2018
[3734433.192632] Call Trace:
[3734433.195556]  dump_stack+0x41/0x60
[3734433.199461]  spl_kmem_zalloc.cold.2+0x17/0x1c [spl]
[3734433.205107]  brt_vdev_realloc+0xa4/0x400 [zfs]
[3734433.210372]  brt_pending_apply+0x2f6/0x7d0 [zfs]
[3734433.215799]  spa_sync+0x85/0x1360 [zfs]
[3734433.241340]  txg_sync_thread+0x2bc/0x540 [zfs]
[3734433.257166]  thread_generic_wrapper+0x78/0xc0 [spl]
[3734433.262808]  kthread+0x14c/0x170
[3734433.271466]  ret_from_fork+0x35/0x40

bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc.

robn · 2023-07-22T01:10:25Z

@robn and it looks like there's potentially one other large kmem_alloc/free that should be converted to a vmem_alloc/free.

Done, see additional commit.

I wasn't able to reproduce it because I don't have sufficiently large vdevs to work with. It seems fairly clear from the comment on BRT_RANGESIZE that it can get pretty big though, so it make sense.

I checked the other kmem allocations in brt.c (just some back-of-napkin math) and they all seem like they can almost never be very big - definitely no where near spl_kmem_alloc_warn anyway.

oromenahar

Checked the most important code and run my tests against it as well, found the same erros and if we apply #14995 the errors are gone. If I have time to add tests for my test case, I will add it, but need some time to understand how the test framework works exactly and how to add test.
Good code, simple to read the different commits. Nice work

module/os/linux/zfs/zpl_file_range.c

robn · 2023-07-22T12:04:59Z

Checked the most important code and run my tests against it as well, found the same erros and if we apply #14995 the errors are gone.

Can you describe exactly how to reproduce #14995 against this PR? I have never seen it, and never been able to reproduce with your method, and it doesn't really make sense to me either.

oromenahar · 2023-07-22T13:37:46Z

Yes there are two different problems.

First my setup:
I have a virtual machine running on linux and qemu to test. The machine has two virtual disks, one for OS (no zfs) and one for the zfs tests.

setup the code and pool:
I compile the code using this lines:

sh autogen.sh
./configure --enable-debug --enable-debuginfo --enable-debug-kmem --enable-debug-kmem-tracking
make -s -j$(nproc) && sleep 5 && make install; ldconfig; depmod
# remove the module (rmmod zfs or reboot)
zpool create -f tank /dev/sdb && zfs create tank/test && dd if=/dev/random bs=4M status=progress count=1000 of=/tank/test/test.img
# in most cases I check some stuff out, wait a little bit and while doing this everything is synced to the virtual disk

first test:

while true; do /usr/bin/cp -fv /tank/test/test.img /tank/test/test.img2 && date; done

just leave it, after a few seconds (mostly about 5 to 10 seconds) you get the result/error:

Sat Jul 22 14:20:58 CEST 2023
'/tank/test/test.img' -> '/tank/test/test.img2'

Message from syslogd@localhost at Jul 22 14:20:58 ...
 kernel:VERIFY(list_head(&db->db_dirty_records) == NULL) failed

Message from syslogd@localhost at Jul 22 14:20:58 ...
 kernel:PANIC at dbuf.c:2704:dmu_buf_will_clone()

I can reproduce it really good and can say pretty good based on the workload of the cpu, when this error occur. Fist I just used my own reflink implementation (less complete than yours). After debuging and reading how cp exactly handles the copy/truncate/rm/reflink, I think the assert does not make sense.
cp opens the file with O_TRUNC-flag, the zfs filesystem truncates the file internally and if I understand everything correctly a transactiongroup is finished but not necessarily sync to disk right now? cp isn't finished and continues it's work by using the open file and the clone range syscalls. cp makes some more checks before and after and so on, but nothing which is important for the error. The list_head(&db->db_dirty_records) returns some values and the tx_group id is smaller than the current one, wrote the ids to the kernel log. I think this must be a previous dirty transaction. I'm unsure if I debuged and understand everything correctly, fairly new to the code base.

the second test I made:
same setup like in the first test but now:

while true; do /usr/bin/cp -fv /tank/test/test.img /tank/test/test.img2 && sleep 2 && sha256sum /tank/test/test.img2; done

and the result:

'/tank/test/test.img' -> '/tank/test/test.img2'

Message from syslogd@localhost at Jul 22 14:44:58 ...
 kernel:VERIFY(db->db_state == DB_CACHED || db->db_state == DB_NOFILL) failed

Message from syslogd@localhost at Jul 22 14:44:58 ...
 kernel:PANIC at dbuf.c:4461:dbuf_sync_leaf()

this takes a little bit more time and I think the disk speed is important, (the virtual disk is stored on a stable zfs pool on ssds) but I didn't tried it on slower disks.

It looks like if you are fast enough to read the data, the ASSERT is false. The db->db_state will be DB_READ but dr->dt.dl.dr_brtwrite is not synced to disk yet and still dirty. This is just for the debug code important as far as I understood the state doesn't really matter on that state.
So if (db->db_state == DB_READ && dr->dt.dl.dr_brtwrite == B_TRUE) is B_TRUE it is fine to continue. (in debug mode)

	ASSERT(db->db_state == DB_CACHED || db->db_state == DB_NOFILL ||
		    (db->db_state == DB_READ &&
		    dr->dt.dl.dr_brtwrite == B_TRUE));

please don't ask why I'm doing some weird while true loops with cp on the same file and where I got my ideas for that loops.
As I wrote I'm fairly new to the code base. I would really appreciate it, if you could give me some feedback if I understood everything correctly.

If you have any more questions, please feel free to ask. I hope I didn't forgot anything and you can reproduce the error.

robn · 2023-07-22T13:42:24Z

Thanks for the detail, I will consider it more closely tomorrow. Just to clarify one point:

Fist I just used my own reflink implementation

Can you reproduce this against my implementation? For the purposes of this PR that's all I'm interested in. If you can, then we might be looking at something real. If not, then it's more likely to be something in your implementation.

oromenahar · 2023-07-22T13:44:56Z

Can you reproduce this against my implementation? For the purposes of this PR that's all I'm interested in. If you can, then we might be looking at something real. If not, then it's more likely to be something in your implementation.

yes I have tested it with your code. While I was writting the explanation I tested everything whith my test setup again using your PR. (also tested againts other peoples reflink wires)

bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

dbuf_undirty() will (correctly) only removed dirty records for the given (open) txg. If there is a dirty record for an earlier closed txg that has not been synced out yet, then db_dirty_records will still have entries on it, tripping the assertion. Instead, change the assertion to only consider the current txg. To some extent this is redundant, as its really just saying "did dbuf_undirty() work?", but it it doesn't hurt and accurately expresses our expectations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

Block cloning introduced a new state transition from DB_NOFILL to DB_READ. This occurs when a block is cloned and then read on the current txg. In this case, the clone will move the dbuf to DB_NOFILL, and then the read will be issued for the overidden block pointer. If that read is still outstanding when it comes time to write, the dbuf will be in DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus tripping the assertions. This updates those checks to allow DB_READ as a valid state iff the dirty record is for a BRT write and there is a override block pointer. This is a safe situation because the block already exists, so there's nothing that could change from underneath the read. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050 Closes openzfs#405 Closes openzfs#13349

Just silencing the warning about large allocations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050

bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050

dbuf_undirty() will (correctly) only removed dirty records for the given (open) txg. If there is a dirty record for an earlier closed txg that has not been synced out yet, then db_dirty_records will still have entries on it, tripping the assertion. Instead, change the assertion to only consider the current txg. To some extent this is redundant, as its really just saying "did dbuf_undirty() work?", but it it doesn't hurt and accurately expresses our expectations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050

Block cloning introduced a new state transition from DB_NOFILL to DB_READ. This occurs when a block is cloned and then read on the current txg. In this case, the clone will move the dbuf to DB_NOFILL, and then the read will be issued for the overidden block pointer. If that read is still outstanding when it comes time to write, the dbuf will be in DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus tripping the assertions. This updates those checks to allow DB_READ as a valid state iff the dirty record is for a BRT write and there is a override block pointer. This is a safe situation because the block already exists, so there's nothing that could change from underneath the read. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050

This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050

Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050

Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050

Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050 Closes #405 Closes #13349

DHowett · 2023-08-19T20:44:52Z

My plan is to pull it in to a 2.2.0-rc3 release this week for broader testing.

Is it expected that FICLONE works on 2.2.0-rc3? In my limited testing, I am seeing EOPNOTSUPP for both same-dataset and cross-dataset clones despite running 2.2.0-rc3.

rincebrain · 2023-08-19T20:48:18Z

I would suspect you're running into what I mentioned about it failing to activate the feature in the first place, guessing blindly, assuming you marked the feature as "enabled" already.

DHowett · 2023-08-19T20:50:12Z

Well, that is a well-deserved facepalm for me. Thank you.

I'll choose to blame GitHub's "xxx hidden comments..." disclosure for me not even knowing there was a pool feature, rather than a feature in the broader sense, even though I know it was not solely GitHub's fault. 😄

Sorry for the noise!

Just silencing the warning about large allocations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

dbuf_undirty() will (correctly) only removed dirty records for the given (open) txg. If there is a dirty record for an earlier closed txg that has not been synced out yet, then db_dirty_records will still have entries on it, tripping the assertion. Instead, change the assertion to only consider the current txg. To some extent this is redundant, as its really just saying "did dbuf_undirty() work?", but it it doesn't hurt and accurately expresses our expectations. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

Block cloning introduced a new state transition from DB_NOFILL to DB_READ. This occurs when a block is cloned and then read on the current txg. In this case, the clone will move the dbuf to DB_NOFILL, and then the read will be issued for the overidden block pointer. If that read is still outstanding when it comes time to write, the dbuf will be in DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus tripping the assertions. This updates those checks to allow DB_READ as a valid state iff the dirty record is for a BRT write and there is a override block pointer. This is a safe situation because the block already exists, so there's nothing that could change from underneath the read. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Original-patch-by: Kay Pedersen <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050

Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Kay Pedersen <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050 Closes #405 Closes openzfs#13349

robn force-pushed the block-cloning-linux branch 4 times, most recently from 0360980 to d828c5b Compare July 13, 2023 06:49

behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 13, 2023

robn force-pushed the block-cloning-linux branch 4 times, most recently from f85dc61 to ca42718 Compare July 15, 2023 06:41

behlendorf mentioned this pull request Jul 20, 2023

Fixed some dmu block clone reflink ASSERTs. #14995

Closed

13 tasks

behlendorf approved these changes Jul 21, 2023

View reviewed changes

module/os/linux/zfs/zpl_file_range.c Outdated Show resolved Hide resolved

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 21, 2023

zfs_clone_range: use vmem_malloc for large allocation

276b18c

Just silencing the warning about large allocations. Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc.

brt_vdev_realloc: use vmem_alloc for large allocation

7ecd425

bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Signed-off-by: Rob Norris <[email protected]> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc.

robn force-pushed the block-cloning-linux branch from ca42718 to fbe5cc3 Compare July 22, 2023 00:58

oromenahar reviewed Jul 22, 2023

View reviewed changes

module/os/linux/zfs/zpl_file_range.c Show resolved Hide resolved

oromenahar approved these changes Jul 22, 2023

View reviewed changes

markfasheh mentioned this pull request Aug 3, 2023

Support for ZFS deduplication markfasheh/duperemove#300

Closed

oromenahar mentioned this pull request Nov 15, 2023

Add a tunable to disable BRT support. #15529

Merged

13 tasks

shodanshok mentioned this pull request Feb 6, 2024

Feature request: fast copy between snapshot and live filesystem #8629

Closed

Linux: wire up copy_file_range, FICLONE, etc to block cloning #15050

Linux: wire up copy_file_range, FICLONE, etc to block cloning #15050

Uh oh!

Conversation

robn commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

robn commented Jul 16, 2023

Uh oh!

allanjude commented Jul 17, 2023

Uh oh!

dreamice2012 commented Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dreamice2012 commented Jul 18, 2023

Uh oh!

allanjude commented Jul 18, 2023

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

behlendorf commented Jul 21, 2023

Uh oh!

robn commented Jul 22, 2023

Uh oh!

oromenahar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robn commented Jul 22, 2023

Uh oh!

oromenahar commented Jul 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Jul 22, 2023

Uh oh!

oromenahar commented Jul 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DHowett commented Aug 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rincebrain commented Aug 19, 2023

Uh oh!

DHowett commented Aug 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

robn commented Jul 11, 2023 •

edited

Loading

dreamice2012 commented Jul 18, 2023 •

edited

Loading

robn commented Jul 18, 2023 •

edited

Loading

oromenahar left a comment •

edited

Loading

oromenahar commented Jul 22, 2023 •

edited

Loading

oromenahar commented Jul 22, 2023 •

edited

Loading

DHowett commented Aug 19, 2023 •

edited

Loading

DHowett commented Aug 19, 2023 •

edited

Loading