Skip to content

Linux: wire up copy_file_range, FICLONE, etc to block cloning #15050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Jul 11, 2023

Motivation and Context

The recent addition of block cloning to OpenZFS was not initially available on Linux. This adds the missing pieces.

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.

Closes: #405
Closes: #13349

Description

This implements the necessary interfaces to allow the copy_file_range syscall and the FICLONE, FICLONERANGE and FIDEDUPERANGE ioctls to be properly routed through to OpenZFS, and provides implementations of all but dedup:

  • For Linux 4.5+, implementing the .copy_file_range, .clone_file_range, .dedupe_file_range and .remap_file_range VFS ops
  • For older Linux, implementing compatible handlers for FICLONE, FICLONERANGE and FIDEDUPERANGE
  • For EL7 kernels, implementing the extended .copy_file_range and .clone_file_range VFS ops

Note that I've wired up the dedup calls for completeness, but currently they return EOPNOTSUPP or ENOTTY as appropriate. Implementing it is pretty involved, and beyond the scope of this PR.

Note that this does not attempt to address the issues surround cross-dataset cloning in Linux (I'm not even sure there's much we can really do anyway). The short version is that only copy_file_range since 5.3. can clone across filesystems, but there's no way to know from its return if it did a clone, a regular copy or a bit of both. coreutils 9+ will use copy_file_range for cp --reflink=auto (default), but FICLONE for cp --reflink=always (previously it always used FICLONE). Its hard to say whether or not users will find this confusing. It might require documentation improvements, or real effort to make it work. I suggest its out of scope for this PR too; we can consider options later if becomes clear that cross-dataset cloning is in high demand.

How Has This Been Tested?

I wrote a test program: https://github.com/robn/clonefile

I've tested on the following kernels/distributions:

  • kernel.org: 5.10.170, 6.1.38, 6.4.2
  • Debian 12.0: 6.1.0-9-amd64 (6.1.27-1)
  • Debian 11.7: 5.10.0-20-amd64 (5.10.158-2)
  • Debian 10.11: 4.19.0-24-amd64 (4.19.282-1)
  • Debian 8.11: 3.16.0-6-amd64 (3.16.56-1+deb8u1)
  • CentOS 7.9.2009: 3.10.0-1160.90.1.el7.x86_64

All performed as I would expect: FICLONE/FICLONERANGE worked on all, copy_file_range worked on all but Debian 8 / 3.16 (syscall doesn't exist there).

When block cloning is disabled, all calls fail correctly except copy_file_range, which falls back a regular file copy.

Determining if the file was cloned or not is just looking at the L0 DVAs for each file and comparing them.

Cloning smaller file ranges appears to work within the existing constraints of zfs_clone_range(), but I have not tested extensively.

I've incorporated some of this into the test suite. They're not very comprehensive, but should be enough of a starting point. I'd appreciate feedback.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn robn force-pushed the block-cloning-linux branch 4 times, most recently from 0360980 to d828c5b Compare July 13, 2023 06:49
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 13, 2023
@robn robn force-pushed the block-cloning-linux branch 4 times, most recently from f85dc61 to ca42718 Compare July 15, 2023 06:41
@robn
Copy link
Member Author

robn commented Jul 16, 2023

I think the remaining test failures are not mine. This should be good to go.

@allanjude
Copy link
Contributor

Yeah, the remaining test failures look like what @amotin was describing here: https://openzfs.slack.com/archives/C052RGXL5/p1689265971254869

@dreamice2012
Copy link

dreamice2012 commented Jul 18, 2023

Yeah, the remaining test failures look like what @amotin was describing here: https://openzfs.slack.com/archives/C052RGXL5/p1689265971254869

can't access this link.
I use clonefile to test the patch, only "-f " option can excute ok, others are errors as flowlling:
using FICLONERANGE
ioctl(FICLONERANGE): Operation not supported
using FIDEDUPERANGE
ioctl(FIDEDUPERANGE): Operation not supported

my system info:
root@zfstest:/mypool# uname -a
Linux zfstest 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

@robn
Copy link
Member Author

robn commented Jul 18, 2023

my system info: root@zfstest:/mypool# uname -a Linux zfstest 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I use clonefile to test the patch, only "-f " option can excute ok, others are errors as flowlling:
using FICLONERANGE
ioctl(FICLONERANGE): Operation not supported
using FIDEDUPERANGE
ioctl(FIDEDUPERANGE): Operation not supported

FIDEDUPERANGE won't work by design (see opening comment) but FICLONERANGE definitely should. Please confirm that you've definitely built OpenZFS with these commits and its properly installed and loaded into the kernel, and that the feature@block_cloning pool property is enabled.

If that all looks right, then please show all the output of setting up your pool and using clonefile, eg:

# zfs version
zfs-2.2.99-1
zfs-kmod-2.2.99-1

# zpool get feature@block_cloning tank
NAME  PROPERTY               VALUE                  SOURCE
tank  feature@block_cloning  enabled                local

# dd if=/dev/urandom of=/tank/file bs=128K count=4
4+0 records in
4+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.00214827 s, 244 MB/s

# clonefile -c /tank/file /tank/file2
using FICLONE
file offsets: src=0/524288; dst=0/524288

@dreamice2012
Copy link

my system info: root@zfstest:/mypool# uname -a Linux zfstest 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I use clonefile to test the patch, only "-f " option can excute ok, others are errors as flowlling:
using FICLONERANGE
ioctl(FICLONERANGE): Operation not supported
using FIDEDUPERANGE
ioctl(FIDEDUPERANGE): Operation not supported

FIDEDUPERANGE won't work by design (see opening comment) but FICLONERANGE definitely should. Please confirm that you've definitely built OpenZFS with these commits and its properly installed and loaded into the kernel, and that the feature@block_cloning pool property is enabled.

If that all looks right, then please show all the output of setting up your pool and using clonefile, eg:

# zfs version
zfs-2.2.99-1
zfs-kmod-2.2.99-1

# zpool get feature@block_cloning tank
NAME  PROPERTY               VALUE                  SOURCE
tank  feature@block_cloning  enabled                local

# dd if=/dev/urandom of=/tank/file bs=128K count=4
4+0 records in
4+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.00214827 s, 244 MB/s

# clonefile -c /tank/file /tank/file2
using FICLONE
file offsets: src=0/524288; dst=0/524288

thanks for your reply!The difference is:
root@zfstest:/home/zfs/lib# zfs version
zfs-2.2.99-1
zfs-kmod-2.1.5-1ubuntu6

Could you show me how to upgrade zfs-kmod version? thanks~

@allanjude
Copy link
Contributor

thanks for your reply!The difference is: root@zfstest:/home/zfs/lib# zfs version zfs-2.2.99-1 zfs-kmod-2.1.5-1ubuntu6

Could you show me how to upgrade zfs-kmod version? thanks~

If you just want to do it temporarily, from the zfs you build yourself, run ./scripts/zfs.sh -v -r from the zfs source code directory, and it will unload the old module, and load the module you just built.

Otherwise, you need to install it. Instructions are here: https://openzfs.github.io/openzfs-docs/Developer%20Resources/Building%20ZFS.html

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for breaking this up in to logically separate commits to facilitate the review. This looks great, and it passed all the manual testing I was able to throw at it as well an 100 iterations of the new test cases. I only posted one comment with a trivial nit.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 21, 2023
Just silencing the warning about large allocations.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
@behlendorf
Copy link
Contributor

@robn and it looks like there's potentially one other large kmem_alloc/free that should be converted to a vmem_alloc/free.

[3734433.153401] Large kmem_alloc(952320, 0x1000), please file an issue at:
                 https://github.com/openzfs/zfs/issues/new
[3734433.166717] CPU: 4 PID: 160637 Comm: txg_sync Kdump: loaded Tainted: P           OE  X --------- -  - 4.18.0-477.10.1.1toss.t4.x86_64 #1
[3734433.180594] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0024.021320181901 02/13/2018
[3734433.192632] Call Trace:
[3734433.195556]  dump_stack+0x41/0x60
[3734433.199461]  spl_kmem_zalloc.cold.2+0x17/0x1c [spl]
[3734433.205107]  brt_vdev_realloc+0xa4/0x400 [zfs]
[3734433.210372]  brt_pending_apply+0x2f6/0x7d0 [zfs]
[3734433.215799]  spa_sync+0x85/0x1360 [zfs]
[3734433.241340]  txg_sync_thread+0x2bc/0x540 [zfs]
[3734433.257166]  thread_generic_wrapper+0x78/0xc0 [spl]
[3734433.262808]  kthread+0x14c/0x170
[3734433.271466]  ret_from_fork+0x35/0x40

bv_entcount can be a relatively large allocation (see comment for
BRT_RANGESIZE), so get it from the big allocator.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
@robn robn force-pushed the block-cloning-linux branch from ca42718 to fbe5cc3 Compare July 22, 2023 00:58
@robn
Copy link
Member Author

robn commented Jul 22, 2023

@robn and it looks like there's potentially one other large kmem_alloc/free that should be converted to a vmem_alloc/free.

Done, see additional commit.

I wasn't able to reproduce it because I don't have sufficiently large vdevs to work with. It seems fairly clear from the comment on BRT_RANGESIZE that it can get pretty big though, so it make sense.

I checked the other kmem allocations in brt.c (just some back-of-napkin math) and they all seem like they can almost never be very big - definitely no where near spl_kmem_alloc_warn anyway.

Copy link
Contributor

@oromenahar oromenahar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the most important code and run my tests against it as well, found the same erros and if we apply #14995 the errors are gone. If I have time to add tests for my test case, I will add it, but need some time to understand how the test framework works exactly and how to add test.
Good code, simple to read the different commits. Nice work

@robn
Copy link
Member Author

robn commented Jul 22, 2023

Checked the most important code and run my tests against it as well, found the same erros and if we apply #14995 the errors are gone.

Can you describe exactly how to reproduce #14995 against this PR? I have never seen it, and never been able to reproduce with your method, and it doesn't really make sense to me either.

@oromenahar
Copy link
Contributor

oromenahar commented Jul 22, 2023

Yes there are two different problems.

First my setup:
I have a virtual machine running on linux and qemu to test. The machine has two virtual disks, one for OS (no zfs) and one for the zfs tests.

setup the code and pool:
I compile the code using this lines:

sh autogen.sh
./configure --enable-debug --enable-debuginfo --enable-debug-kmem --enable-debug-kmem-tracking
make -s -j$(nproc) && sleep 5 && make install; ldconfig; depmod
# remove the module (rmmod zfs or reboot)
zpool create -f tank /dev/sdb && zfs create tank/test && dd if=/dev/random bs=4M status=progress count=1000 of=/tank/test/test.img
# in most cases I check some stuff out, wait a little bit and while doing this everything is synced to the virtual disk

first test:

while true; do /usr/bin/cp -fv /tank/test/test.img /tank/test/test.img2 && date; done

just leave it, after a few seconds (mostly about 5 to 10 seconds) you get the result/error:

Sat Jul 22 14:20:58 CEST 2023
'/tank/test/test.img' -> '/tank/test/test.img2'

Message from syslogd@localhost at Jul 22 14:20:58 ...
 kernel:VERIFY(list_head(&db->db_dirty_records) == NULL) failed

Message from syslogd@localhost at Jul 22 14:20:58 ...
 kernel:PANIC at dbuf.c:2704:dmu_buf_will_clone()

I can reproduce it really good and can say pretty good based on the workload of the cpu, when this error occur. Fist I just used my own reflink implementation (less complete than yours). After debuging and reading how cp exactly handles the copy/truncate/rm/reflink, I think the assert does not make sense.
cp opens the file with O_TRUNC-flag, the zfs filesystem truncates the file internally and if I understand everything correctly a transactiongroup is finished but not necessarily sync to disk right now? cp isn't finished and continues it's work by using the open file and the clone range syscalls. cp makes some more checks before and after and so on, but nothing which is important for the error. The list_head(&db->db_dirty_records) returns some values and the tx_group id is smaller than the current one, wrote the ids to the kernel log. I think this must be a previous dirty transaction. I'm unsure if I debuged and understand everything correctly, fairly new to the code base.

the second test I made:
same setup like in the first test but now:

while true; do /usr/bin/cp -fv /tank/test/test.img /tank/test/test.img2 && sleep 2 && sha256sum /tank/test/test.img2; done

and the result:

'/tank/test/test.img' -> '/tank/test/test.img2'

Message from syslogd@localhost at Jul 22 14:44:58 ...
 kernel:VERIFY(db->db_state == DB_CACHED || db->db_state == DB_NOFILL) failed

Message from syslogd@localhost at Jul 22 14:44:58 ...
 kernel:PANIC at dbuf.c:4461:dbuf_sync_leaf()

this takes a little bit more time and I think the disk speed is important, (the virtual disk is stored on a stable zfs pool on ssds) but I didn't tried it on slower disks.

It looks like if you are fast enough to read the data, the ASSERT is false. The db->db_state will be DB_READ but dr->dt.dl.dr_brtwrite is not synced to disk yet and still dirty. This is just for the debug code important as far as I understood the state doesn't really matter on that state.
So if (db->db_state == DB_READ && dr->dt.dl.dr_brtwrite == B_TRUE) is B_TRUE it is fine to continue. (in debug mode)

	ASSERT(db->db_state == DB_CACHED || db->db_state == DB_NOFILL ||
		    (db->db_state == DB_READ &&
		    dr->dt.dl.dr_brtwrite == B_TRUE));

please don't ask why I'm doing some weird while true loops with cp on the same file and where I got my ideas for that loops.
As I wrote I'm fairly new to the code base. I would really appreciate it, if you could give me some feedback if I understood everything correctly.

If you have any more questions, please feel free to ask. I hope I didn't forgot anything and you can reproduce the error.

@robn
Copy link
Member Author

robn commented Jul 22, 2023

Thanks for the detail, I will consider it more closely tomorrow. Just to clarify one point:

Fist I just used my own reflink implementation

Can you reproduce this against my implementation? For the purposes of this PR that's all I'm interested in. If you can, then we might be looking at something real. If not, then it's more likely to be something in your implementation.

@oromenahar
Copy link
Contributor

oromenahar commented Jul 22, 2023

Can you reproduce this against my implementation? For the purposes of this PR that's all I'm interested in. If you can, then we might be looking at something real. If not, then it's more likely to be something in your implementation.

yes I have tested it with your code. While I was writting the explanation I tested everything whith my test setup again using your PR. (also tested againts other peoples reflink wires)

behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jul 25, 2023
bv_entcount can be a relatively large allocation (see comment for
BRT_RANGESIZE), so get it from the big allocator.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jul 25, 2023
dbuf_undirty() will (correctly) only removed dirty records for the given
(open) txg. If there is a dirty record for an earlier closed txg that
has not been synced out yet, then db_dirty_records will still have
entries on it, tripping the assertion.

Instead, change the assertion to only consider the current txg. To some
extent this is redundant, as its really just saying "did dbuf_undirty()
work?", but it it doesn't hurt and accurately expresses our
expectations.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Original-patch-by: Kay Pedersen <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jul 25, 2023
Block cloning introduced a new state transition from DB_NOFILL to
DB_READ. This occurs when a block is cloned and then read on the
current txg.

In this case, the clone will move the dbuf to DB_NOFILL, and then the
read will be issued for the overidden block pointer. If that read is
still outstanding when it comes time to write, the dbuf will be in
DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus
tripping the assertions.

This updates those checks to allow DB_READ as a valid state iff the
dirty record is for a BRT write and there is a override block pointer.
This is a safe situation because the block already exists, so there's
nothing that could change from underneath the read.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Original-patch-by: Kay Pedersen <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jul 25, 2023
This implements the Linux VFS ops required to service the file
copy/clone APIs:

  .copy_file_range    (4.5+)
  .clone_file_range   (4.5-4.19)
  .dedupe_file_range  (4.5-4.19)
  .remap_file_range   (4.20+)

Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are
hooked up here, but are not implemented yet.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jul 25, 2023
Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and
were implemented as regular filesystem-specific ioctls. This implements
those ioctls directly in OpenZFS, allowing cloning to work on older
kernels.

There's no need to gate these behind version checks; on later kernels
Linux will simply never deliver these ioctls, instead calling the
approprate VFS op.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jul 25, 2023
Redhat have backported copy_file_range and clone_file_range to the EL7
kernel using an "extended file operations" wrapper structure. This
connects all that up to let cloning work there too.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jul 25, 2023
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
Closes openzfs#405
Closes openzfs#13349
behlendorf pushed a commit that referenced this pull request Jul 26, 2023
Just silencing the warning about large allocations.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
behlendorf pushed a commit that referenced this pull request Jul 26, 2023
bv_entcount can be a relatively large allocation (see comment for
BRT_RANGESIZE), so get it from the big allocator.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
behlendorf pushed a commit that referenced this pull request Jul 26, 2023
dbuf_undirty() will (correctly) only removed dirty records for the given
(open) txg. If there is a dirty record for an earlier closed txg that
has not been synced out yet, then db_dirty_records will still have
entries on it, tripping the assertion.

Instead, change the assertion to only consider the current txg. To some
extent this is redundant, as its really just saying "did dbuf_undirty()
work?", but it it doesn't hurt and accurately expresses our
expectations.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Original-patch-by: Kay Pedersen <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
behlendorf pushed a commit that referenced this pull request Jul 26, 2023
Block cloning introduced a new state transition from DB_NOFILL to
DB_READ. This occurs when a block is cloned and then read on the
current txg.

In this case, the clone will move the dbuf to DB_NOFILL, and then the
read will be issued for the overidden block pointer. If that read is
still outstanding when it comes time to write, the dbuf will be in
DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus
tripping the assertions.

This updates those checks to allow DB_READ as a valid state iff the
dirty record is for a BRT write and there is a override block pointer.
This is a safe situation because the block already exists, so there's
nothing that could change from underneath the read.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Original-patch-by: Kay Pedersen <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
behlendorf pushed a commit that referenced this pull request Jul 26, 2023
This implements the Linux VFS ops required to service the file
copy/clone APIs:

  .copy_file_range    (4.5+)
  .clone_file_range   (4.5-4.19)
  .dedupe_file_range  (4.5-4.19)
  .remap_file_range   (4.20+)

Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are
hooked up here, but are not implemented yet.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
behlendorf pushed a commit that referenced this pull request Jul 26, 2023
Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and
were implemented as regular filesystem-specific ioctls. This implements
those ioctls directly in OpenZFS, allowing cloning to work on older
kernels.

There's no need to gate these behind version checks; on later kernels
Linux will simply never deliver these ioctls, instead calling the
approprate VFS op.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
behlendorf pushed a commit that referenced this pull request Jul 26, 2023
Redhat have backported copy_file_range and clone_file_range to the EL7
kernel using an "extended file operations" wrapper structure. This
connects all that up to let cloning work there too.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
behlendorf pushed a commit that referenced this pull request Jul 26, 2023
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
Closes #405
Closes #13349
@DHowett
Copy link

DHowett commented Aug 19, 2023

My plan is to pull it in to a 2.2.0-rc3 release this week for broader testing.

Is it expected that FICLONE works on 2.2.0-rc3? In my limited testing, I am seeing EOPNOTSUPP for both same-dataset and cross-dataset clones despite running 2.2.0-rc3.

@rincebrain
Copy link
Contributor

I would suspect you're running into what I mentioned about it failing to activate the feature in the first place, guessing blindly, assuming you marked the feature as "enabled" already.

@DHowett
Copy link

DHowett commented Aug 19, 2023

Well, that is a well-deserved facepalm for me. Thank you.

I'll choose to blame GitHub's "xxx hidden comments..." disclosure for me not even knowing there was a pool feature, rather than a feature in the broader sense, even though I know it was not solely GitHub's fault. 😄

Sorry for the noise!

lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
Just silencing the warning about large allocations.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
bv_entcount can be a relatively large allocation (see comment for
BRT_RANGESIZE), so get it from the big allocator.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
dbuf_undirty() will (correctly) only removed dirty records for the given
(open) txg. If there is a dirty record for an earlier closed txg that
has not been synced out yet, then db_dirty_records will still have
entries on it, tripping the assertion.

Instead, change the assertion to only consider the current txg. To some
extent this is redundant, as its really just saying "did dbuf_undirty()
work?", but it it doesn't hurt and accurately expresses our
expectations.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Original-patch-by: Kay Pedersen <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
Block cloning introduced a new state transition from DB_NOFILL to
DB_READ. This occurs when a block is cloned and then read on the
current txg.

In this case, the clone will move the dbuf to DB_NOFILL, and then the
read will be issued for the overidden block pointer. If that read is
still outstanding when it comes time to write, the dbuf will be in
DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus
tripping the assertions.

This updates those checks to allow DB_READ as a valid state iff the
dirty record is for a BRT write and there is a override block pointer.
This is a safe situation because the block already exists, so there's
nothing that could change from underneath the read.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Original-patch-by: Kay Pedersen <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
This implements the Linux VFS ops required to service the file
copy/clone APIs:

  .copy_file_range    (4.5+)
  .clone_file_range   (4.5-4.19)
  .dedupe_file_range  (4.5-4.19)
  .remap_file_range   (4.20+)

Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are
hooked up here, but are not implemented yet.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and
were implemented as regular filesystem-specific ioctls. This implements
those ioctls directly in OpenZFS, allowing cloning to work on older
kernels.

There's no need to gate these behind version checks; on later kernels
Linux will simply never deliver these ioctls, instead calling the
approprate VFS op.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
Redhat have backported copy_file_range and clone_file_range to the EL7
kernel using an "extended file operations" wrapper structure. This
connects all that up to let cloning work there too.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Kay Pedersen <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes openzfs#15050
Closes #405
Closes openzfs#13349
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FAST-Tracking REFLINK and Offline Deduplication, first for LINUX only COW cp (--reflink) support
8 participants