You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The present in-order scan code has three possible behaviors wrt the DDT, in 1:1 correspondence with DDT classes:
class 0: enqueue block pointers from the autoditto DDT class only, necessary as these ditto pointers may not exist in the ordinary block pointer tree
class 1: enqueue block pointers from the autoditto DDT and >1-reference DDT classes
class 2: enqueue block pointers from all DDT classes
After harvesting block pointers from the DDTs, class 0 and 1 traversals require probing the DDT whenever preparing to enqueue a block pointer during the usual tree walk so as to not multiply scan those blocks (using ddt_class_contains). Class 2 scans, by virtue of knowing that the entire DDT has already been read in, are free to simply skip all deduplicated blocks during the usual tree walk, which can dramatically save I/O (this logic is in fact nicely encapsulated in ddt_class_contains and so no special case is necessary in the scan code).
All of these iterations over the DDT ZAPs happen in checksum order, which is uncorrelated with on-disk layout of the blocks in question. Thus, a very large DDT will, statistically, create many sparse regions in the in-order scan sio queues, as we are essentially picking on-disk locations at random. Worse, as we are forced to shed queue memory and actually scan blocks, we will pick the largest, least-gapful sio entries. That is, we're likely to evict the very entries that have, by chance, managed to accumulate runs of disk blocks and so cause later sio queue entries to be even sparser on average! Thus, it is unlikely that large files, even if written sequentially on disk, will be scanned in anything resembling disk order; instead, several passes may need to be done over the same region of the medium, with different sectors considered to be in the gaps each time. This is, at the very least, somewhat antithetical to the intention of the in-order scan code!
Note that this scan-fragmenting behavior emerges even if the entire DDT (or a given DDT class) fits in (ARC) memory; what matters here is the amount of the DDT (class) that fits in the in-order scan queues, which are likely to be significantly more limited than the ARC (by default, the scan queues are capped to a 10th the size of the ARC; while there's subtlety about the relative size of an in-core DDT ZAP entry vs a sio queue entry, they are both mostly pointers to data blocks by weight and so probably have the same ballpark size).
Describe the feature would like to see added to OpenZFS
I would like to propose another traversal "class" which
does not traverse the DDTs to enqueue block pointers (though of course traverses them for scanning as part of the MOS)
instead, traverses the pool data structure twice:
The first traversal enqueues only the autoditto DDT class and all deduplicated blocks
The second only and all non-deduplicated blocks
That is, like "class 2" traversals, it visits all deduplicated blocks first and does not need to probe the DDT in the second traversal.
ETA: There are other scan strategies here, too, if desired, that use the two phases like the other DDT-class N traversals. One could imagine, for example, walking the pool and first collecting all autoditto and multiply-referenced blocks, and then the 2nd pass collecting singly-referenced dedup'd blocks and non-deduplicated blocks.
How will this feature improve OpenZFS?
This new scrub strategy would enqueue deduplicated blocks in file order rather than in checksum order. While file order is not disk order, it's likely to be much closer than the random mapping dictated by cryptographic hash functions. Thus, we should expect many fewer gap-ful regions being created in sio queues, and many fewer repeated passes over the same region of the medium during a scan of a heavily deduplicated, large pool.
The downside is that we may repeatedly scrub a deduplicated block if we find it repeatedly in the tree of block pointers and the occurrences are sufficiently far from each other that the sio queues cannot merge the duplicate entries.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Background
The present in-order scan code has three possible behaviors wrt the DDT, in 1:1 correspondence with DDT classes:
After harvesting block pointers from the DDTs, class 0 and 1 traversals require probing the DDT whenever preparing to enqueue a block pointer during the usual tree walk so as to not multiply scan those blocks (using
ddt_class_contains
). Class 2 scans, by virtue of knowing that the entire DDT has already been read in, are free to simply skip all deduplicated blocks during the usual tree walk, which can dramatically save I/O (this logic is in fact nicely encapsulated inddt_class_contains
and so no special case is necessary in the scan code).All of these iterations over the DDT ZAPs happen in checksum order, which is uncorrelated with on-disk layout of the blocks in question. Thus, a very large DDT will, statistically, create many sparse regions in the in-order scan
sio
queues, as we are essentially picking on-disk locations at random. Worse, as we are forced to shed queue memory and actually scan blocks, we will pick the largest, least-gapfulsio
entries. That is, we're likely to evict the very entries that have, by chance, managed to accumulate runs of disk blocks and so cause latersio
queue entries to be even sparser on average! Thus, it is unlikely that large files, even if written sequentially on disk, will be scanned in anything resembling disk order; instead, several passes may need to be done over the same region of the medium, with different sectors considered to be in the gaps each time. This is, at the very least, somewhat antithetical to the intention of the in-order scan code!Note that this scan-fragmenting behavior emerges even if the entire DDT (or a given DDT class) fits in (ARC) memory; what matters here is the amount of the DDT (class) that fits in the in-order scan queues, which are likely to be significantly more limited than the ARC (by default, the scan queues are capped to a 10th the size of the ARC; while there's subtlety about the relative size of an in-core DDT ZAP entry vs a
sio
queue entry, they are both mostly pointers to data blocks by weight and so probably have the same ballpark size).Describe the feature would like to see added to OpenZFS
I would like to propose another traversal "class" which
That is, like "class 2" traversals, it visits all deduplicated blocks first and does not need to probe the DDT in the second traversal.
ETA: There are other scan strategies here, too, if desired, that use the two phases like the other DDT-class N traversals. One could imagine, for example, walking the pool and first collecting all autoditto and multiply-referenced blocks, and then the 2nd pass collecting singly-referenced dedup'd blocks and non-deduplicated blocks.
How will this feature improve OpenZFS?
This new scrub strategy would enqueue deduplicated blocks in file order rather than in checksum order. While file order is not disk order, it's likely to be much closer than the random mapping dictated by cryptographic hash functions. Thus, we should expect many fewer gap-ful regions being created in
sio
queues, and many fewer repeated passes over the same region of the medium during a scan of a heavily deduplicated, large pool.The downside is that we may repeatedly scrub a deduplicated block if we find it repeatedly in the tree of block pointers and the occurrences are sufficiently far from each other that the
sio
queues cannot merge the duplicate entries.The text was updated successfully, but these errors were encountered: