Speeding up in-order scans of large, dedup pools #12326

nwf · 2021-07-04T11:53:32Z

Background

The present in-order scan code has three possible behaviors wrt the DDT, in 1:1 correspondence with DDT classes:

class 0: enqueue block pointers from the autoditto DDT class only, necessary as these ditto pointers may not exist in the ordinary block pointer tree
class 1: enqueue block pointers from the autoditto DDT and >1-reference DDT classes
class 2: enqueue block pointers from all DDT classes

After harvesting block pointers from the DDTs, class 0 and 1 traversals require probing the DDT whenever preparing to enqueue a block pointer during the usual tree walk so as to not multiply scan those blocks (using ddt_class_contains). Class 2 scans, by virtue of knowing that the entire DDT has already been read in, are free to simply skip all deduplicated blocks during the usual tree walk, which can dramatically save I/O (this logic is in fact nicely encapsulated in ddt_class_contains and so no special case is necessary in the scan code).

All of these iterations over the DDT ZAPs happen in checksum order, which is uncorrelated with on-disk layout of the blocks in question. Thus, a very large DDT will, statistically, create many sparse regions in the in-order scan sio queues, as we are essentially picking on-disk locations at random. Worse, as we are forced to shed queue memory and actually scan blocks, we will pick the largest, least-gapful sio entries. That is, we're likely to evict the very entries that have, by chance, managed to accumulate runs of disk blocks and so cause later sio queue entries to be even sparser on average! Thus, it is unlikely that large files, even if written sequentially on disk, will be scanned in anything resembling disk order; instead, several passes may need to be done over the same region of the medium, with different sectors considered to be in the gaps each time. This is, at the very least, somewhat antithetical to the intention of the in-order scan code!

Note that this scan-fragmenting behavior emerges even if the entire DDT (or a given DDT class) fits in (ARC) memory; what matters here is the amount of the DDT (class) that fits in the in-order scan queues, which are likely to be significantly more limited than the ARC (by default, the scan queues are capped to a 10th the size of the ARC; while there's subtlety about the relative size of an in-core DDT ZAP entry vs a sio queue entry, they are both mostly pointers to data blocks by weight and so probably have the same ballpark size).

Describe the feature would like to see added to OpenZFS

I would like to propose another traversal "class" which

does not traverse the DDTs to enqueue block pointers (though of course traverses them for scanning as part of the MOS)
instead, traverses the pool data structure twice:
- The first traversal enqueues only the autoditto DDT class and all deduplicated blocks
- The second only and all non-deduplicated blocks

That is, like "class 2" traversals, it visits all deduplicated blocks first and does not need to probe the DDT in the second traversal.

ETA: There are other scan strategies here, too, if desired, that use the two phases like the other DDT-class N traversals. One could imagine, for example, walking the pool and first collecting all autoditto and multiply-referenced blocks, and then the 2nd pass collecting singly-referenced dedup'd blocks and non-deduplicated blocks.

How will this feature improve OpenZFS?

This new scrub strategy would enqueue deduplicated blocks in file order rather than in checksum order. While file order is not disk order, it's likely to be much closer than the random mapping dictated by cryptographic hash functions. Thus, we should expect many fewer gap-ful regions being created in sio queues, and many fewer repeated passes over the same region of the medium during a scan of a heavily deduplicated, large pool.

The downside is that we may repeatedly scrub a deduplicated block if we find it repeatedly in the tree of block pointers and the occurrences are sufficiently far from each other that the sio queues cannot merge the duplicate entries.

The text was updated successfully, but these errors were encountered:

nwf added the Type: Feature Feature request or new feature label Jul 4, 2021

rincebrain mentioned this issue May 3, 2022

Block Cloning #13392

Merged

13 tasks

rincebrain mentioned this issue Jun 19, 2023

Improved sequential scrub checkpointing #14996

Open

nwf mentioned this issue Jun 18, 2024

Fast Dedup: prune unique entries #16277

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speeding up in-order scans of large, dedup pools #12326

Speeding up in-order scans of large, dedup pools #12326

nwf commented Jul 4, 2021 •

edited

Loading

Speeding up in-order scans of large, dedup pools #12326

Speeding up in-order scans of large, dedup pools #12326

Comments

nwf commented Jul 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?

nwf commented Jul 4, 2021 •

edited

Loading