Skip to content

Speeding up in-order scans of large, dedup pools #12326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nwf opened this issue Jul 4, 2021 · 0 comments
Open

Speeding up in-order scans of large, dedup pools #12326

nwf opened this issue Jul 4, 2021 · 0 comments
Labels
Type: Feature Feature request or new feature

Comments

@nwf
Copy link
Contributor

nwf commented Jul 4, 2021

Background

The present in-order scan code has three possible behaviors wrt the DDT, in 1:1 correspondence with DDT classes:

  • class 0: enqueue block pointers from the autoditto DDT class only, necessary as these ditto pointers may not exist in the ordinary block pointer tree
  • class 1: enqueue block pointers from the autoditto DDT and >1-reference DDT classes
  • class 2: enqueue block pointers from all DDT classes

After harvesting block pointers from the DDTs, class 0 and 1 traversals require probing the DDT whenever preparing to enqueue a block pointer during the usual tree walk so as to not multiply scan those blocks (using ddt_class_contains). Class 2 scans, by virtue of knowing that the entire DDT has already been read in, are free to simply skip all deduplicated blocks during the usual tree walk, which can dramatically save I/O (this logic is in fact nicely encapsulated in ddt_class_contains and so no special case is necessary in the scan code).

All of these iterations over the DDT ZAPs happen in checksum order, which is uncorrelated with on-disk layout of the blocks in question. Thus, a very large DDT will, statistically, create many sparse regions in the in-order scan sio queues, as we are essentially picking on-disk locations at random. Worse, as we are forced to shed queue memory and actually scan blocks, we will pick the largest, least-gapful sio entries. That is, we're likely to evict the very entries that have, by chance, managed to accumulate runs of disk blocks and so cause later sio queue entries to be even sparser on average! Thus, it is unlikely that large files, even if written sequentially on disk, will be scanned in anything resembling disk order; instead, several passes may need to be done over the same region of the medium, with different sectors considered to be in the gaps each time. This is, at the very least, somewhat antithetical to the intention of the in-order scan code!

Note that this scan-fragmenting behavior emerges even if the entire DDT (or a given DDT class) fits in (ARC) memory; what matters here is the amount of the DDT (class) that fits in the in-order scan queues, which are likely to be significantly more limited than the ARC (by default, the scan queues are capped to a 10th the size of the ARC; while there's subtlety about the relative size of an in-core DDT ZAP entry vs a sio queue entry, they are both mostly pointers to data blocks by weight and so probably have the same ballpark size).

Describe the feature would like to see added to OpenZFS

I would like to propose another traversal "class" which

  • does not traverse the DDTs to enqueue block pointers (though of course traverses them for scanning as part of the MOS)
  • instead, traverses the pool data structure twice:
    • The first traversal enqueues only the autoditto DDT class and all deduplicated blocks
    • The second only and all non-deduplicated blocks

That is, like "class 2" traversals, it visits all deduplicated blocks first and does not need to probe the DDT in the second traversal.

ETA: There are other scan strategies here, too, if desired, that use the two phases like the other DDT-class N traversals. One could imagine, for example, walking the pool and first collecting all autoditto and multiply-referenced blocks, and then the 2nd pass collecting singly-referenced dedup'd blocks and non-deduplicated blocks.

How will this feature improve OpenZFS?

This new scrub strategy would enqueue deduplicated blocks in file order rather than in checksum order. While file order is not disk order, it's likely to be much closer than the random mapping dictated by cryptographic hash functions. Thus, we should expect many fewer gap-ful regions being created in sio queues, and many fewer repeated passes over the same region of the medium during a scan of a heavily deduplicated, large pool.

The downside is that we may repeatedly scrub a deduplicated block if we find it repeatedly in the tree of block pointers and the occurrences are sufficiently far from each other that the sio queues cannot merge the duplicate entries.

@nwf nwf added the Type: Feature Feature request or new feature label Jul 4, 2021
@rincebrain rincebrain mentioned this issue May 3, 2022
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

1 participant