Skip to content

ACP: abstract BufReader::peek #569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Kixunil opened this issue Mar 31, 2025 · 7 comments
Open

ACP: abstract BufReader::peek #569

Kixunil opened this issue Mar 31, 2025 · 7 comments
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@Kixunil
Copy link

Kixunil commented Mar 31, 2025

Proposal

Problem statement

Because of how BufRead::fill_buf works, chunked decoders such as base64 or hex that decode from generic readers are currently forced to maintain an internal buffer of unread byte chunks. This makes the code much more complex, error prone and slower by introducing additional copies along with all the branches required to handle the edge cases. BufReader::peek solves this but only for the specific case when the reader is BufReader, not for generic readers (e.g. slices).

Motivating examples or use cases

The base64 crate has this code:

pub struct DecoderReader<'e, E: Engine, R: io::Read> {
    // Unrelated fields omitted for brevity

    /// Where b64 data is read from
    inner: R,

    /// Holds b64 data read from the delegate reader.
    b64_buffer: [u8; BUF_SIZE],
    /// The start of the pending buffered data in `b64_buffer`.
    b64_offset: usize,
    /// The amount of buffered b64 data after `b64_offset` in `b64_len`.
    b64_len: usize,
}

The three additional fields hold an internal buffer that gets copied from the internal reader. This means that reading from e.g. a slice adds an additional copy and complex handling of edge cases. (You can see quite complicated code if you read that file.) The BufRead trait almost solves it but one edge case still remains: if the number of bytes in the reader is less than the chunk size the only way to progress is to copy the bytes into an intermediate buffer before continuing with decoding. As mentioned, BufReader::peek can solve this but storing BufReader internally would add additional copying if an already-buffered reader is used.

Solution sketch

We can solve these problems by adding this trait:

// note that I'm changing the name from peek because the semantics is a bit different - it should return the entire internal buffer rather than just a sub-slice
pub trait RequireBytes: BufRead {
    /// Returns the maximum number of bytes that can be requested, None meaning unlimited.
    ///
    /// Note that this doesn't mean that `require_bytes` could return an unlimited number of bytes but merely that it can return any number of bytes until it reaches EOF.
    fn capacity(&self) -> Option<usize>;
    /// Attempts to fill the internal buffer until at least `num_bytes` are in it.
    ///
    /// This will return a shorter buffer if EOF is reached. In case of error the data is preserved inside the buffer.
    /// This will return a longer buffer if the underlying reader happened to have more bytes available.
    fn require_bytes(&mut self, num_bytes: usize) -> Result<&[u8]>;
}

Libraries that implement decoding like base64 can then just bound on RequireBytes and avoid all the complexity.

Alternatives

Strictly speaking, this doesn't require std. Any library wishing to do this can define its own trait and implement it for all std types (once BufReader::peek is stable). However it would be annoying to coordinate implementations across various crates. Having a "common vocabulary" trait in std would make the implementation more easily accessible.

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@Kixunil Kixunil added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Mar 31, 2025
@Amanieu
Copy link
Member

Amanieu commented Apr 1, 2025

The three additional fields hold an internal buffer that gets copied from the internal reader. This means that reading from e.g. a slice adds an additional copy and complex handling of edge cases. (You can see quite complicated code if you read that file.) The BufRead trait almost solves it but one edge case still remains: if the number of bytes in the reader is less than the chunk size the only way to progress is to copy the bytes into an intermediate buffer before continuing with decoding. As mentioned, BufReader::peek can solve this but storing BufReader internally would add additional copying if an already-buffered reader is used.

I don't think your proposal fully addresses this. It's perfectly valid to implement RequireBytes with a buffer size of only 1 byte, which means that base64 still has to use an intermediate buffer.

@Kixunil
Copy link
Author

Kixunil commented Apr 2, 2025

Oh, it does but I forgot to mention how. :) The decoder can check capacity upfront and error during construction if it's too small. I believe doing so is reasonable because short buffers should be rare and even if one has a short buffer it's always possible to just stick BufReader in between to get the current behavior with the same performance penalty that is present today and the advntage that std code is used rather than custom code reducing binary size. (Note that &[u8] has "infinite" buffer even if it's an empty slice or any other length - shorter returned buffer means it's at end, so it's consistent. Same with Cursor.)

@Amanieu
Copy link
Member

Amanieu commented Apr 29, 2025

We discussed this in the @rust-lang/libs-api meeting today and we are happy to accept this, but with some changes.

  • The trait should be renamed to PeekableBufRead to indicate a buffered reader that supports peeking. The current name really doesn't describe what the trait does.
  • The require_bytes method should be renamed to peek and should match the behavior of BufReader::peek. Notably, we could can BufReader::peek to return excess bytes as you propose here.
  • capacity should just return a plain usize instead of an Option<usize>. Unlimited capacity can be represented by returning usize::MAX.

Please open a tracking issue and open a PR to rust-lang/rust to add it as an unstable feature. You can close this ACP once the tracking issue has been created.

@Amanieu Amanieu added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label Apr 29, 2025
@programmerjake
Copy link
Member

Notably, we could can BufReader::peek

I assuming you meant "change" rather than "can".

@Kixunil
Copy link
Author

Kixunil commented Apr 29, 2025

Sounds reasonable

  • Unlimited capacity can be represented by returning usize::MAX.

Oh, this is true but non-obvious - no actual allocation in Rust can be above isize::MAX, so usize::MAX will not get misinterpreted as slice length.

Unfortunately I'm unable to do more than reply to this right now, will do the rest later.

@fintelia
Copy link

If I'm interpreting this ACP correctly, then &[u8] and VecDeque<u8> would gain new capacity methods that unconditionally return usize::MAX? Isn't that very likely to cause issues given that Vec::capacity and VecDeque::capacity already exist but have different semantics?

@programmerjake
Copy link
Member

maybe rename capacity to peek_capacity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

4 participants