-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Warn that Chars iterator does not iterate "characters" #26689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Rust's We also document these differences in the book: http://doc.rust-lang.org/stable/book/strings.html#indexing |
The documentation is technically correct, but I think insufficient to prevent users from falling into a trap. The difference between Unicode codepoint and a character is subtle and tricky, and to make it worse, in most cases (and all examples in Rust documentation) it appears to be the same thing. The documentation shows only simple cases where one codepoint is one grapheme cluster, which—along with the name of the I could make pull requests for the docs adding explanatory notes to chars iterator, chars method and adding tricky cases to the book. Do you think that would be a good addition? |
Yeah, I guess what I meant was "here's what we have, what specifically could improve?" A PR would be a great way to workshop it :) |
@pornel are you still persuing a PR, or should I write something? |
Sorry, I've been a bit ill. It's still on my todo list. |
It's all good! Feel better. I have other work I can do, was just curious. |
Fixes #26689 This PR tries to clarify uses of "character" where it means "code point" or "UTF-8 sequence", which are almost, but not quite the same. Edge cases added to some examples to demonstrate this. However, I've kept use of the term "code point" instead of "Unicode scalar value", because in UTF-8 they're the same, and "code point" is more widely known.
The
Chars
iterator iterates over Unicode Scalar Values, but when people think about "characters" they usually mean something closer to what Unicode calls "grapheme clusters".This leads to surprising results and thus potential errors, for example:
I suggest adding a warning to the documentation that this iterator isn't iterating over "characters", and that users should consider using
UnicodeSegmentation::graphemes
iterator instead.The text was updated successfully, but these errors were encountered: