Skip to content

Warn that Chars iterator does not iterate "characters" #26689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kornelski opened this issue Jun 30, 2015 · 6 comments
Closed

Warn that Chars iterator does not iterate "characters" #26689

kornelski opened this issue Jun 30, 2015 · 6 comments

Comments

@kornelski
Copy link
Contributor

The Chars iterator iterates over Unicode Scalar Values, but when people think about "characters" they usually mean something closer to what Unicode calls "grapheme clusters".

This leads to surprising results and thus potential errors, for example:

"éé".chars().count() == 3
"🇺🇸".chars().count() == 2

I suggest adding a warning to the documentation that this iterator isn't iterating over "characters", and that users should consider using UnicodeSegmentation::graphemes iterator instead.

@steveklabnik
Copy link
Member

Rust's char type is a unicode scalar value, and its documentation does currently say 'codepoints', explicitly: http://doc.rust-lang.org/stable/std/primitive.str.html#method.chars

We also document these differences in the book: http://doc.rust-lang.org/stable/book/strings.html#indexing

@kornelski
Copy link
Contributor Author

The documentation is technically correct, but I think insufficient to prevent users from falling into a trap.

The difference between Unicode codepoint and a character is subtle and tricky, and to make it worse, in most cases (and all examples in Rust documentation) it appears to be the same thing.

The documentation shows only simple cases where one codepoint is one grapheme cluster, which—along with the name of the char type—could lead users to wrongly assume that codepoints are characters.

I could make pull requests for the docs adding explanatory notes to chars iterator, chars method and adding tricky cases to the book. Do you think that would be a good addition?

@steveklabnik
Copy link
Member

Yeah, I guess what I meant was "here's what we have, what specifically could improve?"

A PR would be a great way to workshop it :)

@steveklabnik
Copy link
Member

@pornel are you still persuing a PR, or should I write something?

@kornelski
Copy link
Contributor Author

Sorry, I've been a bit ill. It's still on my todo list.

@steveklabnik
Copy link
Member

It's all good! Feel better. I have other work I can do, was just curious.

bors added a commit that referenced this issue Jul 26, 2015
Fixes #26689

This PR tries to clarify uses of "character" where it means "code point" or "UTF-8 sequence", which are almost, but not quite the same. Edge cases added to some examples to demonstrate this.

However, I've kept use of the term "code point" instead of "Unicode scalar value", because in UTF-8 they're the same, and "code point" is more widely known.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants