Warn that Chars iterator does not iterate "characters" #26689

kornelski · 2015-06-30T21:00:33Z

The Chars iterator iterates over Unicode Scalar Values, but when people think about "characters" they usually mean something closer to what Unicode calls "grapheme clusters".

This leads to surprising results and thus potential errors, for example:

"éé".chars().count() == 3
"🇺🇸".chars().count() == 2

I suggest adding a warning to the documentation that this iterator isn't iterating over "characters", and that users should consider using UnicodeSegmentation::graphemes iterator instead.

The text was updated successfully, but these errors were encountered:

steveklabnik · 2015-06-30T21:10:38Z

Rust's char type is a unicode scalar value, and its documentation does currently say 'codepoints', explicitly: http://doc.rust-lang.org/stable/std/primitive.str.html#method.chars

We also document these differences in the book: http://doc.rust-lang.org/stable/book/strings.html#indexing

kornelski · 2015-06-30T21:25:16Z

The documentation is technically correct, but I think insufficient to prevent users from falling into a trap.

The difference between Unicode codepoint and a character is subtle and tricky, and to make it worse, in most cases (and all examples in Rust documentation) it appears to be the same thing.

The documentation shows only simple cases where one codepoint is one grapheme cluster, which—along with the name of the char type—could lead users to wrongly assume that codepoints are characters.

I could make pull requests for the docs adding explanatory notes to chars iterator, chars method and adding tricky cases to the book. Do you think that would be a good addition?

steveklabnik · 2015-06-30T22:14:53Z

Yeah, I guess what I meant was "here's what we have, what specifically could improve?"

A PR would be a great way to workshop it :)

steveklabnik · 2015-07-08T16:58:38Z

@pornel are you still persuing a PR, or should I write something?

kornelski · 2015-07-08T22:17:03Z

Sorry, I've been a bit ill. It's still on my todo list.

steveklabnik · 2015-07-08T22:41:10Z

It's all good! Feel better. I have other work I can do, was just curious.

Fixes #26689 This PR tries to clarify uses of "character" where it means "code point" or "UTF-8 sequence", which are almost, but not quite the same. Edge cases added to some examples to demonstrate this. However, I've kept use of the term "code point" instead of "Unicode scalar value", because in UTF-8 they're the same, and "code point" is more widely known.

steveklabnik added the A-docs label Jun 30, 2015

kornelski mentioned this issue Jul 13, 2015

Document Unicode complications when iterating "characters" #27012

Merged

bors closed this as completed in #27012 Jul 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn that Chars iterator does not iterate "characters" #26689

Warn that Chars iterator does not iterate "characters" #26689

kornelski commented Jun 30, 2015

steveklabnik commented Jun 30, 2015

kornelski commented Jun 30, 2015

steveklabnik commented Jun 30, 2015

steveklabnik commented Jul 8, 2015

kornelski commented Jul 8, 2015

steveklabnik commented Jul 8, 2015

Warn that Chars iterator does not iterate "characters" #26689

Warn that Chars iterator does not iterate "characters" #26689

Comments

kornelski commented Jun 30, 2015

steveklabnik commented Jun 30, 2015

kornelski commented Jun 30, 2015

steveklabnik commented Jun 30, 2015

steveklabnik commented Jul 8, 2015

kornelski commented Jul 8, 2015

steveklabnik commented Jul 8, 2015