-
Notifications
You must be signed in to change notification settings - Fork 60
ZWNJ in Persian #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, we understand how Arabic contextual forms work. The mandate is to follow that spec faithfully. That spec does call out that context-specific tailorings can exist, but those are usually use-case-dependent and this crate doesn't go that far. You can build such a tailoring on top of this to filter for ZWNJ. |
Also, with my Unicode hat on, I think the spec is doing the right thing here: the zwnj has semantic meaning there -- it's not just a space -- and it's not it's own "perceived character" -- so it has to be rolled in to something, and the spec rolls it into the previous one. Force-final forms aren't considered different in Arabic or Persian, it's a property of the word not the letter, but grapheme segmentation isn't about equality. A jeem and a jeem with a zwnj are both a user-perceived character, even though they are perceived by the user as the same character, because sameness isn't about encoding. If you're relying on equality while segmenting, you need to tweak how you look at equality for this to work. With Unicode algorithms it is important to understand if the algorithm is precisely for the conceptual purpose you want to use it for. Grapheme boundaries provide a simple bare-minimum logical places to do a bunch of segmentation operations (backspace, arrow keys, hyphenated linebreaking). They don't necessarily produce "graphemes" that are equal when you need them to be. |
Thanks for your responses. Filtering for ZWNJ is easy enough. And I appreciate your last point; maybe I had the wrong idea of what grapheme segmentation is primarily meant for. Having said that, I'm not sure I agree about the spec itself. Where the ZWNJ is mentioned, it isn't related to the way it's used in Persian. The reference is to Indic languages, and those cases, from what I've seen, are less ambiguous (i.e., a difference in the user-perceived character). I'm not totally convinced that Persian usage was considered in the drafting of this spec… though again, I could be wrong. It's worth keeping in mind that the effort to get people to use the ZWNJ in Persian has been gradual, and it's still quite common to see spaces used instead. (That's the way I first learned to type in Persian, around fifteen years ago. Many of my academic colleagues continue to do so.) There are also contexts in which, failing the use of the ZWNJ, the letters can be allowed to connect. This is true of the verbal prefix mī. So you might see, for example, میکند or می کند or میکند and it's not the end of the world. The version with the ZWNJ is just the best option, since it's easy to read but also space-efficient and keeps the word together. Anyway, I could ramble about this for days, but my point is that the "semantic content" of the ZWNJ in Persian is debatable. Is it just like a space? Not quite, but a space can stand in for it in a pinch. Does it produce something new in combination with the preceding letter? No, or not perceptibly. Does it need to be treated as part of a grapheme cluster with the preceding letter? I don't think anyone could argue that it needs to, but maybe someone thinks it's better this way for encoding purposes. I'll get to the bottom of it eventually. Maybe I'll ask Thomas Milo; if he tells me I'm full of it, that'll shut me up real quick. Thanks again. |
I suspect Roozbeh would have noticed if the spec was wrong, but I can ask him. |
It could easily be the "bare default spec" argument ("if you want something different for working with Persian, go ahead and customize"). I don't know. It occurred to me that much of what I wrote about Persian would also apply to the use of the ZWNJ in German to prevent a ligature across the stems of a compound word. Someone would have spoken up if it seemed wrong to include the ZWNJ in the preceding grapheme cluster. I'll ask around. |
So I asked Roozbeh (unicode expert, script expert, native Persian speaker) about this and he agrees with me, but felt that you should submit feedback through https://unicode.org/reporting.html anyway so we can discuss this at the next UTC. |
Ok. I'll just ask that a sentence or two be added to explain how the treatment of ZWNJ in this spec fits with languages where it's used to prevent a connection or ligature. That much would have cowed me from the start. If the way it works now seems right to highly placed Iranians in the free software world (not a small group), then what can I do but eat my words. |
Feel free to close this if there's nothing to be done about it. I saw that Annex 29 repeatedly excludes the zero-width non-joiner.
In Persian, this character (
U+200C
) is used to prevent connection of letters between certain prefixes and suffixes, and the words to which they are attached. I know it has other purposes in other languages, but Persian is what I'm working with. (I also work in Arabic, where the ZWNJ is not used in any context that I know of.)I was tinkering with a Rust program that involves (among other things) taking Arabic or Persian text input and segmenting the graphemes. Once I found this package, it worked immediately, with few exceptions. And I understood the exceptions that occurred. For example, if an Arabic letter is followed by a vowel mark or other diacritic, those code points stay together as a unit. That seems right, since the letter plus diacritic(s) can be said to represent the "user-perceived character."
But I have a problem with the ZWNJ in Persian. It does not create a new "user-perceived character" along with the preceding letter—which is how it's being treated in this segmentation scheme. Rather, the intention is, "act as though there's a space after this letter, but leave out the space."
At issue is the fact that letters in the Arabic or Persian alphabet have up to four contextual forms: isolated, initial, medial, and final. As you probably know, setting the correct form in a given context tends to be taken care of by the shaping engine. (Otherwise, typing would be incredibly tedious.) When a ZWNJ is added, it's an instruction not to use the medial form of the preceding letter, where it might otherwise be used. The result is that one of the other standard forms will be set instead, depending on the context.
When segmenting graphemes in Persian, then, I don't think it makes sense to exclude the ZWNJ as a boundary. It would better be segmented out, the way that spaces are. In fact, unless I've missed something,
U+200C
could be treated as a grapheme boundary when it occurs after any code point in the Arabic block. (It should not, however, be treated as a word or sentence boundary by default.)But I could be wrong. There are people who would know better. And if the mandate here is to follow Annex 29 faithfully, then I suppose it doesn't matter. I found a workaround for my immediate purposes.
Thank you for your work on this project!
The text was updated successfully, but these errors were encountered: