Skip to content
This repository was archived by the owner on Feb 16, 2024. It is now read-only.

Flag implication: v implies u? #23

Closed
mathiasbynens opened this issue May 26, 2021 · 17 comments
Closed

Flag implication: v implies u? #23

mathiasbynens opened this issue May 26, 2021 · 17 comments

Comments

@mathiasbynens
Copy link
Member

mathiasbynens commented May 26, 2021

The current draft spec says that the v flag implies the u behavior, but it doesn't make that explicit in that /./v.unicode === false. This was a conscious choice that I pushed for: I wanted to preserve the invariant that 1 flag affects exactly 1 getter. v and u are two distinct modes; it's just that v syntax + functionality happen to be a superset of u.

During the TC39 meeting, @waldemarhorwat and @msaboff brought up alternate options, and we should reconsider if we should change anything here.

  1. We could enforce the explicit use of the u flag whenever v is used.
    • Con: requires more typing. (pointed out by Waldemar)
    • Con: annoying to spec the early error. (pointed out by Waldemar)
    • Con: existing code that takes a different code path based on .unicode might now break, since v is parsed differently
  2. We could make the implication explicit by making /./v.unicode === true.
    • Con: This might break existing code that expects RegExp objects with.unicode === true to behave based on the u-only (non-v) behavior.
    • Con: .unicode === true would not correspond to .flags.includes('u'). (pointed out by Jordan)
@markusicu
Copy link
Collaborator

FWIW I don't have any strong opinion here. I totally defer to you + stage 3 reviewers.

@ljharb
Copy link
Member

ljharb commented May 26, 2021

It doesn't have to be an early error; due to new RegExp it'd have to be a runtime error there, so i'd think we could make https://tc39.es/ecma262/#sec-regexpinitialize throw whenever v is in F but u is not in F?

Another con of option 2 there is that .unicode === true would not correspond to .flags.includes('u').

@mathiasbynens
Copy link
Member Author

It doesn't have to be an early error; due to new RegExp it'd have to be a runtime error there

If we were to throw errors (which for the record, I hope we don't!) I think it would need to be an early error in the regular expression literal case: /./v would throw an early error, just like /./x does (where x is an unsupported flag). I agree that new RegExp('.', 'v') would then be a runtime error.

Another con of option 2 there is that .unicode === true would not correspond to .flags.includes('u').

Great point; added to the original post. Thanks!

@ljharb
Copy link
Member

ljharb commented May 26, 2021

Also, personally I don't find "requires more typing" an issue; code is read more often than it's written, so clarity/readability should be weighed more highly.

I also think that it would be very unfortunate if all code that currently checks .unicode had to change to x.unicode || x.uniSet, or code that checks .flags for u has to also check for v.

@sffc
Copy link
Collaborator

sffc commented May 26, 2021

Previous issue: #14

I personally would be happy with requiring /uv (where /v is a syntax error), because it meshes most consistently with my mental model of v adding the extra sets of strings semantics to u mode.

@mathiasbynens
Copy link
Member Author

I personally would be happy with requiring /uv (where /v is a syntax error), because it meshes most consistently with my mental model of v adding the extra sets of strings semantics to u mode.

I'd rather avoid this option. There is no precedent for introducing a new flag that cannot be used by itself.

@markusicu
Copy link
Collaborator

In email I asked:

I believe we agree that we don't want v semantics without u semantics, right? We don't want set operations and properties of strings with code unit matching and can-escape-any-letter, right?

Mathias replied:

I very strongly agree.

@markusicu
Copy link
Collaborator

Copied from #14 (comment) where I wrote:

Crazy idea: Could we change the "unicode" getter to return numeric value 2 instead of boolean false/true, when v is set? Would expressions like if (unicode) then ... still work?

@markusicu
Copy link
Collaborator

In email, Waldemar wrote:

There are two reasonable alternatives:

A. Flags as they are now. This has the disadvantage that user code checking for unicode mode must be modified to also look for the v flag.

B. Setting the v flag actually sets the u flag as well, with us picking whatever the most reasonable semantics are to achieve that.


I want to note that we have discussed another option as well: Require setting u when setting v. This is option 1 in Mathias’ original issue text above.

(Waldemar’s option B is Mathias’ original option 2. Waldemar’s option A is what we have in the current draft spec changes.)


Mathias responded about Waldemar’s option A:

I see this as an advantage; any existing code checking for .unicode and expecting (only the current) u syntax + semantics would otherwise break when it suddenly encounters some of the new v-only syntax. It seems better to restrict the potential breakage to just RegExps with the new flag/getter we're adding, and to not change anything else.

The mental model is that u and v are separate modes, and v just happens to be a superset of u.
Roughly similar to how JavaScript modules imply strict mode, but are still different (e.g. are parsed differently/support new syntax).


Note that if v does not set .unicode === true and some existing code checks for .unicode === false expecting pre-Unicode (BMP matching) behavior, then that code also needs to be updated.

@markusicu
Copy link
Collaborator

Discussed in today's meeting with @mathiasbynens @macchiati @sffc --

We realized that to say “v implies u” is slightly misleading. There are cases where expressions that are valid under u are syntax errors under v, or behave differently. For example, under v a bracket [ starts a nested class instead of just adding that bracket; and the proposed v syntax never allows a literal dash - as itself inside a CharacterClass. More generally, v requires escaping some syntax characters that are not (yet: issue #27) even allowed to be escaped under u.

In addition, in issue #30 we propose resolving CharacterClasses with top-level ^ complement and IgnoreCase differently.

Shane was wondering whether implementers that test (re.unicode) are primarily looking for whether the expression matches by whole code points (and supports Unicode properties), or whether literally the u flag is specified.

As for code that parses regular expression pattern strings, we think that adding the v mode will require an update no matter what the flag getters return. However, based on implementation experience with such parsers, Mathias thinks that setting (re.unicode == false) will break less of that kind of code: Code in an if (re.unicode) path would continue to work correctly.

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

@ljharb
Copy link
Member

ljharb commented Jun 17, 2021

ftr, "v implies u" to me does not mean that the same syntax will be used, it means "u is for unicode semantics, v implies the same unicode semantics". I suspect most cases of people checking .unicode or .flags.match(/u/) are trying to determine code points and unicode properties, as opposed to caring about the actual regex pattern syntax.

@markusicu
Copy link
Collaborator

Follow-up from our meeting notes: Since we don't quite have “v implies u” and since it looks more compatible to have /./v.unicode == false, I think it’s best to forbid specifying both v and u.

@mathiasbynens
Copy link
Member Author

I suspect most cases of people checking .unicode or .flags.match(/u/) are trying to determine code points and unicode properties, as opposed to caring about the actual regex pattern syntax.

The only examples I know of, and which I was referring to during the TC39 meetings, are used for the latter purpose.

I share them below, but first, some background: any JS parser technically has to check each regular expression’s flags and parse accordingly, since invalid patterns result in early errors per spec, which are parse-time errors. A JS parser which is lenient in the face of an invalid regular expression literal is technically in violation of the spec. And of course, patterns exist that are only valid with or without the u flag.

  • Example pattern that’s only valid without the u flag:
  • Example patterns that are only valid with the u flag:
    • [\u{0}-z] — without the u flag, this creates a range from } to z, which is out of order.
    • [💩-💫] — without the u flag, surrogates are treated as individual "characters". The pattern is equivalent to [\uD83D\uDCA9-\uD83D\uDCAB], and the \uDCA9-\uD83D range is out of order.

With that out of the way, here are the JS-parser-in-JS examples which I referred do during the TC39 meetings.

(Note that Babel doesn’t bother implementing (or emulating) proper regular expression parsing, and happily accepts invalid patterns: https://github.com/babel/babel/blob/e498bee10f0123bb208baa228ce6417542a2c3c4/packages/babel-parser/src/tokenizer/index.js#L924-L973 It does check whether each flag is a known one, but it has no flag-specific code paths, not even for u.)

These are the only examples I’m aware of. All of the above implementations (incl. Babel) would need to be updated to support v, which is within expectations. Making v imply u would not change this.

I’m not aware of any real-world examples of library code checking re.unicode or .flags.includes('u') at runtime.

@mathiasbynens
Copy link
Member Author

There are three options:

  1. /v implies /u: using /uv is equivalent to /v
  2. /v by itself is invalid; developers must use /uv
  3. /uv is invalid; developers must use /v to get the new features, and /abc/v.unicode === false

I would like to resolve this issue with option 3, because it’s simplest. The mental model then becomes: v and u are separate modes, with no potentially surprising implications between them (one doesn’t silently enable the other). /v is syntactically and semantically different from /u.

Let’s discuss this during this week’s meeting.

@sffc
Copy link
Collaborator

sffc commented Sep 1, 2021

I agree with your logic for /v, so long as /v retains its own separate model. However, we got a lot of pushback from TC39 this week regarding the scope expansion. I think we need to answer that question before we can revisit this one.

@markusicu
Copy link
Collaborator

Even with the smaller scope of just the set notation the committee favored a new flag, and with the incompatibilities within the combined-so-far scope /u and /v are really separate modes.

@mathiasbynens
Copy link
Member Author

During yesterday’s weekly sync we decided to proceed with “option 3”. Closing this issue to mark its resolution. We’ll update the spec draft later.

@mathiasbynens mathiasbynens changed the title Flag implication: v implies u Flag implication: v implies u? Sep 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants