Flag implication: `v` implies `u`? #23

mathiasbynens · 2021-05-26T22:13:42Z

The current draft spec says that the v flag implies the u behavior, but it doesn't make that explicit in that /./v.unicode === false. This was a conscious choice that I pushed for: I wanted to preserve the invariant that 1 flag affects exactly 1 getter. v and u are two distinct modes; it's just that v syntax + functionality happen to be a superset of u.

During the TC39 meeting, @waldemarhorwat and @msaboff brought up alternate options, and we should reconsider if we should change anything here.

We could enforce the explicit use of the u flag whenever v is used.
- Con: requires more typing. (pointed out by Waldemar)
- Con: annoying to spec the early error. (pointed out by Waldemar)
- Con: existing code that takes a different code path based on .unicode might now break, since v is parsed differently
We could make the implication explicit by making /./v.unicode === true.
- Con: This might break existing code that expects RegExp objects with.unicode === true to behave based on the u-only (non-v) behavior.
- Con: .unicode === true would not correspond to .flags.includes('u'). (pointed out by Jordan)

The text was updated successfully, but these errors were encountered:

markusicu · 2021-05-26T22:25:12Z

FWIW I don't have any strong opinion here. I totally defer to you + stage 3 reviewers.

ljharb · 2021-05-26T22:27:24Z

It doesn't have to be an early error; due to new RegExp it'd have to be a runtime error there, so i'd think we could make https://tc39.es/ecma262/#sec-regexpinitialize throw whenever v is in F but u is not in F?

Another con of option 2 there is that .unicode === true would not correspond to .flags.includes('u').

mathiasbynens · 2021-05-26T22:35:17Z

It doesn't have to be an early error; due to new RegExp it'd have to be a runtime error there

If we were to throw errors (which for the record, I hope we don't!) I think it would need to be an early error in the regular expression literal case: /./v would throw an early error, just like /./x does (where x is an unsupported flag). I agree that new RegExp('.', 'v') would then be a runtime error.

Another con of option 2 there is that .unicode === true would not correspond to .flags.includes('u').

Great point; added to the original post. Thanks!

ljharb · 2021-05-26T22:43:26Z

Also, personally I don't find "requires more typing" an issue; code is read more often than it's written, so clarity/readability should be weighed more highly.

I also think that it would be very unfortunate if all code that currently checks .unicode had to change to x.unicode || x.uniSet, or code that checks .flags for u has to also check for v.

sffc · 2021-05-26T22:47:18Z

Previous issue: #14

I personally would be happy with requiring /uv (where /v is a syntax error), because it meshes most consistently with my mental model of v adding the extra sets of strings semantics to u mode.

mathiasbynens · 2021-06-10T12:02:23Z

I personally would be happy with requiring /uv (where /v is a syntax error), because it meshes most consistently with my mental model of v adding the extra sets of strings semantics to u mode.

I'd rather avoid this option. There is no precedent for introducing a new flag that cannot be used by itself.

markusicu · 2021-06-10T16:56:27Z

In email I asked:

I believe we agree that we don't want v semantics without u semantics, right? We don't want set operations and properties of strings with code unit matching and can-escape-any-letter, right?

Mathias replied:

I very strongly agree.

markusicu · 2021-06-10T16:59:34Z

Copied from #14 (comment) where I wrote:

Crazy idea: Could we change the "unicode" getter to return numeric value 2 instead of boolean false/true, when v is set? Would expressions like if (unicode) then ... still work?

markusicu · 2021-06-10T17:15:15Z

In email, Waldemar wrote:

There are two reasonable alternatives:

A. Flags as they are now. This has the disadvantage that user code checking for unicode mode must be modified to also look for the v flag.

B. Setting the v flag actually sets the u flag as well, with us picking whatever the most reasonable semantics are to achieve that.

I want to note that we have discussed another option as well: Require setting u when setting v. This is option 1 in Mathias’ original issue text above.

(Waldemar’s option B is Mathias’ original option 2. Waldemar’s option A is what we have in the current draft spec changes.)

Mathias responded about Waldemar’s option A:

I see this as an advantage; any existing code checking for .unicode and expecting (only the current) u syntax + semantics would otherwise break when it suddenly encounters some of the new v-only syntax. It seems better to restrict the potential breakage to just RegExps with the new flag/getter we're adding, and to not change anything else.

The mental model is that u and v are separate modes, and v just happens to be a superset of u.
Roughly similar to how JavaScript modules imply strict mode, but are still different (e.g. are parsed differently/support new syntax).

Note that if v does not set .unicode === true and some existing code checks for .unicode === false expecting pre-Unicode (BMP matching) behavior, then that code also needs to be updated.

markusicu · 2021-06-17T17:31:46Z

Discussed in today's meeting with @mathiasbynens @macchiati @sffc --

We realized that to say “v implies u” is slightly misleading. There are cases where expressions that are valid under u are syntax errors under v, or behave differently. For example, under v a bracket [ starts a nested class instead of just adding that bracket; and the proposed v syntax never allows a literal dash - as itself inside a CharacterClass. More generally, v requires escaping some syntax characters that are not (yet: issue #27) even allowed to be escaped under u.

In addition, in issue #30 we propose resolving CharacterClasses with top-level ^ complement and IgnoreCase differently.

Shane was wondering whether implementers that test (re.unicode) are primarily looking for whether the expression matches by whole code points (and supports Unicode properties), or whether literally the u flag is specified.

As for code that parses regular expression pattern strings, we think that adding the v mode will require an update no matter what the flag getters return. However, based on implementation experience with such parsers, Mathias thinks that setting (re.unicode == false) will break less of that kind of code: Code in an if (re.unicode) path would continue to work correctly.

Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff

ljharb · 2021-06-17T17:34:15Z

ftr, "v implies u" to me does not mean that the same syntax will be used, it means "u is for unicode semantics, v implies the same unicode semantics". I suspect most cases of people checking .unicode or .flags.match(/u/) are trying to determine code points and unicode properties, as opposed to caring about the actual regex pattern syntax.

markusicu · 2021-06-17T17:36:01Z

Follow-up from our meeting notes: Since we don't quite have “v implies u” and since it looks more compatible to have /./v.unicode == false, I think it’s best to forbid specifying both v and u.

mathiasbynens · 2021-06-25T12:33:02Z

I suspect most cases of people checking .unicode or .flags.match(/u/) are trying to determine code points and unicode properties, as opposed to caring about the actual regex pattern syntax.

The only examples I know of, and which I was referring to during the TC39 meetings, are used for the latter purpose.

I share them below, but first, some background: any JS parser technically has to check each regular expression’s flags and parse accordingly, since invalid patterns result in early errors per spec, which are parse-time errors. A JS parser which is lenient in the face of an invalid regular expression literal is technically in violation of the spec. And of course, patterns exist that are only valid with or without the u flag.

Example pattern that’s only valid without the u flag:
- \z — with the u flag, backslash-escaped [a-zA-Z] literals are not allowed
Example patterns that are only valid with the u flag:
- [\u{0}-z] — without the u flag, this creates a range from } to z, which is out of order.
- [💩-💫] — without the u flag, surrogates are treated as individual "characters". The pattern is equivalent to [\uD83D\uDCA9-\uD83D\uDCAB], and the \uDCA9-\uD83D range is out of order.

With that out of the way, here are the JS-parser-in-JS examples which I referred do during the TC39 meetings.

Acorn: https://github.com/acornjs/acorn/blob/556a7ab62dd999c9024ed0e0920ebb5f6521c4de/acorn/src/regexp.js#L29-L34
- It looks for the u flag in order to correctly parse the regular expression.
Esprima: https://github.com/jquery/esprima/blob/0911ad869928fd218371b91eaae5d739f0c09aba/src/scanner.ts#L1139-L1169
- Rather than implement RegExp parsing, it instead performs some clever surgery on the regular expression pattern if the u flag is set, and defers to the host JS engine for the actual validation.

(Note that Babel doesn’t bother implementing (or emulating) proper regular expression parsing, and happily accepts invalid patterns: https://github.com/babel/babel/blob/e498bee10f0123bb208baa228ce6417542a2c3c4/packages/babel-parser/src/tokenizer/index.js#L924-L973 It does check whether each flag is a known one, but it has no flag-specific code paths, not even for u.)

These are the only examples I’m aware of. All of the above implementations (incl. Babel) would need to be updated to support v, which is within expectations. Making v imply u would not change this.

I’m not aware of any real-world examples of library code checking re.unicode or .flags.includes('u') at runtime.

mathiasbynens · 2021-09-01T09:58:25Z

There are three options:

/v implies /u: using /uv is equivalent to /v
/v by itself is invalid; developers must use /uv
/uv is invalid; developers must use /v to get the new features, and /abc/v.unicode === false

I would like to resolve this issue with option 3, because it’s simplest. The mental model then becomes: v and u are separate modes, with no potentially surprising implications between them (one doesn’t silently enable the other). /v is syntactically and semantically different from /u.

Let’s discuss this during this week’s meeting.

sffc · 2021-09-01T17:30:44Z

I agree with your logic for /v, so long as /v retains its own separate model. However, we got a lot of pushback from TC39 this week regarding the scope expansion. I think we need to answer that question before we can revisit this one.

markusicu · 2021-09-01T20:44:16Z

Even with the smaller scope of just the set notation the committee favored a new flag, and with the incompatibilities within the combined-so-far scope /u and /v are really separate modes.

mathiasbynens · 2021-09-03T08:42:18Z

During yesterday’s weekly sync we decided to proceed with “option 3”. Closing this issue to mark its resolution. We’ll update the spec draft later.

sffc mentioned this issue May 27, 2021

Bikeshed: flag + corresponding getter #14

Closed

mathiasbynens closed this as completed Sep 3, 2021

mathiasbynens mentioned this issue Sep 4, 2021

Add unicodeSet option for the RegExp set notation proposal jviereck/regjsparser#113

Merged

mathiasbynens changed the title ~~Flag implication: v implies u~~ Flag implication: v implies u? Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flag implication: `v` implies `u`? #23

Flag implication: `v` implies `u`? #23

mathiasbynens commented May 26, 2021 •

edited

Loading

markusicu commented May 26, 2021

ljharb commented May 26, 2021

mathiasbynens commented May 26, 2021

ljharb commented May 26, 2021

sffc commented May 26, 2021

mathiasbynens commented Jun 10, 2021

markusicu commented Jun 10, 2021

markusicu commented Jun 10, 2021

markusicu commented Jun 10, 2021

markusicu commented Jun 17, 2021

ljharb commented Jun 17, 2021

markusicu commented Jun 17, 2021

mathiasbynens commented Jun 25, 2021

mathiasbynens commented Sep 1, 2021

sffc commented Sep 1, 2021

markusicu commented Sep 1, 2021

mathiasbynens commented Sep 3, 2021

Flag implication: v implies u? #23

Flag implication: v implies u? #23

Comments

mathiasbynens commented May 26, 2021 • edited Loading

markusicu commented May 26, 2021

ljharb commented May 26, 2021

mathiasbynens commented May 26, 2021

ljharb commented May 26, 2021

sffc commented May 26, 2021

mathiasbynens commented Jun 10, 2021

markusicu commented Jun 10, 2021

markusicu commented Jun 10, 2021

markusicu commented Jun 10, 2021

markusicu commented Jun 17, 2021

ljharb commented Jun 17, 2021

markusicu commented Jun 17, 2021

mathiasbynens commented Jun 25, 2021

mathiasbynens commented Sep 1, 2021

sffc commented Sep 1, 2021

markusicu commented Sep 1, 2021

mathiasbynens commented Sep 3, 2021

Flag implication: `v` implies `u`? #23

Flag implication: `v` implies `u`? #23

mathiasbynens commented May 26, 2021 •

edited

Loading