improve number detection #149

wkeese · 2023-10-05T01:50:44Z

The current regexp for matching numeric literals does not account for:

An initial (optional) +/- sign used to indicate negative or positive numbers, ex: -3 or +5. Admittedly it's hard or impossible via regexps to reliably differentiate between a negative number vs. a subtraction, ex AGE-5 vs AGE * -5
Exponential notation, ex: -1.2E-3

See https://dev.mysql.com/doc/refman/8.0/en/number-literals.html for the syntax (or at least the syntax for MYSQL).

The text was updated successfully, but these errors were encountered:

scriptcoded · 2023-10-25T09:16:33Z

As for exponential notation I don't think that it would be too hard to match. It has a pretty well defined and unique structure.

As for normal signed numbers I think it could be done only using regex, but feels like something that could easily spawn a lot of edge cases. As a proof of concept I threw together this which matches signed numbers only. It has issues with queries that start with a signed number and defines a signed number as a +/- followed by a number and that is not preceded by another number. https://regex101.com/r/bs2Gvb/1

I have briefly been thinking about switching to using something like Ohm for parsing, but it would add to the bundle size, add external dependencies and increase complexity. However if that were be to be implemented this issue would be a breeze to implement.

wkeese · 2023-10-25T23:44:18Z

As for exponential notation I don't think that it would be too hard to match. It has a pretty well defined and unique structure.

Agreed.

As for normal signed numbers I think it could be done only using regex, but feels like something that could easily spawn a lot of edge cases. As a proof of concept I threw together this which matches signed numbers only. It has issues with queries that start with a signed number and defines a signed number as a +/- followed by a number and that is not preceded by another number. https://regex101.com/r/bs2Gvb/1

Yes, agreed. I'm not sure it's possible to solve that problem without using a parser, as I was trying to show in my example AGE-5 vs AGE * -5. But I think a safe enough heuristic is to just assume that a space after a + or - means that it's a binary operator, whereas a + or - followed directly by a number implies that it's a +/- sign.

In other words, I was assuming a regexp like /(?<number>[+-]?\d+(?:\.\d+)?(E[+-]?\d+)?)/ (https://regex101.com/r/nQSDHC/2).

I have briefly been thinking about switching to using something like Ohm for parsing, but it would add to the bundle size, add external dependencies and increase complexity. However if that were be to be implemented this issue would be a breeze to implement.

Agreed.

You could also use a library like https://pegjs.org/, which generates pure javascript, and just release that generated javascript in the bundle. In that way, the external dependency is only at build time, not downloaded to the browser.

But both those approaches would still greatly bloat the code downloaded to the browser.

I think the regexp approach that you have is the best compromise, since a syntax highlighting error is not the end of the world.

scriptcoded · 2024-02-01T11:42:56Z

Same comment as other issues but haven't had a lot of time to work on this library. Next week is vacation so fingers crossed I get some time left over!

Fixes scriptcoded#149

Fixes #149

scriptcoded · 2024-04-03T19:09:43Z

Fixed in #192. Big thanks to @wkeese! 🥳

Fixes #149

# [5.0.0](v4.4.2...v5.0.0) (2024-07-02) * chore!: add support for Node 22 ([9478bf1](9478bf1)) ### Bug Fixes * improve number detection ([02d459a](02d459a)), closes [#149](#149) * improve operator detection ([183a4fb](183a4fb)), closes [#150](#150) * typo in unknown segments ([70af287](70af287)), closes [#148](#148) [#178](#178) [#148](#148) ### Features * add way to style identifiers ([25677d4](25677d4)), closes [#147](#147) * release 5.1.0 ([cb0c0f1](cb0c0f1)) ### BREAKING CHANGES * The `default` segment has been split into `identifier` and `whitespace` segments. There's also a new `unknown` segment that will only show up for malformed SQL such as an unclosed string. However, the highlight() function works largely the same as before, both normal mode and HTML mode, except for the bug fix to stop classifying identifiers as strings. In other words, SQL like select * from EMP where NAME="John Smith" will get highlighted the same as before, i.e. no syntax highlighting for EMP or NAME. * drop support for Node 14.

# [5.0.0](v4.4.2...v5.0.0) (2024-07-02) * chore!: add support for Node 22 ([9478bf1](9478bf1)) ### Bug Fixes * improve number detection ([02d459a](02d459a)), closes [#149](#149) * improve operator detection ([183a4fb](183a4fb)), closes [#150](#150) * typo in unknown segments ([70af287](70af287)), closes [#148](#148) [#178](#178) [#148](#148) ### Features * add way to style identifiers ([25677d4](25677d4)), closes [#147](#147) ### BREAKING CHANGES * The `default` segment has been split into `identifier` and `whitespace` segments. There's also a new `unknown` segment that will only show up for malformed SQL such as an unclosed string. However, the highlight() function works largely the same as before, both normal mode and HTML mode, except for the bug fix to stop classifying identifiers as strings. In other words, SQL like select * from EMP where NAME="John Smith" will get highlighted the same as before, i.e. no syntax highlighting for EMP or NAME. * drop support for Node 14.

# [6.0.0](v5.0.0...v6.0.0) (2024-07-02) ### Bug Fixes * improve number detection ([02d459a](02d459a)), closes [#149](#149) * improve operator detection ([183a4fb](183a4fb)), closes [#150](#150) * typo in unknown segments ([70af287](70af287)), closes [#148](#148) [#178](#178) [#148](#148) ### Features * add way to style identifiers ([25677d4](25677d4)), closes [#147](#147) * release 5.1.0 ([3a58def](3a58def)) ### BREAKING CHANGES * The `default` segment has been split into `identifier` and `whitespace` segments. There's also a new `unknown` segment that will only show up for malformed SQL such as an unclosed string. However, the highlight() function works largely the same as before, both normal mode and HTML mode, except for the bug fix to stop classifying identifiers as strings. In other words, SQL like select * from EMP where NAME="John Smith" will get highlighted the same as before, i.e. no syntax highlighting for EMP or NAME.

scriptsbot · 2024-07-02T21:58:58Z

🎉 This issue has been resolved in version 6.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

wkeese mentioned this issue Oct 5, 2023

improve operator detection #150

Closed

This comment was marked as off-topic.

Sign in to view

scriptsbot added the Stale label Oct 19, 2023

This comment was marked as off-topic.

Sign in to view

scriptsbot removed the Stale label Oct 20, 2023

scriptcoded added the no-stale Prevent making as stale label Oct 24, 2023

wkeese added a commit to wkeese/sql-highlight that referenced this issue Apr 2, 2024

WIP fix: improve number detection

33fef1b

Fixes scriptcoded#149

wkeese added a commit to wkeese/sql-highlight that referenced this issue Apr 2, 2024

fix: improve number detection

db1c7a5

Fixes scriptcoded#149

wkeese mentioned this issue Apr 2, 2024

fix: improve number detection #192

Merged

scriptcoded pushed a commit that referenced this issue Apr 3, 2024

fix: improve number detection

765348e

Fixes #149

scriptcoded closed this as completed Apr 3, 2024

scriptcoded pushed a commit that referenced this issue Jun 23, 2024

fix: improve number detection

e4c735a

Fixes #149

scriptcoded pushed a commit that referenced this issue Jun 23, 2024

fix: improve number detection

9113ef6

Fixes #149

scriptcoded pushed a commit that referenced this issue Jun 23, 2024

fix: improve number detection

dcc963e

Fixes #149

scriptcoded pushed a commit that referenced this issue Jul 2, 2024

fix: improve number detection

02d459a

Fixes #149

This was referenced Jul 2, 2024

The automated release is failing 🚨 #219

Closed

The automated release is failing 🚨 #220

Closed

scriptsbot added the released label Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve number detection #149

improve number detection #149

wkeese commented Oct 5, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

scriptcoded commented Oct 25, 2023

wkeese commented Oct 25, 2023 •

edited

Loading

scriptcoded commented Feb 1, 2024

scriptcoded commented Apr 3, 2024 •

edited

Loading

scriptsbot commented Jul 2, 2024

improve number detection #149

improve number detection #149

Comments

wkeese commented Oct 5, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

scriptcoded commented Oct 25, 2023

wkeese commented Oct 25, 2023 • edited Loading

scriptcoded commented Feb 1, 2024

scriptcoded commented Apr 3, 2024 • edited Loading

scriptsbot commented Jul 2, 2024

wkeese commented Oct 25, 2023 •

edited

Loading

scriptcoded commented Apr 3, 2024 •

edited

Loading