Skip to content

improve number detection #149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wkeese opened this issue Oct 5, 2023 · 7 comments
Closed

improve number detection #149

wkeese opened this issue Oct 5, 2023 · 7 comments
Labels
no-stale Prevent making as stale released

Comments

@wkeese
Copy link
Contributor

wkeese commented Oct 5, 2023

The current regexp for matching numeric literals does not account for:

  • An initial (optional) +/- sign used to indicate negative or positive numbers, ex: -3 or +5. Admittedly it's hard or impossible via regexps to reliably differentiate between a negative number vs. a subtraction, ex AGE-5 vs AGE * -5
  • Exponential notation, ex: -1.2E-3

See https://dev.mysql.com/doc/refman/8.0/en/number-literals.html for the syntax (or at least the syntax for MYSQL).

@scriptsbot

This comment was marked as off-topic.

@wkeese

This comment was marked as off-topic.

@scriptsbot scriptsbot removed the Stale label Oct 20, 2023
@scriptcoded scriptcoded added the no-stale Prevent making as stale label Oct 24, 2023
@scriptcoded
Copy link
Owner

As for exponential notation I don't think that it would be too hard to match. It has a pretty well defined and unique structure.

As for normal signed numbers I think it could be done only using regex, but feels like something that could easily spawn a lot of edge cases. As a proof of concept I threw together this which matches signed numbers only. It has issues with queries that start with a signed number and defines a signed number as a +/- followed by a number and that is not preceded by another number. https://regex101.com/r/bs2Gvb/1

I have briefly been thinking about switching to using something like Ohm for parsing, but it would add to the bundle size, add external dependencies and increase complexity. However if that were be to be implemented this issue would be a breeze to implement.

@wkeese
Copy link
Contributor Author

wkeese commented Oct 25, 2023

As for exponential notation I don't think that it would be too hard to match. It has a pretty well defined and unique structure.

Agreed.

As for normal signed numbers I think it could be done only using regex, but feels like something that could easily spawn a lot of edge cases. As a proof of concept I threw together this which matches signed numbers only. It has issues with queries that start with a signed number and defines a signed number as a +/- followed by a number and that is not preceded by another number. https://regex101.com/r/bs2Gvb/1

Yes, agreed. I'm not sure it's possible to solve that problem without using a parser, as I was trying to show in my example AGE-5 vs AGE * -5. But I think a safe enough heuristic is to just assume that a space after a + or - means that it's a binary operator, whereas a + or - followed directly by a number implies that it's a +/- sign.

In other words, I was assuming a regexp like /(?<number>[+-]?\d+(?:\.\d+)?(E[+-]?\d+)?)/ (https://regex101.com/r/nQSDHC/2).

I have briefly been thinking about switching to using something like Ohm for parsing, but it would add to the bundle size, add external dependencies and increase complexity. However if that were be to be implemented this issue would be a breeze to implement.

Agreed.

You could also use a library like https://pegjs.org/, which generates pure javascript, and just release that generated javascript in the bundle. In that way, the external dependency is only at build time, not downloaded to the browser.

But both those approaches would still greatly bloat the code downloaded to the browser.

I think the regexp approach that you have is the best compromise, since a syntax highlighting error is not the end of the world.

@scriptcoded
Copy link
Owner

Same comment as other issues but haven't had a lot of time to work on this library. Next week is vacation so fingers crossed I get some time left over!

wkeese added a commit to wkeese/sql-highlight that referenced this issue Apr 2, 2024
wkeese added a commit to wkeese/sql-highlight that referenced this issue Apr 2, 2024
scriptcoded pushed a commit that referenced this issue Apr 3, 2024
@scriptcoded
Copy link
Owner

scriptcoded commented Apr 3, 2024

Fixed in #192. Big thanks to @wkeese! 🥳

scriptcoded pushed a commit that referenced this issue Jun 23, 2024
scriptcoded pushed a commit that referenced this issue Jun 23, 2024
scriptcoded pushed a commit that referenced this issue Jun 23, 2024
scriptcoded pushed a commit that referenced this issue Jul 2, 2024
scriptsbot added a commit that referenced this issue Jul 2, 2024
# [5.0.0](v4.4.2...v5.0.0) (2024-07-02)

* chore!: add support for Node 22 ([9478bf1](9478bf1))

### Bug Fixes

* improve number detection ([02d459a](02d459a)), closes [#149](#149)
* improve operator detection ([183a4fb](183a4fb)), closes [#150](#150)
* typo in unknown segments ([70af287](70af287)), closes [#148](#148) [#178](#178) [#148](#148)

### Features

* add way to style identifiers ([25677d4](25677d4)), closes [#147](#147)
* release 5.1.0 ([cb0c0f1](cb0c0f1))

### BREAKING CHANGES

* The `default` segment has been split into `identifier` and `whitespace`
segments.  There's also a new `unknown` segment that will only show up for malformed
SQL such as an unclosed string.

However, the highlight() function works largely the same as before, both normal mode and HTML mode,
except for the bug fix to stop classifying identifiers as strings.  In other words, SQL like

select * from EMP where NAME="John Smith"

will get highlighted the same as before, i.e. no syntax highlighting for EMP or NAME.
* drop support for Node 14.
scriptsbot added a commit that referenced this issue Jul 2, 2024
# [5.0.0](v4.4.2...v5.0.0) (2024-07-02)

* chore!: add support for Node 22 ([9478bf1](9478bf1))

### Bug Fixes

* improve number detection ([02d459a](02d459a)), closes [#149](#149)
* improve operator detection ([183a4fb](183a4fb)), closes [#150](#150)
* typo in unknown segments ([70af287](70af287)), closes [#148](#148) [#178](#178) [#148](#148)

### Features

* add way to style identifiers ([25677d4](25677d4)), closes [#147](#147)

### BREAKING CHANGES

* The `default` segment has been split into `identifier` and `whitespace`
segments.  There's also a new `unknown` segment that will only show up for malformed
SQL such as an unclosed string.

However, the highlight() function works largely the same as before, both normal mode and HTML mode,
except for the bug fix to stop classifying identifiers as strings.  In other words, SQL like

select * from EMP where NAME="John Smith"

will get highlighted the same as before, i.e. no syntax highlighting for EMP or NAME.
* drop support for Node 14.
scriptsbot added a commit that referenced this issue Jul 2, 2024
# [6.0.0](v5.0.0...v6.0.0) (2024-07-02)

### Bug Fixes

* improve number detection ([02d459a](02d459a)), closes [#149](#149)
* improve operator detection ([183a4fb](183a4fb)), closes [#150](#150)
* typo in unknown segments ([70af287](70af287)), closes [#148](#148) [#178](#178) [#148](#148)

### Features

* add way to style identifiers ([25677d4](25677d4)), closes [#147](#147)
* release 5.1.0 ([3a58def](3a58def))

### BREAKING CHANGES

* The `default` segment has been split into `identifier` and `whitespace`
segments.  There's also a new `unknown` segment that will only show up for malformed
SQL such as an unclosed string.

However, the highlight() function works largely the same as before, both normal mode and HTML mode,
except for the bug fix to stop classifying identifiers as strings.  In other words, SQL like

select * from EMP where NAME="John Smith"

will get highlighted the same as before, i.e. no syntax highlighting for EMP or NAME.
@scriptsbot
Copy link
Collaborator

🎉 This issue has been resolved in version 6.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-stale Prevent making as stale released
Projects
None yet
Development

No branches or pull requests

3 participants