Skip to content

Commit 673b43e

Browse files
committed
Update to { codePoint, position } form.
Fixes #1
1 parent 16f72fd commit 673b43e

File tree

3 files changed

+70
-49
lines changed

3 files changed

+70
-49
lines changed

README.md

Lines changed: 47 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,15 @@ to be able to tokenise a string into separate code points before handling them w
1313

1414
Currently language APIs provide two ways to access entire code points:
1515

16-
1. `codePointAt` allows to retrieve a code point at a known position. The issue is that position is usually unknown in advance if you're just iterating over the string, and you need to manually
17-
calculate it on each iteration with a manual `for(;;)` loop and a magically looking expression like
18-
`pos += currentCodePoint <= 0xFFFF ? 1 : 2`.
19-
1. `String.prototype[Symbol.iterator]` which allows a hassle-free iteration over string codepoints,
20-
but yields their string values, which are inefficient to work with in performance-critical lexers.
16+
1. `codePointAt` allows to retrieve a code point at a known position. The issue is that position is usually unknown in advance if you're just iterating over the string, and you need to manually
17+
calculate it on each iteration with a manual `for(;;)` loop and a magically looking expression like
18+
`pos += currentCodePoint <= 0xFFFF ? 1 : 2`.
19+
1. `String.prototype[Symbol.iterator]` which allows a hassle-free iteration over string codepoints,
20+
but yields their string values, which are inefficient to work with in performance-critical lexers, and still lack position information.
2121

2222
## Proposed solution
2323

24-
We propose the addition of a `codePoints()` method functionally similar to the `[@@iterator]`, but yielding numerical values of code points instead of string ones, this way combining the benefits of both approaches presented above while avoiding the related pitfalls in consumer code.
24+
We propose the addition of a `codePoints()` method functionally similar to the `[@@iterator]`, but yielding positions and numerical values of code points instead of just string values, this way combining the benefits of both approaches presented above while avoiding the related pitfalls in consumer code.
2525

2626
## Naming
2727

@@ -36,11 +36,11 @@ function isIdent(input) {
3636
let codePoints = input.codePoints();
3737
let first = codePoints.next();
3838

39-
if (first.done || !isIdentifierStart(first.value)) {
39+
if (first.done || !isIdentifierStart(first.value.codePoint)) {
4040
return false;
4141
}
4242

43-
for (let cp of codePoints) {
43+
for (let { codePoint } of codePoints) {
4444
if (!isIdentifierContinue(cp)) {
4545
return false;
4646
}
@@ -50,41 +50,54 @@ function isIdent(input) {
5050
}
5151
```
5252

53-
### Tokenise a string with a state machine
53+
### Full-blown tokeniser
5454

5555
```javascript
5656
function toDigit(cp) {
5757
return cp - /* '0' */ 48;
5858
}
5959

60-
function *tokenise(input) {
61-
let token = {};
62-
63-
for (let cp of input) {
64-
let pos = /* see open question #1, we still need to know a pos somehow */;
65-
66-
if (token.type === 'Identifier') {
67-
if (isIdentifierContinue(cp)) {
68-
continue;
69-
}
70-
token.end = pos;
71-
token.name = input.slice(token.start, token.end);
72-
yield token;
73-
} else if (token.type === 'Number') {
74-
if (isDigit(cp)) {
75-
token.value = token.value * 10 + toDigit(cp);
76-
continue;
77-
}
78-
token.end = pos;
79-
yield token;
60+
// Generic helper
61+
class LookaheadIterator {
62+
constructor(inner) {
63+
this[Symbol.iterator] = this;
64+
this.inner = inner;
65+
this.next();
66+
}
67+
68+
next() {
69+
let next = this.lookahead;
70+
this.lookahead = this.inner.next();
71+
return next;
72+
}
73+
74+
skipWhile(cond) {
75+
while (!this.lookahead.done && cond(this.lookahead.value)) {
76+
this.next();
8077
}
78+
return this.lookahead;
79+
}
80+
}
8181

82-
if (isIdentifierStart(cp)) {
83-
token = { type: 'Identifier', start: pos };
84-
} else if (isDigit(cp)) {
85-
token = { type: 'Number', start: pos, value: toDigit(cp) };
82+
// Main tokeniser.
83+
function* tokenise(input) {
84+
let iter = new LookaheadIterator(input.codePoints());
85+
86+
for (let { position: start, codePoint } of iter) {
87+
if (isIdentifierStart(codePoint)) {
88+
yield {
89+
type: 'Identifier',
90+
start,
91+
end: iter.skipWhile(item => isIdentifierContinue(item.codePoint)).position
92+
};
93+
} else if (isDigit(codePoint)) {
94+
yield {
95+
type: 'Number',
96+
start,
97+
end: iter.skipWhile(item => isDigit(item.codePoint)).position
98+
};
8699
} else {
87-
throw new SyntaxError(`Expected an identifier or digit at ${tokenStart}`);
100+
throw new SyntaxError(`Expected an identifier or digit at ${start}`);
88101
}
89102
}
90103
}
@@ -93,7 +106,3 @@ function *tokenise(input) {
93106
## Specification
94107

95108
You can view the rendered spec [here](https://rreverser.github.io/string-prototype-codepoints/).
96-
97-
## Open questions
98-
99-
1. [Should the API yield `[position, codePoint]` pairs like `entries` API of standard collections?](https://github.com/RReverser/string-prototype-codepoints/issues/1)

0 commit comments

Comments
 (0)