Skip to content

Commit 51dd32f

Browse files
committed
Merge pull request #879 from huonw/small-bases
RFC to lex binary and octal literals more eagerly.
2 parents df72584 + 3989c5a commit 51dd32f

File tree

1 file changed

+106
-0
lines changed

1 file changed

+106
-0
lines changed

text/0000-small-base-lexing.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
- Feature Name: stable, it only restricts the language
2+
- Start Date: 2015-02-17
3+
- RFC PR: (leave this empty)
4+
- Rust Issue: (leave this empty)
5+
6+
# Summary
7+
8+
Lex binary and octal literals as if they were decimal.
9+
10+
# Motivation
11+
12+
Lexing all digits (even ones not valid in the given base) allows for
13+
improved error messages & future proofing (this is more conservative
14+
than the current approach) and less confusion, with little downside.
15+
16+
Currently, the lexer stops lexing binary and octal literals (`0b10` and
17+
`0o12345670`) as soon as it sees an invalid digit (2-9 or 8-9
18+
respectively), and immediately starts lexing a new token,
19+
e.g. `0b0123` is two tokens, `0b01` and `23`. Writing such a thing in
20+
normal code gives a strange error message:
21+
22+
```rust
23+
<anon>:2:9: 2:11 error: expected one of `.`, `;`, `}`, or an operator, found `23`
24+
<anon>:2 0b0123
25+
^~
26+
```
27+
28+
However, it is valid to write such a thing in a macro (e.g. using the
29+
`tt` non-terminal), and thus lexing the adjacent digits as two tokens
30+
can lead to unexpected behaviour.
31+
32+
```rust
33+
macro_rules! expr { ($e: expr) => { $e } }
34+
35+
macro_rules! add {
36+
($($token: tt)*) => {
37+
0 $(+ expr!($token))*
38+
}
39+
}
40+
fn main() {
41+
println!("{}", add!(0b0123));
42+
}
43+
```
44+
45+
prints `24` (`add` expands to `0 + 0b01 + 23`).
46+
47+
It would be nicer for both cases to print an error like:
48+
49+
```rust
50+
error: found invalid digit `2` in binary literal
51+
0b0123
52+
^
53+
```
54+
55+
(The non-macro case could be handled by detecting this pattern in the
56+
lexer and special casing the message, but this doesn't not handle the
57+
macro case.)
58+
59+
Code that wants two tokens can opt in to it by `0b01 23`, for
60+
example. This is easy to write, and expresses the intent more clearly
61+
anyway.
62+
63+
# Detailed design
64+
65+
The grammar that the lexer uses becomes
66+
67+
```
68+
(0b[0-9]+ | 0o[0-9]+ | [0-9]+ | 0x[0-9a-fA-F]+) suffix
69+
```
70+
71+
instead of just `[01]` and `[0-7]` for the first two, respectively.
72+
73+
However, it is always an error (in the lexer) to have invalid digits
74+
in a numeric literal beginning with `0b` or `0o`. In particular, even
75+
a macro invocation like
76+
77+
```rust
78+
macro_rules! ignore { ($($_t: tt)*) => { {} } }
79+
80+
ignore!(0b0123)
81+
```
82+
83+
is an error even though it doesn't use the tokens.
84+
85+
86+
# Drawbacks
87+
88+
This adds a slightly peculiar special case, that is somewhat unique to
89+
Rust. On the other hand, most languages do not expose the lexical
90+
grammar so directly, and so have more freedom in this respect. That
91+
is, in many languages it is indistinguishable if `0b1234` is one or
92+
two tokens: it is *always* an error either way.
93+
94+
95+
# Alternatives
96+
97+
Don't do it, obviously.
98+
99+
Consider `0b123` to just be `0b1` with a suffix of `23`, and this is
100+
an error or not depending if a suffix of `23` is valid. Handling this
101+
uniformly would require `"foo"123` and `'a'123` also being lexed as a
102+
single token. (Which may be a good idea anyway.)
103+
104+
# Unresolved questions
105+
106+
None.

0 commit comments

Comments
 (0)