|
| 1 | +- Feature Name: stable, it only restricts the language |
| 2 | +- Start Date: 2015-02-17 |
| 3 | +- RFC PR: (leave this empty) |
| 4 | +- Rust Issue: (leave this empty) |
| 5 | + |
| 6 | +# Summary |
| 7 | + |
| 8 | +Lex binary and octal literals as if they were decimal. |
| 9 | + |
| 10 | +# Motivation |
| 11 | + |
| 12 | +Lexing all digits (even ones not valid in the given base) allows for |
| 13 | +improved error messages & future proofing (this is more conservative |
| 14 | +than the current approach) and less confusion, with little downside. |
| 15 | + |
| 16 | +Currently, the lexer stops lexing binary and octal literals (`0b10` and |
| 17 | +`0o12345670`) as soon as it sees an invalid digit (2-9 or 8-9 |
| 18 | +respectively), and immediately starts lexing a new token, |
| 19 | +e.g. `0b0123` is two tokens, `0b01` and `23`. Writing such a thing in |
| 20 | +normal code gives a strange error message: |
| 21 | + |
| 22 | +```rust |
| 23 | +<anon>:2:9: 2:11 error: expected one of `.`, `;`, `}`, or an operator, found `23` |
| 24 | +<anon>:2 0b0123 |
| 25 | + ^~ |
| 26 | +``` |
| 27 | + |
| 28 | +However, it is valid to write such a thing in a macro (e.g. using the |
| 29 | +`tt` non-terminal), and thus lexing the adjacent digits as two tokens |
| 30 | +can lead to unexpected behaviour. |
| 31 | + |
| 32 | +```rust |
| 33 | +macro_rules! expr { ($e: expr) => { $e } } |
| 34 | + |
| 35 | +macro_rules! add { |
| 36 | + ($($token: tt)*) => { |
| 37 | + 0 $(+ expr!($token))* |
| 38 | + } |
| 39 | +} |
| 40 | +fn main() { |
| 41 | + println!("{}", add!(0b0123)); |
| 42 | +} |
| 43 | +``` |
| 44 | + |
| 45 | +prints `24` (`add` expands to `0 + 0b01 + 23`). |
| 46 | + |
| 47 | +It would be nicer for both cases to print an error like: |
| 48 | + |
| 49 | +```rust |
| 50 | +error: found invalid digit `2` in binary literal |
| 51 | +0b0123 |
| 52 | + ^ |
| 53 | +``` |
| 54 | + |
| 55 | +(The non-macro case could be handled by detecting this pattern in the |
| 56 | +lexer and special casing the message, but this doesn't not handle the |
| 57 | +macro case.) |
| 58 | + |
| 59 | +Code that wants two tokens can opt in to it by `0b01 23`, for |
| 60 | +example. This is easy to write, and expresses the intent more clearly |
| 61 | +anyway. |
| 62 | + |
| 63 | +# Detailed design |
| 64 | + |
| 65 | +The grammar that the lexer uses becomes |
| 66 | + |
| 67 | +``` |
| 68 | +(0b[0-9]+ | 0o[0-9]+ | [0-9]+ | 0x[0-9a-fA-F]+) suffix |
| 69 | +``` |
| 70 | + |
| 71 | +instead of just `[01]` and `[0-7]` for the first two, respectively. |
| 72 | + |
| 73 | +However, it is always an error (in the lexer) to have invalid digits |
| 74 | +in a numeric literal beginning with `0b` or `0o`. In particular, even |
| 75 | +a macro invocation like |
| 76 | + |
| 77 | +```rust |
| 78 | +macro_rules! ignore { ($($_t: tt)*) => { {} } } |
| 79 | + |
| 80 | +ignore!(0b0123) |
| 81 | +``` |
| 82 | + |
| 83 | +is an error even though it doesn't use the tokens. |
| 84 | + |
| 85 | + |
| 86 | +# Drawbacks |
| 87 | + |
| 88 | +This adds a slightly peculiar special case, that is somewhat unique to |
| 89 | +Rust. On the other hand, most languages do not expose the lexical |
| 90 | +grammar so directly, and so have more freedom in this respect. That |
| 91 | +is, in many languages it is indistinguishable if `0b1234` is one or |
| 92 | +two tokens: it is *always* an error either way. |
| 93 | + |
| 94 | + |
| 95 | +# Alternatives |
| 96 | + |
| 97 | +Don't do it, obviously. |
| 98 | + |
| 99 | +Consider `0b123` to just be `0b1` with a suffix of `23`, and this is |
| 100 | +an error or not depending if a suffix of `23` is valid. Handling this |
| 101 | +uniformly would require `"foo"123` and `'a'123` also being lexed as a |
| 102 | +single token. (Which may be a good idea anyway.) |
| 103 | + |
| 104 | +# Unresolved questions |
| 105 | + |
| 106 | +None. |
0 commit comments