Implement the new Unicode algorithm for preventing bidi-based source code spoofing #113363
Labels
A-lints
Area: Lints (warnings about flaws in source code) such as unused_mut.
A-rustfmt
Area: Rustfmt
A-Unicode
Area: Unicode
T-compiler
Relevant to the compiler team, which will review and decide on the PR/issue.
Rustc already mitigates against the potential spoofing effects (CVE-2021-42574) of bidirectional format characters.
What we currently do is a bit of a blunt hammer: we lint on any bidirectional format character except LRM and RLM, which is both too aggressive (it doesn't allow for "isolated" pairs of format characters) and not aggressive enough (RLM can have negative effects, and even plain visible characters can cause problems).
Since that CVE was released, Unicode has been working on a specification, (Draft) Unicode Technical Standard #55 Unicode Source Code Handling to help programming language designers implement various mitigations to such issues. The goal of this specification is to protect source code from spoofing while retaining the ability to use bidirectional text in source. It contains advice for multiple levels of the stack: compiler warning engines, code formatters, and source code editors.
Relevant to rustc is the section on Directional Formatting Characters, which refers to the algorithm in Conversion to Plain Text. That algorithm is primarily written for the purposes of a formatting tool like rustfmt, and I've filed a rustfmt issue to implement it there1, however it can also be tweaked to be a good lint too.
In essence, it will warn about format characters nestled amongst whitespace (except for LRM2), and also warn about mismatched stateful format characters in contexts where it matters (e.g. a block comment where there is other code on the same line after it). The difference from the current algorithm is that it would ignore these format characters when they don't have any effect that leaks outside of the "atom" (string, comment, etc), and furthermore lint on RLMs found in whitespace.
An important thing to highlight is that this change would allow the proposed rustfmt pass to generate warning-free code: Under the current regime, this pass would generate warning-free code in most cases provided you do not already have warnings of this form, but there is the FSI/PDI insertion section that is rare but will turn warning-free code into code that warns.
Here is the tweaked algorithm such that it is suitable for linting.
However, a thing that will need to be decided is this: Do we want to lint cases where visible RTL characters change the rendered direction of the text?
For example, the code
<arabic> = 2 - x;
will render aswhich can be confusing. This can be mitigated by inserting an LRM3 after the arabic identifier; which is what the algorithm recommends formatters like rustfmt do (and is what I am proposing rustfmt support).
We could in our lint require users do the same: any visible RTL code points that affect the directionality of surrounding text are required to have that effect be mitigated, usually with an LRM. This would roughly amount to requiring people run the rustfmt pass though we can have lint suggestions also do the same for them. This is an expansion of the current set of things we lint on: currently we only lint on invisible control characters and the effects of visible RTL characters are basically treated as okay since you can see the RTL character and know there is something fishy.
I generally lean on the side of not linting here. The algorithm above becomes much simpler in such a case, basically all the stuff producing and consuming
Needs_LRM
goes away, I can write up a modified algorithm if necessary.Happy to help mentor the implementation and provide additional background on this algorithm and the bidi algorithm/properties.
Thoughts? @estebank @pietroalbini
Footnotes
Likely initially as a config option; I do not want to make decisions on whether it ought to be the default just yet. ↩
LRM is a format character that, in a left-to-right context, can only fix problems, not cause them. It is the equivalent of a latin letter, except invisible. ↩
Treated as a space by Rust ↩
The text was updated successfully, but these errors were encountered: