WG21 approved delimited escape sequences and named escape
sequences.
Adjust the extension warnings accordingly, and update
the release notes.
Reviewed By: aaron.ballman
Differential Revision: https://reviews.llvm.org/D129664
Implements [[ https://wg21.link/p2071r1 | P2071 Named Universal Character Escapes ]] - as an extension in all language mode, the patch not warn in c++23 mode will be done later once this paper is plenary approved (in July).
We add
* A code generator that transforms `UnicodeData.txt` and `NameAliases.txt` to a space efficient data structure that can be queried in `O(NameLength)`
* A set of functions in `Unicode.h` to query that data, including
* A function to find an exact match of a given Unicode character name
* A function to perform a loose (ignoring case, space, underscore, medial hyphen) matching
* A function returning the best matching codepoint for a given string per edit distance
* Support of `\N{}` escape sequences in String and character Literals, with loose and typos diagnostics/fixits
* Support of `\N{}` as UCN with loose matching diagnostics/fixits.
Loose matching is considered an error to match closely the semantics of P2071.
The generated data contributes to 280kB of data to the binaries.
`UnicodeData.txt` and `NameAliases.txt` are not committed to the repository in this patch, and regenerating the data is a manual process.
Reviewed By: tahonermann
Differential Revision: https://reviews.llvm.org/D123064
\x{XXXX} \u{XXXX} and \o{OOOO} are accepted in all languages mode
in characters and string literals.
This is a feature proposed for both C++ (P2290R1) and C (N2785). The
papers have been seen by both committees but are not yet adopted into
either standard. However, they do have support from both committees.
Summary:
Previously, the -fdollars-in-identifiers flag allows the '$' symbol to be used
in an identifier but the universal character name equivalent '\u0024' is not
allowed.
This patch changes this, so that \u0024 is valid in identifiers.
Reviewers: rsmith, jordan_rose
Reviewed By: rsmith
Subscribers: dexonsmith, simoncook, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D71758
This is a missing piece for C99 conformance.
This patch handles UCNs by adding a '\\' case to LexTokenInternal and
LexIdentifier -- if we see a backslash, we tentatively try to read in a UCN.
If the UCN is not syntactically well-formed, we fall back to the old
treatment: a backslash followed by an identifier beginning with 'u' (or 'U').
Because the spelling of an identifier with UCNs still has the UCN in it, we
need to convert that to UTF-8 in Preprocessor::LookUpIdentifierInfo.
Of course, valid code that does *not* use UCNs will see only a very minimal
performance hit (checks after each identifier for non-ASCII characters,
checks when converting raw_identifiers to identifiers that they do not
contain UCNs, and checks when getting the spelling of an identifier that it
does not contain a UCN).
This patch also adds basic support for actual UTF-8 in the source. This is
treated almost exactly the same as UCNs except that we consider stray
Unicode characters to be mistakes and offer a fixit to remove them.
llvm-svn: 173369