llvm-project

Commit Graph

Author	SHA1	Message	Date
Haojian Wu	edb8fb2659	[pseudo] Fix HeadsPartition is not initialized correctly. The bug was that if we recover from the token 0, we will make the Heads empty (Line646), which results no recovery being applied. Reviewed By: sammccall Differential Revision: https://reviews.llvm.org/D132388	2022-08-23 15:08:33 +02:00
Sam McCall	bd5cc6575b	[pseudo] Start rules are `_ := start-symbol EOF`, improve recovery. Previously we were calling glrRecover() ad-hoc at the end of input. Two main problems with this: - glrRecover() on two separate code paths is inelegant - We may have to recover several times in succession (e.g. to exit from nested scopes), so we need a loop at end-of-file Having an actual shift action for an EOF terminal allows us to handle both concerns in the main shift/recover/reduce loop. This revealed a recovery design bug where recovery could enter a loop by repeatedly choosing the same parent to identically recover from. Addressed this by allowing each node to be used as a recovery base once. Differential Revision: https://reviews.llvm.org/D130550	2022-08-19 16:49:37 +02:00
Sam McCall	2cc7463c85	[pseudo] Perform unconstrained reduction prior to recovery. Our GLR uses lookahead: only perform reductions that might be consumed by the shift immediately following. However when shift fails and so reduce is followed by recovery instead, this restriction is incorrect and leads to missing heads. In turn this means certain recovery strategies can't be made to work. e.g. ``` ns := NAMESPACE { namespace-body } [recover=Skip] ns-body := namespace_opt ``` When `namespace { namespace {` is parsed, we can recover the inner `ns` (using the `Skip` strategy to ignore the missing `}`). However this `namespace` will not be reduced to a `namespace-body` as EOF is not in the follow-set, and so we are unable to recover the outer `ns`. This patch fixes this by tracking which heads were produced by constrained reduce, and discarding and rebuilding them before performing recovery. This is a prerequisite for the `Skip` strategy mentioned above, though there are some other limitations we need to address too. Reviewed By: hokein Differential Revision: https://reviews.llvm.org/D130523	2022-08-19 15:07:36 +02:00
Sam McCall	07b7ff9838	[pseudo] Allow opaque nodes to represent terminals This allows incomplete code such as `namespace foo {` to be modeled as a normal sequence with the missing } represented by an empty opaque node. Differential Revision: https://reviews.llvm.org/D130551	2022-07-26 13:56:26 +02:00
Haojian Wu	2a88fb2ecb	[pseudo] Eliminate the dangling-else syntax ambiguity. - the grammar ambiguity is eliminated by a guard; - modify the guard function signatures, now all parameters are folded in to a single object, avoid a long parameter list (as we will add more parameters in the near future); Reviewed By: sammccall Differential Revision: https://reviews.llvm.org/D130160	2022-07-22 09:13:09 +02:00
Sam McCall	3132e9cd7c	[pseudo] Key guards by RuleID, add guards to literals (and 0). After this, NUMERIC_CONSTANT and strings should parse only one way. There are 8 types of literals, and 24 valid (literal, TokenKind) pairs. This means adding 8 new named guards (or 24, if we want to assert the token). It seems fairly clear to me at this point that the guard names are unneccesary indirection: the guards are in fact coupled to the rule signature. (Also add the zero guard I forgot in the previous patch.) Differential Revision: https://reviews.llvm.org/D130066	2022-07-21 22:42:31 +02:00
Sam McCall	7d8e2742d9	[pseudo] Define recovery strategy as grammar extension. Differential Revision: https://reviews.llvm.org/D129158	2022-07-06 15:03:38 +02:00
Sam McCall	3121167488	[pseudo] Add error-recovery framework & brace-based recovery The idea is: - a parse failure is detected when all heads die when trying to shift the next token - we can recover by choosing a nonterminal we're partway through parsing, and determining where it ends through nonlocal means (e.g. matching brackets) - we can find candidates by walking up the stack from the (ex-)heads - the token range is defined using heuristics attached to grammar rules - the unparsed region is represented in the forest by an Opaque node This patch has the core GLR functionality. It does not allow recovery heuristics to be attached as extensions to the grammar, but rather infers a brace-based heuristic. Expected followups: - make recovery heuristics grammar extensions (depends on D127448) - add recovery to our grammar for bracketed constructs and sequence nodes - change the structure of our augmented `_ := start` rules to eliminate some special-cases in glrParse. - (if I can work out how): avoid some spurious recovery cases described in comments (Previously mistakenly committed as `a0f4c10ae2`) Differential Revision: https://reviews.llvm.org/D128486	2022-07-05 20:49:41 +02:00
Haojian Wu	9ab67cc8bf	[pseudo] Implement guard extension. - Extend the GLR parser to allow conditional reduction based on the guard functions; - Implement two simple guards (contextual-override/final) for cxx.bnf; - layering: clangPseudoCXX depends on clangPseudo (as the guard function need to access the TokenStream); Differential Revision: https://reviews.llvm.org/D127448	2022-07-05 15:55:15 +02:00
Sam McCall	9fbf1107cc	[pseudo] Eliminate LRTable::Action. NFC The last remaining uses are in tests/test builders. Replace with a builder struct. Differential Revision: https://reviews.llvm.org/D129093	2022-07-05 14:35:41 +02:00
Haojian Wu	fe66aebd75	[pseudo] Define a clangPseudoCLI library. - define a common data structure Language which is a compiled result of the bnf grammar. It is defined in Language.h; - creates a clangPseudoCLI lib which defines a grammar commandline flag and expose a function to get the Language. It supports --grammar=cxx, --grammmar=/path/to/file.bnf; - use the clangPseudoCLI in clang-pseudo, fuzzer, and benchmark tools ( simplify the code and use the prebuilt cxx grammar); Split out from https://reviews.llvm.org/D127448. Differential Revision: https://reviews.llvm.org/D128679	2022-07-01 08:31:34 +02:00
Sam McCall	bc5e7ced1c	[pseudo] Fix bugs/inconsistencies in forest dump. - when printing a shared node for the second time, don't print its children (This keeps output proportional to the size of the structure) - when printing a shared node for the second time, print its type only, not rule (for consistency with above: don't dump details of nodes twice) - don't abbreviate shared nodes, to ensure we can prune the tree there Differential Revision: https://reviews.llvm.org/D128805	2022-06-29 22:56:26 +02:00
Sam McCall	743971faaf	Revert "[pseudo] Add error-recovery framework & brace-based recovery" This reverts commit `a0f4c10ae2`. This commit hadn't been reviewed yet, and was unintentionally included on another branch.	2022-06-28 21:11:09 +02:00
Sam McCall	a0f4c10ae2	[pseudo] Add error-recovery framework & brace-based recovery The idea is: - a parse failure is detected when all heads die when trying to shift the next token - we can recover by choosing a nonterminal we're partway through parsing, and determining where it ends through nonlocal means (e.g. matching brackets) - we can find candidates by walking up the stack from the (ex-)heads - the token range is defined using heuristics attached to grammar rules - the unparsed region is represented in the forest by an Opaque node This patch has the core GLR functionality. It does not allow recovery heuristics to be attached as extensions to the grammar, but rather infers a brace-based heuristic. Expected followups: - make recovery heuristics grammar extensions (depends on D127448) - add recover to our grammar for bracketed constructs and sequence nodes - change the structure of our augmented `_ := start` rules to eliminate some special-cases in glrParse. - (if I can work out how): avoid some spurious recovery cases described in comments - grammar changes to eliminate the hard distinction between init-list and designated-init-list shown in the recovery-init-list.cpp testcase Differential Revision: https://reviews.llvm.org/D128486	2022-06-28 21:08:43 +02:00
Sam McCall	3f028c02ba	[pseudo] Grammar::parseBNF returns Grammar not unique_ptr. NFC	2022-06-28 16:34:21 +02:00
Sam McCall	85eaecbe8e	[pseudo] Check follow-sets instead of tying reduce actions to lookahead tokens. Previously, the action table stores a reduce action for each lookahead token it should allow. These tokens are the followSet(action.rule.target). In practice, the follow sets are large, so we spend a bunch of time binary searching around all these essentially-duplicates to check whether our lookahead token is there. However the number of reduces for a given state is very small, so we're much better off linear scanning over them and performing a fast check for each. D128318 was an attempt at this, storing a bitmap for each reduce. However it's even more compact just to use the follow sets directly, as there are fewer nonterminals than (state, rule) pairs. It's also faster. This specialized approach means unbundling Reduce from other actions in LRTable, so it's no longer useful to support it in Action. I suspect Action will soon go away, as we store each kind of action separately. This improves glrParse speed by 42% (3.30 -> 4.69 MB/s). It also reduces LR table size by 59% (343 -> 142kB). Differential Revision: https://reviews.llvm.org/D128472	2022-06-28 00:36:16 +02:00
Sam McCall	b70ee9d984	Reland "[pseudo] Track heads as GSS nodes, rather than as "pending actions"." This reverts commit `2c80b53198`. Fixes LRTable::buildForTest to create states that are referenced but have no actions.	2022-06-23 18:21:44 +02:00
Sam McCall	2c80b53198	Revert "[pseudo] Track heads as GSS nodes, rather than as "pending actions"." This reverts commit `e3ec054dfd`. Tests fail in asserts mode: https://lab.llvm.org/buildbot/#/builders/109/builds/41217	2022-06-23 18:16:38 +02:00
Sam McCall	e3ec054dfd	[pseudo] Track heads as GSS nodes, rather than as "pending actions". IMO this model is simpler to understand (borrowed from the LR0 patch D127357). It also makes error recovery easier to implement, as we have a simple list of head nodes lying around to recover from when needed. (It's not quite as nice as LR0 in this respect though). It's slightly slower (2.24 -> 2.12 MB/S on my machine = 5%) but nothing close to as bad as LR0. However - I think we'd have to eat a litle performance loss otherwise to implement error recovery. - this frees up some complexity budget for optimizations like fastpath push/pop (this + fastpath is already faster than head) - I haven't changed the data structure here and it's now pretty dumb, we can make it faster Differential Revision: https://reviews.llvm.org/D128297	2022-06-23 17:26:42 +02:00
Haojian Wu	c70aeaad2b	[pseudo] Move grammar-related headers to a separate dir, NFC. We did that for .cpp, but forgot the headers. Differential Revision: https://reviews.llvm.org/D127388	2022-06-09 14:58:05 +02:00
Haojian Wu	7a05942dd0	[pseudo] Remove the explicit Accept actions. As pointed out in the previous review section, having a dedicated accept action doesn't seem to be necessary. This patch implements the the same behavior without accept acction, which will save some code complexity. Reviewed By: sammccall Differential Revision: https://reviews.llvm.org/D125677	2022-06-09 11:19:07 +02:00
Sam McCall	94b2ca18c1	[pseudo] GC GSS nodes, reuse them with a freelist Most GSS nodes have short effective lifetimes, keeping them around until the end of the parse is wasteful. Mark and sweep them every 20 tokens. When parsing clangd/AST.cpp, this reduces the GSS memory from 1MB to 20kB. We pay ~5% performance for this according to the glrParse benchmark. (Parsing more tokens between GCs doesn't seem to improve this further). Compared to the refcounting approach in https://reviews.llvm.org/D126337, this is simpler (at least the complexity is better isolated) and has >2x less overhead. It doesn't provide death handlers (for error-handling) but we have an alternative solution in mind. Differential Revision: https://reviews.llvm.org/D126723	2022-06-08 23:39:59 +02:00
Haojian Wu	1a65c491be	[pseudo] Support parsing variant target symbols. With this patch, we're able to parse smaller chunks of C++ code (statement, declaration), rather than translation-unit. The start symbol is listed in the grammar in a form of `_ := statement`, each start symbol has a dedicated state (`_ := • statement`). We create and track all these separate states in the LRTable. When we start parsing, we lookup the corresponding state to start the parser. LR pasing table changes with this patch: - number of states: 1467 -> 1471 - number of actions: 82891 -> 83578 - size of the table (bytes): 334248 -> 336996 Differential Revision: https://reviews.llvm.org/D125006	2022-05-16 10:38:16 +02:00
Haojian Wu	ed1b32791d	[pseudo] Print the GSS::Node details when the unittest fails, NFC.	2022-05-03 22:06:10 +02:00
Haojian Wu	9f38da258e	[pseudo] Implement the GLR parsing algorithm. This patch implements a standard GLR parsing algorithm, the core piece of the pseudoparser. - it parses preprocessed C++ code, currently it supports correct code only and parse them as a translation-unit; - it produces a forest which stores all possible trees in an efficient manner (only a single node being build for per (SymbolID, Token Range)); no disambiguation yet; Reland with a fix for g++'s -fpermissive error on previous declaration `GSS& GSS;`. Differential Revision: https://reviews.llvm.org/D121150	2022-05-03 20:25:23 +02:00
Haojian Wu	860eabb395	Revert "[pseudo] Implement the GLR parsing algorithm." This breaks some buildbots (on the declaration GSS& GSS), will fix it later. This reverts commit `eac22d0754`.	2022-05-03 15:54:10 +02:00
Sam McCall	eac22d0754	[pseudo] Implement the GLR parsing algorithm. This patch implements a standard GLR parsing algorithm, the core piece of the pseudoparser. - it parses preprocessed C++ code, currently it supports correct code only and parse them as a translation-unit; - it produces a forest which stores all possible trees in an efficient manner (only a single node being build for per (SymbolID, Token Range)); no disambiguation yet; Differential Revision: https://reviews.llvm.org/D121150	2022-05-03 15:42:07 +02:00

27 Commits