LALR(1) parser from official MySQL grammar by JanJakes · Pull Request #429 · WordPress/sqlite-database-integration

JanJakes · 2026-06-10T12:42:11Z

Note

The changed line numbers are misleading—about 115,000 added lines is just a testing query corpus.
(Copied to the new mysql-parser package from mysql-on-sqlite.)

LALR(1) parser from official MySQL grammar

A new experimental packages/mysql-parser package that implements a universal LALR(1) parser and builds a MySQL parse table from the official MySQL grammar.

This is the initial implementation, not used anywhere in the driver yet.

A full driver migration to this new parser is AI-prototyped in #432.

What it does

Grammar processing pipeline: Fetch sources → Bison → generate parse table and token data.
Lexer: The existing MySQL lexer was copied and adapted to the new LALR(1) grammar.
Parser: A new universal LALR(1) parser implementation.
MySQL grammar: A compacted MySQL 8.4 LTS grammar, extracted using the grammar processing pipeline.
MySQL query corpus: The ~70k MySQL query corpus was copied and updated to MySQL 8.4 LTS.
Benchmark: A no-JIT/JIT lexer + parser benchmark.
Test suite: New tests and a CI job.

What it doesn't do yet

Replace the current parser: It's a standalone package that doesn't replace the existing parser yet.
Multi-version: For now, the parser only tracks MySQL 8.4 LTS. Multi-version will be done as a follow-up.

Benchmarks

Measured on MacBook Pro M4 Max on PHP 8.4, the package's 8.4.10 corpus ~70k queries, end-to-end (lex + parse), best of 5 timed passes after 2 warmups:

Metric	LL (trunk)	LALR (this)
Throughput, no JIT	11,010 QPS	59,457 QPS
Throughput, warm JIT	24,393 QPS	112,759 QPS
Cold boot, no opcache	~1.9 ms	~2.7 ms
Warm boot, opcache	~0.6 ms	~0.3 ms
Memory, no opcache	~3.4 MB	~5.4 MB
Memory, opcache worker	~1.8 MB	~3.1 MB
Generated parser/table file size	65 KB	177 KB
Full size (lexer + parser + grammar)	246 KB	260 KB

This parser is over 5× faster without JIT and over 4.5× faster with JIT. Cold boot is a bit slower; warm boot is faster. The memory footprint is a bit higher, and the overall size about 14 KB higher.

Recognize-only

The same lex+parse runs but building no AST, measuring only recognition without AST allocation:

Throughput	LL (trunk)	LALR (this)
no JIT	16,359 QPS	95,374 QPS
warm JIT	49,940 QPS	210,032 QPS

Dropping AST construction lifts both by ~1.5–2×, but the gap stays around ~4.2–5.8×.

github-actions · 2026-06-10T12:43:09Z

🤖 Lexer benchmark

Changes to lexer-related files were detected and triggered a benchmark:

Config	Base (QPS)	This PR (QPS)	Speedup
no JIT	73,082	72,859	1.00×
tracing JIT	155,119	154,228	0.99×

Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally.

To reproduce locally:

cd packages/mysql-on-sqlite && composer run bench-lexer

Add a new monorepo package for a MySQL parser generated from the official MySQL grammar. This commit sets up the package metadata; the source, tooling, and documentation follow in later commits.

Bring the MySQL lexer and the token and node classes over from the mysql-on-sqlite package unchanged, so the later adaptation to the official grammar is reviewable as a focused diff, and register src/ as the package Composer classmap (the WordPress-style file names rule out PSR-4).

Compile the grammar from the official MySQL sources: fetch sql_yacc.yy and lex.h at a pinned, checksum-verified mysql-server tag; run a pinned Bison build (Docker, version-asserted) to produce the automaton; compact the automaton into plain PHP ACTION/GOTO tables (about 7% of the dense cells); and derive the keyword table and token constants from lex.h, failing the build on any unresolved terminal. bin/build-grammar (composer run build-grammar) runs the pipeline end to end.

Commit the LALR(1) parse table produced by bin/build-grammar: a plain PHP array that compacts the grammar's dense ACTION/GOTO automaton to about 7% of its cells. Regenerate with composer run build-grammar. The token-level data (keyword table, paren-gated function keywords, and token constants) is generated into the lexer itself; see the next commit.

Make the lexer emit the grammar's own token numbers, with the keyword table generated from lex.h: keyword synonyms, paren-gated function keywords, and dropped keywords all follow MySQL's own data. Diagnostic token names are derived on demand instead of shipping a name map. The lexer produces MySQL's grammar token stream directly, the way MySQL's own lexer does, rather than scanning a different token model and reconciling it in a separate pass: "@" is a standalone terminal followed by its name, "WITH ROLLUP" is contracted via a one-token lookahead, NOT becomes NOT2 under HIGH_NOT_PRECEDENCE, and the input ends with END_OF_INPUT and Bison's end marker (omitted on invalid input). The pull iterator (next_token/get_token) and remaining_tokens() both yield this single stream; the scanner's internal sentinels stay private and never reach it.

A table-driven LALR(1) shift-reduce runtime (WP_Parser) over a WP_Parser_Grammar that expands a compact, generated ACTION/GOTO parse table, building a WP_Parser_Node AST. The grammar is unambiguous for LALR(1), so the loop is deterministic, with no conflict handling or backtracking. A rule that matches nothing produces no node, so empty optional rules are absent from the tree. This is grammar-agnostic: it knows nothing about MySQL, only how to run an LALR(1) parse table. Adapt the copied parse-tree primitives to the package: the runtime builds each node in a single step, so the old recursive parser's merge_fragment() is dropped, and the node and token docblocks no longer reference that parser.

Wire the generated MySQL parse table into the generic LALR(1) runtime through a factory: WP_MySQL_Parser_Factory::create_parser() builds a WP_Parser over a WP_Parser_Grammar loaded from src/mysql-parse-table.php. The grammar is expanded once and shared between created parsers; create_grammar() exposes a fresh grammar for callers that want their own. This is the only piece that knows the parser is being used for MySQL.

Cut the generated parse table from 190 KB to 177 KB (-7%) with no behavior change: most shifts on a given terminal go to the same successor state, so those cells are stored as bare token lists (action_row_shift_tokens) and restored from a per-terminal target table (action_shift_targets) when the grammar is constructed. The smaller file also parses faster on a cold opcache.

Bring the query corpus extracted from the MySQL server test suite, with the tooling that generates it, into the package: data/mysql-server-query-corpus/ plus a bin/build-corpus orchestrator (composer run build-corpus) that fetches the mysql-test directory at the pinned tag and extracts the queries. The SQLite driver package keeps its own copy for now; it will be retired when the driver is ported to this package.

Measure the corpus parse rate and end-to-end (lex + parse) throughput, with warmup and timed passes. The parser accepts 99.76% of the ~69k corpus queries.

Cover the token stream, the scanner (the exhaustive unit suite ported from the SQLite driver), the parser runtime, token value and name resolution, generated grammar-data invariants, and a corpus regression test pinning the exact acceptance tally. Run the suite on the oldest and newest supported PHP versions in CI.

Tokenizing a whole statement routed every token through the pull iterator (next_token -> produce -> scan_lexeme -> read_next_token -> enqueue_token), adding ~4 method calls plus token-queue bookkeeping per token over a plain scan-and-emit loop. Give remaining_tokens() a tight fast path that emits the common single-token lexemes inline and delegates only the rare multi-token ones (@, WITH ROLLUP, end markers) to the buffered producers. The pull API is unchanged and the output is byte-identical; ~24% faster (no JIT) / ~16% (JIT) end-to-end over the MySQL server corpus.

The pull iterator buffered produced tokens in a dynamic $token_queue drained by index. A scan step yields at most two grammar tokens, so a single $pending_token slot suffices: next_token() returns the first and holds the second. The multi-token producers (@, WITH ROLLUP, end markers) now append to a caller-supplied array, shared directly by both next_token() and remaining_tokens() — removing the queue bookkeeping and the duplicated drain in the fast path. A make_token() helper unifies token construction. Output is byte-identical and throughput is unchanged (the multi-token cases were already off the hot path); this is a structural cleanup.

adamziel · 2026-06-23T22:39:16Z

+				list( $type, $start, $length ) = $this->lookahead;
+				$this->lookahead               = null;
+			} else {
+				// Inlined scan_lexeme(): skip whitespace and comments, then scan.


we could remove all comments like that one, LLMs love to leave them for continuity between commits, but it doesn't really mean anything anymore

adamziel · 2026-06-23T22:40:42Z

+			// Token constants share the class with the lexer's own constants;
+			// grammar tokens are the non-negative ints that are not SQL modes
+			// (the scanner sentinels are negative, the SQL modes are bit flags).
+			foreach ( ( new ReflectionClass( self::class ) )->getConstants() as $name => $value ) {


Reflections look concerning, especially in such a prominent code path. What's the performance hit?

adamziel · 2026-06-23T22:42:04Z

+					$type = self::LESS_OR_EQUAL_OPERATOR;
+				}
+			} elseif ( '>' === $next_byte ) {
+				$this->bytes_already_read += 1; // Consume the '>'.


These comments seem pretty redundant.

adamziel · 2026-06-23T22:44:27Z

+			// in the range of U+0080 to U+FFFF before looking at further bytes.
+			// If it can't, bail out early to avoid unnecessary UTF-8 decoding.
+			// Identifiers are usually ASCII-only, so we can optimize for that.
+			$byte_1 = ord(


An inline utf-8 parser? :D Let's see if it makes any sense to adapt the one @dmsnell built for WordPress core, that's less maintenance, less opportunity to have an error, and probably more performance.

adamziel · 2026-06-23T22:45:00Z

+	}
+
+	private function read_number(): ?int {
+		// @TODO: Support numeric-only identifier parts after "." (e.g., 1ea10.1).


Should this be addressed before merging?

adamziel · 2026-06-23T22:47:01Z

+			 * A backslash with any other character represents the character itself.
+			 * That is, \x evaluates to x, \\ evaluates to \, and \🙂 evaluates to 🙂.
+			 */
+			$preg_quoted_backslash = preg_quote( $backslash );


Can we just compute this on paper and inline it in the next line? :-)

adamziel · 2026-06-23T22:48:43Z

+ *   https://github.com/mysql/mysql-workbench/blob/8.0.38/library/parsers/grammars/MySQLLexer.g4
+ *   https://github.com/mysql/mysql-workbench/blob/8.0.38/library/parsers/mysql/MySQLBaseLexer.cpp
+ */
+class WP_MySQL_Lexer {


I'd love to understand better how was this was adapted from the default lexer

adamziel · 2026-06-23T22:50:27Z

Really good work here @JanJakes! I've left some non-blocking comments and will proceed with merging 🎉

JanJakes changed the title ~~Add an experimental MySQL parser built from the official 8.4 grammar~~ Experiment: LALR(1) parser from official MySQL grammar Jun 10, 2026

JanJakes force-pushed the lalr-parser branch 9 times, most recently from df8874b to 70d642d Compare June 11, 2026 15:43

JanJakes changed the title ~~Experiment: LALR(1) parser from official MySQL grammar~~ LALR(1) parser from official MySQL grammar Jun 12, 2026

JanJakes mentioned this pull request Jun 12, 2026

Experiment: Port MySQL-on-SQLite to LALR(1) parser #432

Draft

JanJakes force-pushed the lalr-parser branch 4 times, most recently from b3c39da to 1f88932 Compare June 12, 2026 15:27

JanJakes added 2 commits June 12, 2026 21:05

Scaffold the mysql-parser package

2cfc1d3

Add a new monorepo package for a MySQL parser generated from the official MySQL grammar. This commit sets up the package metadata; the source, tooling, and documentation follow in later commits.

JanJakes force-pushed the lalr-parser branch 6 times, most recently from 296a9c5 to 0f841c5 Compare June 13, 2026 14:10

JanJakes force-pushed the lalr-parser branch 2 times, most recently from a9f8619 to aca2c38 Compare June 19, 2026 09:22

JanJakes marked this pull request as ready for review June 19, 2026 09:39

JanJakes added 10 commits June 19, 2026 15:36

Add the corpus benchmark

85c4b9a

Measure the corpus parse rate and end-to-end (lex + parse) throughput, with warmup and timed passes. The parser accepts 99.76% of the ~69k corpus queries.

Add README.md

e347cb2

JanJakes force-pushed the lalr-parser branch from aca2c38 to 91f53a1 Compare June 19, 2026 14:04

adamziel reviewed Jun 23, 2026

View reviewed changes

adamziel merged commit 49d783b into trunk Jun 23, 2026
10 of 11 checks passed

adamziel deleted the lalr-parser branch June 23, 2026 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LALR(1) parser from official MySQL grammar#429

LALR(1) parser from official MySQL grammar#429
adamziel merged 14 commits into
trunkfrom
lalr-parser

JanJakes commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

adamziel Jun 23, 2026

Uh oh!

adamziel Jun 23, 2026

Uh oh!

adamziel Jun 23, 2026

Uh oh!

adamziel Jun 23, 2026 •

edited

Loading

Uh oh!

adamziel Jun 23, 2026

Uh oh!

adamziel Jun 23, 2026

Uh oh!

adamziel Jun 23, 2026

Uh oh!

adamziel commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JanJakes commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!