argumentcomputer · samuelburnham · Jun 29, 2026
diff --git a/Benchmarks/Statistics/.gitignore b/Benchmarks/Statistics/.gitignore
@@ -0,0 +1,3 @@
+.venv/
+__pycache__/
+*.svg
diff --git a/Benchmarks/Statistics/README.md b/Benchmarks/Statistics/README.md
@@ -0,0 +1,88 @@
+# Benchmarks/Statistics
+
+A small, general-purpose Python project for **modelling and plotting measured
+benchmark data** for the proving stacks — currently **Aiur** (FFT-cost) and
+**Zisk** (zkVM cycles). Two regimes per system:
+
+1. **profiled features → cost** — predict the deterministic cost (Aiur FFT /
+   Zisk cycles) from cheap out-of-circuit profiler counters (`hb`, `subst`,
+   `defeq`, `bytes`), so you can pick benchmark targets by budget before running.
+2. **cost → runtime** — predict execution / proving **time, throughput, and RAM**
+   from the cost.
+
+## Layout
+
+```
+benchstats/        # the package: load.py, fit.py, plot.py, __main__.py (CLI)
+data/
+  aiur/  cost.csv             FFT cost + exec/prove time + peak RAM, per constant
+         features.csv         native profiler features + measured FFT
+         predicted_vs_actual.csv
+  zisk/  single_shard.csv     full-closure single-constant: features + cycles
+         multi_shard.csv      per-shard: features + cycles
+         prove_real.csv       real GPU proves: steps, prove time, peak RAM
+docs/    aiur-cost-dataset.md        FFT → exec/prove/RAM models + the full table
+         aiur-fft-cost-model.md      predicting FFT from features (cost drivers, ceiling)
+         aiur-predicted-vs-actual.md per-constant predicted vs actual
+         zisk-prove-validation.md    shard.rs model vs real GPU proves (R²)
+figures/ # rendered PNGs (gitignored — regenerate and view with `kitten icat`)
+```
+
+## Usage
+
+The `nix develop` dev shell provides `python3` with matplotlib, so just:
+
+```bash
+python3 -m benchstats all          # render every graph to figures/
+python3 -m benchstats aiur-runtime # or one graph; add --svg for vector
+```
+
+Outside Nix, `pip install matplotlib` into a venv first.
+
+## Graph catalogue
+
+| command | graph | data | status |
+|---|---|---|---|
+| `aiur-predictor` | profiled `hb/subst/defeq/bytes` vs **FFT** (one panel each, power-law fit) | `data/aiur/features.csv` | ✅ n=65 |
+| `aiur-runtime` | **FFT** vs exec {time, throughput, RAM}; and vs prove {time, throughput, RAM} (two figures, stacked panels) | `data/aiur/cost.csv` | ✅ n=98; RAM is the *process* peak (execute+prove — for completers the peak is in prove, so exec-phase RAM isn't separated) |
+| `zisk-predictor` | profiled features vs **cycles**, single-shard **and** multi-shard (two figures) | `data/zisk/{single,multi}_shard.csv` | ✅ single n=66, multi n=12 |
+| `zisk-runtime` | **cycles** vs prove {time, throughput, RAM} — projection of the `shard.rs` model | constants read from `shard.rs` | ✅ model projection (no fit) |
+| `zisk-prove-validate` | the **actual `shard.rs` model** (`50+33·B` RAM, `54+158·B` time) vs **real GPU proves**, with R² | `data/zisk/prove_real.csv` | ✅ n=31 leaves (single + sharded, witness 5 & 10); RAM R²=0.88, time R²=0.95; witness=10 ≈ no RAM change — see `docs/zisk-prove-validation.md` |
+
+> `zisk-runtime` and `zisk-prove-validate` read the model **constants from the
+> in-repo planner** (`crates/kernel/src/shard.rs`, the single source of truth) via
+> `benchstats/shard_model.py` — they plot the code's model, never a fitted one.
+> Override with `--shard-rs <path>` (defaults to the in-repo file; falls back to
+> `~/ix/crates/kernel/src/shard.rs` outside a full worktree).
+
+## Figures
+
+Rendered to `figures/` by `python3 -m benchstats all` and committed for reference
+(regenerate any time — they're deterministic from the data + `shard.rs`).
+
+### Aiur
+| profiled features → FFT | FFT → execution | FFT → proving |
+|---|---|---|
+| ![](figures/aiur_features_vs_fft.png) | ![](figures/aiur_exec_vs_fft.png) | ![](figures/aiur_prove_vs_fft.png) |
+
+### Zisk
+| features → cycles (single-shard) | features → cycles (multi-shard) |
+|---|---|
+| ![](figures/zisk_single_features_vs_cycles.png) | ![](figures/zisk_multi_features_vs_cycles.png) |
+
+| cycles → proving (`shard.rs` model projection) | **`shard.rs` model vs real GPU proves** |
+|---|---|
+| ![](figures/zisk_prove_vs_cycles.png) | ![](figures/zisk_prove_validation.png) |
+
+## Data provenance
+
+- **Aiur FFT cost** (`computeStats.totalFftCost`) and exec/prove/RAM: the on-tree
+  `bench-typecheck --ixe` / `ix check --stats-out` (full-closure `verify_claim`).
+- **Native profiler features** (`hb/subst/defeq/bytes`): the `ix` repo's
+  `only_const_profile` example (`IX_PROFILE_COUNTERS=1`, out of circuit).
+- **Zisk cycles** and the prove-time/RAM model: `zisk-host --execute` and the
+  shard-prover measurements — see `ix` repo `docs/zisk-cycle-cost-model.md`,
+  the companion to `docs/aiur-fft-cost-model.md`.
+
+Regenerate the underlying data with the commands in those docs; this project
+just models and plots it.
diff --git a/Benchmarks/Statistics/benchstats/__init__.py b/Benchmarks/Statistics/benchstats/__init__.py
@@ -0,0 +1,4 @@
+"""benchstats — model & plot measured Aiur / Zisk benchmark data.
+
+See README.md for the graph catalogue. Run `python -m benchstats --help`.
+"""
diff --git a/Benchmarks/Statistics/benchstats/__main__.py b/Benchmarks/Statistics/benchstats/__main__.py
@@ -0,0 +1,125 @@
+"""CLI: render the benchmark cost-model graphs.
+
+    python -m benchstats all                  # every graph
+    python -m benchstats aiur-predictor        # profiled features  -> FFT
+    python -m benchstats aiur-runtime          # FFT -> exec/prove time, throughput, RAM
+    python -m benchstats zisk-predictor        # profiled features  -> cycles (single + multi shard)
+    python -m benchstats zisk-runtime          # cycles -> prove time/throughput/RAM (shard.rs model)
+    python -m benchstats zisk-prove-validate   # shard.rs model vs real GPU proves (with R²)
+
+Figures are written to `figures/` (gitignored — view with `kitten icat`).
+"""
+import argparse
+
+from . import load, plot
+from .plot import ENV_COLOR
+
+
+def _project(rows, keep):
+    """Keep only `keep` columns (drop incidental ones like shard/blocks)."""
+    return [{k: r[k] for k in keep if k in r} for r in rows]
+
+
+def aiur_predictor(a):
+    rows = load.load("aiur/features.csv")
+    pts = _project(rows, ["name", "env", "hb", "subst", "defeq", "bytes", "fft"])
+    plot.feature_panels(pts, "fft", "Aiur FFT cost", "aiur_features_vs_fft.png",
+                        "Aiur — profiled features vs FFT cost", a.svg)
+
+
+def aiur_runtime(a):
+    rows = load.load("aiur/cost.csv")
+    fft, exec_us, prove_ms, rss_kb, kept = load.numeric(
+        rows, "fft_cost", "exec_us", "prove_ms", "peak_rss_kb")
+    colors = [ENV_COLOR.get(r.get("env", ""), "#1f77b4") for r in kept]
+    exec_ms = [v / 1e3 for v in exec_us]
+    exec_tput = [f / (e / 1e6) for f, e in zip(fft, exec_us)]   # FFT-units / s
+    prove_tput = [f / (p / 1e3) for f, p in zip(fft, prove_ms)]  # FFT-units / s
+    ram_gb = [v / 1024 / 1024 for v in rss_kb]
+    plot.cost_vs_runtime(fft, "Aiur FFT cost", [
+        dict(label="exec time", y=exec_ms, ylabel="exec time (ms)", yscale="log", fit="loglog"),
+        dict(label="exec throughput", y=exec_tput, ylabel="exec throughput\n(FFT-units/s)", yscale="linear"),
+        dict(label="peak RAM", y=ram_gb, ylabel="peak RAM (GB)", yscale="log", fit="loglog",
+             note="process peak RSS (execute+prove; peak is in prove for completers)"),
+    ], "aiur_exec_vs_fft.png", "Aiur — FFT cost vs execution", colors, a.svg)
+    plot.cost_vs_runtime(fft, "Aiur FFT cost", [
+        dict(label="prove time", y=prove_ms, ylabel="prove time (ms)", yscale="log", fit="loglog"),
+        dict(label="prove throughput", y=prove_tput, ylabel="prove throughput\n(FFT-units/s)", yscale="linear"),
+        dict(label="peak RAM", y=ram_gb, ylabel="peak RAM (GB)", yscale="log", fit="loglog"),
+    ], "aiur_prove_vs_fft.png", "Aiur — FFT cost vs proving", colors, a.svg)
+
+
+def zisk_predictor(a):
+    single = _project(load.load("zisk/single_shard.csv"),
+                      ["name", "hb", "bytes", "subst", "cycles"])
+    plot.feature_panels(single, "cycles", "Zisk cycles (steps)",
+                        "zisk_single_features_vs_cycles.png",
+                        "Zisk single-shard (full-closure) — features vs cycles", a.svg)
+    multi = _project(load.load("zisk/multi_shard.csv"),
+                     ["shard", "hb", "bytes", "subst", "cycles"])
+    plot.feature_panels(multi, "cycles", "Zisk cycles (steps)",
+                        "zisk_multi_features_vs_cycles.png",
+                        "Zisk multi-shard (per shard) — features vs cycles", a.svg)
+
+
+def zisk_runtime(a):
+    # Project the actual shard.rs prove model (constants read from the code) over
+    # a cycle range. x = steps; B = steps in billions. See zisk-prove-validate for
+    # the same model checked against real GPU proves.
+    from . import shard_model
+    k = shard_model.read_constants(a.shard_rs)
+    B = lambda s: s / 1e9
+    plot.model_lines((2.7e8, 5.0e9), [
+        dict(label=f"{k['RAM_BASE_GIB']:.0f} + {k['RAM_GIB_PER_BCYCLE']:.0f}·B",
+             fn=lambda s: k["RAM_BASE_GIB"] + k["RAM_GIB_PER_BCYCLE"] * B(s),
+             ylabel="peak host RAM (GiB)", yscale="linear"),
+        dict(label=f"{k['PROVE_SETUP_SECS']:.0f} + {k['PROVE_SECS_PER_BCYCLE']:.0f}·B",
+             fn=lambda s: k["PROVE_SETUP_SECS"] + k["PROVE_SECS_PER_BCYCLE"] * B(s),
+             ylabel="GPU prove time (s)", yscale="linear"),
+        dict(label="steps/s",
+             fn=lambda s: (s / 1e6) / (k["PROVE_SETUP_SECS"] + k["PROVE_SECS_PER_BCYCLE"] * B(s)),
+             ylabel="prove throughput\n(M steps/s)", yscale="linear"),
+    ], "zisk_prove_vs_cycles.png",
+        "Zisk — cycles vs proving (shard.rs model projection)", a.svg)
+
+
+def zisk_prove_validate(a):
+    """Real GPU proves (zisk-host, RTX PRO 6000) vs the actual shard.rs model."""
+    from . import shard_model
+    k = shard_model.read_constants(a.shard_rs)
+    steps, ram, leaf, wall, kept = load.numeric(
+        load.load("zisk/prove_real.csv"), "steps", "peak_ram_gib", "leaf_secs", "wall_secs")
+    kinds = [f"{r.get('kind','single')} w{r.get('witness','5')}" for r in kept]
+    ram_model = lambda s: k["RAM_BASE_GIB"] + k["RAM_GIB_PER_BCYCLE"] * s / 1e9
+    time_model = lambda s: k["PROVE_SETUP_SECS"] + k["PROVE_SECS_PER_BCYCLE"] * s / 1e9
+    plot.model_vs_data([
+        dict(x=steps, y=ram, model=ram_model, ylabel="peak host RAM (GiB)", groups=kinds,
+             formula=f"{k['RAM_BASE_GIB']:.0f} + {k['RAM_GIB_PER_BCYCLE']:.0f}·Bsteps"),
+        dict(x=steps, y=leaf, model=time_model, ylabel="prove time (s)", groups=kinds,
+             y2=wall, y2label="measured (wall, incl. setup)",
+             formula=f"{k['PROVE_SETUP_SECS']:.0f} + {k['PROVE_SECS_PER_BCYCLE']:.0f}·Bsteps"),
+    ], "zisk_prove_validation.png",
+        "Zisk shard.rs cost model vs real GPU proves (RTX PRO 6000 Blackwell)", a.svg)
+
+
+CMDS = {"aiur-predictor": aiur_predictor, "aiur-runtime": aiur_runtime,
+        "zisk-predictor": zisk_predictor, "zisk-runtime": zisk_runtime,
+        "zisk-prove-validate": zisk_prove_validate}
+
+
+def main():
+    ap = argparse.ArgumentParser(prog="benchstats", description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("graph", choices=["all"] + list(CMDS), help="which graph(s) to render")
+    ap.add_argument("--svg", action="store_true", help="also write .svg")
+    ap.add_argument("--shard-rs", default=None,
+                    help="path to the ix repo's crates/kernel/src/shard.rs "
+                         "(model constants source; default ~/ix/...)")
+    a = ap.parse_args()
+    todo = CMDS.values() if a.graph == "all" else [CMDS[a.graph]]
+    for fn in todo:
+        fn(a)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/Benchmarks/Statistics/benchstats/fit.py b/Benchmarks/Statistics/benchstats/fit.py
@@ -0,0 +1,69 @@
+"""Least-squares estimators shared by the Aiur and Zisk cost models.
+
+Two regimes show up across the benchmarks:
+  * profiled features -> cost   (predict FFT / cycles from cheap counters)
+  * cost -> runtime/RAM         (predict time / throughput / RAM from cost)
+
+Costs span several orders of magnitude, so fits minimise *relative* error
+(WLS with 1/y^2 weights) and we also expose a log-log power-law fit.
+"""
+import math
+
+
+def nlogn(x: float) -> float:
+    """`x * log2(x)` feature — the natural form for Aiur FFT (= sum w*h*log2 h)."""
+    return x * math.log2(x + 2.0)
+
+
+def _solve(A, b):
+    n = len(A)
+    M = [row[:] + [b[i]] for i, row in enumerate(A)]
+    for i in range(n):
+        piv = max(range(i, n), key=lambda r: abs(M[r][i]))
+        M[i], M[piv] = M[piv], M[i]
+        d = M[i][i]
+        if abs(d) < 1e-300:
+            raise ValueError("singular normal matrix")
+        M[i] = [v / d for v in M[i]]
+        for r in range(n):
+            if r != i:
+                f = M[r][i]
+                M[r] = [a - f * c for a, c in zip(M[r], M[i])]
+    return [M[i][n] for i in range(n)]
+
+
+def wls(feature_cols, y, relative=True, intercept=True):
+    """Weighted least squares. `feature_cols` is a list of equal-length columns.
+
+    Returns dict(coef, r2 (weighted), mape (%), pred). With `relative`, weights
+    are 1/y^2 so the fit targets relative error (right for multi-decade costs).
+    """
+    n = len(y)
+    p = len(feature_cols) + (1 if intercept else 0)
+    rows = [[col[i] for col in feature_cols] + ([1.0] if intercept else []) for i in range(n)]
+    w = [1.0 / (v * v) if (relative and v) else 1.0 for v in y]
+    A = [[sum(w[i] * rows[i][j] * rows[i][k] for i in range(n)) for k in range(p)] for j in range(p)]
+    c = [sum(w[i] * rows[i][j] * y[i] for i in range(n)) for j in range(p)]
+    coef = _solve(A, c)
+    pred = [sum(coef[j] * rows[i][j] for j in range(p)) for i in range(n)]
+    ybar = sum(w[i] * y[i] for i in range(n)) / sum(w)
+    sst = sum(w[i] * (y[i] - ybar) ** 2 for i in range(n))
+    ssr = sum(w[i] * (y[i] - pred[i]) ** 2 for i in range(n))
+    mape = sum(abs(pred[i] - y[i]) / y[i] for i in range(n) if y[i]) / n * 100
+    return dict(coef=coef, r2=(1 - ssr / sst if sst else 0.0), mape=mape, pred=pred)
+
+
+def loglog(x, y):
+    """Power-law fit y = a * x^b by OLS in log space. Returns (a, b, r2_log)."""
+    lx = [math.log(v) for v in x]
+    ly = [math.log(v) for v in y]
+    n = len(x)
+    sx, sy = sum(lx), sum(ly)
+    sxx = sum(v * v for v in lx)
+    sxy = sum(a * b for a, b in zip(lx, ly))
+    b = (n * sxy - sx * sy) / (n * sxx - sx * sx)
+    a = (sy - b * sx) / n
+    yb = sy / n
+    sst = sum((v - yb) ** 2 for v in ly)
+    ssr = sum((ly[i] - (a + b * lx[i])) ** 2 for i in range(n))
+    return math.exp(a), b, (1 - ssr / sst if sst else 0.0)
diff --git a/Benchmarks/Statistics/benchstats/load.py b/Benchmarks/Statistics/benchstats/load.py
@@ -0,0 +1,35 @@
+"""CSV loading for the benchmark datasets under `data/`."""
+import csv
+import os
+
+ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+DATA = os.path.join(ROOT, "data")
+FIGURES = os.path.join(ROOT, "figures")
+
+
+def load(rel_path):
+    """Load `data/<rel_path>` as a list of dict rows."""
+    with open(os.path.join(DATA, rel_path)) as f:
+        return list(csv.DictReader(f))
+
+
+def numeric(rows, *names, require=True):
+    """Return parallel float columns for `names`, dropping any row where a
+    requested field is missing / non-numeric (e.g. DNF). Returns (cols..., kept_rows)."""
+    cols = [[] for _ in names]
+    kept = []
+    for r in rows:
+        vals = []
+        ok = True
+        for nm in names:
+            v = r.get(nm, "")
+            try:
+                vals.append(float(v))
+            except (TypeError, ValueError):
+                ok = False
+                break
+        if ok:
+            for i, v in enumerate(vals):
+                cols[i].append(v)
+            kept.append(r)
+    return (*cols, kept)