Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Benchmarks/Statistics/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.venv/
__pycache__/
*.svg
88 changes: 88 additions & 0 deletions Benchmarks/Statistics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Benchmarks/Statistics

A small, general-purpose Python project for **modelling and plotting measured
benchmark data** for the proving stacks — currently **Aiur** (FFT-cost) and
**Zisk** (zkVM cycles). Two regimes per system:

1. **profiled features → cost** — predict the deterministic cost (Aiur FFT /
Zisk cycles) from cheap out-of-circuit profiler counters (`hb`, `subst`,
`defeq`, `bytes`), so you can pick benchmark targets by budget before running.
2. **cost → runtime** — predict execution / proving **time, throughput, and RAM**
from the cost.

## Layout

```
benchstats/ # the package: load.py, fit.py, plot.py, __main__.py (CLI)
data/
aiur/ cost.csv FFT cost + exec/prove time + peak RAM, per constant
features.csv native profiler features + measured FFT
predicted_vs_actual.csv
zisk/ single_shard.csv full-closure single-constant: features + cycles
multi_shard.csv per-shard: features + cycles
prove_real.csv real GPU proves: steps, prove time, peak RAM
docs/ aiur-cost-dataset.md FFT → exec/prove/RAM models + the full table
aiur-fft-cost-model.md predicting FFT from features (cost drivers, ceiling)
aiur-predicted-vs-actual.md per-constant predicted vs actual
zisk-prove-validation.md shard.rs model vs real GPU proves (R²)
figures/ # rendered PNGs (gitignored — regenerate and view with `kitten icat`)
```

## Usage

The `nix develop` dev shell provides `python3` with matplotlib, so just:

```bash
python3 -m benchstats all # render every graph to figures/
python3 -m benchstats aiur-runtime # or one graph; add --svg for vector
```

Outside Nix, `pip install matplotlib` into a venv first.

## Graph catalogue

| command | graph | data | status |
|---|---|---|---|
| `aiur-predictor` | profiled `hb/subst/defeq/bytes` vs **FFT** (one panel each, power-law fit) | `data/aiur/features.csv` | ✅ n=65 |
| `aiur-runtime` | **FFT** vs exec {time, throughput, RAM}; and vs prove {time, throughput, RAM} (two figures, stacked panels) | `data/aiur/cost.csv` | ✅ n=98; RAM is the *process* peak (execute+prove — for completers the peak is in prove, so exec-phase RAM isn't separated) |
| `zisk-predictor` | profiled features vs **cycles**, single-shard **and** multi-shard (two figures) | `data/zisk/{single,multi}_shard.csv` | ✅ single n=66, multi n=12 |
| `zisk-runtime` | **cycles** vs prove {time, throughput, RAM} — projection of the `shard.rs` model | constants read from `shard.rs` | ✅ model projection (no fit) |
| `zisk-prove-validate` | the **actual `shard.rs` model** (`50+33·B` RAM, `54+158·B` time) vs **real GPU proves**, with R² | `data/zisk/prove_real.csv` | ✅ n=31 leaves (single + sharded, witness 5 & 10); RAM R²=0.88, time R²=0.95; witness=10 ≈ no RAM change — see `docs/zisk-prove-validation.md` |

> `zisk-runtime` and `zisk-prove-validate` read the model **constants from the
> in-repo planner** (`crates/kernel/src/shard.rs`, the single source of truth) via
> `benchstats/shard_model.py` — they plot the code's model, never a fitted one.
> Override with `--shard-rs <path>` (defaults to the in-repo file; falls back to
> `~/ix/crates/kernel/src/shard.rs` outside a full worktree).

## Figures

Rendered to `figures/` by `python3 -m benchstats all` and committed for reference
(regenerate any time — they're deterministic from the data + `shard.rs`).

### Aiur
| profiled features → FFT | FFT → execution | FFT → proving |
|---|---|---|
| ![](figures/aiur_features_vs_fft.png) | ![](figures/aiur_exec_vs_fft.png) | ![](figures/aiur_prove_vs_fft.png) |

### Zisk
| features → cycles (single-shard) | features → cycles (multi-shard) |
|---|---|
| ![](figures/zisk_single_features_vs_cycles.png) | ![](figures/zisk_multi_features_vs_cycles.png) |

| cycles → proving (`shard.rs` model projection) | **`shard.rs` model vs real GPU proves** |
|---|---|
| ![](figures/zisk_prove_vs_cycles.png) | ![](figures/zisk_prove_validation.png) |

## Data provenance

- **Aiur FFT cost** (`computeStats.totalFftCost`) and exec/prove/RAM: the on-tree
`bench-typecheck --ixe` / `ix check --stats-out` (full-closure `verify_claim`).
- **Native profiler features** (`hb/subst/defeq/bytes`): the `ix` repo's
`only_const_profile` example (`IX_PROFILE_COUNTERS=1`, out of circuit).
- **Zisk cycles** and the prove-time/RAM model: `zisk-host --execute` and the
shard-prover measurements — see `ix` repo `docs/zisk-cycle-cost-model.md`,
the companion to `docs/aiur-fft-cost-model.md`.

Regenerate the underlying data with the commands in those docs; this project
just models and plots it.
4 changes: 4 additions & 0 deletions Benchmarks/Statistics/benchstats/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""benchstats — model & plot measured Aiur / Zisk benchmark data.

See README.md for the graph catalogue. Run `python -m benchstats --help`.
"""
125 changes: 125 additions & 0 deletions Benchmarks/Statistics/benchstats/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
"""CLI: render the benchmark cost-model graphs.

python -m benchstats all # every graph
python -m benchstats aiur-predictor # profiled features -> FFT
python -m benchstats aiur-runtime # FFT -> exec/prove time, throughput, RAM
python -m benchstats zisk-predictor # profiled features -> cycles (single + multi shard)
python -m benchstats zisk-runtime # cycles -> prove time/throughput/RAM (shard.rs model)
python -m benchstats zisk-prove-validate # shard.rs model vs real GPU proves (with R²)

Figures are written to `figures/` (gitignored — view with `kitten icat`).
"""
import argparse

from . import load, plot
from .plot import ENV_COLOR


def _project(rows, keep):
"""Keep only `keep` columns (drop incidental ones like shard/blocks)."""
return [{k: r[k] for k in keep if k in r} for r in rows]


def aiur_predictor(a):
rows = load.load("aiur/features.csv")
pts = _project(rows, ["name", "env", "hb", "subst", "defeq", "bytes", "fft"])
plot.feature_panels(pts, "fft", "Aiur FFT cost", "aiur_features_vs_fft.png",
"Aiur — profiled features vs FFT cost", a.svg)


def aiur_runtime(a):
rows = load.load("aiur/cost.csv")
fft, exec_us, prove_ms, rss_kb, kept = load.numeric(
rows, "fft_cost", "exec_us", "prove_ms", "peak_rss_kb")
colors = [ENV_COLOR.get(r.get("env", ""), "#1f77b4") for r in kept]
exec_ms = [v / 1e3 for v in exec_us]
exec_tput = [f / (e / 1e6) for f, e in zip(fft, exec_us)] # FFT-units / s
prove_tput = [f / (p / 1e3) for f, p in zip(fft, prove_ms)] # FFT-units / s
ram_gb = [v / 1024 / 1024 for v in rss_kb]
plot.cost_vs_runtime(fft, "Aiur FFT cost", [
dict(label="exec time", y=exec_ms, ylabel="exec time (ms)", yscale="log", fit="loglog"),
dict(label="exec throughput", y=exec_tput, ylabel="exec throughput\n(FFT-units/s)", yscale="linear"),
dict(label="peak RAM", y=ram_gb, ylabel="peak RAM (GB)", yscale="log", fit="loglog",
note="process peak RSS (execute+prove; peak is in prove for completers)"),
], "aiur_exec_vs_fft.png", "Aiur — FFT cost vs execution", colors, a.svg)
plot.cost_vs_runtime(fft, "Aiur FFT cost", [
dict(label="prove time", y=prove_ms, ylabel="prove time (ms)", yscale="log", fit="loglog"),
dict(label="prove throughput", y=prove_tput, ylabel="prove throughput\n(FFT-units/s)", yscale="linear"),
dict(label="peak RAM", y=ram_gb, ylabel="peak RAM (GB)", yscale="log", fit="loglog"),
], "aiur_prove_vs_fft.png", "Aiur — FFT cost vs proving", colors, a.svg)


def zisk_predictor(a):
single = _project(load.load("zisk/single_shard.csv"),
["name", "hb", "bytes", "subst", "cycles"])
plot.feature_panels(single, "cycles", "Zisk cycles (steps)",
"zisk_single_features_vs_cycles.png",
"Zisk single-shard (full-closure) — features vs cycles", a.svg)
multi = _project(load.load("zisk/multi_shard.csv"),
["shard", "hb", "bytes", "subst", "cycles"])
plot.feature_panels(multi, "cycles", "Zisk cycles (steps)",
"zisk_multi_features_vs_cycles.png",
"Zisk multi-shard (per shard) — features vs cycles", a.svg)


def zisk_runtime(a):
# Project the actual shard.rs prove model (constants read from the code) over
# a cycle range. x = steps; B = steps in billions. See zisk-prove-validate for
# the same model checked against real GPU proves.
from . import shard_model
k = shard_model.read_constants(a.shard_rs)
B = lambda s: s / 1e9
plot.model_lines((2.7e8, 5.0e9), [
dict(label=f"{k['RAM_BASE_GIB']:.0f} + {k['RAM_GIB_PER_BCYCLE']:.0f}·B",
fn=lambda s: k["RAM_BASE_GIB"] + k["RAM_GIB_PER_BCYCLE"] * B(s),
ylabel="peak host RAM (GiB)", yscale="linear"),
dict(label=f"{k['PROVE_SETUP_SECS']:.0f} + {k['PROVE_SECS_PER_BCYCLE']:.0f}·B",
fn=lambda s: k["PROVE_SETUP_SECS"] + k["PROVE_SECS_PER_BCYCLE"] * B(s),
ylabel="GPU prove time (s)", yscale="linear"),
dict(label="steps/s",
fn=lambda s: (s / 1e6) / (k["PROVE_SETUP_SECS"] + k["PROVE_SECS_PER_BCYCLE"] * B(s)),
ylabel="prove throughput\n(M steps/s)", yscale="linear"),
], "zisk_prove_vs_cycles.png",
"Zisk — cycles vs proving (shard.rs model projection)", a.svg)


def zisk_prove_validate(a):
"""Real GPU proves (zisk-host, RTX PRO 6000) vs the actual shard.rs model."""
from . import shard_model
k = shard_model.read_constants(a.shard_rs)
steps, ram, leaf, wall, kept = load.numeric(
load.load("zisk/prove_real.csv"), "steps", "peak_ram_gib", "leaf_secs", "wall_secs")
kinds = [f"{r.get('kind','single')} w{r.get('witness','5')}" for r in kept]
ram_model = lambda s: k["RAM_BASE_GIB"] + k["RAM_GIB_PER_BCYCLE"] * s / 1e9
time_model = lambda s: k["PROVE_SETUP_SECS"] + k["PROVE_SECS_PER_BCYCLE"] * s / 1e9
plot.model_vs_data([
dict(x=steps, y=ram, model=ram_model, ylabel="peak host RAM (GiB)", groups=kinds,
formula=f"{k['RAM_BASE_GIB']:.0f} + {k['RAM_GIB_PER_BCYCLE']:.0f}·Bsteps"),
dict(x=steps, y=leaf, model=time_model, ylabel="prove time (s)", groups=kinds,
y2=wall, y2label="measured (wall, incl. setup)",
formula=f"{k['PROVE_SETUP_SECS']:.0f} + {k['PROVE_SECS_PER_BCYCLE']:.0f}·Bsteps"),
], "zisk_prove_validation.png",
"Zisk shard.rs cost model vs real GPU proves (RTX PRO 6000 Blackwell)", a.svg)


CMDS = {"aiur-predictor": aiur_predictor, "aiur-runtime": aiur_runtime,
"zisk-predictor": zisk_predictor, "zisk-runtime": zisk_runtime,
"zisk-prove-validate": zisk_prove_validate}


def main():
ap = argparse.ArgumentParser(prog="benchstats", description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument("graph", choices=["all"] + list(CMDS), help="which graph(s) to render")
ap.add_argument("--svg", action="store_true", help="also write .svg")
ap.add_argument("--shard-rs", default=None,
help="path to the ix repo's crates/kernel/src/shard.rs "
"(model constants source; default ~/ix/...)")
a = ap.parse_args()
todo = CMDS.values() if a.graph == "all" else [CMDS[a.graph]]
for fn in todo:
fn(a)


if __name__ == "__main__":
main()
69 changes: 69 additions & 0 deletions Benchmarks/Statistics/benchstats/fit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
"""Least-squares estimators shared by the Aiur and Zisk cost models.

Two regimes show up across the benchmarks:
* profiled features -> cost (predict FFT / cycles from cheap counters)
* cost -> runtime/RAM (predict time / throughput / RAM from cost)

Costs span several orders of magnitude, so fits minimise *relative* error
(WLS with 1/y^2 weights) and we also expose a log-log power-law fit.
"""
import math


def nlogn(x: float) -> float:
"""`x * log2(x)` feature — the natural form for Aiur FFT (= sum w*h*log2 h)."""
return x * math.log2(x + 2.0)


def _solve(A, b):
n = len(A)
M = [row[:] + [b[i]] for i, row in enumerate(A)]
for i in range(n):
piv = max(range(i, n), key=lambda r: abs(M[r][i]))
M[i], M[piv] = M[piv], M[i]
d = M[i][i]
if abs(d) < 1e-300:
raise ValueError("singular normal matrix")
M[i] = [v / d for v in M[i]]
for r in range(n):
if r != i:
f = M[r][i]
M[r] = [a - f * c for a, c in zip(M[r], M[i])]
return [M[i][n] for i in range(n)]


def wls(feature_cols, y, relative=True, intercept=True):
"""Weighted least squares. `feature_cols` is a list of equal-length columns.

Returns dict(coef, r2 (weighted), mape (%), pred). With `relative`, weights
are 1/y^2 so the fit targets relative error (right for multi-decade costs).
"""
n = len(y)
p = len(feature_cols) + (1 if intercept else 0)
rows = [[col[i] for col in feature_cols] + ([1.0] if intercept else []) for i in range(n)]
w = [1.0 / (v * v) if (relative and v) else 1.0 for v in y]
A = [[sum(w[i] * rows[i][j] * rows[i][k] for i in range(n)) for k in range(p)] for j in range(p)]
c = [sum(w[i] * rows[i][j] * y[i] for i in range(n)) for j in range(p)]
coef = _solve(A, c)
pred = [sum(coef[j] * rows[i][j] for j in range(p)) for i in range(n)]
ybar = sum(w[i] * y[i] for i in range(n)) / sum(w)
sst = sum(w[i] * (y[i] - ybar) ** 2 for i in range(n))
ssr = sum(w[i] * (y[i] - pred[i]) ** 2 for i in range(n))
mape = sum(abs(pred[i] - y[i]) / y[i] for i in range(n) if y[i]) / n * 100
return dict(coef=coef, r2=(1 - ssr / sst if sst else 0.0), mape=mape, pred=pred)


def loglog(x, y):
"""Power-law fit y = a * x^b by OLS in log space. Returns (a, b, r2_log)."""
lx = [math.log(v) for v in x]
ly = [math.log(v) for v in y]
n = len(x)
sx, sy = sum(lx), sum(ly)
sxx = sum(v * v for v in lx)
sxy = sum(a * b for a, b in zip(lx, ly))
b = (n * sxy - sx * sy) / (n * sxx - sx * sx)
a = (sy - b * sx) / n
yb = sy / n
sst = sum((v - yb) ** 2 for v in ly)
ssr = sum((ly[i] - (a + b * lx[i])) ** 2 for i in range(n))
return math.exp(a), b, (1 - ssr / sst if sst else 0.0)
35 changes: 35 additions & 0 deletions Benchmarks/Statistics/benchstats/load.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
"""CSV loading for the benchmark datasets under `data/`."""
import csv
import os

ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA = os.path.join(ROOT, "data")
FIGURES = os.path.join(ROOT, "figures")


def load(rel_path):
"""Load `data/<rel_path>` as a list of dict rows."""
with open(os.path.join(DATA, rel_path)) as f:
return list(csv.DictReader(f))


def numeric(rows, *names, require=True):
"""Return parallel float columns for `names`, dropping any row where a
requested field is missing / non-numeric (e.g. DNF). Returns (cols..., kept_rows)."""
cols = [[] for _ in names]
kept = []
for r in rows:
vals = []
ok = True
for nm in names:
v = r.get(nm, "")
try:
vals.append(float(v))
except (TypeError, ValueError):
ok = False
break
if ok:
for i, v in enumerate(vals):
cols[i].append(v)
kept.append(r)
return (*cols, kept)
Loading