ForecastLite-Selector

A transparent, reproducible, leakage-safe reliability audit for deterministic short-horizon pedestrian trajectory baselines under observation noise and nonlinear motion.

Conclusion: ForecastLite-Selector is not a new state-of-the-art trajectory forecasting model. It is a reproducible study that maps when simple deterministic predictors work, when they fail, how their ranking changes under observation corruption, and how much a leakage-safe observed-only selector can reduce mixed-corruption ADE and oracle regret under declared synthetic corruption mixtures.

Current Status
Final Conclusion
Key Results
What This Project Studies
What This Project Does Not Claim
Data and Protocol
Method Overview
ForecastLite-Selector
Deterministic Reliability Envelope
Reproduce From Scratch
Expected Results Directory
How to Read Results
Code Structure
Script Map
Tests and Quality Gates
Paper and Documents
Claim Hygiene
Best Next Steps
FAQ
License and Data Notes

Current Status

This working tree keeps the source code, configuration files, tests, manuscript draft, and reproducibility log. The results/ directory is generated output and is ignored by .gitignore, so a normal checkout may not contain the full result artifacts. That is expected. The result files are reproducible from the commands below, and they are omitted from version control to avoid storing several gigabytes of derived artifacts in the repository.

The latest complete reproduction record is:

doc/FORECASTLITE_REPRODUCIBILITY_RUN_LOG_2026-06-15.md

That log records the formal commands, command outputs, generated artifacts, and how the key numbers used in the manuscript were regenerated.

Primary project documents:

doc/FORECASTLITE_SELECTOR_MANUSCRIPT_DRAFT_2026-06-14.md: manuscript draft.
doc/FORECASTLITE_COMPLETE_BEGINNER_GUIDE_ZH_2026-06-15.md: complete beginner-level guide in Chinese.
doc/FORECASTLITE_REPOSITIONING_LITERATURE_AUDIT_ZH_2026-06-14.md: literature-positioning audit in Chinese.
doc/forecastlite_selector_references.bib: BibTeX references.

Final Conclusion

The final research conclusion is:

Under ETH/UCY 8-observation / 12-prediction evaluation, transparent deterministic pedestrian predictors exchange dominance across motion and observation-corruption regimes. A leakage-safe observed-only selector can reduce mixed-corruption ADE and oracle regret under declared synthetic corruption mixtures, but no deterministic policy is universally best. The strongest current contribution is a reproducible deterministic reliability audit, not a SOTA learned forecaster, not a deployment guarantee, and not evidence that synthetic corruption fully matches real perception noise.

In practical terms:

On clean data, last-step constant velocity and Kalman constant velocity are both strong.
After observation corruption is introduced, the best predictor changes by corruption type, corruption level, motion regime, and scene.
forecastlite_loso_threshold_selector improves over fixed OLS-4 and OLS-8 on the declared full Gaussian-jitter mixed audit.
The expanded corruption grid has no universal winner.
The Motion LOSO ablation is slightly better on the uniform expanded mixture, but it is worse on Gaussian jitter, so it should remain an ablation rather than the headline method.
The largest failures concentrate in ETH homography-bias, abrupt-stop, and strong-turn cases, often where the oracle predictor is stationary but the selector chooses OLS-4 or OLS-8 and keeps extrapolating motion.

Key Results

The numbers below come from the complete 2026-06-15 reproducibility log and the manuscript draft. Running the commands in Reproduce From Scratch regenerates the corresponding results/ files.

Data Scale

Item	Value
Dataset	ETH/UCY via Trajectron++
Scenes	ETH, Hotel, Univ, Zara1, Zara2
Valid dense stride-1 windows	34,161
Trajectory groups	1,219
Observation length	8 steps
Prediction length	12 steps
Sampling interval	0.4 s
Prediction horizon	4.8 s

Scene breakdown:

Scene	Windows	Tracks
ETH	364	44
Hotel	1,197	122
Univ	24,334	722
Zara1	2,356	142
Zara2	5,910	189

Clean ETH/UCY

Clean pooled ADE winner:

Model	Pooled ADE
Last-step CV	0.481554 m
Kalman CV	0.494961 m
OLS-4	0.541744 m
OLS-8	0.667404 m
Stationary	1.442155 m

Clean macro ADE winner:

Model	Macro ADE	95% grouped CI	Macro FDE
Kalman CV	0.529111 m	[0.479361, 0.584796]	1.134871 m
Last-step CV	0.534033 m	[0.486211, 0.587942]	1.147595 m
OLS-4	0.551975 m	[0.497044, 0.610357]	1.154689 m
OLS-8	0.648164 m	[0.577081, 0.712963]	1.273103 m
Stationary	1.726353 m	[1.582527, 2.032697]	3.109385 m

Interpretation:

Clean pooled data favors last-step CV.
Clean macro data favors Kalman CV.
Univ contributes many dense windows, so pooled and macro evaluation answer different questions.
Constant velocity being strong is not a new discovery. Prior work already established that CV-style baselines are hard to dismiss on ETH/UCY.

Full Gaussian-Jitter Audit

Command:

python scripts/run_selector.py --config configs/noise_study.yaml --output-dir results/selector_full --noise-sigmas 0 0.02 0.05 0.1 0.2 --noise-seeds 42 43 44 45 46

Mixed Gaussian-jitter ADE:

Policy / Model	Mixed ADE	95% grouped CI	Mean regret
LOSO threshold selector	0.662352 m	[0.630250, 0.698300]	0.208877 m
Motion LOSO selector	0.663879 m	[0.631394, 0.699734]	0.210404 m
Fixed ForecastLite selector	0.663932 m	[0.631355, 0.699853]	0.210457 m
OLS-8	0.716557 m	[0.680635, 0.754783]	0.263083 m
OLS-4	0.720154 m	[0.696024, 0.746804]	0.266680 m
Tree selector	0.742189 m	[0.706919, 0.782720]	0.288715 m
Kalman CV	0.903828 m	[0.885148, 0.924602]	0.450353 m
Last-step CV	1.168029 m	[1.151870, 1.185559]	0.714554 m
Stationary	1.463198 m	[1.366109, 1.566100]	1.009723 m

Paired matched-window deltas:

Comparison	Mean ADE delta	95% CI	Interpretation
LOSO selector - OLS-8	-0.054206 m	[-0.058516, -0.050375]	LOSO has lower ADE
LOSO selector - OLS-4	-0.057803 m	[-0.066286, -0.048296]	LOSO has lower ADE

Interpretation:

The selector improvement is real but modest.
The result is roughly a 7.6%-8.0% mixed-distribution ADE reduction over strong fixed OLS baselines.
This should not be described as a forecasting breakthrough.

Expanded Corruption Audit

Expanded corruption types:

gaussian_jitter
final_step_spike
quantization
timestamp_jitter
homography_bias
consecutive_dropout

Levels:

0.05
0.10
0.20

Seeds:

42
43
44

Generated scale in the verified run:

Artifact	Rows
`results/selector_corruption_expanded/base_metrics_by_window.parquet`	9,223,470
`selector_metrics.csv`	1,844,694
`loso_threshold_selector_metrics.csv`	1,844,694
`tree_selector_metrics.csv`	1,844,694

Single-corruption winners:

Corruption	Best policy / model	Mixed ADE	Main lesson
consecutive dropout	LOSO / fixed / Motion / last-step CV tie	0.485897 m	Observed missingness is recoverable by last-step-style behavior
gaussian jitter	LOSO / fixed selector	0.726233 m	Selector improves over OLS-8 on this expanded subset
final-step spike	Motion LOSO	0.705787 m	Speed-trend features help; OLS-8 is close at 0.706699 m
homography bias	Last-step CV	0.657662 m	Selector has a clear feature gap
quantization	OLS-4	0.582928 m	Short-window smoothing wins
timestamp jitter	Kalman CV	0.498549 m	Timestamp-aware smoothing helps

Uniform expanded mixture:

Policy	Mixture ADE
Motion LOSO selector	0.628318 m
LOSO threshold selector	0.629112 m
Fixed selector	0.629112 m

Interpretation:

Motion LOSO is useful as an ablation.
It is not the headline selector because it worsens Gaussian jitter: LOSO/fixed score 0.726233 m, while Motion LOSO scores 0.733404 m.
The safest paper claim is mixture-dependent improvement, not universal dominance.

Tail Risk and Failure Concentration

Expanded audit, all corruption types:

Model / Policy	Mean regret	P95 regret	Regret >= 1m rate	Max regret
LOSO threshold selector	0.175440 m	0.748423 m	0.027612	7.075210 m
Fixed selector	0.175440 m	0.748423 m	0.027612	7.075210 m
Motion LOSO selector	0.174646 m	0.749803 m	0.027347	6.844848 m
OLS-4	0.204330 m	0.796091 m	0.028246	6.844848 m
OLS-8	0.259864 m	1.048856 m	0.056222	7.269652 m
Kalman CV	0.288400 m	1.290415 m	0.078545	6.972805 m
Last-step CV	0.404088 m	1.879473 m	0.128180	10.001165 m

Largest LOSO failure concentration:

Scene	Corruption	Mean regret	P95 regret	Max regret	Regret >= 1m rate	Common oracle	Common selected model
ETH	homography_bias	0.632800 m	2.380453 m	7.075210 m	0.218864	stationary	OLS-4

Interpretation:

The selector improves mean and some tail metrics relative to several fixed baselines.
It does not eliminate high-regret failures.
ETH abrupt-stop, strong-turn, and homography-like bias cases remain the most visible weaknesses.

What This Project Studies

ForecastLite-Selector studies a narrow but important question:

In short-horizon pedestrian trajectory prediction, if we only use transparent, deterministic predictors based on observed positions and timestamps, when are they reliable, when do they fail, and can an observed-only selector choose a better simple predictor from the available family?

More concretely, the project studies:

clean-data performance of deterministic predictors;
how noise, missing observations, timestamp irregularity, quantization, and homography-like bias change model ranking;
why pooled and macro summaries can disagree;
whether selector evaluation is free of future leakage;
selector regret relative to a deterministic oracle;
whether tail risk reveals failures hidden by mean ADE;
which claims are supported by evidence and which claims are not.

The project is suitable as:

a pedestrian trajectory baseline study;
an embodied-AI or robotics interview project;
a reliability-audit and reproducibility case study;
a sanity-check layer for larger learned forecasting systems;
an example of careful claim hygiene in an empirical ML paper.

What This Project Does Not Claim

ForecastLite-Selector does not claim that it:

is a state-of-the-art trajectory forecasting model;
beats Social LSTM, Social GAN, Trajectron++, NATRA, TrajImpute, or other learned methods;
models social interaction;
models multimodality;
generates probabilistic trajectories;
guarantees collision avoidance or planning safety;
proves that synthetic corruption is equivalent to real detector/tracker noise;
transfers to all datasets, camera setups, or embodied-agent settings;
proves that ADE/FDE are sufficient for social plausibility or safety.

These non-claims are part of the project quality. They prevent a reliability audit from being misrepresented as a model paper with broader evidence than the experiments actually provide.

Data and Protocol

Data Source

The data comes from the ETH/UCY split files curated in the Trajectron++ repository.

Recorded upstream commit:

1031c7bd1a444273af378c1ec1dcca907ba59830

This project expects the raw files at:

data/raw/eth_ucy/

That directory is ignored by .gitignore. Raw data should be reviewed against upstream licenses before redistribution.

Raw Data Schema

Each row has:

frame_id    track_id    pos_x    pos_y

Meaning:

frame_id: original frame number;
track_id: pedestrian trajectory ID;
pos_x, pos_y: world coordinates in meters.

Standard Protocol

Parameter	Value
Observed steps	8
Predicted steps	12
Sampling interval	0.4 s
Raw-frame step	10 frames
Standard window stride	1
Evaluated split	`test`
Scenes	`eth`, `hotel`, `univ`, `zara1`, `zara2`

A valid window must satisfy:

one consistent track_id;
enough length for 8 observed steps plus 12 future steps;
continuous frame spacing;
finite coordinates;
no duplicate frame/track pair.

Method Overview

All predictors implement the same interface:

predict(observed_xy, observed_t, future_t) -> predicted_xy

This guarantees that every model sees the same observed history and the same future timestamps.

Stationary

Assumes the pedestrian stays at the last observed position:

future position = last observed position

Strength:

very strong for low-motion and abrupt-stop cases.

Weakness:

poor for normal walking motion.

Last-Step Constant Velocity

Estimates velocity from only the last two observed points:

velocity = (last_position - previous_position) / dt

Strengths:

fast response on clean short-term motion;
best clean pooled ADE in this study.

Weakness:

if the final observed step is corrupted, the velocity estimate amplifies that error into the future.

OLS-4 / OLS-8 Constant Velocity

Fits a line through the most recent 4 or 8 observed points:

position ~= velocity * time + intercept

Strengths:

smooths observation noise;
can be more stable under quantization or Gaussian jitter.

Weaknesses:

a longer history can respond too slowly;
turns or abrupt speed changes can make linear extrapolation fail.

Kalman CV

Uses a constant-velocity Kalman filter:

state = [x, y, vx, vy]
measurement = [x, y]

Strengths:

stable under timestamp jitter and compatible noise assumptions;
best clean macro ADE in this study.

Weakness:

if process-noise or measurement-noise assumptions do not match the data, it can underperform OLS or last-step CV.

ForecastLite-Selector

The selector does not forecast directly. It chooses among base predictors:

stationary
last_step_cv
ols_4
ols_8
kalman_cv

Core Constraint: Observed-Only Selection

Selector features may only use:

observed positions;
observed timestamps;
future timestamps;
observed-only derived features.

They may not use:

future positions;
future turn labels;
oracle model choice;
ADE/FDE;
regret.

Future-derived information is allowed only for evaluation and failure analysis, not for model selection.

Fixed Selector

The fixed selector uses hand-written rules based on observed speed, displacement, OLS residuals, timestamp irregularity, missingness ratio, and related observed-only features.

LOSO Threshold Selector

LOSO means leave-one-scene-out.

Procedure:

Hold out one scene.
Choose thresholds on the remaining scenes.
Freeze those thresholds.
Evaluate on the held-out scene.
Repeat for all five held-out scenes.
Merge the held-out results.

This is stricter than tuning and testing on the same scene.

Tree Selector

The tree selector is a shallow decision-tree diagnostic. It is not the headline method. It tests whether a more flexible observed-only rule gives a large gain.

Motion LOSO Selector

The Motion LOSO selector adds speed-trend and deceleration features as an ablation. It is slightly better on the expanded uniform mixture, but it worsens Gaussian-jitter performance, so it should not replace the headline selector unless the target distribution is explicitly redefined.

Deterministic Reliability Envelope

The Deterministic Reliability Envelope is one of the central analysis tools in this project.

For each matched case, it asks:

If we are only allowed to choose within the current deterministic predictor family, what is the best possible ADE/FDE after seeing the outcome?

The selector cannot use this information. It is an evaluation-only oracle.

Uses:

compute oracle regret;
separate best fixed baseline, best selector policy, and per-case oracle;
measure how far a selector remains from the best available deterministic family member;
expose tail failures;
avoid hiding severe failures behind mean ADE.

The main reliability audit is generated by:

python scripts/build_reliability_audit_tables.py --gaussian-dir results/selector_full --expanded-dir results/selector_corruption_expanded --output-dir results/reliability_audit

Generated artifacts include:

claim_separation.csv
estimands.csv
claim_ledger.csv
external_validation_design.csv
reliability_envelope_cases.csv
reliability_envelope_summary.csv
regret_dominance_map.csv
mixture_sensitivity.csv
tail_risk_summary.csv
selector_failure_summary.csv
mixture and tail-risk figures.

Reproduce From Scratch

1. Install the Environment

Python 3.11 or newer is required. The latest complete verification used Python 3.13.

python -m pip install -e ".[dev]"

If installing from the lock file manually:

python -m pip install -r requirements-lock.txt
python -m pip install -e .

2. Prepare the Data

Place the ETH/UCY split files at:

data/raw/eth_ucy/

The structure should be readable by configs/benchmark.yaml. Raw data is not redistributed in this repository; review the Trajectron++ and ETH/UCY upstream licenses before redistribution.

3. Run Basic Quality Gates

ruff check .
python -m pytest -q
python -m compileall -q src scripts tests

Expected:

ruff: All checks passed.
pytest: 40 passed.
compileall: exit 0.

4. Run the Core Benchmark

python scripts/inspect_data.py --config configs/benchmark.yaml --output-dir results/data_inspection
python scripts/run_benchmark.py --config configs/benchmark.yaml --output-dir results/core

Generated files:

results/data_inspection/data_provenance.csv
results/data_inspection/window_counts.csv
results/data_inspection/exclusions.csv
results/core/metrics_by_window.parquet
results/core/summary_by_scene.csv
results/core/paired_comparisons.csv
results/core/manifest.json

5. Run Ablation Studies

python scripts/run_ablation.py --config configs/noise_study.yaml --study all --output-dir results/ablations

Generated files:

horizon_summary.csv
observation_length_summary.csv
noise_summary.csv
turn_stratification.csv
stride_sensitivity.csv

Note: this command can run for a while. In the 2026-06-15 reproduction, the wrapper timed out but the Python subprocess continued and completed. Completion was verified by output-file presence and row counts.

6. Run the Full Gaussian Selector Audit

python scripts/run_selector.py --config configs/noise_study.yaml --output-dir results/selector_full --noise-sigmas 0 0.02 0.05 0.1 0.2 --noise-seeds 42 43 44 45 46

Then run the derived audits:

python scripts/audit_stationary_branch.py --selector-dir results/selector_full
python scripts/audit_selector_failure_features.py --selector-dir results/selector_full
python scripts/audit_motion_dynamics_features.py --selector-dir results/selector_full
python scripts/run_motion_loso_selector.py --config configs/noise_study.yaml --selector-dir results/selector_full
python scripts/analyze_dominance_stability.py --input-dir results/selector_full
python scripts/plot_selector_failures.py --config configs/noise_study.yaml --selector-dir results/selector_full --output-dir results/selector_full/figures/eth_failure_cases

Finally, build paper tables:

python scripts/summarize_selector_results.py --input-dir results/selector_full --bootstrap-resamples 1000

7. Run the Expanded Corruption Audit

python scripts/run_selector.py --config configs/noise_study.yaml --output-dir results/selector_corruption_expanded --corruption-types gaussian_jitter final_step_spike quantization timestamp_jitter homography_bias consecutive_dropout --corruption-levels 0.05 0.1 0.2 --noise-seeds 42 43 44

Then run the derived audits:

python scripts/audit_stationary_branch.py --selector-dir results/selector_corruption_expanded
python scripts/audit_selector_failure_features.py --selector-dir results/selector_corruption_expanded
python scripts/audit_motion_dynamics_features.py --selector-dir results/selector_corruption_expanded
python scripts/run_motion_loso_selector.py --config configs/noise_study.yaml --selector-dir results/selector_corruption_expanded
python scripts/analyze_dominance_stability.py --input-dir results/selector_corruption_expanded
python scripts/plot_selector_failures.py --config configs/noise_study.yaml --selector-dir results/selector_corruption_expanded --output-dir results/selector_corruption_expanded/figures/eth_failure_cases

Finally, build paper tables:

python scripts/summarize_selector_results.py --input-dir results/selector_corruption_expanded --bootstrap-resamples 1000

Note: the expanded selector audit is the heaviest step. In the 2026-06-15 reproduction, the wrapper timed out but the Python subprocess continued and completed.

8. Generate Paper Figures and Reliability Tables

python scripts/plot_selector_paper_figures.py --gaussian-dir results/selector_full --expanded-dir results/selector_corruption_expanded --output-dir results/paper_figures
python scripts/build_reliability_audit_tables.py --gaussian-dir results/selector_full --expanded-dir results/selector_corruption_expanded --output-dir results/reliability_audit

9. Build the Release Package

python scripts/train_tiny_gru.py
python scripts/build_release.py

train_tiny_gru.py does not train a model. It writes the TinyGRU cancellation record, documenting that TinyGRU is outside the current deterministic core release.

build_release.py requires the core results/ artifacts to exist. If a normal checkout does not contain results/, running it immediately will fail. That is expected.

Expected Results Directory

After a complete reproduction, results/ should contain:

results/
  data_inspection/
  core/
  ablations/
  selector_full/
  selector_corruption_expanded/
  paper_figures/
  reliability_audit/
  release/
  tiny_gru_cancellation_record.md

Core files:

results/core/summary_by_scene.csv
results/selector_full/paper_tables/mixed_corruption_summary.csv
results/selector_full/paper_tables/paired_mixed_summary.csv
results/selector_corruption_expanded/paper_tables/mixed_corruption_summary.csv
results/reliability_audit/expanded/mixture_sensitivity.csv
results/reliability_audit/expanded/tail_risk_summary.csv
results/reliability_audit/expanded/selector_failure_summary.csv

How to Read Results

Pooled vs Macro

pooled:

averages over all windows together;
gives more weight to large scenes;
is strongly affected by Univ because Univ contributes many windows.

macro:

averages within each scene first;
then averages across scenes;
better answers the question "how does this behave across scenes on average?"

Both matter. Pooled results can be dominated by a large scene. Macro results can hide the actual sample distribution.

ADE vs FDE

ADE means:

Average Displacement Error

It is the average Euclidean distance error across all future timesteps.

FDE means:

Final Displacement Error

It only evaluates the last future timestep.

ADE/FDE are useful deterministic centerline diagnostics, but they do not prove safety, collision avoidance, or social plausibility.

Oracle Regret

Oracle regret is:

model ADE - best deterministic family ADE for the same case

If regret is zero, the model is the best available member of the deterministic family for that case.

If regret is large, the model is far from what the current deterministic family could have achieved with hindsight.

Dominance Map

A dominance map answers:

Under a specific corruption, level, observed regime, or turn bin, which model has the lowest average ADE?

This is more informative than a single global table because the central finding of the project is conditional dominance, not global dominance.

Mixture Sensitivity

Mixture sensitivity answers:

If the weights of corruption types change, which policy is best?

This prevents a winner under one mixture distribution from being overstated as a universal winner.

Tail Risk

Tail risk tracks P90 / P95 / P99 regret and high-regret rate.

It answers:

When mean ADE looks acceptable, are there still rare but severe failures?

One final conclusion of this project is yes. The selector reduces some tail risk, but high-regret failures remain.

Code Structure

configs/
  benchmark.yaml              clean ETH/UCY benchmark config
  noise_study.yaml            noise/corruption study config
  kalman_grid.yaml            Kalman parameter grid notes
  tiny_gru.yaml               cancelled extension config

data/
  README.md                   data setup notes
  fixtures/                   tiny test fixture
  raw/                        ignored local raw data

src/forecastlite/
  bootstrap.py                grouped bootstrap CI helpers
  config.py                   YAML config loading and typed config objects
  corruption.py               observed-history corruption operators
  data.py                     raw file discovery, validation, provenance
  evaluate.py                 benchmark evaluation and result writing
  metrics.py                  ADE, FDE, summaries, paired comparisons
  plotting.py                 core benchmark plots
  reliability.py              reliability envelope, mixture, tail, claim tables
  schema.py                   TrajectoryWindow data object
  selector.py                 observed-only features, selectors, regret maps
  stratify.py                 future turn-bin analysis labels
  windows.py                  ETH/UCY window construction
  predictors/
    stationary.py
    constant_velocity.py
    linear_regression.py
    kalman.py
    protocol.py
    __init__.py

scripts/
  CLI entry points for inspection, benchmark, selector, audits, plotting, release

tests/
  unit, leakage, integration, selector, corruption, reliability tests

doc/
  manuscript, guide, reproduction log, upgrade decision, references, slides, papers

Script Map

Script	Purpose	Main output
`scripts/inspect_data.py`	Validate data and count windows	`results/data_inspection/`
`scripts/run_benchmark.py`	Clean five-scene benchmark	`results/core/`
`scripts/run_ablation.py`	Horizon, observation, noise, turn, stride studies	`results/ablations/`
`scripts/run_selector.py`	Base predictor + selector + oracle audit	`results/selector_*`
`scripts/summarize_selector_results.py`	Paper-ready selector tables	`paper_tables/`
`scripts/audit_stationary_branch.py`	Check stationary selector branch	`stationary_branch_audit/`
`scripts/audit_selector_failure_features.py`	Audit observed-feature failure overrides	`failure_feature_audit/`
`scripts/audit_motion_dynamics_features.py`	Audit speed-trend features	`motion_dynamics_audit/`
`scripts/run_motion_loso_selector.py`	Motion LOSO ablation	`motion_loso_selector_metrics.csv`
`scripts/analyze_dominance_stability.py`	Winner stability across cases	`dominance_stability/`
`scripts/plot_selector_failures.py`	High-regret qualitative figures	`figures/eth_failure_cases/`
`scripts/plot_selector_paper_figures.py`	Manuscript summary figures	`results/paper_figures/`
`scripts/build_reliability_audit_tables.py`	Reliability envelope and claim-safety tables	`results/reliability_audit/`
`scripts/train_tiny_gru.py`	Write cancellation record	`results/tiny_gru_cancellation_record.md`
`scripts/build_release.py`	Validate and package release artifacts	`results/release/`

Tests and Quality Gates

The test suite covers:

trajectory window construction;
predictor behavior;
ADE/FDE metrics;
Kalman filter state and covariance behavior;
corruption operators;
no-leakage boundaries;
selector behavior;
reliability envelope and claim tables;
a tiny integration pipeline.

Recommended before any release:

ruff check .
python -m pytest -q
python -m compileall -q src scripts tests

Last recorded full verification:

ruff: All checks passed.
pytest: 40 passed.
compileall: exit 0.

Paper and Documents

File	Use
doc/FORECASTLITE_SELECTOR_MANUSCRIPT_DRAFT_2026-06-14.md	Main manuscript draft
doc/FORECASTLITE_COMPLETE_BEGINNER_GUIDE_ZH_2026-06-15.md	Complete beginner-level explanation in Chinese
doc/FORECASTLITE_REPRODUCIBILITY_RUN_LOG_2026-06-15.md	Exact command/result reproduction log
doc/FORECASTLITE_WORKSHOP_UPGRADE_DECISION_2026-06-15.md	Upgrade-path decision memo
doc/FORECASTLITE_REPOSITIONING_LITERATURE_AUDIT_ZH_2026-06-14.md	Literature-positioning audit in Chinese
doc/forecastlite_selector_references.bib	BibTeX references
`doc/slides/`	Interview / presentation deck artifacts
`doc/paper/`	Local reference papers and extracted assets

Claim Hygiene

Safe Claims

It is safe to say:

ForecastLite-Selector is a leakage-safe deterministic reliability audit on ETH/UCY.
Constant velocity being strong is prior work, not the main contribution.
Deterministic predictor dominance is conditional on motion and corruption regime.
The observed-only selector improves declared Gaussian mixed ADE over fixed OLS baselines.
The expanded corruption grid has no universal winner.
Tail-regret tables expose remaining high-regret failures.
External validation and real-noise transfer remain future work.

Unsafe Claims

Do not say:

ForecastLite is SOTA.
ForecastLite discovered that constant velocity is strong.
The selector is universally robust.
Synthetic corruption proves deployment reliability.
ADE/FDE prove safety.
Motion LOSO is the main method unless the target distribution is redefined and the Gaussian-jitter tradeoff is accepted.
SDD / OpenTraj / T2FPV / EgoTraj-Bench external validation has already been completed.

Best Next Steps

Based on the workshop upgrade decision, the next priorities are:

Primary path: frozen-policy real-noise replay on T2FPV-style or EgoTraj-Bench-style data.
Backup path: frozen external validation on SDD / OpenTraj.
Do regardless: keep failure analysis, reliability envelope, mixture sensitivity, tail risk, and claim boundaries central.
Defer: conformal tubes, extra selector tuning, learned selectors, and risk-aware objectives until transfer evidence exists.

Reasoning:

The current largest weakness is not that the selector is too simple. The largest weakness is that ETH/UCY synthetic corruption has not yet been shown to transfer to real perception noise.
If real-noise replay fails, that negative result should be reported honestly.
A negative result is still valuable because it clarifies whether ForecastLite-Selector is an ETH/UCY diagnostic or a transferable policy.

FAQ

Why does this checkout not contain `results/`?

results/ is a reproducible generated directory and is ignored by .gitignore. Running the reproduction commands regenerates it. The complete historical result summary is in doc/FORECASTLITE_REPRODUCIBILITY_RUN_LOG_2026-06-15.md.

Why not commit all result artifacts?

The selector and expanded-corruption outputs are large. In the 2026-06-15 backup, the old results directory was about 7.55 GB. Committing it directly would make the repository difficult to use. The better approach is to keep source code, configs, logs, and exact reproduction commands.

Why not use a neural network?

The goal is not to win a leaderboard. The goal is to build a deterministic baseline reliability audit. Neural models can be future work, but the current contribution comes from transparent, interpretable, reproducible predictors and selector/regret analysis.

Why was TinyGRU cancelled?

TinyGRU was an early optional extension. The current core release is a deterministic baseline study. scripts/train_tiny_gru.py writes a cancellation record rather than training a model.

Why use only ADE/FDE?

This project evaluates deterministic centerline predictors and keeps alignment with ETH/UCY baseline tradition. The README, manuscript, and claim ledger all state clearly that ADE/FDE are not safety or social-plausibility metrics.

Why use grouped bootstrap?

Dense sliding windows are not independent samples. One trajectory produces many neighboring windows. Grouped bootstrap resamples by scene/source_file/track_id, so highly correlated windows are not treated as independent observations.

Why can the selector not use the future turn bin?

Because the future trajectory does not exist at prediction time. Future-derived labels are allowed only for post-hoc stratified analysis. The code and tests treat this boundary as a leakage-safety requirement.

What should I check if reproduced numbers differ from this README?

First check:

Whether the configs are exactly configs/benchmark.yaml and configs/noise_study.yaml.
Whether the data comes from the same Trajectron++ commit.
Whether raw files were modified.
Whether Python/package versions differ substantially.
Whether Motion LOSO was run before summarize_selector_results.py.
Whether paper tables were generated with 1000 bootstrap resamples.

Then compare against:

License and Data Notes

The code license is recorded in LICENSE.

Raw ETH/UCY data and copied upstream split files require separate license review before redistribution. The project records provenance and checksums when scripts/inspect_data.py runs, but it does not claim permission to redistribute the upstream dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
data		data
doc		doc
scripts		scripts
src/forecastlite		src/forecastlite
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt

Folders and files

Latest commit

History

Repository files navigation

ForecastLite-Selector

Table of Contents

Current Status

Final Conclusion

Key Results

Data Scale

Clean ETH/UCY

Full Gaussian-Jitter Audit

Expanded Corruption Audit

Tail Risk and Failure Concentration

What This Project Studies

What This Project Does Not Claim

Data and Protocol

Data Source

Raw Data Schema

Standard Protocol

Method Overview

Stationary

Last-Step Constant Velocity

OLS-4 / OLS-8 Constant Velocity

Kalman CV

ForecastLite-Selector

Core Constraint: Observed-Only Selection

Fixed Selector

LOSO Threshold Selector

Tree Selector

Motion LOSO Selector

Deterministic Reliability Envelope

Reproduce From Scratch

1. Install the Environment

2. Prepare the Data

3. Run Basic Quality Gates

4. Run the Core Benchmark

5. Run Ablation Studies

6. Run the Full Gaussian Selector Audit

7. Run the Expanded Corruption Audit

8. Generate Paper Figures and Reliability Tables

9. Build the Release Package

Expected Results Directory

How to Read Results

Pooled vs Macro

ADE vs FDE

Oracle Regret

Dominance Map

Mixture Sensitivity

Tail Risk

Code Structure

Script Map

Tests and Quality Gates

Paper and Documents

Claim Hygiene

Safe Claims

Unsafe Claims

Best Next Steps

FAQ

Why does this checkout not contain results/?

Why not commit all result artifacts?

Why not use a neural network?

Why was TinyGRU cancelled?

Why use only ADE/FDE?

Why use grouped bootstrap?

Why can the selector not use the future turn bin?

What should I check if reproduced numbers differ from this README?

License and Data Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Why does this checkout not contain `results/`?

Packages