A transparent, reproducible, leakage-safe reliability audit for deterministic short-horizon pedestrian trajectory baselines under observation noise and nonlinear motion.
Conclusion: ForecastLite-Selector is not a new state-of-the-art trajectory forecasting model. It is a reproducible study that maps when simple deterministic predictors work, when they fail, how their ranking changes under observation corruption, and how much a leakage-safe observed-only selector can reduce mixed-corruption ADE and oracle regret under declared synthetic corruption mixtures.
- Current Status
- Final Conclusion
- Key Results
- What This Project Studies
- What This Project Does Not Claim
- Data and Protocol
- Method Overview
- ForecastLite-Selector
- Deterministic Reliability Envelope
- Reproduce From Scratch
- Expected Results Directory
- How to Read Results
- Code Structure
- Script Map
- Tests and Quality Gates
- Paper and Documents
- Claim Hygiene
- Best Next Steps
- FAQ
- License and Data Notes
This working tree keeps the source code, configuration files, tests, manuscript
draft, and reproducibility log. The results/ directory is generated output and
is ignored by .gitignore, so a normal checkout may not contain the full result
artifacts. That is expected. The result files are reproducible from the commands
below, and they are omitted from version control to avoid storing several
gigabytes of derived artifacts in the repository.
The latest complete reproduction record is:
That log records the formal commands, command outputs, generated artifacts, and how the key numbers used in the manuscript were regenerated.
Primary project documents:
-
doc/FORECASTLITE_SELECTOR_MANUSCRIPT_DRAFT_2026-06-14.md: manuscript draft.
-
doc/FORECASTLITE_COMPLETE_BEGINNER_GUIDE_ZH_2026-06-15.md: complete beginner-level guide in Chinese.
-
doc/FORECASTLITE_REPOSITIONING_LITERATURE_AUDIT_ZH_2026-06-14.md: literature-positioning audit in Chinese.
-
doc/forecastlite_selector_references.bib: BibTeX references.
The final research conclusion is:
Under ETH/UCY 8-observation / 12-prediction evaluation, transparent deterministic pedestrian predictors exchange dominance across motion and observation-corruption regimes. A leakage-safe observed-only selector can reduce mixed-corruption ADE and oracle regret under declared synthetic corruption mixtures, but no deterministic policy is universally best. The strongest current contribution is a reproducible deterministic reliability audit, not a SOTA learned forecaster, not a deployment guarantee, and not evidence that synthetic corruption fully matches real perception noise.
In practical terms:
- On clean data, last-step constant velocity and Kalman constant velocity are both strong.
- After observation corruption is introduced, the best predictor changes by corruption type, corruption level, motion regime, and scene.
forecastlite_loso_threshold_selectorimproves over fixed OLS-4 and OLS-8 on the declared full Gaussian-jitter mixed audit.- The expanded corruption grid has no universal winner.
- The Motion LOSO ablation is slightly better on the uniform expanded mixture, but it is worse on Gaussian jitter, so it should remain an ablation rather than the headline method.
- The largest failures concentrate in ETH homography-bias, abrupt-stop, and strong-turn cases, often where the oracle predictor is stationary but the selector chooses OLS-4 or OLS-8 and keeps extrapolating motion.
The numbers below come from the complete 2026-06-15 reproducibility log and the
manuscript draft. Running the commands in Reproduce From Scratch
regenerates the corresponding results/ files.
| Item | Value |
|---|---|
| Dataset | ETH/UCY via Trajectron++ |
| Scenes | ETH, Hotel, Univ, Zara1, Zara2 |
| Valid dense stride-1 windows | 34,161 |
| Trajectory groups | 1,219 |
| Observation length | 8 steps |
| Prediction length | 12 steps |
| Sampling interval | 0.4 s |
| Prediction horizon | 4.8 s |
Scene breakdown:
| Scene | Windows | Tracks |
|---|---|---|
| ETH | 364 | 44 |
| Hotel | 1,197 | 122 |
| Univ | 24,334 | 722 |
| Zara1 | 2,356 | 142 |
| Zara2 | 5,910 | 189 |
Clean pooled ADE winner:
| Model | Pooled ADE |
|---|---|
| Last-step CV | 0.481554 m |
| Kalman CV | 0.494961 m |
| OLS-4 | 0.541744 m |
| OLS-8 | 0.667404 m |
| Stationary | 1.442155 m |
Clean macro ADE winner:
| Model | Macro ADE | 95% grouped CI | Macro FDE |
|---|---|---|---|
| Kalman CV | 0.529111 m | [0.479361, 0.584796] | 1.134871 m |
| Last-step CV | 0.534033 m | [0.486211, 0.587942] | 1.147595 m |
| OLS-4 | 0.551975 m | [0.497044, 0.610357] | 1.154689 m |
| OLS-8 | 0.648164 m | [0.577081, 0.712963] | 1.273103 m |
| Stationary | 1.726353 m | [1.582527, 2.032697] | 3.109385 m |
Interpretation:
- Clean pooled data favors last-step CV.
- Clean macro data favors Kalman CV.
- Univ contributes many dense windows, so pooled and macro evaluation answer different questions.
- Constant velocity being strong is not a new discovery. Prior work already established that CV-style baselines are hard to dismiss on ETH/UCY.
Command:
python scripts/run_selector.py --config configs/noise_study.yaml --output-dir results/selector_full --noise-sigmas 0 0.02 0.05 0.1 0.2 --noise-seeds 42 43 44 45 46Mixed Gaussian-jitter ADE:
| Policy / Model | Mixed ADE | 95% grouped CI | Mean regret |
|---|---|---|---|
| LOSO threshold selector | 0.662352 m | [0.630250, 0.698300] | 0.208877 m |
| Motion LOSO selector | 0.663879 m | [0.631394, 0.699734] | 0.210404 m |
| Fixed ForecastLite selector | 0.663932 m | [0.631355, 0.699853] | 0.210457 m |
| OLS-8 | 0.716557 m | [0.680635, 0.754783] | 0.263083 m |
| OLS-4 | 0.720154 m | [0.696024, 0.746804] | 0.266680 m |
| Tree selector | 0.742189 m | [0.706919, 0.782720] | 0.288715 m |
| Kalman CV | 0.903828 m | [0.885148, 0.924602] | 0.450353 m |
| Last-step CV | 1.168029 m | [1.151870, 1.185559] | 0.714554 m |
| Stationary | 1.463198 m | [1.366109, 1.566100] | 1.009723 m |
Paired matched-window deltas:
| Comparison | Mean ADE delta | 95% CI | Interpretation |
|---|---|---|---|
| LOSO selector - OLS-8 | -0.054206 m | [-0.058516, -0.050375] | LOSO has lower ADE |
| LOSO selector - OLS-4 | -0.057803 m | [-0.066286, -0.048296] | LOSO has lower ADE |
Interpretation:
- The selector improvement is real but modest.
- The result is roughly a 7.6%-8.0% mixed-distribution ADE reduction over strong fixed OLS baselines.
- This should not be described as a forecasting breakthrough.
Expanded corruption types:
gaussian_jitterfinal_step_spikequantizationtimestamp_jitterhomography_biasconsecutive_dropout
Levels:
- 0.05
- 0.10
- 0.20
Seeds:
- 42
- 43
- 44
Generated scale in the verified run:
| Artifact | Rows |
|---|---|
results/selector_corruption_expanded/base_metrics_by_window.parquet |
9,223,470 |
selector_metrics.csv |
1,844,694 |
loso_threshold_selector_metrics.csv |
1,844,694 |
tree_selector_metrics.csv |
1,844,694 |
Single-corruption winners:
| Corruption | Best policy / model | Mixed ADE | Main lesson |
|---|---|---|---|
| consecutive dropout | LOSO / fixed / Motion / last-step CV tie | 0.485897 m | Observed missingness is recoverable by last-step-style behavior |
| gaussian jitter | LOSO / fixed selector | 0.726233 m | Selector improves over OLS-8 on this expanded subset |
| final-step spike | Motion LOSO | 0.705787 m | Speed-trend features help; OLS-8 is close at 0.706699 m |
| homography bias | Last-step CV | 0.657662 m | Selector has a clear feature gap |
| quantization | OLS-4 | 0.582928 m | Short-window smoothing wins |
| timestamp jitter | Kalman CV | 0.498549 m | Timestamp-aware smoothing helps |
Uniform expanded mixture:
| Policy | Mixture ADE |
|---|---|
| Motion LOSO selector | 0.628318 m |
| LOSO threshold selector | 0.629112 m |
| Fixed selector | 0.629112 m |
Interpretation:
- Motion LOSO is useful as an ablation.
- It is not the headline selector because it worsens Gaussian jitter: LOSO/fixed score 0.726233 m, while Motion LOSO scores 0.733404 m.
- The safest paper claim is mixture-dependent improvement, not universal dominance.
Expanded audit, all corruption types:
| Model / Policy | Mean regret | P95 regret | Regret >= 1m rate | Max regret |
|---|---|---|---|---|
| LOSO threshold selector | 0.175440 m | 0.748423 m | 0.027612 | 7.075210 m |
| Fixed selector | 0.175440 m | 0.748423 m | 0.027612 | 7.075210 m |
| Motion LOSO selector | 0.174646 m | 0.749803 m | 0.027347 | 6.844848 m |
| OLS-4 | 0.204330 m | 0.796091 m | 0.028246 | 6.844848 m |
| OLS-8 | 0.259864 m | 1.048856 m | 0.056222 | 7.269652 m |
| Kalman CV | 0.288400 m | 1.290415 m | 0.078545 | 6.972805 m |
| Last-step CV | 0.404088 m | 1.879473 m | 0.128180 | 10.001165 m |
Largest LOSO failure concentration:
| Scene | Corruption | Mean regret | P95 regret | Max regret | Regret >= 1m rate | Common oracle | Common selected model |
|---|---|---|---|---|---|---|---|
| ETH | homography_bias | 0.632800 m | 2.380453 m | 7.075210 m | 0.218864 | stationary | OLS-4 |
Interpretation:
- The selector improves mean and some tail metrics relative to several fixed baselines.
- It does not eliminate high-regret failures.
- ETH abrupt-stop, strong-turn, and homography-like bias cases remain the most visible weaknesses.
ForecastLite-Selector studies a narrow but important question:
In short-horizon pedestrian trajectory prediction, if we only use transparent, deterministic predictors based on observed positions and timestamps, when are they reliable, when do they fail, and can an observed-only selector choose a better simple predictor from the available family?
More concretely, the project studies:
- clean-data performance of deterministic predictors;
- how noise, missing observations, timestamp irregularity, quantization, and homography-like bias change model ranking;
- why pooled and macro summaries can disagree;
- whether selector evaluation is free of future leakage;
- selector regret relative to a deterministic oracle;
- whether tail risk reveals failures hidden by mean ADE;
- which claims are supported by evidence and which claims are not.
The project is suitable as:
- a pedestrian trajectory baseline study;
- an embodied-AI or robotics interview project;
- a reliability-audit and reproducibility case study;
- a sanity-check layer for larger learned forecasting systems;
- an example of careful claim hygiene in an empirical ML paper.
ForecastLite-Selector does not claim that it:
- is a state-of-the-art trajectory forecasting model;
- beats Social LSTM, Social GAN, Trajectron++, NATRA, TrajImpute, or other learned methods;
- models social interaction;
- models multimodality;
- generates probabilistic trajectories;
- guarantees collision avoidance or planning safety;
- proves that synthetic corruption is equivalent to real detector/tracker noise;
- transfers to all datasets, camera setups, or embodied-agent settings;
- proves that ADE/FDE are sufficient for social plausibility or safety.
These non-claims are part of the project quality. They prevent a reliability audit from being misrepresented as a model paper with broader evidence than the experiments actually provide.
The data comes from the ETH/UCY split files curated in the Trajectron++ repository.
Recorded upstream commit:
1031c7bd1a444273af378c1ec1dcca907ba59830
This project expects the raw files at:
data/raw/eth_ucy/
That directory is ignored by .gitignore. Raw data should be reviewed against
upstream licenses before redistribution.
Each row has:
frame_id track_id pos_x pos_y
Meaning:
frame_id: original frame number;track_id: pedestrian trajectory ID;pos_x,pos_y: world coordinates in meters.
| Parameter | Value |
|---|---|
| Observed steps | 8 |
| Predicted steps | 12 |
| Sampling interval | 0.4 s |
| Raw-frame step | 10 frames |
| Standard window stride | 1 |
| Evaluated split | test |
| Scenes | eth, hotel, univ, zara1, zara2 |
A valid window must satisfy:
- one consistent
track_id; - enough length for 8 observed steps plus 12 future steps;
- continuous frame spacing;
- finite coordinates;
- no duplicate frame/track pair.
All predictors implement the same interface:
predict(observed_xy, observed_t, future_t) -> predicted_xyThis guarantees that every model sees the same observed history and the same future timestamps.
Assumes the pedestrian stays at the last observed position:
future position = last observed position
Strength:
- very strong for low-motion and abrupt-stop cases.
Weakness:
- poor for normal walking motion.
Estimates velocity from only the last two observed points:
velocity = (last_position - previous_position) / dt
Strengths:
- fast response on clean short-term motion;
- best clean pooled ADE in this study.
Weakness:
- if the final observed step is corrupted, the velocity estimate amplifies that error into the future.
Fits a line through the most recent 4 or 8 observed points:
position ~= velocity * time + intercept
Strengths:
- smooths observation noise;
- can be more stable under quantization or Gaussian jitter.
Weaknesses:
- a longer history can respond too slowly;
- turns or abrupt speed changes can make linear extrapolation fail.
Uses a constant-velocity Kalman filter:
state = [x, y, vx, vy]
measurement = [x, y]
Strengths:
- stable under timestamp jitter and compatible noise assumptions;
- best clean macro ADE in this study.
Weakness:
- if process-noise or measurement-noise assumptions do not match the data, it can underperform OLS or last-step CV.
The selector does not forecast directly. It chooses among base predictors:
stationary
last_step_cv
ols_4
ols_8
kalman_cv
Selector features may only use:
- observed positions;
- observed timestamps;
- future timestamps;
- observed-only derived features.
They may not use:
- future positions;
- future turn labels;
- oracle model choice;
- ADE/FDE;
- regret.
Future-derived information is allowed only for evaluation and failure analysis, not for model selection.
The fixed selector uses hand-written rules based on observed speed, displacement, OLS residuals, timestamp irregularity, missingness ratio, and related observed-only features.
LOSO means leave-one-scene-out.
Procedure:
- Hold out one scene.
- Choose thresholds on the remaining scenes.
- Freeze those thresholds.
- Evaluate on the held-out scene.
- Repeat for all five held-out scenes.
- Merge the held-out results.
This is stricter than tuning and testing on the same scene.
The tree selector is a shallow decision-tree diagnostic. It is not the headline method. It tests whether a more flexible observed-only rule gives a large gain.
The Motion LOSO selector adds speed-trend and deceleration features as an ablation. It is slightly better on the expanded uniform mixture, but it worsens Gaussian-jitter performance, so it should not replace the headline selector unless the target distribution is explicitly redefined.
The Deterministic Reliability Envelope is one of the central analysis tools in this project.
For each matched case, it asks:
If we are only allowed to choose within the current deterministic predictor family, what is the best possible ADE/FDE after seeing the outcome?
The selector cannot use this information. It is an evaluation-only oracle.
Uses:
- compute oracle regret;
- separate best fixed baseline, best selector policy, and per-case oracle;
- measure how far a selector remains from the best available deterministic family member;
- expose tail failures;
- avoid hiding severe failures behind mean ADE.
The main reliability audit is generated by:
python scripts/build_reliability_audit_tables.py --gaussian-dir results/selector_full --expanded-dir results/selector_corruption_expanded --output-dir results/reliability_auditGenerated artifacts include:
claim_separation.csvestimands.csvclaim_ledger.csvexternal_validation_design.csvreliability_envelope_cases.csvreliability_envelope_summary.csvregret_dominance_map.csvmixture_sensitivity.csvtail_risk_summary.csvselector_failure_summary.csv- mixture and tail-risk figures.
Python 3.11 or newer is required. The latest complete verification used Python 3.13.
python -m pip install -e ".[dev]"If installing from the lock file manually:
python -m pip install -r requirements-lock.txt
python -m pip install -e .Place the ETH/UCY split files at:
data/raw/eth_ucy/
The structure should be readable by configs/benchmark.yaml. Raw data is not
redistributed in this repository; review the Trajectron++ and ETH/UCY upstream
licenses before redistribution.
ruff check .
python -m pytest -q
python -m compileall -q src scripts testsExpected:
ruff: All checks passed.
pytest: 40 passed.
compileall: exit 0.
python scripts/inspect_data.py --config configs/benchmark.yaml --output-dir results/data_inspection
python scripts/run_benchmark.py --config configs/benchmark.yaml --output-dir results/coreGenerated files:
results/data_inspection/data_provenance.csvresults/data_inspection/window_counts.csvresults/data_inspection/exclusions.csvresults/core/metrics_by_window.parquetresults/core/summary_by_scene.csvresults/core/paired_comparisons.csvresults/core/manifest.json
python scripts/run_ablation.py --config configs/noise_study.yaml --study all --output-dir results/ablationsGenerated files:
horizon_summary.csvobservation_length_summary.csvnoise_summary.csvturn_stratification.csvstride_sensitivity.csv
Note: this command can run for a while. In the 2026-06-15 reproduction, the wrapper timed out but the Python subprocess continued and completed. Completion was verified by output-file presence and row counts.
python scripts/run_selector.py --config configs/noise_study.yaml --output-dir results/selector_full --noise-sigmas 0 0.02 0.05 0.1 0.2 --noise-seeds 42 43 44 45 46Then run the derived audits:
python scripts/audit_stationary_branch.py --selector-dir results/selector_full
python scripts/audit_selector_failure_features.py --selector-dir results/selector_full
python scripts/audit_motion_dynamics_features.py --selector-dir results/selector_full
python scripts/run_motion_loso_selector.py --config configs/noise_study.yaml --selector-dir results/selector_full
python scripts/analyze_dominance_stability.py --input-dir results/selector_full
python scripts/plot_selector_failures.py --config configs/noise_study.yaml --selector-dir results/selector_full --output-dir results/selector_full/figures/eth_failure_casesFinally, build paper tables:
python scripts/summarize_selector_results.py --input-dir results/selector_full --bootstrap-resamples 1000python scripts/run_selector.py --config configs/noise_study.yaml --output-dir results/selector_corruption_expanded --corruption-types gaussian_jitter final_step_spike quantization timestamp_jitter homography_bias consecutive_dropout --corruption-levels 0.05 0.1 0.2 --noise-seeds 42 43 44Then run the derived audits:
python scripts/audit_stationary_branch.py --selector-dir results/selector_corruption_expanded
python scripts/audit_selector_failure_features.py --selector-dir results/selector_corruption_expanded
python scripts/audit_motion_dynamics_features.py --selector-dir results/selector_corruption_expanded
python scripts/run_motion_loso_selector.py --config configs/noise_study.yaml --selector-dir results/selector_corruption_expanded
python scripts/analyze_dominance_stability.py --input-dir results/selector_corruption_expanded
python scripts/plot_selector_failures.py --config configs/noise_study.yaml --selector-dir results/selector_corruption_expanded --output-dir results/selector_corruption_expanded/figures/eth_failure_casesFinally, build paper tables:
python scripts/summarize_selector_results.py --input-dir results/selector_corruption_expanded --bootstrap-resamples 1000Note: the expanded selector audit is the heaviest step. In the 2026-06-15 reproduction, the wrapper timed out but the Python subprocess continued and completed.
python scripts/plot_selector_paper_figures.py --gaussian-dir results/selector_full --expanded-dir results/selector_corruption_expanded --output-dir results/paper_figures
python scripts/build_reliability_audit_tables.py --gaussian-dir results/selector_full --expanded-dir results/selector_corruption_expanded --output-dir results/reliability_auditpython scripts/train_tiny_gru.py
python scripts/build_release.pytrain_tiny_gru.py does not train a model. It writes the TinyGRU cancellation
record, documenting that TinyGRU is outside the current deterministic core
release.
build_release.py requires the core results/ artifacts to exist. If a normal
checkout does not contain results/, running it immediately will fail. That is
expected.
After a complete reproduction, results/ should contain:
results/
data_inspection/
core/
ablations/
selector_full/
selector_corruption_expanded/
paper_figures/
reliability_audit/
release/
tiny_gru_cancellation_record.md
Core files:
results/core/summary_by_scene.csv
results/selector_full/paper_tables/mixed_corruption_summary.csv
results/selector_full/paper_tables/paired_mixed_summary.csv
results/selector_corruption_expanded/paper_tables/mixed_corruption_summary.csv
results/reliability_audit/expanded/mixture_sensitivity.csv
results/reliability_audit/expanded/tail_risk_summary.csv
results/reliability_audit/expanded/selector_failure_summary.csv
pooled:
- averages over all windows together;
- gives more weight to large scenes;
- is strongly affected by Univ because Univ contributes many windows.
macro:
- averages within each scene first;
- then averages across scenes;
- better answers the question "how does this behave across scenes on average?"
Both matter. Pooled results can be dominated by a large scene. Macro results can hide the actual sample distribution.
ADE means:
Average Displacement Error
It is the average Euclidean distance error across all future timesteps.
FDE means:
Final Displacement Error
It only evaluates the last future timestep.
ADE/FDE are useful deterministic centerline diagnostics, but they do not prove safety, collision avoidance, or social plausibility.
Oracle regret is:
model ADE - best deterministic family ADE for the same case
If regret is zero, the model is the best available member of the deterministic family for that case.
If regret is large, the model is far from what the current deterministic family could have achieved with hindsight.
A dominance map answers:
Under a specific corruption, level, observed regime, or turn bin, which model has the lowest average ADE?
This is more informative than a single global table because the central finding of the project is conditional dominance, not global dominance.
Mixture sensitivity answers:
If the weights of corruption types change, which policy is best?
This prevents a winner under one mixture distribution from being overstated as a universal winner.
Tail risk tracks P90 / P95 / P99 regret and high-regret rate.
It answers:
When mean ADE looks acceptable, are there still rare but severe failures?
One final conclusion of this project is yes. The selector reduces some tail risk, but high-regret failures remain.
configs/
benchmark.yaml clean ETH/UCY benchmark config
noise_study.yaml noise/corruption study config
kalman_grid.yaml Kalman parameter grid notes
tiny_gru.yaml cancelled extension config
data/
README.md data setup notes
fixtures/ tiny test fixture
raw/ ignored local raw data
src/forecastlite/
bootstrap.py grouped bootstrap CI helpers
config.py YAML config loading and typed config objects
corruption.py observed-history corruption operators
data.py raw file discovery, validation, provenance
evaluate.py benchmark evaluation and result writing
metrics.py ADE, FDE, summaries, paired comparisons
plotting.py core benchmark plots
reliability.py reliability envelope, mixture, tail, claim tables
schema.py TrajectoryWindow data object
selector.py observed-only features, selectors, regret maps
stratify.py future turn-bin analysis labels
windows.py ETH/UCY window construction
predictors/
stationary.py
constant_velocity.py
linear_regression.py
kalman.py
protocol.py
__init__.py
scripts/
CLI entry points for inspection, benchmark, selector, audits, plotting, release
tests/
unit, leakage, integration, selector, corruption, reliability tests
doc/
manuscript, guide, reproduction log, upgrade decision, references, slides, papers
| Script | Purpose | Main output |
|---|---|---|
scripts/inspect_data.py |
Validate data and count windows | results/data_inspection/ |
scripts/run_benchmark.py |
Clean five-scene benchmark | results/core/ |
scripts/run_ablation.py |
Horizon, observation, noise, turn, stride studies | results/ablations/ |
scripts/run_selector.py |
Base predictor + selector + oracle audit | results/selector_* |
scripts/summarize_selector_results.py |
Paper-ready selector tables | paper_tables/ |
scripts/audit_stationary_branch.py |
Check stationary selector branch | stationary_branch_audit/ |
scripts/audit_selector_failure_features.py |
Audit observed-feature failure overrides | failure_feature_audit/ |
scripts/audit_motion_dynamics_features.py |
Audit speed-trend features | motion_dynamics_audit/ |
scripts/run_motion_loso_selector.py |
Motion LOSO ablation | motion_loso_selector_metrics.csv |
scripts/analyze_dominance_stability.py |
Winner stability across cases | dominance_stability/ |
scripts/plot_selector_failures.py |
High-regret qualitative figures | figures/eth_failure_cases/ |
scripts/plot_selector_paper_figures.py |
Manuscript summary figures | results/paper_figures/ |
scripts/build_reliability_audit_tables.py |
Reliability envelope and claim-safety tables | results/reliability_audit/ |
scripts/train_tiny_gru.py |
Write cancellation record | results/tiny_gru_cancellation_record.md |
scripts/build_release.py |
Validate and package release artifacts | results/release/ |
The test suite covers:
- trajectory window construction;
- predictor behavior;
- ADE/FDE metrics;
- Kalman filter state and covariance behavior;
- corruption operators;
- no-leakage boundaries;
- selector behavior;
- reliability envelope and claim tables;
- a tiny integration pipeline.
Recommended before any release:
ruff check .
python -m pytest -q
python -m compileall -q src scripts testsLast recorded full verification:
ruff: All checks passed.
pytest: 40 passed.
compileall: exit 0.
| File | Use |
|---|---|
| doc/FORECASTLITE_SELECTOR_MANUSCRIPT_DRAFT_2026-06-14.md | Main manuscript draft |
| doc/FORECASTLITE_COMPLETE_BEGINNER_GUIDE_ZH_2026-06-15.md | Complete beginner-level explanation in Chinese |
| doc/FORECASTLITE_REPRODUCIBILITY_RUN_LOG_2026-06-15.md | Exact command/result reproduction log |
| doc/FORECASTLITE_WORKSHOP_UPGRADE_DECISION_2026-06-15.md | Upgrade-path decision memo |
| doc/FORECASTLITE_REPOSITIONING_LITERATURE_AUDIT_ZH_2026-06-14.md | Literature-positioning audit in Chinese |
| doc/forecastlite_selector_references.bib | BibTeX references |
doc/slides/ |
Interview / presentation deck artifacts |
doc/paper/ |
Local reference papers and extracted assets |
It is safe to say:
- ForecastLite-Selector is a leakage-safe deterministic reliability audit on ETH/UCY.
- Constant velocity being strong is prior work, not the main contribution.
- Deterministic predictor dominance is conditional on motion and corruption regime.
- The observed-only selector improves declared Gaussian mixed ADE over fixed OLS baselines.
- The expanded corruption grid has no universal winner.
- Tail-regret tables expose remaining high-regret failures.
- External validation and real-noise transfer remain future work.
Do not say:
- ForecastLite is SOTA.
- ForecastLite discovered that constant velocity is strong.
- The selector is universally robust.
- Synthetic corruption proves deployment reliability.
- ADE/FDE prove safety.
- Motion LOSO is the main method unless the target distribution is redefined and the Gaussian-jitter tradeoff is accepted.
- SDD / OpenTraj / T2FPV / EgoTraj-Bench external validation has already been completed.
Based on the workshop upgrade decision, the next priorities are:
- Primary path: frozen-policy real-noise replay on T2FPV-style or EgoTraj-Bench-style data.
- Backup path: frozen external validation on SDD / OpenTraj.
- Do regardless: keep failure analysis, reliability envelope, mixture sensitivity, tail risk, and claim boundaries central.
- Defer: conformal tubes, extra selector tuning, learned selectors, and risk-aware objectives until transfer evidence exists.
Reasoning:
- The current largest weakness is not that the selector is too simple. The largest weakness is that ETH/UCY synthetic corruption has not yet been shown to transfer to real perception noise.
- If real-noise replay fails, that negative result should be reported honestly.
- A negative result is still valuable because it clarifies whether ForecastLite-Selector is an ETH/UCY diagnostic or a transferable policy.
results/ is a reproducible generated directory and is ignored by .gitignore.
Running the reproduction commands regenerates it. The complete historical result
summary is in doc/FORECASTLITE_REPRODUCIBILITY_RUN_LOG_2026-06-15.md.
The selector and expanded-corruption outputs are large. In the 2026-06-15
backup, the old results directory was about 7.55 GB. Committing it directly
would make the repository difficult to use. The better approach is to keep
source code, configs, logs, and exact reproduction commands.
The goal is not to win a leaderboard. The goal is to build a deterministic baseline reliability audit. Neural models can be future work, but the current contribution comes from transparent, interpretable, reproducible predictors and selector/regret analysis.
TinyGRU was an early optional extension. The current core release is a
deterministic baseline study. scripts/train_tiny_gru.py writes a cancellation
record rather than training a model.
This project evaluates deterministic centerline predictors and keeps alignment with ETH/UCY baseline tradition. The README, manuscript, and claim ledger all state clearly that ADE/FDE are not safety or social-plausibility metrics.
Dense sliding windows are not independent samples. One trajectory produces many
neighboring windows. Grouped bootstrap resamples by scene/source_file/track_id,
so highly correlated windows are not treated as independent observations.
Because the future trajectory does not exist at prediction time. Future-derived labels are allowed only for post-hoc stratified analysis. The code and tests treat this boundary as a leakage-safety requirement.
First check:
- Whether the configs are exactly
configs/benchmark.yamlandconfigs/noise_study.yaml. - Whether the data comes from the same Trajectron++ commit.
- Whether raw files were modified.
- Whether Python/package versions differ substantially.
- Whether Motion LOSO was run before
summarize_selector_results.py. - Whether paper tables were generated with 1000 bootstrap resamples.
Then compare against:
- doc/FORECASTLITE_REPRODUCIBILITY_RUN_LOG_2026-06-15.md
- doc/FORECASTLITE_SELECTOR_MANUSCRIPT_DRAFT_2026-06-14.md
The code license is recorded in LICENSE.
Raw ETH/UCY data and copied upstream split files require separate license review
before redistribution. The project records provenance and checksums when
scripts/inspect_data.py runs, but it does not claim permission to redistribute
the upstream dataset.