Skip to content

2XUID/ForecastLite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ForecastLite-Selector

A transparent, reproducible, leakage-safe reliability audit for deterministic short-horizon pedestrian trajectory baselines under observation noise and nonlinear motion.

Conclusion: ForecastLite-Selector is not a new state-of-the-art trajectory forecasting model. It is a reproducible study that maps when simple deterministic predictors work, when they fail, how their ranking changes under observation corruption, and how much a leakage-safe observed-only selector can reduce mixed-corruption ADE and oracle regret under declared synthetic corruption mixtures.

Table of Contents

Current Status

This working tree keeps the source code, configuration files, tests, manuscript draft, and reproducibility log. The results/ directory is generated output and is ignored by .gitignore, so a normal checkout may not contain the full result artifacts. That is expected. The result files are reproducible from the commands below, and they are omitted from version control to avoid storing several gigabytes of derived artifacts in the repository.

The latest complete reproduction record is:

That log records the formal commands, command outputs, generated artifacts, and how the key numbers used in the manuscript were regenerated.

Primary project documents:

Final Conclusion

The final research conclusion is:

Under ETH/UCY 8-observation / 12-prediction evaluation, transparent deterministic pedestrian predictors exchange dominance across motion and observation-corruption regimes. A leakage-safe observed-only selector can reduce mixed-corruption ADE and oracle regret under declared synthetic corruption mixtures, but no deterministic policy is universally best. The strongest current contribution is a reproducible deterministic reliability audit, not a SOTA learned forecaster, not a deployment guarantee, and not evidence that synthetic corruption fully matches real perception noise.

In practical terms:

  • On clean data, last-step constant velocity and Kalman constant velocity are both strong.
  • After observation corruption is introduced, the best predictor changes by corruption type, corruption level, motion regime, and scene.
  • forecastlite_loso_threshold_selector improves over fixed OLS-4 and OLS-8 on the declared full Gaussian-jitter mixed audit.
  • The expanded corruption grid has no universal winner.
  • The Motion LOSO ablation is slightly better on the uniform expanded mixture, but it is worse on Gaussian jitter, so it should remain an ablation rather than the headline method.
  • The largest failures concentrate in ETH homography-bias, abrupt-stop, and strong-turn cases, often where the oracle predictor is stationary but the selector chooses OLS-4 or OLS-8 and keeps extrapolating motion.

Key Results

The numbers below come from the complete 2026-06-15 reproducibility log and the manuscript draft. Running the commands in Reproduce From Scratch regenerates the corresponding results/ files.

Data Scale

Item Value
Dataset ETH/UCY via Trajectron++
Scenes ETH, Hotel, Univ, Zara1, Zara2
Valid dense stride-1 windows 34,161
Trajectory groups 1,219
Observation length 8 steps
Prediction length 12 steps
Sampling interval 0.4 s
Prediction horizon 4.8 s

Scene breakdown:

Scene Windows Tracks
ETH 364 44
Hotel 1,197 122
Univ 24,334 722
Zara1 2,356 142
Zara2 5,910 189

Clean ETH/UCY

Clean pooled ADE winner:

Model Pooled ADE
Last-step CV 0.481554 m
Kalman CV 0.494961 m
OLS-4 0.541744 m
OLS-8 0.667404 m
Stationary 1.442155 m

Clean macro ADE winner:

Model Macro ADE 95% grouped CI Macro FDE
Kalman CV 0.529111 m [0.479361, 0.584796] 1.134871 m
Last-step CV 0.534033 m [0.486211, 0.587942] 1.147595 m
OLS-4 0.551975 m [0.497044, 0.610357] 1.154689 m
OLS-8 0.648164 m [0.577081, 0.712963] 1.273103 m
Stationary 1.726353 m [1.582527, 2.032697] 3.109385 m

Interpretation:

  • Clean pooled data favors last-step CV.
  • Clean macro data favors Kalman CV.
  • Univ contributes many dense windows, so pooled and macro evaluation answer different questions.
  • Constant velocity being strong is not a new discovery. Prior work already established that CV-style baselines are hard to dismiss on ETH/UCY.

Full Gaussian-Jitter Audit

Command:

python scripts/run_selector.py --config configs/noise_study.yaml --output-dir results/selector_full --noise-sigmas 0 0.02 0.05 0.1 0.2 --noise-seeds 42 43 44 45 46

Mixed Gaussian-jitter ADE:

Policy / Model Mixed ADE 95% grouped CI Mean regret
LOSO threshold selector 0.662352 m [0.630250, 0.698300] 0.208877 m
Motion LOSO selector 0.663879 m [0.631394, 0.699734] 0.210404 m
Fixed ForecastLite selector 0.663932 m [0.631355, 0.699853] 0.210457 m
OLS-8 0.716557 m [0.680635, 0.754783] 0.263083 m
OLS-4 0.720154 m [0.696024, 0.746804] 0.266680 m
Tree selector 0.742189 m [0.706919, 0.782720] 0.288715 m
Kalman CV 0.903828 m [0.885148, 0.924602] 0.450353 m
Last-step CV 1.168029 m [1.151870, 1.185559] 0.714554 m
Stationary 1.463198 m [1.366109, 1.566100] 1.009723 m

Paired matched-window deltas:

Comparison Mean ADE delta 95% CI Interpretation
LOSO selector - OLS-8 -0.054206 m [-0.058516, -0.050375] LOSO has lower ADE
LOSO selector - OLS-4 -0.057803 m [-0.066286, -0.048296] LOSO has lower ADE

Interpretation:

  • The selector improvement is real but modest.
  • The result is roughly a 7.6%-8.0% mixed-distribution ADE reduction over strong fixed OLS baselines.
  • This should not be described as a forecasting breakthrough.

Expanded Corruption Audit

Expanded corruption types:

  • gaussian_jitter
  • final_step_spike
  • quantization
  • timestamp_jitter
  • homography_bias
  • consecutive_dropout

Levels:

  • 0.05
  • 0.10
  • 0.20

Seeds:

  • 42
  • 43
  • 44

Generated scale in the verified run:

Artifact Rows
results/selector_corruption_expanded/base_metrics_by_window.parquet 9,223,470
selector_metrics.csv 1,844,694
loso_threshold_selector_metrics.csv 1,844,694
tree_selector_metrics.csv 1,844,694

Single-corruption winners:

Corruption Best policy / model Mixed ADE Main lesson
consecutive dropout LOSO / fixed / Motion / last-step CV tie 0.485897 m Observed missingness is recoverable by last-step-style behavior
gaussian jitter LOSO / fixed selector 0.726233 m Selector improves over OLS-8 on this expanded subset
final-step spike Motion LOSO 0.705787 m Speed-trend features help; OLS-8 is close at 0.706699 m
homography bias Last-step CV 0.657662 m Selector has a clear feature gap
quantization OLS-4 0.582928 m Short-window smoothing wins
timestamp jitter Kalman CV 0.498549 m Timestamp-aware smoothing helps

Uniform expanded mixture:

Policy Mixture ADE
Motion LOSO selector 0.628318 m
LOSO threshold selector 0.629112 m
Fixed selector 0.629112 m

Interpretation:

  • Motion LOSO is useful as an ablation.
  • It is not the headline selector because it worsens Gaussian jitter: LOSO/fixed score 0.726233 m, while Motion LOSO scores 0.733404 m.
  • The safest paper claim is mixture-dependent improvement, not universal dominance.

Tail Risk and Failure Concentration

Expanded audit, all corruption types:

Model / Policy Mean regret P95 regret Regret >= 1m rate Max regret
LOSO threshold selector 0.175440 m 0.748423 m 0.027612 7.075210 m
Fixed selector 0.175440 m 0.748423 m 0.027612 7.075210 m
Motion LOSO selector 0.174646 m 0.749803 m 0.027347 6.844848 m
OLS-4 0.204330 m 0.796091 m 0.028246 6.844848 m
OLS-8 0.259864 m 1.048856 m 0.056222 7.269652 m
Kalman CV 0.288400 m 1.290415 m 0.078545 6.972805 m
Last-step CV 0.404088 m 1.879473 m 0.128180 10.001165 m

Largest LOSO failure concentration:

Scene Corruption Mean regret P95 regret Max regret Regret >= 1m rate Common oracle Common selected model
ETH homography_bias 0.632800 m 2.380453 m 7.075210 m 0.218864 stationary OLS-4

Interpretation:

  • The selector improves mean and some tail metrics relative to several fixed baselines.
  • It does not eliminate high-regret failures.
  • ETH abrupt-stop, strong-turn, and homography-like bias cases remain the most visible weaknesses.

What This Project Studies

ForecastLite-Selector studies a narrow but important question:

In short-horizon pedestrian trajectory prediction, if we only use transparent, deterministic predictors based on observed positions and timestamps, when are they reliable, when do they fail, and can an observed-only selector choose a better simple predictor from the available family?

More concretely, the project studies:

  • clean-data performance of deterministic predictors;
  • how noise, missing observations, timestamp irregularity, quantization, and homography-like bias change model ranking;
  • why pooled and macro summaries can disagree;
  • whether selector evaluation is free of future leakage;
  • selector regret relative to a deterministic oracle;
  • whether tail risk reveals failures hidden by mean ADE;
  • which claims are supported by evidence and which claims are not.

The project is suitable as:

  • a pedestrian trajectory baseline study;
  • an embodied-AI or robotics interview project;
  • a reliability-audit and reproducibility case study;
  • a sanity-check layer for larger learned forecasting systems;
  • an example of careful claim hygiene in an empirical ML paper.

What This Project Does Not Claim

ForecastLite-Selector does not claim that it:

  • is a state-of-the-art trajectory forecasting model;
  • beats Social LSTM, Social GAN, Trajectron++, NATRA, TrajImpute, or other learned methods;
  • models social interaction;
  • models multimodality;
  • generates probabilistic trajectories;
  • guarantees collision avoidance or planning safety;
  • proves that synthetic corruption is equivalent to real detector/tracker noise;
  • transfers to all datasets, camera setups, or embodied-agent settings;
  • proves that ADE/FDE are sufficient for social plausibility or safety.

These non-claims are part of the project quality. They prevent a reliability audit from being misrepresented as a model paper with broader evidence than the experiments actually provide.

Data and Protocol

Data Source

The data comes from the ETH/UCY split files curated in the Trajectron++ repository.

Recorded upstream commit:

1031c7bd1a444273af378c1ec1dcca907ba59830

This project expects the raw files at:

data/raw/eth_ucy/

That directory is ignored by .gitignore. Raw data should be reviewed against upstream licenses before redistribution.

Raw Data Schema

Each row has:

frame_id    track_id    pos_x    pos_y

Meaning:

  • frame_id: original frame number;
  • track_id: pedestrian trajectory ID;
  • pos_x, pos_y: world coordinates in meters.

Standard Protocol

Parameter Value
Observed steps 8
Predicted steps 12
Sampling interval 0.4 s
Raw-frame step 10 frames
Standard window stride 1
Evaluated split test
Scenes eth, hotel, univ, zara1, zara2

A valid window must satisfy:

  • one consistent track_id;
  • enough length for 8 observed steps plus 12 future steps;
  • continuous frame spacing;
  • finite coordinates;
  • no duplicate frame/track pair.

Method Overview

All predictors implement the same interface:

predict(observed_xy, observed_t, future_t) -> predicted_xy

This guarantees that every model sees the same observed history and the same future timestamps.

Stationary

Assumes the pedestrian stays at the last observed position:

future position = last observed position

Strength:

  • very strong for low-motion and abrupt-stop cases.

Weakness:

  • poor for normal walking motion.

Last-Step Constant Velocity

Estimates velocity from only the last two observed points:

velocity = (last_position - previous_position) / dt

Strengths:

  • fast response on clean short-term motion;
  • best clean pooled ADE in this study.

Weakness:

  • if the final observed step is corrupted, the velocity estimate amplifies that error into the future.

OLS-4 / OLS-8 Constant Velocity

Fits a line through the most recent 4 or 8 observed points:

position ~= velocity * time + intercept

Strengths:

  • smooths observation noise;
  • can be more stable under quantization or Gaussian jitter.

Weaknesses:

  • a longer history can respond too slowly;
  • turns or abrupt speed changes can make linear extrapolation fail.

Kalman CV

Uses a constant-velocity Kalman filter:

state = [x, y, vx, vy]
measurement = [x, y]

Strengths:

  • stable under timestamp jitter and compatible noise assumptions;
  • best clean macro ADE in this study.

Weakness:

  • if process-noise or measurement-noise assumptions do not match the data, it can underperform OLS or last-step CV.

ForecastLite-Selector

The selector does not forecast directly. It chooses among base predictors:

stationary
last_step_cv
ols_4
ols_8
kalman_cv

Core Constraint: Observed-Only Selection

Selector features may only use:

  • observed positions;
  • observed timestamps;
  • future timestamps;
  • observed-only derived features.

They may not use:

  • future positions;
  • future turn labels;
  • oracle model choice;
  • ADE/FDE;
  • regret.

Future-derived information is allowed only for evaluation and failure analysis, not for model selection.

Fixed Selector

The fixed selector uses hand-written rules based on observed speed, displacement, OLS residuals, timestamp irregularity, missingness ratio, and related observed-only features.

LOSO Threshold Selector

LOSO means leave-one-scene-out.

Procedure:

  1. Hold out one scene.
  2. Choose thresholds on the remaining scenes.
  3. Freeze those thresholds.
  4. Evaluate on the held-out scene.
  5. Repeat for all five held-out scenes.
  6. Merge the held-out results.

This is stricter than tuning and testing on the same scene.

Tree Selector

The tree selector is a shallow decision-tree diagnostic. It is not the headline method. It tests whether a more flexible observed-only rule gives a large gain.

Motion LOSO Selector

The Motion LOSO selector adds speed-trend and deceleration features as an ablation. It is slightly better on the expanded uniform mixture, but it worsens Gaussian-jitter performance, so it should not replace the headline selector unless the target distribution is explicitly redefined.

Deterministic Reliability Envelope

The Deterministic Reliability Envelope is one of the central analysis tools in this project.

For each matched case, it asks:

If we are only allowed to choose within the current deterministic predictor family, what is the best possible ADE/FDE after seeing the outcome?

The selector cannot use this information. It is an evaluation-only oracle.

Uses:

  • compute oracle regret;
  • separate best fixed baseline, best selector policy, and per-case oracle;
  • measure how far a selector remains from the best available deterministic family member;
  • expose tail failures;
  • avoid hiding severe failures behind mean ADE.

The main reliability audit is generated by:

python scripts/build_reliability_audit_tables.py --gaussian-dir results/selector_full --expanded-dir results/selector_corruption_expanded --output-dir results/reliability_audit

Generated artifacts include:

  • claim_separation.csv
  • estimands.csv
  • claim_ledger.csv
  • external_validation_design.csv
  • reliability_envelope_cases.csv
  • reliability_envelope_summary.csv
  • regret_dominance_map.csv
  • mixture_sensitivity.csv
  • tail_risk_summary.csv
  • selector_failure_summary.csv
  • mixture and tail-risk figures.

Reproduce From Scratch

1. Install the Environment

Python 3.11 or newer is required. The latest complete verification used Python 3.13.

python -m pip install -e ".[dev]"

If installing from the lock file manually:

python -m pip install -r requirements-lock.txt
python -m pip install -e .

2. Prepare the Data

Place the ETH/UCY split files at:

data/raw/eth_ucy/

The structure should be readable by configs/benchmark.yaml. Raw data is not redistributed in this repository; review the Trajectron++ and ETH/UCY upstream licenses before redistribution.

3. Run Basic Quality Gates

ruff check .
python -m pytest -q
python -m compileall -q src scripts tests

Expected:

ruff: All checks passed.
pytest: 40 passed.
compileall: exit 0.

4. Run the Core Benchmark

python scripts/inspect_data.py --config configs/benchmark.yaml --output-dir results/data_inspection
python scripts/run_benchmark.py --config configs/benchmark.yaml --output-dir results/core

Generated files:

  • results/data_inspection/data_provenance.csv
  • results/data_inspection/window_counts.csv
  • results/data_inspection/exclusions.csv
  • results/core/metrics_by_window.parquet
  • results/core/summary_by_scene.csv
  • results/core/paired_comparisons.csv
  • results/core/manifest.json

5. Run Ablation Studies

python scripts/run_ablation.py --config configs/noise_study.yaml --study all --output-dir results/ablations

Generated files:

  • horizon_summary.csv
  • observation_length_summary.csv
  • noise_summary.csv
  • turn_stratification.csv
  • stride_sensitivity.csv

Note: this command can run for a while. In the 2026-06-15 reproduction, the wrapper timed out but the Python subprocess continued and completed. Completion was verified by output-file presence and row counts.

6. Run the Full Gaussian Selector Audit

python scripts/run_selector.py --config configs/noise_study.yaml --output-dir results/selector_full --noise-sigmas 0 0.02 0.05 0.1 0.2 --noise-seeds 42 43 44 45 46

Then run the derived audits:

python scripts/audit_stationary_branch.py --selector-dir results/selector_full
python scripts/audit_selector_failure_features.py --selector-dir results/selector_full
python scripts/audit_motion_dynamics_features.py --selector-dir results/selector_full
python scripts/run_motion_loso_selector.py --config configs/noise_study.yaml --selector-dir results/selector_full
python scripts/analyze_dominance_stability.py --input-dir results/selector_full
python scripts/plot_selector_failures.py --config configs/noise_study.yaml --selector-dir results/selector_full --output-dir results/selector_full/figures/eth_failure_cases

Finally, build paper tables:

python scripts/summarize_selector_results.py --input-dir results/selector_full --bootstrap-resamples 1000

7. Run the Expanded Corruption Audit

python scripts/run_selector.py --config configs/noise_study.yaml --output-dir results/selector_corruption_expanded --corruption-types gaussian_jitter final_step_spike quantization timestamp_jitter homography_bias consecutive_dropout --corruption-levels 0.05 0.1 0.2 --noise-seeds 42 43 44

Then run the derived audits:

python scripts/audit_stationary_branch.py --selector-dir results/selector_corruption_expanded
python scripts/audit_selector_failure_features.py --selector-dir results/selector_corruption_expanded
python scripts/audit_motion_dynamics_features.py --selector-dir results/selector_corruption_expanded
python scripts/run_motion_loso_selector.py --config configs/noise_study.yaml --selector-dir results/selector_corruption_expanded
python scripts/analyze_dominance_stability.py --input-dir results/selector_corruption_expanded
python scripts/plot_selector_failures.py --config configs/noise_study.yaml --selector-dir results/selector_corruption_expanded --output-dir results/selector_corruption_expanded/figures/eth_failure_cases

Finally, build paper tables:

python scripts/summarize_selector_results.py --input-dir results/selector_corruption_expanded --bootstrap-resamples 1000

Note: the expanded selector audit is the heaviest step. In the 2026-06-15 reproduction, the wrapper timed out but the Python subprocess continued and completed.

8. Generate Paper Figures and Reliability Tables

python scripts/plot_selector_paper_figures.py --gaussian-dir results/selector_full --expanded-dir results/selector_corruption_expanded --output-dir results/paper_figures
python scripts/build_reliability_audit_tables.py --gaussian-dir results/selector_full --expanded-dir results/selector_corruption_expanded --output-dir results/reliability_audit

9. Build the Release Package

python scripts/train_tiny_gru.py
python scripts/build_release.py

train_tiny_gru.py does not train a model. It writes the TinyGRU cancellation record, documenting that TinyGRU is outside the current deterministic core release.

build_release.py requires the core results/ artifacts to exist. If a normal checkout does not contain results/, running it immediately will fail. That is expected.

Expected Results Directory

After a complete reproduction, results/ should contain:

results/
  data_inspection/
  core/
  ablations/
  selector_full/
  selector_corruption_expanded/
  paper_figures/
  reliability_audit/
  release/
  tiny_gru_cancellation_record.md

Core files:

results/core/summary_by_scene.csv
results/selector_full/paper_tables/mixed_corruption_summary.csv
results/selector_full/paper_tables/paired_mixed_summary.csv
results/selector_corruption_expanded/paper_tables/mixed_corruption_summary.csv
results/reliability_audit/expanded/mixture_sensitivity.csv
results/reliability_audit/expanded/tail_risk_summary.csv
results/reliability_audit/expanded/selector_failure_summary.csv

How to Read Results

Pooled vs Macro

pooled:

  • averages over all windows together;
  • gives more weight to large scenes;
  • is strongly affected by Univ because Univ contributes many windows.

macro:

  • averages within each scene first;
  • then averages across scenes;
  • better answers the question "how does this behave across scenes on average?"

Both matter. Pooled results can be dominated by a large scene. Macro results can hide the actual sample distribution.

ADE vs FDE

ADE means:

Average Displacement Error

It is the average Euclidean distance error across all future timesteps.

FDE means:

Final Displacement Error

It only evaluates the last future timestep.

ADE/FDE are useful deterministic centerline diagnostics, but they do not prove safety, collision avoidance, or social plausibility.

Oracle Regret

Oracle regret is:

model ADE - best deterministic family ADE for the same case

If regret is zero, the model is the best available member of the deterministic family for that case.

If regret is large, the model is far from what the current deterministic family could have achieved with hindsight.

Dominance Map

A dominance map answers:

Under a specific corruption, level, observed regime, or turn bin, which model has the lowest average ADE?

This is more informative than a single global table because the central finding of the project is conditional dominance, not global dominance.

Mixture Sensitivity

Mixture sensitivity answers:

If the weights of corruption types change, which policy is best?

This prevents a winner under one mixture distribution from being overstated as a universal winner.

Tail Risk

Tail risk tracks P90 / P95 / P99 regret and high-regret rate.

It answers:

When mean ADE looks acceptable, are there still rare but severe failures?

One final conclusion of this project is yes. The selector reduces some tail risk, but high-regret failures remain.

Code Structure

configs/
  benchmark.yaml              clean ETH/UCY benchmark config
  noise_study.yaml            noise/corruption study config
  kalman_grid.yaml            Kalman parameter grid notes
  tiny_gru.yaml               cancelled extension config

data/
  README.md                   data setup notes
  fixtures/                   tiny test fixture
  raw/                        ignored local raw data

src/forecastlite/
  bootstrap.py                grouped bootstrap CI helpers
  config.py                   YAML config loading and typed config objects
  corruption.py               observed-history corruption operators
  data.py                     raw file discovery, validation, provenance
  evaluate.py                 benchmark evaluation and result writing
  metrics.py                  ADE, FDE, summaries, paired comparisons
  plotting.py                 core benchmark plots
  reliability.py              reliability envelope, mixture, tail, claim tables
  schema.py                   TrajectoryWindow data object
  selector.py                 observed-only features, selectors, regret maps
  stratify.py                 future turn-bin analysis labels
  windows.py                  ETH/UCY window construction
  predictors/
    stationary.py
    constant_velocity.py
    linear_regression.py
    kalman.py
    protocol.py
    __init__.py

scripts/
  CLI entry points for inspection, benchmark, selector, audits, plotting, release

tests/
  unit, leakage, integration, selector, corruption, reliability tests

doc/
  manuscript, guide, reproduction log, upgrade decision, references, slides, papers

Script Map

Script Purpose Main output
scripts/inspect_data.py Validate data and count windows results/data_inspection/
scripts/run_benchmark.py Clean five-scene benchmark results/core/
scripts/run_ablation.py Horizon, observation, noise, turn, stride studies results/ablations/
scripts/run_selector.py Base predictor + selector + oracle audit results/selector_*
scripts/summarize_selector_results.py Paper-ready selector tables paper_tables/
scripts/audit_stationary_branch.py Check stationary selector branch stationary_branch_audit/
scripts/audit_selector_failure_features.py Audit observed-feature failure overrides failure_feature_audit/
scripts/audit_motion_dynamics_features.py Audit speed-trend features motion_dynamics_audit/
scripts/run_motion_loso_selector.py Motion LOSO ablation motion_loso_selector_metrics.csv
scripts/analyze_dominance_stability.py Winner stability across cases dominance_stability/
scripts/plot_selector_failures.py High-regret qualitative figures figures/eth_failure_cases/
scripts/plot_selector_paper_figures.py Manuscript summary figures results/paper_figures/
scripts/build_reliability_audit_tables.py Reliability envelope and claim-safety tables results/reliability_audit/
scripts/train_tiny_gru.py Write cancellation record results/tiny_gru_cancellation_record.md
scripts/build_release.py Validate and package release artifacts results/release/

Tests and Quality Gates

The test suite covers:

  • trajectory window construction;
  • predictor behavior;
  • ADE/FDE metrics;
  • Kalman filter state and covariance behavior;
  • corruption operators;
  • no-leakage boundaries;
  • selector behavior;
  • reliability envelope and claim tables;
  • a tiny integration pipeline.

Recommended before any release:

ruff check .
python -m pytest -q
python -m compileall -q src scripts tests

Last recorded full verification:

ruff: All checks passed.
pytest: 40 passed.
compileall: exit 0.

Paper and Documents

File Use
doc/FORECASTLITE_SELECTOR_MANUSCRIPT_DRAFT_2026-06-14.md Main manuscript draft
doc/FORECASTLITE_COMPLETE_BEGINNER_GUIDE_ZH_2026-06-15.md Complete beginner-level explanation in Chinese
doc/FORECASTLITE_REPRODUCIBILITY_RUN_LOG_2026-06-15.md Exact command/result reproduction log
doc/FORECASTLITE_WORKSHOP_UPGRADE_DECISION_2026-06-15.md Upgrade-path decision memo
doc/FORECASTLITE_REPOSITIONING_LITERATURE_AUDIT_ZH_2026-06-14.md Literature-positioning audit in Chinese
doc/forecastlite_selector_references.bib BibTeX references
doc/slides/ Interview / presentation deck artifacts
doc/paper/ Local reference papers and extracted assets

Claim Hygiene

Safe Claims

It is safe to say:

  • ForecastLite-Selector is a leakage-safe deterministic reliability audit on ETH/UCY.
  • Constant velocity being strong is prior work, not the main contribution.
  • Deterministic predictor dominance is conditional on motion and corruption regime.
  • The observed-only selector improves declared Gaussian mixed ADE over fixed OLS baselines.
  • The expanded corruption grid has no universal winner.
  • Tail-regret tables expose remaining high-regret failures.
  • External validation and real-noise transfer remain future work.

Unsafe Claims

Do not say:

  • ForecastLite is SOTA.
  • ForecastLite discovered that constant velocity is strong.
  • The selector is universally robust.
  • Synthetic corruption proves deployment reliability.
  • ADE/FDE prove safety.
  • Motion LOSO is the main method unless the target distribution is redefined and the Gaussian-jitter tradeoff is accepted.
  • SDD / OpenTraj / T2FPV / EgoTraj-Bench external validation has already been completed.

Best Next Steps

Based on the workshop upgrade decision, the next priorities are:

  1. Primary path: frozen-policy real-noise replay on T2FPV-style or EgoTraj-Bench-style data.
  2. Backup path: frozen external validation on SDD / OpenTraj.
  3. Do regardless: keep failure analysis, reliability envelope, mixture sensitivity, tail risk, and claim boundaries central.
  4. Defer: conformal tubes, extra selector tuning, learned selectors, and risk-aware objectives until transfer evidence exists.

Reasoning:

  • The current largest weakness is not that the selector is too simple. The largest weakness is that ETH/UCY synthetic corruption has not yet been shown to transfer to real perception noise.
  • If real-noise replay fails, that negative result should be reported honestly.
  • A negative result is still valuable because it clarifies whether ForecastLite-Selector is an ETH/UCY diagnostic or a transferable policy.

FAQ

Why does this checkout not contain results/?

results/ is a reproducible generated directory and is ignored by .gitignore. Running the reproduction commands regenerates it. The complete historical result summary is in doc/FORECASTLITE_REPRODUCIBILITY_RUN_LOG_2026-06-15.md.

Why not commit all result artifacts?

The selector and expanded-corruption outputs are large. In the 2026-06-15 backup, the old results directory was about 7.55 GB. Committing it directly would make the repository difficult to use. The better approach is to keep source code, configs, logs, and exact reproduction commands.

Why not use a neural network?

The goal is not to win a leaderboard. The goal is to build a deterministic baseline reliability audit. Neural models can be future work, but the current contribution comes from transparent, interpretable, reproducible predictors and selector/regret analysis.

Why was TinyGRU cancelled?

TinyGRU was an early optional extension. The current core release is a deterministic baseline study. scripts/train_tiny_gru.py writes a cancellation record rather than training a model.

Why use only ADE/FDE?

This project evaluates deterministic centerline predictors and keeps alignment with ETH/UCY baseline tradition. The README, manuscript, and claim ledger all state clearly that ADE/FDE are not safety or social-plausibility metrics.

Why use grouped bootstrap?

Dense sliding windows are not independent samples. One trajectory produces many neighboring windows. Grouped bootstrap resamples by scene/source_file/track_id, so highly correlated windows are not treated as independent observations.

Why can the selector not use the future turn bin?

Because the future trajectory does not exist at prediction time. Future-derived labels are allowed only for post-hoc stratified analysis. The code and tests treat this boundary as a leakage-safety requirement.

What should I check if reproduced numbers differ from this README?

First check:

  1. Whether the configs are exactly configs/benchmark.yaml and configs/noise_study.yaml.
  2. Whether the data comes from the same Trajectron++ commit.
  3. Whether raw files were modified.
  4. Whether Python/package versions differ substantially.
  5. Whether Motion LOSO was run before summarize_selector_results.py.
  6. Whether paper tables were generated with 1000 bootstrap resamples.

Then compare against:

License and Data Notes

The code license is recorded in LICENSE.

Raw ETH/UCY data and copied upstream split files require separate license review before redistribution. The project records provenance and checksums when scripts/inspect_data.py runs, but it does not claim permission to redistribute the upstream dataset.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages