Select2Plan (S2P): Training-Free ICL-Based Planning through VQA and Memory Retrieval [IEEE RA-L 2025]
Davide Buoso, Luke Robinson, Giuseppe Averta, Philip Torr, Tim Franzmeyer, Daniele De Martini
Politecnico di Torino · University of Oxford
Welcome to the official repository for Select2Plan (S2P), a training-free framework for high-level robot planning that leverages off-the-shelf Vision-Language Models (VLMs) for autonomous navigation.
Unlike most learning-based approaches that require extensive task-specific training and large-scale data collection, S2P adapts inputs to align with the VLM's pretraining data — no fine-tuning required.
🚀 No task-specific training or fine-tuning.
🧠 In-Context Learning (ICL) from a small experiential memory bank.
🎯 Structured Visual Question Answering (VQA) to ground actions on the image.
🔄 Maximal Marginal Relevance (MMR) for diverse, relevant memory retrieval.
🤖 Two navigation paradigms: First-Person View (FPV) and Third-Person View (TPV).
📄 Read our paper (IEEE RA-L)
🌍 Project Page
⚠️ Disclaimer: Exact reproduction of the benchmark results reported in the paper is currently a work in progress. This release focuses on providing a clear, environment-agnostic reference implementation of the S2P methodology — visual prompting, experiential memory retrieval with MMR, and ICL-based planning — that researchers can adapt to their own setups. The agents ship with Google Gemini as the default VLM backend, but the pipeline is designed to be VLM-agnostic: you can extend or swap in other vision-language models by adapting the API calls infpv_agent.pyandtpv_agent.py.
S2P supports two distinct navigation paradigms:
| Paradigm | Agent | Description | Output |
|---|---|---|---|
| FPV | Select2PlanFPVAgent |
Autonomous navigation using an onboard monocular camera | Discrete movement commands (1–9, 0) |
| TPV | Select2PlanTPVAgent |
Infrastructure-driven planning using external/CCTV cameras | Multi-hop trajectory (waypoints) avoiding obstacles |
FPV maintains a 24-bin episodic memory (compass) to track objects outside the current field of view.
TPV generates dynamic concentric candidate rings on the scene and plans trajectories toward a target while respecting obstacle masks.
Before running S2P, you need an Experiential Memory bank of successful demonstrations:
- A
.csvfile with demonstration metadata (the database index). - A
.ptfile with pre-computed Swin Transformer visual embeddings of the demonstration frames.
The sampler uses a Swin Transformer to recover similar situations from memory. To balance similarity and diversity among retrieved examples, we employ Maximal Marginal Relevance (MMR):
Once organized, your data directory should look like this:
S2P/
├── fpv_agent.py
├── tpv_agent.py
├── data/
│ ├── fpv/
│ │ ├── demonstrations.csv
│ │ ├── embeddings.pt
│ │ └── images/ # frames referenced by `ts` in the CSV
│ └── tpv/
│ ├── demonstrations.csv
│ ├── embeddings.pt
│ └── images/ # frames referenced by `timestamp` in the CSV
Your CSV file acts as the database index. Format it according to the agent you are using:
For FPV Navigation (Select2PlanFPVAgent):
| Column | Description |
|---|---|
ts |
Unique identifier/timestamp of the frame (used to load the image) |
goal |
Target object the agent was searching for (e.g., "Bowl") |
compass |
Stringified list of 24 lists representing the agent's spatial memory |
object |
Stringified list of objects visible in the current frame |
expl |
Natural language explanation of the action taken |
answer |
Integer command chosen (1–9, or 0 for success) |
For TPV Navigation (Select2PlanTPVAgent):
| Column | Description |
|---|---|
timestamp |
Unique identifier/timestamp of the frame |
coordinates |
Stringified tuple of the robot's starting coordinates, e.g., "(100, 200)" |
t_coordinates |
Stringified tuple of the target's coordinates |
luke_t |
Stringified list of the numerical sequence of points chosen to reach the target, e.g., "[1, 15, 23]" |
Before running the agents, pre-compute embeddings for your demonstration images and save them as a .pt tensor:
import torch
from PIL import Image
from pathlib import Path
from transformers import AutoImageProcessor, SwinModel
model_id = "microsoft/swin-large-patch4-window7-224"
processor = AutoImageProcessor.from_pretrained(model_id, use_fast=True)
encoder = SwinModel.from_pretrained(model_id).eval()
image_dir = Path("data/fpv/images")
image_paths = sorted(image_dir.glob("*.png"))
embeddings = []
for path in image_paths:
image = Image.open(path).convert("RGB")
inputs = processor(image, return_tensors="pt")
with torch.no_grad():
emb = encoder(**inputs).last_hidden_state.mean(dim=1)
embeddings.append(emb.squeeze(0))
torch.save(torch.stack(embeddings), "data/fpv/embeddings.pt")Note: The number of rows in your CSV must match the number of embeddings in the
.ptfile, and their order must be consistent.
The code has been tested with Python 3.10+ and PyTorch 2.x. To set up the environment using Conda:
conda create --name s2p python=3.10 -y
conda activate s2p
pip install torch torchvision
pip install transformers google-generativeai opencv-python numpy pandas pillowYou will also need a Google Gemini API key to query the VLM. Set it in your environment or pass it directly when initializing an agent:
export GEMINI_API_KEY="your-api-key-here"The FPV agent annotates the camera frame with numbered movement prompts, retrieves relevant demonstrations via MMR, and queries the VLM for a discrete navigation command.
import os
from PIL import Image
from fpv_agent import Select2PlanFPVAgent
agent = Select2PlanFPVAgent(
gemini_api_key=os.environ["GEMINI_API_KEY"],
model_name="models/gemini-2.0-flash",
device="cuda", # or "cpu"
)
agent.load_experiential_memory(
embeddings_path="data/fpv/embeddings.pt",
dataframe_path="data/fpv/demonstrations.csv",
)
frame = Image.open("data/fpv/images/current_frame.png")
result = agent.get_action(current_image=frame, target_obj="Bowl")
print(result)
# {"objects": [...], "explanation": "...", "command": "4", "next_step_command": "0"}Command system:
| Command | Action |
|---|---|
| 1, 2, 3 | Rotate right (45°, 30°, 15°) |
| 4 | Move forward |
| 5, 6, 7 | Rotate left (15°, 30°, 45°) |
| 8, 9 | Look up / down |
| 0 | Target reached |
The TPV agent overlays concentric rings of candidate waypoints on a bird's-eye or CCTV view, then queries the VLM to select a trajectory toward the target.
import os
import cv2
import numpy as np
from PIL import Image
from tpv_agent import Select2PlanTPVAgent
agent = Select2PlanTPVAgent(
gemini_api_key=os.environ["GEMINI_API_KEY"],
model_name="models/gemini-2.0-flash",
device="cuda",
)
agent.load_experiential_memory(
embeddings_path="data/tpv/embeddings.pt",
dataframe_path="data/tpv/demonstrations.csv",
)
frame = Image.open("data/tpv/images/current_frame.png")
traversable_mask = cv2.imread("data/tpv/masks/traversable.png", cv2.IMREAD_GRAYSCALE)
trajectory = agent.get_path(
current_image=frame,
start_coord=(100, 200),
target_coord=(400, 350),
traversable_mask=traversable_mask,
horizon=3,
)
print(trajectory) # [(x1, y1), (x2, y2), ...]| File | Description |
|---|---|
fpv_agent.py |
FPV wrapper (Select2PlanFPVAgent). Manages the 24-bin episodic memory (compass) and outputs discrete movement commands. |
tpv_agent.py |
TPV wrapper (Select2PlanTPVAgent). Generates dynamic concentric candidate rings and outputs multi-hop trajectories. |
tpv_agent_old.py |
Legacy TPV implementation (kept for reference). |
S2P builds upon the following open-source projects:
- Hugging Face Transformers — Swin Transformer backbone for experiential memory retrieval
- Google Generative AI — Gemini VLM for planning
- OpenCV — Visual prompting and obstacle masking
If you find S2P useful in your research, please cite our work:
@ARTICLE{11151762,
author={Buoso, Davide and Robinson, Luke and Averta, Giuseppe and Torr, Philip and Franzmeyer, Tim and Martini, Daniele De},
journal={IEEE Robotics and Automation Letters},
title={Select2Plan: Training-Free ICL-Based Planning Through VQA and Memory Retrieval},
year={2025},
volume={10},
number={11},
pages={11267-11274},
keywords={Navigation;Robots;Planning;Visualization;Cameras;Training;Adaptation models;Transformers;Predictive models;Oral communication;Motion and path planning;vision-based navigation;autonomous agents},
doi={10.1109/LRA.2025.3606790}
}