Select2Plan (S2P): Training-Free ICL-Based Planning through VQA and Memory Retrieval [IEEE RA-L 2025]

Davide Buoso, Luke Robinson, Giuseppe Averta, Philip Torr, Tim Franzmeyer, Daniele De Martini

Politecnico di Torino · University of Oxford

Welcome to the official repository for Select2Plan (S2P), a training-free framework for high-level robot planning that leverages off-the-shelf Vision-Language Models (VLMs) for autonomous navigation.

Unlike most learning-based approaches that require extensive task-specific training and large-scale data collection, S2P adapts inputs to align with the VLM's pretraining data — no fine-tuning required.

🚀 No task-specific training or fine-tuning.
🧠 In-Context Learning (ICL) from a small experiential memory bank.
🎯 Structured Visual Question Answering (VQA) to ground actions on the image.
🔄 Maximal Marginal Relevance (MMR) for diverse, relevant memory retrieval.
🤖 Two navigation paradigms: First-Person View (FPV) and Third-Person View (TPV).

📄 Read our paper (IEEE RA-L)
🌍 Project Page

⚠️ Disclaimer: Exact reproduction of the benchmark results reported in the paper is currently a work in progress. This release focuses on providing a clear, environment-agnostic reference implementation of the S2P methodology — visual prompting, experiential memory retrieval with MMR, and ICL-based planning — that researchers can adapt to their own setups. The agents ship with Google Gemini as the default VLM backend, but the pipeline is designed to be VLM-agnostic: you can extend or swap in other vision-language models by adapting the API calls in fpv_agent.py and tpv_agent.py.

👀 S2P in Action

S2P supports two distinct navigation paradigms:

Paradigm	Agent	Description	Output
FPV	`Select2PlanFPVAgent`	Autonomous navigation using an onboard monocular camera	Discrete movement commands (1–9, 0)
TPV	`Select2PlanTPVAgent`	Infrastructure-driven planning using external/CCTV cameras	Multi-hop trajectory (waypoints) avoiding obstacles

FPV maintains a 24-bin episodic memory (compass) to track objects outside the current field of view.
TPV generates dynamic concentric candidate rings on the scene and plans trajectories toward a target while respecting obstacle masks.

📊 Data Preparation

Before running S2P, you need an Experiential Memory bank of successful demonstrations:

A .csv file with demonstration metadata (the database index).
A .pt file with pre-computed Swin Transformer visual embeddings of the demonstration frames.

The sampler uses a Swin Transformer to recover similar situations from memory. To balance similarity and diversity among retrieved examples, we employ Maximal Marginal Relevance (MMR):

$$MMR(Q, M, C) = \arg\max_{s_i \in M \setminus C} \left[ \lambda \langle Q, s_i \rangle - (1-\lambda) \max_{s_j \in C} \langle s_i, s_j \rangle \right]$$

Once organized, your data directory should look like this:

S2P/
├── fpv_agent.py
├── tpv_agent.py
├── data/
│   ├── fpv/
│   │   ├── demonstrations.csv
│   │   ├── embeddings.pt
│   │   └── images/          # frames referenced by `ts` in the CSV
│   └── tpv/
│       ├── demonstrations.csv
│       ├── embeddings.pt
│       └── images/          # frames referenced by `timestamp` in the CSV

CSV Formats

Your CSV file acts as the database index. Format it according to the agent you are using:

For FPV Navigation (Select2PlanFPVAgent):

Column	Description
`ts`	Unique identifier/timestamp of the frame (used to load the image)
`goal`	Target object the agent was searching for (e.g., `"Bowl"`)
`compass`	Stringified list of 24 lists representing the agent's spatial memory
`object`	Stringified list of objects visible in the current frame
`expl`	Natural language explanation of the action taken
`answer`	Integer command chosen (1–9, or 0 for success)

For TPV Navigation (Select2PlanTPVAgent):

Column	Description
`timestamp`	Unique identifier/timestamp of the frame
`coordinates`	Stringified tuple of the robot's starting coordinates, e.g., `"(100, 200)"`
`t_coordinates`	Stringified tuple of the target's coordinates
`luke_t`	Stringified list of the numerical sequence of points chosen to reach the target, e.g., `"[1, 15, 23]"`

Generating Swin Transformer Embeddings

Before running the agents, pre-compute embeddings for your demonstration images and save them as a .pt tensor:

import torch
from PIL import Image
from pathlib import Path
from transformers import AutoImageProcessor, SwinModel

model_id = "microsoft/swin-large-patch4-window7-224"
processor = AutoImageProcessor.from_pretrained(model_id, use_fast=True)
encoder = SwinModel.from_pretrained(model_id).eval()

image_dir = Path("data/fpv/images")
image_paths = sorted(image_dir.glob("*.png"))

embeddings = []
for path in image_paths:
    image = Image.open(path).convert("RGB")
    inputs = processor(image, return_tensors="pt")
    with torch.no_grad():
        emb = encoder(**inputs).last_hidden_state.mean(dim=1)
    embeddings.append(emb.squeeze(0))

torch.save(torch.stack(embeddings), "data/fpv/embeddings.pt")

Note: The number of rows in your CSV must match the number of embeddings in the .pt file, and their order must be consistent.

⚙️ Environment Setup

The code has been tested with Python 3.10+ and PyTorch 2.x. To set up the environment using Conda:

conda create --name s2p python=3.10 -y
conda activate s2p
pip install torch torchvision
pip install transformers google-generativeai opencv-python numpy pandas pillow

You will also need a Google Gemini API key to query the VLM. Set it in your environment or pass it directly when initializing an agent:

export GEMINI_API_KEY="your-api-key-here"

🚶 First-Person View (FPV) Navigation

The FPV agent annotates the camera frame with numbered movement prompts, retrieves relevant demonstrations via MMR, and queries the VLM for a discrete navigation command.

import os
from PIL import Image
from fpv_agent import Select2PlanFPVAgent

agent = Select2PlanFPVAgent(
    gemini_api_key=os.environ["GEMINI_API_KEY"],
    model_name="models/gemini-2.0-flash",
    device="cuda",  # or "cpu"
)

agent.load_experiential_memory(
    embeddings_path="data/fpv/embeddings.pt",
    dataframe_path="data/fpv/demonstrations.csv",
)

frame = Image.open("data/fpv/images/current_frame.png")
result = agent.get_action(current_image=frame, target_obj="Bowl")

print(result)
# {"objects": [...], "explanation": "...", "command": "4", "next_step_command": "0"}

Command system:

Command	Action
1, 2, 3	Rotate right (45°, 30°, 15°)
4	Move forward
5, 6, 7	Rotate left (15°, 30°, 45°)
8, 9	Look up / down
0	Target reached

🛰️ Third-Person View (TPV) Navigation

The TPV agent overlays concentric rings of candidate waypoints on a bird's-eye or CCTV view, then queries the VLM to select a trajectory toward the target.

import os
import cv2
import numpy as np
from PIL import Image
from tpv_agent import Select2PlanTPVAgent

agent = Select2PlanTPVAgent(
    gemini_api_key=os.environ["GEMINI_API_KEY"],
    model_name="models/gemini-2.0-flash",
    device="cuda",
)

agent.load_experiential_memory(
    embeddings_path="data/tpv/embeddings.pt",
    dataframe_path="data/tpv/demonstrations.csv",
)

frame = Image.open("data/tpv/images/current_frame.png")
traversable_mask = cv2.imread("data/tpv/masks/traversable.png", cv2.IMREAD_GRAYSCALE)

trajectory = agent.get_path(
    current_image=frame,
    start_coord=(100, 200),
    target_coord=(400, 350),
    traversable_mask=traversable_mask,
    horizon=3,
)

print(trajectory)  # [(x1, y1), (x2, y2), ...]

🛠️ Repository Structure

File	Description
`fpv_agent.py`	FPV wrapper (`Select2PlanFPVAgent`). Manages the 24-bin episodic memory (compass) and outputs discrete movement commands.
`tpv_agent.py`	TPV wrapper (`Select2PlanTPVAgent`). Generates dynamic concentric candidate rings and outputs multi-hop trajectories.
`tpv_agent_old.py`	Legacy TPV implementation (kept for reference).

🔗 Acknowledgements

S2P builds upon the following open-source projects:

Hugging Face Transformers — Swin Transformer backbone for experiential memory retrieval
Google Generative AI — Gemini VLM for planning
OpenCV — Visual prompting and obstacle masking

Citation

If you find S2P useful in your research, please cite our work:

@ARTICLE{11151762,
  author={Buoso, Davide and Robinson, Luke and Averta, Giuseppe and Torr, Philip and Franzmeyer, Tim and Martini, Daniele De},
  journal={IEEE Robotics and Automation Letters},
  title={Select2Plan: Training-Free ICL-Based Planning Through VQA and Memory Retrieval},
  year={2025},
  volume={10},
  number={11},
  pages={11267-11274},
  keywords={Navigation;Robots;Planning;Visualization;Cameras;Training;Adaptation models;Transformers;Predictive models;Oral communication;Motion and path planning;vision-based navigation;autonomous agents},
  doi={10.1109/LRA.2025.3606790}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
fpv_agent.py		fpv_agent.py
tpv_agent.py		tpv_agent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Select2Plan (S2P): Training-Free ICL-Based Planning through VQA and Memory Retrieval [IEEE RA-L 2025]

👀 S2P in Action

📊 Data Preparation

CSV Formats

Generating Swin Transformer Embeddings

⚙️ Environment Setup

🚶 First-Person View (FPV) Navigation

🛰️ Third-Person View (TPV) Navigation

🛠️ Repository Structure

🔗 Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Select2Plan (S2P): Training-Free ICL-Based Planning through VQA and Memory Retrieval [IEEE RA-L 2025]

👀 S2P in Action

📊 Data Preparation

CSV Formats

Generating Swin Transformer Embeddings

⚙️ Environment Setup

🚶 First-Person View (FPV) Navigation

🛰️ Third-Person View (TPV) Navigation

🛠️ Repository Structure

🔗 Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages