Skip to content

lambdavi/S2P

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Select2Plan (S2P): Training-Free ICL-Based Planning through VQA and Memory Retrieval [IEEE RA-L 2025]

Davide Buoso, Luke Robinson, Giuseppe Averta, Philip Torr, Tim Franzmeyer, Daniele De Martini

Politecnico di Torino · University of Oxford

Welcome to the official repository for Select2Plan (S2P), a training-free framework for high-level robot planning that leverages off-the-shelf Vision-Language Models (VLMs) for autonomous navigation.

Unlike most learning-based approaches that require extensive task-specific training and large-scale data collection, S2P adapts inputs to align with the VLM's pretraining data — no fine-tuning required.

🚀 No task-specific training or fine-tuning.
🧠 In-Context Learning (ICL) from a small experiential memory bank.
🎯 Structured Visual Question Answering (VQA) to ground actions on the image.
🔄 Maximal Marginal Relevance (MMR) for diverse, relevant memory retrieval.
🤖 Two navigation paradigms: First-Person View (FPV) and Third-Person View (TPV).

📄 Read our paper (IEEE RA-L)
🌍 Project Page

⚠️ Disclaimer: Exact reproduction of the benchmark results reported in the paper is currently a work in progress. This release focuses on providing a clear, environment-agnostic reference implementation of the S2P methodology — visual prompting, experiential memory retrieval with MMR, and ICL-based planning — that researchers can adapt to their own setups. The agents ship with Google Gemini as the default VLM backend, but the pipeline is designed to be VLM-agnostic: you can extend or swap in other vision-language models by adapting the API calls in fpv_agent.py and tpv_agent.py.


👀 S2P in Action

S2P supports two distinct navigation paradigms:

Paradigm Agent Description Output
FPV Select2PlanFPVAgent Autonomous navigation using an onboard monocular camera Discrete movement commands (1–9, 0)
TPV Select2PlanTPVAgent Infrastructure-driven planning using external/CCTV cameras Multi-hop trajectory (waypoints) avoiding obstacles

FPV maintains a 24-bin episodic memory (compass) to track objects outside the current field of view.
TPV generates dynamic concentric candidate rings on the scene and plans trajectories toward a target while respecting obstacle masks.


📊 Data Preparation

Before running S2P, you need an Experiential Memory bank of successful demonstrations:

  • A .csv file with demonstration metadata (the database index).
  • A .pt file with pre-computed Swin Transformer visual embeddings of the demonstration frames.

The sampler uses a Swin Transformer to recover similar situations from memory. To balance similarity and diversity among retrieved examples, we employ Maximal Marginal Relevance (MMR):

$$MMR(Q, M, C) = \arg\max_{s_i \in M \setminus C} \left[ \lambda \langle Q, s_i \rangle - (1-\lambda) \max_{s_j \in C} \langle s_i, s_j \rangle \right]$$

Once organized, your data directory should look like this:

S2P/
├── fpv_agent.py
├── tpv_agent.py
├── data/
│   ├── fpv/
│   │   ├── demonstrations.csv
│   │   ├── embeddings.pt
│   │   └── images/          # frames referenced by `ts` in the CSV
│   └── tpv/
│       ├── demonstrations.csv
│       ├── embeddings.pt
│       └── images/          # frames referenced by `timestamp` in the CSV

CSV Formats

Your CSV file acts as the database index. Format it according to the agent you are using:

For FPV Navigation (Select2PlanFPVAgent):

Column Description
ts Unique identifier/timestamp of the frame (used to load the image)
goal Target object the agent was searching for (e.g., "Bowl")
compass Stringified list of 24 lists representing the agent's spatial memory
object Stringified list of objects visible in the current frame
expl Natural language explanation of the action taken
answer Integer command chosen (1–9, or 0 for success)

For TPV Navigation (Select2PlanTPVAgent):

Column Description
timestamp Unique identifier/timestamp of the frame
coordinates Stringified tuple of the robot's starting coordinates, e.g., "(100, 200)"
t_coordinates Stringified tuple of the target's coordinates
luke_t Stringified list of the numerical sequence of points chosen to reach the target, e.g., "[1, 15, 23]"

Generating Swin Transformer Embeddings

Before running the agents, pre-compute embeddings for your demonstration images and save them as a .pt tensor:

import torch
from PIL import Image
from pathlib import Path
from transformers import AutoImageProcessor, SwinModel

model_id = "microsoft/swin-large-patch4-window7-224"
processor = AutoImageProcessor.from_pretrained(model_id, use_fast=True)
encoder = SwinModel.from_pretrained(model_id).eval()

image_dir = Path("data/fpv/images")
image_paths = sorted(image_dir.glob("*.png"))

embeddings = []
for path in image_paths:
    image = Image.open(path).convert("RGB")
    inputs = processor(image, return_tensors="pt")
    with torch.no_grad():
        emb = encoder(**inputs).last_hidden_state.mean(dim=1)
    embeddings.append(emb.squeeze(0))

torch.save(torch.stack(embeddings), "data/fpv/embeddings.pt")

Note: The number of rows in your CSV must match the number of embeddings in the .pt file, and their order must be consistent.


⚙️ Environment Setup

The code has been tested with Python 3.10+ and PyTorch 2.x. To set up the environment using Conda:

conda create --name s2p python=3.10 -y
conda activate s2p
pip install torch torchvision
pip install transformers google-generativeai opencv-python numpy pandas pillow

You will also need a Google Gemini API key to query the VLM. Set it in your environment or pass it directly when initializing an agent:

export GEMINI_API_KEY="your-api-key-here"

🚶 First-Person View (FPV) Navigation

The FPV agent annotates the camera frame with numbered movement prompts, retrieves relevant demonstrations via MMR, and queries the VLM for a discrete navigation command.

import os
from PIL import Image
from fpv_agent import Select2PlanFPVAgent

agent = Select2PlanFPVAgent(
    gemini_api_key=os.environ["GEMINI_API_KEY"],
    model_name="models/gemini-2.0-flash",
    device="cuda",  # or "cpu"
)

agent.load_experiential_memory(
    embeddings_path="data/fpv/embeddings.pt",
    dataframe_path="data/fpv/demonstrations.csv",
)

frame = Image.open("data/fpv/images/current_frame.png")
result = agent.get_action(current_image=frame, target_obj="Bowl")

print(result)
# {"objects": [...], "explanation": "...", "command": "4", "next_step_command": "0"}

Command system:

Command Action
1, 2, 3 Rotate right (45°, 30°, 15°)
4 Move forward
5, 6, 7 Rotate left (15°, 30°, 45°)
8, 9 Look up / down
0 Target reached

🛰️ Third-Person View (TPV) Navigation

The TPV agent overlays concentric rings of candidate waypoints on a bird's-eye or CCTV view, then queries the VLM to select a trajectory toward the target.

import os
import cv2
import numpy as np
from PIL import Image
from tpv_agent import Select2PlanTPVAgent

agent = Select2PlanTPVAgent(
    gemini_api_key=os.environ["GEMINI_API_KEY"],
    model_name="models/gemini-2.0-flash",
    device="cuda",
)

agent.load_experiential_memory(
    embeddings_path="data/tpv/embeddings.pt",
    dataframe_path="data/tpv/demonstrations.csv",
)

frame = Image.open("data/tpv/images/current_frame.png")
traversable_mask = cv2.imread("data/tpv/masks/traversable.png", cv2.IMREAD_GRAYSCALE)

trajectory = agent.get_path(
    current_image=frame,
    start_coord=(100, 200),
    target_coord=(400, 350),
    traversable_mask=traversable_mask,
    horizon=3,
)

print(trajectory)  # [(x1, y1), (x2, y2), ...]

🛠️ Repository Structure

File Description
fpv_agent.py FPV wrapper (Select2PlanFPVAgent). Manages the 24-bin episodic memory (compass) and outputs discrete movement commands.
tpv_agent.py TPV wrapper (Select2PlanTPVAgent). Generates dynamic concentric candidate rings and outputs multi-hop trajectories.
tpv_agent_old.py Legacy TPV implementation (kept for reference).

🔗 Acknowledgements

S2P builds upon the following open-source projects:


Citation

If you find S2P useful in your research, please cite our work:

@ARTICLE{11151762,
  author={Buoso, Davide and Robinson, Luke and Averta, Giuseppe and Torr, Philip and Franzmeyer, Tim and Martini, Daniele De},
  journal={IEEE Robotics and Automation Letters},
  title={Select2Plan: Training-Free ICL-Based Planning Through VQA and Memory Retrieval},
  year={2025},
  volume={10},
  number={11},
  pages={11267-11274},
  keywords={Navigation;Robots;Planning;Visualization;Cameras;Training;Adaptation models;Transformers;Predictive models;Oral communication;Motion and path planning;vision-based navigation;autonomous agents},
  doi={10.1109/LRA.2025.3606790}
}

About

[RAL 2025] "Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages