Skip to content

VG-GUI-TASKER/VG-GUI-TASKER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

arXiv HF Daily Paper Project Page Code Data Leaderboard License

Sunqi Fan, Qingle Liu, Runqi Yin, Meng-Hao Guo, Shuojin Yang
Tsinghua University, Beijing, China  ·  Corresponding author
Accepted by ECCV 2026


This is the official repository for our ECCV 2026 paper "Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction".

We advance video understanding from the low-level VideoQA paradigm toward the high-level Video-Guided Agentic Task paradigm. The latter requires models to learn procedural knowledge from a tutorial video and transfer it to long-horizon decision making — a form of video in-context learning.

Motivation: from low-level VideoQA to high-level video-guided agentic tasks

Abstract

Recent Multimodal Large Language Models (MLLMs) achieve remarkable performance on Video Question Answering (VideoQA) benchmarks, but existing benchmarks primarily test shallow visual perception and rarely examine whether MLLMs can learn deeper procedural skills from video tutorials and generalize them to long-horizon agentic tasks. To address this gap, we introduce VG-GUI-Bench (Video-Guided GUI Benchmark), a new benchmark that evaluates whether MLLM-based GUI agents can follow video tutorials to complete corresponding interactive GUI tasks.

We further observe that performance on both VideoQA and video-guided agentic tasks critically depends on effective keyframe extraction. Based on this, we propose TASKER (Task-driven and Scene-aware Keyframe searcher), a keyframe extraction algorithm that jointly considers task relevance and scene dynamics. TASKER yields significant gains on both VideoQA and video-guided agentic benchmarks, outperforming the best baseline by 2.0% on the EgoSchema fullset and 1.8% on NExT-QA, highlighting the potential of generalized keyframe extraction for video understanding.

Contributions

  • A Two-Level Taxonomy. We identify a key limitation of existing video benchmarks and propose a taxonomy connecting low-level VideoQA with high-level video-guided agentic tasks, highlighting the role of video in-context learning.
  • VG-GUI-Bench. A new benchmark that pairs tutorial videos with GUI agent tasks to evaluate procedural knowledge transfer from videos, with 1,000 long-horizon test cases and four complementary metrics.
  • TASKER. A task-driven and scene-aware keyframe extraction algorithm, formulated as a generalized graph search, that improves both accuracy and frame efficiency across VideoQA and agentic tasks.

Repository Structure

This repository is organized into two parts, corresponding to the two paradigms studied in the paper:

Part Description Get Started
📱 VG-GUI-Bench The Video-Guided GUI Agent benchmark: data pipeline, reference-frame modes, evaluation, and leaderboard scripts. VG-GUI-Bench/README.md
🔍 TASKER The TASKER keyframe-search algorithm, covering both stages: VideoQA (videoqa/) and video-guided GUI tutorials (gui/). TASKER/README.md
VG-GUI-TASKER/
├── VG-GUI-Bench/       # Video-Guided GUI Agent benchmark   → see its README
├── TASKER/             # TASKER keyframe search (VideoQA + GUI stages) → see its README
│   ├── videoqa/        #   Stage 1: VideoQA keyframe search (EgoSchema, NExT-QA)
│   └── gui/            #   Stage 2: GUI-tutorial keyframe extraction for VG-GUI-Bench
└── README.md           # You are here

📱 Part 1 — VG-GUI-Bench

A dedicated benchmark for evaluating MLLM-based GUI agents on long-horizon tasks guided by video tutorials, released as the 🤗 VG-GUI-Bench dataset and built upon the MONDAY dataset. It provides a standardized action space (CLICK, SCROLL, TYPE, PRESS, ZOOM, FINISH) and four metrics (Accuracy, Completion, Efficiency, PIR).

➡️ Setup, data preparation, and evaluation: see VG-GUI-Bench/README.md.

➡️ Live results: see the Leaderboard.

🔍 Part 2 — TASKER

TASKER reformulates keyframe extraction as a generalized graph-search problem. The video is split into segments (nodes), and an MLLM evaluates cost functions and termination confidence to decide which segments to expand — selecting a compact yet informative set of keyframes. Variants include TASKER-BFS / GBFS / Dijkstra / A*.

The same search paradigm powers both paradigms studied in the paper, so the code is organized into two stages under one directory:

  • TASKER/videoqa/ — Stage 1: keyframe search for VideoQA (EgoSchema, NExT-QA), using a caption/text-based LLM.
  • TASKER/gui/ — Stage 2: keyframe extraction for video-guided GUI tutorials, using a multi-image VLM; its output feeds the tasker reference mode of VG-GUI-Bench.

➡️ Installation, demo, and both-stage experiments: see TASKER/README.md.

Note. The TASKER algorithm originates from our earlier work AKeyS (Agentic Keyframe Search for Video Question Answering, arXiv:2503.16032). TASKER generalizes and renames AKeyS, extending it from VideoQA to video-guided agentic tasks. See TASKER/README.md for details.

Links

Acknowledgements

VG-GUI-Bench is built on the MONDAY dataset. We also thank the developers of LLoVi, VideoTree, VideoAgent, and Android-in-the-Wild for their public code release.

Citation

If you find our work useful, please kindly consider citing:

@inproceedings{fan2026bridging,
  title     = {Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction},
  author    = {Fan, Sunqi and Liu, Qingle and Yin, Runqi and Guo, Meng-Hao and Yang, Shuojin},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

The TASKER algorithm builds upon our earlier AKeyS paper:

@misc{fan2025agentickeyframesearchvideo,
  title         = {Agentic Keyframe Search for Video Question Answering},
  author        = {Sunqi Fan and Meng-Hao Guo and Shuojin Yang},
  year          = {2025},
  eprint        = {2503.16032},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2503.16032}
}

License

This project is released under the MIT License.

About

Generalized Keyframe Extraction for VideoQA and Video-Guided Agentic Tasks. Accepted by ECCV 2026.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors