Sunqi Fan, Qingle Liu, Runqi Yin, Meng-Hao Guo, Shuojin Yang✉
Tsinghua University, Beijing, China · ✉ Corresponding author
Accepted by ECCV 2026
This is the official repository for our ECCV 2026 paper "Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction".
We advance video understanding from the low-level VideoQA paradigm toward the high-level Video-Guided Agentic Task paradigm. The latter requires models to learn procedural knowledge from a tutorial video and transfer it to long-horizon decision making — a form of video in-context learning.
Recent Multimodal Large Language Models (MLLMs) achieve remarkable performance on Video Question Answering (VideoQA) benchmarks, but existing benchmarks primarily test shallow visual perception and rarely examine whether MLLMs can learn deeper procedural skills from video tutorials and generalize them to long-horizon agentic tasks. To address this gap, we introduce VG-GUI-Bench (Video-Guided GUI Benchmark), a new benchmark that evaluates whether MLLM-based GUI agents can follow video tutorials to complete corresponding interactive GUI tasks.
We further observe that performance on both VideoQA and video-guided agentic tasks critically depends on effective keyframe extraction. Based on this, we propose TASKER (Task-driven and Scene-aware Keyframe searcher), a keyframe extraction algorithm that jointly considers task relevance and scene dynamics. TASKER yields significant gains on both VideoQA and video-guided agentic benchmarks, outperforming the best baseline by 2.0% on the EgoSchema fullset and 1.8% on NExT-QA, highlighting the potential of generalized keyframe extraction for video understanding.
- A Two-Level Taxonomy. We identify a key limitation of existing video benchmarks and propose a taxonomy connecting low-level VideoQA with high-level video-guided agentic tasks, highlighting the role of video in-context learning.
- VG-GUI-Bench. A new benchmark that pairs tutorial videos with GUI agent tasks to evaluate procedural knowledge transfer from videos, with 1,000 long-horizon test cases and four complementary metrics.
- TASKER. A task-driven and scene-aware keyframe extraction algorithm, formulated as a generalized graph search, that improves both accuracy and frame efficiency across VideoQA and agentic tasks.
This repository is organized into two parts, corresponding to the two paradigms studied in the paper:
| Part | Description | Get Started |
|---|---|---|
| 📱 VG-GUI-Bench | The Video-Guided GUI Agent benchmark: data pipeline, reference-frame modes, evaluation, and leaderboard scripts. | → VG-GUI-Bench/README.md |
| 🔍 TASKER | The TASKER keyframe-search algorithm, covering both stages: VideoQA (videoqa/) and video-guided GUI tutorials (gui/). |
→ TASKER/README.md |
VG-GUI-TASKER/
├── VG-GUI-Bench/ # Video-Guided GUI Agent benchmark → see its README
├── TASKER/ # TASKER keyframe search (VideoQA + GUI stages) → see its README
│ ├── videoqa/ # Stage 1: VideoQA keyframe search (EgoSchema, NExT-QA)
│ └── gui/ # Stage 2: GUI-tutorial keyframe extraction for VG-GUI-Bench
└── README.md # You are here
A dedicated benchmark for evaluating MLLM-based GUI agents on long-horizon tasks guided by video tutorials, released as the 🤗 VG-GUI-Bench dataset and built upon the MONDAY dataset. It provides a standardized action space (CLICK, SCROLL, TYPE, PRESS, ZOOM, FINISH) and four metrics (Accuracy, Completion, Efficiency, PIR).
➡️ Setup, data preparation, and evaluation: see VG-GUI-Bench/README.md.
➡️ Live results: see the Leaderboard.
TASKER reformulates keyframe extraction as a generalized graph-search problem. The video is split into segments (nodes), and an MLLM evaluates cost functions and termination confidence to decide which segments to expand — selecting a compact yet informative set of keyframes. Variants include TASKER-BFS / GBFS / Dijkstra / A*.
The same search paradigm powers both paradigms studied in the paper, so the code is organized into two stages under one directory:
TASKER/videoqa/— Stage 1: keyframe search for VideoQA (EgoSchema, NExT-QA), using a caption/text-based LLM.TASKER/gui/— Stage 2: keyframe extraction for video-guided GUI tutorials, using a multi-image VLM; its output feeds thetaskerreference mode of VG-GUI-Bench.
➡️ Installation, demo, and both-stage experiments: see TASKER/README.md.
Note. The TASKER algorithm originates from our earlier work AKeyS (Agentic Keyframe Search for Video Question Answering, arXiv:2503.16032). TASKER generalizes and renames AKeyS, extending it from VideoQA to video-guided agentic tasks. See
TASKER/README.mdfor details.
- 📄 Paper (arXiv): https://arxiv.org/abs/2606.29445
- 🤗 Hugging Face Daily Paper: https://huggingface.co/papers/2606.29445
- 🌐 Project Page: https://vg-gui-tasker.github.io/
- 💻 Code: https://github.com/VG-GUI-TASKER/VG-GUI-TASKER
- 🗂️ Data (VG-GUI-Bench): https://huggingface.co/datasets/Aoraku/VG-GUI-Bench
- 🏆 Leaderboard: https://vg-gui-tasker.github.io/#leaderboard
VG-GUI-Bench is built on the MONDAY dataset. We also thank the developers of LLoVi, VideoTree, VideoAgent, and Android-in-the-Wild for their public code release.
If you find our work useful, please kindly consider citing:
@inproceedings{fan2026bridging,
title = {Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction},
author = {Fan, Sunqi and Liu, Qingle and Yin, Runqi and Guo, Meng-Hao and Yang, Shuojin},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}The TASKER algorithm builds upon our earlier AKeyS paper:
@misc{fan2025agentickeyframesearchvideo,
title = {Agentic Keyframe Search for Video Question Answering},
author = {Sunqi Fan and Meng-Hao Guo and Shuojin Yang},
year = {2025},
eprint = {2503.16032},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2503.16032}
}This project is released under the MIT License.
