Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Sunqi Fan, Qingle Liu, Runqi Yin, Meng-Hao Guo, Shuojin Yang^✉
Tsinghua University, Beijing, China · ^✉ Corresponding author
Accepted by ECCV 2026

This is the official repository for our ECCV 2026 paper "Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction".

We advance video understanding from the low-level VideoQA paradigm toward the high-level Video-Guided Agentic Task paradigm. The latter requires models to learn procedural knowledge from a tutorial video and transfer it to long-horizon decision making — a form of video in-context learning.

Abstract

Recent Multimodal Large Language Models (MLLMs) achieve remarkable performance on Video Question Answering (VideoQA) benchmarks, but existing benchmarks primarily test shallow visual perception and rarely examine whether MLLMs can learn deeper procedural skills from video tutorials and generalize them to long-horizon agentic tasks. To address this gap, we introduce VG-GUI-Bench (Video-Guided GUI Benchmark), a new benchmark that evaluates whether MLLM-based GUI agents can follow video tutorials to complete corresponding interactive GUI tasks.

We further observe that performance on both VideoQA and video-guided agentic tasks critically depends on effective keyframe extraction. Based on this, we propose TASKER (Task-driven and Scene-aware Keyframe searcher), a keyframe extraction algorithm that jointly considers task relevance and scene dynamics. TASKER yields significant gains on both VideoQA and video-guided agentic benchmarks, outperforming the best baseline by 2.0% on the EgoSchema fullset and 1.8% on NExT-QA, highlighting the potential of generalized keyframe extraction for video understanding.

Contributions

A Two-Level Taxonomy. We identify a key limitation of existing video benchmarks and propose a taxonomy connecting low-level VideoQA with high-level video-guided agentic tasks, highlighting the role of video in-context learning.
VG-GUI-Bench. A new benchmark that pairs tutorial videos with GUI agent tasks to evaluate procedural knowledge transfer from videos, with 1,000 long-horizon test cases and four complementary metrics.
TASKER. A task-driven and scene-aware keyframe extraction algorithm, formulated as a generalized graph search, that improves both accuracy and frame efficiency across VideoQA and agentic tasks.

Repository Structure

This repository is organized into two parts, corresponding to the two paradigms studied in the paper:

Part	Description	Get Started
📱 VG-GUI-Bench	The Video-Guided GUI Agent benchmark: data pipeline, reference-frame modes, evaluation, and leaderboard scripts.	→ `VG-GUI-Bench/README.md`
🔍 TASKER	The TASKER keyframe-search algorithm, covering both stages: VideoQA (`videoqa/`) and video-guided GUI tutorials (`gui/`).	→ `TASKER/README.md`

VG-GUI-TASKER/
├── VG-GUI-Bench/       # Video-Guided GUI Agent benchmark   → see its README
├── TASKER/             # TASKER keyframe search (VideoQA + GUI stages) → see its README
│   ├── videoqa/        #   Stage 1: VideoQA keyframe search (EgoSchema, NExT-QA)
│   └── gui/            #   Stage 2: GUI-tutorial keyframe extraction for VG-GUI-Bench
└── README.md           # You are here

📱 Part 1 — VG-GUI-Bench

A dedicated benchmark for evaluating MLLM-based GUI agents on long-horizon tasks guided by video tutorials, released as the 🤗 VG-GUI-Bench dataset and built upon the MONDAY dataset. It provides a standardized action space (CLICK, SCROLL, TYPE, PRESS, ZOOM, FINISH) and four metrics (Accuracy, Completion, Efficiency, PIR).

➡️ Setup, data preparation, and evaluation: see VG-GUI-Bench/README.md.

➡️ Live results: see the Leaderboard.

🔍 Part 2 — TASKER

TASKER reformulates keyframe extraction as a generalized graph-search problem. The video is split into segments (nodes), and an MLLM evaluates cost functions and termination confidence to decide which segments to expand — selecting a compact yet informative set of keyframes. Variants include TASKER-BFS / GBFS / Dijkstra / A*.

The same search paradigm powers both paradigms studied in the paper, so the code is organized into two stages under one directory:

TASKER/videoqa/ — Stage 1: keyframe search for VideoQA (EgoSchema, NExT-QA), using a caption/text-based LLM.
TASKER/gui/ — Stage 2: keyframe extraction for video-guided GUI tutorials, using a multi-image VLM; its output feeds the tasker reference mode of VG-GUI-Bench.

➡️ Installation, demo, and both-stage experiments: see TASKER/README.md.

Note. The TASKER algorithm originates from our earlier work AKeyS (Agentic Keyframe Search for Video Question Answering, arXiv:2503.16032). TASKER generalizes and renames AKeyS, extending it from VideoQA to video-guided agentic tasks. See TASKER/README.md for details.

Links

📄 Paper (arXiv): https://arxiv.org/abs/2606.29445
🤗 Hugging Face Daily Paper: https://huggingface.co/papers/2606.29445
🌐 Project Page: https://vg-gui-tasker.github.io/
💻 Code: https://github.com/VG-GUI-TASKER/VG-GUI-TASKER
🗂️ Data (VG-GUI-Bench): https://huggingface.co/datasets/Aoraku/VG-GUI-Bench
🏆 Leaderboard: https://vg-gui-tasker.github.io/#leaderboard

Acknowledgements

VG-GUI-Bench is built on the MONDAY dataset. We also thank the developers of LLoVi, VideoTree, VideoAgent, and Android-in-the-Wild for their public code release.

Citation

If you find our work useful, please kindly consider citing:

@inproceedings{fan2026bridging,
  title     = {Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction},
  author    = {Fan, Sunqi and Liu, Qingle and Yin, Runqi and Guo, Meng-Hao and Yang, Shuojin},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

The TASKER algorithm builds upon our earlier AKeyS paper:

@misc{fan2025agentickeyframesearchvideo,
  title         = {Agentic Keyframe Search for Video Question Answering},
  author        = {Sunqi Fan and Meng-Hao Guo and Shuojin Yang},
  year          = {2025},
  eprint        = {2503.16032},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2503.16032}
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
TASKER		TASKER
VG-GUI-Bench		VG-GUI-Bench
assets		assets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Abstract

Contributions

Repository Structure

📱 Part 1 — VG-GUI-Bench

🔍 Part 2 — TASKER

Links

Acknowledgements

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Abstract

Contributions

Repository Structure

📱 Part 1 — VG-GUI-Bench

🔍 Part 2 — TASKER

Links

Acknowledgements

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages