A Task-Oriented Benchmark For Long-Video Moment Retrieval

[Submitted on 18 Feb 2025 (v1), last revised 10 Jan 2026 (this version, v5)]

View a PDF of the paper titled MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval, by Huaying Yuan and 9 other authors

View PDF
HTML (experimental)

Abstract:Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker(this https URL) to facilitate future research in this area.

Submission history

From: Huaying Yuan [view email]
[v1]
Tue, 18 Feb 2025 05:50:23 UTC (11,161 KB)
[v2]
Mon, 10 Mar 2025 05:34:20 UTC (5,908 KB)
[v3]
Wed, 16 Apr 2025 03:11:44 UTC (5,925 KB)
[v4]
Tue, 20 May 2025 03:30:44 UTC (20,736 KB)
[v5]
Sat, 10 Jan 2026 02:37:26 UTC (20,750 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

A Task-Oriented Benchmark For Long-Video Moment Retrieval

Escaping the SQL Jungle | Towards Data Science

A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early)

Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

How to Measure AI Value

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Plantain and Black Bean Salad Recipe

What Is Buttermilk? How It’s Made and Used

Why your law firm’s best leads don’t convert after research

For Demi Lovato, Learning to Cook Meant Starting to Heal

A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

23 Radish Recipes for Salads, Pickles, and More

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

A Task-Oriented Benchmark For Long-Video Moment Retrieval

Submission history

Related Posts

Subscribe to Updates