Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

[Submitted on 10 Jun 2025 (v1), last revised 27 Jan 2026 (this version, v2)]

View a PDF of the paper titled MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding, by Zhiyi Zhu and 3 other authors

View PDF
HTML (experimental)

Abstract:Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual priors, enables more precise localization. Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate that MLVTG achieves state-of-the-art performance and significantly outperforms existing baselines.

Submission history

From: Zhiyi Zhu [view email]
[v1]
Tue, 10 Jun 2025 07:20:12 UTC (2,640 KB)
[v2]
Tue, 27 Jan 2026 18:07:12 UTC (3,142 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

DynaTrust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic Trust Graphs

Linear Regression Is Actually a Projection Problem, Part 1: The Geometric Intuition

Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents

Efficient High-Resolution Visual Understanding for Vision-Language Models

Large Language Model Enhanced Greybox Fuzzing

Why You Should Stop Worrying About AI Taking Data Science Jobs

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

DynaTrust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic Trust Graphs

Introducing new collaboration features for Inoreader Teams

Stop competing with your own content

Linear Regression Is Actually a Projection Problem, Part 1: The Geometric Intuition

Potato Chips Are My Chicest Party Trick

Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

Submission history

Related Posts

Subscribe to Updates