Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

[Submitted on 12 Aug 2025 (v1), last revised 13 Mar 2026 (this version, v4)]

View a PDF of the paper titled SegDAC: Visual Generalization in Reinforcement Learning via Dynamic Object Tokens, by Alexandre Brown and 1 other authors

View PDF
HTML (experimental)

Abstract:Visual reinforcement learning policies trained on pixel observations often struggle to generalize when visual conditions change at test time. Object-centric representations are a promising alternative, but most approaches use fixed-size slot representations, require image reconstruction, or need auxiliary losses to learn object decompositions. As a result, it remains unclear how to learn RL policies directly from object-level inputs without these constraints. We propose SegDAC, a Segmentation-Driven Actor-Critic that operates on a variable-length set of object token embeddings. At each timestep, text-grounded segmentation produces object masks from which spatially aware token embeddings are extracted. A transformer-based actor-critic processes these dynamic tokens, using segment positional encoding to preserve spatial information across objects. We ablate these design choices and show that both segment positional encoding and variable-length processing are individually necessary for strong performance. We evaluate SegDAC on 8 ManiSkill3 manipulation tasks under 12 visual perturbation types across 3 difficulty levels. SegDAC improves over prior visual generalization methods by 15% on easy, 66% on medium, and 88% on the hardest settings. SegDAC matches the sample efficiency of the state-of-the-art visual RL methods while achieving improved generalization under visual changes. Project Page: this https URL

Submission history

From: Alexandre Brown [view email]
[v1]
Tue, 12 Aug 2025 20:16:54 UTC (15,249 KB)
[v2]
Fri, 17 Oct 2025 22:15:14 UTC (15,179 KB)
[v3]
Mon, 12 Jan 2026 13:21:57 UTC (15,566 KB)
[v4]
Fri, 13 Mar 2026 15:31:24 UTC (15,507 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

How to Build a Production-Ready Claude Code Skill

Interactive Robot Skill Adaptation using Natural Language

Bayesian Thinking for People Who Hated Statistics

A Benchmark Evaluating Agents’ Vulnerabilities When Processing Web URLs

From Local to Global Time Series Explanations

The 2026 Data Mandate: Is Your Governance Architecture a Fortress or a Liability?

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

AI Search Barely Cites Syndicated News Or Press Releases

How to Choose Social Media Networks in 2026

Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

20+ Top Social Media Platforms to Grow Your Brand in 2025

You’re Not Scaling Content. You’re Scaling Disappointment

Interactive Robot Skill Adaptation using Natural Language

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

Submission history

Related Posts

Subscribe to Updates