[2506.08915] Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

[Submitted on 10 Jun 2025 (v1), last revised 1 Apr 2026 (this version, v4)]

View a PDF of the paper titled Two-stage Vision Transformers and Hard Masking offer Robust Object Representations, by Ananthu Aniraj and 3 other authors

View PDF
HTML (experimental)

Abstract:Context can strongly affect object representations, sometimes leading to undesired biases, particularly when objects appear in out-of-distribution backgrounds at inference. At the same time, many object-centric tasks require to leverage the context for identifying the relevant image regions. We posit that this conundrum, in which context is simultaneously needed and a potential nuisance, can be addressed by an attention-based approach that uses learned binary attention masks to ensure that only attended image regions influence the prediction. To test this hypothesis, we evaluate a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, for which context cues are likely to be needed, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. The explicit nature of the semantic masks also makes the model’s reasoning auditable, enabling powerful test-time interventions to further enhance robustness. Extensive experiments across diverse benchmarks demonstrate that this approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds. Code: this https URL

Submission history

From: Ananthu Aniraj [view email]
[v1]
Tue, 10 Jun 2025 15:41:22 UTC (19,418 KB)
[v2]
Mon, 16 Jun 2025 08:52:37 UTC (19,420 KB)
[v3]
Tue, 17 Jun 2025 13:45:06 UTC (19,420 KB)
[v4]
Wed, 1 Apr 2026 10:28:00 UTC (19,423 KB)

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

[2506.08915] Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

A Benchmark Dataset for Epitope-Specific Antibody Design

Fast Image and Video Editing with Diffusion Guidance

Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration

Gram-Eigenmode INR Editing with Closed-Form Geometry Updates

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility

Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

How to run Google Ads in sensitive categories without remarketing

[2506.08915] Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

AI vendor loss would disrupt 3 in 4 enterprises

The Complete Crisis Management Guide for Brands

2026 Social Media Ecommerce Trends & Statistics

Bing is testing a much larger sponsored product carousel in shopping results

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

[2506.08915] Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

Submission history

Related Posts

Subscribe to Updates