View a PDF of the paper titled Two-stage Vision Transformers and Hard Masking offer Robust Object Representations, by Ananthu Aniraj and 3 other authors
View PDF
HTML (experimental)
Abstract:Context can strongly affect object representations, sometimes leading to undesired biases, particularly when objects appear in out-of-distribution backgrounds at inference. At the same time, many object-centric tasks require to leverage the context for identifying the relevant image regions. We posit that this conundrum, in which context is simultaneously needed and a potential nuisance, can be addressed by an attention-based approach that uses learned binary attention masks to ensure that only attended image regions influence the prediction. To test this hypothesis, we evaluate a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, for which context cues are likely to be needed, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. The explicit nature of the semantic masks also makes the model’s reasoning auditable, enabling powerful test-time interventions to further enhance robustness. Extensive experiments across diverse benchmarks demonstrate that this approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds. Code: this https URL
Submission history
From: Ananthu Aniraj [view email]
[v1]
Tue, 10 Jun 2025 15:41:22 UTC (19,418 KB)
[v2]
Mon, 16 Jun 2025 08:52:37 UTC (19,420 KB)
[v3]
Tue, 17 Jun 2025 13:45:06 UTC (19,420 KB)
[v4]
Wed, 1 Apr 2026 10:28:00 UTC (19,423 KB)


![[2506.08915] Two-stage Vision Transformers and Hard Masking offer Robust Object Representations Measuring Intelligence Efficiency of Local AI](https://skytik.cc/wp-content/uploads/2025/11/Measuring-Intelligence-Efficiency-of-Local-AI-768x448.png)