MUFASA: A Multi-Layer Framework for Slot Attention

TL,DR: MUFASA is a plug-and-play framework for slot attention-based approaches that leverages complementary semantics encoded across multiple layers of the vision encoder, enhancing performance in unsupervised object segmentation while simultaneously improving training convergence.

Abstract

Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

Overview of MUFASA

MUFASA is designed as a plug-and-play component, allowing seamless integration into existing slot attention-based methods by replacing their single-layer slot-attention module with our proposed multi-layer framework. MUFASA is trained with no additional losses, relying solely on training signals of the base model.

(a) Multi-layer slot-attention: Given an input image, features from multiple layers of a DINO encoder are extracted and processed by multiple slot-attention modules \(\mathrm{SA}_m\). The slot-attention modules are independent of another, enabling adaptation to layer-specific information. Each \(\mathrm{SA}_m\) produces layer-specific slots \(\mathcal{S}_m\) and corresponding attention masks \(\mathcal{A}_m^{\mathrm{Slot}}\). After Hungarian Matching, a fusion module merges slots and masks. A ViT decoder reconstructs the last encoder layer's features from fused slots.

(b) Hungarian Matching: Prior to fusion, we ensure that two slot sets \(\mathcal{S}_m\) and \(\mathcal{S}_{m+1}\) of adjacent layers are aligned in the sense that the slots at corresponding indices learn to bind to the same objects across layers. This is achieved via Hungarian Matching between the corresponding slot-attention masks \(\mathcal{A}_m^{\mathrm{Slot}}\) and \(\mathcal{A}_{m+1}^{\mathrm{Slot}}\).

(c) Fusion module: To integrate the semantic information encoded across layers, we fuse all slot sets \(\mathcal{S}_m\) into a single set of slots \(\mathcal{S}_{\mathrm{fused}}\). For this, we propose M-Fusion: First, each subsequent pair of slot sets are summed in a sliding window-like fashion to encode an inductive bias of local interactions of adjacent slots. The resulting sequence is concatenated and projected onto the fused set of slots through a learned MLP. We analogously fuse the slot-attention masks \(\mathcal{A}_m^{\mathrm{Slot}}\) into a joint representation \(\mathcal{A}_{\mathrm{fused}}^{\mathrm{Slot}}\) through a weighted linear combination.

Unsupervised Object Segmentation

We integrate MUFASA into state-of-the-art slot attention-based models, SPOT and DINOSAUR, to highlight the benefit of our approach. In the task of unsupervised object segmentation, we substantially improve the performance of the respective base models, thus establishing a new state of the art among unsupervised OCL methods on these benchmarks.

The segmentation results of MUFASA are of higher quality compared to the baselines, producing masks more consistent in shape that follow the object boundaries more closely.

In addition to enhanced UOS performance, integrating MUFASA improves training efficiency: MUFASA performs on-par with baseline models in substantially fewer epochs (\(\text{T}_\text{Base}\)) and training converges earlier (\(\text{T}_\text{Peak}\)), all while inducing minor overhead during inference.

BibTeX

@misc{bock2026mufasamultilayerframeworkslot,
      title={MUFASA: A Multi-Layer Framework for Slot Attention}, 
      author={Sebastian Bock and Leonie Sch{\"u}{\ss}ler and Krishnakant Singh and Simone Schaub-Meyer and Stefan Roth},
      year={2026},
      eprint={2602.07544},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.07544}, 
}