MUFASA is designed as a plug-and-play component, allowing seamless integration into existing slot attention-based methods by replacing their single-layer slot-attention module
with our proposed multi-layer framework. MUFASA is trained with no additional losses, relying solely on training signals of the base model.
(a) Multi-layer slot-attention: Given an input image, features from multiple layers of a DINO encoder are extracted and processed by multiple slot-attention modules \(\mathrm{SA}_m\).
The slot-attention modules are independent of another, enabling adaptation to layer-specific information. Each \(\mathrm{SA}_m\) produces layer-specific slots \(\mathcal{S}_m\) and corresponding attention masks \(\mathcal{A}_m^{\mathrm{Slot}}\).
After Hungarian Matching, a fusion module merges slots and masks.
A ViT decoder reconstructs the last encoder layer's features from fused slots.
(b) Hungarian Matching: Prior to fusion, we ensure that two slot sets \(\mathcal{S}_m\) and \(\mathcal{S}_{m+1}\) of adjacent layers are aligned in the sense that the slots at corresponding indices learn to bind to the same objects across layers.
This is achieved via Hungarian Matching between the corresponding slot-attention masks \(\mathcal{A}_m^{\mathrm{Slot}}\) and \(\mathcal{A}_{m+1}^{\mathrm{Slot}}\).
(c) Fusion module: To integrate the semantic information encoded across layers, we fuse all slot sets \(\mathcal{S}_m\) into a single set of slots \(\mathcal{S}_{\mathrm{fused}}\).
For this, we propose M-Fusion: First, each subsequent pair of slot sets are summed in a sliding window-like fashion to encode an inductive bias of local interactions of adjacent slots.
The resulting sequence is concatenated and projected onto the fused set of slots through a learned MLP.
We analogously fuse the slot-attention masks \(\mathcal{A}_m^{\mathrm{Slot}}\) into a joint representation \(\mathcal{A}_{\mathrm{fused}}^{\mathrm{Slot}}\) through a weighted linear combination.
We integrate MUFASA into state-of-the-art slot attention-based models, SPOT and DINOSAUR, to highlight the benefit of our approach.
In the task of unsupervised object segmentation, we substantially improve the performance of the respective base models,
thus establishing a new state of the art among unsupervised OCL methods on these benchmarks.
The segmentation results of MUFASA are of higher quality compared to the baselines, producing masks more consistent in shape that follow the object boundaries more closely.
In addition to enhanced UOS performance, integrating MUFASA improves training efficiency: MUFASA performs on-par with baseline models
in substantially fewer epochs (\(\text{T}_\text{Base}\)) and training converges earlier (\(\text{T}_\text{Peak}\)), all while inducing minor overhead during inference.