Learning Content and Positional Features for Object Detection via Self-Supervision

Learning Content and Positional Features for Object Detection via Self-Supervision

       

In recent years, self-supervised learning (SSL) has made significant progress as a method for extracting useful features from images without requiring human-annotated labels. These approaches have enabled models to achieve strong performance on various downstream tasks such as image classification and segmentation by leveraging large-scale image datasets.

However, most early SSL methods focused on learning global image-level representations and were not well suited for dense prediction tasks like object detection (OD) and instance segmentation (IS), which require fine-grained, spatially localized features. This paper addresses that limitation by proposing a novel SSL method that is specifically designed to support OD and IS using Vision Transformers (ViT) as the backbone.

Problem Setting and Contribution

OD and IS require identifying individual object instances within an image. For this, it is essential to extract features that not only capture the content of each region but also preserve its positional context. Conventional SSL methods such as DINO or MAE often ignore spatial alignment, as their positional embeddings remain fixed even when input images are cropped or augmented. This results in the loss of crucial spatial cues.

In response, the authors propose a new SSL framework that explicitly learns “intertwined content and positional features.” The goal is to produce pre-trained models that are more effective and sample-efficient for downstream OD and IS tasks.

Key Components of the Proposed Method

The method introduces two core innovations that address the limitations of prior approaches.

1. Position Encoding Aligned with Cropping

In conventional contrastive learning, crops of an image are treated as standalone views, and the same positional embeddings are applied regardless of the crop location. This breaks the link between a patch’s position and its global context, which is detrimental for tasks like OD.

To overcome this, the authors propose representing position embeddings as a vector field defined over the entire image. When a crop is taken, the corresponding region of the positional field is also cropped and sampled, maintaining spatial consistency. Furthermore, a novel augmentation is applied by randomly scaling and shifting the cropped position embeddings, which prevents the model from overfitting to absolute coordinates and improves generalization across datasets with varying object sizes.

2. Masking and Predicting Both Content and Position

Building on the Masked Image Modeling (MIM) paradigm, this method masks and reconstructs not only the content embeddings but also the positional embeddings independently. A new special token is introduced for position masking, and the prediction tasks are conducted separately for content and position.

Notably, the method finds that a cross-shaped masking pattern performs better for position embeddings than the conventional box-wise masking used in MAE and iBOT. This dual-masking approach enables the model to learn richer, spatially grounded representations.

Experiments and Evaluation

The authors pre-train their model on ImageNet-1K and evaluate it on COCO for OD/IS and ADE20K for semantic segmentation. With both ViT-B and ViT-S backbones, the proposed method outperforms a wide range of state-of-the-art SSL methods, including general-purpose ones (e.g., DINO, MAE, iBOT) and OD/IS-specific ones (e.g., DropPos, LOCA).

Remarkably, even with fewer effective training epochs (e.g., 350 effective epochs compared to DINO’s 1600), the method achieves superior performance (+0.9 APBox on COCO), demonstrating its efficiency and robustness. On ADE20K, the method performs on par with the best-performing approaches, indicating that its benefits are not limited to detection tasks.

Analysis and Visualization

To further validate their approach, the authors conduct a detailed analysis of attention maps generated by ViT models pre-trained with different SSL methods. They find that their method produces attention maps that are more sharply focused on individual object instances, particularly distinguishing foreground and background regions more clearly.

Additional ablation studies show that each component—positional embedding alignment, position augmentation, and dual masking—contributes significantly to performance gains, and that their combination is especially effective.

Conclusion and Significance

This work presents a novel SSL method tailored for dense prediction tasks such as OD and IS, addressing the challenge of learning representations that integrate both content and positional information. It extends contrastive and masked modeling frameworks with minimal changes and can be easily integrated into existing pipelines.

By explicitly modeling and learning position information in a crop-aware and task-relevant manner, the method pushes forward a relatively underexplored direction in SSL. The results and analyses strongly suggest that incorporating structured positional encoding in SSL can be a powerful tool for improving vision transformer-based models in dense vision tasks.

Publication

Kang-Jun Liu, Masanori Suganuma, Takayuki Okatani, Self-Supervised Learning of Intertwined Content and Positional Features for Object Detection, Proceedings of International Conference on Machine Learning 2025