The composition of objects and their elements, together with object-object positional relationships, offers a wealthy supply of data for illustration studying. Therefore, spatial-aware pretext duties have been actively explored in self-supervised studying. Present works generally begin from a grid construction, the place the purpose of the pretext process includes predicting absolutely the place index of patches inside a set grid. Nonetheless, grid-based approaches fall wanting capturing the fluid and steady nature of real-world object compositions. We introduce PART, a self-supervised studying strategy that leverages steady relative transformations between off-grid patches to beat these limitations. By modeling how elements relate to one another in a steady house, PART learns the relative composition of images-an off-grid structural relative positioning that’s much less tied to absolute look and might stay coherent below variations comparable to partial visibility or stylistic adjustments. In duties requiring exact spatial understanding comparable to object detection and time sequence prediction, PART outperforms grid-based strategies like MAE and DropPos, whereas sustaining aggressive efficiency on international classification duties. By breaking free from grid constraints, PART opens up a brand new trajectory for common self-supervised pretraining throughout various datatypes-from photos to EEG signals-with potential in medical imaging, video, and audio.
- †College of Amsterdam
