Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers

Römer, Jonas; Dickscheid, Timo

doi:10.48550/arXiv.2601.09040

Preprint

FZJ-2026-01885

Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers

Römer, J. (Corresponding author)FZJ* ; Dickscheid, T.FZJ*

2026
arXiv

arXiv (2026) [10.48550/arXiv.2601.09040]

This record in other databases:

Please use a persistent id in citations: doi:10.48550/arXiv.2601.09040 doi:10.48550/ARXIV.2601.09040 doi:10.34734/FZJ-2026-01885

Abstract: End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video transformers can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain sparse. We apply blockwise learning to a masked autoencoding video vision transformer by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations close to matched end-to-end baselines under linear-probe and retrieval proxies. In order to compare intermediate representations, we analyze depth-wise decodability, inter-block similarity, and patch-level diagnostics. Blockwise training exposes higher-level structure earlier, while later blocks saturate and operate in a more geometry-preserving regime. It can also induce token-level shifts consistent with stronger early mixing that pooled metrics can miss. These findings point to late-block saturation and interface formation as contributors to the remaining gap.

Keyword(s): Computer Vision and Pattern Recognition (cs.CV) ; FOS: Computer and information sciences

Contributing Institute(s):