Preprint FZJ-2026-01885

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers

 ;

2026
arXiv

arXiv () [10.48550/arXiv.2601.09040]

This record in other databases:  

Please use a persistent id in citations: doi:  doi:  doi:

Abstract: End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video transformers can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain sparse. We apply blockwise learning to a masked autoencoding video vision transformer by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations close to matched end-to-end baselines under linear-probe and retrieval proxies. In order to compare intermediate representations, we analyze depth-wise decodability, inter-block similarity, and patch-level diagnostics. Blockwise training exposes higher-level structure earlier, while later blocks saturate and operate in a more geometry-preserving regime. It can also induce token-level shifts consistent with stronger early mixing that pooled metrics can miss. These findings point to late-block saturation and interface formation as contributors to the remaining gap.

Keyword(s): Computer Vision and Pattern Recognition (cs.CV) ; FOS: Computer and information sciences


Contributing Institute(s):
  1. Strukturelle und funktionelle Organisation des Gehirns (INM-1)
Research Program(s):
  1. 5254 - Neuroscientific Data Analytics and AI (POF4-525) (POF4-525)
  2. Helmholtz AI - Helmholtz Artificial Intelligence Coordination Unit – Local Unit FZJ (E.40401.62) (E.40401.62)
  3. HIBALL - Helmholtz International BigBrain Analytics and Learning Laboratory (HIBALL) (InterLabs-0015) (InterLabs-0015)
  4. EBRAINS 2.0 - EBRAINS 2.0: A Research Infrastructure to Advance Neuroscience and Brain Health (101147319) (101147319)

Appears in the scientific report 2026
Database coverage:
OpenAccess
Click to display QR Code for this record

The record appears in these collections:
Institute Collections > INM > INM-1
Document types > Reports > Preprints
Workflow collections > Public records
Publications database
Open Access

 Record created 2026-03-03, last modified 2026-03-05


OpenAccess:
Download fulltext PDF
External link:
Download fulltextFulltext
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)