Preprint FZJ-2026-01461

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Optimal Scaling Needs Optimal Norm

 ;  ;  ;

2025
arXiv

arXiv ()

This record in other databases:

Report No.: arXiv:2510.03871

Abstract: Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. For Adam and Scion optimizers, we discover that joint optimal scaling across model and dataset sizes is conditioned on a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(η^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(η, B)$ reach the optimal norm, only a unique $(η^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(η^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of Adam. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

Keyword(s): Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI) ; Machine Learning (stat.ML) ; FOS: Computer and information sciences


Research Program(s):
  1. 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511) (POF4-511)
  2. Helmholtz AI Consultant Team FB Information (E54.303.11) (E54.303.11)
  3. TrustLLM - Democratize Trustworthy and Efficient Large Language Model Technology for Europe (101135671) (101135671)

Click to display QR Code for this record

The record appears in these collections:
Externe Publikationen > Vita Publikationen
Workflowsammlungen > In Bearbeitung
Institutssammlungen > JSC

 Datensatz erzeugt am 2026-01-29, letzte Änderung am 2026-01-30


Restricted:
Volltext herunterladen PDF
Externer link:
Volltext herunterladenFulltext by arXiv.org
Dieses Dokument bewerten:

Rate this document:
1
2
3
 
(Bisher nicht rezensiert)