001053128 001__ 1053128
001053128 005__ 20260130140312.0
001053128 0247_ $$2arXiv$$aarXiv:2510.03871
001053128 037__ $$aFZJ-2026-01461
001053128 088__ $$2arXiv$$aarXiv:2510.03871
001053128 1001_ $$0P:(DE-Juel1)198706$$aFilatov, Oleg$$b0$$ufzj
001053128 245__ $$aOptimal Scaling Needs Optimal Norm
001053128 260__ $$barXiv$$c2025
001053128 3367_ $$0PUB:(DE-HGF)25$$2PUB:(DE-HGF)$$aPreprint$$bpreprint$$mpreprint$$s1769766699_29796
001053128 3367_ $$2ORCID$$aWORKING_PAPER
001053128 3367_ $$028$$2EndNote$$aElectronic Article
001053128 3367_ $$2DRIVER$$apreprint
001053128 3367_ $$2BibTeX$$aARTICLE
001053128 3367_ $$2DataCite$$aOutput Types/Working Paper
001053128 520__ $$aDespite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. For Adam and Scion optimizers, we discover that joint optimal scaling across model and dataset sizes is conditioned on a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(η^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(η, B)$ reach the optimal norm, only a unique $(η^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(η^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of Adam. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.
001053128 536__ $$0G:(DE-HGF)POF4-5112$$a5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511)$$cPOF4-511$$fPOF IV$$x0
001053128 536__ $$0G:(DE-Juel-1)E54.303.11$$aHelmholtz AI Consultant Team FB Information (E54.303.11)$$cE54.303.11$$x1
001053128 536__ $$0G:(EU-Grant)101135671$$aTrustLLM - Democratize Trustworthy and Efficient Large Language Model Technology for Europe (101135671)$$c101135671$$fHORIZON-CL4-2023-HUMAN-01-CNECT$$x2
001053128 588__ $$aDataset connected to DataCite
001053128 650_7 $$2Other$$aMachine Learning (cs.LG)
001053128 650_7 $$2Other$$aArtificial Intelligence (cs.AI)
001053128 650_7 $$2Other$$aMachine Learning (stat.ML)
001053128 650_7 $$2Other$$aFOS: Computer and information sciences
001053128 7001_ $$0P:(DE-Juel1)200557$$aWang, Jiangtao$$b1$$ufzj
001053128 7001_ $$0P:(DE-Juel1)187002$$aEbert, Jan$$b2$$ufzj
001053128 7001_ $$0P:(DE-Juel1)185654$$aKesselheim, Stefan$$b3$$ufzj
001053128 8564_ $$uhttps://juser.fz-juelich.de/record/1053128/files/2510.03871v2.pdf$$yRestricted
001053128 909CO $$ooai:juser.fz-juelich.de:1053128$$pextern4vita
001053128 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)198706$$aForschungszentrum Jülich$$b0$$kFZJ
001053128 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)200557$$aForschungszentrum Jülich$$b1$$kFZJ
001053128 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)187002$$aForschungszentrum Jülich$$b2$$kFZJ
001053128 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)185654$$aForschungszentrum Jülich$$b3$$kFZJ
001053128 9131_ $$0G:(DE-HGF)POF4-511$$1G:(DE-HGF)POF4-510$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5112$$aDE-HGF$$bKey Technologies$$lEngineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action$$vEnabling Computational- & Data-Intensive Science and Engineering$$x0
001053128 9801_ $$aEXTERN4VITA
001053128 980__ $$apreprint
001053128 980__ $$aEDITORS
001053128 980__ $$aI:(DE-Juel1)JSC-20090406