001     1053128
005     20260130140312.0
024 7 _ |a arXiv:2510.03871
|2 arXiv
037 _ _ |a FZJ-2026-01461
088 _ _ |a arXiv:2510.03871
|2 arXiv
100 1 _ |a Filatov, Oleg
|0 P:(DE-Juel1)198706
|b 0
|u fzj
245 _ _ |a Optimal Scaling Needs Optimal Norm
260 _ _ |c 2025
|b arXiv
336 7 _ |a Preprint
|b preprint
|m preprint
|0 PUB:(DE-HGF)25
|s 1769766699_29796
|2 PUB:(DE-HGF)
336 7 _ |a WORKING_PAPER
|2 ORCID
336 7 _ |a Electronic Article
|0 28
|2 EndNote
336 7 _ |a preprint
|2 DRIVER
336 7 _ |a ARTICLE
|2 BibTeX
336 7 _ |a Output Types/Working Paper
|2 DataCite
520 _ _ |a Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. For Adam and Scion optimizers, we discover that joint optimal scaling across model and dataset sizes is conditioned on a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(η^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(η, B)$ reach the optimal norm, only a unique $(η^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(η^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of Adam. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.
536 _ _ |a 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511)
|0 G:(DE-HGF)POF4-5112
|c POF4-511
|f POF IV
|x 0
536 _ _ |a Helmholtz AI Consultant Team FB Information (E54.303.11)
|0 G:(DE-Juel-1)E54.303.11
|c E54.303.11
|x 1
536 _ _ |a TrustLLM - Democratize Trustworthy and Efficient Large Language Model Technology for Europe (101135671)
|0 G:(EU-Grant)101135671
|c 101135671
|f HORIZON-CL4-2023-HUMAN-01-CNECT
|x 2
588 _ _ |a Dataset connected to DataCite
650 _ 7 |a Machine Learning (cs.LG)
|2 Other
650 _ 7 |a Artificial Intelligence (cs.AI)
|2 Other
650 _ 7 |a Machine Learning (stat.ML)
|2 Other
650 _ 7 |a FOS: Computer and information sciences
|2 Other
700 1 _ |a Wang, Jiangtao
|0 P:(DE-Juel1)200557
|b 1
|u fzj
700 1 _ |a Ebert, Jan
|0 P:(DE-Juel1)187002
|b 2
|u fzj
700 1 _ |a Kesselheim, Stefan
|0 P:(DE-Juel1)185654
|b 3
|u fzj
856 4 _ |u https://juser.fz-juelich.de/record/1053128/files/2510.03871v2.pdf
|y Restricted
909 C O |o oai:juser.fz-juelich.de:1053128
|p extern4vita
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 0
|6 P:(DE-Juel1)198706
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 1
|6 P:(DE-Juel1)200557
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 2
|6 P:(DE-Juel1)187002
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 3
|6 P:(DE-Juel1)185654
913 1 _ |a DE-HGF
|b Key Technologies
|l Engineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action
|1 G:(DE-HGF)POF4-510
|0 G:(DE-HGF)POF4-511
|3 G:(DE-HGF)POF4
|2 G:(DE-HGF)POF4-500
|4 G:(DE-HGF)POF
|v Enabling Computational- & Data-Intensive Science and Engineering
|9 G:(DE-HGF)POF4-5112
|x 0
980 1 _ |a EXTERN4VITA
980 _ _ |a preprint
980 _ _ |a EDITORS
980 _ _ |a I:(DE-Juel1)JSC-20090406


LibraryCollectionCLSMajorCLSMinorLanguageAuthor
Marc 21