2026-01-30 15:05 |
Detailed record - Similar records
|
2026-01-29 21:31 |
[FZJ-2026-01461]
Preprint
Filatov, O. ; Wang, J. ; Ebert, J. ; et al
Optimal Scaling Needs Optimal Norm
[arXiv:2510.03871]
arXiv (2025)2025
Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. For Adam and Scion optimizers, we discover that joint optimal scaling across model and dataset sizes is conditioned on a single invariant: the operator norm of the output layer. [...]
Restricted: PDF;
Detailed record - Similar records
|
2026-01-29 21:28 |
[FZJ-2026-01459]
Preprint
Khalfaoui, I. ; Kesselheim, S.
Polynomial, trigonometric, and tropical activations
[arXiv:2502.01247]
arXiv (2025)2025
Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. [...]
Restricted: PDF;
Detailed record - Similar records
|
2026-01-29 21:24 |
Detailed record - Similar records
|
2026-01-28 10:00 |
Detailed record - Similar records
|
2026-01-28 09:57 |
Detailed record - Similar records
|
2026-01-28 09:52 |
Detailed record - Similar records
|
2026-01-28 09:48 |
Detailed record - Similar records
|
2026-01-28 09:43 |
Detailed record - Similar records
|
2026-01-22 08:11 |
Detailed record - Similar records
|
|
|