Home > Publications database > Mechanics of deep neural networks beyond the Gaussian limit |
Book/Dissertation / PhD Thesis | FZJ-2025-02100 |
2025
Forschungszentrum Jülich GmbH Zentralbibliothek, Verlag
Jülich
ISBN: 978-3-95806-815-5
This record in other databases:
Please use a persistent id in citations: urn:nbn:de:0001-2504220916057.519614615515 doi:10.34734/FZJ-2025-02100
Abstract: Current developments in the field of artificial intelligence and the neural network technology supersede our theoretical understanding of these networks. In the limit of infinite width, networks at initialization are well described by the neural network Gaussian process (NNGP): the distribution of outputs is a zero-mean Gaussian characterized by its covariance or kernel across data samples. Going to the lazy learning regime, where network parameters change only slightly from their initial values, the neural tangent kernel characterizes networks trained with gradient descent. Despite the success of these Gaussian limits for deep neural networks, they do not capture important properties such as network trainability or feature learning. In this work, we go beyond Gaussian limits of deep neural networks by obtaining higher-order corrections from field-theoretic descriptions of neural networks. From a statistical point of view, two complimentary averages have to be considered: the distribution over data samples and the distribution over network parameters. We investigate both cases, gaining insights into the working mechanisms of deep neural networks. In the former case, we study how data statistics are transformed across network layers to solve classification tasks. We find that, while the hidden layers are well described by a non-linear mapping of the Gaussian statistics, the input layer extracts information from higher-order cumulants of the data. The developed theoretical framework allows us to investigate the relevance of different cumulant orders for classification: On MNIST, Gaussian statistics account for most of the classification performance, and higher-order cumulants are required to fine-tune the networks for the last few percentages. In contrast, more complex data sets such as CIFAR-10 require the inclusion of higher-order cumulants for reasonable performance values, giving an explanation for why fully-connected networks perform subpar compared to convolutional networks. In the latter case, we investigate two different aspects: First, we derive the network kernels for the Bayesian network posterior of fully-connected networks and observe a non-linear adaptation of the kernels to the target, which is not present in the NNGP. These feature corrections result from fluctuation corrections to the NNGP in finitesize networks, which allow the networks to adapt to the data. While fluctuations become larger near criticality, we uncover a trade-off between criticality and feature learning scales in networks as a driving mechanism for feature learning. Second, we study network trainability of residual networks by deriving the network prior at initialization. From this, we obtain the response function as a leading-order correction to the NNGP, which describes the signal propagation in networks. We find that scaling the residual branch by a hyperparameter improves signal propagation since it avoids saturation of the non-linearity and thus information loss. Finally, we observe a strong dependence of the optimal scaling of the residual branch on the network depth but only a weak dependence on other network hyperparameters, giving an explanation for the universal success of depth-dependent scaling of the residual branch. Overall, we derive statistical field theories for deep neural networks that allow us to obtain systematic corrections to the Gaussian limits. In this way, we take a step towards a better mechanistic understanding of information processing and data representations in neural networks.
![]() |
The record appears in these collections: |