Mechanics of deep neural networks beyond the Gaussian limit

Fischer, Kirsten
001041036 001__ 1041036
001041036 005__ 20250423202217.0
001041036 0247_ $$2datacite_doi$$a10.34734/FZJ-2025-02100
001041036 0247_ $$2URN$$aurn:nbn:de:0001-2504220916057.519614615515
001041036 020__ $$a978-3-95806-815-5
001041036 037__ $$aFZJ-2025-02100
001041036 1001_ $$0P:(DE-Juel1)180150$$aFischer, Kirsten$$b0$$eCorresponding author
001041036 245__ $$aMechanics of deep neural networks beyond the Gaussian limit$$f2020-11-15 - 2025-03-07
001041036 260__ $$aJülich$$bForschungszentrum Jülich GmbH Zentralbibliothek, Verlag$$c2025
001041036 300__ $$a165
001041036 3367_ $$2DataCite$$aOutput Types/Dissertation
001041036 3367_ $$0PUB:(DE-HGF)3$$2PUB:(DE-HGF)$$aBook$$mbook
001041036 3367_ $$2ORCID$$aDISSERTATION
001041036 3367_ $$2BibTeX$$aPHDTHESIS
001041036 3367_ $$02$$2EndNote$$aThesis
001041036 3367_ $$0PUB:(DE-HGF)11$$2PUB:(DE-HGF)$$aDissertation / PhD Thesis$$bphd$$mphd$$s1744785157_15412
001041036 3367_ $$2DRIVER$$adoctoralThesis
001041036 4900_ $$aSchriften des Forschungszentrums Jülich Reihe Information / Information$$v110
001041036 502__ $$aDissertation, RWTH Aachen University, 2025$$bDissertation$$cRWTH Aachen University$$d2025$$o2025-03-07
001041036 520__ $$aCurrent developments in the field of artificial intelligence and the neural network technology supersede our theoretical understanding of these networks. In the limit of infinite width, networks at initialization are well described by the neural network Gaussian process (NNGP): the distribution of outputs is a zero-mean Gaussian characterized by its covariance or kernel across data samples. Going to the lazy learning regime, where network parameters change only slightly from their initial values, the neural tangent kernel characterizes networks trained with gradient descent. Despite the success of these Gaussian limits for deep neural networks, they do not capture important properties such as network trainability or feature learning. In this work, we go beyond Gaussian limits of deep neural networks by obtaining higher-order corrections from field-theoretic descriptions of neural networks. From a statistical point of view, two complimentary averages have to be considered: the distribution over data samples and the distribution over network parameters. We investigate both cases, gaining insights into the working mechanisms of deep neural networks. In the former case, we study how data statistics are transformed across network layers to solve classification tasks. We find that, while the hidden layers are well described by a non-linear mapping of the Gaussian statistics, the input layer extracts information from higher-order cumulants of the data. The developed theoretical framework allows us to investigate the relevance of different cumulant orders for classification: On MNIST, Gaussian statistics account for most of the classification performance, and higher-order cumulants are required to fine-tune the networks for the last few percentages. In contrast, more complex data sets such as CIFAR-10 require the inclusion of higher-order cumulants for reasonable performance values, giving an explanation for why fully-connected networks perform subpar compared to convolutional networks. In the latter case, we investigate two different aspects: First, we derive the network kernels for the Bayesian network posterior of fully-connected networks and observe a non-linear adaptation of the kernels to the target, which is not present in the NNGP. These feature corrections result from fluctuation corrections to the NNGP in finitesize networks, which allow the networks to adapt to the data. While fluctuations become larger near criticality, we uncover a trade-off between criticality and feature learning scales in networks as a driving mechanism for feature learning. Second, we study network trainability of residual networks by deriving the network prior at initialization. From this, we obtain the response function as a leading-order correction to the NNGP, which describes the signal propagation in networks. We find that scaling the residual branch by a hyperparameter improves signal propagation since it avoids saturation of the non-linearity and thus information loss. Finally, we observe a strong dependence of the optimal scaling of the residual branch on the network depth but only a weak dependence on other network hyperparameters, giving an explanation for the universal success of depth-dependent scaling of the residual branch. Overall, we derive statistical field theories for deep neural networks that allow us to obtain systematic corrections to the Gaussian limits. In this way, we take a step towards a better mechanistic understanding of information processing and data representations in neural networks.
001041036 536__ $$0G:(DE-HGF)POF4-5232$$a5232 - Computational Principles (POF4-523)$$cPOF4-523$$fPOF IV$$x0
001041036 536__ $$0G:(DE-HGF)POF4-5234$$a5234 - Emerging NC Architectures (POF4-523)$$cPOF4-523$$fPOF IV$$x1
001041036 536__ $$0G:(DE-Juel1)HGF-SMHB-2014-2018$$aMSNN - Theory of multi-scale neuronal networks (HGF-SMHB-2014-2018)$$cHGF-SMHB-2014-2018$$fMSNN$$x2
001041036 536__ $$0G:(DE-Juel-1)BMBF-01IS19077A$$aRenormalizedFlows - Transparent Deep Learning with Renormalized Flows (BMBF-01IS19077A)$$cBMBF-01IS19077A$$x3
001041036 536__ $$0G:(DE-HGF)SO-092$$aACA - Advanced Computing Architectures (SO-092)$$cSO-092$$x4
001041036 536__ $$0G:(DE-82)EXS-SF-neuroIC002$$aneuroIC002 - Recurrence and stochasticity for neuro-inspired computation (EXS-SF-neuroIC002)$$cEXS-SF-neuroIC002$$x5
001041036 8564_ $$uhttps://juser.fz-juelich.de/record/1041036/files/Information_110.pdf$$yOpenAccess
001041036 909CO $$ooai:juser.fz-juelich.de:1041036$$pVDB$$pdriver$$purn$$popen_access$$popenaire$$pdnbdelivery
001041036 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
001041036 915__ $$0LIC:(DE-HGF)CCBY4$$2HGFVOC$$aCreative Commons Attribution CC BY 4.0
001041036 9141_ $$y2025
001041036 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)180150$$aForschungszentrum Jülich$$b0$$kFZJ
001041036 9131_ $$0G:(DE-HGF)POF4-523$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5232$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vNeuromorphic Computing and Network Dynamics$$x0
001041036 9131_ $$0G:(DE-HGF)POF4-523$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5234$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vNeuromorphic Computing and Network Dynamics$$x1
001041036 920__ $$lyes
001041036 9201_ $$0I:(DE-Juel1)IAS-6-20130828$$kIAS-6$$lComputational and Systems Neuroscience$$x0
001041036 980__ $$aphd
001041036 980__ $$aVDB
001041036 980__ $$aUNRESTRICTED
001041036 980__ $$abook
001041036 980__ $$aI:(DE-Juel1)IAS-6-20130828
001041036 9801_ $$aFullTexts
Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe