Optimal signal propagation in ResNets through residual scaling

Fischer, Kirsten; Dahmen, David; Helias, Moritz
doi:10.48550/ARXIV.2305.07715
001010660 001__ 1010660
001010660 005__ 20240313103122.0
001010660 0247_ $$2doi$$a10.48550/ARXIV.2305.07715
001010660 0247_ $$2datacite_doi$$a10.34734/FZJ-2023-03175
001010660 037__ $$aFZJ-2023-03175
001010660 1001_ $$0P:(DE-Juel1)180150$$aFischer, Kirsten$$b0$$eCorresponding author
001010660 245__ $$aOptimal signal propagation in ResNets through residual scaling
001010660 260__ $$barXiv$$c2023
001010660 3367_ $$0PUB:(DE-HGF)25$$2PUB:(DE-HGF)$$aPreprint$$bpreprint$$mpreprint$$s1695103673_21286
001010660 3367_ $$2ORCID$$aWORKING_PAPER
001010660 3367_ $$028$$2EndNote$$aElectronic Article
001010660 3367_ $$2DRIVER$$apreprint
001010660 3367_ $$2BibTeX$$aARTICLE
001010660 3367_ $$2DataCite$$aOutput Types/Working Paper
001010660 520__ $$aResidual networks (ResNets) have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range of values for this scaling parameter, the associated performance improvement and its universality across network hyperparameters yet need to be understood. For feed-forward networks (FFNets), finite-size theories have led to important insights with regard to signal propagation and hyperparameter tuning. We here derive a systematic finite-size theory for ResNets to study signal propagation and its dependence on the scaling for the residual branch. We derive analytical expressions for the response function, a measure for the network's sensitivity to inputs, and show that for deep networks the empirically found values for the scaling parameter lie within the range of maximal sensitivity. Furthermore, we obtain an analytical expression for the optimal scaling parameter that depends only weakly on other network hyperparameters, such as the weight variance, thereby explaining its universality across hyperparameters. Overall, this work provides a framework for theory-guided optimal scaling in ResNets and, more generally, provides the theoretical framework to study ResNets at finite widths.
001010660 536__ $$0G:(DE-HGF)POF4-5232$$a5232 - Computational Principles (POF4-523)$$cPOF4-523$$fPOF IV$$x0
001010660 536__ $$0G:(DE-HGF)POF4-5234$$a5234 - Emerging NC Architectures (POF4-523)$$cPOF4-523$$fPOF IV$$x1
001010660 536__ $$0G:(DE-Juel-1)BMBF-01IS19077A$$aRenormalizedFlows - Transparent Deep Learning with Renormalized Flows (BMBF-01IS19077A)$$cBMBF-01IS19077A$$x2
001010660 536__ $$0G:(DE-Juel1)HGF-SMHB-2014-2018$$aMSNN - Theory of multi-scale neuronal networks (HGF-SMHB-2014-2018)$$cHGF-SMHB-2014-2018$$fMSNN$$x3
001010660 536__ $$0G:(DE-HGF)SO-092$$aACA - Advanced Computing Architectures (SO-092)$$cSO-092$$x4
001010660 536__ $$0G:(DE-82)EXS-SF-neuroIC002$$aneuroIC002 - Recurrence and stochasticity for neuro-inspired computation (EXS-SF-neuroIC002)$$cEXS-SF-neuroIC002$$x5
001010660 536__ $$0G:(GEPRIS)368482240$$aGRK 2416 - GRK 2416: MultiSenses-MultiScales: Neue Ansätze zur Aufklärung neuronaler multisensorischer Integration (368482240)$$c368482240$$x6
001010660 588__ $$aDataset connected to DataCite
001010660 650_7 $$2Other$$aDisordered Systems and Neural Networks (cond-mat.dis-nn)
001010660 650_7 $$2Other$$aMachine Learning (cs.LG)
001010660 650_7 $$2Other$$aMachine Learning (stat.ML)
001010660 650_7 $$2Other$$aFOS: Physical sciences
001010660 650_7 $$2Other$$aFOS: Computer and information sciences
001010660 7001_ $$0P:(DE-Juel1)156459$$aDahmen, David$$b1$$ufzj
001010660 7001_ $$0P:(DE-Juel1)144806$$aHelias, Moritz$$b2$$ufzj
001010660 773__ $$a10.48550/ARXIV.2305.07715
001010660 8564_ $$uhttps://arxiv.org/abs/2305.07715
001010660 8564_ $$uhttps://juser.fz-juelich.de/record/1010660/files/2305.07715.pdf$$yOpenAccess
001010660 909CO $$ooai:juser.fz-juelich.de:1010660$$pdnbdelivery$$pdriver$$pVDB$$popen_access$$popenaire
001010660 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)180150$$aForschungszentrum Jülich$$b0$$kFZJ
001010660 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)156459$$aForschungszentrum Jülich$$b1$$kFZJ
001010660 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)144806$$aForschungszentrum Jülich$$b2$$kFZJ
001010660 9131_ $$0G:(DE-HGF)POF4-523$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5232$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vNeuromorphic Computing and Network Dynamics$$x0
001010660 9131_ $$0G:(DE-HGF)POF4-523$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5234$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vNeuromorphic Computing and Network Dynamics$$x1
001010660 9141_ $$y2023
001010660 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
001010660 920__ $$lyes
001010660 9201_ $$0I:(DE-Juel1)INM-6-20090406$$kINM-6$$lComputational and Systems Neuroscience$$x0
001010660 9201_ $$0I:(DE-Juel1)IAS-6-20130828$$kIAS-6$$lTheoretical Neuroscience$$x1
001010660 9201_ $$0I:(DE-Juel1)INM-10-20170113$$kINM-10$$lJara-Institut Brain structure-function relationships$$x2
001010660 9801_ $$aFullTexts
001010660 980__ $$apreprint
001010660 980__ $$aVDB
001010660 980__ $$aUNRESTRICTED
001010660 980__ $$aI:(DE-Juel1)INM-6-20090406
001010660 980__ $$aI:(DE-Juel1)IAS-6-20130828
001010660 980__ $$aI:(DE-Juel1)INM-10-20170113
001010660 981__ $$aI:(DE-Juel1)IAS-6-20130828
guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help