Optimal signal propagation in ResNets through residual scaling

Fischer, Kirsten; Dahmen, David; Helias, Moritz

doi:10.48550/ARXIV.2305.07715

Items
Marc 21

001			1010660
005			20240313103122.0
024	7	_	\|a 10.48550/ARXIV.2305.07715 \|2 doi
024	7	_	\|a 10.34734/FZJ-2023-03175 \|2 datacite_doi
037	_	_	\|a FZJ-2023-03175
100	1	_	\|a Fischer, Kirsten \|0 P:(DE-Juel1)180150 \|b 0 \|e Corresponding author
245	_	_	\|a Optimal signal propagation in ResNets through residual scaling
260	_	_	\|c 2023 \|b arXiv
336	7	_	\|a Preprint \|b preprint \|m preprint \|0 PUB:(DE-HGF)25 \|s 1695103673_21286 \|2 PUB:(DE-HGF)
336	7	_	\|a WORKING_PAPER \|2 ORCID
336	7	_	\|a Electronic Article \|0 28 \|2 EndNote
336	7	_	\|a preprint \|2 DRIVER
336	7	_	\|a ARTICLE \|2 BibTeX
336	7	_	\|a Output Types/Working Paper \|2 DataCite
520	_	_	\|a Residual networks (ResNets) have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range of values for this scaling parameter, the associated performance improvement and its universality across network hyperparameters yet need to be understood. For feed-forward networks (FFNets), finite-size theories have led to important insights with regard to signal propagation and hyperparameter tuning. We here derive a systematic finite-size theory for ResNets to study signal propagation and its dependence on the scaling for the residual branch. We derive analytical expressions for the response function, a measure for the network's sensitivity to inputs, and show that for deep networks the empirically found values for the scaling parameter lie within the range of maximal sensitivity. Furthermore, we obtain an analytical expression for the optimal scaling parameter that depends only weakly on other network hyperparameters, such as the weight variance, thereby explaining its universality across hyperparameters. Overall, this work provides a framework for theory-guided optimal scaling in ResNets and, more generally, provides the theoretical framework to study ResNets at finite widths.
536	_	_	\|a 5232 - Computational Principles (POF4-523) \|0 G:(DE-HGF)POF4-5232 \|c POF4-523 \|f POF IV \|x 0
536	_	_	\|a 5234 - Emerging NC Architectures (POF4-523) \|0 G:(DE-HGF)POF4-5234 \|c POF4-523 \|f POF IV \|x 1
536	_	_	\|a RenormalizedFlows - Transparent Deep Learning with Renormalized Flows (BMBF-01IS19077A) \|0 G:(DE-Juel-1)BMBF-01IS19077A \|c BMBF-01IS19077A \|x 2
536	_	_	\|a MSNN - Theory of multi-scale neuronal networks (HGF-SMHB-2014-2018) \|0 G:(DE-Juel1)HGF-SMHB-2014-2018 \|c HGF-SMHB-2014-2018 \|f MSNN \|x 3
536	_	_	\|a ACA - Advanced Computing Architectures (SO-092) \|0 G:(DE-HGF)SO-092 \|c SO-092 \|x 4
536	_	_	\|a neuroIC002 - Recurrence and stochasticity for neuro-inspired computation (EXS-SF-neuroIC002) \|0 G:(DE-82)EXS-SF-neuroIC002 \|c EXS-SF-neuroIC002 \|x 5
536	_	_	\|a GRK 2416 - GRK 2416: MultiSenses-MultiScales: Neue Ansätze zur Aufklärung neuronaler multisensorischer Integration (368482240) \|0 G:(GEPRIS)368482240 \|c 368482240 \|x 6
588	_	_	\|a Dataset connected to DataCite
650	_	7	\|a Disordered Systems and Neural Networks (cond-mat.dis-nn) \|2 Other
650	_	7	\|a Machine Learning (cs.LG) \|2 Other
650	_	7	\|a Machine Learning (stat.ML) \|2 Other
650	_	7	\|a FOS: Physical sciences \|2 Other
650	_	7	\|a FOS: Computer and information sciences \|2 Other
700	1	_	\|a Dahmen, David \|0 P:(DE-Juel1)156459 \|b 1 \|u fzj
700	1	_	\|a Helias, Moritz \|0 P:(DE-Juel1)144806 \|b 2 \|u fzj
773	_	_	\|a 10.48550/ARXIV.2305.07715
856	4	_	\|u https://arxiv.org/abs/2305.07715
856	4	_	\|u https://juser.fz-juelich.de/record/1010660/files/2305.07715.pdf \|y OpenAccess
909	C	O	\|o oai:juser.fz-juelich.de:1010660 \|p openaire \|p open_access \|p VDB \|p driver \|p dnbdelivery
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 0 \|6 P:(DE-Juel1)180150
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 1 \|6 P:(DE-Juel1)156459
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 2 \|6 P:(DE-Juel1)144806
913	1	_	\|a DE-HGF \|b Key Technologies \|l Natural, Artificial and Cognitive Information Processing \|1 G:(DE-HGF)POF4-520 \|0 G:(DE-HGF)POF4-523 \|3 G:(DE-HGF)POF4 \|2 G:(DE-HGF)POF4-500 \|4 G:(DE-HGF)POF \|v Neuromorphic Computing and Network Dynamics \|9 G:(DE-HGF)POF4-5232 \|x 0
913	1	_	\|a DE-HGF \|b Key Technologies \|l Natural, Artificial and Cognitive Information Processing \|1 G:(DE-HGF)POF4-520 \|0 G:(DE-HGF)POF4-523 \|3 G:(DE-HGF)POF4 \|2 G:(DE-HGF)POF4-500 \|4 G:(DE-HGF)POF \|v Neuromorphic Computing and Network Dynamics \|9 G:(DE-HGF)POF4-5234 \|x 1
914	1	_	\|y 2023
915	_	_	\|a OpenAccess \|0 StatID:(DE-HGF)0510 \|2 StatID
920	_	_	\|l yes
920	1	_	\|0 I:(DE-Juel1)INM-6-20090406 \|k INM-6 \|l Computational and Systems Neuroscience \|x 0
920	1	_	\|0 I:(DE-Juel1)IAS-6-20130828 \|k IAS-6 \|l Theoretical Neuroscience \|x 1
920	1	_	\|0 I:(DE-Juel1)INM-10-20170113 \|k INM-10 \|l Jara-Institut Brain structure-function relationships \|x 2
980	1	_	\|a FullTexts
980	_	_	\|a preprint
980	_	_	\|a VDB
980	_	_	\|a UNRESTRICTED
980	_	_	\|a I:(DE-Juel1)INM-6-20090406
980	_	_	\|a I:(DE-Juel1)IAS-6-20130828
980	_	_	\|a I:(DE-Juel1)INM-10-20170113
981	_	_	\|a I:(DE-Juel1)IAS-6-20130828

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help