Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory

Bencheikh, Wadjih; Finkbeiner, Jan; Neftci, Emre

Items
Marc 21

001			1037903
005			20250203103256.0
024	7	_	\|a 10.34734/FZJ-2025-01041 \|2 datacite_doi
037	_	_	\|a FZJ-2025-01041
100	1	_	\|a Bencheikh, Wadjih \|0 P:(DE-Juel1)203192 \|b 0
245	_	_	\|a Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory
260	_	_	\|c 2024
336	7	_	\|a Preprint \|b preprint \|m preprint \|0 PUB:(DE-HGF)25 \|s 1738234604_29870 \|2 PUB:(DE-HGF)
336	7	_	\|a WORKING_PAPER \|2 ORCID
336	7	_	\|a Electronic Article \|0 28 \|2 EndNote
336	7	_	\|a preprint \|2 DRIVER
336	7	_	\|a ARTICLE \|2 BibTeX
336	7	_	\|a Output Types/Working Paper \|2 DataCite
520	_	_	\|a Recurrent neural networks (RNNs) are valued for their computational efficiency and reduced memory requirements on tasks involving long sequence lengths but require high memory-processor bandwidth to train. Checkpointing techniques can reduce the memory requirements by only storing a subset of intermediate states, the checkpoints, but are still rarely used due to the computational overhead of the additional recomputation phase. This work addresses these challenges by introducing memory-efficient gradient checkpointing strategies tailored for the general class of sparse RNNs and Spiking Neural Networks (SNNs). SNNs are energy efficient alternatives to RNNs thanks to their local, event-driven operation and potential neuromorphic implementation. We use the Intelligence Processing Unit (IPU) as an exemplary platform for architectures with distributed local memory. We exploit its suitability for sparse and irregular workloads to scale SNN training on long sequence lengths. We find that Double Checkpointing emerges as the most effective method, optimizing the use of local memory resources while minimizing recomputation overhead. This approach reduces dependency on slower large-scale memory access, enabling training on sequences over 10 times longer or 4 times larger networks than previously feasible, with only marginal time overhead. The presented techniques demonstrate significant potential to enhance scalability and efficiency in training sparse and recurrent networks across diverse hardware platforms, and highlights the benefits of sparse activations for scalable recurrent neural network training.
536	_	_	\|a 5234 - Emerging NC Architectures (POF4-523) \|0 G:(DE-HGF)POF4-5234 \|c POF4-523 \|f POF IV \|x 0
700	1	_	\|a Finkbeiner, Jan \|0 P:(DE-Juel1)190112 \|b 1 \|u fzj
700	1	_	\|a Neftci, Emre \|0 P:(DE-Juel1)188273 \|b 2 \|u fzj
856	4	_	\|u https://doi.org/10.48550/arXiv.2412.11810
856	4	_	\|u https://juser.fz-juelich.de/record/1037903/files/arxiv_Optimal%20Gradient%20Checkpointing%20for%20Sparse%20and%20Recurrent%20Architectures%20using%20Off-Chip%20Memory.pdf \|y OpenAccess
909	C	O	\|o oai:juser.fz-juelich.de:1037903 \|p openaire \|p open_access \|p VDB \|p driver \|p dnbdelivery
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 1 \|6 P:(DE-Juel1)190112
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 2 \|6 P:(DE-Juel1)188273
913	1	_	\|a DE-HGF \|b Key Technologies \|l Natural, Artificial and Cognitive Information Processing \|1 G:(DE-HGF)POF4-520 \|0 G:(DE-HGF)POF4-523 \|3 G:(DE-HGF)POF4 \|2 G:(DE-HGF)POF4-500 \|4 G:(DE-HGF)POF \|v Neuromorphic Computing and Network Dynamics \|9 G:(DE-HGF)POF4-5234 \|x 0
914	1	_	\|y 2024
915	_	_	\|a OpenAccess \|0 StatID:(DE-HGF)0510 \|2 StatID
920	_	_	\|l yes
920	1	_	\|0 I:(DE-Juel1)PGI-15-20210701 \|k PGI-15 \|l Neuromorphic Software Eco System \|x 0
980	1	_	\|a FullTexts
980	_	_	\|a preprint
980	_	_	\|a VDB
980	_	_	\|a UNRESTRICTED
980	_	_	\|a I:(DE-Juel1)PGI-15-20210701

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help