Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Leroux, Nathan; Siegel, Sebastian; Strachan, John Paul; Finkbeiner, Jan; Neftci, Emre; Sudarshan, Chirag; Manea, Paul-Philipp

doi:10.48550/arXiv.2409.19315

Items
Marc 21

001			1038064
005			20250203103308.0
024	7	_	\|a 10.48550/ARXIV.2409.19315 \|2 doi
024	7	_	\|a 10.48550/arXiv.2409.19315 \|2 doi
024	7	_	\|a 10.34734/FZJ-2025-01113 \|2 datacite_doi
037	_	_	\|a FZJ-2025-01113
100	1	_	\|a Leroux, Nathan \|0 P:(DE-Juel1)194421 \|b 0 \|e Corresponding author \|u fzj
245	_	_	\|a Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models
260	_	_	\|c 2024 \|b arXiv
336	7	_	\|a Preprint \|b preprint \|m preprint \|0 PUB:(DE-HGF)25 \|s 1738249459_31339 \|2 PUB:(DE-HGF)
336	7	_	\|a WORKING_PAPER \|2 ORCID
336	7	_	\|a Electronic Article \|0 28 \|2 EndNote
336	7	_	\|a preprint \|2 DRIVER
336	7	_	\|a ARTICLE \|2 BibTeX
336	7	_	\|a Output Types/Working Paper \|2 DataCite
520	_	_	\|a Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch. Our architecture respectively reduces attention latency and energy consumption by up to two and five orders of magnitude compared to GPUs, marking a significant step toward ultra-fast, low-power generative Transformers.
536	_	_	\|a 5234 - Emerging NC Architectures (POF4-523) \|0 G:(DE-HGF)POF4-5234 \|c POF4-523 \|f POF IV \|x 0
536	_	_	\|a BMBF 16ME0400 - Verbundprojekt: Neuro-inspirierte Technologien der künstlichen Intelligenz für die Elektronik der Zukunft - NEUROTEC II - (16ME0400) \|0 G:(BMBF)16ME0400 \|c 16ME0400 \|x 1
536	_	_	\|a BMBF 03ZU1106CA - NeuroSys: Algorithm-Hardware Co-Design (Projekt C) - A (03ZU1106CA) \|0 G:(BMBF)03ZU1106CA \|c 03ZU1106CA \|x 2
536	_	_	\|a BMBF 03ZU1106CB - NeuroSys: Algorithm-Hardware Co-Design (Projekt C) - B (BMBF-03ZU1106CB) \|0 G:(DE-Juel1)BMBF-03ZU1106CB \|c BMBF-03ZU1106CB \|x 3
588	_	_	\|a Dataset connected to DataCite
650	_	7	\|a Neural and Evolutionary Computing (cs.NE) \|2 Other
650	_	7	\|a Artificial Intelligence (cs.AI) \|2 Other
650	_	7	\|a Hardware Architecture (cs.AR) \|2 Other
650	_	7	\|a Emerging Technologies (cs.ET) \|2 Other
650	_	7	\|a FOS: Computer and information sciences \|2 Other
700	1	_	\|a Manea, Paul-Philipp \|0 P:(DE-Juel1)192242 \|b 1 \|e Corresponding author \|u fzj
700	1	_	\|a Sudarshan, Chirag \|0 P:(DE-Juel1)198888 \|b 2 \|u fzj
700	1	_	\|a Finkbeiner, Jan \|0 P:(DE-Juel1)190112 \|b 3 \|u fzj
700	1	_	\|a Siegel, Sebastian \|0 P:(DE-Juel1)174486 \|b 4 \|u fzj
700	1	_	\|a Strachan, John Paul \|0 P:(DE-Juel1)188145 \|b 5 \|u fzj
700	1	_	\|a Neftci, Emre \|0 P:(DE-Juel1)188273 \|b 6 \|u fzj
773	_	_	\|a 10.48550/arXiv.2409.19315
856	4	_	\|u https://doi.org/10.48550/arXiv.2409.19315
856	4	_	\|u https://juser.fz-juelich.de/record/1038064/files/Analog%20In-Memory%20Computing%20Attention%20Mechanism%20for%20Fast%20and%20Energy-Efficient%20Large%20Language%20Models.pdf \|y OpenAccess
909	C	O	\|o oai:juser.fz-juelich.de:1038064 \|p openaire \|p open_access \|p VDB \|p driver \|p dnbdelivery
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 0 \|6 P:(DE-Juel1)194421
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 1 \|6 P:(DE-Juel1)192242
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 2 \|6 P:(DE-Juel1)198888
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 3 \|6 P:(DE-Juel1)190112
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 4 \|6 P:(DE-Juel1)174486
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 5 \|6 P:(DE-Juel1)188145
910	1	_	\|a Forschungszentrum Jülich \|0 I:(DE-588b)5008462-8 \|k FZJ \|b 6 \|6 P:(DE-Juel1)188273
913	1	_	\|a DE-HGF \|b Key Technologies \|l Natural, Artificial and Cognitive Information Processing \|1 G:(DE-HGF)POF4-520 \|0 G:(DE-HGF)POF4-523 \|3 G:(DE-HGF)POF4 \|2 G:(DE-HGF)POF4-500 \|4 G:(DE-HGF)POF \|v Neuromorphic Computing and Network Dynamics \|9 G:(DE-HGF)POF4-5234 \|x 0
914	1	_	\|y 2024
915	_	_	\|a OpenAccess \|0 StatID:(DE-HGF)0510 \|2 StatID
920	_	_	\|l yes
920	1	_	\|0 I:(DE-Juel1)PGI-15-20210701 \|k PGI-15 \|l Neuromorphic Software Eco System \|x 0
920	1	_	\|0 I:(DE-Juel1)PGI-14-20210412 \|k PGI-14 \|l Neuromorphic Compute Nodes \|x 1
980	_	_	\|a preprint
980	_	_	\|a VDB
980	_	_	\|a UNRESTRICTED
980	_	_	\|a I:(DE-Juel1)PGI-15-20210701
980	_	_	\|a I:(DE-Juel1)PGI-14-20210412
980	1	_	\|a FullTexts

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help