Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Leroux, Nathan; Siegel, Sebastian; Strachan, John Paul; Finkbeiner, Jan; Neftci, Emre; Sudarshan, Chirag; Manea, Paul-Philipp
doi:10.48550/arXiv.2409.19315
001038064 001__ 1038064
001038064 005__ 20250203103308.0
001038064 0247_ $$2doi$$a10.48550/ARXIV.2409.19315
001038064 0247_ $$2doi$$a10.48550/arXiv.2409.19315
001038064 0247_ $$2datacite_doi$$a10.34734/FZJ-2025-01113
001038064 037__ $$aFZJ-2025-01113
001038064 1001_ $$0P:(DE-Juel1)194421$$aLeroux, Nathan$$b0$$eCorresponding author$$ufzj
001038064 245__ $$aAnalog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models
001038064 260__ $$barXiv$$c2024
001038064 3367_ $$0PUB:(DE-HGF)25$$2PUB:(DE-HGF)$$aPreprint$$bpreprint$$mpreprint$$s1738249459_31339
001038064 3367_ $$2ORCID$$aWORKING_PAPER
001038064 3367_ $$028$$2EndNote$$aElectronic Article
001038064 3367_ $$2DRIVER$$apreprint
001038064 3367_ $$2BibTeX$$aARTICLE
001038064 3367_ $$2DataCite$$aOutput Types/Working Paper
001038064 520__ $$aTransformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch. Our architecture respectively reduces attention latency and energy consumption by up to two and five orders of magnitude compared to GPUs, marking a significant step toward ultra-fast, low-power generative Transformers.
001038064 536__ $$0G:(DE-HGF)POF4-5234$$a5234 - Emerging NC Architectures (POF4-523)$$cPOF4-523$$fPOF IV$$x0
001038064 536__ $$0G:(BMBF)16ME0400$$aBMBF 16ME0400 - Verbundprojekt: Neuro-inspirierte Technologien der künstlichen Intelligenz für die Elektronik der Zukunft - NEUROTEC II - (16ME0400)$$c16ME0400$$x1
001038064 536__ $$0G:(BMBF)03ZU1106CA$$aBMBF 03ZU1106CA - NeuroSys: Algorithm-Hardware Co-Design (Projekt C) - A (03ZU1106CA)$$c03ZU1106CA$$x2
001038064 536__ $$0G:(DE-Juel1)BMBF-03ZU1106CB$$aBMBF 03ZU1106CB - NeuroSys: Algorithm-Hardware Co-Design (Projekt C) - B (BMBF-03ZU1106CB)$$cBMBF-03ZU1106CB$$x3
001038064 588__ $$aDataset connected to DataCite
001038064 650_7 $$2Other$$aNeural and Evolutionary Computing (cs.NE)
001038064 650_7 $$2Other$$aArtificial Intelligence (cs.AI)
001038064 650_7 $$2Other$$aHardware Architecture (cs.AR)
001038064 650_7 $$2Other$$aEmerging Technologies (cs.ET)
001038064 650_7 $$2Other$$aFOS: Computer and information sciences
001038064 7001_ $$0P:(DE-Juel1)192242$$aManea, Paul-Philipp$$b1$$eCorresponding author$$ufzj
001038064 7001_ $$0P:(DE-Juel1)198888$$aSudarshan, Chirag$$b2$$ufzj
001038064 7001_ $$0P:(DE-Juel1)190112$$aFinkbeiner, Jan$$b3$$ufzj
001038064 7001_ $$0P:(DE-Juel1)174486$$aSiegel, Sebastian$$b4$$ufzj
001038064 7001_ $$0P:(DE-Juel1)188145$$aStrachan, John Paul$$b5$$ufzj
001038064 7001_ $$0P:(DE-Juel1)188273$$aNeftci, Emre$$b6$$ufzj
001038064 773__ $$a10.48550/arXiv.2409.19315
001038064 8564_ $$uhttps://doi.org/10.48550/arXiv.2409.19315
001038064 8564_ $$uhttps://juser.fz-juelich.de/record/1038064/files/Analog%20In-Memory%20Computing%20Attention%20Mechanism%20for%20Fast%20and%20Energy-Efficient%20Large%20Language%20Models.pdf$$yOpenAccess
001038064 909CO $$ooai:juser.fz-juelich.de:1038064$$pdnbdelivery$$pdriver$$pVDB$$popen_access$$popenaire
001038064 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)194421$$aForschungszentrum Jülich$$b0$$kFZJ
001038064 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)192242$$aForschungszentrum Jülich$$b1$$kFZJ
001038064 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)198888$$aForschungszentrum Jülich$$b2$$kFZJ
001038064 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)190112$$aForschungszentrum Jülich$$b3$$kFZJ
001038064 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)174486$$aForschungszentrum Jülich$$b4$$kFZJ
001038064 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)188145$$aForschungszentrum Jülich$$b5$$kFZJ
001038064 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)188273$$aForschungszentrum Jülich$$b6$$kFZJ
001038064 9131_ $$0G:(DE-HGF)POF4-523$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5234$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vNeuromorphic Computing and Network Dynamics$$x0
001038064 9141_ $$y2024
001038064 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
001038064 920__ $$lyes
001038064 9201_ $$0I:(DE-Juel1)PGI-15-20210701$$kPGI-15$$lNeuromorphic Software Eco System$$x0
001038064 9201_ $$0I:(DE-Juel1)PGI-14-20210412$$kPGI-14$$lNeuromorphic Compute Nodes$$x1
001038064 980__ $$apreprint
001038064 980__ $$aVDB
001038064 980__ $$aUNRESTRICTED
001038064 980__ $$aI:(DE-Juel1)PGI-15-20210701
001038064 980__ $$aI:(DE-Juel1)PGI-14-20210412
001038064 9801_ $$aFullTexts
Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe