Home > Publications database > Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models > print |
001 | 1038064 | ||
005 | 20250203103308.0 | ||
024 | 7 | _ | |a 10.48550/ARXIV.2409.19315 |2 doi |
024 | 7 | _ | |a 10.48550/arXiv.2409.19315 |2 doi |
024 | 7 | _ | |a 10.34734/FZJ-2025-01113 |2 datacite_doi |
037 | _ | _ | |a FZJ-2025-01113 |
100 | 1 | _ | |a Leroux, Nathan |0 P:(DE-Juel1)194421 |b 0 |e Corresponding author |u fzj |
245 | _ | _ | |a Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models |
260 | _ | _ | |c 2024 |b arXiv |
336 | 7 | _ | |a Preprint |b preprint |m preprint |0 PUB:(DE-HGF)25 |s 1738249459_31339 |2 PUB:(DE-HGF) |
336 | 7 | _ | |a WORKING_PAPER |2 ORCID |
336 | 7 | _ | |a Electronic Article |0 28 |2 EndNote |
336 | 7 | _ | |a preprint |2 DRIVER |
336 | 7 | _ | |a ARTICLE |2 BibTeX |
336 | 7 | _ | |a Output Types/Working Paper |2 DataCite |
520 | _ | _ | |a Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch. Our architecture respectively reduces attention latency and energy consumption by up to two and five orders of magnitude compared to GPUs, marking a significant step toward ultra-fast, low-power generative Transformers. |
536 | _ | _ | |a 5234 - Emerging NC Architectures (POF4-523) |0 G:(DE-HGF)POF4-5234 |c POF4-523 |f POF IV |x 0 |
536 | _ | _ | |a BMBF 16ME0400 - Verbundprojekt: Neuro-inspirierte Technologien der künstlichen Intelligenz für die Elektronik der Zukunft - NEUROTEC II - (16ME0400) |0 G:(BMBF)16ME0400 |c 16ME0400 |x 1 |
536 | _ | _ | |a BMBF 03ZU1106CA - NeuroSys: Algorithm-Hardware Co-Design (Projekt C) - A (03ZU1106CA) |0 G:(BMBF)03ZU1106CA |c 03ZU1106CA |x 2 |
536 | _ | _ | |a BMBF 03ZU1106CB - NeuroSys: Algorithm-Hardware Co-Design (Projekt C) - B (BMBF-03ZU1106CB) |0 G:(DE-Juel1)BMBF-03ZU1106CB |c BMBF-03ZU1106CB |x 3 |
588 | _ | _ | |a Dataset connected to DataCite |
650 | _ | 7 | |a Neural and Evolutionary Computing (cs.NE) |2 Other |
650 | _ | 7 | |a Artificial Intelligence (cs.AI) |2 Other |
650 | _ | 7 | |a Hardware Architecture (cs.AR) |2 Other |
650 | _ | 7 | |a Emerging Technologies (cs.ET) |2 Other |
650 | _ | 7 | |a FOS: Computer and information sciences |2 Other |
700 | 1 | _ | |a Manea, Paul-Philipp |0 P:(DE-Juel1)192242 |b 1 |e Corresponding author |u fzj |
700 | 1 | _ | |a Sudarshan, Chirag |0 P:(DE-Juel1)198888 |b 2 |u fzj |
700 | 1 | _ | |a Finkbeiner, Jan |0 P:(DE-Juel1)190112 |b 3 |u fzj |
700 | 1 | _ | |a Siegel, Sebastian |0 P:(DE-Juel1)174486 |b 4 |u fzj |
700 | 1 | _ | |a Strachan, John Paul |0 P:(DE-Juel1)188145 |b 5 |u fzj |
700 | 1 | _ | |a Neftci, Emre |0 P:(DE-Juel1)188273 |b 6 |u fzj |
773 | _ | _ | |a 10.48550/arXiv.2409.19315 |
856 | 4 | _ | |u https://doi.org/10.48550/arXiv.2409.19315 |
856 | 4 | _ | |u https://juser.fz-juelich.de/record/1038064/files/Analog%20In-Memory%20Computing%20Attention%20Mechanism%20for%20Fast%20and%20Energy-Efficient%20Large%20Language%20Models.pdf |y OpenAccess |
909 | C | O | |o oai:juser.fz-juelich.de:1038064 |p openaire |p open_access |p VDB |p driver |p dnbdelivery |
910 | 1 | _ | |a Forschungszentrum Jülich |0 I:(DE-588b)5008462-8 |k FZJ |b 0 |6 P:(DE-Juel1)194421 |
910 | 1 | _ | |a Forschungszentrum Jülich |0 I:(DE-588b)5008462-8 |k FZJ |b 1 |6 P:(DE-Juel1)192242 |
910 | 1 | _ | |a Forschungszentrum Jülich |0 I:(DE-588b)5008462-8 |k FZJ |b 2 |6 P:(DE-Juel1)198888 |
910 | 1 | _ | |a Forschungszentrum Jülich |0 I:(DE-588b)5008462-8 |k FZJ |b 3 |6 P:(DE-Juel1)190112 |
910 | 1 | _ | |a Forschungszentrum Jülich |0 I:(DE-588b)5008462-8 |k FZJ |b 4 |6 P:(DE-Juel1)174486 |
910 | 1 | _ | |a Forschungszentrum Jülich |0 I:(DE-588b)5008462-8 |k FZJ |b 5 |6 P:(DE-Juel1)188145 |
910 | 1 | _ | |a Forschungszentrum Jülich |0 I:(DE-588b)5008462-8 |k FZJ |b 6 |6 P:(DE-Juel1)188273 |
913 | 1 | _ | |a DE-HGF |b Key Technologies |l Natural, Artificial and Cognitive Information Processing |1 G:(DE-HGF)POF4-520 |0 G:(DE-HGF)POF4-523 |3 G:(DE-HGF)POF4 |2 G:(DE-HGF)POF4-500 |4 G:(DE-HGF)POF |v Neuromorphic Computing and Network Dynamics |9 G:(DE-HGF)POF4-5234 |x 0 |
914 | 1 | _ | |y 2024 |
915 | _ | _ | |a OpenAccess |0 StatID:(DE-HGF)0510 |2 StatID |
920 | _ | _ | |l yes |
920 | 1 | _ | |0 I:(DE-Juel1)PGI-15-20210701 |k PGI-15 |l Neuromorphic Software Eco System |x 0 |
920 | 1 | _ | |0 I:(DE-Juel1)PGI-14-20210412 |k PGI-14 |l Neuromorphic Compute Nodes |x 1 |
980 | _ | _ | |a preprint |
980 | _ | _ | |a VDB |
980 | _ | _ | |a UNRESTRICTED |
980 | _ | _ | |a I:(DE-Juel1)PGI-15-20210701 |
980 | _ | _ | |a I:(DE-Juel1)PGI-14-20210412 |
980 | 1 | _ | |a FullTexts |
Library | Collection | CLSMajor | CLSMinor | Language | Author |
---|