Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Leroux, Nathan; Siegel, Sebastian; Strachan, John Paul; Finkbeiner, Jan; Neftci, Emre; Sudarshan, Chirag; Manea, Paul-Philipp

doi:10.48550/arXiv.2409.19315

Preprint

FZJ-2025-01113

Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Leroux, N. (Corresponding author)FZJ* ; Manea, P.-P. (Corresponding author)FZJ* ; Sudarshan, C.FZJ* ; Finkbeiner, J.FZJ* ; Siegel, S.FZJ* ; Strachan, J. P.FZJ* ; Neftci, E.FZJ*

2024
arXiv

arXiv (2024) [10.48550/arXiv.2409.19315]

This record in other databases:

Please use a persistent id in citations: doi:10.48550/ARXIV.2409.19315 doi:10.48550/arXiv.2409.19315 doi:10.34734/FZJ-2025-01113

Abstract: Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch. Our architecture respectively reduces attention latency and energy consumption by up to two and five orders of magnitude compared to GPUs, marking a significant step toward ultra-fast, low-power generative Transformers.

Keyword(s): Neural and Evolutionary Computing (cs.NE) ; Artificial Intelligence (cs.AI) ; Hardware Architecture (cs.AR) ; Emerging Technologies (cs.ET) ; FOS: Computer and information sciences