Analog in-memory computing attention mechanism for fast and energy-efficient large language models

Leroux, Nathan; Siegel, Sebastian; Manea, Paul; Strachan, John Paul; Finkbeiner, Jan; Neftci, Emre; Sudarshan, Chirag

doi:10.1038/s43588-025-00854-1

Journal Article

FZJ-2026-00225

Analog in-memory computing attention mechanism for fast and energy-efficient large language models

Leroux, N.FZJ* ; Manea, P. (Corresponding author)FZJ* ; Sudarshan, C.FZJ* ; Finkbeiner, J.FZJ* ; Siegel, S.FZJ* ; Strachan, J. P.FZJ* ; Neftci, E.FZJ*

2025
Nature Research London

Nature computational science 5(9), 813 - 824 (2025) [10.1038/s43588-025-00854-1] special issue: "Neuromorphic Hardware and Computing 2024"

This record in other databases:

Please use a persistent id in citations: doi:10.1038/s43588-025-00854-1 doi:10.34734/FZJ-2026-00225

Abstract: Transformer networks, driven by self-attention, are central to large languagemodels. In generative transformers, self-attention uses cache memoryto store token projections, avoiding recomputation at each time step.However, graphics processing unit (GPU)-stored projections must be loadedinto static random-access memory for each new generation step, causinglatency and energy bottlenecks. Here we present a custom self-attentionin-memory computing architecture based on emerging charge-basedmemories called gain cells, which can be efficiently written to store newtokens during sequence generation and enable parallel analog dot-productcomputation required for self-attention. However, the analog gain-cellcircuits introduce non-idealities and constraints preventing the directmapping of pre-trained models. To circumvent this problem, we design aninitialization algorithm achieving text-processing performance comparableto GPT-2 without training from scratch. Our architecture reduces attentionlatency and energy consumption by up to two and four orders of magnitude,respectively, compared with GPUs, marking a substantial step towardultrafast, low-power generative transformers

Classification: