TY  - JOUR
AU  - Leroux, Nathan
AU  - Manea, Paul
AU  - Sudarshan, Chirag
AU  - Finkbeiner, Jan
AU  - Siegel, Sebastian
AU  - Strachan, John Paul
AU  - Neftci, Emre
TI  - Analog in-memory computing attention mechanism for fast and energy-efficient large language models
JO  - Nature computational science
VL  - 5
IS  - 9
SN  - 2662-8457
CY  - London
PB  - Nature Research
M1  - FZJ-2026-00225
SP  - 813 - 824
PY  - 2025
AB  - Transformer networks, driven by self-attention, are central to large languagemodels. In generative transformers, self-attention uses cache memoryto store token projections, avoiding recomputation at each time step.However, graphics processing unit (GPU)-stored projections must be loadedinto static random-access memory for each new generation step, causinglatency and energy bottlenecks. Here we present a custom self-attentionin-memory computing architecture based on emerging charge-basedmemories called gain cells, which can be efficiently written to store newtokens during sequence generation and enable parallel analog dot-productcomputation required for self-attention. However, the analog gain-cellcircuits introduce non-idealities and constraints preventing the directmapping of pre-trained models. To circumvent this problem, we design aninitialization algorithm achieving text-processing performance comparableto GPT-2 without training from scratch. Our architecture reduces attentionlatency and energy consumption by up to two and four orders of magnitude,respectively, compared with GPUs, marking a substantial step towardultrafast, low-power generative transformers
LB  - PUB:(DE-HGF)16
DO  - DOI:10.1038/s43588-025-00854-1
UR  - https://juser.fz-juelich.de/record/1050455
ER  -