TY  - EJOUR
AU  - Leroux, Nathan
AU  - Manea, Paul-Philipp
AU  - Sudarshan, Chirag
AU  - Finkbeiner, Jan
AU  - Siegel, Sebastian
AU  - Strachan, John Paul
AU  - Neftci, Emre
TI  - Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models
PB  - arXiv
M1  - FZJ-2025-01113
PY  - 2024
AB  - Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch. Our architecture respectively reduces attention latency and energy consumption by up to two and five orders of magnitude compared to GPUs, marking a significant step toward ultra-fast, low-power generative Transformers.
KW  - Neural and Evolutionary Computing (cs.NE) (Other)
KW  - Artificial Intelligence (cs.AI) (Other)
KW  - Hardware Architecture (cs.AR) (Other)
KW  - Emerging Technologies (cs.ET) (Other)
KW  - FOS: Computer and information sciences (Other)
LB  - PUB:(DE-HGF)25
DO  - DOI:10.48550/arXiv.2409.19315
UR  - https://juser.fz-juelich.de/record/1038064
ER  -