001     1049542
005     20251213202221.0
024 7 _ |a 10.1109/IPDPS64566.2025.00075
|2 doi
037 _ _ |a FZJ-2025-05346
100 1 _ |a Lengvenis, Arijus
|0 P:(DE-Juel1)206763
|b 0
|e Corresponding author
|u fzj
111 2 _ |a 2025 IEEE International Parallel and Distributed Processing Symposium
|g IPDPS
|c Milano
|d 2025-06-03 - 2025-06-07
|w Italy
245 _ _ |a A New Spin on the Fast Multipole Method for GPUS: Rethinking the Far-Field Operators
260 _ _ |c 2025
|b IEEE
300 _ _ |a 12 p.
336 7 _ |a CONFERENCE_PAPER
|2 ORCID
336 7 _ |a Conference Paper
|0 33
|2 EndNote
336 7 _ |a INPROCEEDINGS
|2 BibTeX
336 7 _ |a conferenceObject
|2 DRIVER
336 7 _ |a Output Types/Conference Paper
|2 DataCite
336 7 _ |a Contribution to a conference proceedings
|b contrib
|m contrib
|0 PUB:(DE-HGF)8
|s 1765628388_21157
|2 PUB:(DE-HGF)
520 _ _ |a The Fast Multipole Method (FMM) is an optimally efficient algorithm for solving N -body problems: a fundamental challenge in fields like astrophysics, plasma physics and molecular dynamics. It is particularly suited for computing 1/r potentials present in Coulomb and gravitational particle systems. Despite the near-field phase being trivially parallelisable, the far-field phase of the 1/r FMM currently lacks an efficient, massively parallel GPU algorithm fitting for the era of Exascale computing. Current state-of-the-art approaches either favor highly parallel but inefficient expansion shift operators or asymptotically efficient but poorly parallelisable rotation-based ones. Recently, a breakthrough was made with the re-evaluation of a rotation operator variant called fast rotation, which dramatically increases caching effectiveness and marries the advantages of both methods. Thus, this paper incorporates this approach to create fast rotation-based operators that facilitate an efficient far-field algorithm for the FMM on GPUs. Additionally, a warpcentric data access scheme is co-developed alongside a matching octree design, which yields coalesced memory access patterns for the bottleneck operators of the far-field phase. The fast rotation algorithm is enhanced with a cache-tiling mechanism, maximising GPU cache utilisation. Compared to the state-of-theart GPU FMM far-field implementation, our algorithm achieves lower running times across the board and a 2.47 x speedup for an increased precision simulation, with the performance improvement growing as precision increases, providing concrete proof of efficacy for dense particle systems.
536 _ _ |a 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511)
|0 G:(DE-HGF)POF4-5112
|c POF4-511
|f POF IV
|x 0
588 _ _ |a Dataset connected to CrossRef Conference
700 1 _ |a Dachsel, Holger
|0 P:(DE-Juel1)132079
|b 1
|u fzj
700 1 _ |a Morgenstern, Laura
|0 P:(DE-Juel1)169856
|b 2
700 1 _ |a Kabadshow, Ivo
|0 P:(DE-Juel1)132152
|b 3
|u fzj
773 _ _ |a 10.1109/IPDPS64566.2025.00075
856 4 _ |u https://juser.fz-juelich.de/record/1049542/files/Publication.pdf
|y Restricted
909 C O |o oai:juser.fz-juelich.de:1049542
|p VDB
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 0
|6 P:(DE-Juel1)206763
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 1
|6 P:(DE-Juel1)132079
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 3
|6 P:(DE-Juel1)132152
913 1 _ |a DE-HGF
|b Key Technologies
|l Engineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action
|1 G:(DE-HGF)POF4-510
|0 G:(DE-HGF)POF4-511
|3 G:(DE-HGF)POF4
|2 G:(DE-HGF)POF4-500
|4 G:(DE-HGF)POF
|v Enabling Computational- & Data-Intensive Science and Engineering
|9 G:(DE-HGF)POF4-5112
|x 0
914 1 _ |y 2025
920 _ _ |l yes
920 1 _ |0 I:(DE-Juel1)JSC-20090406
|k JSC
|l Jülich Supercomputing Center
|x 0
980 _ _ |a contrib
980 _ _ |a VDB
980 _ _ |a I:(DE-Juel1)JSC-20090406
980 _ _ |a UNRESTRICTED


LibraryCollectionCLSMajorCLSMinorLanguageAuthor
Marc 21