001049541 001__ 1049541
001049541 005__ 20251213202221.0
001049541 037__ $$aFZJ-2025-05345
001049541 1001_ $$0P:(DE-Juel1)206763$$aLengvenis, Arijus$$b0$$eCorresponding author$$ufzj
001049541 1112_ $$a2025 IEEE International Parallel and Distributed Processing Symposium$$cMilano$$d2025-06-03 - 2025-06-07$$gIPDPS$$wItaly
001049541 245__ $$aA New Spin on the Fast Multipole Method for GPUS: Rethinking the Far-Field Operators
001049541 260__ $$c2025
001049541 3367_ $$033$$2EndNote$$aConference Paper
001049541 3367_ $$2DataCite$$aOther
001049541 3367_ $$2BibTeX$$aINPROCEEDINGS
001049541 3367_ $$2DRIVER$$aconferenceObject
001049541 3367_ $$2ORCID$$aLECTURE_SPEECH
001049541 3367_ $$0PUB:(DE-HGF)6$$2PUB:(DE-HGF)$$aConference Presentation$$bconf$$mconf$$s1765628766_19099$$xAfter Call
001049541 520__ $$aThe Fast Multipole Method (FMM) is an optimally efficient algorithm for solving N -body problems: a fundamental challenge in fields like astrophysics, plasma physics and molecular dynamics. It is particularly suited for computing 1/r potentials present in Coulomb and gravitational particle systems. Despite the near-field phase being trivially parallelisable, the far-field phase of the 1/r FMM currently lacks an efficient, massively parallel GPU algorithm fitting for the era of Exascale computing. Current state-of-the-art approaches either favor highly parallel but inefficient expansion shift operators or asymptotically efficient but poorly parallelisable rotation-based ones. Recently, a breakthrough was made with the re-evaluation of a rotation operator variant called fast rotation, which dramatically increases caching effectiveness and marries the advantages of both methods. Thus, this paper incorporates this approach to create fast rotation-based operators that facilitate an efficient far-field algorithm for the FMM on GPUs. Additionally, a warpcentric data access scheme is co-developed alongside a matching octree design, which yields coalesced memory access patterns for the bottleneck operators of the far-field phase. The fast rotation algorithm is enhanced with a cache-tiling mechanism, maximising GPU cache utilisation. Compared to the state-of-theart GPU FMM far-field implementation, our algorithm achieves lower running times across the board and a 2.47 x speedup for an increased precision simulation, with the performance improvement growing as precision increases, providing concrete proof of efficacy for dense particle systems.
001049541 536__ $$0G:(DE-HGF)POF4-5112$$a5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511)$$cPOF4-511$$fPOF IV$$x0
001049541 7001_ $$0P:(DE-Juel1)132079$$aDachsel, Holger$$b1$$ufzj
001049541 7001_ $$0P:(DE-Juel1)169856$$aMorgenstern, Laura$$b2
001049541 7001_ $$0P:(DE-Juel1)132152$$aKabadshow, Ivo$$b3$$ufzj
001049541 909CO $$ooai:juser.fz-juelich.de:1049541$$pVDB
001049541 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)206763$$aForschungszentrum Jülich$$b0$$kFZJ
001049541 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)132079$$aForschungszentrum Jülich$$b1$$kFZJ
001049541 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)132152$$aForschungszentrum Jülich$$b3$$kFZJ
001049541 9131_ $$0G:(DE-HGF)POF4-511$$1G:(DE-HGF)POF4-510$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5112$$aDE-HGF$$bKey Technologies$$lEngineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action$$vEnabling Computational- & Data-Intensive Science and Engineering$$x0
001049541 9141_ $$y2025
001049541 920__ $$lyes
001049541 9201_ $$0I:(DE-Juel1)JSC-20090406$$kJSC$$lJülich Supercomputing Center$$x0
001049541 980__ $$aconf
001049541 980__ $$aVDB
001049541 980__ $$aI:(DE-Juel1)JSC-20090406
001049541 980__ $$aUNRESTRICTED