001     889151
005     20220228143453.0
024 7 _ |a 10.1177/1094342020964857
|2 doi
024 7 _ |a 1078-3482
|2 ISSN
024 7 _ |a 1094-3420
|2 ISSN
024 7 _ |a 1741-2846
|2 ISSN
024 7 _ |a 2128/26689
|2 Handle
024 7 _ |a altmetric:94666778
|2 altmetric
024 7 _ |a WOS:000578560700001
|2 WOS
037 _ _ |a FZJ-2021-00076
082 _ _ |a 004
100 1 _ |a Kohnke, Bartosz
|0 0000-0002-6000-5490
|b 0
|e Corresponding author
245 _ _ |a A CUDA fast multipole method with highly efficient M2L far field evaluation
260 _ _ |a Thousand Oaks, Calif.
|c 2021
|b Sage Science Press
336 7 _ |a article
|2 DRIVER
336 7 _ |a Output Types/Journal article
|2 DataCite
336 7 _ |a Journal Article
|b journal
|m journal
|0 PUB:(DE-HGF)16
|s 1646033425_5297
|2 PUB:(DE-HGF)
336 7 _ |a ARTICLE
|2 BibTeX
336 7 _ |a JOURNAL_ARTICLE
|2 ORCID
336 7 _ |a Journal Article
|0 0
|2 EndNote
520 _ _ |a Solving an N-body problem, electrostatic or gravitational, is a crucial task and the main computational bottleneck in many scientific applications. Its direct solution is an ubiquitous showcase example for the compute power of graphics processing units (GPUs). However, the naïve pairwise summation has 𝒪(𝑁2) computational complexity. The fast multipole method (FMM) can reduce runtime and complexity to 𝒪(𝑁) for any specified precision. Here, we present a CUDA-accelerated, C++ FMM implementation for multi particle systems with 𝑟−1 potential that are found, e.g. in biomolecular simulations. The algorithm involves several operators to exchange information in an octree data structure. We focus on the Multipole-to-Local (M2L) operator, as its runtime is limiting for the overall performance. We propose, implement and benchmark three different M2L parallelization approaches. Approach (1) utilizes Unified Memory to minimize programming and porting efforts. It achieves decent speedups for only little implementation work. Approach (2) employs CUDA Dynamic Parallelism to significantly improve performance for high approximation accuracies. The presorted list-based approach (3) fits periodic boundary conditions particularly well. It exploits FMM operator symmetries to minimize both memory access and the number of complex multiplications. The result is a compute-bound implementation, i.e. performance is limited by arithmetic operations rather than by memory accesses. The complete CUDA parallelized FMM is incorporated within the GROMACS molecular dynamics package as an alternative Coulomb solver.
536 _ _ |a 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511)
|0 G:(DE-HGF)POF4-5112
|c POF4-511
|f POF IV
|x 0
588 _ _ |a Dataset connected to CrossRef
700 1 _ |a Kutzner, Carsten
|0 0000-0002-8719-0307
|b 1
700 1 _ |a Beckmann, Andreas
|0 P:(DE-Juel1)157750
|b 2
|u fzj
700 1 _ |a Lube, Gert
|0 P:(DE-HGF)0
|b 3
700 1 _ |a Kabadshow, Ivo
|0 P:(DE-Juel1)132152
|b 4
|u fzj
700 1 _ |a Dachsel, Holger
|0 P:(DE-Juel1)132079
|b 5
|u fzj
700 1 _ |a Grubmüller, Helmut
|0 P:(DE-HGF)0
|b 6
773 _ _ |a 10.1177/1094342020964857
|g Vol. 35, no. 1, p. 97 - 117
|0 PERI:(DE-600)2017480-9
|n 1
|p 97 - 117
|t The international journal of high performance computing applications
|v 35
|y 2021
|x 1741-2846
856 4 _ |u https://juser.fz-juelich.de/record/889151/files/1094342020964857.pdf
|y OpenAccess
909 C O |o oai:juser.fz-juelich.de:889151
|p openaire
|p open_access
|p VDB
|p driver
|p dnbdelivery
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 2
|6 P:(DE-Juel1)157750
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 4
|6 P:(DE-Juel1)132152
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 5
|6 P:(DE-Juel1)132079
913 1 _ |a DE-HGF
|b Key Technologies
|l Engineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action
|1 G:(DE-HGF)POF4-510
|0 G:(DE-HGF)POF4-511
|3 G:(DE-HGF)POF4
|2 G:(DE-HGF)POF4-500
|4 G:(DE-HGF)POF
|v Enabling Computational- & Data-Intensive Science and Engineering
|9 G:(DE-HGF)POF4-5112
|x 0
913 0 _ |a DE-HGF
|b Key Technologies
|l Supercomputing & Big Data
|1 G:(DE-HGF)POF3-510
|0 G:(DE-HGF)POF3-511
|3 G:(DE-HGF)POF3
|2 G:(DE-HGF)POF3-500
|4 G:(DE-HGF)POF
|v Computational Science and Mathematical Methods
|x 0
914 1 _ |y 2021
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0200
|2 StatID
|b SCOPUS
|d 2020-08-29
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0300
|2 StatID
|b Medline
|d 2020-08-29
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)1160
|2 StatID
|b Current Contents - Engineering, Computing and Technology
|d 2020-08-29
915 _ _ |a Creative Commons Attribution CC BY 4.0
|0 LIC:(DE-HGF)CCBY4
|2 HGFVOC
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0600
|2 StatID
|b Ebsco Academic Search
|d 2020-08-29
915 _ _ |a JCR
|0 StatID:(DE-HGF)0100
|2 StatID
|b INT J HIGH PERFORM C : 2018
|d 2020-08-29
915 _ _ |a WoS
|0 StatID:(DE-HGF)0113
|2 StatID
|b Science Citation Index Expanded
|d 2020-08-29
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0150
|2 StatID
|b Web of Science Core Collection
|d 2020-08-29
915 _ _ |a IF < 5
|0 StatID:(DE-HGF)9900
|2 StatID
|d 2020-08-29
915 _ _ |a OpenAccess
|0 StatID:(DE-HGF)0510
|2 StatID
915 _ _ |a Peer Review
|0 StatID:(DE-HGF)0030
|2 StatID
|b ASC
|d 2020-08-29
915 _ _ |a Allianz-Lizenz
|0 StatID:(DE-HGF)0410
|2 StatID
|d 2020-08-29
|w ger
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0160
|2 StatID
|b Essential Science Indicators
|d 2020-08-29
915 _ _ |a Nationallizenz
|0 StatID:(DE-HGF)0420
|2 StatID
|d 2020-08-29
|w ger
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0199
|2 StatID
|b Clarivate Analytics Master Journal List
|d 2020-08-29
920 _ _ |l yes
920 1 _ |0 I:(DE-Juel1)JSC-20090406
|k JSC
|l Jülich Supercomputing Center
|x 0
980 _ _ |a journal
980 _ _ |a VDB
980 _ _ |a I:(DE-Juel1)JSC-20090406
980 _ _ |a UNRESTRICTED
980 1 _ |a FullTexts


LibraryCollectionCLSMajorCLSMinorLanguageAuthor
Marc 21