Scalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster

Malapally, Nitin; Bolnykh, Viacheslav; Mandelli, Davide; Carloni, Paolo; Suarez, Estela; Lippert, Thomas
001005609 001__ 1005609
001005609 005__ 20240625095111.0
001005609 0247_ $$2Handle$$a2128/34243
001005609 037__ $$aFZJ-2023-01559
001005609 1001_ $$0P:(DE-Juel1)190684$$aMalapally, Nitin$$b0
001005609 245__ $$aScalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster
001005609 260__ $$c2023
001005609 3367_ $$0PUB:(DE-HGF)25$$2PUB:(DE-HGF)$$aPreprint$$bpreprint$$mpreprint$$s1680511176_8653
001005609 3367_ $$2ORCID$$aWORKING_PAPER
001005609 3367_ $$028$$2EndNote$$aElectronic Article
001005609 3367_ $$2DRIVER$$apreprint
001005609 3367_ $$2BibTeX$$aARTICLE
001005609 3367_ $$2DataCite$$aOutput Types/Working Paper
001005609 520__ $$aThe 3D Discrete Fourier Transform (DFT) is a technique used to solve problems in disparate fields. Nowadays, the commonly adopted implementation of the 3D-DFT is derived from the Fast Fourier Transform (FFT) algorithm. However, evidence indicates that the distributed memory 3D-FFT algorithm does not scale well due to its use of all-to-all communication. Here, building on the work of Sedukhin et al. [Proceedings of the 30th International Conference on Computers and Their Applications, CATA 2015 pp. 193-200 (01 2015)], we revisit the possibility of improving the scaling of the 3D-DFT by using an alternative approach that uses point-to-point communication, albeit at a higher arithmetic complexity. The new algorithm exploits tensor-matrix multiplications on a volumetrically decomposed domain via three specially adapted variants of Cannon's algorithm. It has here been implemented as a C++ library called S3DFT and tested on the JUWELS Cluster at the J\"ulich Supercomputing Center. Our implementation of the shared memory tensor-matrix multiplication attained 88% of the theoretical single node peak performance. One variant of the distributed memory tensor-matrix multiplication shows excellent scaling, while the other two show poorer performance, which can be attributed to their intrinsic communication patterns. A comparison of S3DFT with the Intel MKL and FFTW3 libraries indicates that currently iMKL performs best overall, followed in order by FFTW3 and S3DFT. This picture might change with further improvements of the algorithm and/or when running on clusters that use network connections with higher latency, e.g. on cloud platforms.
001005609 536__ $$0G:(DE-HGF)POF4-5241$$a5241 - Molecular Information Processing in Cellular Systems (POF4-524)$$cPOF4-524$$fPOF IV$$x0
001005609 536__ $$0G:(DE-HGF)POF4-5121$$a5121 - Supercomputing & Big Data Facilities (POF4-512)$$cPOF4-512$$fPOF IV$$x1
001005609 7001_ $$0P:(DE-Juel1)168432$$aBolnykh, Viacheslav$$b1
001005609 7001_ $$0P:(DE-Juel1)142361$$aSuarez, Estela$$b2
001005609 7001_ $$0P:(DE-Juel1)145614$$aCarloni, Paolo$$b3
001005609 7001_ $$0P:(DE-Juel1)132179$$aLippert, Thomas$$b4
001005609 7001_ $$0P:(DE-Juel1)190906$$aMandelli, Davide$$b5$$eCorresponding author
001005609 8564_ $$uhttps://juser.fz-juelich.de/record/1005609/files/Scalability_of_3D_DFT_by_block_tensor_matrix_multiplication_on_the_JUWELS_cluster.pdf$$yOpenAccess
001005609 909CO $$ooai:juser.fz-juelich.de:1005609$$pdnbdelivery$$pdriver$$pVDB$$popen_access$$popenaire
001005609 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)190684$$aForschungszentrum Jülich$$b0$$kFZJ
001005609 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)142361$$aForschungszentrum Jülich$$b2$$kFZJ
001005609 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)145614$$aForschungszentrum Jülich$$b3$$kFZJ
001005609 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)132179$$aForschungszentrum Jülich$$b4$$kFZJ
001005609 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)190906$$aForschungszentrum Jülich$$b5$$kFZJ
001005609 9131_ $$0G:(DE-HGF)POF4-524$$1G:(DE-HGF)POF4-520$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5241$$aDE-HGF$$bKey Technologies$$lNatural, Artificial and Cognitive Information Processing$$vMolecular and Cellular Information Processing$$x0
001005609 9131_ $$0G:(DE-HGF)POF4-512$$1G:(DE-HGF)POF4-510$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5121$$aDE-HGF$$bKey Technologies$$lEngineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action$$vSupercomputing & Big Data Infrastructures$$x1
001005609 9141_ $$y2023
001005609 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
001005609 920__ $$lyes
001005609 9201_ $$0I:(DE-Juel1)IAS-5-20120330$$kIAS-5$$lComputational Biomedicine$$x0
001005609 9201_ $$0I:(DE-Juel1)INM-9-20140121$$kINM-9$$lComputational Biomedicine$$x1
001005609 9201_ $$0I:(DE-Juel1)JSC-20090406$$kJSC$$lJülich Supercomputing Center$$x2
001005609 9801_ $$aFullTexts
001005609 980__ $$apreprint
001005609 980__ $$aVDB
001005609 980__ $$aUNRESTRICTED
001005609 980__ $$aI:(DE-Juel1)IAS-5-20120330
001005609 980__ $$aI:(DE-Juel1)INM-9-20140121
001005609 980__ $$aI:(DE-Juel1)JSC-20090406
Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe