Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library

Wu, Xinzhe; Di Napoli, Edoardo
doi:10.1145/3624062.3624249
% IMPORTANT: The following is UTF-8 encoded.  This means that in the presence
% of non-ASCII characters, it will not work with BibTeX 0.99 or older.
% Instead, you should use an up-to-date BibTeX implementation like “bibtex8” or
% “biber”.

@INPROCEEDINGS{Wu:1019351,
      author       = {Wu, Xinzhe and Di Napoli, Edoardo},
      title        = {{A}dvancing the distributed {M}ulti-{GPU} {C}h{ASE} library
                      through algorithm optimization and {NCCL} library},
      publisher    = {ACM New York, NY, USA},
      reportid     = {FZJ-2023-05321},
      pages        = {1688–1696},
      year         = {2023},
      abstract     = {As supercomputers become larger with powerful Graphics
                      Processing Unit (GPU), traditional direct eigensolvers
                      struggle to keep up with the hardware evolution and scale
                      efficiently due to communication and synchronization
                      demands. Conversely, subspace eigensolvers, like the
                      Chebyshev Accelerated Subspace Eigensolver (ChASE), have a
                      simpler structure and can overcome communication and
                      synchronization bottlenecks. ChASE is a modern subspace
                      eigensolver that uses Chebyshev polynomials to accelerate
                      the computation of extremal eigenpairs of dense Hermitian
                      eigenproblems. In this work we show how we have modified
                      ChASE by rethinking its memory layout, introducing a novel
                      parallelization scheme, switching to a more performing
                      communication-avoiding algorithm for one of its inner
                      modules, and substituting the MPI library by the
                      vendor-optimized NCCL library. The resulting library can
                      tackle dense problems with size up to , and scales
                      effortlessly up to the full 900 nodes—each one powered by
                      4 × A100 NVIDIA GPUs—of the JUWELS Booster hosted at the
                      Jülich Supercomputing Centre.},
      month         = {Nov},
      date          = {2023-11-12},
      organization  = {SC-W 2023: Workshops of The
                       International Conference on High
                       Performance Computing, Network,
                       Storage, and Analysis, Denver, CO
                       (USA), 12 Nov 2023 - 17 Nov 2023},
      cin          = {JSC / CASA},
      cid          = {I:(DE-Juel1)JSC-20090406 / I:(DE-Juel1)CASA-20230315},
      pnm          = {5111 - Domain-Specific Simulation $\&$ Data Life Cycle Labs
                      (SDLs) and Research Groups (POF4-511) / Simulation and Data
                      Laboratory Quantum Materials (SDLQM) (SDLQM)},
      pid          = {G:(DE-HGF)POF4-5111 / G:(DE-Juel1)SDLQM},
      typ          = {PUB:(DE-HGF)8},
      doi          = {10.1145/3624062.3624249},
      url          = {https://juser.fz-juelich.de/record/1019351},
}
guest :: login JuSER
		Search		Submit		Personalize Your alerts Your baskets Your searches		Help