001026292 001__ 1026292
001026292 005__ 20240524205938.0
001026292 0247_ $$2datacite_doi$$a10.34734/FZJ-2024-03363
001026292 037__ $$aFZJ-2024-03363
001026292 1001_ $$0P:(DE-HGF)0$$aZaourar, Lilia$$b0$$eCorresponding author
001026292 245__ $$aCase Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC
001026292 260__ $$c2024
001026292 3367_ $$0PUB:(DE-HGF)25$$2PUB:(DE-HGF)$$aPreprint$$bpreprint$$mpreprint$$s1716539971_2344
001026292 3367_ $$2ORCID$$aWORKING_PAPER
001026292 3367_ $$028$$2EndNote$$aElectronic Article
001026292 3367_ $$2DRIVER$$apreprint
001026292 3367_ $$2BibTeX$$aARTICLE
001026292 3367_ $$2DataCite$$aOutput Types/Working Paper
001026292 520__ $$aThe memory systems of High-Performance Computing (HPC) systems commonly feature non-uniform data paths to memory, i.e. are non-uniform memory access (NUMA) architectures. Memory is divided into multiple regions, with each processing unit having its own local memory. Therefore, for each processing unit access to local memory regions is faster compared to accessing memory at non-local regions. Architectures with hybrid memory technologies result in further non-uniformity. This paper presents case studies of the performance potential and data placement implications of non-uniform and heterogeneous memory in HPC systems. Using the gem5 and VPSim simulation platforms, we model NUMA systems with processors based on the ARMv8 Neoverse V1 Reference Design. The gem5 simulator provides a cycle-accurate view, while VPSim offers greater simulation speed, with a high-level view of the simulated system. We highlight the performance impact of design trade-offs regarding NUMA node organization and System Level Cache (SLC) group assignment, as well as Network-on-Chip (NoC) configuration. Our case studies provide essential input to a co-design process involving HPC processor architects and system integrators. A comparison of system configurations for different NoC bandwidths shows reduced NoC latency and high memory bandwidth improvement when NUMA control is enabled. Furthermore, a configuration with HBM2 memory organized as four NUMA nodes highlights the memory bandwidth performance gap and NoC queuing latency impact when comparing local vs. remote memory accesses. On the other hand, NUMA can result in an unbalanced distribution of memory accesses and reduced SLC hit ratios, as shown with DDR4 memory organized as four NUMA nodes.
001026292 536__ $$0G:(DE-HGF)POF4-5122$$a5122 - Future Computing & Big Data Systems (POF4-512)$$cPOF4-512$$fPOF IV$$x0
001026292 536__ $$0G:(BMBF)16ME0507K$$aEPI SGA2 (16ME0507K)$$c16ME0507K$$x1
001026292 7001_ $$0P:(DE-HGF)0$$aBenazouz, Mohamed$$b1
001026292 7001_ $$0P:(DE-HGF)0$$aMouhagir, Ayoub$$b2
001026292 7001_ $$0P:(DE-Juel1)179531$$aFalquez, Carlos$$b3$$ufzj
001026292 7001_ $$0P:(DE-Juel1)177768$$aPortero, Antoni$$b4$$ufzj
001026292 7001_ $$0P:(DE-Juel1)176469$$aHo, Nam$$b5$$ufzj
001026292 7001_ $$0P:(DE-Juel1)142361$$aSuarez, Estela$$b6$$ufzj
001026292 7001_ $$0P:(DE-HGF)0$$aPetrakis, Polydoros$$b7
001026292 7001_ $$0P:(DE-HGF)0$$aMarazakis, Manolis$$b8
001026292 7001_ $$0P:(DE-HGF)0$$aSgherzi, Francesco$$b9
001026292 7001_ $$0P:(DE-HGF)0$$aFernandez, Ivan$$b10
001026292 7001_ $$0P:(DE-HGF)0$$aDolbeau, Romain$$b11
001026292 7001_ $$0P:(DE-HGF)0$$aPleiter, Dirk$$b12
001026292 8564_ $$uhttps://juser.fz-juelich.de/record/1026292/files/arcs_2024_preprint.pdf$$yOpenAccess
001026292 8564_ $$uhttps://juser.fz-juelich.de/record/1026292/files/arcs_2024_preprint.gif?subformat=icon$$xicon$$yOpenAccess
001026292 8564_ $$uhttps://juser.fz-juelich.de/record/1026292/files/arcs_2024_preprint.jpg?subformat=icon-1440$$xicon-1440$$yOpenAccess
001026292 8564_ $$uhttps://juser.fz-juelich.de/record/1026292/files/arcs_2024_preprint.jpg?subformat=icon-180$$xicon-180$$yOpenAccess
001026292 8564_ $$uhttps://juser.fz-juelich.de/record/1026292/files/arcs_2024_preprint.jpg?subformat=icon-640$$xicon-640$$yOpenAccess
001026292 909CO $$ooai:juser.fz-juelich.de:1026292$$pdriver$$pVDB$$popen_access$$popenaire$$pdnbdelivery
001026292 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)179531$$aForschungszentrum Jülich$$b3$$kFZJ
001026292 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)177768$$aForschungszentrum Jülich$$b4$$kFZJ
001026292 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)176469$$aForschungszentrum Jülich$$b5$$kFZJ
001026292 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)142361$$aForschungszentrum Jülich$$b6$$kFZJ
001026292 9131_ $$0G:(DE-HGF)POF4-512$$1G:(DE-HGF)POF4-510$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5122$$aDE-HGF$$bKey Technologies$$lEngineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action$$vSupercomputing & Big Data Infrastructures$$x0
001026292 9141_ $$y2024
001026292 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
001026292 920__ $$lyes
001026292 9201_ $$0I:(DE-Juel1)JSC-20090406$$kJSC$$lJülich Supercomputing Center$$x0
001026292 9801_ $$aFullTexts
001026292 980__ $$apreprint
001026292 980__ $$aVDB
001026292 980__ $$aUNRESTRICTED
001026292 980__ $$aI:(DE-Juel1)JSC-20090406